When Graduate Students School the Machines: A Reality Check on AI Coding

Dec 09, 2025

Researchers from EPFL just dropped a study that cuts through the AI hype with surgical precision. They ran a tournament pitting 40 LLM-coded agents against 17 human-coded agents from 2020, before ChatGPT was even a thing, on a complex logistics problem. The machines got absolutely destroyed.

The challenge was the Auction, Pickup, and Delivery Problem (APDP) - the kind of optimization nightmare that keeps logistics companies like Amazon and FedEx running. Multiple transportation companies compete in real-time auctions, bidding on delivery tasks, then optimizing routes while managing truck capacity constraints. Every bid changes the landscape for future decisions. It’s strategic, it’s dynamic, and it’s exactly the kind of problem where you’d expect modern AI to shine.

The Setup: A Fair Fight (With Advantages to the Machines)

The researchers used the Logist platform developed at EPFL’s Artificial Intelligence Laboratory - an open-source Java framework specifically built for this kind of multi-agent competition. The problem came from a postgraduate Intelligent Agents course, where students had 2-3 weeks to develop their agents and then compete in single-elimination tournaments for extra credit.

They selected 12 student agents from the 2020 class pre-LLM era, no AI assistance whatsoever - plus 5 baseline agents developed by EPFL lab members. These baselines were intentionally simple:

Naive: Calculates distance to pickup city plus distance to destination, bids that amount
Expected Cost Fixed Bid: Uses expected delivery cost plus a fixed markup
Greedy: Tries to maximize immediate profit per task
Random: Selects actions randomly (yes, they included this to set a floor)

Against this, they threw every modern LLM you can name: GPT-4o Thinking, Gemini 2.0 Pro, Claude Opus 4.1, DeepSeek R1. They tested multiple prompting approaches, including the trendy “vibe coding” methodology where you describe what you want in natural language and trust the model to figure it out.

The setup was brutal but fair: 12 double all-play-all tournaments across different network topologies, roughly 40,000 matches total, with agents facing every opponent twice. Here’s the thing - the LLMs actually got several advantages the human students didn’t. The researchers intervened in LLM-generated code to fix obvious errors. They provided detailed problem descriptions. They even tried feeding winning solutions back to the models.

The Results: Embarrassing for the Machines

Graduate students swept the top 5 positions. Not just won, they dominated. The winning student agent averaged 96.5 wins out of 110 possible victories.

Meanwhile, 33 out of 40 LLM-coded agents couldn’t even beat those simplistic baseline algorithms. Think about that. These are state-of-the-art models losing to “calculate distance and bid that number” strategies that a first-year CS student would be embarrassed to submit.

The performance gap was a chasm. Student agents understood that you sometimes need to bid below cost early to position yourself advantageously for future auctions. They developed sophisticated routing heuristics that balanced immediate delivery costs against future flexibility. They modeled opponent behavior and adjusted strategies accordingly.

The LLM agents? They generated syntactically correct Java code that compiled without errors. But the strategies were naive, the routing inefficient, and the auction decisions short-sighted. Code that runs isn’t the same as code that competes.

The Most Damning Experiment

The researchers tried something particularly interesting. They took the best human solution, the winning student agent and fed it directly to GPT-4o Pro with detailed instructions: “Here’s the problem. Here’s a winning solution. Improve it. Make it better. Win the tournament.”

The result? The “improved” version dropped from 1st place to 10th.

The LLM’s optimizations resulted in losing 9 spots on the leaderboard. This isn’t a data contamination problem where the model secretly trained on test data. Even when explicitly shown a winning solution, the LLM couldn’t recognize what made it work or preserve those strategic elements while improving other aspects.

This should terrify anyone building production AI systems. It means these models don’t just struggle to create optimal strategies from scratch, they actively damage working solutions when asked to improve them.

Why Standard Benchmarks Lie

The study exposes something I’ve been seeing in production for months: standard code benchmarks are fundamentally misleading. HumanEval, BigCodeBench, all those popular testing frameworks measure whether code compiles and passes unit tests. They don’t measure whether the code actually solves business problems effectively.

It’s like judging chess players by whether they can move pieces legally rather than whether they can win games. Syntax correctness is table stakes. Strategic thinking is the game.

The APDP benchmark is NP-hard - there’s no algorithm that can reliably find optimal solutions in reasonable time as problems scale. This forces real-world trade-offs between precision and speed, between immediate optimization and long-term positioning. These are exactly the kinds of decisions that separate code that compiles from code that delivers business value.

The Strategic Thinking Gap

Let’s break down where LLMs actually failed:

Bidding Strategy: Competitive bidding requires understanding your marginal cost (what it actually costs you to add this task given your current route), estimating opponent marginal costs (what can they afford to bid?), and strategic positioning (should you take a loss now to improve your position for future auctions?). The LLMs consistently bid naively based on simple distance calculations.

Route Optimization: Students developed sophisticated heuristics that balanced multiple factors: delivery time windows, truck capacity constraints, geographic clustering of tasks, flexibility for future deliveries. LLMs often produced legally valid routes that were strategically terrible, sometimes ignoring available trucks entirely.

Opponent Modeling: The best human agents adapted their strategies based on opponent behavior. If a competitor was aggressive on certain routes, they’d pivot to underserved areas. LLMs showed minimal adaptive behavior, essentially playing the same strategy regardless of competition.

Long-term Planning: Strategic decisions cascade. Winning one auction might position you poorly for the next three. Students understood this. LLMs optimized for immediate gains.

The Real-World Parallel

This mirrors what we’ve experienced building AI solutions for pharma. We launched multiple domain-specific use cases - Smart Order Assistant AI, Visit Summary, predictive analytics for field sales. Standard LLMs out of the box? Barely functional.

The gap between “generates plausible code” and “solves the actual business problem” required massive architectural work:

Multiple specialized agents working in orchestrated workflows
Semantic search to surface relevant domain knowledge
RAG (Retrieval-Augmented Generation) for context-specific information
Graph RAG for understanding relationships in complex data
Custom fine-tuning on industry-specific patterns
Rule-based validators to catch strategic errors

We couldn’t just describe what we wanted and trust the LLM to architect a solution. We needed domain experts who understood both the technical requirements and the business context to design the system architecture.

What This Means for Enterprise AI

If you’re building AI products, this research should fundamentally inform your approach:

Don’t rely on raw LLM output for business-critical optimization. Code generation for boilerplate? Great. Strategic planning that affects revenue? Not yet. The gap between syntactically correct and strategically optimal remains enormous.

Specialized agents beat general-purpose models in specific domains. We’ve proven this repeatedly in pharma. The pattern holds across industries. A narrowly focused system with domain-specific training and validation rules will outperform a general-purpose LLM every time on complex domain problems.

Architecture matters more than model selection. The difference between our baseline AI implementations and production versions isn’t which LLM we use - GPT-4, Claude, Gemini all have similar capabilities for code generation. What matters is how we structure the system around them: what knowledge we provide, what validation we enforce, what strategic constraints we build in.

Human expertise remains irreplaceable for complex strategic decisions. The 2020 student code beat 2024 cutting-edge AI because those students deeply understood the problem domain. They’d made mistakes, debugged them, developed intuition about what works and why. That foundation matters.

The Vibe Coding Reality Check

Vibe coding saves enormous amounts of time.

But there’s a dangerous myth spreading through the industry: that you can just describe what you want in natural language and AI will architect optimal solutions. This research proves otherwise. Definitively.

The LLMs tested here had every advantage. They had detailed problem descriptions. They had examples of good solutions. They had researchers intervening to fix obvious bugs. They still produced strategies that lost to undergraduate-level baselines.

Why? Because understanding the problem space, anticipating opponent behavior, optimizing across multiple interdependent decisions with long-term consequences -these require genuine strategic reasoning that current LLMs fundamentally lack.

The students who dominated this tournament had foundation that matters: Computer Science coursework, lab exercises where they debugged their own mistakes, experience with algorithmic trade-offs. They’d internalized patterns about what works and why. They could look at a logistics problem and immediately see both the obvious approaches and the subtle ways those approaches fail under competitive pressure.

Beyond Logistics: Universal Implications

The researchers chose logistics, but the implications extend to every domain where success requires strategic thinking:

Financial Portfolio Optimization: Should you take losses in one sector to rebalance for better long-term positioning?
Supply Chain Management: How do you optimize inventory given uncertain demand and competitor behavior?
Resource Allocation: What’s the right balance between immediate utilization and future flexibility?
Competitive Strategy: When should you compete aggressively versus when to pivot to underserved opportunities?

Anywhere decisions cascade through a system with long-term consequences, anywhere you need to model opponent or market behavior, anywhere immediate optimization might be strategically wrong - current LLMs struggle.

The Path Forward

This doesn’t mean AI can’t help with these problems. It means we need radically better frameworks for human-AI collaboration.

Use LLMs for what they’re genuinely good at:

Generating boilerplate and standard patterns
Exploring solution spaces quickly
Handling routine transformations and data processing
Accelerating the coding of well-understood components

Keep humans in the loop for:

Strategic architecture decisions
Domain-specific optimization
Validating that solutions actually work competitively
Understanding cascading consequences of design choices

The research team made their benchmark open source, the Logist platform and all agent code is available on GitHub. This is important. We need more evaluations that test real-world competitive performance, not just syntactic correctness. Code that compiles but loses money isn’t useful code. Code that passes unit tests but makes strategically poor decisions doesn’t solve business problems.

Confronting the Hype

The hype cycle around AI coding has been intense. We’ve seen claims that junior developers are obsolete, that software engineering will be fully automated by next quarter, that anyone can build complex systems by just chatting with an LLM.

This research provides necessary correction grounded in rigorous methodology and reproducible results. AI coding assistants are powerful tools. They’re not magic. They’re not strategic thinkers. They’re not architects who understand business context.

The gap between “generates code that looks right” and “solves complex problems strategically” remains massive. Understanding that difference matters if you’re building anything that needs to work in competitive real-world environments.

The student agents from 2020 had no access to AI assistance. They had Computer Science fundamentals, domain understanding, and time to debug and iterate. They demolished state-of-the-art LLMs from 2024.

That should tell you everything you need to know about where we actually are versus where the marketing claims suggest we are. The tools are improving rapidly. But code that competes requires something current LLMs don’t have: strategic reasoning about long-term consequences in complex competitive environments.

And for now, that still requires humans in the room.

Max Votek

Discussion about this post

Ready for more?