The Age of Scaling Is Dead. Welcome Back to the Era of Research.

Dec 18, 2025

I listened to Ilya Sutskever’s interview with Dwarkesh Patel and once again found myself returning to a thought that’s been nagging at me.

If we don’t have a shortage of ideas or money, why does progress in AI feel so slow and uneven?

Models impress in demos, evals go through the roof, but in real-world tasks we’re still living in a world of recurring bugs, absurd errors, and facts made up on the fly. Sutskever, one of the architects of the scaling paradigm, is now saying the quiet part out loud: we’ve been scaling the wrong things.

The Bug That Loops Forever

Sutskever describes a phenomenon everyone who’s used AI for actual work has experienced. You hit a bug. You tell the model to fix it, and it apologizes: “Oh my God, you’re so right. Let me go fix that.” Then it introduces a second bug. You point this out. Same energy: “Oh my God, how could I have done it?” It brings back the first bug. You can alternate between these indefinitely.

How can a system that performs at superhuman levels on competitive programming benchmarks get stuck in a loop between two bugs?

This resonates with what I’ve written about before: in olympiad programming research, champions easily outpace LLMs on complex problems - especially where you need unexpected insights and edge case handling. The winning formula is usually a combination of roles: a fast coder-sprinter, an algorithmist, and an insighter who sees unconventional moves.

The future isn’t one giant LLM, but several agent roles working as a team. Sutskever describes the same problem, just at an industry-wide level.

The Overtrained Olympiad Student

Sutskever has a perfect analogy. Imagine two students. One spends 10,000 hours grinding competitive programming, knows all the tricks, wins tournaments. The second practices for 100 hours but has intuition and understands how things work, growing well in the real world.

Which one does better in their career? The second one, obviously.

Today’s models are the first students, on steroids. We massively overtrain them on narrow tasks: code, evals, RL tuned to specific metrics. We take every competitive programming problem ever, augment the data to create even more, and train on that. With this level of preparation, it’s intuitive that it wouldn’t generalize to other things.

Hence what everyone has seen: superhuman on tests, helpless in production. Models reproduce patterns well but fail where you need unexpected perspectives, edge case handling, or combining multiple ideas. Humans are still far more reliable there.

The Real Reward Hackers

Sutskever’s explanation for this jaggedness: when people do RL training, they create environments. Companies have teams producing new RL environments constantly. But here’s the problem - one thing they do, inadvertently, is take inspiration from the evals. “I want the evals to look great. What RL training would help on this task?”

The real reward hackers aren’t the models. It’s the human researchers too focused on the evals.

If you combine this with inadequate generalization, you can explain the disconnect between eval performance and real-world performance. We’re creating systems brilliant at reproducing exactly what we measured them on, and helpless at everything adjacent.

For Humans, Pre-training Never Ends

Unlike models, we don’t live in a trained-then-frozen format. Humans constantly catch feedback, reassess experience, change strategy, build errors into intuition. We have something like a built-in value function: emotions, taste, a sense of “definitely not going there.”

Sutskever brings up a fascinating case: a person with brain damage that took out his emotional processing. He stopped feeling any emotion. Still articulate, could solve puzzles, seemed fine on tests. But he became extremely bad at decisions - hours to decide which socks to wear, terrible financial choices.

Even simple emotions provide critical guidance. They’re something like a value function that evaluates whether an intermediate step is good or bad. AI systems lack this, so they can’t tell if they’re going down a promising path or wasting compute on a dead end until they reach the very end of their reasoning chain.

Models are massive, but essentially static: frozen at release, then fine-tuning and patches.

The Era of Scaling Is Over

For the last few years, everyone lived by one mantra: “Take the same idea and throw computational power at it.” More parameters, more data, more GPUs.

Now adding another 10x means billions, not just another server farm. Sutskever says it plainly: we’re returning to the era of research, just with very big computers. Hardware stopped being the bottleneck. The bottleneck is ideas.

Sutskever breaks AI history into three periods:

2012-2020: Age of Research

People tinkered, tried things. Compute was limited. Good ideas couldn’t be validated.

2020-2025: Age of Scaling

“Scaling” became the mantra. Companies loved this: low-risk resource investment. Get more data, more compute, you know you’ll get results from pre-training.

2026+: Back to Research

Is the belief really that 100x more compute would transform everything? It would be different, sure. But not transformed. So it’s back to research, just with big computers.

One consequence: scaling sucked out all the air in the room. Everyone started doing the same thing. We got to a world where there are more companies than ideas. By quite a bit.

There’s this Silicon Valley saying: ideas are cheap, execution is everything. But if ideas are so cheap, how come no one’s having any ideas?

The Fundamental Problem: Generalization

Now people are scaling RL. They spend more compute on RL than pre-training because RL consumes massive compute - long rollouts, relatively small learning per rollout.

But Sutskever wouldn’t even call it scaling. He’d say: “Is this the most productive thing you could be doing with your compute?”

The fundamental problem is that these models generalize dramatically worse than people. It’s super obvious.

Why? Consider skills where people exhibit great reliability. Vision, hearing, locomotion: you could argue evolution gave us priors. But language, math, coding? These didn’t exist until recently. Yet people are still incredibly sample-efficient learners. This suggests people just have better machine learning, period.

A teenager learns to drive in 10 hours. They don’t get some prebuilt, verifiable reward. They have their value function - a general sense of how they’re driving, how unconfident. The learning speed is staggering.

Sutskever has opinions about how to do this in AI. But we live in a world where not all machine learning ideas are discussed freely, and this is one of them. The fact that people are like that, though, is proof that it can be done.

AGI Is Not a Finished Product

Here’s where Sutskever’s thinking has evolved, and it addresses a misconception baked into how we talk about AI.

The term AGI exists as a reaction to “narrow AI.” Chess AI beat Kasparov but couldn’t do anything else. So people said: we need general AI that can do all the things. Pre-training reinforced this: do more pre-training, the model gets better at everything uniformly.

But this thinking overshoots the target. A human being is not an AGI in this sense. We have a foundation of skills, but we lack knowledge. We rely on continual learning.

Sutskever’s vision: “You produce a superintelligent 15-year-old that’s very eager. They don’t know very much at all, a great student. You go and be a programmer, be a doctor, go and learn.” Deployment itself involves a learning trial-and-error period. A process, not dropping the finished thing.

This is radically different from the OpenAI charter vision - a system that can do every job. Sutskever is proposing a mind which can learn to do every job, deployed into the world like a human laborer joining an organization.

Once deployed, it becomes superhuman not through recursive self-improvement, but through sheer breadth. One model learning every job simultaneously, amalgamating learnings, that’s functionally superintelligence even without self-modification.

SSI’s Bet: Generalization Is Everything

SSI raised $3 billion. Sounds like a lot, but Sutskever points out other companies’ bigger numbers are mostly earmarked for inference. Product inference requires big staff - engineers, salespeople. Lots of research dedicated to product features. What’s actually left for research? The difference becomes smaller.

More importantly: if you’re doing something different, do you really need maximal scale to prove it? AlexNet was built on two GPUs. The transformer on 8 to 64 GPUs. You need compute for research, but not necessarily the absolutely largest amount ever.

SSI is investigating reliable generalization - how to make systems that learn like humans do, with the same sample efficiency and robustness. They’re “squarely an age of research company.” They’ve made good progress over the past year, but need to keep going. It’s an attempt to be a voice and participant.

(His co-founder Daniel Gross left when Meta tried to acquire SSI at $32B. Sutskever said no. Gross effectively said yes, he was the only person from SSI to join Meta, enjoying near-term liquidity.)

Making It Go Well

Sutskever’s thinking on alignment has evolved. Major shift: he now places more importance on AI being deployed incrementally and in advance. Why? Because superintelligence is very difficult to imagine. If it’s hard to imagine, you’ve got to be showing the thing.

His prediction: as AI becomes more powerful, people will change behaviors. Fierce competitors will collaborate on safety (OpenAI and Anthropic took a first step). At some point the AI will start to feel powerful. When that happens, all AI companies will become much more paranoid about safety.

What should companies build? There’s been one idea everyone’s locked into: the self-improving AI. Why? Because there are fewer ideas than companies. But Sutskever thinks there’s something better: AI that’s robustly aligned to care about sentient life. Easier to build than AI that cares only about humans, because the AI itself will be sentient.

For the long run, Sutskever offers an answer he doesn’t like but thinks needs considering: people become part-AI with Neuralink++. Then the AI understands something, and we understand it too - understanding transmitted wholesale. If the AI is in some situation, you’re involved fully. That’s his answer to the equilibrium problem. Not comfortable, but honest.

What Comes Next

Sutskever thinks 5 to 20 years until we get a system that can learn as well as a human and subsequently become superhuman.

He expects convergence on alignment strategies as AI becomes more powerful. All companies will realize they’re striving for the same thing: find some way to talk to each other, make sure the first actual superintelligent AI is aligned and cares for sentient life, for people, democratic, some combination.

The straight shot to superintelligence plan might change. One reason is pragmatic: if timelines turn out long. Second, there’s value in the best AI being out there impacting the world. Even in a straight shot, you’d still do gradual release. It’s just what’s the first thing out the door.

My Conclusions

Compute lost its magic. Not forever, but for now. The age where you could throw more GPUs at the problem and guarantee progress is over. NVIDIA, your run was good, but the alpha is moving.

Time to change the paradigm. “Scale what we have” yields diminishing returns. We need to solve reliable generalization - systems that learn like humans, with the same sample efficiency and robustness. Not systems that memorize every edge case.

AGI as a finished product is the wrong metaphor. The breakthrough comes when we stop thinking about training everything at once and start thinking about creating conditions for systems to live, learn, and collaborate like humans do in teams. Not a god in a box. A learner that goes into the world.

The industry is about to bifurcate. Companies optimizing the current paradigm - tremendous revenue, incremental improvements, domain expertise. And research labs betting on fundamentally different approaches to generalization. Most bets will fail. One won’t.

Bold ideas and daring scientists matter again. After years where the main competitive advantage was access to compute, we’re returning to a world where the limiting factor is: do you have a genuinely new idea about how intelligence works?

This is what excites me. The scaling era was necessary, it proved the paradigm, built the infrastructure, created economic incentive. But it also created a monoculture where everyone did the same thing with different amounts of money.

Now we get to find out which research bets were actually right.

Welcome back to the era of research. It’s going to be messy, chaotic, and full of dead ends.

Finally.

Max Votek

Discussion about this post

Ready for more?