Skip to main content

“I have never been congratulated so much for winning one game,” said a smiling Lee Sedol, at the packed press conference amid enthusiastic applause and camera flashes. The great 9-dan Go player had just defeated DeepMind’s AlphaGo in game four with an unexpected wedge move that his adversary critically underestimated. Two days later, AlphaGo would prevail, winning four of the tournament’s five games. It wasn’t Lee’s move 78 that captured the world’s imagination, but AlphaGo’s move 37 two games earlier. Move 37 was so unconventional that Lee and much of the audience thought it was a mistake before realizing how “beautiful” it was. It had a 1-in-10,000 likelihood of being played by a human and was a landmark demonstration of AI’s ability to be superhumanly creative. 

During the tense final minutes of game five, the AlphaGo team crammed into the VIP suite at the Seoul Four Seasons Hotel that had become its command center. The team lead, David Silver, was flanked on one side by DeepMind founder Demis Hassabis and on the other by Ioannis Antonoglou, the research engineer who accelerated the neural networks that made AlphaGo tick. The moment when Lee finally resigned marked the end of a two year journey that pushed the science of reinforcement learning (RL) to new heights, and the team reveled in the accomplishment.

For Ioannis, it was a vindication of what had seemed a risky career move, joining the AI startup in London in 2012 as employee #25 and researcher #6. “It was the only place in the world where people were seriously considering building AGI,” he recalls now, and he was amazed at what RL had achieved in such a short time. He wasn’t the only one who was amazed. In Chicago, a freshly minted PhD in quantum physics named Misha Laskin read the AlphaGo paper and abruptly changed the course of his life. Here was something as beautiful as theoretical physics but with real world impact. Misha realized, “this was a real glimpse into what it will be like to live among superintelligence.”

AlphaGo and beyond

Ioannis’s first collaborator at DeepMind was Vlad Mnih, the researcher credited with building the first deep reinforcement learning agent, DQN (Deep Q Network). He asked Ioannis to write a clean implementation of DQN as a library that DeepMind could use to advance deep reinforcement learning research. “Ioannis’s attention to detail and intellectual rigour stood out right away,” says Vlad, “he would always ask why we were going with particular design choices, why we weren’t incorporating things like LSTMs sooner, etc.” 

A year later, when David Silver started working on AlphaGo, DQN had demonstrated how to use deep reinforcement learning in game playing. AlphaGo built on this foundation with many sophisticated innovations including using Monte Carlo Tree Search, a powerful search algorithm that allowed it to explore possible future moves by simulating game outcomes. AlphaGo also featured a dual network architecture with a policy network to predict the next move and a value network to evaluate board positions. 

This was a monster of a system by contemporary standards and Silver knew that maximizing throughput and minimizing latency was mission critical to train the system and then deploy it for realtime game play. He tapped Ioannis to write kernels on GPUs that could speed up the neural networks they were training. DeepMind was acquired by Google in 2014 and Ioannis quickly moved on to implementing the system on Google’s first TPUs (tensor processing units). These first chips were extremely temperamental and hard to work with but Ioannis figured out how to tame them. “If I couldn’t get it to work,” Ioannis says now, “then we wouldn’t be able to achieve the things that we achieved with AlphaGo.”

After the triumphant victory in Seoul, Ioannis continued to work on expanding the capabilities of these game-playing agents with a small core team. “One of the first papers that we wrote was AlphaGo Zero,” he remembers. “We showed that you can actually get the same performance—actually much better performance than AlphaGo—starting from completely random behavior all the way to superintelligent behavior.” This tabla rasa approach combined with self-play was important to demonstrate that RL can scale to solve new problems.

Then the team built AlphaZero, which added superhuman-level play of chess and shoji starting from random play, with no domain knowledge except the game rules. The zero in this case also meant that the algorithm can learn to play any zero sum board game. Three years later they introduced MuZero, which added Atari games to its arsenal while no longer supplying game rules—just raw pixels. Using reinforcement learning on problems without access to a perfect simulator of the world, as you have in Go or chess, was a longstanding challenge that MuZero solved. “You can actually have a system that both learns the world dynamics,” says Ioannis, “and uses this learned model of the world to actually plan into the future.”

Transformers in Toronto

Although Misha was captivated by DeepMind’s research, it took him a while to find his way there. He caught the entrepreneurial bug and went through Y Combinator with an AI startup, but the frontier of AI research kept calling—he dove into a postdoc in Pieter Abbeel’s lab at Berkeley to hone his AI research chops. Like a good reinforcement learning agent, he maximized learning from this environment of other brilliant researchers. Entering as a former physicist had a side benefit: “Because I didn’t know anything, I had this kind of clean slate. And I think that enabled me to try some pretty simple things that other people thought were too obvious to try.” He adds that, “It was surprising to me how doing the simple thing correctly was hard. And oftentimes when people try something simple and it didn’t work, it’s because they didn’t try it hard enough.”  

Over the course of the next two-and-a-half years, Misha followed the interests he developed through self study, improving RL methods in the DeepMind Control Suite and Atari Games environments along with the OpenAI Gym. Specifically, he worked on data augmentation strategies and extracting high-level features from raw pixels (CURL). The most cited paper he co-authored at Berkeley was Decision Transformer, which reframed RL as a sequence modeling problem, leveraging the simplicity and scalability of the Transformer architecture.

In 2021, Misha interviewed with Vlad, who was starting a new general agents team for DeepMind in Toronto. “I was immediately struck by Misha’s enthusiasm and belief in the power of reinforcement learning and unsupervised approaches, so it was really exciting when he decided to join the team,” recalls Vlad. “Misha was very productive from the start. He got up to speed with Google infrastructure and started producing research results in a matter of weeks, the same amount of time it took him to find all the good coffee shops in Toronto.” 

Vlad also appreciated the clean slate Misha brought with him from his academic research, “he liked to start on new problems with very simple approaches. It’s easy to overcomplicate things in research, but Misha seemed to have a knack for finding the simplest approach that will work on a problem.”

Vlad and Misha would go on to collaborate on a paper titled, In-context reinforcement learning with algorithm distillation after working together on a DeepMind hackathon project. “Misha and I decided to train a Transformer to mimic a reinforcement learning algorithm to see if it would exhibit self-improvement on tasks it wasn’t trained on,” says Vlad. “We had a lot of fun pair programming and getting the first prototype to work. This was an early example of in-context self-improvement, which has become a very active area of research in the LLM literature. We got really excited when we discovered that the Transformer learned a much more efficient RL algorithm than the one it was trained to mimic.”

Both Ioannis and Misha were pushing hard on how to make deep RL agents more capable and general through increased scale and algorithmic improvements. But on November 30, 2022, AI history took an abrupt turn. “Everything changed when ChatGPT came out,” Ioannis remembers. “I think that was a big moment for everyone in the industry to see that these systems are so powerful and they can be a starting point for something that’s even more intelligent.”

The Race to Gemini

ChatGPT’s success exposed that the value of deep reinforcement learning had been hiding in plain sight: Reinforcement learning with human feedback (RLHF) was a crucial part of its training, aligning the model’s outputs with human preferences and values. Like the Transformer itself, OpenAI didn’t invent deep RL, but was the first to deploy it at this scale.

ChatGPT’s ramp to 100 million users in two months sent shockwaves through the tech industry—but particularly at Google. Not only had OpenAI capitalized on the Transformer architecture, they had added an RL step in post-training to make a runaway consumer product. At the beginning of 2023, Google’s CEO Sundar Pichai called a “code red,” and enlisted the help of founders Larry Page and Sergey Brin to help the company catch up to OpenAI. A few months later, the Google Brain team was merged with DeepMind to become Google DeepMind under the leadership of Demis Hassabis.

The code red brought the AlphaGo lineage of research to an end for Ioannis, who was tapped to lead the RLHF effort for Google’s new large language model (LLM) Gemini. Misha was looking to join the Gemini team as well and Ioannis asked him to lead on reward models. Over the next two years they worked together to launch Gemini 1 and 1.5. Misha was initially intimidated by Ioannis, but soon came to see his colleague in a new light. “What was interesting about the RLHF team that he led was that he was extremely people oriented and worked very hard on ensuring that the people were self-motivated. He would deeply understand what drove them and would figure out the direction to steer them to ensure that what they were doing was impactful for Gemini. I think that’s a superpower that he has as a leader.”

By December of 2023, Google had announced Gemini and when v1.0 was released in February of 2024 it was comparable or better on most benchmarks to OpenAI’s GPT-4. This was a huge accomplishment but the team didn’t let up. A couple of months later it announced Gemini 1.5, three months before OpenAI released its next major upgrade, GPT-4o. Gemini 1.5 improved compute efficiency through a Mixture of Experts architecture, added multimodal capabilities and a million token context window.

For Misha, the sudden prominence of LLMs was a revelation. “We were trying to solve the general agents problem, but we were looking at it completely the wrong way,” he says. “Because we thought that generality wasn’t solved.” For all their limitations and confabulations, LLMs are the most general systems ever invented. “They’re not very reliable,” Misha explains. “They can’t do any work for you autonomously. You can’t ask them to do something in the way that you would ask a coworker, but they’ll answer almost any question that you have.”

Combining that broad intelligence with the ability to take specific actions reliably was still an unsolved problem. “In AlphaGo, the reward is clear. You win the game or you lose. And in these systems where it’s clear, reinforcement learning works like magic,” says Misha. In most real life situations, however, the reward is quite fuzzy. How do you know if a complex task is complete? How do you know if a human would like a model’s response? This is where his work at DeepMind on open ended reward functions for general agents come in. “It just so happened that that building block that I was focused on outside of language models was one of the core building blocks for training them,” Misha explains.

Misha and Ioannis had achieved their goal of becoming, in Misha’s words, “the best craftsmen of training language models.” They understood these systems inside and out, and yet they didn’t feel the trajectory was moving fast enough towards their burning research question: how to build superintelligent systems capable of acting autonomously?

Reflection co-founders Ioannis Antonoglou and Misha Laskin

Snapping the puzzle pieces together with Reflection

LLMs have succeeded at achieving a very wide range of capabilities, but their competence is shallow. They have, as Misha puts it, solved the breadth problem but not the depth problem. “It became very clear that these two streams of research are the two building blocks for building a superintelligence,” he says. “You get generality from the language models and you get capability from reinforcement learning.” And indeed, Ioannis adds, “all the necessary components to build a superintelligent agent that can have an impact in the real world are there.” 

When they left DeepMind to form Reflection in March of 2024, their strategy was at odds with where most of the industry was heading. A year later, every major lab is working on reasoning models that augment LLMs with RL, but this wasn’t an obvious direction at the time. Similarly, they had decided to train their own models, a seemingly prohibitively expensive move for a small startup, but it’s clear now that you can radically improve the economics of model training through smart engineering.

“It’s one thing to have the algorithms,” says Ioannis, “but it’s equally important—or even more important—to build them in a way that’s scalable and uses the compute in the best way possible. And we’ve actually seen the power of engineering with DeepSeek, they’ve managed to get such good performance with such a small budget by doing the engineering really well.”

Scaling is crucial to the Reflection approach, but specifically it’s about scaling data and compute in post-training, which allows the agent to get better and better at specific tasks. As Ioannis learned when working on AlphaGo, you have to “really push hard on the systems,” to get superhuman results. But, he admits, “there is a bit of a leap of faith involved, now you have the scaling laws, but until you actually put the investment in and really see that they hold as you go up the orders of magnitude of compute, then you can never be sure, right?”

The scaling laws for pre-trained LLMs are beginning to asymptote, but with RL in post-training the curve is just beginning to climb. Unlike the training data for LLMs, which may be running out, the data for RL training is largely created by the RL agent itself in its interactions with its environment. “It gets some things right, it gets some things wrong,” Ioannis explains. “Things that it gets right it should just do more of and things that it gets wrong it should do less of. So this is like learning from its mistakes. And this really simple mechanism, if scaled correctly, gives rise to superintelligent behavior.”

Misha puts a finer point on this, “the fundamental thing about reinforcement learning—it is the most scalable form of AI. There’s basically no ceiling that we know of.”

The thing about LLMs is that not all tasks are equal for them. They are spikey just like humans, primarily because of differences in the quality and quantity of their training data. Misha explains that the model’s intelligence isn’t evenly distributed across all of human knowledge, but is “a kind of jagged intelligence.” He predicts that the first generally superintelligent systems will be good at some things and not at others.

It turns out that the thing LLMs are spikiest on is also the thing that they are currently used for the most in a work context: coding. “The areas where models are already starting to work well, they’re going to get amplified,” say Misha, “and that’s going to be the first place where we see superintelligent behavior.”

Superintelligence, the product

“A lot of AI teams think that the model is the product,” says Misha. “From talking to customers, the way we think about what we’re building, it’s really not a model, it’s a system. So it’s a model coupled to a product that solves real problems for our customer.”

The opportunity that the Reflection founders see is to create genuinely autonomous coding agents that can accurately complete end-to-end tasks that currently an engineer would do. “And if there’s one thing that customers care about above all else in autonomous coding,” Misha reports, “it’s reliability.”

Misha uses Waymo as an analogy for what this might look like in practice. “Part of the product,” he explains, “is the geofencing. It’s not just a car, it’s determining the boundary and its area of excellence.” Instead of claiming at the outset to be capable of any task—as most current coding agents do—the Reflection approach is to partition off the software equivalent of a large metro area and make strong guarantees for safety and reliability within that geography.

The bright side of this limitation is that it enables a more intuitive interface for users. “The most frictionless interface is like the one that you communicate with humans,” says Misha. “The example is you give it an engineering task, and it comes back to you with an implementation that is better than the one that you would have imagined. That’s what we think superintelligence looks like.”

The founders believe that solving autonomous coding is a “root-node” problem that will unlock more general superintelligence. “We think that autonomous coding is AGI complete. So if you show that you have a super intelligent software developer, then that’s all it takes, that’s an AGI,” says Ioannis. “Then it’s a matter of just taking the same algorithm and applying it to different verticals. But you have the recipe for how to get superintelligent systems. All the things that you need for intelligence are there in this particular problem.”

Why code, why now?

There are many kinds of intelligence, not just those required to write code. But code seems to be the most available surface area to push machine intelligence. “We think that the intelligence is going to evolve faster than the software,” predicts Misha. “The reason to work on software engineering today is because it’s the one category that is already ready for this. That entire category has been built for it to be machine friendly.”

Automating coding with autonomous agents will have a profound effect on software itself. “In software the form factor that we have as humans, the UIs we are used to using on a computer, are not necessarily the UIs that are going to be optimal for AIs,” says Misha. The GUI, the graphical user interface, evolved because humans have innate priors based on having eyes and hands. But for LLMs, their priors are the language on the internet. “But when it comes to interacting with a computer, the language that you interact with the computer with is code. So to a language model, code is as intuitive as three dimensional object manipulation is for humans,” continues Misha. “Code is ergonomic for language models.”

The implications of these changes will take time to unfold. In the process, companies that produce software will create AI-friendly UIs that enable human interactions with software products to be faster if not instantaneous. Misha foresees that “it might be that pieces of the GUI get eaten and under the hood it’s just a language model doing some work via code.” Instead of a ten click human workflow, the LLM might just write a single line of code that encapsulates the entire task. 

The Reflection team has a pragmatic definition of superintelligence as a thing that creates value by doing work on computers. “We think that a coding agent is actually going to be the way that a language model does work on any piece of software in the future,” says Misha. “So if you solve this problem, you’ve solved superintelligence on a computer for any piece of software that has an AI friendly interface.” Eventually, Sequoia partner Stephanie Zhan imagines that this will lead to “a future where we all become directors of superintelligent agents that conduct knowledge work on our behalf.”

Getting there will involve not just training models, but building the machine-friendly interfaces—from browsers and code editors to abstract representations of task categories—that literally set the stage for superintelligence. Autonomous agents will work best, the founders believe, when they have custom-designed environments (think DeepMind’s Atari environment or the OpenAI Gym) in which they can hone their particular craft. When it comes to code, the required environments and tools are fairly easy to imagine, but other cognitive categories may require greater leaps.

AI at its current stage is like the early steam engines before the discovery of thermodynamics: Lack of theory didn’t stop inventors from producing new engines. “I think deeply understanding why a model works from a theoretical sense will certainly be a very useful thing,” says Misha. “Whenever a thing has been deeply understood theoretically, at least in physics, it spurred a new era of empirical innovations because scientists knew where to search. But you don’t need to wait for that to build reliable systems today.”
Richard Feynman was an early hero of Misha’s, inspiring him to study physics. In his lecture on the conservation of energy, Feynman said, “It is important to realize that in physics today, we have no knowledge of what energy is.” The same is true of AI today, and the study of intelligence. During a Noble Prize interview in Stockholm, DeepMind founder Demis Hassabis summed up where we are in the quest for superintelligence: “I think the science of AI is about trying to explore and understand what intelligence is, and the best expression of understanding something is actually trying to build it.”

Related Topics