ReflectionAI founder Ioannis Antonoglou: From AlphaGo to AGI
Training Data: Ep27
Visit Training Data Series PageIoannis Antonoglou, founding engineer at DeepMind and co-founder of ReflectionAI, has seen the triumphs of reinforcement learning firsthand. From AlphaGo to AlphaZero and MuZero, Ioannis has built the most powerful agents in the world. Ioannis breaks down key moments in AlphaGo’s game against Lee Sodol (Moves 37 and 78), the importance of self-play and the impact of scale, reliability, planning and in-context learning as core factors that will unlock the next level of progress in AI.
Stream On
Summary
Ioannis Antonoglou, DeepMind founding engineer and Co-Founder of ReflectionAI, helped create some of the most significant breakthroughs in AI history—from AlphaGo to AlphaZero to MuZero. In this episode, he shares critical insights about the evolution of AI agents and what it takes to build truly reliable, robust systems that can learn and adapt. His perspective on the intersection of reinforcement learning, planning and large language models points to key opportunities for founders building the next generation of AI companies.
- Scale plus planning equals breakthrough performance. The success of AlphaGo and subsequent systems showed that when you combine sufficient scale with sophisticated planning capabilities, you can achieve superhuman performance even in extremely complex domains. For founders, this means considering both raw computational power and sophisticated decision-making architectures.
- Robustness requires systematic evaluation and improvement. Just as AlphaGo needed to be rigorously tested against top human players to expose blind spots and hallucinations, modern AI systems need systematic ways to identify and correct mistakes. Building in mechanisms for error detection and recovery should be a core focus.
- Self-play and synthetic data generation are critical paths forward. The progression from AlphaGo to AlphaZero demonstrated that systems can achieve better performance by learning from self-generated experiences in addition to human examples. As we approach the limits of available human data, founders should invest in robust approaches to synthetic data generation and self-improvement.
- In-context learning and rapid adaptation capabilities are the next frontier. Beyond storing knowledge in weights, systems need to be able to learn and adapt quickly through interaction—picking up new skills and tools through few-shot learning and direct experience. This represents a major opportunity for innovation.
- The path to reliable AI agents requires solving the “sometimes” problem. While current language models can produce amazing results, they lack the consistency of game-playing systems like AlphaGo. Creating agents that can reliably execute tasks at a high level of performance—not just sometimes—is a key challenge founders must address.
Transcript
Chapters
Contents
Ioannis Antonoglou: Go is a complex game. There was always a bit of worry about whether AlphaGo was truly as good as we believed. So we actually had the conviction that deep reinforcement learning is the answer based on everything that we could measure and everything we could see. But that’s the thing about these systems, is that they’re not like classic computers where you just know that they always produce the same answer. They’re, like, stochastic, they are creative and they have some blind spots. They hallucinate similarly to how model LLMs hallucinate. So you need to just, like, really push them and just, like, see exactly where they break. And the only way you could actually do that is by having, like, the best humans playing against them.
Stephanie Zhan: Ioannis, thank you so much for joining us today.
Ioannis Antonoglou: Thank you so much for having me.
Stephanie Zhan: Ioannis. You have an incredible background, having worked at DeepMind as a founding engineer for over a decade, starting with some of the most notable projects that have really defined the industry. DeepMind quite notably created this notion of building AI within games to start. Can you share a little bit more about why DeepMind chose to start with games at the time?
Ioannis Antonoglou: Yeah. So DeepMind was the first company to truly embrace the concept of artificial general Intelligence, or AGI. From the outset, they had grand ambitions aiming to build systems that would match or exceed human intelligence. So the big question was—and still is—how do you build AGI? And more importantly, how do you measure intelligence in a way that allows for meaningful research and performance improvements?
So the idea of using video games as a testing ground came naturally to DeepMind founders. It was Demis Hassabis and Shane Legg, because Demis had a background in the gaming industry and Shane’s PhD thesis defined AGI as a system that could learn to complete any task. Video games provided a controlled yet complex environment where these ideas could be explored and tested.
The importance of games in AI
Sonya Huang: And to what extent—you mentioned games, they provide a very controlled environment. To what extent are games representative or not of the real world? Like, if you have a result in games, do you think that generalizes naturally to the real world or not?
Ioannis Antonoglou: So I mean, I guess games have indeed been valuable for developing AI, and you actually have a few examples of that. So you can see that PPO, for example, which is currently being used in RLHF, was developed using OpenAI gym and MuJoCo and Atari. And similarly, we have MCTS, which was developed—which stands for Monte Carlo Tree Search, and was developed through board games like backgammon and Go. But at the same time, games have a number of limitations. So the real world is messy, it’s unbounded, and it’s a much tougher nut to crack than even the most complex games. So even though they just give you an interesting test bed to develop new ideas, it’s definitely limiting, and it doesn’t really capture all the complexity of the real world.
Sonya Huang: Okay, interesting though. So, but a lot of the techniques and algorithms that you’ve developed in a game environment, PPO, et cetera, these are used in the real world.
Ioannis Antonoglou: Yeah. So PPO was actually exactly what ChatGPT used for RLHF. And so MCTS, it’s used in MuZero. And MuZero has been used in the real world in things like, you know, compression, video compression for YouTube. It was part of the self-driving system at Tesla at some time. And it was also, like, used for developing a pilot that was completely controlled by an AI. So yeah. I mean, you can see methods like that being used in the real world to solve real problems.
Stephanie Zhan: So interesting. Ioannis, I remember back in 2017 when AlphaGo the movie came out, and it featured the incredible game of AlphaGo against Lee Sedol. Can you take us back to that moment in time, and maybe the years leading up to it as you’re building AlphaGo? How was AlphaGo specifically chosen as the game to focus on?
Ioannis Antonoglou: So I feel like games have always been a benchmark for AI research. So, like, before Go, you had chess, and chess was like a major milestone with IBM’s Deep Blue defeating Garry Kasparov in the late ‘90s. And I mean, even though chess and Go are completely different games, and Go is definitely a different beast. Like, games have always acted as test beds for the development—especially board games, for the development of new AI methods. Actually, even going back to the earliest days of AI research, Turing and Shannon, they both worked on their own versions of chessbots.
So now the thing about Go is that it’s a much harder problem than chess. And the reason for that is because it’s almost close to impossible to define an evaluation method, a heuristic. So in chess you can just take a look at the board, you can count the number of pawns that each side has, you can see what the ranks of these pawns are, and then you can just, like, draw some conclusions on, like, who is winning and why.
But like in Go, there’s—there’s nothing like that. Like, it’s mostly human intuition. And if you ask a Go professional player, like, how they know whether a position is a good one or a bad one, they will say that, like, you know, after having played the game for so long, they can just, like, feel it in their gut that this is a better position than the other one.
So now it’s actually a question of how do you encode the feeling in your gut into an AI system, right? So this is exactly the reason why solving Go was considered the holy grail of AI research for a long time, and was a challenge that seemed almost impossible, but at the same time was, like, within reach. People felt that, like, you know, they could actually get it cracked. And this is exactly what AlphaGo did back in 2016. And it kind of like, showcased two new methods, which is like deep learning and reinforcement learning. Because back in 2015 and 2016, like, now we kind of think of deep learning and reinforcement learning as mature technologies.
But back then, we were kind of like, literally making the—they’re taking their first steps, and they were kind of like the new kid in the block. And most people were kind of like, really skeptical about them. Like, everyone thought that deep learning was another AI fad that would just, like, won’t last the test of time. So yeah, I mean, AlphaGo was chosen because it was, like, clear to showcase that you actually have, like, the most performant agent in the world. You could actually evaluate it, you can have it play with other humans. At the same time, it was within reach given, like, the latest developments in deep learning and reinforcement learning.
How AlphaGo works
Sonya Huang: I remember reading that there’s more configurations of the Go board than atoms in the universe by many orders of magnitude. And that blew me away because I mean, I grew up playing Go and it felt like such a—you know, it’s very simple in terms of the rules, but I see why it was the holy grail. Maybe can you explain how AlphaGo worked technically? Maybe explain it to me like I’m a fifth grader, because that is effectively my level of sophistication understanding these things. But how did it work? And you mentioned that both reinforcement learning and deep learning were involved. I’d love to peel that back a little bit.
Ioannis Antonoglou: Yeah, absolutely. So AlphaGo has two deep neural networks. So, like, a neural network is a function that takes something as an input and produces something as an output. And it’s literally like a black box. We don’t really know exactly how it does it. It just knows that you can actually—if you train it on enough data, it will just, like, learn the mapping, it will learn the function, like, from input to the output space.
So AlphaGo actually had access to two deep neural networks, the policy network and the value network. And the policy network suggested the most promising move. So it will just take a look at a current board position and just say, “Okay, you know, based on the current position, this is the list of moves that I would recommend you just, like, consider playing.”
And it also had access to the value network. It would just take a look at, like, a board position and just, like, give you a winning probability. Like, what are your chances of actually winning the game starting from this position? This is exactly the gut feeling. It had its own gut feeling on whether the position is a good one or a bad one.
So once you have access to these two networks, then you can actually play in your imagination a number of games. You can consider the most promising moves, then you can consider your opponent’s most promising moves, and then you can just evaluate each move like the value network. And then you can use a method called minmax. What that says is that I want to win the game, but I also know that my opponent wants to win the game. So I want to just pick a move that will maximize my chances of winning, knowing that my opponent will try to maximize their chances of winning. So if you actually do that and simulate a bunch of moves, then you can just get the optimal action.
And the way to just do this imagination, this planning, this search in the most efficient way is by using a tree search method called Monte Carlo Tree Search. So MCTS. So whenever people talk about MCTS, they literally just mean this heuristic of how do I choose which futures to consider so that I can make informed decisions?
The role for reinforcement learning and deep learning in building AlphaGo was that AlphaGo, first of all, was a success of reinforcement learning and deep learning, because this is exactly the two methods that powered AlphaGo. And the policy network was initially trained on a larger set of human games. So you had, like, many games played by human professionals, and you just consider every position and you consider the move they took at this position. And then you have a deep neural network that tries to predict this move.
Then once you have the policy network, you need to somehow find a way to just, like, obtain a value network. So we did it in two ways. First, we just took the policy network and we had it play against itself. And we used reinforcement learning to improve the playing strength of the model. So we used a technique called ‘policy gradient.’ So what policy gradient does is that it just looks at the game and then it looks at the outcome. This is the simplest version of policy gradient. It looks at the outcome of the game, and for all the moves that led to a win, they’ll just say “Great. Just increase the probability of choosing this move.” And for all the moves that led to a loss, it says “Great. Now decrease the probability of this move being selected in the future.” And if you do that for many games and for long enough, then you just get an improved policy.
Now once you have this improved policy, you can just generate a new data set of games where the policy plays against itself. And then you have, like, a huge amount of games where for each position you know who the final winner was. So then you can take this network, you can take another network, a value network, and have it predict the outcome of the game based on the current position. So what the network will learn is that if I start at this position and I play under my current policy, on average this is the player who wins. Like, it’s either the black player or the white player. So this is the first version of a value network, and you can just use it within AlphaGo by combining it with a policy network.
Stephanie Zhan: And what were some of the biggest challenges in building this, and how did you overcome them?
Ioannis Antonoglou: Yeah, so AlphaGo was not just a research challenge, but was mostly, I’d say, an engineering marvel. The early versions run on 1,200 CPUs and 176 GPUs. And the version that played against Lee Sedol used 48 GPUs. So, like, CPUs for the first accelerator, custom accelerators. And these accelerators were, like, really primitive back then, because literally it was like the first version, right? Like, now the later accelerators are much, much better and much more stable. So the system had to be highly optimized to minimize latency, maximize throughput. We had to build large-scale infrastructure for training these networks. And it was a massive endeavor. It just required a lot of coordinated effort from many talented individuals working on different aspects of the project.
But you know, I just walked you through a number of steps to just, like, obtain the policy network and the value network. And each of these steps had to just be implemented at the limits of, like, what was available and what was possible back then in terms of scale. And it had to be implemented in a way where people could just, like, think of everything. They could just try their research ideas fast and get results fast. So yeah, lots of people scale at levels that hadn’t been implemented before. And it’s kind of like working at the forefront of what was possible back then.
Stephanie Zhan: I love your highlight of it being a research marvel and an engineering marvel. And I remember you sharing one time that part of the reason this project came about also was because Google had TPUs that they needed a test customer for, and that was the spark of this AlphaGo project. So that’s pretty incredible.
Sonya Huang: How much conviction did the DeepMind team have that this was going to work? You mentioned that at the time, deep learning, reinforcement learning were still relatively novel, but DeepMind was very much founded with that belief. But did you guys think that you were going to be able to have these superhuman-level results, beating the top Go player in the world? Like, was it a crazy idea and maybe it’ll work, or did the team have conviction, like, this is going to work?
Ioannis Antonoglou: Yeah. So I’d say that the team had a cautious optimism. So one of AlphaGo’s lead developers, Aja Huang, he is a strong amateur Go player, and he had been working on Go for, like, a decade before AlphaGo happened. And we also had, like, a leaderboard of computer players, and you could see that AlphaGo was significantly stronger than anything that had come before. But Go is a complex game, and there was always a bit of worry about whether AlphaGo was truly as good as we believed.
So we actually had the conviction that deep reinforcement learning is the answer based on everything that we could measure and everything we could see. But that’s the thing about these systems, is that they’re not like classic computers where you just know that they always produce the same answer. They’re, like, stochastic, they are creative and they have some blind spots. They hallucinate similarly to how model LLMs hallucinate. So you need to just, like, really push them and just, like, see exactly where they break. And the only way you could actually do that is by having, like, the best humans playing against them.
Move 37
Stephanie Zhan: Move 37. Can you tell us what that was? It was such a monumental move, and I think everyone watching it at the time was—and Lee Sedol maybe primarily was confused by that move. What was going on in your head when that happened?
Ioannis Antonoglou: So yeah, I mean, move 37 in game two against Lee Sedol was literally just a spectacular moment in the sense that it kind of showed case to the world that AlphaGo has creativity, and it demonstrated that AI could come up with strategies that even top human players hadn’t considered. So at first, I still remember that we thought that AlphaGo made an error, so that it actually hallucinated, it did something that it didn’t mean to. But then turned out to be a brilliant unconventional move that underscored that the system had a deep understanding of the game, that the system actually had, like, creativity. It could think of things that people hadn’t thought of before.
Sonya Huang: I want to take us to another key move in the game. I think it was in game four. At this point I was rooting for Lee because I was like, the poor guy needs to win a game.
Stephanie Zhan: [laughs]
Sonya Huang: Move 78. I think AlphaGo made a mistake and Lee Sedol notices it. I guess what was the weakness there that Lee found during the game?
Ioannis Antonoglou: Yeah, exactly. So I mean Lee Sedol’s move in game four was literally a testament to human ingenuity. Like, move 78 was unexpected and caught AlphaGo off guard. Initially, AlphaGo based on its evaluations, misinterpreted it as a mistake and thought that it was actually, like, winning. So that’s why it didn’t respond appropriately. And this kind of highlighted the blind spot in the system. So the game showed that, like, while systems like AlphaGo are extremely powerful, at the same time they still have vulnerabilities and there were, like, still areas where we could further improve it.
Sonya Huang: But how do you go about improving something like that? Do you need to show it a lot more data of, you know, kind of that type of human ingenuity move, or how do you go about fixing and patching those blind spots?
Ioannis Antonoglou: So yeah, I mean, it’s actually interesting that by the end of the games with Lee Sdeol, we just, like, put together a benchmark where you’re just kind of like trying to quantify and just have a way of measuring the mistakes that AlphaGo makes, and these kind of blind spots let’s say. And then we just write a number of approaches to just, like, improve the algorithm so that we can solve these issues.
And what happened is that actually the most effective way of getting rid of them was just do what we were doing, just, like, at a higher scale and better. So just, like, change the architecture of the model. We just switched to a deep ResNet with two output heads. And we also like—we just had a bigger network trained on more data, then just moved to AlphaZero and better algorithms and that kind of like, made it so that we didn’t have any hallucinations anymore. So in a way just like scale data, things that are always kind of the well-known recipe in the field of AI is exactly what solves it in our case, too.
Stephanie Zhan: With scale and data, how much did higher quality data, or maybe specifically data from great professional players, the best professional players, make a meaningful difference, or was it just any data?
Ioannis Antonoglou: No. For us, what mattered was that we solved it using self play.
Stephanie Zhan: Yeah.
Ioannis Antonoglou: So we actually had access to the most competent Go player in the world, and we just, like, used it to generate the best quality games, and then we just trained on these games. So I guess, like, you know, we didn’t need to have, like, human experts because you had an expert in house. It wasn’t human.
From AlphaGo to AlphaZero
Stephanie Zhan: Right. Huh. Interesting. Amazing. Well, I’d love to move on to the progression from AlphaGo to AlphaZero. And you talked a little bit about this notion of self play just now. AlphaZero was powerful because it learned how to play the game from scratch, entirely from self play, without any human intervention. Can you share more about how that worked and why that was important?
Ioannis Antonoglou: So AlphaZero was a game changer because it learned entirely from scratch through self play without any human data. And this was, like, a major leap from AlphaGo because, like AlphaGo, as I said, relied heavily on human expert games. So two things happened. First of all, AlphaZero managed to simplify the training process, and also showed that AI will literally just get from zero to superhuman performance just purely by playing against itself. And that allowed it to just be applicable to a whole range of new domains that were out of reach because there weren’t enough human data for it. But I feel like the more important thing is that just so that AlphaZero also solved all the issues that AlphaGo had in terms of hallucinations, in terms of, you know, blind spots and robustness. So, like, AlphaZero’s was like a better method, just full stop.
Sonya Huang: And you explained kind of how AlphaGo worked to a fifth grader. What would you tell the fifth grader would be the key difference technically that you implemented with AlphaZero?
Ioannis Antonoglou: So AlphaZero, just like AlphaGo, uses a policy network and a value network along with Monte Carlo Tree Search. So in that respect, it’s exactly the same as AlphaGo. So the key difference is in training. AlphaZero starts with random weights and learns by playing games against itself. And by playing games against itself, it iteratively improves its performance. But the main idea behind AlphaZero is that whenever you take a set of weights, a set of policy and value networks, and then you just combine them with search, then you just, like, end up with a better player. You just, like, increase your performance, you just, like, become a stronger player.
So what that meant is that we can actually use this mechanism to improve the model policy, the rule policy. So this is what we call reinforcement learning, a policy improvement operator. Whenever you can just, like, take an existing policy and then do something, some magic, and then just come up with a better policy. And then you can just, like, take this policy and distill it back to the initial policy and then just repeat this process, then you have a reinforcement learning algorithm.
And I think, like, you know, this is exactly what people are trying to do today with, you know, Q* or synthetic data. This is exactly the idea of, like, how can I take a policy, do something with it, planning, search, compute, whatever it is, and derive a better policy which I can then imitate and just, like, kind of distill back to the original policy? So this is exactly what AlphaZero is doing. It uses MCTS search to produce a better policy. Then it takes its trajectories, it trains this policy and value network on the new better trajectories, and it repeats this process until it converges, you know, to an expert-level Go player.
Sonya Huang: That’s fascinating and counterintuitive, that kind of like starting without the weights that you would have from professional-level players is actually a better starting place.
The next level with MuZero
Stephanie Zhan: The epitome of AI agents and games was achieved I think via MuZero, which is the progression even from AlphaZero itself. And it’s also where you became one of the co-leads or one of the leads of the game. AlphaZero was obviously impressive because of self play, but it also needed to be told the environment’s dynamics or the rules of the game. And MuZero takes this to the next level without needing to be told the rules of the game. And it mastered quite a few different games—Go, chess and many others. Can you share a little bit about how MuZero worked and why was this particularly meaningful?
Ioannis Antonoglou: Absolutely. So AlphaZero, as you said, was a massive success in games like chess, Go, and shogi, so in games where we actually had access to the game rules, where we actually had access to a perfect simulator of the world. But, like, this reliance on the perfect simulator made it challenging to apply it to real world problems. And real world problems are often messy and they lack clear rules, and it’s really hard to just write a perfect simulator for them. So that’s exactly what MuZero tried to solve.
So MuZero masters the games, of course, like Go, chess and shogi, but it also masters more visually challenging games or games that, like, a hard code like Atari. And it does that without having access to the simulator. It just, like, learns how to build an internal simulator of the world, and then just uses this internal simulator in a way similar to what AlphaZero was doing. So it does that by using model-based reinforced learning, where what that means is that you can just take a number of trajectories generated by an agent, and then try and learn a prediction model of how the world works.
So this is actually quite similar to what methods like Sora are trying to do now, where they just take YouTube videos, and they try to just learn a world model by just trying to predict based on starting from one frame what’s going to happen in the future frames. So MuZero tries to do exactly that, but it does it in a way different from generative models in the sense that it tries to only model things that matter for solving the reinforcement learning problem. So it tries to predict what the reward is going to be in the future. What’s the value of future states? What’s the policy for future states? So only things that you need within your MCTS. But the fundamental scaffold remains the same. So how do you just learn a model based on trajectories, and then once you have these models, you can just combine the search and get superhuman performance.
So of course, like, you can always decouple the two problems and have the model being trained separately from data out in the wild, and then just combine that with MuZero. And we just found that back then, given the limitations of our models and the smaller sizes, it kind of like made more sense to just keep those two together, and only have the model predict things that matter for planning, instead of just try to model everything because you’re hitting the limits of what the capacity of the model could take.
Similarity to Sora and world models
Stephanie Zhan: So interesting. Is it right to assume then that not only Sora takes the same approach, but maybe other world models or other robotics foundation models?
Ioannis Antonoglou: Yeah, so anything that tries to just build a model of how the world works and then just, like, use that for planning, it’s within, you know, MuZero methods. So yeah, you can just, like, train it on YouTube videos. You can train it on the inputs coming from robots. You can train it on, you know, any environment. You can even think of, like, large language models as a form of models of text. So, like, they model text. But the thing about text is that, like, the model needs to be trivial. Like, you don’t need to just—there aren’t many artifacts happening when you’re trying to predict what the next world is going to be.
Stephanie Zhan: Right.
Sonya Huang: Have you seen the ideas behind MuZero kind of be used outside gameplay or in any messy real world environments?
Ioannis Antonoglou: Yeah. I mean, so as I’ve said, AlphaZero and MuZero are quite general methods, and there were—like, there’s a number of scientific communities. In chemistry, so there’s AlphaChem. In quantum computing, some people tried to use AlphaZero in optimization where they just, like, adopted AlphaZero because it was really powerful in doing planning and just, like, solving these optimization problems. At the same time, MuZero was incorporated in a version of, like, Tesla’s self-driving system. It was kind of reported in their AI day, and it was also used, and I think it’s currently being used within YouTube as a custom compression algorithm. But I think, you know, it’s early days, and takes time for, like, these new technologies to be fully adopted by the industry.
Stephanie Zhan: We’d love to talk a little bit more about reinforcement learning and agents. You alluded earlier to the fact that reinforcement learning and deep learning back in 2015 were new, nascent ideas. They really grew in popularity, 2017, 2018, 2019 onwards. And then they were overshadowed by LLMs largely because of the GPT and everything else that came out. But now reinforcement learning is back. Why do you think that is the case?
Ioannis Antonoglou: Yeah. I mean first of all, LLMs and multimodal models have indeed brought incredible progress to AI. So these models are exceptionally powerful and can perform some truly impressive tasks. But they have some fundamental limitations, and one of them is the availability of human data. People just keep talking about the data wall, and what happens once you run out of high quality data. And this is exactly where reinforcement learning shines.
So reinforcement learning excels because it doesn’t rely solely on pre-existing human data. Instead, reinforcement learning uses experience generated by the agent itself to improve its performance. So this self-generated experience allows reinforcement learning to learn and adapt, and to even adapt to scenarios where human data is scarce or, like, non-existent. So if you define the reinforcement learning problem in the right setting, in the right way, you can literally effectively exchange compute for intelligence. You can just, like, get to a point similar to where we were with AlphaZero, where we just—like, the moment we threw more compute at it, like, made the networks bigger, we just like, you know, used more games, we just literally got a better player. And it was deterministic. You always get a better player.
So I guess this is exactly where we want to be with, like, this synthetic data pipeline. Currently we have that with, you know, the scaling laws in LLMs, that if you have, like, more data and bigger models, then you get like—you know, you can predict that there’s going to be an improvement to performance. But once you’ve run out of human data, how do you just keep going? And synthetic data is the answer to that. And the only way that you can actually get high quality reinforcement learning, high quality data to just improve your model is via some form of reinforcement learning. And just like leading—I’m just, like, keeping reinforcement learning as a really kind of blanket term here where I just define it as anything that learns through trial and error.
Sonya Huang: How do you think reinforcement learning is being brought into the kind of like, LLM world? And you mentioned Q* earlier. Like, I guess in a closed form game you have, like, a pretty clearly defined policy and value function. How does that work in, like, a messy kind of real world environment or the LLM world?
Ioannis Antonoglou: So I mean, I guess, like, there are two different types of, like, messy real world, right? Like, there is the—if you try to just, like, build a controller or something, that’s a really messy environment. And then if you operate in the digital space. So my personal belief is that digital AGI will just happen much earlier than, you know, robotics AGI. And the reason for that is exactly that you have control over the environment, and the environment is like computers, like the digital world. So even though it’s, like, messy and it’s noisy, it’s still contained. It’s not like the real kind of like, world in that sense. So now in terms of how do you bring reinforcement learning? So reinforcement learning is—we used to say in DeepMind that you have the problem and you have the solution. And the problem setting of reinforcement learning is how do I take a model, how do I take a policy and generate synthetic data? Or, like, I find a way to improve this policy by interacting with the environment via trial and error. And this is like the reinforcement learning problem setting, right?
And then there’s like the solution space where you have value functions and you have, like, reinforcement learning methods. So I think that there’s a lot of inspiration to draw from, like classical reinforcement learning methods that were developed in the past decade, but you have to adjust them to the new world of LLMs. So methods like Q* try to do that by just taking the idea that if I have a policy and then I do planning, I consider possible future scenarios, and then I have a way to evaluate which one is better, then I can just take the best ones and then ask the model to imitate these better ones. And this is, like, a way of improving the policy. So in the classic RL framework, you do that by using a policy and a value network. In the new world, you’ll just do that by having a reward model or asking your LLM to just give you feedback on an output it gave you.
Making synthetic data work
Stephanie Zhan: So interesting. You also talked a little bit about synthetic data earlier. I think some folks are very bullish on synthetic data and some folks more skeptical. I also believe that synthetic data is more useful in some domains where outcomes and success is perhaps more deterministic. Can you share a little bit about your perspective on the role of synthetic data and how bullish you are on it?
Ioannis Antonoglou: Yeah. I mean, I think, like, synthetic data is something that we have to solve one way or another. So it’s not about, like, whether you know you’re bullish or not. This kind of is an obstacle that you have to just find a way around. Like, we will run out of data. Like, you know, there is so much data that, like, humans can produce. And also, like, it’s important that these systems start taking actions, they start learning from their own mistakes. So we need to just find a way to make synthetic data work. Now what people have done is that they’ve tried, like, the most, I guess, like, naive approach where you just, like, take the models, they produce something and you try to just, like, train on that. And of course, like, you know, they’ve seen that model collapsing, and this just, like, doesn’t work out of the box. But, you know, new methods never work out of the box. You just need to invest in it and just, like, take your time and, you know, really kind of think of what’s the best way of doing it.
So I’m really optimistic that we’ll just definitely find ways to improve these models. And I think that, like, actually there is a number of methods out there like the Q* and the equivalents that in the new world where people don’t really share their research breakthroughs the way they used to, it’s probably hidden behind some company trade secrets.
Sonya Huang: I want to ask about reasoning and novel scientific discoveries. Do you think that that can naturally come out of just scaling LLMs if you have enough data? Or do you think that kind of like the ability to reason and come up with net new ideas requires kind of doing reinforcement learning and deeper compute at inference time?
Ioannis Antonoglou: So I think you need reinforcement learning to get better reasoning because the distribution of, like—it’s also about the distribution of data, right? Like, you have a lot of data out in the wild in the internet, but at the same time you don’t always have, like, the right type of data. So you don’t have the data where, like, someone reasons and they just explain the reasoning in detail. You have some of it. You have like—and it’s incredible that the models you’ve actually managed to pick it up and just imitate it. But if you want to just, like, improve on that capability, then you need to do that through reinforcement learning. You need to just show the model how this kind of emerging capability can further be improved by just having it generate synthetic data, interact with the environment, just tell it when it’s doing something right and when it’s not doing something right. So yeah, I think that, like, reinforcement learning is definitely part of the answer for that.
Lessons for building AI agents today
Stephanie Zhan: AlphaGo, AlphaZero and MuZero are the most powerful agents we’ve ever built. Can you share a little bit about how some of the lessons and learnings unlocked from that are relevant to how we’re pursuing building AI agents today?
Ioannis Antonoglou: Yeah, so I think, like, AlphaGo and MuZero, you know, they’ve actually fundamentally transformed our approach to AI agents because they highlight the importance of planning and scale, in my opinion. That if you actually look at the charts of, like, different models and how they scale, you can see that, like, AlphaGo and AlphaZero were kind of really ahead of their time. Like, they were kind of outliers. You had these curves of how compute scaled, and then you have AlphaZero somewhere standing on its own. So it shows that if you can scale and you can really push on that, then you can get incredible results.
At the same time, it also showed that you don’t have just only train, you can also have better performance during inference, during test, during evaluation by just using planning. And I think that this is something that we’ll start seeing more and more in the near future, or these methods will just start thinking more, planning more before they’re just making any decisions. So I’d say that this is more of the significance of AlphaGo and AlphaZero and MuZero is the basic principles. And the basic principles are that scale matters, planning matters. These methods can really solve problems that we thought that are insanely complex or, like, you know, beyond what we can solve on our own—similar problems with the ones that we actually observe today with these large language models are things that we saw back then, like back in 2016. We actually saw that these models can hallucinate or that, like, at the same time they’re also creative, that they’ll just come up with solutions that we hadn’t thought of. But they can also have blind spots or hallucinate or be susceptible to adversarial attacks, which I guess everyone knows now that these neural networks suffer from. So I think that these are the main lessons drawn from this line of work.
Stephanie Zhan: Super exciting. Thank you.
Sonya Huang: What do you think are the biggest open questions from this line of work for the field going forward?
Ioannis Antonoglou: So the main question is we had, like, AlphaGo and MuZero and we just, like, managed to have, like, this insanely robust and reliable systems that will just always play Go and at the, you know, the highest possible kind of level, and they’ll just, like, achieve consistently. They will just, like, be top of the leaderboard, will just, like, never lose a game. So AlphaGo Master actually, like, played against 60 people in online matches and just, like, literally won in every single one of them. So these methods were, like, incredibly robust and reliable. And I think, like, this is exactly what we’re missing now with these LLM-based agents. Sometimes they get it, sometimes they don’t. You cannot trust them. They will just, like—you know, you have, like, some amazing demos but, like, you know, they happen once every two times even or, like, once every 10 times you have, like, something amazing. And the remaining nine, they just lost their way and didn’t do anything. So I think what we need to do is just find a way to just make these LLM-based agents equally robust to the ones that we had with AlphaGo and MuZero and AlphaZero. This is, like, the new open question of how do we actually do that.
Stephanie Zhan: We’d love to move into some of your thoughts on the broader ecosystem today. You’ve touched on a few really core problems that people are working on right now. One, the data wall problem that will hit eventually, perhaps by 2028 or so as some folks predict. Another being the idea of planning as an area that AI agents need to get better at. And then, you know, a third idea that you just described was around robustness and reliability. Can you share a little bit about maybe some of these areas that you think the whole field needs to solve that you are most excited about, to help us unlock this vision of really getting to the AI agents that we want?
Ioannis Antonoglou: Yeah. I mean, I’ll just, like, also add another one to the list. So I think, like, another major challenge is, like, how to improve the in-context learning capabilities of these models or, like, you know, how do we make sure that, like, these systems can learn on the fly, and how they can adapt to new context, like, quickly. So this is another thing that I think is going to be really important. It’s going to happen the next few years, a couple of years, actually.
Stephanie Zhan: Ioannis, what’s the term that you used for that?
Ioannis Antonoglou: In-context learning?
Stephanie Zhan: Oh, in-context learning.
Ioannis Antonoglou: In-context learning, yeah. So it’s the idea that a system can actually learn how to do a new task with, like, few shot prompting. Like, it kind of like, sees a few examples, and on the fly it kind of like, learns how to adapt to the new environment. It learns how to use the new tools that were provided to it. Or, like, it kind of like, learns. It’s not just all the knowledge it has stored in its weights, but, like, it can also, like, acquire new knowledge by just, like, interacting with the real world, interacting with the environment. So I think that this is like another place where there is a lot of work happening at the moment, and we’re going to have, like, amazing progress in the next couple of years. And I’m really excited about that.
So yeah, I mean, to recap, I think planning is important, in-context learning is important, and reliability. So the best way to achieve reliability is to just ensure that these models somehow know how to learn from their mistakes. So if they just made a mistake somewhere, they can just see that and they’re like, “Okay, I made a mistake. I’ll just correct for it.” The way that humans make mistakes all the time, but we can correct for them. So these are the three areas which I’m really excited to see progress on.
How startups can compete against the big research labs
Sonya Huang: Now that you’ve kind of embarked on your own entrepreneurial journey, how do you think about the areas where startups can compete against the big research labs? And how do you motivate yourself for that journey?
Ioannis Antonoglou: Yeah. I mean, it’s completely like—it’s a new world for me, but at the same time, it’s not that new because when I joined DeepMind, it was literally a startup, and I was like, literally in the first few employees. So I actually, like, saw that firsthand. But, you know, one of the benefits of, like, working for a startup is that, you know, the agility and the focus. So everyone really cares, everyone just moves really fast, and there’s like a clear focus on what we want to build. So the building is like, what’s the most important kind of motivation for people, like, just like, building. And I think that, like, this is one of the big advantages that, like, startups have over more established businesses.
At the same time, you know, it’s easier to just pivot to adapt to new findings, new technologies. You’re not kind of like, tied to some pre-existing solutions or, like, some projects that you don’t want to deprecate because they bring a lot of revenue to you. While if you’re a startup you have no such chains. You can just move fast and be innovative and just break conventions, and at the same time just allows you to leverage open source resources, things that are out of touch for the big labs. And yeah, and you don’t have, like, the red tape that, like, big places tend to have.
Stephanie Zhan: I love the term that you use sometimes, Ionnis, “main quest versus side quest.”
Ioannis Antonoglou: Yeah, it’s the idea of, like, having a main focus. Like, you know, in big places, in big labs they have, like, many different projects that, like, people are working on. And it usually happens that they have, like, the main quest, the main thing that, like, everyone’s working on. And there’s, like, many multiple, like, smaller side quests. The idea is to just feed into the bigger quests, but usually they don’t get as much—they don’t get as many resources or as much focus from the leadership, so they tend to atrophy.
Stephanie Zhan: In the broader field, what are some of the most defining projects that you admire the most, and maybe who are some of the most influential researchers that you admire the most?
Ioannis Antonoglou: Yeah, absolutely. So I actually, like, started my AI research journey back in 2012, and I’ve actually, like, seen some milestones. So I’ll just give a list of, like, what I think are, like, the main milestones, like, in AI in the past, like, 12 years that I’ve been around. So the first one I’ll say is like AlexNet. This is the first paper that kind of like, showed that deep learning is the answer. I mean, back then it didn’t feel like it, just felt like, you know, a kind of curiosity. But, like, now I think that most people are convinced that, like, deep learning is part of the answer.
Then it was DQN. I had the pleasure to actually work on DQN and just, like, see it firsthand how it started. It was actually developed by a friend of mine, Volodymyr Mnih. And it was the first system that showed that you can actually combine deep learning with reinforced learning to achieve human performance or, like, superhuman performance in really complex environments.
Then this was AlphaGo. Again, I was really lucky to just, like, work on that. And it showed that, you know, scale and planning are really important ingredients. And if you just do that right, then you get huge success in an incredibly complex environment. AlphaFold, another one. This is again by DeepMind. It shows that these methods are not just things that you can use to solve games, but they actually will make this world a better place. They will just ensure that healthcare is improved, that scientific discoveries are being realized, that we’ll just make sure this world is a better place by using AI.
Stephanie Zhan: Yeah.
Ioannis Antonoglou: Then ChatGPT. It kind of like, brought AI to everyone, just, like, made it accessible to a broad audience. Like, everyone knows what AI is now. It has made my life of explaining my job much easier.
Stephanie Zhan: [laughs]
Ioannis Antonoglou: And finally GPT4. And I think that yeah, probably GPT4 is, like, the latest kind of big advancement in AI, because it kind of like, showed that, you know, artificial general intelligence is a matter of years, it’s within reach. Yeah, we are getting there. Like, I think that, you know, most people now believe that we are, like, a few years away from, like, AGI. And, you know, that’s because of, like, the incredible breakthrough that GPT4 was.
Now in terms of, like, some people I really admire—before I forget. So I’d say first, like, David Silver. He was my PhD supervisor, he was my mentor at DeepMind. He’s an incredible researcher. He worked—he led AlphaGo and AlphaZero and, you know, he is—you know, he has an unyielding dedication to the field of reinforced learning, and he’s, you know, probably one of the smartest people or maybe the smartest person I know, and amazing guy in—amazing reinforced learning engineer.
And the second one I’d say is Ilya Sutskever. And, you know, he was a co-founder of OpenAI. I had the opportunity to work with him just a little bit in the really early days of AlphaGo. But I think, like, his commitment to scaling AI methods and pushing the boundaries of what these systems can achieve is remarkable, and he made sure that GPT3 and GPT4 happened. So yeah, immense respect towards him.
Stephanie Zhan: Thank you for sharing that.
Lightning round
Sonya Huang: Let’s close out with some rapid-fire questions. Maybe first, what do you think will be the next big milestones in AI, let’s say in the next one, five and ten years?
Ioannis Antonoglou: So I feel like the next five to ten years the world will be a different place. I actually really believe that. I think that in the next few years we’ll see models becoming powerful and reliable agents that can actually independently execute tasks. And I think that AI agents will be massively adopted across industries, especially in science and healthcare. So in that sense I’m really excited on what’s coming in AI. And what I’m most excited about is AI agents, systems that can actually, like, do tasks for you. And, you know, this is exactly what we’re building at Reflection.
Stephanie Zhan: In what year do you think we’ll pass the 50 percent threshold on SWE-bench?
Ioannis Antonoglou: So I think we are one to three years away from the 50 percent threshold for SWE agents and three to five years from achieving 90 percent. So the reason is while progress is amazing, I think we still need reliable agents to hit these milestones. And it’s really—when it comes to research, it’s, like, hard to make precise predictions.
Sonya Huang: When do you think we’ll hit the data wall for scaling LLMs? And do you think all the research in RL is mature enough to keep up our slope of progress? Or do you think there will be a bit of a lull as we try to figure out what happens when we hit the wall?
Ioannis Antonoglou: So I think based on what I’ve read, I think we have at least one more year for text just like before we hit the wall. And then we have, like, these extra modalities which might actually buy us maybe a year extra. And I think we are in a really good place to just, like, start using synthetic data. So in the next few years we’ll just, like, figure out the synthetic data problem. So I think that we won’t really hit the wall—just like we’ll hit the wall but, like, no one realizes it because we’ll have, like, new methods in place.
Sonya Huang: Do you think LLMs will have their AlphaGo moment, and if so, when?
Ioannis Antonoglou: I think LLMs had their AlphaGo moment with the initial release of ChatGPT, where they showcased the power and the progress made over the past decade. I think what they haven’t had yet is their AlphaZero moment. And that’s the moment where more compute directly translates to increased intelligence without human intervention. And I felt like this breakthrough is still on the horizon.
Sonya Huang: When do you think that will happen?
Ioannis Antonoglou: I think it’s going to happen in the next five years.
Stephanie Zhan: Wow, amazing. Ioannis, thank you so much for joining us and taking us through the awesome history of AlphaGo, AlphaZero, MuZero, your own journey through DeepMind, and then many of the core research problems that the whole industry is tackling today around data and building for reliability and robustness and planning and in-context learning. We’re really excited for the future that you’re helping us build, and that you’re pushing forward in the field as well, so thank you so much, Ioannis.
Ioannis Antonoglou: Thank you so much for having me.
Mentioned in this episode
Mentioned in this episode:
- PPO: Proximal Policy Optimization algorithm developed by DeepMind in game environments. Also used by OpenAI for RLHF in ChatGPT.
- MuJoCo: Open source physics engine used to develop PPO
- Monte Carlo Tree Search: Heuristic search algorithm used in AlphaGo as well as video compression for YouTube and the self-driving system at Tesla
- AlphaZero: The DeepMind model that taught itself from scratch how to master the games of chess, shogi and Go
- MuZero: The DeepMind follow up to AlphaZero that mastered games without knowing the rules and able to plan winning strategies in unknown environments
- AlphaChem: Chemical Synthesis Planning with Tree Search and Deep Neural Network Policies
- DQN: Deep Q-Network, Introduced in 2013 paper, Playing Atari with Deep Reinforcement Learning
- AlphaFold: DeepMind model for predicting protein structures for which Demis Hassabis, John Jumper and David Baker won the 2024 Nobel Prize in Chemistry