Reflection AI’s Misha Laskin on the AlphaGo Moment for LLMs

Training Data: Ep5

Reflection AI co-founders Misha Laskin and Ioannis Antonoglou are leveraging their unique experiences from Google DeepMind—where they worked on groundbreaking projects like AlphaGo, AlphaZero and Gemini—to build universal superhuman agents. We talk to Misha about his vision to build the best agent models by bringing the search capabilities of RL together with LLMs.

Listen Now

Stream On

Summary

LLMs are democratizing digital intelligence, but we’re all waiting for AI agents to take this to the next level by planning tasks and executing actions to actually transform the way we work and live our lives. What do we need to truly unlock agentic capability for LLMs? To find out, we’re talking to Misha Laskin, former research scientist at DeepMind. Misha and his cofounder Ioannis, co-creator of AlphaGo and AlphaZero and RLHF lead for Gemini, are leveraging their unique insights to train the most reliable models for developers building agentic workflows.

Agents are still far from a “tipping point.” Best in class coding agents today are only scoring in the high-teens % on the SWE-bench benchmark for resolving GitHub issues, which far exceeds previous baselines, but we’ve got a long way to go.
Depth is the missing piece in AI agents. While current language models excel in breadth, they lack the depth necessary for reliable task completion. Laskin argues that solving the “depth problem” is crucial for creating truly capable AI agents that can plan and execute complex tasks over multiple steps.
Combining learning and search is key to superhuman performance. Drawing from the success of AlphaGo, Laskin emphasizes that the most profound idea in AI is the combination of learning (from data) and search (planning multiple steps ahead). This approach is essential for creating agents that can outperform humans in complex tasks.
Post-training and reward modeling present significant challenges. Unlike games with clear reward functions, real-world tasks often lack ground truth rewards. Developing reliable reward models that can’t be “gamed” by increasingly clever AI systems is a critical challenge in creating dependable AI agents.
Universal agents may be closer than we think. Laskin estimates that we could be just three years away from “digital AGI”—AI systems with both breadth and depth of capabilities. This accelerated timeline underscores the urgency of addressing safety and reliability concerns alongside capability development.
The path to universal agents requires a methodical approach. Reflection AI is focusing on expanding agent capabilities concentrically, starting with more constrained environments like web browsing, coding and computer operating systems. Their goal is to develop general recipes for enabling agency that don’t rely on task-specific heuristics.

Transcript

Chapters

Leaving Russia, discovering science (1:11)
Getting into AI with Ioannis Antonoglou (10:01)
Reflection AI and agents (15:54)
The current state of AI agents (25:41)
AlphaGo, AlphaZero and Gemini (29:17)
LLMs don’t have a ground truth reward (32:58)
The importance of post-training (37:53)
Task categories for agents (44:12)
Attracting talent (45:54)
How far away are capable agents? (50:52)
Lightning round (56:01)
Mentioned in this episode

Misha Laskin: I think someone needs to solve the kind of depth problem the field as a whole, I think has, or large labs have been, have been really working on, on the breadth. That’s amazing. And there’s a big market for that and a lot of very useful things that get unlocked. But someone needs to solve the depth problem too.

Stephanie Zhan: Hi everyone. Welcome to Training Data. Today we’re hosting Misha Laskin, CEO and co-founder of Reflection AI. Misha is a former research scientist at DeepMind and his co-founder Ionnis was the creator of AlphaGo and RLHF lead for Gemini. Together they are building universal superhuman agents. We’ll chat about why we’re so far from the promise of AI agents even with best-in-class models today, what we need to truly unlock agentic capabilities for LLMs, and what we can learn from those who have built both the most powerful agents in the world like AlphaGo and AlphaZero, and the most powerful LLMs in the world like Gemini.

So Misha, to kick things off, we’d love to learn a little bit more about your personal background. You were born in Russia, moved to Israel when you were one, then the United States in Washington state when you were nine, your parents were pushing forward the field of technology and research in chemistry. And I think that inspired a lot of your love for pushing forward the frontier of technology and getting into the world of AI today as well. Can you share a little bit more about what inspired you to get into this field, and what has inspired you throughout your childhood and adulthood so far?

Misha Laskin: Yeah, definitely. You know, when my parents emigrated from Russia to Israel, it was when the Soviet Union collapsed and they came to Israel, basically. I mean, with nothing. They had, I think, $300 in their pocket, which was then stolen from them as soon as they landed because they put down a deposit for an apartment. And well, that just disappeared? They. I don’t even know if there was an apartment. So they. And they didn’t speak Hebrew. So they decided to kind of pursue a PhD in chemistry at the Hebrew University of Jerusalem. But that’s not because there’s some kind of internal passion for academia. At that time, it was more that Israel was giving stipends to Russian immigrants to get further educated. It was interesting asking my parents about this, because they kind of grew to love their craft as they got excellent at it. So I think from them that might be the thing that I took away most. It’s not that they are particularly impassioned about chemistry to start, but as they kind of learn about it, got curious about it and really went deeply into it, I think they became kind of masters of their craft. And that’s something that I found really important myself. Moving from there to the States. Well, my parents promised that we’re moving to this kind of beautiful state with all these mountains, this Washington.

And I remember, like, we’re taking the plane flight. I mean, I bragged to all my friends in Israel, and, you know, I, I was really excited. So. Yeah, we’re flying. I do see mountains in the distance. But then the plane does a sort of U-turn. And if maybe people don’t know this about Washington State, but it’s kind of half desert and half you know, mountainous and forests. And the plane turned to the desert. And so I see it kind of landing in this kind of middle of nowhere. And I asked my parents, like, where are the mountains? Like, well, you saw them from the plane. The reason I’m saying this is because I basically moved to a very boring place where there’s, there’s this area in Washington state called the Tri-Cities. And it has some pretty interesting history. The reason it exists is because it was one of the sites for the Manhattan Project. So this is where the plutonium was enriched at this place called the Hanford Site, which is a sister site to Los Alamos. So it’s a town that was basically built for that in the 1940s and is in the middle of nowhere, kind of like Los Alamos is. And there’s not much to do, I remember seeing my first tumbleweeds like it was. You literally saw tumbleweeds rolling across the highway.

I found myself in a place where I didn’t really speak the language that well. English. I was in this very rural place that was different from where I grew up, didn’t have many friends, and had a lot of time on my hands. And the way I got interested in science at the time, it was physics. After getting video games out of my system, I got bored again. And I found these Feynman lectures that my parents had, the Feynman lectures in physics. And they were so interesting because Feynman had this way of explaining incredibly complex things in a way that basically a person who’s, I mean, not that educated mathematically at the time, can really understand something fundamental about how the world works. And that is probably the thing that inspired me most. I just got really interested in this idea of understanding how things, how things work at this sort of root level and working on problems that are sort of root node problems and that, I mean, there are all these examples I was reading about, like the invention of the transistor, which was invented by John Bardeen, a theoretical physicist, or how GPS works. It turns out you need to understand, you need to make relativistic calculations, which is coming from Einstein’s theory of special relativity. And I wanted to work on things like that.

So that’s why I got into physics. I pursued it for a while. Got educated in it, got a PhD in it, and I think maybe the critical bit of information that I had not, did not have in my context then was that you don’t just want to work on a root node problems, you want to work on root node problems of your time. You want to work on the things that really can be unlocked now. And it’s no surprise, you know, when you’re when you’re being trained as a physicist, you’re doing like these really interesting problems and learning very interesting things about how people thought about these, you know, about physics basically a hundred years ago, 100 years ago, physics was the root node problem of our time. And that’s why I decided to not pursue it professionally. I kind of did a 180 and wanted to do something very practical. So I started up a startup, but as I was working on that, I started noticing deep learning as a field taking off. And in particular AlphaGo. Like when AlphaGo came out, there was something, something just felt very profound about it. How do they get this system that is like a computer to not just perform at a higher level than a human, but do so creatively? In AlphaGo there was this famous move called move 37 where the.

Well, where the neural network made this kind of move that looked like it just looked like a bad move. Lee Sedol was really perplexed by it. Everyone was perplexed by it. It just looked like a mistake. And it turned out that ten moves later, this was actually the optimal move to kind of put AlphaGo in a winning position for the game. And so you could tell that this is not just this is not just a brute force thing. This is not a thing that is just, obviously, the system does a lot of search, but it’s able to find creative solutions that people hadn’t thought of before. And so that kind of made me feel pretty viscerally that solving agency. This was the first real large-scale superhuman agent. Yeah. That seemed profound. That’s why I got into AI and got into AI to build agents from the first day. So there was a kind of non-linear path where you know, I was an outsider. I wasn’t really I mean, it was competitive back then, too. And OpenAI released these requests for research at the time. This was maybe 2018 or 19. The requests for research. Were just things that they wanted other people to work on. I think by the time that I was looking at this list, it was actually already stale.

So I don’t think they really cared about those problems, but it gave me something concrete to work on. And I started making progress against one of these problems. I felt like I was making progress. I don’t know how much progress I actually made. I was kind of peppering a few research scientists from OpenAI with questions and kind of, I mean, effectively cold emailing them until maybe I got, you know, too annoying. And they started, you know, well, I guess they responded rather, I’d say graciously and had built some, some relationships there. And one of them introduced me to Pieter Abbeel who is a PI at Berkeley, one of the greatest, I think, researchers of our time in the field of reinforcement learning and robotics but his lab kind of does everything. They have some of the most impactful generative model research as well. One of the key model papers came out of there and honestly got lucky. He took a chance on me and kind of brought me into his group. He really had no reason to. After I was on the other side and looking at applicants coming into the group, there was really no reason for him to take someone who is not vetted. So he kind of took a chance. And that, I think, was my kind of foot into the field.

Sonya Huang: You and your co-founder, Ioannis, have worked on, I think, some of the most incredible projects out of DeepMind and Google, maybe. Can you give the folks here a taste of some of the projects that you both worked on, like Gemini and AlphaGo, maybe. What were the key learnings from each and how they propelled your thinking forward to the present day?

Misha Laskin: Yeah. So Ioannis was basically the reason I got into AI. He was one of the key engineers on AlphaGo. He was there in Seoul when they played against Lee Sedol. And he’s actually you know, before AlphaGo, he worked on this paper called the Deep Q-Network, DQNs. And this was actually the first successful agent of the deep learning era. This was an agent that was able to play Atari video games. And it kind of catalyzed this whole field of deep reinforcement learning back then, which was, hey, AI systems are autonomously learning to act in, I mean, mostly video game and robotics environments, but it was the first agent who was kind of a proof point that you can learn, you know, to act in an environment in a reliable way from just raw sensory inputs coming in. This was, I think, a big unlock that was completely unclear at the time, the same way that I think neural networks working on ImageNet was an unlock in 2012. And then, yeah, Ioannis worked on AlphaGo and the kind of subsequent line of work there was AlphaGo, AlphaZero. There’s a paper called MuZero. And I think that really showed how far you can take this idea. Like it really scales. Relative to the models we have today, large language models, the AlphaGo model is actually really small, and it was so smart at this one thing.

I think the key lessons, at least for me from, from AlphaGo, were kind of encapsulated in this famous essay that Rich Sutton reinforcement learning researcher or I guess, kind of a sort of father of a lot of reinforcement learning research put forth, which is this idea of the bitter lesson. In that essay. He basically says that if you’re building systems that are based on your internal heuristics those things will likely get washed away with systems that kind of just learn on their own. And the or rather, systems that leverage compute in a scalable way. And he argued that there are two ways to leverage compute. One is by learning. So it’s training. That’s when we think of language models. Today. They’re leveraging compute mostly through learning by training them on the internet. The other way is search, which is leveraging compute to unroll a bunch of plans and then pick the best one. And AlphaGo is actually both ideas in one. I still think it’s the most profound idea in AI that combining learning and search together is the optimal way to leverage compute in a scalable sense.

And that, and those things together are the things that produce a superhuman agent at Go. The issue with AlphaGo was that it was only good at one thing, and I remember being in the field then, and it did feel kind of stuck, the field of deep reinforcement learning, because the goal everyone set out for themselves is to build general agents. Superhuman general agents. And where the field landed was superhuman, very narrow agents, and there was no clear path to make them general? Because they were so data inefficient that if it takes 6 billion steps to train on one task, then where are you going to get the data to train, you know, on all these others? And that was the big unlock of the language model era. Like one way to think about the internet is or all the data on the internet is a collection of many tasks like you have, like Wikipedia, as a task of kind of describing some historical events. Stack Overflow is a task of Q&A on coding. You can think of the internet as like a massive multitask dataset. And that is interesting. Yeah. And that the reason we get generality from language models is because it’s an, it’s basically a system that’s trained on tons of tasks.

Those tasks aren’t particularly, let’s say, directed or, you know, there’s no notion of reliability or agency on the internet. So it’s no surprise that the language models that come out of that aren’t particularly good agents. They’re I mean, they’re obviously incredible and they do incredible things. But one of these fundamental problems in agency is that you need to think over many steps, and you have some error rate over each step, and that error accumulates. It’s called error accumulation. And so it means if you have, you know, some percent chance that you’re wrong the first step. That’ll compound very quickly over a few steps to the point where it’s basically impossible for you to be reliable on a task that is meaningful. The key thing that I think is missing is that we have language models are systems that leverage learning. They’re not yet systems that leverage search or planning in a scalable way. And so that I think is kind of the missing piece. Okay. Now we have general agents, but they’re, or we have general agents that are not very competent. And so you kind of want to move up the competence. And the only existence proof for that has been AlphaGo. And it’s been done through search.

Stephanie Zhan: I really Love that. And the encapsulation of how you just shared that. I think that sets the stage wonderfully for Reflection. Can you share a little bit more about the original inspiration, this problem space that you’re going after, and your long term vision for Reflection?

Misha Laskin: I mean, the original inspiration came very much from working. Ioannis and I collaborated very closely on Gemini. Ioannis led the RLHF effort, and I led reward model training, which was kind of a key part of of RLHF where we were working on and what everyone is working on these language models is you in post training, you align them for chat, so you align them to be good interactive experiences with for some end user. So that was, you know, through products like ChatGPT or Bard which has now been just named Gemini. These language models, like the pre-trained ones, are very adaptive. And so you can, if, with the right data mix, you can adapt them to be like highly interactive chatbots. And I think our key kind of insight from working on that was that there is nothing specific that was being done for chat. It was just, you know, you’re just collecting data for chat. But if you collect the data for anything, for another capability, you’d be able to unlock that as well. Of course, it’s not so simple. A lot of things change in the sense I mean, one key thing is that chat is subjective.

So the algorithms that you train are different than the algorithms that you train for something that has an objective like this task was done or not. There are all sorts of issues, but the main thing was that we think the architectures and models work. A lot of the things that I thought were bottlenecks have been kind of washed away with compute and scale. Like long context lengths is something I thought would be something that would need a research breakthrough. And now all the players are releasing models with extremely long context lengths relative to what we thought was possible even a year or two ago. The methods for training these things and aligning them in post training them are pretty stable. And it’s really yeah, it’s a data problem and it’s a, it’s a data problem and a problem of how do you enable planning and search on top of these objects. And we thought we’d move faster against this problem if we did it on our own. I think we just wanted to move very quickly against it.

Sonya Huang: So you’ve described agents as kind of the dream both for you and Ioannis as researchers, but also for Reflection. Can we pause on the word agents for a little bit? Because now it’s become, now that it’s become the term of 2024, everybody is calling themselves an agent, and the word is starting to lose its meaning a little bit. And I imagine that you have probably a more pure definition of what an agent is, maybe. Could you just explain that? How do you think about what an agent is? And, you know, when we look at some of the agents that everyone’s gotten really excited about recently, there’s still it seems like they’re still very early in terms of being reliable enough to be kind of colleague level true agents. So, like, where do you think we are on that curve? What is an agent? And, and how do we kind of get to the promised land?

Misha Laskin: Yeah. It’s an interesting question since I. The term agents has been floating around, you know, within the research community for, for a while, since the start of AI, but I primarily have been thinking more of agents in the context of the deep learning era. So starting with DQNs and the definition is pretty simple, it’s an AI system that is able to reason on its own and take however many steps it needs to to accomplish some goal that’s been specified to it. That’s kind of it. And now the way that goal is specified has changed over time in the deep reinforcement learning era. The goal was usually specified through a reward function. So for AlphaGo it’s whether you won the game of Go or not. There’s no one who wrote like go win the game of Go like via text. So that’s how people usually thought about agents. They thought of agents as things that are optimizing a reward function. But there’s a whole area of research even then, before language models on goal conditioned agents. So these would be either in robotics or video games or you set a goal for the robot, which could be you give it an image of you know, an apple having been moved somewhere, and you ask it to kind of reproduce that image, and it has to act on the world and pick up an apple and move it somewhere in order to accomplish the goal. So it’s a short definition. It’s AI systems that have to act in an environment to accomplish some goal. That’s an agent.

Sonya Huang: And then I guess as a follow up, if you take, for example, coding agents as one potential domain of agents where there’s been a lot of activity recently. You could say the goal is, you know, create a calculator app for me. Yeah. And the agent has to go and accomplish the task. When I look at what, you know, SWE agents and what Devin has done, is that in your mind? Is that agentic reasoning and does kind of scaling that up get us to the promised land or or do you think there are kind of different approaches that you need more on the RL side or more on whatever other techniques you might need to to use to get us to the promised Land, because I think those agents are still in the 13, 14% kind of task completion rate range. I’m curious how we get them to 99%.

Misha Laskin: They’re definitely by the definition of agencies. These are agents. They’re just on a spectrum of capability. Maybe not at a high level of reliability yet. I think the way most people think about agents today in the context of language models is prompted agents. So you take a model, you prompt it or you set up some flow of several prompts to get it to accomplish a task that allows anyone to kind of take a language model and take it from zero to something that’s kind of working somewhere. Yeah. So I think that’s quite interesting. I think it can only go so far. So this, this I think is actually like a, a very I mean, this is a kind of example of what I think the bitter lesson would apply to because prompting things and kind of really directing them to go in like these specific ways, that’s exactly the kinds of heuristics that we have that we’re baking into these models to kind of try to achieve higher intelligence. I mean, every major advance in agency since the deep learning era has been kind of showing that with learning and search, a lot of that gets washed away. I think that the purpose of a prompt is to specify the goal. So you’ll always need a prompt. You always need to tell an agent what to do. But if once you start deviating from that and the purpose of the prompt is actually to put the agent on rails and, and you know, where you’re kind of doing the thinking for it, right? You’re kind of telling it, okay, now just go here and do this thing. That I think is going to disappear. I think that’s a local thing that’s happening today. Future systems, I don’t think will have that.

Sonya Huang: So the key is that the thinking and the planning needs to happen in the AI system, not in the prompt layer in order to kind of not hit a wall.

Misha Laskin: I think you want to offload as much as possible to the AI system itself. Again, these language models have not been, they’re never trained for agency. They’re trained for chat interaction and predicting things on the internet. So it’s almost a miracle that you can prompt your way to getting something that kind of works. But what’s interesting is that once you’re able to prompt your way to something that kind of works, that’s actually the best place to start for a reinforcement learning algorithm, a reinforcement learning algorithm. All it does is it reinforces good behavior and it kind of downweights bad behavior. If you have an agent that is doing zero that is just doing nothing, there’s no good behavior to upweight. And so the algorithm doesn’t work. This is known as the sparse reward problem. If you’re not hitting your reward, like if you’re not accomplishing your task, then ever, then there’s nothing to learn from. But if you’ve prompted your way to an agent that is kind of working like, you know, SWE agent or something like this, it’s getting 13% or something like this, then you have something that is minimally capable where you can reinforce really good behavior. The challenge becomes a data challenge of where do you get the set of prompts to train on? Where do you get the environment to run these things through? I guess SWE-agent does come with an environment, but for many problems, like, you need to think about that. And then perhaps the biggest challenge is how do you verify that a thing has been done correctly or not, in a scalable way? And if you can solve this where, where the tasks come from, which usually that’s through products that’s solvable. What environment you run them through, what algorithm you use. But it’s really kind of what environment you run them through. And then critically, how do you verify if a thing has been done correctly or not? In a scalable way? I think that’s a recipe for agency.

Stephanie Zhan: I think that gets to the crux of the problem space in AI agents today. Just to set the stage a little bit for the problem that Reflection is going after, what do you think is the current state of the market broadly in AI agents? And like many assume that we are capable of more than we actually are with the models that exist today. So what do you think the problem is? And what do you think is why do you think the current attempts around AI agents are failing us today?

Misha Laskin: One way to categorize or classify what it means to be a general agent. And maybe I’ll use the term universal agent, since I’ll use the term generality to apply to breadth. So a universal agent needs to be a broad, a very general agent that can do many things, can handle many inputs, but it also needs to have depth in the kind of task complexity it can achieve. And so examples are AlphaGo is probably the deepest agent that has ever been built. It can do one task. So not not that useful. It can play Go, but not tic tac toe. Yeah. The current systems language model systems like Gemini, Claude, ChatGPT and the GPT series of models lean the other way. They’re very broad. They’re not very capable and depth wise, since they’re extremely impressive and capable broadly. I think that’s one of the things that’s been honestly miraculous. Like, yeah, as I said, the field felt like we did not have an answer to generality. And then these objects came along. But now in the opposite end of the spectrum where we have, I think, more or less de-risked as a field, progress towards breadth. That’s especially evident with the latest generations of models like GPT-4o and the latest family of Gemini models that are multimodal in the sense that they just understand other modalities at the same base layer, that they understand language.

You don’t need to translate one modality into language. So that’s what I would call breadth. But nowhere along this process were things trained for depth. There’s no, there’s no the internet doesn’t have real data around how to kind of think sequentially. The way people try to solve this problem is like work on data sets that might have the structure and hope it generalizes. So math data sets, coding data sets kind of what people refer to as reasoning, which usually is reasoning along the lines of can you solve a mathematical problem? But that’s still not really addressing the problem head on. I think we need methods that you can take recipes, let’s say, that are general in that you can take any task category, have a bunch of prompts for it in turn for your training data. And make a language model iteratively more capable on those things. I think someone needs to solve the depth problem. And the field as a whole, I think is or large labs have been, have been really working on, on, on the breadth that, well, that’s amazing. And there’s a big market for that and a lot of very useful things that get unlocked. But someone needs to solve the depth problem too.

Stephanie Zhan: I think that takes us really nicely into the unique insight that you and Ioannis have from working on AlphaGo, AlphaZero, and on Gemini. And the importance of post-training and data. Can you share a little bit more about how those experiences have shaped the unique perspective that you have? That gets us to the unlock with the agenetic capabilities?

Misha Laskin: One of the things I found very surprising about language models is how close they are to, you know, how oftentimes, even if they’re not working on something that you want them to, they’re actually quite close. They feel like a nudge away. I feel like they need to be grounded in the thing a bit better. And that’s I think that was the insight that led them to be good in chat. Like you could play with them and they’re, yeah, a bit unreliable. And they kind of go off the rails sometimes, but they’re almost good chat companions. Yeah. And so then the recipe, there’s a recipe for how do you take a pre-trained language model and make it a reliable chatbot. So by reliability there, it’s just a the way you measure that is with human preferences. Do people interacting with this chatbot prefer it over other chatbots or other versions of the previous versions of itself? So if the current version is much more preferred than the, you know, last few iterations ago, then you know you made progress and that progress is made by collecting data for it. So it’s collecting data for the kind of, you know, queries that users input into a chat box, the outputs that the models provide, and an effective ranking between those outputs so that you push the model to, to, you know, index over on the kind of more preferred outputs. So when we say ranking where does that ranking come from. Well it comes from humans. So there are either human labelers or it’s something that’s embedded into the product. You sometimes might see thumbs up or thumbs down and ChatGPT it’s harvesting your thumbs to kind of know what your, what your preference is.

And that data is used to kind of align the model with the user preferences. That’s a very general algorithm. That’s a reinforcement learning algorithm. And that’s why it’s called reinforcement learning from human feedback or RLHF. If you’re just applying the things that human feedback is expressing preferences for, there’s no reason why this same approach would not be possible for enabling more reliable agency. There’s a whole sequence of other problems you need to solve. I think that’s the reason this is so hard is because as soon as you go into agent territory, you have more than just the language outputs. You have the tools that they interact with and, you know, the tools being, suppose you want it to send an email or like work in an IDE or like anything that an agent does, it does it in an environment. And that requires tools and it requires the environment. And everyone who’s deploying agents is deploying agents in different environments. And so there’s a challenge of how do you integrate with environments and how do you onboard agency onto them. So I think that that’s why it’s a bit of a schlep. If you get into this kind of line of work and you have to be careful about the environments and kind of and yeah, the way you structure it because you don’t want to overfit to some, you know, some particular like environment. But conceptually it looks very similar to aligning a model for chat. There are just some more integration challenges that need to be solved along the way.

Sonya Huang: Since you view AlphaGo as kind of like the pinnacle of building an agent that was, you know, truly capable, I imagine you’re trying to usher in an AlphaGo moment with LLMs. What do you think are the differences? To me, you know, with gameplay, you have a very clear reward function. You have the ability to self-play like it’s doing kind of the reinforcement learning from human feedback. Do you think that’s enough to kind of get us to an AlphaGo moment in LLMs? Or like, I guess, how should I think about the differences here?

Misha Laskin: I think what you said around not having a ground truth reward is a key, and maybe the key thing. What we learned from the previous era of reinforcement learning research is that if you have a ground truth reward, you’re kind of guaranteed success. Like, that’s kind of there have been so many very impressive projects that showed this at a really unprecedented scale. I mean, you think aside from AlphaGo, there was OpenAI’s Dota 5 or AlphaStar and let’s say AlphaStar and Dota 5 are a bit more niche in the sense that you kind of have to play those games to understand. But as a former StarCraft player, I was, I still am completely blown away by AlphaStar, like the group strategies that the AI discovered were, it just looked like a smarter than us alien came upon the earth, decided to play this game and completely outcompeted us. So that’s due to the existence of a number of things. But a ground truth reward is extremely important for tightening that behavior. Now, both with human preferences and for agency. We don’t have these very general objects, and we don’t have ground truth rewards for whether something is accomplished or not, like even for a coding task. What’s the ground truth of whether this was done the right way? Like it could pass some unit tests, but it can still be wrong.

So it’s a really hard problem. And I think it’s the fundamental problem for agency. There are others as well, but this is kind of the big one. The way you get around this problem for chat is again through RLHF. Well, you train reward models. Reward models are a language model that predicts whether something was done or not correctly. The challenge with that. So first it works well. The challenge with that is that when you don’t, in the absence of a ground truth, when you have this kind of noisy thing, that it can be wrong, your policy or the agent quickly gets smart enough where it finds holes in the reward model and exploits them. To give a concrete example in chat, suppose you notice that your chatbot was outputting you know, was outputting some, let’s say, harmful content or like there’s some topics that you don’t want it to talk about because it might be sensitive. And so you put in some data saying where it’s examples of the chatbot kind of ignoring or not ignoring, but saying, “I’m sorry as a language model, I cannot answer this.” What can happen? Is that okay, you now train a reward model against this and suppose in your data mix, you really you only put in like data points that you know showed showed instances of like this kind of this happening, but not instances of a chatbot taking something like kind of sensitive and actually like, you know, answering it.

What that means is that the reward model could think that it’s actually like a good thing when you just don’t answer the user’s query. Ever. Because I’ve only seen positive use cases of that. And when you train against that, the policy with language model will at some point get smart enough and discover that this reward model gives me high reward whenever I just don’t answer, whenever I punt the question. And it can collapse into a language model that just never answers your questions. And this this is why it’s it’s it’s very finicky and it’s very difficult for this for this reason I’m sure a lot of users who have interacted with ChatGPT or Gemini or these kinds of models, like probably through through interacting with them, found sometimes that they kind of degrade and they all of a sudden like, don’t answer questions, you know, as often as they used to. Get slightly worse at something or, you know, are politically biased in some way. And I think a lot of that is, well, it’s artifacts of the data, but the, the but the artifacts and the data get amplified by, by bad reward functions. So that is the hardest problem, I think.

Sonya Huang: If I view the rough kind of large model training pipeline or large AI system training pipeline as pre-training and post-training I kind of think like pre-training seems largely like a solved like we’re in the, you know, the techniques are solved and we’re just kind of in the, the race to scale moment on the pre-training side. Post-training still feels a little bit in the research phase of the market where people are still figuring out what techniques will work in a general way. I’m curious if you agree with that. And in an ideal state, like what is pre-training responsible for doing? How should we as laymen think about it? And what is post-training responsible for accomplishing? And how should we think about that from the perspective of a five year old?

Misha Laskin: Yeah. I would generally agree with that statement that pre-training has become, there are a lot of details that need to get right. And it’s and it’s by no means easy. So it’s a very hard endeavor. But it’s a better understood endeavor at this point. And one way that I think about pre-training is I actually think thinking about it through the lens of something like AlphaGo is quite simple and clear because it kind of, you know, rather than thinking about this massive internet thing, you just think about a very clear setting, which is clean setting, which is this game. You can think about AlphaGo as two phases. It has an imitation learning phase where a neural network imitates a bunch of expert, amateur like expert Go players, and then it has this reinforcement learning phase, and you can think about pre-training as the imitation learning phase of AlphaGo. You’re just kind of acquiring the basic skill of learning to play the game. You’re not. Maybe your neural network then, is not the best in the world. But it’s pretty good. It goes from zero to pretty good. And pre-training for a language model is getting from zero to pretty good on everything. Which is why it’s so powerful. Post-training is, I think about it as hardening good behavior.

What that means is with AlphaGo, you start off with, you did imitation learning. You start off at a place where you have a neural network that can do something. It can. I mean, it can play a game pretty well. Then you apply this other recipe to it, which is reinforcement learning, which is then the network starts generating, you know, its own plans and kind of acting through the game, getting feedback. And that could then and basically good actions get reinforced. That is I would say that that’s post-training. And you can think about from a chat perspective, you’re hardening the model like, the good behavior along the chat axis. So it’s actually quite interesting that the high level recipe for training AlphaGo and for training Gemini is actually the same. You have this imitation learning phase and then you have a reinforcement learning phase. The reinforcement learning phase. And AlphaGo is just much more sophisticated than what we have now. And the reason comes back to reward models. If you have a reward model that is, you know, that is fairly noisy and exploitable, then there’s only so much you can do before the policy gets smart and finds a way to trick it.

And so even if you threw the fanciest like RL algorithm at it, like Monte Carlo tree search with AlphaGo it may not be that effective because it kind of collapses into this kind of degenerate state where the policies hack the reward model before it could even do any interesting search. Like, suppose you’re like, thinking about if you are playing chess and you’re thinking about what to do multiple moves ahead, but your judgment is really bad at every move. Then there’s no point in planning ten moves ahead. And I think that’s where we are with RLHF today. There’s this wonderful paper that I think is very underrated called Scaling Laws for Reward Model Overoptimization. This is a paper from OpenAI studying this phenomenon. And what’s interesting about, I mean, a number of things, but it showed that this phenomenon happens at all scales. And, I mean, you tried a couple of different RLHF algorithms, and it happens for all scales, for all algorithms that were tried in that paper. And I think it’s an interesting paper because it’s the kind of fundamental problem of post-training that’s it’s that paper. Yeah.

Sonya Huang: Just to pull on the thread a little bit. If you follow the results from AlphaZero, though, then we may not need pre-training at all. Is that a fair conclusion of what to make of this?

Misha Laskin: I think that at least my mental model is that the AlphaGo part, the imitation learning is necessary. More from a practicality standpoint. When, when we went from or DeepMind went from AlphaGo to AlphaStar. There was no AlphaZero of AlphaStar. There was no AlphaStarZero that was released after that or anything like this. And AlphaStar had like a big part of it was imitation learning across a lot of games. I think with AlphaGo, it was like this kind of special place where you don’t only have a zero sum game, but you can get to the end of that game, like fairly quickly. And so it’s again, like you can get that feedback about whether what you did was right or not.

Sonya Huang: Got it. Okay. So it’s just way too unconstrained of a problem to throw it out generally.

Misha Laskin: Yeah, I think in practice like, yeah, AlphaZero would work generally for everything if we had ground truth reward functions for everything. But because we don’t, you need to do the imitation learning piece as almost this just like practical, like, we need to get into the game somehow.

Stephanie Zhan: You described earlier the importance, from a technical perspective of having an agent in its environment, also from a product distribution and getting the product in the user’s hands. Perspectives. It’s important to think about what the right task categories are for users to first interact with the most powerful agents. What are some of the task categories that are on your mind, and what do you imagine are some of the possibilities that users could use in their daily workflow?

Misha Laskin: If you want to make progress along the depth axis you could go for AlphaGo first, which is like a really hard thing. Or you could kind of expand concentrically in the sort of complexity that has the tasks you’re able to handle. And we are focused on, on kind of enabling depth, but in this sort of concentric way, and we care a lot about having a general recipe that does not inherit heuristics that are special to some tasks. So from a research perspective, we’re building general recipes for this. Now you have to ground those recipes in something to show progress. And at least for us it’s important to show diversity of environments. And so we’re thinking about a number of different types of agents. Web agents, coding agents, I have OS computer agents. The important thing for us is to just show that you can have a general recipe for enabling agency.

Stephanie Zhan: Switching gears a little bit, you’ve attracted a seller team already. Who else are you looking to recruit on your team?

Misha Laskin: Yeah, we’ve been fortunate to be able to draw some talent from the top AI labs in the industry. And I think a lot of that has to do with well, with both of the work that Ioannis and I did. But definitely, you know, I think a lot of credit goes to Ioannis and his reputation, you know, that there is this I was watching the Michael Jordan documentary, and Michael Jordan was, one of the reasons he was so effective is because he was such an incredible kind of individual, like, basically contributor to the game, maybe the best, that he really inspired people on his team to get to his level, even if they couldn’t get there. And Ioannis has this effect on people like I worked very closely with him on, on Gemini, and he had that effect on me. I don’t know if I ever got to Ioannis level. But I aspired to, and I definitely became a much better engineer and researcher through the process. And I think that’s a lot of the draw is that, you know, you get to learn a lot from him. We’re primarily continuing to look for… So we’re not hiring out quickly. We’re hiring out, I think more methodically.

We’re looking for other, definitely interested in other researchers and engineers joining us on this mission. I’d say a commonality between everyone who’s joined is we’re all very hungry. Maybe that’s how I’d put it. You know, we could have. Ioannis and I could have stayed and tried pushing agents you know, at DeepMind. And as I said, I think, you know, the reason we decided to do it in our own way is because we think we can move quickly, much faster against this goal. And some of this urgency is driven by a real belief that we are three or so years away from something that resembles a digital AGI. And by that, that’s what I’d been referring to as a universal agent. Something has both this kind of breadth and depth of knowledge. And that means we’re actually on a very accelerated timeline. Yeah. You’re, you know, a few months in and you’re kind of 5% away from hitting that timeline and maybe saw this is also driven by how quickly AlphaGo went from experts in the field doubting this is possible. Thinking it’s kind of decades, human human level or expert human level Go play was decades away. And how effectively they were able to solve that problem within months.

I think we’re seeing a similar kind of acceleration happening with language models. There’s I think one viewpoint you can have is that we’ve saturated a lot of what we can work on this sort of at the tail end of an S-curve. And we don’t we don’t view it that way. We think we’re still on an exponential. Part of the reason is that these things are so bulky and slow to train that there’s no way collectively as a, you know, as a field of researchers and engineers that we’ve optimized it yet, like, if it takes a few months to run and like a few months and a few billion dollars to run the biggest model, then how many experiments can you really run? So yeah, we kind of see things going at an accelerated pace. And we think solving the depth and reliability problem is something that is not getting the kind of attention it needs. Like there are groups that are following this, as I would call it, more like side quests within these big companies. But I think you need a player that is focused entirely on it, to solve this problem. Yeah.

Stephanie Zhan: I love the framing of main quest versus side quest, and I love the hunger and the zero complacency and impatience in a healthy way that you and the rest of the team have. And the only other thing I’d highlight is the revered reputation that you described for Ioannis with inspiring and motivating other people, I think is true for you and Ioannis from everyone we know at DeepMind.

Sonya Huang: So three years until I have an agent that will write my memos for me, hopefully, like, three years.

Misha Laskin: Yeah, I think the memos might be coming, you know, sooner.

Sonya Huang: Because that was one of my burning questions. Is this like decades away? Is this months away? It sounds like you’re closer to months or small number of years away.

Misha Laskin: I think small number of years. Well, yeah, it yeah, it’s honestly kind of alarming. I think the speed at which the field is moving. And part of. Yeah, part of depth and reliability, it’s also like I mean, reliability is safety. So you want these systems to be safe. I think that there’s a lot of very interesting research in terms of, there’s a recent paper from Anthropic on mechanistic interpretability. And that whole line of work is really interesting. And I think starting to kind of get to the point where there’s utility in it as well, in terms of finding neurons in the model that are lying neurons that you can suppress. But to me, safety is reliability. If the thing is running around your computer, breaking all sorts of things, that’s an unsafe system. Maybe it’s like a utilitarian safety, like you just want these things to work and do what you intended them, and what you ask them to do.

Sonya Huang: So a few years to find another hobby other than my memo writing, then.

Misha Laskin: Yeah. Well, or maybe you’ll just have an army of AI interns that you know, will do all the research work for you.

Sonya Huang: Can’t wait.

Stephanie Zhan: Wrapping up our topic around Reflection. If everything goes right, what is your dream for Reflection?

Misha Laskin: I think there are two angles to this question. One is we’re working on this because this is the kind of scientific root node problem of our time. We’re scientists. That’s why we’re so kind of interested and committed to it. And it’s really there’s a world where you get to be part of one of the most exciting journeys in science ever made and you’ve, you know, accomplished your goal of building a universal agent. You have highly safe, reliable digital agents running around on your computer basically doing things that tedious work that you don’t necessarily want to do. You kind of, I think, rather than people kind of going and you know, spending less time working, I don’t think the human need to be productive and to contribute is going to change. I just think the capacity of each human’s ability to produce and, you know, impact the world is going to dramatically increase. As you know, in my line of work, as a researcher, there are so many things that I spend time on that a smarter AI could help me out with to make faster progress towards our own goal. I mean, this is kind of circular, but if we had something close to a digital AGI, we’d get much faster to solving the problem of digital AGI. That’s one angle.

I think the other angle is, I guess, we’ve kind of moved on to the other angle is from the user perspective of a lot of the things that we do on a computer are, maybe you can think about like computers, like the first digital tool that we’ve been introduced to is you know as people in the same way that there were, you know, hammers and chisels and sickles that that people used. And I think we’re moving towards the kind of layer beyond that where instead of you having to kind of learn how to use all these tools with great precision and spend all your time on this, which actually is kind of time taken away from achieving, you know, whatever personal goals people have. That you kind of have these incredibly helpful AI agents that can help you bring kind of any goal that you have to fruition. And I think it’s very exciting because I think the kind of ambition of our individual goals is going is it’s already increasing, you know, in this local sense, like a software engineer can get a lot more done today with these tools. But this is just the beginning. And I think we’ll yeah, we’ll be able to really set dramatically more ambitious goals for ourselves and for kind of these sorts of things we want to achieve simply because we can offload a lot of the work that’s needed to get there to to these systems. So these are some things that I’m really excited about.

Sonya Huang: We’ll close it out with a few questions that we’d like to ask everybody about the state of AI. The first question, what are you most excited about in AI in your field or more broadly, in the next one year or five years and ten years?

Misha Laskin: I think there are a number of things that the one, the local one that comes to mind is because the paper is fresh. Is this kind of work on mechanistic interpretability and that, I mean, these models are largely black boxes. And it’s unclear how to study them, like, what’s the neuroscience of language models like, if you think about them as brains? Yeah. And this seems like a really interesting line of work that is now starting to see kind of signs of well, signs of it working beyond toy settings. Yeah. So maybe like the sort of neuroscience of language models is, I think, kind of a really interesting field in AI to get into and more generally. You know, if I was in academia I’d probably be looking a lot at the science of AI. So the neuroscience of AI is one thing, but there are all sorts of things one can investigate in terms of, well, what really determines the scaling laws that these models have? Both from a theoretical perspective and from an empirical perspective of how you change data mixes, maybe kind of taking a step back where we’re basically in the equivalent of what the late 1800s looked like for physics. Electricity was being discovered. No one knew why it worked or how. There were a lot of empirical results. But there was no kind of theory behind it, which just meant that they were not very well understood. And then this, like, very rich set of theoretical models were developed that were very simple to understand these phenomena. And that gave rise to actually basically the next wave of empirical breakthroughs. And so I think the science of AI is kind of in that state right now. And I’m very excited to see where that goes.

Stephanie Zhan: So interesting.

Stephanie Zhan: Who do you admire most in the world of AI?

Misha Laskin: I think most people when getting this question or some people might kind of put someone, yeah, I maybe I’ll take that back and say, like, I want to emphasize people that I admire based on having worked with them and kind of see how they operate because through my last kind of, you know, a number of years in AI, I think, yeah, there are a handful of people like this who’ve inspired me. And one of them is So Pieter Abbeel is certainly one of them. He is, I think I’ve never seen anyone operate as efficiently as Pieter. That was something that and to date like since meeting him and to date like I think there’s a sort of you think a lot about research as a creative pursuit oftentimes. And I think what I learned from Pieter is just sort of operational competence and efficiency around this. He’s very creative as well. And like, his lab does a lot of creative work, but I think it’s because these things are very hard and they need to be pushed hard and with great focus. And he ran his lab like it was the tightest ship that I’ve ever been on. And really helped focus like all the projects. And so, yeah, I think I look up to him a lot, both in terms of his the, the work that he’s done. Obviously it’s remarkably cross-field. Yeah. Like he’s both done well. Incredible breakthrough work and reinforcement learning and unsupervised learning, generative modeling.

And a lot of it has been, I think, from recognizing and enabling talent. Yeah. So there’s just very much it was like the group was a bunch of independent thinkers, students, PhDs and people kind of pursuing what was interesting to them. But the way I saw Pieter is sort of like a great amplifier. Yeah. Like he kind of helped people amplify and focus on the thing that really mattered within their pursuit. I think a few other people that come to mind so my, my manager at DeepMind, his name is Vlad Mnih. He’s, yeah, I think also like a very incredible, very creative scientist. He was first author of the DQN paper. And then there are actually two papers at the time, there’s the A2C and A3C papers. These are basically the two algorithms that defined reinforcement learning, and deep reinforcement learning. And he kind of really pioneered both. Yeah. I think his strength is like he was extremely kind and people oriented. Very humble despite his accomplishments. Ioannis as well. I mean, definitely I mean, Ioannis has them. The Michael Jordan effect. I think he really, yeah, he just wanted to be the best he could be when you worked with him. And their RHF team was, like, quite small. And people pushed really hard in order to like, largely, I think, inspired by him. Yeah. These are some people I really look up to.

Stephanie Zhan: Thank you for sharing that. It’s so interesting to hear you say that about everyone. And one comment on Pieter Abbeel. I tell him all the time that he’s also just created a mafia of founders in the last couple of years, and it’s probably because he’s taught them how to do many things. And there’s a self-selection that’s naturally happening. The creative thinkers and the independent thinkers who come into his lab. But he’s also taught them a lot about how to run a tight ship and how to focus incredibly well. So I’m sure that doesn’t come without intention on his part.

Sonya Huang: Very last question. Any advice you have for founders building an AI? You are. You’re just starting your journey right now, and I’m sure you’ve asked others for a lot of advice. What advice would you pass on to the next generation?

Misha Laskin: I think one, I think I’ll be in a better position to answer that question in a few years, in a way. That is much more meaningful. But I’ll actually provide a piece of advice that I lived through through my previous startup which has nothing to do with AI, to just work on things that are, internally, really matter to you in a way that is almost independent from what’s happening around you, in a way that when things go bad, it’s still interesting to you. It’s kind of there’s just some fundamental drive around this problem that, independent of everything else that’s happening, is just really interesting to you. Maybe I say that about AI because it’s such an interesting, highly capable, cool technology. And so there’s this sort of, I think appeal of taking it and just kind of like, oh, let’s just see what we can do. I think you inevitably find yourself in a hard place without having a very strong internal compass independent of AI, the core of it that’s important to you and what you want to do. And so having been in that position previously that’s kind of what I’ve done, what I would have done differently and what I would advise people to do.

Stephanie Zhan: I really love that the line that I like to think about is playing in your own stadium and don’t get distracted by the glitz and glamor of someone else’s stadium. You need that internal drive and grit and obsession with the problem to get you through all the tough times.

Misha Laskin: Yeah, and I think there are things that come with it. Like if you really care about some problem, like you will care about the customers who you’re solving the problem for. Like, like having customers that you don’t care about is like a terrible place to be.

And so, I think, yeah, it has to come. And it’s not like I think it’s kind of actually hard to control, like who you care and who you don’t care about. That’s like a personal thing. Yeah. So you can’t, it’s actually really hard to force yourself to, like, care about something out of necessity if it’s kind of not aligned with something inside you already.

Stephanie Zhan: So no more shopping and retailers for Misha.

Misha Laskin: Yeah. So, you know, I was building software for inventory prediction for retailers. And for some people, you know, they would really care about that problem. Like there’s a reason they would they’ve kind of seen that problem. They’ve, you know, maybe if you’re a merchant at one of these retail companies, you’ve really felt it viscerally. And in my case, you know, I hadn’t. It was like we were just trying to build a revenue generating business almost kind of independent of an internal compass.

Stephanie Zhan: Misha, thank you so much for joining us today. You were working on the most ambitious problem of our time. I love your framing of the root node problem of our time. And I think that is today agents. And it’s very clear that both your and Ioannis’s experiences make you the very best team at what you do. Ioannis obviously from a RLHF perspective. And yours from a reward model training perspective and the insights and experiences that you’ve both had working on AlphaGo, AlphaZero and Gemini. We’re so excited for the future of Reflection.

Misha Laskin: Yeah, thank you for having me.

Mentioned in this episode:

The Feynman Lectures on Physics: The classic text that got Misha interested in science
Pieter Abeel: Berkeley professor and founder of Covariant who Misha studied with
Mastering the game of Go with deep neural networks and tree search: The original 2016 AlphaGo paper.
Mastering the game of Go without human knowledge: 2017 AlphaGo Zero paper
Scaling Laws for Reward Model Overoptimization: OpenAI paper on how reward models can be gamed at all scales for all algorithms
Mapping the Mind of a Large Language Model: Article about Anthropic mechanistic interpretability paper that identifies how millions of concepts are represented inside Claude Sonnet

Reflection AI’s Misha Laskin on the AlphaGo Moment for LLMs

Training Data: Ep5

Listen Now

Stream On

Summary

Transcript

Chapters

Contents

Leaving Russia, discovering science

Getting into AI with Ioannis Antonoglou

Reflection AI and agents

The current state of AI agents

AlphaGo, AlphaZero and Gemini

LLMs don’t have a ground truth reward

The importance of post-training

Task categories for agents

Attracting talent

How far away are capable agents?

Lightning round

Mentioned in this episode