Noam Brown, Ilge Akkaya & Hunter Lightman of OpenAI’s o1 Research Team on Teaching LLMs to Reason Better by Thinking Longer
Training Data: Ep15
Visit Training Data Series PageCombining LLMs with AlphaGo-style deep reinforcement learning has been a holy grail for many leading AI labs, and with o1 (aka Strawberry) we are seeing the most general merging of the two modes to date. o1 is admittedly better at math than essay writing, but it has already achieved SOTA on a number of math, coding and reasoning benchmarks. Deep RL legend and now OpenAI researcher Noam Brown and teammates Ilge Akkaya and Hunter Lightman discuss the ah-ha moments on the way to the release of o1, how it uses chains of thought and backtracking to think through problems, the discovery of strong test-time compute scaling laws and what to expect as the model gets better.
Stream On
Summary
Noam Brown, Hunter Lightman and Ilge Akkaya are researchers at OpenAI who worked on Project Strawberry, which led to the development of o1, OpenAI’s first major foray into general inference-time compute. In this episode, they discuss the significance of o1’s reasoning capabilities and what it means for the future of AI.
- “Thinking” longer enables the model to tackle more complex problems. o1 represents a major advancement in machine intelligence, combining inference-time compute with LLMs to create general AI reasoning capabilities.
- o1 uses human-interpretable “chains of thought” to frame out problems and explore them. The model demonstrates emergent abilities like backtracking and self-correction when given more time to think—capabilities that were previously difficult to enable in language models. This allows it to approach problems in novel ways, sometimes surprising the OpenAI researchers.
- o1 shows particular strength in STEM fields, outperforming previous models on math and coding tasks. In some cases, it has solved mathematical proofs that AI had never cracked before, hinting at its potential as a research assistant for mathematicians and scientists.
- The inference-time scaling laws discovered with o1 suggest a new dimension for improving AI capabilities. The GPT models have made steady progress just increasing model size and training data. The combination of train-time and test-time compute indicates the ceiling for AI performance may be higher than previously thought.
- While impressive in many areas, o1 still has limitations. The researchers emphasize it’s not universally better than humans or previous models at all tasks. They are eager to see how developers and users discover new applications and push the boundaries of what’s possible.
- The team sees o1 as an important step toward artificial general intelligence (AGI). They acknowledge the path forward remains unclear, but are excited to continue iterating on this approach, enabling even more powerful reasoning capabilities in future versions.
Transcript
Chapters
- Conviction in o1 (1:33)
- How o1 works (4:24)
- What is reasoning? (5:04)
- Lessons from gameplay (7:02)
- Generation vs verification (9:14)
- What is surprising about o1 so far (10:31)
- The trough of disillusionment (11:37)
- Applying deep RL (14:03)
- o1’s AlphaGo moment? (14:45)
- A-ha moments (17:38)
- Why is o1 good at STEM? (21:10)
- What will it take to be good at humanities? (22:30)
- Capabilities vs usefulness (24:10)
- Defining AGI (25:29)
- The importance of reasoning (26:13)
- Chain of thought (28:39)
- Implication of inference-time scaling laws (30:41)
- Long-term thinking (31:53)
- Bottlenecks to scaling test-time compute (35:10)
- Biggest misunderstanding about o1? (38:46)
- Criticism about o1? (40:10)
- o1-mini (41:13)
- How should founders think about o1? (42:15)
- What’s underappreciated about o1? (43:13)
Contents
Noam Brown: One way to think about reasoning is there are some problems that benefit from being able to think about it for longer. You know, there’s this classic notion of System 1 versus System 2 thinking in humans. System 1 is the more automatic, instinctive response and System 2 is the slower, more process-driven response. And for some tasks, you don’t really benefit from more thinking time. So if I ask you, what’s the capital of Bhutan, you know, you can think about it for two years. It’s not gonna help you get it right with higher accuracy.
Pat Grady: What is the capital of Bhutan?
Noam Brown: I actually don’t know. [laughs] But, you know, there’s—there’s some problems where there’s clearly a benefit from being able to think for longer. So one classic example that I point to is a Sudoku puzzle. It’s you could, in theory, just go through a lot of different possibilities for, like, what the Sudoku puzzle might be, what the solution might be, and it’s really easy to recognize when you have the correct solution. So in theory, if you just had, like, tons and tons of time to solve a puzzle, you would eventually figure it out.
Sonya Huang: We’re excited to have Noam, Hunter and Ilge with us today, who are three of the researchers on Project Strawberry, or o1 at OpenAI. o1 is OpenAI’s first major foray into general inference-time compute, and we’re excited to talk to the team about reasoning, chain of thought, inference-time scaling loss and more.
Conviction in o1
Sonya Huang: Ilge, Hunter, Noam, thank you so much for joining us and congratulations on releasing o1 into the wild. I want to start by asking, did you always have conviction this was gonna work?
Noam Brown: I think that we had conviction that something in this direction was promising, but the actual path to get here was never clear. And you look at o1, it’s not like this is an overnight thing. Actually, there’s a lot of years of research that goes into this, and a lot of that research didn’t actually pan out. But I think that there was conviction from OpenAI and a lot of the leadership that something in this direction had to work, and they were willing to keep investing in it despite the initial setbacks. And I think that eventually paid off.
Hunter Lightman: I’ll say that I did not have as much conviction as Noam from the very beginning. I’ve been staring at language models, trying to teach them to do math and other kinds of reasoning for a while. And I think there’s a lot to research that’s ebb and flow—sometimes things work, sometimes things don’t work. When we saw that the methods we were pursuing here started to work, I think it was a kind of aha moment for a lot of people, myself included, where I started to read some outputs from the models that were approaching the problem solving in a different way. And that was this moment, I think, for me, where my conviction really set in. I think that OpenAI in general takes a very empirical, data-driven approach to a lot of these things. And when the data starts to speak to you, when the data starts to make sense, when the trends start to line up and we see something that we want to pursue, we pursue it. And that, for me, was when I think the conviction really set in.
Sonya Huang: What about you, Ilge? You’ve been an OpenAI for a very long time.
Ilge Akkaya: Five and a half years, yeah.
Sonya Huang: Five and a half years. What did you think? Did you have conviction from the beginning that this approach was going to work?
Ilge Akkaya: No, I’ve been wrong several times since joining about the path to AGI. We originally—well, I originally thought that robotics was the way forward. That’s why I joined the robotics team first. Embodied AI, AGI, that’s where we thought things were gonna go. But yeah, I mean, things hit roadblocks. I would say, like, during my time here, Chat GPT—well, I guess that’s kind of obvious now—that was a paradigm shift. We were able to share very broadly with the world something that is a universal interface. And I’m glad that now we have a new path potentially forward to push this reasoning paradigm. But yeah, it was definitely not obvious to me for the longest time. Yeah.
How o1 works
Pat Grady: I realize there’s only so much that you’re able to say publicly for very good reasons, about how it works, but what can you share about how it works, even in general terms?
Ilge Akkaya: So the o1 model series are trained with RL to be able to think, and you could call it reasoning, maybe also. And it is fundamentally different from what we’re used to with LLMs. And we’ve seen it really generalized to a lot of different reasoning domains, as we’ve also shared recently. So we’re very excited about this paradigm shift with this new model family.
What is reasoning?
Pat Grady: And for people who may not be as familiar with what’s state of the art in the world of language models today, what is reasoning? How would you define reasoning? And maybe a couple words on what makes it important.
Noam Brown: Good question. I mean, I think one way to think about reasoning is there are some problems that benefit from being able to think about it for longer. You know, there’s this classic notion of System 1 versus System 2 thinking in humans. System 1 is the more automatic, instinctive response and System 2 is the slower, more process-driven response. And for some tasks, you don’t really benefit from more thinking time. So if I ask you, like, what’s the capital of Bhutan, you know, you can think about it for two years. It’s not gonna help you get it right with higher accuracy.
Pat Grady: What is the capital of Bhutan?
Noam Brown: I actually don’t know. [laughs] But there’s some problems where there’s clearly a benefit from being able to think for longer. So one classic example that I point to is a Sudoku puzzle. You could, in theory, just go through a lot of different possibilities for what the Sudoku puzzle might be, what the solution might be, and it’s really easy to recognize when you have the correct solution. So in theory, if you just had tons and tons of time to solve a puzzle, you would eventually figure it out.
And so that’s what I consider to be—I think a lot of people in the AI community have different definitions of reasoning, and I’m not claiming that this is the canonical one. I think everybody has their own opinions, but I view it as the kinds of problems where there is a benefit from being able to consider more options and think for longer. You might call it, like, a generator verifier gap, where it’s really hard to generate a correct solution, but it’s much easier to recognize when you have one. I think all problems exist on the spectrum from really easy to verify relative to generation, like a Sudoku puzzle, versus just as hard to verify as it is to generate a solution, like naming the capital of Bhutan.
Lessons from gameplay
Sonya Huang: I want to ask about AlphaGo and Noam, your background, having done a lot of great work in poker and other games, to what extent are the lessons from gameplay analogous to what you guys have done with o1, and how are they different?
Noam Brown: So I think one thing that’s really cool about o1 is that it does clearly benefit by being able to think for longer. And when you look back at many of the AI breakthroughs that have happened, I think AlphaGo is the classic example. One of the things that was really noticeable about the bot, though, I think underappreciated at the time, was that it thought for a very long time before acting. It would take 30 seconds to make a move, and if you try to have it act instantly, it actually wasn’t better than top humans, it was noticeably worse than them. And so it clearly benefited a lot by that extra thinking time.
Now the problem is that the extra thinking time that it had, it was running Monte Carlo tree search, which is a particular form of reasoning that worked well for Go, but for example, doesn’t work in a game like poker, which my early research was on. And so a lot of the methods that existed for being able to reason, for being able to think for longer, were still specific to the domains, even though the neural nets behind it, the System 1 part of the AI was very general. And I think one thing that’s really cool about o1 is that it is so general. The way that it’s thinking for longer is actually quite general and can be used for a lot of different domains. And we’re seeing that by giving it to users and seeing what they are able to do with it.
Hunter Lightman: One of the things that’s always been really compelling to me about language models, and this is nothing new, is just that because their interface is the text interface, they can be adapted to work on all different kinds of problems. And so what’s exciting, I think, about this moment for us is that we think we have a way to do something, to do reinforcement learning on this general interface, and then we’re excited to see what that can lead to.
Generation vs verification
Pat Grady: One question on that. You mentioned—I thought that was well put. I forget exactly how you phrased it, but the gap between generation and verification, and there’s sort of a spectrum in terms of how easy things are to verify. Does the method for reasoning remain consistent at various points in that spectrum, or are there different methods that apply to various points in that spectrum?
Hunter Lightman: One thing I’m excited about for this release has been to get o1 in the hands of so many new people to play with it, to see how it works, what kinds of problems it’s good at and what kinds of problems it’s bad at. I think this is, like, something really core to OpenAI’s strategy of iterative deployment. We put the technology that we build, the research that we develop out into the world so that we can see—we do it safely, and we do it so that we can see how the world interacts with it and what kinds of things we might not always understand fully ourselves. And so in thinking about what are the limits of our approaches here, I think it’s been really enlightening to see, like, Twitter show what it can and what it can’t do. I hope that is enlightening for the world, that it’s useful for everyone to figure out what these new tools are useful for. And then I also hope we’re able to take back that information and use it effectively to understand our processes, our research, our products better.
What is surprising about o1 so far
Pat Grady: Speaking of which, is there anything in particular that you all have seen in the Twitterverse that surprised you? You know, ways that people have figured out how to use o1 that you hadn’t anticipated.
Ilge Akkaya: There’s one thing I’m super excited about. I’ve seen a lot of MDs and researchers use the model as a brainstorming partner. And what they are talking about is that they’ve been in cancer research for so many years, and they’ve been just running these ideas by the model about what they can do, about these gene discovery, gene therapy type of applications, and they are able to get these really novel ways of research to pursue from the model. Clearly, the model cannot do the research itself, but it can just be a very nice collaborator with humans in this respect. So I’m super excited about seeing the model just advance this scientific path forward. That’s not what we’re doing in our team, but that is the thing, I guess, we want to see in the world—the domains that are outside ours, that get real benefit by this model.
The trough of disillusionment
Sonya Huang: Noam, I think you tweeted that deep RL is out of the trough of disillusionment. Can you say more about what you meant by that?
Noam Brown: I mean, I think there was definitely a period starting with, I think, Atari, the DeepMind Atari results, where deep RL was the hot thing. I mean, I was in a PhD program. I remember what it was like in, like, you know, 2015 to 2018, 2019, and deep RL was the hot thing. And in some ways, I think that was—I mean, a lot of research was done, but certainly some things were overlooked. And I think one of the things that was kind of overlooked was the power of just training on tons and tons of data using something like the GPT approach.
And in many ways, it’s kind of surprising, because if you look at AlphaGo, which was in many ways, like the crowning achievement of deep RL, yes, there was this RL step, but there was also—I mean, first of all, there was also this reasoning step, but even before that, there was this large process of learning from human data, and that’s really what got AlphaGo off the ground. And so then there was this increasing shift. There was, I guess, like a view that this was an impurity in some sense, that—so a lot of deep RL is really focused on learning without human data, just learning from scratch. Yeah, AlphaZero, which was a great—which was an amazing result, and actually ended up doing a lot better than AlphaGo. But I think partly because of this focus on learning from scratch, this GPT paradigm kind of flew under the radar for a while, except for OpenAI, which saw some initial results for it and again had the conviction to double down on that investment.
Yeah, so there was definitely this period where deep RL was the hot thing. And then I think, you know, when GPT-3 came out and some of these other, like, large language models, and there was so much success without deep RL, there was like, yeah, a period of disillusionment where a lot of people switched away from it or kind of lost faith in it. And what we’re seeing now with o1 is that actually there is a place for it, and it can be quite powerful when it’s combined with these other elements as well.
Applying deep RL
Sonya Huang: And I think a lot of the deep RL results were in kind of well-defined settings like gameplay. Is o1 one of the first times that you’ve seen deep RL used in much more general, kind of unbounded setting? Is that the right way to think about it?
Noam Brown: Yeah, I think it’s a good point that a lot of the highlight deep RL results were really cool, but also very narrow in their applicability. I mean, I think there were a lot of quite useful deep RL results and also quite general RL results, but there wasn’t anything comparable to something like GPT-4 in its impact. So I think we will see that kind of level of impact from deep RL in this new paradigm going forward.
o1’s AlphaGo moment?
Sonya Huang: One more question in this general train of thought: I remember the AlphaGo results. You know, at some point in the Lee Sedol tournament, there was move 37. And, you know, that move surprised everybody. Have you seen something of that, you know, sort where o1 tells you something and it’s surprising, and you think that it’s actually right and it’s better than any top human could think of. Have you had that moment yet with the model, or you think it’s o2, o3?
Hunter Lightman: One of the ones that comes to mind is we spent a lot of the time preparing for the IOI competition that we put the model into, looking at its responses to programming competition problems. And there was one problem where o1 was really insistent on solving the problem this kind of weird way with some weird method—I don’t know exactly what the details were. And our colleagues who are much more into competitive programming were trying to figure out why it was doing it like this. I don’t think it was quite a, like, this is a stroke of genius moment. I think it was just like the model didn’t know the actual way to solve it, and so it just, like, banged its head until it found something else.
Pat Grady: Did it get there?
Hunter Lightman: Yeah, yeah. It solved the problem. It just—it just used some—it was like—it was some method that would have been really easy if you saw something else. I wish I had the specific one, but I remember that being kind of interesting. There’s a lot of the things in the programming competition results. I think somewhere we have the IOI competition programs published, where you can start to see that the model doesn’t approach thinking quite like a human does, or doesn’t approach these problems quite like a human does. It has slightly different ways of solving it. For the actual IOI competition, there was one problem that humans did really poor on that the model was able to get half credit on, and then another problem that humans did really well on that the model was, like, barely able to get off the ground on. Just showing that it kind of has a different way of approaching these things than maybe a human would.
Ilge Akkaya: I’ve seen the model solve some geometry problems, and the way of thinking was quite surprising to me, such that you’re asking the model just like, give me this sphere and then there are some points on the sphere and asking for probability of some event or something, and the model would go, “Let’s visualize this. Let’s put the points, and then if I think about it that way or something—” so I’m like, oh, you’re just using words and visualizing something that really helps you contextualize. Like, I would do that as a human, and seeing o1 do it, too just really surprises me.
Pat Grady: Interesting.
Sonya Huang: That’s fascinating. So it’s stuff that’s actually understandable to a human, and would actually kind of expand the boundaries of how humans would think about problems, versus some undecipherable machine language. That’s really fascinating.
Hunter Lightman: Yeah. I definitely think one of the cool things about our o1 result is that these chains of thoughts the model produces are human interpretable, and so we can look at them and we can kind of poke around at how the model is thinking.
A-ha moments
Pat Grady: Were there a-ha moments along the way? Or were there moments where, Hunter, you mentioned that you were not as convinced at the outset that this is the direction that was going to work. Was there a moment when that changed where you said, “Oh, my gosh, this is actually going to work?”
Hunter Lightman: Yeah. So I’ve been in OpenAI about two and a half years, and most of that time I’ve been working on trying to get the models better at solving math problems. And we’ve done a bunch of work in that direction. We’ve built various different bespoke systems for that, and there was a moment on the o1 trajectory where we had just trained this model with this method, with a bunch of fixes and changes and whatnot, and it was scoring higher on the math evals than any of our other attempts, any of our bespoke systems.
And then we were reading the chains of thought, and you could see that it felt like they had a different character in particular. You could see that when it got stuck. It would say, “Wait, this is wrong. Let me take a step back, let me figure out the right path forward.” And we called this backtracking. And I think for a long time I’d been waiting to see an instance of the models backtracking, and I felt like I wasn’t going to get to see an autoregressive language model backtrack, because they’re just kind of predict next token, predict next token, predict next token. And so when we saw this score on the math test and we saw the trajectory that had the backtracking, that was the moment for me where I was like, “Wow, this is—like something is coming together that I didn’t think was going to come together, and I need to update.” And I think that was when I grew a lot of my conviction.
Noam Brown: I think the story is the same for me. I think it was probably around the same time, actually. I definitely—I joined with this idea of, like, ChatGPT doesn’t really think before responding. Like, it’s very, very fast, and there was this powerful paradigm of like, in these games of AIs being able to think for longer and getting much better results.
But—and there’s a question about how do you bring that into language models that I was really interested in. And, you know, that’s like, it’s easy to say that, but then there’s a difference between just saying that, “Oh, there should be a way for it to think for longer,” than actually, like, delivering on that.
Pat Grady: Yeah.
Noam Brown: And so we—I tried a few things, and other people were trying a few different things. And in particular, yeah, one of the things we wanted to see was this ability to backtrack or to recognize when it made a mistake, or to try different approaches. And we had a lot of discussions around how do you enable that kind of behavior? And at some point we just felt like, okay, well, one of the things we should try, at least as a baseline, is just to have the AI think for longer. And we saw that once it’s able to think for longer, it develops these abilities almost emergently that were very powerful and contain things like backtracking and self correction—all these things that we were wondering how to enable in the models. And to see it come from such a clean, scalable approach, that was, for me, the big moment when I was like, okay, it’s very clear that we can push this further, and it’s so clear to see where things are going.
Hunter Lightman: Noam, I think, is understating how strong and effective his conviction and test-time compute was. I feel like all of our early one on ones when he joined were talking about test-time computing its power. And I think multiple points throughout the project, Noam would just say, “Why don’t we let the model think for longer?” And then we would. And it would get better, and he would just be—he would just look at us kind of funny like we hadn’t done it until that point.Noam Brown: [laughs]
Why is o1 good at STEM?
Sonya Huang: One thing we noticed in your evals is that o1 is noticeably good at STEM. It’s better at STEM than the previous models. Is there a rough intuition for why that is?
Noam Brown: I mentioned before that there’s some tasks that are like reasoning tasks that are easier to verify than they are to generate a solution for, and there’s some tasks that don’t really fall into that category. And I think STEM problems tend to fall into the, like, what we would consider hard reasoning problems. And so I think that’s a big factor for why we’re seeing a lift on STEM kind of subjects.
Sonya Huang: Makes sense. I think relatedly, we saw that in the research paper that you guys released, that o1 passes your research engineer interview with pretty high pass rates. What do you make of that? And does that mean at some point in the future, OpenAI will be hiring o1 instead of human engineers?
Hunter Lightman: I don’t think we’re quite at that level yet. I think that there’s more …
Sonya Huang: It’s hard to beat 100 percent, though. [laughs]
Hunter Lightman: Maybe the interviews need to be better.
Sonya Huang: Okay.
Hunter Lightman: I’m not sure. I think that the o1 does feel—at least to me, I think other people on our team like a better coding partner than the other models. I think it’s already authored a couple of PRs in our repo, and so in some ways, it is acting like a software engineer, because I think software engineering is another one of these STEM domains that benefits from longer reasoning. I don’t know. I think that the kinds of rollouts that we’re seeing from the model are thinking for a few minutes at a time. I think the kinds of software engineering job that I do, when I go and write code, I think for more than a few minutes at a time. And so maybe as we start to scale these things further, as we start to follow this trendline and let o1 think for longer and longer, it’ll be able to do more and more of those tasks. And we’ll see.
Noam Brown: You’ll be able to tell that we’ve achieved AGI internally when we take down all the job listings and either the company’s doing really well or really poorly.
What will it take to be good at humanities?
Sonya Huang: [laughs] What do you think it’s going to take for o1 to get great at the humanities? Do you think being good at reasoning and logic and STEM kind of naturally will extend to being good at the humanities as you scale up inference-time, or how do you think that plays out?
Noam Brown: Like we said, we released the models and we were kind of curious to see what they were good at and what they weren’t as good at and what people end up using it for. And I think there’s clearly a gap between the raw intelligence of the model and how useful it is for various tasks. Like, in some ways it’s very useful, but I think that it could be a lot more useful in a lot more ways. And I think there’s still some iterating to do to be able to unlock that more general usefulness.
Capabilities vs usefulness
Pat Grady: Well, can I ask you on that? Do you view—I’m curious if there’s a philosophy at OpenAI, or maybe just a point of view that you guys have on how much of the gap between the capabilities of the model and whatever real world job needs to be done, how much of that gap do you want to make part of the model, and how much of that gap is sort of the job of the ecosystem that exists on top of your APIs, like their job to figure out? Do you have a thought process internally for kind of figuring out what are the jobs to be done that we want to be part of the model, versus where do we want our boundaries to be so that there’s an ecosystem that sort of exists around us?
Noam Brown: So I’d always heard that OpenAI was very focused on AGI, and I was honestly skeptical of that before I joined the company.
Pat Grady: [laughs]
Noam Brown: And basically, like, the first day that I started, there was an all hands of the company, and Sam got up in front of the whole company and basically, like, laid out the priorities going forward for, like, the short term and the long term. It became very clear that AGI was the actual priority. And so I think the clearest answer to that is, you know, AGI is the goal. There’s no single, like, application that is the priority other than getting us to AGI.
Defining AGI
Pat Grady: Do you have a definition for AGI? [laughs]
Noam Brown: Everybody has their own definition for AGI.
Pat Grady: Exactly. That’s why I’m curious.
Hunter Lightman: I don’t know if I have a concrete definition. I just think that it’s something about the proportion of economically valuable jobs that our models and our AI systems are able to do. I think it’s gonna ramp up a bunch over the course of the next however many years. I don’t know. It’s one of those, you’ll feel it when you feel it, and we’ll move the goalpost back and be like this isn’t that for however long, until one day we’re just working alongside these AI coworkers and they’re doing large parts of the jobs that we do now, and we’re doing different jobs, and the whole ecosystem of what it means to do work has changed.
The importance of reasoning
Pat Grady: One of your colleagues had a good articulation of the importance of reasoning on the path to AGI, which I think paraphrases as something like, “Any job to be done is going to have obstacles along the way, and the thing that gets you around those obstacles is your ability to reason through them.” And I thought that was a pretty nice connection between the importance of reasoning and the objective of AGI and sort of being able to accomplish economically useful tasks. Is that the best way to think about what reasoning is and why it matters? Or are there other frameworks that you guys tend to use?
Hunter Lightman: I think this is a TBD thing, just because I think at a lot of the stages of the development of these AI systems, of these models, we’ve seen different shortcomings, different failings of them. I think we’re learning a lot of these things as we develop the systems, as we evaluate them, as we try to understand their capabilities and what they’re capable of. Other things that come to mind that I don’t know how they relate to reasoning or not are like strategic planning, ideating or things like this, where to make an AI model that’s as good as an excellent product manager, you need to do a lot of brainstorming ideation on what users need, what all these things are. Is that reasoning, or is that a different kind of creativity that’s not quite reasoning and needs to be addressed differently? Then afterwards, when you think about operationalizing those plans into action, you have to strategize about how to move an organization towards getting things done. Is that reasoning? There’s parts of it that are probably reasoning, and then there’s maybe parts that are something else. And maybe eventually it’ll all look like reasoning to us, or maybe we’ll come up with a new word and there will be new steps we need to take to get there.
Ilge Akkaya: I don’t know how long we’ll be able to push this forward, but whenever I think about this general reasoning problem, it helps to think about the domain of math. We’ve spent a lot of time reading what the model is thinking when you ask it a math problem, and then it’s clearly doing this thing where, like, it hits an obstacle and then it backtracks, just has a problem. “Oh, wait. Maybe I should try this other thing.” So when you see that thinking process, you can imagine that it might generalize the things that are beyond math. That’s what gives me hope. I don’t know the answer, but hopefully.
Hunter Lightman: The thing that gives me pause is that the o1 is already better than me at math, but it’s not as good at me at being a software engineer. And so there’s some mismatch here.
Pat Grady: There’s still a job to be done. Good.
Hunter Lightman: There’s still some work to do. If my whole job were doing AIME problems and doing high school competition math, I’d be out of work. There’s still some stuff for me for right now.
Chain of thought
Pat Grady: Since you mentioned sort of the chain of thought and being able to watch the reasoning behind the scenes, I have a question that might be one of those questions you guys can’t answer. But just for fun, was it—first off, I give you props for in the blog that you guys published with the release of o1, explaining why chain of thought is actually hidden, and literally saying, like, partly it’s for competitive reasons. I’m curious if that was a contentious decision or, like, how controversial that decision was, because I could see it going either way. And it’s a logical decision to hide it, but I could also imagine a world in which you decide to expose it. So I’m just curious if that was a contentious decision.
Noam Brown: I don’t think it was contentious. I mean, I think for the same reason that you don’t want to share the model weights, necessarily, for a frontier model. I think there’s a lot of risks to sharing the thinking process behind the model, and I think it’s a similar decision, actually.
Sonya Huang: Can you explain, from a layman’s perspective, maybe to a layman, what is a chain of thought and what’s an example of one?
Ilge Akkaya: So for instance, if you’re asked to solve an integral, most of us would need a piece of paper and a pencil. And we would kind of lay out the steps from getting from a complex equation, and then there will be steps of simplifications and then going to a final answer. The answer could be one, but how do I get there? That is the chain of thought in the domain of math.
Sonya Huang: Let’s talk about that path forward. inference-time scaling laws, to me, that was the most important chart from the research that you guys published. And it seems to me like a monumental result, similar to the scaling laws from pre-training. I’m sorry to be hype-y.
Pat Grady: [laughs]
Implication of inference-time scaling laws
Sonya Huang: Do you agree that the implications here I think are pretty profound, and what does it mean for the field as a whole?
Noam Brown: I think it’s pretty profound, and I think one of the things that I wondered when we were preparing to release o1 is whether people would recognize its significance. We—you know, we included it, but it’s kind of a subtle point, and I was actually really surprised and impressed that so many people recognized what this meant. There have been a lot of concerns that, like, AI might be hitting a wall or plateauing because pre-training is so expensive and becoming so expensive, and there’s all these questions around, like, is there enough data to train on. And I think one of the major takeaways about o1, especially the o1 preview, is not what the model is capable of today, but what it means for the future. The fact that we’re able to have this different dimension for scaling that is so far pretty untapped, I think is a big deal. And I think it means that the ceiling is a lot higher than a lot of people have appreciated.
Long-term thinking
Sonya Huang: What happens when you let the model think for hours or months or years? What do you think happens?
Hunter Lightman: We haven’t had o1 for years, so we haven’t been able to let it think that long yet.
Pat Grady: Is there a job just running in the background right now that it’s just still thinking about? “Solve world peace.” “Okay, I’m thinking. I’m thinking.”
Hunter Lightman: Yeah. there’s an Asimov story like that called The Last Question.
Pat Grady: Oh really?
Hunter Lightman: Where they asked this big computer-sized AI something about, like. how do we reverse entropy? And it says, “I need to think longer for that.” And the story goes, and then 10 years later they see and it’s still thinking. And then 100 years later and then 1,000 years later and then 10,000 years later. Yeah.
Ilge Akkaya: There is as yet meaningful—not enough information for meaningful answers, something like that?
Hunter Lightman: Yeah. Like it’s still—yeah.
Sonya Huang: Do you have a guess empirically on what’ll happen? Or I guess right now I think the model has—I’ve seen some reports like 120 IQ, so, like, very, very smart. Is there a ceiling on that as you scale up inference-time compute? Do you think you get to infinite IQ?
Hunter Lightman: One of the important things is that it’s 120 IQ on some tests someone gave. This doesn’t mean that it’s got like 120 IQ-level reasoning at all the different domains that we care about. I think we even talk about how it is below 40 on some things like creative writing and whatnot. So there’s definitely—it’s confusing to think about how we extrapolate this model.
Noam Brown: I think it’s an important point that we talk about these benchmarks, and one of the benchmarks that we highlighted in our results was GPQA, which is these questions that are given to PhD students, and typically PhD students can answer. And the AI is outperforming a lot of PhDs on this benchmark right now. That doesn’t mean that it’s smarter than a PhD in every single way imaginable. There’s a lot of things that a PhD can do that—you know, there’s a lot of things that a human can do, period, that the AI can’t do. And so you always have to look at these evals with some understanding that it’s measuring a certain thing that is typically a proxy for human intelligence when you measure—when humans take that test, but means something different when the AI takes that test.
Hunter Lightman: Maybe a way of framing that as an answer to the question is that I hope that we can see that letting the model think longer on the kinds of things that it’s already showing it’s good at will continue to get it better. So one of my big Twitter moments was I saw a professor that I had in school, a math professor, was tweeting about how he was really impressed with o1 because he had given it a proof that had been solved before by humans but never by an AI model, and it just took it and ran with it and figured it out. And that, to me, feels like we’re at the cusp of something really interesting, where it’s close to being a useful tool for doing novel math research, where if it can do some small lemmas and some proofs for, like, real math research, that would be really a breakthrough.
And so I hope by letting it think longer we can get better at that particular task of being a really good math research assistant. It’s harder for me to extrapolate what it’s going to look like. Will it get better at the things that it’s not good at now? What would that path forward look like, and then what would the infinite IQ or whatever look like then when it thinks forever on problems that it’s not good at? But instead, I think you can kind of ground yourself in a “Here are the problems it’s good at. If we let it think longer at these, oh, it’s going to be useful for math research. Oh, it’s going to be really useful for software engineering. Oh, it’s going to be really—” and you can start to play that game and start to see how I hope the future will evolve.
Bottlenecks to scaling test-time compute
Pat Grady: What are the bottlenecks to scaling test-time compute? I mean, for pre-training, it’s pretty clear you need enormous amounts of compute, you need enormous amounts of data. This stuff requires enormous amounts of money. It’s pretty easy to imagine the bottlenecks on scaling pre-training. What constrains sort of the scaling of inference-time compute?
Noam Brown: When GPT-2 came out and GPT-3 came out, it was pretty clear that, like, okay, if you just throw more data and more GPUs at it, it’s gonna get a lot better. And it still took years to get from GPT-2 to GPT-3 to GPT-4. And there’s just a lot that goes into taking an idea that sounds very simple, and then actually scaling it up to a very large scale. And I think that there’s a similar challenge here where okay, it’s a simple idea, but there’s a lot of work that has to go into actually scaling it up. So I think that’s the challenge.
Hunter Lightman: Yeah, I think that one thing that I think maybe doesn’t anymore surprise, but one thing I think might have used to surprise more academic-oriented researchers who join OpenAI is how much of the problems we solve are engineering problems versus research problems. Building large-scale systems, training large-scale systems, running algorithms that have never been invented before on systems that are brand new at a scale no one’s ever thought of is really hard. And so there’s always a lot of just, like, hard engineering work to make these systems scale up.
Ilge Akkaya: Also, one needs to know what to test the model on. So we do have these standard evals as benchmarks, but perhaps there are ones that we are not yet testing the model on. So we’re definitely looking for those where we can just spend more compute on test-time and get better results.
Sonya Huang: One of the things I’m having a hard time wrapping my head around is, you know, what happens when you give the model near infinite computes? Because as a human, I am—you know, even if I’m Terence Tao, I am limited at some points by my brain, whereas you can just put more and more compute at inference-time. And so does that mean that, for example, all math theorems will eventually be solvable through this approach? Or, like, where is the limit, do you think?
Hunter Lightman: Infinite compute’s a lot of compute.
Pat Grady: [laughs]
Sonya Huang: Near infinite.
Hunter Lightman: It goes back to the Asimov story. If you’re waiting 10,000 years, maybe. But I said that just to ground it in a we don’t know yet quite what the scaling of this is for how it relates to solving really hard math theorems. It might be that you really do need to let it think for a thousand years to solve some of the unsolved core math problems. Yeah.
Noam Brown: Yeah. I mean, I think it is true that, like, if you let it think for long enough, then in theory, you could just go through—you formalize everything in Lean or something, and you go through every single possible Lean proof, and eventually stumble upon the theorem.
Hunter Lightman: Yeah, we have algorithms already that can solve any math problem, is maybe what you were about to get at, right?
Noam Brown: Yeah. Like, given infinite time, you can do a lot of things. But yeah, so, you know, it clearly gets in diminishing returns as you think for longer.
Biggest misunderstanding about o1?
Sonya Huang: Yeah, very fair. What do you think is the biggest misunderstanding about o1?
Noam Brown: I think a big one was, like, when the name Strawberry leaked, people assumed that it’s because of this popular question online of, like, the models can’t answer how many r’s are in strawberry?
Pat Grady: [laughs]
Noam Brown: And that’s actually not the case. When we saw that question, actually, we were really concerned that there was some internal leak about the model. And as far as we know, there wasn’t. It was just, like, a complete coincidence that our project was named Strawberry, and there was this also this …
Hunter Lightman: As far as I can tell, the only reason it’s called Strawberry is because at some point, at some time, someone needed to come up with a code name, and someone in that room was eating a box of strawberries. And I think that’s really the end of it.
Pat Grady: It’s more relatable than Q-Star.
Noam Brown: I think I was pretty impressed with how well understood it was, actually. Yeah. We were actually not sure how it was going to be received when we launched. There was a big debate internally about, like, are people just going to be disappointed that it’s like, you know, not better at everything? Are people going to be impressed by the crazy math performance? And what we were really trying to communicate was that it’s not really about the model that we’re releasing. It’s more about where it’s headed. And I think I was—yeah, I wasn’t sure if that would be well understood, but it seems like it was. And so I think I was actually very, very happy to see that.
Criticism about o1?
Sonya Huang: Is there any criticism of 01 that you think is fair?
Hunter Lightman: It’s absolutely not better at everything. It’s a funky model to play with. I think people on the internet are finding new ways to prompt it to do better. So there’s still a lot of weird edges to work with. I don’t know. I’m really excited to seesomeone had alluded earlier to, like, letting the ecosystem work with our platform to make more intelligent products, to make more intelligent things. I’m really interested to see how that goes with o1. I think we’re in the very early days. It’s kind of like, I don’t know, at some point a year ago, people started to really figure out these LMPs, these language model programs with GPT-4 or whatever, and it was enabling smarter software engineer tools and things like that. Maybe we’ll see some similar kinds of developments with people building on top of o1.
o1-mini
Pat Grady: Speaking of which, one of the things that we have not talked about is o1-mini. And I’ve heard a lot of excitement about o1-mini because people are generally excited about small models. And if you can preserve the reasoning and extract some of the world knowledge for which deep neural nets are not exactly the most efficient mechanism, that’s a pretty decent thing to end up with. So I’m curious, what’s your level of excitement about o1-mini and kind of the general direction that that represents?
Ilge Akkaya: It’s a super exciting model also for us as researchers. If a model is fast, it’s universally useful, so yeah, we also like it. Yeah, they kind of serve different purposes. And also yeah, we are excited to have a cheaper, faster version and then kind of like a heavier, slower one as well. Yeah, they are useful for different things. So yeah, definitely excited that we ended up with a good trade off there.
Hunter Lightman: I really like that framing, because I think it highlights how much progress is how much you can move forward, times how much you can iterate. And at least for our research, like Ilge gets at, o1-mini lets us iterate faster. Hopefully, for the broader ecosystem of people playing with these models, o1 mini will also allow them to iterate faster. And so it should be a really useful and exciting artifact, at least for that reason.
How should founders think about o1?
Sonya Huang: For founders who are building in the AI space, how should they think about when they should be using GPT-4 versus o1? Like, do they have to be doing something STEM related, coding related, math related, to use o1? Or how should they think about it?
Hunter Lightman: I’d love if they could figure that out for us.
Pat Grady: [laughs]
Noam Brown: One of the motivations that we had for releasing o1 preview is to see what people end up using it for and how they end up using it. There was actually some question about whether it’s even worth releasing o1 preview. But yeah, I think one of the reasons why we wanted to release it was so that we can get it into people’s hands early and see what use cases it’s really useful for, what it’s not useful for, what people like to use it for, and how to improve it for the things that people find it useful for.
What’s underappreciated about o1?
Sonya Huang: Anything you think people most underappreciated about o1 right now?
Hunter Lightman: It’s like, somewhat proof we’re getting a little better at naming things. We didn’t call it, like, GPT 4.5 thinking mode, whatever.
Sonya Huang: Well, I thought it was Strawberry. I thought it was Q-Star. So I don’t know.
Pat Grady: I don’t know. ʻThinking mode’ kind of has a ring to it. What are you guys most excited about for o2, o3, whatever may come next?
Hunter Lightman: o3.5 whatever. Yeah.
Ilge Akkaya: We’re not at a point where we are out of ideas, so I’m excited to see how it plays out. Just keep doing our research. But yeah, most excited about getting the feedback, because as researchers we are clearly biased towards the domains that we can understand. But we’ll receive a lot of different use cases from the usage of the product, and we’re going to say maybe like, “Oh yeah, this is an interesting thing to push for.” And yeah, like, beyond our imagination, it might get better at different fields.
Hunter Lightman: I think it’s really cool that we have a trend line which we’ll post in that blog post, and I think it’ll be really interesting to see how that trend line extends.
Sonya Huang: Wonderful. That’s a good note to end on. Thank you guys so much for joining us today.