Skip to main content
Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg
Episode 73 | Visit Training Data Series Page
Podcasts Training Data Physical Intelligence

Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg

Physical Intelligence’s Karol Hausman and Tobi Springenberg believe that robotics has been held back not by hardware limitations, but by an intelligence bottleneck that foundation models can solve. Their end-to-end learning approach enables robots to learn generalizable behaviors rather than task-specific programs. The team prioritizes real-world deployment and uses RL from experience to push beyond what imitation learning alone can achieve. Their philosophy—that a single general-purpose model can handle diverse physical tasks across different robot embodiments—represents a fundamental shift in how we think about building intelligent machines for the physical world.

Listen Now

Summary

Insights from this episode:

Scaling general-purpose models is the path to physical intelligence: Karol and Tobi believe that, just as in language and vision, scaling diverse data and model capacity will unlock robust, general-purpose intelligence for robots.

Real-world robot experience is essential for reinforcement learning: Their approach relies heavily on learning from actual robot interactions, arguing that simulated data alone cannot prepare models for the complexities of physical environments.

End-to-end learning, from pixels to actions, drives performance: They advocate for architectures that map raw visual inputs directly to robot actions, enabling more adaptive and efficient learning compared to modular pipelines.

Data diversity is critical for generalization: The team stresses that exposing models to a wide variety of tasks and conditions is crucial for building systems that can generalize across new and unstructured scenarios.

Deployment reveals the true intelligence bottleneck: Real-world deployment, not just benchmarks, exposes the limitations of robotic models—highlighting the need for continual improvement in model robustness and throughput.

Transcript

Introduction

Karol Hausman: Just, like, the fact that this whole thing works, it’s kind of mind blowing, right? Like, you build this, like, loosely brain inspired thing that has very general purpose learning algorithm. You feed it data, and it somehow gets it, and gets it way better than anything we’ve ever had before. And this applies to robots and it applies to vision and language and sound and all kinds of other things. And, like, I think if you stop for a second and just think about it, how it works and that it works, it’s just, like, absolutely mind blowing.

Sonya Huang: Karol, Tobi, thank you so much for joining us here today.

Main Conversation

Karol Hausman: Thank you for having us.

Sonya Huang: Excited to talk everything Physical Intelligence, general robotics, et cetera. Maybe before we get into it, just for our audience, can you share a little bit about what Physical Intelligence is and the mission that you’re after?

Karol Hausman: Yeah. So at Physical Intelligence, we are building robotic foundation models. These are models that in principle should be able to have any robot do any task. And over the past one and a half years or so, we started building the—we created the right building blocks that show how these models could scale. So we’ve shown that they’re able to control many different robotic form factors, many different types of robots. We’ve also shown that they’re able to generalize, so you can bring it to completely new environments and what it takes for them to generalize.

And this last release that we just had called π*0.6 that we also wanted to tell you more about, shows how we can get them to a good performance so that they’re starting to become deployable.

And this is really important to us because we want to see this technology actually deployed in the real world, but also because we don’t have the benefit of having the free data on the internet. There is no data robot actions, so we need to create the data sets ourselves. So we are after the problem of physical intelligence, after the problem of creating foundation models for robots. And we’ve made quite a lot of progress.

Sonya Huang: Wonderful. And can I ask why the decision to build foundation models as opposed to, you know, there are companies that are building fully vertically integrated robotic products right now. The Sunday lunch last month is in the back of my head. You can buy a cute little robot helper for your household. There’s companies working on cooking robots, there’s obviously the humanoid companies. Why build a foundation model versus build a robot yourselves?

Karol Hausman: Yeah. So I think if you look at the history of robotics, it’s very, very clear to me, and I think to many roboticists, that we’ve been always bottlenecked on intelligence. We’ve had robots that are capable of doing incredible things, whether it’s in the home or in industrial settings. We’ve seen robots more than a decade ago that if teleoperated, they can clean the entire house. And the really important caveat is “if teleoperated.” So if there is a human mind behind it, it’s clear that the hardware is capable of doing lots of different things.

And for a very long time, robotics companies have been structured the way you described, where you kind of think of creating a specific robot that’s designed to do just a single task or a single application. And instead, what we thought would really help the field is to focus on the bottleneck, on the intelligence. So we created a company to focus on that bottleneck, because we think that if we address that bottleneck, we can actually make robots happen. And if you do it any other way, you’re basically not making as much progress on the bottleneck as you could be. So we thought we would just target this problem head on, focus on the intelligence, and if we can do that, that would lead to many different vertical products, it will lead to robots being deployed in the home, in industrial settings, basically anywhere.

Sonya Huang: Can I just pressure test that a little bit? So on the hardware side, I’ve seen the latest videos, for example, of the Optimus Hand. It’s exquisite, it’s a piece of art. And I hadn’t seen the videos of teleoperated robots cleaning houses 10 years ago, but I’m wondering if there’s a set of tasks that’s maybe now just on the cusp of becoming possible. For example, cooking or being able to peel and dice an onion that you couldn’t have done with hardware prior to where we currently are. So how much of a “why now?” do you think hardware is or isn’t?

Karol Hausman: So there’s a lot of progress in hardware, especially in humanoid hardware, like dexterous hands, for instance, as you mentioned. I think they’re much better now than they were even a few years ago. But that still doesn’t address the bottleneck. We could have had robots operating, chopping vegetables or doing cooking even with simple grippers before. The problem is that we don’t have the intelligence to operate these robots. And the more complex the hardware is, it doesn’t really resolve that bottleneck, right? Like, it allows you to do more potentially, but you’re still bottlenecked by the fundamental challenge of robots not being intelligent.

Sonya Huang: I see. So hardware may raise the ceiling on what you’re able to do but, like, the capability floor, we’re not even there yet.

Karol Hausman: That’s right. So even with simple robots, we are not yet at the level of a human operator.

Alfred Lin: So the limit being the intelligence layer, what’s the limit to developing the intelligence? Is that collecting data? Is it doing it cheaply? Because you’ve broken down the problem, we’re going to keep asking you why, why, why and just drill down further. So what’s the next layer of the—okay, what’s the bottleneck for solving intelligence? Generalization?

Karol Hausman: It’s a good question. So we thought about it in terms of three factors. We refer to them as capability, generalization and performance. With capability, our idea was that we want to get to the point where as long as you can collect data for something, for a task or for a robot, you should have a model that should be able to replicate that, to automate that task.

This is something that we’ve gotten to fairly quickly. This was our π 0 release around a year ago or so, showing that it’s basically possible, that if you can collect data for any task for any robot, you should be able to automate it and the model should be able to learn it.

The next challenge is around generalization, and this is still an open challenge. So we wanted to get to the point where the robots can just work zero shot, and you can just bring them to a new home, for instance, and they should know how to operate in that home. And this is a really, really difficult problem, right? Like, if you put a robot in a new home, it needs to understand where different items are. The counters look different, the lighting is different than what you’ve seen in the past and so on.

And I wouldn’t say that this problem is solved, but I think we start to get a handle on how to solve it and how it scales. And the only answer to generalization that we know in machine learning is through diversity of data. So if you see a lot of different, diverse data sets, you should be able to generalize to a setting that is similar to the one you’ve seen.

And this is something that we’ve seen with our π 0.5 release in April of this year, that we got to the point where we can bring a robot to a new home that it’s never been to before, and it’s able to operate in that home. It’s not perfect yet, but at least it has some kind of common sense on how to go about simple tasks like cleaning up the kitchen and things like that.

And then the last challenge that is also not fully solved yet is performance. So how can we get these models to the point where the performance is good enough so you can actually deploy them?

Alfred Lin: Right.

Karol Hausman: And deployments here are really, really important, because as I mentioned before, we also need to gather data. And I think that is going to be the most scalable way of collecting data, because you’ll have robots out there in the world doing economically valuable tasks, and that way the cost of that data collection is basically negative. And the more broadly you can deploy this technology, the more data you’ll be getting. And I think in the limit, that will be the biggest source of the data you can imagine. Much bigger than internet data, for instance.

Alfred Lin: And how far away do you think we are from generalization, or from a performance level that maybe it’s a controlled environment, maybe it’s a general environment in homes or offices, but not the whole world? If you could limit that, where do you think generalization and performance will need to be before we can deploy these kind of robots?

Karol Hausman: I think we are actually fairly close to deploying these robots. We started deploying them ourselves already. We thought this was something that was going to take something like five years to get to the point where the technology is actually ready to deploy a robot in a commercial setting and have it do something valuable. But we’ve done it, I think, two months ago or something like that. So I think we’re now getting to that threshold that the models are useful enough, they’re performant enough, and they can do enough variety of tasks to be actually useful. So that’s a really, really exciting moment. I think we just crossed that threshold.

I think it’s still to be determined how wide is the aperture of where we can deploy? There are some tasks where the failure can be really catastrophic. Maybe these are not the best tasks to deploy just yet. There are some tasks that require a ton of generalization, like deploying in the homes or that have privacy concerns or safety concerns and so on. Maybe these are not the best places to deploy just yet, but I think that the aperture is growing as we collect more data, as these models get better, we can deploy them in more and more settings. So I think we’re starting to get there.

Alfred Lin: Where is the current aperture that you’re deploying right now?

Karol Hausman: So this is a really difficult question to answer, because with these foundation models, sometimes you don’t fully know. So kind of similarly to how with large language models, you train this model, you kind of cook it in house, you try to make the best job possible, and then at the very end, you get this artifact, and you can’t really predict how good the artifact is going to be. You kind of have to test it.

And that’s where we are with these models as well. So for instance, we open source them so that we are not the only ones testing it and we’re not the bottleneck team knowing what their capabilities are. And by open sourcing them, we see them being applied to actually many more applications than we could have imagined. Things like driving or surgical robots or agriculture and places like that. So I don’t have a very good estimate of what the aperture is. I think it’s wider than what I had expected, and I think it will be growing over time. The more data these models get, the more mature they get, and the aperture will continue to grow.

Tobi Springenberg: I would add maybe, like, on the performance level, as you said, the aperture is probably wider, the starting point is wider than we thought. But at the same time, of course, if you actually want each of those starting points that you start at for each of those applications to be at a level where people would want to use this as a day to day driving their businesses, there’s probably still quite a bit of hill climbing to do in terms of performance, right? So with this release that we’re going to talk about a little bit, I guess, the π*0.6, we’ve made progress on learning from experience data, getting that back and making the models better when they are deployed. It’s still for a lot of things that I can naïvely imagine that there’d be lots of scenarios where there’s a really, really long tail of things that can go wrong or that you can encounter that we don’t yet have a great grasp on, like, how to completely solve, I would say.

Sonya Huang: And you guys have been really great about publishing your results with a lot of transparency, releasing an open source. So whatever you’re comfortable sharing, can you talk about what your overall technical architecture, so to speak, is? And do you think that the architecture to kind of get to this promised land is pretty much baked and it’ll be variations on the theme of where we are and we just need to collect a ton of data, or do you think that the architecture is still being figured out?

Tobi Springenberg: I would say—so we can maybe start with, like, a little bit discussing where we’re at now, and then we can go into the details of how that might change. So at the moment the architecture is very analogous to how VLMs are built today that probably most of you interact with on a day-to-day basis, right? Type something in and put an image in and ask it to read what’s on the image and so on.

And we’ve kind of started from the same standpoint of there’s a model that’s trained on internet-scaled data and it’s ingested image data and text. And we’re adding all this robotics data. And our training actually predominantly now is on robotics data, on data that we have collected ourselves. We have a little bit of that internet data in the mix, but the majority of it is robotics data. The architecture is kind of this vision language model, and we add something on the side, which is what we call the action model, the action expert, the part of the model that actually then has to drive the robot, that basically looks at the image and the instruction it’s getting and has to perform the task, has to send commands to the robot.

And so broadly, it’s like a transformer model that is a fairly large model, up to, like, seven billion parameters at this point that we use, that we pre-train on our robotics data and on internet data. And it is trained largely initially from human demonstration data—Karol mentioned this earlier a little bit. We have this demonstration data, teleoperated data of humans trying to get the robot to do stuff.

So that’s the architecture that looks like now, and roughly the scaling that we’re getting is from scaling our data. And we use models similar to what comes from the VLM world. How that might change, I think, is an open question. I think there’s lots of opportunities in adding more capabilities to these models that we’re also exploring. You can imagine that you might want more context in these models. You might want more cameras added to the robots that the model then needs to be able to use. You might want to have a better understanding of the physical world in the sense of understanding exactly what’s in the room, what can break, what is easily movable, and so on.

So there’s lots to be done, I think, in both capabilities and also changing the architecture around. And I wouldn’t be surprised if in, like, five, six years, we look back and we say, “Oh, you know, maybe the backbone of the model that we used at the time—which currently comes from this VLM—has changed, maybe we’ve moved on and we use something slightly different.” I think that will evolve over time, but I think the foundation of the data and how we bring it into the model will probably stay.

Sonya Huang: Got it. And should I think about it as it’s pixels or signals in and then actions out?

Karol Hausman: Yeah.

Sonya Huang: Is that like a single big neural net?

Tobi Springenberg: It’s one big model, yeah. It’s really just basically images in, text in, text out, and actions out at this point.

Sonya Huang: And are you—I guess, do you have a separate kind of locomotion versus manipulation stack? And this might be a good time to talk about kind of just the historical evolution in robotics, and the various different waves of learning and how it pertains to your stack.

Karol Hausman: Yeah. So for a long time, even before learning arrived here, people thought that robotics is one of these problems where if you put enough people on it, enough engineers, they can think really hard about it and eventually write the code that will have the robot do anything in the world. And people have tried really, really hard to do it this way. And then it turned out that the world is just way too complex.

Sonya Huang: Yup.

Karol Hausman: Right? Like, you can’t just write every single case you’ll encounter in the real world. So that doesn’t work. And also, as we were trying to work on that version of the problem, what ended up happening is people did what they usually do, they tried to break down this problem into smaller sub-problems. So rather than working on the full robotics problem, you would say there was a perception aspect of the problem, there was a control aspect of the problem, there is the planning part of the problem. And this almost grew into different communities. There’s a planning community, there’s controls community, they have their own conferences, their own problems and all of that.

So then as we realized that it’s not really possible to handwrite all of these rules, people thought that we should learn them, we should learn them from data, which seems like a really good idea. This is how we learn, too. But what ended up happening is that they started learning each one of those components, these broke down components separately. So you would have a perception layer that is fully learned, maybe you’ll have a control layer that is learned, maybe you’ll have a planner that is learned.

And that showed some progress. It was better than what we had before, but then turned out that breaking down this problem into these sub-components actually is the piece that doesn’t work. Because, you know, when I try to pick up this glass, I don’t think about it in terms of perception and then planning and then control. I just go for it. I just pick up the glass, and it’s just all very natural.

So it turned out that this pipeline approach, where you have these predefined interfaces, that perception gives you the position of the object and then the planner gives you the trajectory and the control executes it, those interfaces are the pieces that broke down. So everything that we thought we knew how we work was always wrong.

So then we arrived to the next stage of this where we said, well, maybe just breaking down this problem was a bad idea to begin with. So let’s just train the whole thing end to end. So we’ll take whatever the kind of sensory inputs as input to the network and we’ll have actions as the output. That’s what we refer to as the end-to-end approach, where you try to go straight from pixels to actions. And we’ll have the network figure out, or the learning algorithm figure out how to split it into these different components, if it’s even possible.

And then while we were doing that, we figured that it actually requires a ton of data to do this, and often it breaks when it requires some kind of common sense. And to gather that common sense through first-person action data sets is really, really hard because you would need to experience every single thing in the world to do this.

Sonya Huang: Yeah.

Karol Hausman: And that’s where we stumbled upon vision language action models, where we can use models that were pre-trained on internet data that already have pretty good understanding of how the world works, and we can utilize that knowledge so that we don’t need to experience everything firsthand. You can just add some action components on top of it and have a common world understanding and connect it to how to actually perform things in the world.

And that’s more or less where we’re at today. Now at Physical Intelligence, we figured a few other things. So how do you start to scale these models? How do you get them to generalize? How do you get them to perform much better? How do you have them move much faster? How do you get them to the point where you can start deploying them? But I think largely we’re still in this era of how do you bring some of the common sense knowledge from the internet pre-training, and how do you make these models very general so that they can work on any robot and perform motions?

Sonya Huang: And can I ask for something like reasoning, right? There’s so much stuff happening in the reasoning side of the large language model space. Do you get the benefits of that as part of your VLA backbone? Do you have reasoning kind of emerge as a consequence of what you’re doing as you train these end to end? Or how should I think about some of the benefits of what’s happening in the LLM world? Do they benefit you or not?

Tobi Springenberg: I mean, I think definitely the models that we have today, they are already planning actions not just at what is the immediate action, but kind of what are the next 50 things I need to do? So, like, the next 50 time steps, in some sense. It’s a very short horizon. Fifty steps means like a second or two, right?

And it also additionally kind of decomposes tasks into subtasks in language space already. So when we ask it, “Oh, clean the kitchen,” the first subtask it might pick out to do is, like, “Oh, I have to drive to the counter and then I have to, like, blah, blah, blah, blah, pick up the glass, move the glass into the sink.” So it already has those aspects in some sense, right? So it decomposes tasks into subtasks, because it gives itself its own subtask, and it predicts a little bit of a horizon of how actions go.

So some of this is already there, I think. I think in the future there will probably be more of it. I do totally expect that, you know, all the advances on RL training for reasoning, all these things will also make their way into robotics. And I think it’s kind of interesting to think about, because it’s maybe a little different than the RL for math problems that people do, for example. Because I think those are very easy for us humans to think of as, like, textual problems, right? You think through them in your head in, like, text. “Okay, if I change this formula this way, I will get this outcome.” And so on.

And I think for the physical intelligence part of it, it will probably be a bit more than that, right? It’s going to be a little bit different. When you try to learn a new sport, for example—when I recently started to try to learn how to play tennis and, you know, I don’t think through in my head of, like, I need to now grab the racket, I need to move it here and I need to do this swing. But it’s more like you think through the motion itself, right? You think about, like, how does your body move, maybe you plan in some sense trajectories of objects around you in your head. And so those things, I think, we’ll see come into the models more over time.

Karol Hausman: Yeah, I suspect that over time—right now we’re in a place where we benefit quite a bit from vision language models. I think it’s very, very likely that that’s going to reverse, that a lot of the shortcomings that we see in LLMs today are kind of baked in, or because we are focused on the text problem, on problems like math and coding.

And I think robotics will offer this new avenue where you need to kind of rethink how to think about reasoning. Reasoning should probably happen in some kind of abstract space where you can reason a little bit in text, you can reason a little bit in images, maybe you can reason in trajectories or in all kinds of different spaces to arrive at the answer. And robotics provides this really nice testbed where you’re grounded in the physical world. There is not that much data yet, so you kind of need to deal with some of the difficulties that come with that. But I think it will provide for new findings that will then be reapplied to the LLM world.

Alfred Lin: Speaking about data, give us a sense of—I don’t know how you measure the sort of magnitude of data you’ve already collected and how much you would like to collect in the next year—I’m sure more is better, but what is the magnitude we’re talking about?

Karol Hausman: Yeah. Data is one of those things that’s actually fairly nuanced. It’s not just a matter of quantity. Quality obviously matters, but also things like diversity. And even when you think about the quality or diversity of robot data, these are not very strictly defined terms, right? Like, if you go for the same tasks in, like, 10 different ways, is this diverse data or not? Or how do you compare it to the diversity of the data if you go for, like, 10 different glasses?

Alfred Lin: Right.

Karol Hausman: So this is something that I don’t think we as a community fully understand, like how to characterize the data, how to describe diversity, how to describe the quality of the data, how to make it very, very rigorous. And we’re also finding out that there are some aspects of the data that really, really matter. Like, for instance, if you want to get to a certain performance on a task, you’re not going to get there by just increasing the quantity of the data you already have.

We’ve been working on these three different tasks for the π*0.6 release, and we’ve noticed fairly early on that if we just keep on collecting more and more data the same way that we’ve been collecting so far, the performance plateaus. You’re not going to just keep on getting better. So you need to find either new ways of collecting it, or you need to start thinking about what kind of data will result in better performance. And this is where things like reinforcement learning and things like this can really, really help.

Sonya Huang: Let’s talk reinforcement learning, and let’s talk π*0.6. Is the star a nod to Q-star or …?

Tobi Springenberg: Yeah.

Sonya Huang: Okay.

Tobi Springenberg: Effectively trying to get to policy star, actually optimal.

Sonya Huang: Policy star. Okay, wonderful. Maybe just say a word on what you guys are doing with π*0.6, and then we can dive into what RL means for your world.

Tobi Springenberg: Yeah, for sure. So I mean, I think the main—if we want to contrast that to what we talked about earlier, the main difference is that up to that point, basically all of the robotics foundational learning that we’ve done was basically demonstration data teleoperated going into the model. The model is trained to kind of like just imitate that data, right? And now with this new model, π*0.6, what we’re using is basically RL from experience that the robot collects itself by actually running a policy.

So we start with the initial policy: is this demonstration trained policy? And then you deploy it, you try to actually have the robot solve the task. And then it additionally gets kind of reward signals given by humans, and it can also get corrections, so where the human intervenes and says, “Oh, actually, you know what? This is not right, let’s do this a little differently.”

And that data, that process, basically that data is collected, comes back in, and the model kind of uses that data to try and figure out which of the data should I reinforce, should I do more of, and which of it should I do less of? And basically improve itself over time. That’s kind of the big distinction. And having that stream of real data coming in is kind of the missing piece that Karol was talking about that allows us to now escape this plateau that otherwise we were finding we were kind of like getting to.

Sonya Huang: Yeah, And I guess in my brain, I think of RL as, you know, you’re hill climbing on your reward signal. And so how do you make sure you’re generalizing as you hill climb on these specific tasks?

Tobi Springenberg: The way we are thinking about this for this specific kind of problem is like, you have this sort of general model and it achieves some performance that isn’t great. And now your first goal actually isn’t to further generalize. You want to kind of solve this specific task first, right? So we deploy it and we’ve picked, like, three, four tasks. So it has to generalize across tasks, nonetheless. The method has to generalize. But when you’re actually kind of deploying it and trying to start this RL process, you really care about let’s make sure I nail down this task, and I kind of nail it down in a way where I can solve it from many different positions, and I can deal with all the long tail of failures that I will encounter, right?

So in some sense, the generalization and the performance here may seem at odds when you look at it from, like, oh, wait, but now you’re, like, just doing this one task. But really, at the end of the day, what we want to do is we have the same method, the same process that deploys to each of these tasks, and then kind of gets the performance high, and then we can have all of that data across all of these tasks and we can bring that data back basically. So in that sense it’s not actually at odds, if that makes sense.

Sonya Huang: Yeah, that makes sense. How much of the RL are you doing? It sounds like there’s a real life RL. Can you talk a little bit about the approach to how much RL you’re doing in sim versus in real life?

Tobi Springenberg: So we have taken a quite real-world-first approach as opposed to using sim. We are exploring sim, of course, as well as a research tool, but all the RL we’ve done for the π*0.6 paper is actually on real systems in the real world. And the reason for that is that it’s actually really, really hard to model—again, we can get back to the long tail of failures that you see when you do deployment. I can give you a lot of examples from the tasks that we’ve actually looked at for this release where there were failure modes that we saw that if you had just done a simulation of it, you might not have seen it.

So to give you an example, we have this one task which is you have to build a box, right? So this is an actual deployment task where the goal is we build these little cardboard boxes to put chocolate into such that they can then be packaged up and sent out, basically. So that’s building a chocolate box, basically. And building this box initially, you know, worked great. And then there was these new shipments of boxes coming in. And they come in as like a flattened sheet of cardboard, and then these cardboards that came in in this new shipment were kind of not perfectly perforated, so they were sticking together, right? And then the robot starts, like, grabbing them, puts them on the table to try to build this box, and it has two boxes suddenly on the table, right?

And this is something that wouldn’t happen in sim if you had written a nice simulator where you would just get individual cardboards and, like, fold them. And so now you have to deal with this problem, and if you just learn everything in sim and then try to deploy it, you wouldn’t encounter it. So we encounter it, and then our kind of method can kind of figure out that, oh, actually what I need to do is I need to separate this and I need to move that second piece back and build the box, basically.

Karol Hausman: And we see a lot of successes for RL being applied in sim and transferred to the real world, especially in locomotion. And we haven’t really seen that kind of success in manipulation for these kind of methods.

And I think maybe one reason for that is that with locomotion, with trying to move around, it seems that the biggest part of the problem is modeling your own body. So if you can figure out how to model you yourself as a robot, you’re basically almost there. So you can do this modeling simulation exercise once, because you only have to do it for you yourself, for this one robot, and then you’re basically done. If you do it really, really well, it should transfer.

With manipulation, however, the problem is not how you move your own body, it’s how the world reacts to it. You’re actually changing the world around you. It’s not difficult to figure out how to move your hand from A to B. It’s difficult to figure out how this affects the objects you’re interacting with. And now the problem is no longer just modeling your own robot, you have to model the entire world, every single object that you might be interacting with, every single task you can think of. And that’s where we see scaling problems. And that’s I think why we haven’t seen those kind of methods be as effective in manipulation.

Sonya Huang: What was the headline of the results from π*0.6? And where did you see the model get after RL on the tasks that you cared about? And what do you think that means about your overall training recipe going forward?

Tobi Springenberg: Yeah, so I think for me the most impressive thing honestly for me personally to see was just have these models run for hours at a time, recover from lots of different failures and basically just keep going, and at the same time do that at a rate that is actually much better than the initial model that we started with, right?

So the headline figures were we increase kind of throughput of the policies by over 2x on these three tasks. So there’s one task was this box-building task I already talked about. One was the making coffee with an actual kind of industrial-scale espresso machine. And the other one was kind of like folding laundry.

And so for each of them we managed to, like, make the base policy that was trained just from demonstrations much, much faster, and also make it be able to recover from failures much, much better.

And so seeing that actually in action when you sit there—if you go to our website, you can look at the videos. We have the robot serve coffee for 13 hours in a row, or fold laundry for four hours, things like that, Actually seeing that live changes the way you think about these models, you know? Changes the way at least I think about it actually being realistic that we can deploy them, that we can do it in a way where it’s not just a toy demo which is shown once, but is actually kind of doing the real thing fully.

Karol Hausman: And that’s been really a challenge in robotics that I don’t think many people are aware of. You know, you see so many videos of robots doing cool things. And, you know, we post these videos, too, basically like anything you want a robot to do, there’s probably already a video of a robot doing that. But, you know, you can take as many takes as you will, as you want. You can keep on recording until you get the perfect shot. And the problem that I think everybody encounters is the reliability of these models—how performant they are, how fast they can go about the task, for how long you can actually deploy them without failure.

And I think this is the biggest bottleneck in terms of deploying these models in the real world, because if they break every other trial, they’re not really deployable.

Alfred Lin: Right.

Tobi Springenberg: And this is, I think, the most important breakthrough for us with this π*0.6 release is that we can actually start getting to a place where they are deployable, where we use these robots in our office to serve us coffee, or we can give them to people at PI to fold laundry in their home, or we can deploy them and have them fold boxes for real. And that is really, really exciting.

Sonya Huang: Should we think about what you guys are doing with reinforcement learning as primarily customer deployment reliability points then? Like, you can now make sure that you can, you know, go reliably deploy the coffee-making model on a customer site and it’s going to be fast enough, it’s not going to fail over long time horizons. So it’s more of a customer deployment innovations versus, like, a fundamental kind of capability innovation? Or is it both?

Tobi Springenberg: I think it’s both. I think—I mean, Karol, you said this a little bit earlier. I think to some extent, the robots that we really, really want, right? The robot that you want at home, which can do your laundry, do your dishes, cook for you, drive around, and also the robot that people want in these smaller businesses, maybe solving a real problem that they have that they don’t want to automate through the classical way because it’s too expensive, like building a chocolate box, those are things where the robot has to be reliable, it has to be good, and it has to have the capability to do a new task that it hasn’t seen in initial training stages.

I think it’s unrealistic for us to assume that, you know, we can just go with more and more human data collection, go bigger, bigger, bigger. We will do that, but there is always going to be a limit to how good and how much data you can get and how good the initial policy is going to be. So I think it is that, what you said in terms of if we want deployment we need this, but also I think increasingly over the next years I expect we will see that we will do these deployments, and that data will actually become really valuable as a source for pre-training, for making our models better themselves. And we will rely more and more on autonomous data collection, is my prediction, at least, over the next coming years, to kind of build that host of data, that convex hull of all the tasks that we want robots eventually to do such that the models, like, ingest this and becomes good at doing them and interpolating [inaudible].

Karol Hausman: And I think of it as a new capability. We haven’t so far figured out how to learn from your own experience—or there’s been many attempts, but I don’t think we’ve seen it done at scale to the extent that actually shows a convincing result that allows you to deploy something.

Sonya Huang: Yeah.

Karol Hausman: And this is why this result was really, really important to us. We wanted to get to the point where they can learn from their own experience, because, you know, similarly to how we learn, you can learn a little bit from watching videos and, you know, maybe learning from others, but at some point you need to learn on the job. You need to try the thing yourself. You need to see how your actions impact what you actually want to achieve, and make your own conclusions and try to learn that way.

Sonya Huang: Yeah.

Karol Hausman: And I think this is the first step towards that.

Sonya Huang: You’re reminding me of the—did you guys read the Rich Sutton “Age of Experience” paper this year? I thought it was very profound. Do you think that this unlocks kind of continual learning in robotics for you all? Will this be part of that?

Karol Hausman: It kind of depends what people mean by continual learning. I think it’s definitely more continual than what we’ve done in the past where, you know, you have a big pre-training mixture and maybe a post-training mixture, and you sit down, you work really, really hard, and then you come up with an artifact and that’s it. Right? Like, the artifact is done, and there’s not much you can do to change it.

Now this is much more of a living thing, right? Like, we start with a process similar to this, but then you deploy it and then it keeps on learning, right? So it’s much more continual in that sense in that it tries new things, it tries to learn from its own experience, and it keeps on getting better.

Sonya Huang: Yeah.

Karol Hausman: Now I think there is still room for it to be much more continual, where it can acquire new skills that way, or it can be even much faster in doing this. It can probably reason throughout this process. So I think there’s a spectrum of how much you can learn on the job. And this is really promising because it shows that you can do it, but I think we can make it much, much better.

Tobi Springenberg: Yeah, I would agree. I would say we’re at the very beginning of this, and it’s definitely not continual learning in the classical sense that people would have thought about it of, like, data streams and then the whole thing churns and it just ultimately leads all the way to, I don’t know, AGI or something like this yet.

But, you know, it’s a first step. I would say we’re moving in the right direction there, and there’s lots more to be done. And I think I will say even from this release, like, I was personally impressed and to some extent, you know, shocked how good these models actually are at picking up little things that you put back into the data. I was surprised that even with just, like, human corrections for—there was one example for tamping when we do—so tamping is a specific part of making an espresso, right? You, like, put the beans in and you have to tamp down.

Sonya Huang: The best part. [laughs]

Tobi Springenberg: Yeah, the best part. You have to tamp down.

Alfred Lin: I don’t get [inaudible] myself, so …

Tobi Springenberg: There you go. See, I’m not a coffee expert.

Alfred Lin: It’s a skill issue. Gotta get it just right.

Tobi Springenberg: That’s right. And so our robot in the beginning, like, tamped way too hard because it just happened to be the case that, you know, the initial human demonstrations were just making sure that, you know, let’s make sure the coffee grounds are flat so we can put it in. And then the robot was, like, tamping really hard and almost lifting itself off the table. And when we looked at it, we were like, “Ooh, that’s a bit much.”

And so with just, I don’t know, it was, I think, 30 to 50 episodes—it’s a really small range of corrections that humans did. And we feed that data back, and the model actually starts being much more gentle and doing the correct thing. And I was really surprised by that, because you think this model has been pre-trained on these millions and millions of episodes, and now you’re just doing a little correction and that actually works. So seeing that happen was a thing that I think is pointing towards this continual learning part, which I find impressive.

Sonya Huang: Can I ask though—and the thing I’m still hung up on is generalization. So as I learn how to tamp better, does that make me better at folding boxes or not?

Tobi Springenberg: In this specific case, no. But the mechanism is the same that you can also employ to fix the oh, I have two boxes in front of me that are sort of stuck together and I need to pull them apart, right? Because you can get 30 corrections for the stamping part. You get 30 corrections for the pulling boxes apart. But you get 30 corrections for oh, you know, this box wasn’t neatly folded together, and all of this accumulates together to then give you this more generalized improvement, I would say.

Sonya Huang: Okay, so it’s a repeatable recipe, but they don’t necessarily cross pollinate.

Tobi Springenberg: Yeah. I mean, I would expect that as we scale this up, we might see also things actually kind of transfer from A to B if there is motions that are kind of similar across tasks. But at this point yeah, I would say it’s more like a repeatable recipe.

Sonya Huang: Yeah.

Karol Hausman: And we see a lot of generalization from pre-training, where you train on more and more tasks, more and more data, you see that it’s much easier to onboard any new task, or you see tasks that appear zero shot that you didn’t expect before. And this keeps on improving. We kick off a pre-training run at a certain cadence, and every single time we start seeing that the model keeps on getting better, because there’s more data being put in, there are more improvements that we are making to the pre-training process and so on. And I also suspect that as we have more and more of these models deployed doing all kinds of different tasks, they also bring data back in. And I think one way where I’m quite certain where we’ll see more generalization is from that process, that as you deploy these models, the data comes back, the models get better, you can deploy them more, then the models get better. You can deploy them more, and so on.

Sonya Huang: Yeah.

Tobi Springenberg: And I think maybe it’s worthwhile for this point that you brought up. We haven’t really talked about one crucial detail aspect of this π*0.6 recipe, which is that the model has kind of two parts. One is the policy that it’s trying to improve via corrections and RL feedback. And the other part is how do you actually get this RL feedback? So we’ve talked a little bit—I’ve mentioned, like, you know, humans might correct and that’s the human correction part.

And the RL feedback part is a little different, and it already has some of these aspects of generalization that I think you’re trying to search for, which is that the way we do this is we first basically get humans to tell us whether a specific attempt of making the coffee or doing the box was successful or not. So there will be, like, human labels provided with these episodes. And then we train something which is called a value function to try and predict basically from my given point of where I am in the task, will I likely be succeeding or failing, basically. And this value function is then used as kind of a baseline to decide whether for this data point, should I bump that up or should I bump that down, depending on whether I expect that I will be moving towards success or more likely to move towards failure.

And one thing that we saw when we trained these value functions—so those are trained basically from the same kind of backbone, the same kind of model, but they’re pre-trained before the actual policy is trained that actually runs the task. When we trained these value functions, we see that adding more data from different tasks actually helps there, and the model starts being actually really quite good, at least for certain tasks, at knowing when it will fail beforehand and before it is obvious for me. For example, when I look at a video of it trying to insert the …

Karol Hausman: The portafilter?

Tobi Springenberg: The portafilter. Thank you. See? I’m not good at making coffee.

Sonya Huang: [laughs]

Tobi Springenberg: And it’s trying to insert a portafilter into the coffee machine. It kind of knows that it doesn’t quite have the right angle before that happens. So, like, 30, 40 steps before that actually happens, the value function kind of—if you look at the prediction—drops and saying, “Oh, this is not good in this specific episode, so I shouldn’t include this data.”

Sonya Huang: Interesting.

Tobi Springenberg: And this gets better with more data and more tasks.

Sonya Huang: So this is an interesting counterpoint to the Karpathy, like, slurping bits from a straw thing, right? Because you’re not waiting for that final bit at the end. You’re actually getting a lot of signal along the way.

Karol Hausman: I think RL is just such a vast field, and there’s so many different approaches to it. And people often associate RL with something like a policy gradient method, or with very specific on-policy learning approaches. And to me, RL is more of a problem definition, and there is many, many approaches that get around the problem that you’re referring to, which is that you only get the reward at the very, very end, and it’s not really scalable for very long horizon tasks.

There are things like value functions, there are things like temporal difference learning that try to get around this problem, where you constantly make predictions and you do it in a sequential way. And this is maybe another one of these things where I think robotics can really help the broader AI community, because we don’t have the advantage of having a perfect language simulator where you can run as many simulations as you’d like. Instead, you need to do it in the real world. So you need to make more efficient methods, and therefore you need to learn value functions and things like this. And I think these will be really valuable everywhere.

Sonya Huang: Yeah. Can I push a little bit on—I’d love to understand, you know, internet video seems like it’s part of the recipe, but not a huge focus right now. As I see it, like, do you think that there’s gold left to be mined in internet video? And then if you look at what’s happening in video models right now, world models, to what extent do you think that’s going to be a discontinuous jump in model capabilities and an important part of your model pipeline?

Karol Hausman: Yeah, I think maybe there are two questions there. One is about the data, like how do you bootstrap yourself to the point where you can start deploying? And the other question is, you know, what about video models and kind of the world model aspects of it?

So on the data point, I think we are now in this bootstrap phase, where basically anything goes. Like, whatever you can figure out how to add to the model to its benefit, I think it’s good, whether you can add sim, whether you can add human videos, some kind of handheld devices, human teleoperations, I think it kind of doesn’t matter. You just need to figure out some way to bootstrap yourself to the point where you can deploy these models. Because I think in the long term there’s going to be this bootstrap phase, but then there’s going to be the deployment phase. And I think deployment phase will provide much, much more data than anything you could do in the bootstrap phase. So we’re in this kind of, like, weird spot right now where we tried many different things, try to see what sticks to just get us to the deployment threshold.

Sonya Huang: I see.

Karol Hausman: And once you can deploy, I think that will vastly be much greater than anything you can do before that. So that’s also what we are sprinting towards. That’s why we want to start deploying these models. That’s why we want to do this with many different tasks in many different environments, so that we can just have this very powerful data engine.

Now on the world modeling side of things, I think the world models and RL approaches are kind of targeting the same problem, the problem of counterfactuals, of how do you—or a credit assignment problem, right? Like, how do you figure out which actions were the ones that actually matter for your success? And how would the world have evolved had you taken a different action?

And one way you can do this is by predicting what would have happened. Like, rolling out a full video of, you know, if I put this portafilter a little bit differently, where would I end up? And would this be a failure or a success?

Or you can do this through reinforcement learning, and it does it through a slightly different mechanism, a little bit more implicitly, but it fundamentally targets a very similar problem. We are exploring all of those approaches and try to see how to really solve the counterfactual problem. I don’t think there is an answer yet, but we see a lot of progress with reinforcement learning that we’ve just shown with π*0.6, but I think there is probably room for many other approaches, too.

Sonya Huang: Awesome. Can we talk about—once you guys get past that bootstrap phase, let’s talk about customer deployments a little bit. What do you bring to a customer? What do you sell them? And then how do you imagine that’s going to evolve over time? Are you selling them a fully vertically integrated robotic solution? Are you selling them a model that they have to figure out how to integrate into their operations? How does this all work?

Karol Hausman: The real answer is we don’t know yet.

Alfred Lin: [laughs]

Karol Hausman: We are still figuring that out. We are still quite early in the technology. As you can tell, we are just starting to even get to the threshold where we can start deploying these things. So we believe we should focus on the technology first to figure out how to get it to the point where it’s actually easy to deploy, and expand this aperture that we’re talking about initially. And robotics, the history of robotic startups very often gets to this point where you develop a technology for some period of time. You started with this grand vision of what it should be able to enable, how general purpose it will be, and as soon as you pick an application that you want to apply it to, you’re kind of stuck. You start cutting corners, you start figuring out very special purpose solutions just for this application, and very quickly you become, you know, an application company that just focuses on, let’s say, warehouse pick and place robots and that’s it.

And we really want to avoid that future. We think we have a chance to really solve physical intelligence. And the benefits of doing this will far outweigh any single applications that we can focus on now. So we want to make sure that the technology is as general as possible, as easily deployable as possible, this aperture is as wide as possible. And then we’ll start figuring out how to commercialize it. And as you said, there could be many different ways of doing this. There’s probably ways that we can’t think of just yet, because they all depend on how the technology goes. Whether you can be a model provider, a fully vertical solution, or you sell robots or whatever else. But I think it’s a little too premature to answer this question. It will give you a lot of comfort just to pick one of those.

Sonya Huang: It’ll give Alfred a lot of comfort.

Karol Hausman: Alfred will be happy with us, but I think it’s just too early.

Alfred Lin: No, you guys have a grand, grand vision. So thank you for working on Physical Intelligence. It’s a wonderful, wonderful improvement just for π*0.6, it’s just a huge sort of breakthrough. And so congratulations on all the success you’ve had.

Tobi Springenberg: Thank you.

Alfred Lin: Can I follow up with a spicy question?

Tobi Springenberg: Sure.

Alfred Lin: So as you said, this vision is so grand, so broad, you’re doing all these different things. I’m sure you’ve studied all previous robotics efforts, and they’ve largely, as you said, applied it to an application, and they get narrower and narrower. And one of the most successful cases of a large application is self driving. And Waymo or Tesla have done enormously well, but if I had to go back in history, you know, I learned about self driving when Sebastian Thrun was on the stage of TED in, I think, 2009, 2010. And he talked about the thing where they won the DARPA challenge. That was 2007, and we’re in 2025 and the thing barely goes from San Francisco down here. They kind of can do it now, but they take local roads. They can’t even get on the freeway. If you do such a generalized job, how long is the runway or the timeline that you’re thinking about to build for generalization and performance?

Karol Hausman: Yeah. So there are some aspects of the problem that make it easier than self driving and some that make it harder.

Alfred Lin: Yeah.

Karol Hausman: One thing that makes it easier is that we don’t need to deploy it only when it’s a hundred percent reliable, right? There’s many, many tasks out there that even if you’re at 95 percent reliability, you’re totally fine. If you have a robot in your home folding your laundry and every hundredth item, you know, it doesn’t fold it perfectly, you’ll be totally fine.

Alfred Lin: You just call your child to go fold the laundry. We still need chores.

Karol Hausman: Exactly. And with self driving, that’s not the case, right? Like, if you fail every hundredth time, catastrophically, that’s a big problem.

Alfred Lin: Yeah.

Karol Hausman: So I think in terms of deploying this technology, it might be easier. Now we also benefit from the fact that this is a different era of technology. We are at the era of vision language models, of foundation models that have some common sense, and we learn a lot of lessons between—what was it, 2009 and 2025? And we can benefit from all of those. So I think that also really, really helps. And these are much more general purpose solutions than what we had in the past.

At the same time, there are some things that will be very challenging. There isn’t just a single application. This is a very general purpose solution that can be applied to driving, but also to manipulation and locomotion and flying and all kinds of other things. And I think it’s to be seen how much harder this is. So far, based on what we’ve experienced, it doesn’t seem to be that much harder, to be honest. It seems that if you tackle this with a very general purpose kind of mindset from the get go, it turns out that it can generalize fairly well.

And there is something about physical intelligence that we don’t fully understand that allows these models to generalize between driving and making coffee and flying a drone and operating a surgical robot, even though they seem so far apart from each other, and it seems that these should be all different models and different applications, these models somehow can make sense out of all of that data. And that gives me a lot of hope that maybe the problem is not that much harder—and it might be actually easier. So I think it’s a fair question, but I also don’t want to draw the wrong conclusions from what we’ve seen from self driving.

Alfred Lin: That’s beautiful. Congratulations. What results have impressed you the most outside of results that you [inaudible]?

Karol Hausman: That’s a great question.

Tobi Springenberg: Yeah, it’s a good question, actually.

Karol Hausman: I can start. I’ve been really impressed by the video models that you mentioned earlier. I saw them a few years ago, I worked on aspects of them a few years ago, and I didn’t expect this trajectory to be—the improvement to be so steep. They’re basically indistinguishable right now from reality, and they can do incredible things. So that’s been really, really impressive and really surprising to me.

Tobi Springenberg: Yeah. I would say I’m still in awe to some extent that we’ve gotten to this place where we do seem to get models that do seem generally intelligent to a level that I really didn’t foresee coming out of just next token prediction. I’m still amazed with this, and every little advance that I see, you know, winning IMO maths challenges or applying it to finding new stuff in science, to me, yeah, there are so many things this year where I thought, like, wow, there’s still a lot of progress to be made, even though it felt like at the beginning of the year maybe this whole pre-training business of LLMs is kind of maybe petering out a bit. Yeah, realizing that there’s, like, this whole almost second breath of fresh air basically coming in.

Karol Hausman: Yeah. I would maybe add to this just like the fact that this whole thing works. It’s kind of mind blowing.

Alfred Lin: [laughs]

Karol Hausman: I don’t think we, like, fully realize how ridiculous this is, right? Like, you build this loosely brain-inspired thing that has a very general purpose learning algorithm. You feed the data, and it somehow gets it and gets it way better than anything we’ve ever had before. And this applies to robots and it applies to vision and language and sound and all kinds of other things. And I think if you stop for a second and just think about it, how it works and that it works, it’s just, like, absolutely mind blowing. The fact that we can have robots, you can put it in a home and it kind of knows what to do in a home that it’s never been to before, or it can make coffee for 13 hours straight or, you know, things like that. And this is from this very general purpose thing that trains fully end to end that we don’t fully understand but it seems to start to get it. That to me is just mind blowing.

Sonya Huang: We’re in a simulation. [laughs]

Alfred Lin: That’s what Sonya believes, that we’re living in a simulation. But it is interesting. In science, they teach you to take a big problem and break it up into smaller and smaller problems. And then basically somebody realizes that’s maybe not the best way to train machines or robots of any kind.

Tobi Springenberg: And to be honest, the whole machine learning AI field made that same mistake actually, to some extent, right? We were working for a long time—people were working on solving individual problems very deeply, basically. And then over time there is this, like, notion of oh, if we can put it all together and, like, do multitask learning, if we could do that really, really well, we’d do much better. But then the fact that that all happened just because we switched to this, you know, general pre-training objective and then it just all falls out, that’s the part that is the surprising bit, right?

Alfred Lin: Do you think it’s like an accordion where we go from one framework to the other framework? We take big problems, break them up into smaller and smaller ones. That worked for a period of time, then it stopped working and then we’re like, “All right, let’s go back to the big problem and try to solve it more generally.” And go back and forth.

Tobi Springenberg: I don’t see us going back.

Karol Hausman: Yeah, I don’t see us going back. I think there is a lot of approaches, or a lot of people saying that, you know, you need the best of both worlds, and you need some kind of way of incorporating the rules that we already know about, like Newtonian physics. You don’t need to learn that. We already know how it works, so can you just put it somehow into the weights?

But from what we’ve seen so far, it doesn’t work. If you try to do this, you kind of limit the ability to learn new things. And I don’t think there’s the best of both worlds. I think we just go all the way learning.

And it’s kind of interesting to similarly to how we learn. You would think that if there was a way to pre-bake all of the intelligence, the evolution would have figured this out. You would have just been born knowing everything there is to know. And we see this with some other species, right? I think deer, when they get born, they’re basically as smart as they’ll ever be. They don’t really learn much throughout their lifetime. But for intelligent species like humans, but also, I think, crows, for instance, they have these childhood periods, the adolescence period, where they’re not very smart to begin with, but they have to learn from their own experience. And it doesn’t come pre-baked. You kind of have to earn it on your own. And I think there is something to that. You need to just experience the world and learn from that, and I think that’s the lesson we are learning in machine learning as well, in AI, that we think we know how we think, but we actually don’t and we just need to let the algorithm learn it from data.

Alfred Lin: Same thing with raising a child. I think I know how my son is thinking but I don’t. [laughs]

Karol Hausman: Yeah. I have a small daughter and yeah, it’s just so surprising.

Alfred Lin: They learn so fast.

Karol Hausman: They learn so fast and you don’t know where they get it from.

Alfred Lin: Hopefully from the parents.

Karol Hausman: Hopefully. She definitely knows some things that I didn’t teach her.

Alfred Lin: Thank you guys so much.

Sonya Huang: It’s a really beautiful mission you’re building after. Thank you for coming to share.

Karol Hausman: Thank you. Thanks for having us.

More Episodes