Jim Fan on Nvidia’s Embodied AI Lab and Jensen Huang’s Prediction that All Robots will be Autonomous

Training Data: Ep13

AI researcher Jim Fan has had a charmed career. He now leads its Embodied AI “GEAR” group. The lab’s current work spans foundation models for humanoid robots to agents for virtual worlds Jim describes a three-pronged data strategy for robotics, combining internet-scale data, simulation data and real world robot data. He believes that in the next few years it will be possible to create a “foundation agent” that can generalize across skills, embodiments and realities—both physical and virtual.

Listen Now

Stream On

Summary

Jim Fan, senior research scientist at NVIDIA, leads the company’s Generalist Embodied Agent Research (GEAR) team, focusing on developing AI agents for both physical robotics and virtual environments. The team’s flagship project, GR00T, aims to create foundation models for humanoid robots as general as LLMs can today. Jim envisions a future where intelligent robots are as ubiquitous as smartphones, and says we need to start building towards this vision today.

Diverse data integration is crucial: NVIDIA’s approach combines internet-scale data, simulation data and real-world robot data to create robust AI models. Each data type has unique strengths and limitations, necessitating a unified strategy.
Simulation is a key advantage: NVIDIA leverages its graphics expertise to create advanced simulations, accelerating synthetic data generation and helping bridge the sim-to-real gap. LLMs play a key role in writing reward functions on the fly during agent self-play.
Humanoid form factor has strategic benefits: The human-centric design of our world makes humanoid robots potentially more versatile and able to utilize existing human-oriented data for training. Nvidia, and others like Tesla, have take a controversial, long-term strategy because of their deep belief in the market dominance of humanoid robots.
Generalist approach over specialists: Jim believes in developing generalist models that can be fine-tuned for specific tasks, similar to the evolution seen in natural language processing with models like GPT-3. Once we have a generalist model for robotics we can begin to distill them into generalist specialists, which are better than either.
Hardware and software integration: NVIDIA’s development of both chips (Jetson Orin) and AI models (GR00T) allows for optimized, full-stack solutions for robotics. Their ability to tap vast amounts of compute helps both on the model side and eventually on the chip design side.

Transcript

Chapters

Jim’s journey to embodied intelligence (1:35)
The GEAR Group (4:53)
Three kinds of data for robotics (7:32)
A GPT-3 moment for robotics (10:32)
Choosing the humanoid robot form factor (16:05)
Specialized generalists (19:37)
GR00T gets its own chip
Eureka and Isaac Sim (23:25)
Why now for robotics? (25:23)
Exploring virtual worlds (28:53)
Implications for games (36:28)
Is the virtual world in the service of the physical world? (39:13)
Alternative architectures to transformers (42:10)
Lightning round (44:15)
Mentioned in this episode:

Jim Fan: So from the chip level, which is the Jetson Orin family, to the foundation model project route, and also to the simulation and the utilities that we built along the way, it will become a platform, a computing platform for humanoid robots, and then also for intelligent robots in general. So I want to quote Jensen here. One of my favorite quotes from him is that “Everything that moves will eventually be autonomous,” and I believe in that as well. It’s not right now, but let’s say 10 years or more from now. If we believe that there will be as many intelligent robots as iPhones, then we’d better start building that today.

Sonya Huang: Hi, and welcome to Training Data. We have with us today Jim Fan, senior research scientist at NVIDIA. Jim leads NVIDIA’s embodied AI agent research with a dual mandate spanning robotics in the physical world and gameplay agents in the virtual world. Jim’s group is responsible for Project GR00T, NVIDIA’s humanoid robots that you may have seen on stage with Jensen at this year’s GTC. We’re excited to ask Jim about all things robotics: why now, why humanoids, and what’s required to unlock a GPT-3 moment for robotics.

Sonya Huang: Welcome to Training Data.

Jim Fan: Thank you for having me.

Sonya Huang: We’re so excited to dig in today, and learn about everything you have to share with us around robotics and embodied AI. Before we get there, you have a fascinating personal story. I think you were the first intern at OpenAI. Maybe walk us through some of your personal story and how you got to where you are.

Jim Fan: Absolutely. I would love to share the stories with the audience. So back in the summer of 2016, some of my friends said there’s a new startup in town and you should check it out. And I’m like, “Huh, I don’t have anything else to do,” because I got accepted to PhD, and that summer I was idle. So I decided to join this startup, and that turned out to be OpenAI. And during my time at OpenAI, we were already talking about AGI back in 2016. And back then, my intern-mentor was Andrej Karpathy and Ilya Sutskever, and we talked about and we discussed a project together. It’s called World of Bits.

So the idea is very simple: we want to build an AI agent that can read computer screens, read the pixels from the screens, and then control the keyboard and mouse. If you think about it, this interface is as general as it can get, right? Like, all the things that we do on a computer, like replying to emails or playing games or browsing the web, it can all be done in this interface, mapping pixels to keyboard mouse control. So that was actually my first kind of attempt at AGI at OpenAI, and also my first journey, the first chapter of my journey in AI agents.

Stephanie Zhan: I remember World of Bits, actually. I didn’t know that you were a part of that. That’s really interesting.

Jim Fan: Yeah. Yeah, it was a very fun project, and was part of a bigger initiative called OpenAI Universe.

Stephanie Zhan: Yeah.

Jim Fan: Which was like a bigger platform on, like, integrating all the applications and games into this framework.

Stephanie Zhan: What do you think were some of the unlocks then? And then also, what do you think were some of the challenges that you had with agents back then?

Jim Fan: Yes. So back then, the main method that we used was reinforcement learning. There was no LLM, no transformer back in 2016. And the thing is, reinforcement learning, it works on specific tasks, but it doesn’t generalize. Like, we can’t give the agent arbitrary language, and instruct it to do things, to do arbitrary things that we can do with a keyboard and mouse. So back then, it kind of worked on the tasks that we designed, but it doesn’t really generalize. So that started my next chapter, which is I went to Stanford, and I started my PhD with Professor Fei-Fei Li, and we started working on computer vision and also embodied AI.

And during my time at Stanford, which was from 2016 to 2021, I kind of witnessed the transition of the Stanford Vision Lab, led by Fei Fei, from static computer vision, like recognizing images and videos, to more embodied computer vision, where an agent learns perception and takes actions in an interactive environment. And this environment can be virtual, as in simulation, or it can be the physical world. So that was my PhD, like, transitioning to embodied AI. And then after I graduated from PhD, I joined NVIDIA and have stayed there ever since. So I carried over my work from my PhD thesis to NVIDIA and still work on embodied AI until this day.

Sonya Huang: So you oversee the embodied AI initiative at NVIDIA. Maybe say a word on what that means and what you all are hoping to accomplish.

Jim Fan: Yes. So the team that I am co leading right now is called GEAR, which is G-E-A-R. And that stands for Generalist Embodied Agent Research. And to summarize what our team works on in three words is that we generate actions, because we build embodied AI agents, and those agents take actions in different worlds. And if the actions are taken in the virtual world, that would be gaming AI and simulation. And if the actions are taken in the physical world, that would be robotics. Actually, earlier this year, in March, GTC, at Jensen’s keynote, he unveiled something called Project GR00T, which is NVIDIA’s moonshot effort at building foundation models for humanoid robotics. And that’s basically what the GEAR team is focusing on right now. We want to build the AI brain for humanoid robots and even beyond.

Stephanie Zhan: What do you think is NVIDIA’s competitive advantage in building that?

Jim Fan: Yeah, that’s a great question. So well, one is for sure, like, compute resources. All of these foundation models require a lot of compute to scale up, and we do believe in scaling law. There were scaling laws for, like, LLMs, but the scaling law for embodied AI and robotics are yet to be studied, so we’re working on that. And the second strength of NVIDIA is actually simulation. So NVIDIA, before it was an AI company, it was a graphics company. So NVIDIA has many years of expertise on building simulation, like physics simulation and rendering, and also real time acceleration on GPUs. So we are using simulation heavily in our approach to build robotics.

Stephanie Zhan: The simulation strategy is super interesting. Why do you think most of the industry is still very focused on real world data? The opposite strategy?

Jim Fan: Yeah, I think we need all kinds of data, and simulation and real world data by themselves are not enough. So at GEAR, we divide this data strategy into roughly three buckets. One is the internet scale data, like all the tags and videos online. And the second is simulation data, where we use NVIDIA simulation tools to generate lots of synthetic data. And the third is the real robot data, where we collect the data by tele-operating the robot, and then just collecting and recording those data on the robot platforms. And I believe a successful robotic strategy will involve the effective use of all three kinds of data and mixing them, and also on delivering a unified solution.

Sonya Huang: Can you say more about—we were talking earlier about how data is fundamentally the key bottleneck in making a robotics foundation model actually work. Can you say more about kind of your conviction in that idea? And then what exactly does it take to make great data to break through this problem?

Jim Fan: Yes, I think the three different kinds of data that I just mentioned have different strengths and weaknesses. So for internet data, they’re the most diverse. They encode a lot of common sense priors, right? Like, for example, most of the videos online are human centered, because humans, we love to take selfies, we love to record each other doing all kinds of activities. And there are also a lot of instructional videos online. So we can use that to kind of learn how humans interact with objects and how objects behave under different situations. So that kind of provides a common sense prior for the robot foundation model.

But the internet scale data, they don’t come with actions. We cannot download the motor control signals of the robots from the internet. And that goes to the second part of the data strategy, which is using simulation. So in simulation, you can have all the actions, and you can also observe the consequences of the actions in that particular environment. And the strength of simulation is that it’s basically infinite data. And, you know, the data scales with compute—the more GPUs you put into the simulation pipeline, the more data that you will get.

And also the data is super real time. So if you collect data only on the real robot, then you are limited by 24 hours per day. But in simulation, like the GPU-accelerated simulators, we can actually accelerate real time by 10,000X. So we can collect the data at much higher throughput given the same amount of time. So that’s the strength. But the weakness is that for simulation, no matter how good the graphics pipeline is, there will always be this simulation-to-reality gap. Like, the physics will be different from the real world. The visuals will still be different. They will not look exactly as realistic as the real world. And also there is a diversity issue. Like, the contents in the simulation will not be as diverse as all the scenarios that we encounter in the real world. So these are the weaknesses.

And then going to the real robot data. And those data, they don’t have the sim-to-real gap because they’re collected on the real robot. But it’s much more expensive to collect because you need to hire people to operate the robots. And again, they’re limited by the speed of the world of atoms. You only have 24 hours per day, and you need humans to collect those data, which is also very expensive. So we see these three types of data as having complementary strengths. And I think a successful strategy is to combine their strengths and then to remove their weaknesses.

Sonya Huang: So the cute GR00T robots that were on stage with Jensen, that was such a cool moment. If you had to help us dream in one, five, ten years, like, what do you think your group will have accomplished?

Jim Fan: Yeah, so this is pure speculation, but I hope that we can see research breakthrough in robot foundation models maybe in the next two to three years. So that’s what we call like a GPT-3 moment for robotics. And then after that, it’s a bit uncertain, because to have the robots enter daily lives of people, there are a lot more things than just the technical side. The robots need to be affordable and mass produced, and we also need safety for the hardware, and also privacy and regulations. And those will take longer for the robots to be able to hit a mass market. So that’s a bit harder to predict. But I do hope that the research breakthrough will come in the next two to three years.

Stephanie Zhan: What do you think will define what a GPT-3 moment in AI robotics looks like?

Jim Fan: Yeah, that’s a great question. So I would like to think about robotics as consisting of two systems—system one and system two. So that comes from the book Thinking Fast and Slow, where system one means this low-level motor control that’s unconscious and fast. Like, for example, when I’m grasping this cup of water, I don’t really think about how I move the fingertip at every millisecond. So that would be system one. And then system two is slow and deliberate, and it’s more like reasoning and planning that actually uses the conscious brain power that we have. So I think the GPT-3 moment will be on the system one side.

And my favorite example is the verb ‘open.’ So just think about the complexity of the word ’open,’ right? Like, opening the door is different from opening a window. It’s also different from opening a bottle or opening a phone. But for humans, we have no trouble understanding that ’open’ means different things when you’re interacting. It means different motions when you’re interacting with different objects. But so far, we have not seen a robotics model that can generalize on a low-level motor control on these verbs. So I hope to see a model that can understand these verbs in their abstract sense, and can generalize to all kinds of scenarios that make sense to humans. And we haven’t seen that yet, but I’m hopeful that this moment could come in the next two to three years.

Sonya Huang: What about system two thinking? Like, how do you think we get there? Do you think that some of the reasoning efforts in the LLM world will be relevant as well in the robotics world?

Jim Fan: Yeah, absolutely. I think for system two, we have already seen very strong models that can do reasoning and planning and also coding as well. So these are the LLMs and the frontier models that we have already seen these days. But to integrate the system two models with system one is another research challenge in itself. So the question is: for robot foundation model, do we have a single monolithic model, or do we have some kind of cascaded approach where the system two and the system one models are separate and can communicate with each other in some ways? I think that’s an open question.

And again, they have pros and cons. For the first idea, the monolithic model, it’s cleaner—there’s just one model, one API to maintain, but also it’s a bit harder to kind of control because you have different control frequencies. Like, the system two models will operate on a slower control frequency, let’s say 1 Hz, like one decision per second, while the system one, like the motor control of me grasping this cup of water, that will likely be 1,000 Hz, where I need to make these minor, like these tiny muscle decisions at 1,000 times per second. It’s really hard to encode them both in a single model. So maybe a cascaded approach will be better, but again, how do we communicate between system one and two? Do they communicate through text or through some latent variables? It’s unclear, and I think it’s a very exciting new research direction.

Sonya Huang: Is your instinct that we’ll get there in that breakthrough on system one thinking, like through scale and transformers? Like, is this going to work? Or is it cross your fingers and hope and see?

Jim Fan: I certainly hope that the data strategy I described will kind of get us there, because I feel that we have not pushed the limit of transformers yet. On the essential level, transformers take tokens in and outputs tokens, and ultimately, the quality of the tokens determines the quality of the model, the quality of those large transformers. And for robotics, as I mentioned, the data strategy is very complex. We have all the internet data, and also we need simulation data and the real robot data. And once we’re able to scale up on the data pipeline with all those high quality actions, then we can tokenize them and we can send them to a transformer to compress. So I feel we have not pushed transformers to the limit yet. And once we figure out the data strategy, we may be able to see some emergent property as we scale up the data and scale up the model size. And for that I’m calling it the scaling law for embodied AI. And it’s just getting started.

Stephanie Zhan: I’m very optimistic that we will get there. I’m curious to hear, what are you most excited about personally? When we do get there, what’s the industry or application or use case that you’re really excited to see this completely transform the world of robotics today?

Jim Fan: Yes. So there are actually a few reasons that we chose humanoid robots as kind of the main research thesis to tackle. One reason is the world is built around the human embodiment, the human form factor. All our restaurants, factories, hospitals, and all our equipment and tools, they’re designed for the human form and also the human hands. So in principle, a sufficiently good humanoid hardware should be able to support any tasks that a reasonable human can do—in principle. And the humanoid hardware is not there yet today, but I feel in the next two to three years, the humanoid hardware ecosystem will mature and we will have affordable humanoid hardware to work on, and then it will be a problem about the AI brain, about how we kind of drive those humanoid hardware.

And once we have that, once we’re able to have the GR00T foundation model that can take an instruction in language and then perform any tasks that a reasonable human can do, then we unlock a lot of economic value. Like, we can have robots in our households helping us with daily chores like laundry, dishwashing and cooking, or like elderly care. And will we also have them in restaurants, in hospitals, in factories, helping with all the tasks that humans do. And I hope that will come in the next decade, but again, as I mentioned in the beginning, this is not just a technical problem, but also there are many things beyond the technology. So I’m looking forward to that.

Sonya Huang: Any other reasons you’ve chosen to go after humanoid robots specifically?

Jim Fan: Yeah. So there are also a bit more practical reasons in terms of the training pipeline. So there are lots of data online about humans, right? It’s all human centered. All the videos are humans doing daily tasks or having fun. And the humanoid robot form factor is closest to the human form factor, which means that the model that we train using all of those data will be able to have an easier time to transfer to the humanoid form factor rather than the other form factors. So let’s say for robot arms, right? Like, how many videos do we see online about robot arms and grippers? Very few. But there are many videos of people using their five finger hands to work with objects. So it might be easier to train for humanoid robots. And then once we have that, we’ll be able to specialize them to the robot arms and more kind of specific robot forms. So that’s why we’re aiming for the full generality first.

Stephanie Zhan: I didn’t realize—so are you exclusively training on humanoids today, versus robot arms and robot dogs as well?

Jim Fan: Yeah. So for Project GR00T?

Stephanie Zhan: Yeah.

Jim Fan: Yes. So for Project GR00T, we are aiming more towards humanoid right now, but the pipeline that we’re building, including the simulation tools, the real robot tools, those are general purpose enough that we can also adapt to other platforms in the future. So yeah, we’re building these tools to be generally applicable.

Sonya Huang: Yeah, you’ve used the term ‘general’ quite a few times now. I think there are some folks, especially from the robotics world, who think that a general approach won’t work, and you have to be domain environment specific. Why have you chosen to go after a generalist approach? And the Richard Sutton bitter lesson stuff has been a recurring theme on our podcast. I’m curious if you think it holds in robotics as well.

Jim Fan: Absolutely. So I would like to first talk about the success story in NLP that we have all seen. So before the ChatGPT and the GPT-3, in the world of NLP, there were a lot of different models and pipelines for different applications, like translation and coding and doing math and doing creative writing. Like, they all use very different models and completely different training pipelines.

But then ChatGPT came and unified everything into a single model. So before ChatGPT, we called those specialists, and then the GPT-3, and ChatGPTs, we call them the generalists. And once we have the generalists, we can prompt them, distill them, and fine tune them back to the specialized tasks. And we call those the specialized generalists. And according to the historical trend, it’s almost always the case that the specialized generalists are just far stronger than the original specialists, and they’re also much easier to maintain because you have a single API that takes text in and then spits text out.

So I think we can follow the same success story from the world of NLP, and it will be the same for robotics. So right now, in 2024, most of the robotics and applications we have seen are still in the specialist stage, right? They have specific robot hardware for specific tasks, collecting specific data using specific pipelines. But Project GR00T aims to build this general purpose foundation model that works on humanoid first, but later will generalize to all kinds of different robot forms or embodiments. And that will be the generalist moment that we are pursuing. And then once we have that generalist, we’ll be able to prompt it, fine tune it, distill it down to specific robotics tasks. And those are the specialized generalists. But that will only happen after we have the generalist. So it will be easier in the short run to pursue the specialist. It’s just easier to show results, because you can just focus on a very narrow set of tasks. But we at NVIDIA believe that the future belongs to generalists, even though it will take longer to develop, it will have more difficult research problems to solve. But that’s what we’re aiming for first.

Stephanie Zhan: The interesting thing about NVIDIA building GR00T, to me, is also what you mentioned earlier, which is that NVIDIA owns both the chip and the model itself. What do you think are some of the interesting things that NVIDIA could do to optimize GR00T on its own chip?

Jim Fan: Yes. So at the March GTC, Jensen also unveiled the next generation of the edge computing chips. It’s called the Jetson Orin chip, and it was actually co-announced with Project GR00T. So the idea is we will have kind of the full stack as a unified solution to the customers. So from the chip level, which is the Jetson Orin family, to the foundation model, Project GR00T, and also to the simulation and the utilities that we built along the way, it will become a platform, a computing platform for humanoid robots, and then also for intelligent robots in general.

So I want to quote Jensen here. One of my favorite quotes from him is that “Everything that moves will eventually be autonomous,” and I believe in that as well. It’s not right now, but let’s say 10 years or more from now. If we believe that there will be as many intelligent robots as iPhones, then we’d better start building that today.

Sonya Huang: That’s awesome. Are there any particular results from your research so far that you want to highlight? Anything that kind of gives you optimism or conviction in the approach that you’re taking?

Jim Fan: Yes. We can talk about some prior works that we have done. So one work that I was really happy about was called Eureka. And for this work, we did a demo where we trained a five finger robot hand to do pen spinning.

Sonya Huang: Very useful.

Jim Fan: [laughs] And it’s superhuman with respect to myself, because I have given up pen spinning long since childhood.

Sonya Huang: You’re not going to do a live demo?

Jim Fan: I will fail miserably at this live demo. So yeah, I’m not able to do this, but the robot hand is able to. And the idea that we use to train this is that we prompt an LLM to write code in the simulator API that NVIDIA has built. So it’s called the Isaac Sim API. And the LLM outputs the code for reward function. So a reward function is basically a specification of the desirable behavior that we want the robot to do. So the robot will be rewarded if it’s on the right track or penalized if it’s doing something wrong. So that’s a reward function. And typically the reward function is engineered by a human expert, typically a roboticist who really knows about the API. It takes a lot of specialized knowledge, and the reward function engineering is by itself a very tedious and manual task.

So what Eureka did was we designed this algorithm that uses LLM to automate this reward function design so that the reward function can instruct a robot to do very complex things like pen spinning. So it is a general purpose technique that we developed, and we do plan to scale this up to beyond just pen spinning. It should be able to design reward functions for all kinds of tasks, or it can even generate new tasks using the NVIDIA simulation API. So that gives us a lot of space to grow.

Sonya Huang: Why do you think—I mean, I remember five years ago, there were people that were—research labs working on solving Rubik’s cubes with the robot hand and things like that. And it felt like robotics kind of went through maybe a trough of disillusionment. And in the last year or so, it feels like the space has really heated up again. Do you think there is a ’why now’ around robotics this time around? And what’s different? And we’re reading that OpenAI is getting back into robotics. Everybody is now spinning up their efforts. What do you think is different now?

Jim Fan: Yeah, I think there are quite a few key factors that are different now. One is on the robot hardware. Actually, since the end of last year, we have seen a surge of new robot hardware in the ecosystem. There are companies like Tesla working on Optimus, Boston Dynamics, and so on, and a lot of startups as well. So we are seeing better and better hardware. So that’s number one. And those hardware are becoming more and more capable with better dexterous hands, better whole body reliability.

And the second factor is the pricing. So we also see a significant drop in the price and the cost, the manufacturing cost for the humanoid robots. So back in 2001, NASA had a humanoid developed, and it’s called a robonaut. I remember, if I recall correctly, it cost north of $1.5 million per robot. And then most recently, there are companies that are able to put a price tag of about $30,000 on a full fledged humanoid, and that’s roughly comparable to the price of a car. And also, there’s always this trend in manufacturing, where a mature product, the price of it will tend towards the price of the raw material cost. And for the humanoids, it typically takes only four percent of the raw material of a car. So it’s possible that we can see the cost trending downwards even more, and there could be an exponential decrease in the price in the next couple of years. And that makes these state of the art hardware more and more affordable. That’s the second factor of why I think humanoid is gathering momentum.

And the third one is on the foundation model side, right? We are able to see the system two problem, the reasoning, the planning part, being addressed very well by the frontier models like the GPTs and the Claudes and the Llamas of the world. And these LLMs, they’re able to generalize to new scenarios. They’re able to write code. And actually, the Eureka project I just mentioned leverages these coding abilities of the LLMs to help develop new robot solutions. And there are also a surge in multimodal models, improving the computer vision, the perception of it. So I think these successes also encourage us to pursue robot foundation models because we think we can ride on the generalizability of these frontier models and then add actions on top of them so we can generate action tokens that will ultimately drive these humanoid robots.

Stephanie Zhan: I completely agree with all that. I also think so much of what we’ve been trying to tackle to date in the field has been how to unlock the scale of data that you need to build this model. And all the research advancements that we’ve made, many of which you’ve contributed to yourself around sim-to-real and other things. And the tools that NVIDIA has built with Isaac SIM and others, have really accelerated the field alongside teleoperation and cheaper teleoperation devices and things like that. And so I think it’s a really, really exciting time to be building here.

Jim Fan: Yeah, I agree. Yeah.

Sonya Huang: I’d love to transition to talking about virtual worlds, if that’s okay with you.

Jim Fan: Yeah, absolutely. Yeah.

Sonya Huang: So I think you started your research more in the virtual world arena. Maybe say a word on what got you interested in Minecraft versus robotics. Is it all kind of related in your world? What got you interested in virtual worlds?

Jim Fan: Yeah, that’s a great question. So, for me, my personal mission is to solve embodied AI. And for AI agents embodied in the virtual world, that would be things like gaming and simulation. And that’s why I also have a very soft spot for gaming. I also enjoy gaming myself.

Stephanie Zhan: [laughs] What did you play?

Jim Fan: Yeah, so I play Minecraft. At least I try to. I’m not a very good gamer, and that’s why I also want my AI to avenge my poor skills.

Stephanie Zhan: Yeah, yeah.

Jim Fan: So I worked on a few gaming projects before. The first one was called MineDojo, where we develop a platform to develop general purpose agents in the game of Minecraft. And for those audiences who are not familiar, Minecraft is this 3D voxel world where you can do whatever you want, you can craft all kinds of recipes, different tools, and you can also go on adventures. It’s an open-ended game with no particular score to maximize and no fixed storylines to follow.

So we collected a lot of data from the internet. There are videos of people playing Minecraft. There are also wiki pages that explain every concept and every mechanism in the game, those are multimodal documents and also forums like Reddit. The Minecraft subreddit has a lot of people talking about the game in natural language. So we collected these multimodal data sets and were able to train models to play Minecraft. So that was the first work of MineDojo, and later the second work was called Voyager. So we had the idea of Voyager after GPT-4 came along, because at that time it was the best coding model out there. So we thought about, hey, what if we use coding as action?

And building on that insight, we’re able to develop the Voyager agent where it writes code to interact with the Minecraft world. So we use an API to first convert the 3D Minecraft world into a text representation, and then have the agent write code using the action APIs. But just like human developers, the agent is not always able to write code correctly on the first try, so we kind of give it a self-reflection loop where it tries out something and if it runs into an error or if it makes some mistakes in the Minecraft world, it gets the feedback and it can correct its program. And once it’s written the correct program, that’s what we call skill. We’ll have it saved to a skill library, so that in the future, if the agent faces a similar situation, it doesn’t have to go through that trial and error loop again. It can retrieve the skill from the skill library. So you can think of that skill library as a codebase that the LLM interactively authored all by itself, right? There’s no human intervention. The whole codebase is developed by Voyager. So that’s a second mechanism, the skill library.

And the third one is what we call an automated curriculum. So basically, the agent knows what it knows and it knows what it doesn’t know, so it’s able to propose the next task that’s neither too difficult nor too easy for it to solve, and then it’s able to just follow that path and discover all kinds of different skills, different tools, and also travel along in a vast world of Minecraft. And because it travels so much, and that’s why we call it the Voyager. So yeah, that was kind of our team’s—one of our earliest attempts on building AI agents in the embodied world using foundation models.

Sonya Huang: Talk about the curriculum thing more. I think that’s really interesting because it feels like it’s one of the more unsolved problems in kind of the reasoning in LLM world generally. Like, how do you make these models self aware so that they know kind of how to take that next step to improve? Maybe say a little bit more about what you built on the curriculum and the reasoning side.

Jim Fan: Absolutely. I think there’s a very interesting emergent property from those frontier models is that they can reflect on their own actions, and they kind of know what they know and what they don’t know, and they’re able to propose tasks accordingly. So for the automated curriculum in Voyager, we gave the agent a high-level directive, that is to find as many novel items as possible. And that’s just the one kind of sentence of goal that we gave. And we didn’t give any instruction on which objects to discover first, which tools to unlock first. We didn’t specify. And the agent was able to discover that all by itself using this kind of coding and prompting and skill library. So it’s kind of amazing that the whole system just works. I would say it’s an emergent property once you have a very strong reasoning engine that can generalize.

Sonya Huang: Why do you think there are so many—so much of the kind of virtual world research has been done in the virtual world? And I’m sure it’s not entirely because a lot of deep learning researchers like playing video games, although I’m sure it doesn’t hurt either. But I guess, what are the connections between solving stuff in the virtual world and in the physical world, and how do the two interplay?

Jim Fan: Yeah, so as different as gaming and robotics seem to be, I just see a lot of similar principles shared across these two domains. For the embodied agents, they take as input the perception, which can be a video stream along with some sensory input, and then they output actions. And in the case of gaming, it will be like keyboard and mouse actions. And for robotics, it would be low-level motor controls. So ultimately the API looks like this. And these agents, they need to explore in the world, they have to collect their own data in some ways. So that’s what we call reinforcement learning and also self-exploration. And that part, that principle is again shared among the physical agents and the virtual agents.

But the difference is robotics is harder because you also have a simulation-to-reality gap to bridge, because in simulation, the physics and the rendering will never be perfect, so it’s really hard to kind of transfer what you’re learning in simulation to the real world. And that is by itself an open-ended research problem. So for robotics, it’s got a sim-to-real issue, but for gaming it doesn’t. You are training and testing in the same environment. So I would say that will be the difference between them.

And last year I proposed a concept called ’foundation agent,’ where I believe ultimately we’ll have one model that can work on both, you know, virtual agents and also physical agents. So for the foundation agent, there are three axes over which it will generalize. Number one is the skills that it can do. Number two is the embodiments or like the body form, the form factor it can control. And number three is the world, the realities it can master. So in the future, I think a single model will be able to do a lot of different skills on a lot of different robot forms or agent forms, and then generalize across many different worlds, virtual or real. And that’s the ultimate vision that the GEAR team wants to pursue, the foundation agent.

Stephanie Zhan: Pulling on the thread of virtual worlds and gaming in particular, and what you’ve unlocked already with some reasoning, some emergent behavior, especially working in an open-ended environment, what are some of your own personal dreams for what is now possible in the world of games? Where would you like to see AI agents innovate in the world of games today?

Jim Fan: Yes. So I’m very excited by two aspects. One is intelligent agents inside the games. So the NPCs that we have these days, they have fixed scripts to follow and they’re all manually authored. What if we have NPCs, the non player characters that are actually alive and you can interact with them, they can remember what you told them before, and they can also take actions in the gaming world that will change the narrative and change the story for you? So this is something that we haven’t seen yet, but I feel there’s a huge potential there, so that when everyone play the game, everybody will have a different experience. And even for one person, you play the game twice, you don’t have the same story. So each game will have infinite replay value.

So that’s one aspect. And the second aspect is that the game itself can be generated. And we already see many different tools kind of doing subsets of this grand vision I just mentioned, right? There are text-to-3D generating assets. There are also text-to-video models. And of course there are language agents that can generate storylines. What if we put all of them together so that the game world is generated on the fly as you are playing and interacting with it? That would be just truly amazing and a truly open-ended experience.

Stephanie Zhan: Super interesting . For the agent vision in particular, do you think you need GPT-4-level capabilities, or do you think you can get there with Llama-8B, for example, alone?

Jim Fan: Yeah, I think the agent needs the following capabilities: one is, of course, it needs to hold an interesting conversation, it needs to have a consistent personality, and it needs to have long-term memory and also take actions in the world. So for these aspects, I think currently the Llama models are pretty good for that, but also not good enough to produce very diverse behaviors and really engaging behaviors. So I do think there’s still a gap to reach. And the other thing is about inference cost. So if we want to deploy these agents to the gamers, then either it’s very low cost, hosted on the cloud, or it runs locally on the device. Otherwise it’s kind of unscalable in terms of cost. So that’s another factor to be optimized.

Sonya Huang: Do you think all this work in the virtual world space, is it in service of, like, you’re learning things from it, that way you can accomplish things in the physical world? Does the virtual world stuff exist in service of the physical world ambitions? Or I guess said differently, like, is it enough of a prize in its own right? And how do you think about prioritizing your work between the physical and virtual worlds?

Jim Fan: Yes. So I just think the virtual world and the physical world ultimately will just be different realities on a single axis. So let me give one example: so there is a technique called domain randomization, and how it works is that you train a robot in simulation, but you train it in 10,000 different simulations in parallel. And for each simulation, they have slightly different physical parameters. Like the gravity is different, the friction, the weight, everything is a bit different, right? So it’s actually 10,000 different worlds.

And let’s assume if we have an agent that can master all the 10,000 different configurations of reality all at once, then our real physical world is just the 10,001st virtual simulation. And in this way, we’re able to generalize from sim to real directly. So that’s actually exactly what we did in a follow up work to Eureka, where we’re able to train agents using all kinds of different randomizations in the simulation, and then transfer zero shot to the real world without further fine tuning.

Stephanie Zhan: And that’s DrEureka.

Jim Fan: That’s DrEureka work. And I do believe that if we have all kinds of different virtual worlds, including from games, and if we have a single agent that can master all kinds of skills in all the worlds, then the real world just becomes part of this bigger distribution.

Stephanie Zhan: Do you want to share a little bit about DrEureka to ground the audience in that example?

Jim Fan: Oh, yeah. Absolutely. So for the DrEureka work, we built upon Eureka and still use LLMs as kind of a robot developer. So the LLM is writing code, and the code is to specify the simulation parameters, like the domain randomization parameters. And after a while, after a few iterations, the policy that we train in a simulation will be able to generalize to the real world. So one specific demo that we showed is that we can have a robot dog walk on a yoga ball, and it’s able to stay balanced and also even walk forward. So one very funny comment that I saw was someone actually asked his real dog to do this task and his dog isn’t able to do it. So in some sense, our neural network is super dog performance.

Stephanie Zhan: [laughs] I’m pretty sure my dog would not be able to do it.

Sonya Huang: [laughs] Call it ADI.

Jim Fan: Yeah, artificial dog intelligence. Yeah, that’s the next benchmark.

Sonya Huang: In the virtual world sphere, I think there’s been a lot of just incredible models that have come out on both the 3D and the video side recently, all of them kind of transformer based. Do you think we’re there in terms of, like, okay, this is the architecture that’s going to take us to the promised land and let’s scale it up, or do you think there’s fundamental breakthroughs that are still required on the model side there?

Jim Fan: Yes, I think for robot foundation models, we haven’t pushed the limit of the architecture yet. So the data is a harder problem right now, and it’s the bottleneck because as I mentioned earlier, we can’t download those action data from the internet. They don’t come with those motor control data. We have to collect it either in simulation or on the real robots. And once we have that, we have a very mature data pipeline. Then we’ll just push the tokens to the transformers and have it compress those tokens just like transformers predicting the next word on Wikipedia. And we’re still testing these hypotheses, but I don’t think we have pushed the transformers to their limit yet.

There’s also a lot of research going on right now on alternative architectures to transformers. I’m super interested in those personally. Like there are Mamba and recently there’s like test time training. There are a few alternatives, and some of them have very promising ideas. They haven’t scaled really to all the frontier model performance, but I’m looking forward to seeing alternatives to transformers.

Stephanie Zhan: Have any of them caught your eye in particular and why?

Jim Fan: Yeah, I think I mentioned the Mamba work and also test time training. Like, these models are more efficient at inference time, so instead of transformers attending to all the past tokens, these models have inherently more efficient mechanisms. And so I see them holding a lot of promise. But we need to scale them up to the size of the frontier models, and really see how they compare heads on with the transformer.

Sonya Huang: Awesome. Should we close that with some rapid fire questions?

Stephanie Zhan: Yeah.

Jim Fan: Oh, yeah.

Sonya Huang: Okay, let’s see. Number one, what outside the embodied AI world are you most interested in within AI?

Jim Fan: Yeah, so I’m super excited about video generation, because I see video generation as a kind of world simulator. So we learned physics and the rendering from data alone. So we have seen Open AI Sora, and later there are a lot of new models catching up to Sora. So this is like an ongoing research topic, and yeah.

Sonya Huang: What does the world simulator get you?

Jim Fan: I think it’s going to get us a data-driven simulation in which we can train embodied AI. That would be amazing.

Sonya Huang: Nice.

Stephanie Zhan: What are you most excited about in AI on a longer-term horizon, 10 years or more?

Jim Fan: Yeah. So on a few fronts, like one is for the reasoning side, I’m super excited about models that code. I think coding is such a fundamental reasoning task that also has huge economic value. I think maybe 10 years from now, we’ll have coding agents that are as good as human-level software engineers, and then we’ll be able to accelerate a lot of development using the LLMs themselves. And the second aspect is, of course, robotics. I think 10 years from now, we’ll have humanoid robots that are at the reliability and agility of humans, or even beyond. And I hope at that time, Project GR00T will be a success, and that we’re able to have humanoids helping us in our daily lives. I just want robots to do my laundry. That’s always been my dream.

Sonya Huang: What year are robots going to do our laundry?

Jim Fan: As soon as possible. I can’t wait.

Sonya Huang: Who do you admire most in the field of AI? And you’ve had the opportunity to work with some greats dating back to your internship days, but who do you admire most these days?

Jim Fan: I have too many heroes in AI to count. So I admire my PhD advisor, Fei-Fei. I think she taught me how to develop good research taste. So sometimes it’s not about how to solve a problem, but identify what problems are worth solving. And actually, a what problem is much harder than a how problem. And during my PhD years with Fei-Fei, I transitioned to embodied AI. And in retrospect, this is the right direction to work on. I believe the future of AI agents will be embodied for robotics or for the virtual world. I also admire Andrej Karpathy. He’s a great educator. I think he writes code like poetry, so I look up to him. And then I admire Jensen a lot. I think Jensen, he cares a lot about AI research, and he also knows a lot about even the technical details of the models, and I’m super impressed. So I look up to him a lot.

Stephanie Zhan: Pulling on the thread of having great research taste, what advice do you have for founders building an AI in terms of finding the right problems to solve?

Jim Fan: Yeah, I think recent research papers. I feel that the research papers these days are becoming more and more accessible, and they have some really good ideas, and they’re more and more practical instead of just theoretical machine learning. So I would recommend kind of keeping up with the latest literature and also just try out all the open-source tools that people have built. So for example at NVIDIA, we built simulator tools that everyone can have access to and just download it and try that out. And you can train your own robots in the simulations. Just get your hands dirty.

Stephanie Zhan: And maybe pulling on the thread of Jensen as an icon, what do you think is some practical tactical advice you’d give to founders building an AI? What they could learn from him?

Jim Fan: Yeah, I think identifying the right problem to work on, right? So NVIDIA bets on humanoid robotics because we believe this is the future, and also embodied AI, because if we believe that, let’s say, 10 years from now, there will be as many intelligent robots in the world as iPhones, then we better start working on that today. So yeah, just long-term future visions.

Sonya Huang: I think that’s a great note to end on. Jim, thank you so much for joining us. We love learning about everything your group is doing, and we can’t wait for the future of laundry-folding robots.

Jim Fan: Awesome. Yeah. Thank you so much for having me. Yeah.

Stephanie Zhan: Thank you.

Sonya Huang: Thank you.

Jim Fan: Thanks.

Mentioned in this episode:

World of Bits: Early OpenAI project Jim worked on as an intern with Andrej Karpathy. Part of a bigger initiative called Universe
Fei-Fei Li: Jim’s PhD advisor at Stanford who founded the ImageNet project in 2010 that revolutionized the field of visual recognition, led the Stanford Vision Lab and just launched her own AI startup, World Labs
Project GR00T: Nvidia’s “moonshot effort” at a robotic foundation model, premiered at this year’s GTC
Thinking Fast and Slow: Influential book by Daniel Kahneman that popularized some of his teaching from behavioral economics
Jetson Orin chip: The dedicated series of edge computing chips Nvidia is developing to power Project GR00T
Eureka: Project by Jim’s team that trained a five finger robot hand to do pen spinning
MineDojo: A project Jim did when he first got to Nvidia that developed a platform for general purpose agents in the game of Minecraft. Won NeurIPS 2022 Outstanding Paper Award
DrEureka: Language Model Guided Sim-To-Real Transfer, a robot developer based on the Eureka work
ADI: artificial dog intelligence
Mamba: Selective State Space Models, an alternative architecture to Transformers that Jim is interested in (original paper here)
Test-Time Training: Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Jim Fan on Nvidia’s Embodied AI Lab and Jensen Huang’s Prediction that All Robots will be Autonomous

Training Data: Ep13

Listen Now

Stream On

Summary

Transcript

Chapters

Contents

Jim’s journey to embodied intelligence

The GEAR Group

Three kinds of data for robotics

A GPT-3 moment for robotics

Choosing the humanoid robot form factor

Specialized generalists

GR00T gets its own chip

Eureka and Isaac Sim

Why now for robotics?

Exploring virtual worlds

Implications for games

Is the virtual world in the service of the physical world?

Alternative architectures to transformers

Lightning round

Mentioned in this episode: