OpenAI Sora 2 Team: How Generative Video Will Unlock Creativity and World Models
The OpenAI Sora 2 team (Bill Peebles, Thomas Dimson, Rohan Sahai) discuss how they compressed filmmaking from months to days, enabling anyone to create compelling video. They explain how space-time tokens enable object permanence and physics understanding in generative video, and why Sora 2 represents a leap. The conversation goes beyond video into the team’s vision for world simulators that could one day run scientific experiments, their perspective on co-evolving society alongside tech, and how digital simulations in alternate realities may become the future of knowledge work.
Listen Now
Summary
The team believes video will lead to simulated worlds and scientific discoveries:
Why diffusion transformers enable superior video quality: Unlike autoregressive models that generate frame-by-frame, diffusion transformers generate entire videos simultaneously by removing noise iteratively. This approach solves quality degradation over time and allows space-time patches to communicate globally, enabling properties like object permanence and physics understanding to emerge naturally at scale.
How cameos transformed Sora from a tool into a social platform: The team initially didn’t expect the cameo feature—which lets users insert themselves into AI-generated videos—to become the killer feature. When they tested it internally, the feed became entirely cameos within days, revealing that the human element was essential for making AI video generation feel personal and social rather than just static, beautiful scenes.
Optimizing for creation over consumption prevents algorithmic decay: Drawing from his Instagram experience, Thomas explains that Sora’s feed is intentionally designed to inspire creation rather than mindless scrolling. With nearly 100% of users creating on day one and 70% creating when they return, the team actively builds features to push users out of passive consumption mode—a deliberate departure from traditional social media economics.
Video data provides richer world models than text: While video contains lower intelligence per bit than text, the total available data is vastly larger and Bill believes the data won’t be exhausted anytime soon. The models are developing internal world simulators that respect physics.
Iterative deployment prepares society for alternate realities: OpenAI views Sora 2 as necessary preparation for a future where simulations of people interact autonomously in digital environments. By releasing now rather than waiting for a more powerful model, they’re co-evolving society with the technology and establishing norms before copies of ourselves are “running around in Sora, doing tasks and reporting back.”
Transcript
Chapters
Introduction
Bill Peebles: For OpenAI across the board, it’s really important that we kind of like iteratively deploy technology in a way where we’re not just, like, dropping bombshells on the world when there’s some big research breakthrough. We want to co-evolve society with the technology. And so that’s why we really thought it was important to do this now, and do it in a way where, you know, we’ve hit this—again, this kind of GPT-3.5 moment for video. Let’s make sure the world is kind of aware of what’s possible now, and also start to get society comfortable in figuring out the rules of the road for this kind of longer-term vision for where there are just copies of yourself running around in Sora, in the ether, like, just doing tasks and reporting back in the physical world, because that is where we are headed long term.
Konstantine Buhler: Today on Training Data, we sit down with the team behind Open AI’s, Sora—Bill Peebles, Thomas Dimson, and Rohan Sahai. You’ll hear about spacetime tokens building internal world simulators, and how optimizing for creation instead of consumption is just better for social platforms. This conversation goes way beyond video generation and into questions about how society will co-evolve with powerful simulation technologies.
We promise that this was an actual real-world conversation and not a video generation—but we don’t know how to prove that to you. Let’s jump in.
Main conversation
Konstantine Buhler: Hey guys, thank you for being here at Sequoia. Congratulations on Sora.
Thomas Dimson: Thank you.
Konstantine Buhler: Maybe you could tell us a little bit about yourselves, and how you got to OpenAI and Sora.
Bill Peebles: Yeah. I’m Bill, I’m the head of the Sora team at OpenAI. I had a pretty traditional path, came through undergrad doing research on video generation, then continued that work at Berkeley, and then started at OpenAI working on Sora from the first day I joined.
Thomas Dimson: And I’m Thomas, I work as an engineering lead inside of Sora. Have a bit of a longer story, but I worked at Instagram for about seven years doing some of the early kind of machine-learning systems and recommender systems there, but it was a very tiny company, it was about 40 people. Then I quit, did my own startup for a while which was Minecraft in the browser, which we’ve talked about a couple times. And I think that OpenAI noticed that we had a very crack product team there, and so they acquired our company. And I’ve been bouncing around different products inside of OpenAI and on the research side as well on post training, but super happy we landed kind of together on Sora to bring this thing to life.
Konstantine Buhler: It was a really cool product in between, too, like the global illumination product.
Thomas Dimson: Oh yeah. I still believe in it.
Konstantine Buhler: Yeah, me too.
Rohan Sahai: Awesome. I’m Rohan, I’ve been at OpenAI for about two and a half years. Started as an IT on ChatGPT, but then as soon as I saw the video gen research I got quickly Sora-pilled and made my way over there, and so currently lead the Sora product team. Before that just startups, big companies within the Valley. Bunch of random stuff. Yeah.
Konstantine Buhler: Cool. Well Bill, you are the inventor of the diffusion transformer. Can you tell us what that is?
Bill Peebles: Yeah. So most people are pretty familiar with autoregressive transformers, which is the core tech that powers a lot of language models that are out there. So there, you generate tokens one at a time, and you condition on all the previous ones to generate the future. Diffusion transformers are a little bit different. So instead of using autoregressive modeling as kind of the core objective, you’re using this technique called “diffusion,” which at a very high level basically involves taking some signal, for example, video, adding a ton of noise to it, and then training neural networks to predict the noise that you applied.
And this is kind of a different kind of iterative generative modeling. So instead of generating token by token as you do in auto regressive models, diffusion models generate by gradually removing noise one step at a time. And in Sora 1, we really kind of popularized this technique for video generation models. So if you look at all the other competitor models that are out there, both in the States and in China, most of them are based on DiTs—diffusion transformers. And a big part of that is because DiTs are a really powerful inductive bias for video. So because you’re generating the whole video simultaneously, you really solve issues where quality can degrade or change over time, which was kind of like a big problem for prior video generation systems, which DiTs ended up fixing. So that’s kind of why you’re seeing them proliferate within video generation stacks.
Konstantine Buhler: Mm-hmm. When I try to visualize it, I mean, for each diffusion, you have a matrix of pixels, and then you do the entire video at the same time, which you can basically see at different frames, I imagine. Can you visualize that as, you know, a matrix of matrices that basically transforms over time?
Bill Peebles: Yeah, it’s a good question. So we really kind of consider things at the granularity of, like, spacetime tokens, which is sort of like an insane phrase but, you know, whereas, for example, characters are a very fundamental building block for language, for vision, it’s really this notion of a spacetime patch, right? If you can just imagine this little cuboid that composes both X and Y, like spatial dimensions, as well as a temporal locale. And that really is kind of like the minimal building block that you can build visual generative models out of.
And so diffusion transformers sort of consider these almost you can think of it like voxel by voxel. And in the traditional versions of these diffusion transformer models, you have all of these little spacetime patches talking with all of the other ones. And that’s how you actually are able to get properties like object permanence to fall out, because basically you have full global context of everything going on in the video at every position in spacetime, which is a very powerful property for a neural network to have.
Konstantine Buhler: And is that the equivalent of the attention mechanism is the object’s movement throughout the video?
Bill Peebles: Yeah, that’s right. So in our Sora 1 blogpost, “Video Generation Models as World Simulators,” we kind of laid out some visuals which sort of go into exactly your point here, which is really attention is like a very powerful mechanism for sharing communication, like, sharing information across spacetime. And if you represent data in this way where you patch-ify it into a bunch of these spacetime tokens, as long as you’re properly using the attention mechanism, that allows you to transfer information throughout the entire video all at once.
Sonya Huang: What are the biggest differences between Sora 1 and 2? And I remember with the original Sora 1, you were already seeing kind of emergent properties where the more you scale, the more it’s able to do things like understand physics. Is Sora 2 purely a function of scaling, or what are the biggest differences?
Bill Peebles: Yeah, that’s a great question. You know, we’ve spent a long time really just doing, like, core generative modeling research since the Sora 1 launch to really figure out how we get the next step function improvement in video generation capabilities. We really kind of operated from first principles, right? So we really want these models to be extremely good at physics. We want them to kind of feel intelligent in a way that I’d say, like, most prior video generation models don’t. So by that I really mean if you look at kind of any of the previous set of models that were out there, you’ll notice a lot of this kind of like effects that happen, like if you try to do any sort of complicated sequence of physical interactions, right? For example, like, spiking—gymnastics, classic.
Konstantine Buhler: Riding a dragon like you did.
Bill Peebles: Riding a dragon. That was fun. That happened for real, actually, Konstantine. Not generated.
Konstantine Buhler: [laughs]
Bill Peebles: You know, there are very clear problems with the past generation of models that we really set out to solve with Sora 2. And I think one thing that’s really cool about this model compared to the prior one is that when the model makes a mistake, it actually fails in a very unique way that we haven’t seen before. And so concretely, for example, if let’s say the text input to Sora is a basketball star wants to shoot a hoop, shoot a free throw, if he misses in the model, Sora will not just magically guide the basketball to go into the hoop, to be over optimistic about respecting what the user asks for. It will actually defer to the laws of physics most of the time, and the basketball will actually rebound off the backboard. And so this is a very interesting distinction between model failure and agent failure—agent as in the agent that Sora is implicitly simulating as it’s generating video. And we haven’t really seen this very unique kind of semantic failure case in prior video models. This is really new with Sora 2. It’s kind of a result of just the investment we put in really doing the core generative modeling research to get this massive improvement in capability.
Sonya Huang: Okay, so not purely a function of scale. There’s some concept of agents implicit in this. There’s things you’re doing beyond just scaling up the model.
Bill Peebles: Well, the notion of agents, I’d say, is actually mostly implicit from scale in the same way where we showed that object permanence begins to emerge in Sora 1 pre-training once you hit some, like, critical flops threshold. We see similar kinds of things happen as we push the next frontier, right? So you begin to see these agents act more intelligently. You begin to see the laws of physics be respected in a way that they aren’t at lower compute scales.
Konstantine Buhler: How does the concept of a spacetime latent patch relate to a spacetime token, relate to object permanence and how things move through the physical world?
Bill Peebles: Yeah, that’s a great question. So I’d say spacetime patch and spacetime token are more or less synonymous with one another—I’ll use them interchangeably. You know, what’s really beautiful is when people started scaling up language models from GPT-1 to GPT-2 to GPT-3, we really began to see the emergence of world models internally in these systems. And what’s kind of beautiful about this is there’s incredibly simple tokenizers that actually go into creating the data that we train these systems on. But despite this very simple representation like BPE characters, what have you, when you put enough compute and data into these systems, like, in order to actually solve this task of predicting the next token, you need to develop an internal representation of how the world functions, right? You need to simulate things.
And the models will make lots of mistakes right now at low compute scales, but as you continue pushing from 3 to 4 to 5, you just see these internal world models get more and more robust. And it’s really analogous for video, and in many ways more explicit. So I think it’s easier to picture what a world model or a world simulator looks like with video data, because it is literally representing the raw observational bits of, like, all of reality. But what’s really remarkable is because these spacetime patches are just this very simple and highly reusable representation that can apply to, like, any type of data, whether it’s just like video footage of this set, whether it’s like anime, cartoons, like, whatever it is, you’re just able to build one neural network that can operate on this vast, extremely diverse set of data, and really build these incredibly powerful representations that model very generalizable properties of the world. It’s useful to have a world simulator to predict how a cartoon will unfold, and likewise it’s useful for predicting how this conversation might unfold. And so that really puts a lot of optimization pressure on Sora to grok these core fundamental concepts in a very data-efficient way.
Konstantine Buhler: Did you have to put effort into selecting the data such that it reflected the physical world? For example, I’d imagine if you have data from the physical world, it all abides the laws of physics. But you mentioned anime. That might not always abide by the laws of physics. Did you have to be selective, or did it naturally find patterns that separated that out?
Bill Peebles: That’s a really great question. We did spend a lot of time, you know, really thinking about what does the optimal data mix for a world simulator look like. And to your point, I think in some cases we’ll make decisions that maybe are for making the model really fun—for example, people love generating anime—but do not necessarily perfectly represent the laws of physics that are directly useful for real world applications. To put it another way, I think in anime there are certain primitives that are simplified, that are actually probably useful for understanding the real world. People still locomote through scenes, for example.
Konstantine Buhler: Yeah.
Bill Peebles: But if there’s some crazy dragon that’s, like, flying around, that’s probably not so useful for grokking aerodynamics or something.
Konstantine Buhler: Dragon Ball Z is more or less how I learned athletics, you know?
Bill Peebles: There you go.
Konstantine Buhler: Motion and Super Saiyan.
Bill Peebles: I think it is an interesting question, like, that I do not know the answer to, whether somehow pre-training on simplified representations of, like, the visual world, whether that’s sketches or some other modality like, you know, makes you more efficient at rocking these concepts. I think it’s actually a very interesting scientific question that we need to understand better.
Sonya Huang: Do you think we’re close to exhausting the number of pre-training tokens there are out there, or do you think video is just so massive, and it’s actually one of the more untapped vats of data?
Bill Peebles: Yeah. The way I kind of think about this is the intelligence per bit of video is much lower than something like text data. But if you integrate over all of the data that really exists out there, the total is much higher. So to directly answer your question, you know, I think it’s hard to imagine ever fully running out of video data. There’s just so many ways that it exists in the world that, like, you know, you will be in a regime where you can continue to just add more and more data to these pre-training runs and continue to see gains for a very long time, I suspect.
Sonya Huang: You think we’ll ever discover new physics? There’s the LLM world of, you know, Einstein thinking at the whiteboard.
Bill Peebles: Right. Right.
Sonya Huang: It’s equivalent to the LLM thinking. There’s also just the—if you develop a perfect simulator and you just simulate physics better and better, you might learn things about the world that we haven’t learned yet.
Bill Peebles: I totally think that this is bound to happen one day. And, like, you know, I think we probably need to even—we probably need one more step function change, I’d say, in model quality to really get to a point where for example, you can think about doing scientific experiments in the models. But you could imagine one day you have a world simulator that is generalized so well to the laws of physics that you don’t even need a wet lab in the real world anymore, right? You can just run biological experiments within Sora itself.
And again, this needs a lot of work to, like, really get to the point where you have a system that’s robust enough to do this reliably. But internally again, we’ve used Sora 1 as kind of being the GPT-1 moment for video. It was really the first time things started working for that modality. Sora 2 we really view as GPT-3.5 in terms of it really being able to kickstart the world’s creative juices and really break through this kind of usability barrier where we’re seeing mass adoption of these models. And we’re going to need a GPT-4 breakthrough to really get this to the point where this is useful for sciences as we’re seeing now with GPT-5, right? I feel like every day on Twitter I’ve seen other convex optimization lower bounds get improved by GPT-5 Pro. And I think eventually we’re going to see the same thing happening for the sciences with Sora.
Sonya Huang: Do you think you need physical world embodiment to get there, or do you think a lot of it can be done effectively in sim?
Bill Peebles: You know, I am always amazed every time we push another 10x compute into these models, what just magically falls out of it with, like, very limited changes in kind of like what we’re training on and the fundamental approach to what we’re doing. I suspect some amount of physical agency will certainly help. I have a hard time believing it will make you worse at modeling collisions or something else. Video only is quite remarkable though, and I wouldn’t be surprised if it’s actually kind of AGI complete for building a general purpose world simulator.
Konstantine Buhler: So for this concept of a general purpose world simulator, a world model where you can do science experiments in that world, do you think that video is the sole—or some combination of video and text are the combined data inputs, and you train it on this type of model? Or is it going to be—does it have to be based on more structured laws of physics that are understood and laws of biology that are understood?
Bill Peebles: I think it probably depends a lot on the specific use case you’re kind of envisioning for the world simulator. Like, for example, if you just really want to build an accurate model of how, like, a basketball game is played, I actually think only video data and maybe audio as well are kind of sufficient to build that system.
Konstantine Buhler: Not of me playing basketball. That would be an inaccurate, very bad player of basketball.
Bill Peebles: You know, actually Sora’s current understanding of how people play basketball, Konstantine, may be at your level.
Konstantine Buhler: Wow. Okay, that makes sense. That checks out.
Bill Peebles: It’s possible. It’s possible.
Sonya Huang: I think he just dissed you. [laughs]
Konstantine Buhler: It’s accurate.
Bill Peebles: But it’s better than mine, Konstantine. That was like a Sora 1 situation. You’re at Sora 2.
Bill Peebles: We’ll toss some hoops. Is that what they’ll say?
Bill Peebles: You know, I’m down. I’m down.
Thomas Dimson: Let’s shoot some hoops.
Konstantine Buhler: Thomas’s first statement in the podcast.
Thomas Dimson: But I’m also at your level.
Sonya Huang: [laughs]
Bill Peebles: You know, I think it is an interesting question, like, what are all of the modalities that should be present in, like, this kind of general purpose system? Certainly if you add more modalities, I have a hard time believing it will decrease the intelligence. I also think there’s an argument to be made that just adding more and more does not provide significant marginal value compared to full mastery of video and audio, for example. I think it’s an interesting open question. I’m not actually sure right now. And it’s something we need to understand more.
Konstantine Buhler: So cool. Sonya a minute ago mentioned Einstein at a whiteboard. And obviously that makes me think of you, Thomas, and your hair.
Bill Peebles: Me, too.
Konstantine Buhler: It had to come. Like, if any hair gives the feeling of spacetime tokens, it’s definitely, definitely yours. At some point, you know, Bill, you are the creator of this revolutionary technology that has changed the way that AI video is created. At some point, you, from Sora 1 to Soro 2 altogether you said, there needs to be an application around this. There’s some benefit to an application. You brought together some of the best product people in the world. How did that crew come together at OpenAI?
Thomas Dimson: Yeah. I mean, the story is never as linear as you might think it is. So I think that—I mean, we’ve had a product team on Sora since the get go. Rohan was, like, spearheading that effort in the Sora 1 days. But I think Bill’s right when he says it was really like a GPT-1 kind of moment. We’re seeing pockets of very interesting things there, but the models were not—like models without sound, videos without sound. It’s like a very different kind of environment.
And so we were working on that surface mostly targeted on kind of like a prosumer demographic. And separately—I mean, Rohan will probably go into more details of all that—separately we were also just kind of exploring different social applications of AI inside of OpenAI and what that could look like. We had a lot of prototypes, most of which were quite bad. And when we started to see some of the magic was actually with ImageGen before it had been released. We were playing with it internally in a social context, and the social context was really interesting to see that what people were doing is they’d sort of like take an image and then you’d have like a chain of remixes of that image where it was like, I don’t know, it’s a duck. And then now the duck’s on somebody’s head, and now everything’s upside down and they’re smoking a cigarette. Like, just a lot of weird things.
And we were seeing this and we were like, “Oh, this is kind of like a very interesting thing that, like, you know, nobody can really do that with social media because it’s so hard to create something or riff on something.” It’s such a high barrier to entry action. Maybe you have to go get a camera set up. And it’s not just like, thinking of the idea, there’s actually a lot of things involved.
And so we were like, okay, this is a very magical behavior. How can we kind of productize that behavior? And we’re mostly thinking about it away from Sora. Some of the Sora research was still ongoing, and I mean, there were signs of life, but it wasn’t, like, quite there yet in productized form. Bill probably had it in his head somewhere—h can see the future, but that’s fine. I’m a little bit more—can’t quite see the future yet. So we were just exploring that. I think we tried a few things. and then at some point, the research was really just showing very clear value of even iterative deployment style value of, like, oh, this is something that people will really want. And so we went into this project, like, two or three months ago? It wasn’t very long.
Rohan Sahai: Yeah, it was, like, July 4.
Konstantine Buhler: Wow. Wow. That’s when you disappear, Thomas.
Thomas Dimson: That’s when I disappeared. Yeah, exactly. And we just kind of locked in, like, okay, we’re finally doing it. You know, that’s always a moment. And we started without any magical features, just like, okay, let’s just try to get a native video environment where you can hear the audio full screen.
And we did some quick generations. Things were showing very—they’re very cool, very fun, very interesting. And because of that ImageGen experience, we sort of had thought, like, okay, the magical thing here is that, like, barrier to entry is very, very low for creation. Coming from Instagram, that’s like, it’s impossible to get people to create on Instagram, and that’s the most valuable thing that people do. So what does that unlock? And it’s like, okay, well, that remix thing from ImageGen, that kind of could still apply here.
And so we brainstormed all these things about how could remixes work and, like, what does a remix mean here? One of those was this, like, cameo thing, which I think also Bill had in his head.
Bill Peebles: It was in the ether.
Thomas Dimson: It was in the Ether, for sure. But we just were, like, hacking together things on the product. Let’s see if this works. I didn’t think it would work at all, but it was on the list. And there were a few other things on the list, some of them were pretty crazy.
Konstantine Buhler: Why didn’t you think it would work?
Thomas Dimson: I am bad at predicting technology. [laughs] It wasn’t super clear to me that you could take a likeness of a person and have that kind of imagined into a video form, and whether it would work or not. And so we had early prototypes of different things, of people reacting in the video corner or stuff like that. But when we saw cameo just start to work and even playing internally, like, Rohan, do you remember that day where we were like, feed is entirely cameos.
Rohan Sahai: Yeah, entirely. It just went from, you know, we didn’t have that feature. Once we had that feature, product market [inaudible], everything we were generating was all of each other.
Sonya Huang: You must have seen the meme potential.
Rohan Sahai: I mean, yeah. I think at first we were just like, “This is amazing.” And then like, a week later, we were like, “This is still all we do. There’s something here.”
Thomas Dimson: Yeah. I mean, at first we were actually a little bit like, “Is this good? Like, hey, the cameos—it’s just all cameos now.”
Rohan Sahai: Does anyone else care about those?
Thomas Dimson: Do people care about other people doing stuff? And we kind of got to the point where we were like, “No, this is actually good.” Like, it actually feels like I’m coming back to see. And it really humanized it a lot, where a lot of AI video is just kind of static scenes that are quite beautiful, quite interesting, might have extremely complicated things going on, but they lose that human touch. And it really felt like it was coming back into it.
Rohan Sahai: Another learning from ImageGen, too. Like, ImageGen took off and had viral moments, because I think you could put yourselves in these scenes in accessible ways that weren’t possible before. Obviously, this massive, like, put me in a Ghibli scene, people taking selfies with their idols and stuff like that. And so once you actually kind of thought about it, it’s like, yeah, the cameo feature makes a lot of sense. You put yourself in all these scenes, that’s way more exciting, you and your friends. It’s novel. It’s not something you could do before.
Thomas Dimson: Yeah. And then that combined with remixes. I mean, cameo is kind of remixed to begin with, but then you start to think about, “Okay, well, now I can riff on Rohan doing something,” or whatever it is. Like, Bill had you wrapped in an action figure package, and it was.
Bill Peebles: It’s been remixed, like, an insane number of times.
Thomas Dimson: Thousands of times. Yeah. So, like, just very, very crazy things that kind of go on. And very emergent. A lot of stuff that I would have never thought of, actually.
Bill Peebles: How many generations have you guys, like, publicly posted at this point?
Rohan Sahai: I have no idea.
Thomas Dimson: I know I’m 11,000 or so.
Rohan Sahai: I was like a little less than that.
Sonya Huang: Wow.
Rohan Sahai: Yeah.
Bill Peebles: Crazy.
Sonya Huang: What has surprised you about the types of users that are really sticking with Sora? Who is it really a hit with?
Rohan Sahai: If you just go to the latest feed, which is just like the fire hose …
Thomas Dimson: Astronaut mode.
Rohan Sahai: … of everything. Yeah, it’s Spacetime Thomas mode. It’s wild out there. But that gives you a pretty good snapshot into just everything happening. I mean, I think we have almost seven million generations happening a day, so you can imagine there’s just a ton of information there. It’s one of my favorite ways to just get product feedback. It is so diverse, the type of stuff people are doing, the type of people. There’ll be, like, a complete variety of age. Some people just envisioning themselves in scenes that seem like motivation oriented, people just memeing with their friends, people cameoing, some of, like, the public figures on the platform that have done cameos.
So I think the diversity has surprised me more. I was kind of expecting this sort of like, you know, the Twitter AI crowd to heavily dominate the feed. They definitely dominate, like, the press cycles, at least the ones that we’re most exposed to. But in terms of people actually using this, it’s quite a wide variety. And last thing I’ll say is a bigger departure from, like, the sort of niche AI film crowd that existed before. Which is great, early adopters, but now you kind of get these—I thought it would start there, but it felt like it started with just a way wider range of people. I think getting to the top of the app store helps with that. You just get people who are browsing and see this thing.
Konstantine Buhler: My mother keeps cameoing Thomas.
Thomas Dimson: Is that right?
[CROSSTALK]
Konstantine Buhler: You have 11,000. She’s done 10,000 of them.
Sonya Huang: Thomas, you wrote the original algorithm, if I’m right, for the Instagram ranking algo. There was a lot in the Sora 2 blogpost about how you guys are clearly being very intentional about how you want to do ranking in the algo. Can you talk a little bit about lessons learned from Instagram, and how you’re approaching it over at Sora?
Thomas Dimson: Yeah. I mean, there’s a lot to cover in that. I think that the first thing to think about when we think about these platforms or think about Sora specifically is it’s the same thing I was mentioning before about creation. So Sora enables basically everybody to be a creator on this platform, and that is a very, very different environment than something like Instagram where you have this, like, extreme power law of the people that are creating. And the power law just naturally gets more—narrow? What’s the right word there? But more head heavy, yes.
So sometimes I feel like I have to defend myself on the Instagram algorithm side. I mean, we did it for a reason. It was to solve a problem. It wasn’t just kind of like a random decision to, you know, optimize for ads or something like that. And the reason we did that was that we noticed that, like, what was happening on Instagram over time was because it was chronologically ordered, every single person that posted was guaranteed to have the top slot of all their followers.
And so if you think about that for a second, the incentive for somebody in that environment is actually to create constantly, because they are guaranteed distribution when they create. And over time, because of this power law becoming heavier and heavier or, like, more head heavy, those types of people—which are great, they provide a lot of value to the ecosystem, but they start to crowd out people that you really care about. And so maybe you follow National Geographic or something—not to dunk on National Geographic, I love them, but if they’re posting 20 times a day, your friend’s not. They don’t have the same optimization objective. They’re probably just posting a picture of their coffee or something. And so you’d have 20 Nat Geo posts and then one picture that you actually really cared about that you never really scrolled to.
And there’s not too many solutions to that problem if you have a guaranteed ordering. One of them is that you have to unfollow all these accounts that you maybe care about, but care about not as much as the person that posts once a day. And the other is that you have to permute the feed. And so we went with that path, we tried it, we tested it out internally. It was very kind of controversial to do, but I think that you can actually kind of, like, math this out. It’s like a proof that basically over time, you’re going to have to take control over distribution on the platform in order to prevent these kind of issues and show people what they actually care about.
So that’s why we did it. And it actually showed a lot of value. I remember the early tests—I won’t get into the numbers on them, but they were pretty unambiguous, actually, about this was showing more people that you cared about. It was improving your experience with the platform. It actually moved creation, which is unusual. It made people create more because they were seeing more content that was accessible to them. But I also think that these things can go astray over time. And I won’t say, like, the Instagram algorithm is unequivocally bad or unequivocally good, but when we started to open up to more unconnected content and ad pressure was very strong, there’s also a natural company incentive to optimize for just blind consumption because it’s how you make money. So maybe cheaper content or maybe just like, get people to scroll more and more and more and more. And that also can encourage people to create less because it’s just like a more mindless scrolling mode.
Konstantine Buhler: But you guys have very concretely committed to doing things to prevent that kind of behavior.
Thomas Dimson: We have. We have a lot of mitigations there in place. But I think what it really comes down to is just like, what are we trying to do as a platform? And I think the magic of this technology is that everybody is a creator. And so we want this feed to be optimized for you to create, to inspire you to create. And that can be—like, sometimes when you think of inspiration, you think of, like, oh, it’s this beautiful, crazy scene that’s so elegant. When I think about that, I think about a meme culture or something really funny or like, oh, that’s cool, I’ve got to riff on that. And I think that’s a very different brain mode when you’re browsing the feed.
And of course, we have lots of other things in place. So, like, I think it starts with an incentive. Our incentive right here is to encourage more creation in the ecosystem. But there are certainly use cases we want to prevent. We’re not going to get them right all the time. It’s very challenging, it’s a very living system. It’s also very hard to write a recommender system when you have no data and you don’t know what to recommend or you don’t know how the platform’s going to evolve. But that’s, like, basically how I kind of think about the incentives of feed.
And then, Rohan, we have a lot of mitigations in place that I think you’ve been kind of, like, thinking about and maybe even more deeply than I have, about preventing maybe the extreme cases. And so I don’t know if you want to talk a little bit.
Rohan Sahai: Yeah, happy to. But one thing before you—I mean, just one thing to add is that the stated intent of, like, optimizing for creation is working really well.
Thomas Dimson: Yep.
Rohan Sahai: It’s almost a hundred percent of people who get past the invite code and all that on the app end up creating on day one. When they come back, it’s like 70 percent of the time they come back they’re creating, and 30 percent of people are actually even posting to the feed. So not just, like, generating for themselves, they’re actually posting into the ecosystem, which is an incredible testament to the model how fun it is and to how what we’re optimizing for is actually working pretty well right now.
But yeah, beyond that, I mean, one of the top of mind things is I think we don’t want this just to be a mindless scroll. And beyond just optimizing for creation in the ranking algorithm, there are things we can do, like, trying to just get you out of this sort of flow state of just consumption and push you into, like, creative mode, I think. There’s a great article on this called, like, “The Curvilinear Nature of Casinos,” where they design it so you never have to make any decisions. It’s just like you walk in a circle, there’s no windows, all that kind of stuff. We can be very intentional about not doing that. And, like, you know, whether it’s an in-feed unit that’s like, “Hey, you just kind of viewed a couple of videos in this domain, why don’t you try creating something?” Or other ways to just kind of like push you out of that. We actually have things like that in the product. Yeah, those are some of the things that come to mind.
Sonya Huang: I really commend you guys for what you’ve done to, you know, make sure that there’s a version of the world where video model as world simulator could have just ended up with us, you know, each retreating into our own computer screens and just becoming addicted and just retreating into ourselves. And I think the amount to which you’re, you know, prioritizing the human element and the social element, I think that the care you’ve put into that really shows.
Rohan Sahai: I don’t think we would have launched a feed of just, like, AI content that didn’t have a human feel. Like, I don’t think that excited us. And as soon as we had the product, we had Cameo and we had that feeling internally, we were like, “Okay, this is actually a little different than …”
Thomas Dimson: Yeah. I don’t think it was totally obvious. Again, it was like a pretty crazy sprint to go through this. And it wasn’t super obvious to us what would emerge, but I think that the idea, it makes sense in retrospect, but it was a completely not obvious product decision that cameos would be the thing.
Rohan Sahai: Yeah.
Thomas Dimson: Where it’s like of course you just want to see your friends doing cool things, so it’s like that makes sense. But I was never actually that afraid of competitive pressure in that crazy product phase because I was like we sort of had these non-trivial decisions that are obvious in retrospect but were not obvious at the time that were sort of building on top of each other. It’s like, okay, cameos. Well, there’s also a version of cameo where you have a crazy flow that’s just for you, and it’s a one-player mode cameo and you, like, go through this onboarding flow and do your stuff. But we were already seeing these interesting dynamics where it’s like, “Oh, I could tag Rohan in my video. That’s crazy!” Like, and then we can have an argument or, like, have an anime fight. Doesn’t matter. And I was like, okay, so that’s actually the human element, that’s the magic of this. This is actually strangely more social than a lot of social networks, even though it’s all AI generated content. So very unintuitive.
Sonya Huang: Is it a fine-tuned version of Sora 2, or is it a separate model from what’s available over the API or is it the same?
Bill Peebles: Between the app and products? So we are currently exposing the models in the same state across API and the app.
Sonya Huang: Okay, really interesting. What are you seeing people do on the API side, and is it different from the types of things people are doing on the consumer app?
Rohan Sahai: The motivation behind even launching an API is just support of these long-tail use cases. We have this vision of enabling, you know, ChatGPT-scale level consumer audience with this tech, but there’s tons of very niche things out there. You can imagine people who are much—you know, with Sora 1 we went out and talked to a lot of these studios, and what we heard from them is like they want to integrate this in this specific part of their stack in this specific way. And we’d love to support all these long-tail use cases, but we don’t want to build a thousand different kinds of interfaces for this stuff. So that’s the kind of stuff we’re excited to see with the API. So far it’s been those kind of like a little bit more of a niche company, not trying to build like a first-party social app, but maybe, you know, has some either filmmaking kind of audience or kind of people they’re supporting, or even just like, we’ve seen some, like, people trying to—you know, I think there was like a, some company making—they were doing something with CAD where they were, like, using Sora.
Bill Peebles: Oh, Mattel.
Rohan Sahai: Yeah, yeah, yeah. So there’s cool use cases out there. I think we’re still getting a sense of what they are.
Thomas Dimson: Yeah, I think there’s a lot that can be done with these things. I think about gaming all the time, just based on my background. AI and gaming is always a very controversial subject, but it’s very clear that there’s a place and there’s a role. Maybe it doesn’t have to interrupt the creative process, it can enhance it. And I’m pretty excited to see some of those use cases emerge.
Sonya Huang: Do you think the video models are good enough now for people to be able to build video games on top of the API, or do you think we’re still another rev or two away?
Thomas Dimson: I have my own take on this.
Rohan Sahai: I was gonna say never bet against the ways people can be creative with technology to build. Like, someone will be able to build a game—and maybe has built a game already. Will it look and feel like—obviously there’s latency with this model, so you’d have to do all sorts of crazy stuff to get around that.
Thomas Dimson: Like, I think that your mind immediately goes to kind of the obvious sort of things that you would do in gaming, and we’ve seen some of that sort of stuff certainly in research blogs and that kind of thing. My mind often goes to, like, okay, this is like a creative tool that’s a little bit different. And the types of games that really excite me there—I’ll just go off on one, which is there’s a game called Infinite Craft, which is the world’s simplest game. It’s a web game where you just take elements. It’s like fire, air, water, earth. You have four elements to start and you just drag them.
Bill Peebles: I love this game.
Thomas Dimson: It combines into something new. And the thing it combines with is LLM-based. So it’s like fire and earth might be a volcano, and then volcano plus water might be an underwater volcano or Godzilla or something like that. You always end up in Godzilla for some reason, I don’t know why. But that’s a game that it’s like, oh, it kind of makes sense where it’s like, yeah, you don’t really need a crafting tree. The LLM can derive this crafting tree, and it’s a process of discovery.
And so I think there’s a lot of untapped stuff in that space where again, I like the idea of a process discovery. In fact, my philosophical view on LLMs and video models to some extent is that it is a process of discovery. These are all in the weights, you’re just unlocking it with a secret code, which is your prompt. And I love that. That is very magical. In gaming, that was the thing that excited me the most was discovering something new, especially if it was a true discovery, it wasn’t put there by somebody else. Maybe they just enabled the mechanics around it. So I think there’s a huge opportunity in that space of gaming, when you think about games and just the different things and embrace this technology in a very different way.
Sonya Huang: It reminds me of how some of the earliest use cases for GPT-3 were kind of these text games. So it’s different from how you think of a playable video game, but actually a lot of these mechanics are very game-like.
Thomas Dimson: Exactly. Yeah. I think there’s still constraints, and I think that’s going to be like the mechanism design. That’s still very human. Like a lot of the early games with GPT-3, they’re kind of like, yeah, it was fun for a minute, and then it kind of went off the rails and you’re like, “I don’t really know what I’m doing anymore.”
Sonya Huang: [laughs]
Thomas Dimson: But again, in some ways, Sora feels like a little bit of that where it’s got a little bit of gaming DNA inside of it, where it feels very fun and different and exploratory. So I like things like that, and I think there’s going to be more use cases that we can’t even think of. It’s too creative.
Sonya Huang: What are you guys seeing on the creative filmmaking side? Like, is that an important target market? Do you want to empower with the long tail or do you want to empower the head, so to speak, of the creative market?
Bill Peebles: I think it’s a really good question. You know, we’ve benefited a lot from creatives who are really willing to, like, go all in on, you know, even like the early technology like DALL-E 1 and DALL-E 2, and really help steer us along the path. And I think it’s important that we continue to build things for those folks, and we are working on some things that are more targeted towards creative power users long term. At the same time, you know, I do think AI is a very democratizing tool at its best, and so what’s kind of beautiful about the Sora platform in general is whenever someone kind of strikes gold, you see one of these beautiful anime prompts that goes to the very top of the feed for everyone. Anybody can go and remix that, right? Everyone has the power to build on top of that and learn from all of these people who come in with this incredible knowledge about how to really get the most out of these tools. And so I am really excited just to see the net creativity of humanity just increase as a result of this, but I think a big part of that is continuing to empower people who are always at the frontier, which are these more pro-oriented creator type folks. And so we want to keep investing in them as well.
Konstantine Buhler: We’ve nerded out for a while, like, almost a couple years now about that vision of feature film-length content. Like, yes, you have these amazing cameos and shorter content, but at some point the individual creator, it’s been something that you’ve been excited about for a very long time.
Bill Peebles: Yeah.
Konstantine Buhler: When do we get there? You know, is there a point where we have a feature film that is created on Sora 2, and how do we consume it? Is it in the Sora app? Is it posted somewhere else online? Do you go to a movie theater and watch it?
Bill Peebles: Yeah, it’s a great question. I mean, I think this will happen in stages to some extent. So, like, if you guys watch the launch video, I mean that was made by Daniel Frieden who’s on the Sora team. And he already, with these tools, is able to pump out these, like, incredibly compelling short stories within days at most. I mean, he literally made that like all by himself in almost no time. And he’s been continuing to put new ones out there on the OpenAI Twitter sense. Clearly this is massively compressing the latency that’s associated with filmmaking.
I think to get to the point where really anybody can do this, any kid in their home can just fire up the app, or Sora.com or something and go and make this, it’s really an economics problem of the video models. Video is the most compute-intensive modality to work with. It’s extremely expensive. And we’re making good progress on the research team really continuing to figure out ways to make this affordable for everyone long term. Like, right now, for example, the store app is totally free. In the future there will probably be ways where people can pay money to get more access to the models, just because that’s the only way we can really scale this further. But I think we are not far off from this world where anybody can really, like, have the tools to make amazing content.
You know, I think there’s going to be, like, a lot of bad movies that get created by this, but likewise there’s probably the next great film director who is just kind of like sitting in their parents house, like, still in high school or something, and just has not had the investment or the tools to be able to, like, really see their vision come to life. And we’re going to find absolutely amazing things from giving this technology to the whole world.
Sonya Huang: I’m looking forward to the feature film-length Konstantine’s Greek Odyssey coming to theaters near you.
Konstantine Buhler: It’s gonna be a banger. We’re all in it together, different characters. I play the Cyclops. It’s a good one.
Thomas Dimson: I think just to touch on that. One more thing that—something I’ve learned from recommender systems over and over again is that, like, oftentimes—so the tool is getting people more creative, is going to be a huge unlock for just making people more creative in general, because you don’t need this access to this filmmaking equipment and all that sort of stuff. But we do consistently see that content is also a social phenomenon in a way. And movies and all that, everything you see out there is kind of a bit of a social phenomenon in addition to the actual content itself. And so I think we’re going to enter a very interesting world where, you know, there’s so many people creating and so much content out there that even the idea that people are paying attention to it and watching it is going to become more and more important. And I think that’s actually going to make the quality of content just kind of elevate because there’s this anybody can create, and actually it’s going to be the consumption that’s going to be quite limited, which is very different than the world we live in today.
Sonya Huang: So I have the flip question, and it’s a little bit chicken and egg, but you guys were very thoughtful and intentional about how you treated IP holders. Can you say a word on that?
Bill Peebles: You know, we’ve been in close partnership with, like, a bunch of folks across the industry, and really trying to, like, both show them kind of this new technology that is actually like a huge value proposition for rightsholders across the board, right? And, like, we’re hearing so much excitement from the folks we’re talking with. Like, they really see this as being, like, you know, a new frontier for, again, every kid in the world having the ability to go and use some of this beloved IP, and really bring it into their lives in a way that feels much more personal and custom than what’s been possible before.
At the same time, you know, we really want to make sure that we’re doing this, like, in the right way. So we’ve been really trying to take feedback and really steer our roadmap in a way where we know that, you know, both users are going to have an awesome experience getting to use this IP, but also the rightsholders are going to get properly monetized and rewarded in a way that everyone wins, basically. So we’re right now actively working on trying to scope out the exact details about how we’re going to, you know, for example, make it so if you want to cameo your favorite character from some, like, beloved film or something, you can do that in a way where you have access to it but, like, monetization will flow back to the rights holder, right? So really trying to figure out this kind of, like, new economy for creators. We kind of just have to create this from scratch right now. There’s a lot of deep questions about how to do this the right way. And, you know, as with everything with this app, we come into it with an open mind, and we hear feedback and we iterate quickly. You know, we’re not sure where this is going to totally converge, but we’re working closely with people to figure it out.
Sonya Huang: Really cool. What’s ahead?
Rohan Sahai: Pets.
Bill Peebles: Yeah, I think—I mean, one …
Sonya Huang: Sorry, what?
Thomas Dimson: Pets cameo.
[CROSSTALK]
Sonya Huang: Is that one of the most demanded features?
Bill Peebles: For me it is.
Konstantine Buhler: Bill’s demanding. I will remind us we were just talking about curing diseases and role models and now we’re to the future [inaudible].
Thomas Dimson: This is something—no, it’s actually—so that’s definitely true. We’ve committed to that. It’s coming, I promise. We actually had Bill’s dog as—like, when we were playing around with it.
Bill Peebles: Rocket. The goodest boy.
Thomas Dimson: Yeah. And actually, it was very, very cool to actually feature a pet. You can imagine where that goes. It doesn’t have to necessarily be a pet. Could be anything—a clock or whatever you have.
Konstantine Buhler: Clock?
Thomas Dimson: Well, yeah.
Konstantine Buhler: Do you have a special clock?
Bill Peebles: Actually, it’s really compelling. I didn’t think it could be so compelling until Thomas showed me this clock. It was like a sentient clock. It’s, like, based on a real clock.
Thomas Dimson: Yeah, I had a clock. My father was a technology person for a while. This company Veritas gave him a clock for his, like, whatever anniversary. Anyway, so I have it on my table somewhere. And there’s this old Simpsons episode where they talk about a walking clock. And for some reason that’s just been an earworm in my head for the last 30 years. And so, you know, they’re telling some joke and it’s like, “Is it a walking clock? There’s a walking clock? It’s like walking clock.” And then it’s like, “No, man. It’s my dog.” And so it connected in my brain where I was like, “Okay, Rocket, walking clock.” And then so I tried it.
[CROSSTALK]
Thomas Dimson: Yeah. So it connected in my brain, and we’ve been playing around with this just to see if we can get it to work and whether there’s something special there, which is part of the fun of being on the Sora team is you get to play with his emergent crazy technology. And, like, maybe it does something you wouldn’t even have expected. So I recorded a two-second video of my clock, and then I gave it some cameo instructions and I said, “You’re just a walking clock. You’re a walking clock, you talk like—you talk your character.” And then I generated my first video. And it was insane. It was crazy. It was a walking clock. And then I had one where it was talking to Bill, and Bill was like, “I didn’t think it would ever land, the pet cameo feature.” And walking clock’s, like, “Here I am. You know, I just landed.” So it’s coming.
Bill Peebles: It’s all internal memes.
Sonya Huang: Talk about immersion IP. Who needs Pokemon when you can have a walking clock?
Thomas Dimson: Well, that’s the greatest IP.
Rohan Sahai: One thing to add in terms of the feature, I think on the feature film question, something I think about all the time is like, you know, what will that actually look like? I think my—I mean, caveat. Bill’s the only one who’s good at predicting the future here.
Bill Peebles: Questionable.
Rohan Sahai: But my sense is that, you know, as we get to longer forms, what our equivalent of a feature film will look and feel is very, very different from what a feature film is today. You know, I don’t know exactly what that looks like, but I think on the subject of creators and what’s coming in the world, I think a new medium and a new class of creators. New class could include a lot of existing creators, and support existing sort of mediums and stuff like that. But I think we’re just in the early innings of what I imagine will be the next film industry, rather than thinking about this being a feature film. But I think there’ll be something new.
There’s some anecdote—I hope this is true because I say it all the time, but apparently when the recording camera hit the world, the first thing people did was record plays. This is the least interesting thing you could do with a recording camera. It’s like, what’s the big idea? Oh, people don’t have to travel around acting. We can just film them and distribute it. And then someone was like, “Wait a minute. We can make a film, and film in all these different areas.” And I feel like we’re in, like, the first inning of so many different sorts of things that people will do with this technology, especially as the constraints change with latency and length and all that kind of stuff.
Konstantine Buhler: So cool. And fun film history nerd fact is one of the original videos—and we should check this as well, but I think the original video was made just down the peninsula to settle a bet on if a horse, when it galloped all four legs left the ground. And I could see a world where you have new—that is an example of a new scientific discovery. People didn’t actually have an answer to that. Now that you have a new simulation format, what are we going to be able to discover in that?
Bill Peebles: It will be crazy. I think one broader point here is this app right now feels very familiar in a lot of ways. It’s like a social media network at its core. But fundamentally, the way that we really view it internally is with Cameo, we’ve kind of introduced the lowest bandwidth way to give information to Sora about yourself, right? Aspects about your appearance, about your voice, et cetera. You can imagine over time that that bandwidth will greatly increase. So the model deeply understands your relationships with other people. It understands more than just how you look on any given day. It’s seen your full—like how you’ve grown up, all of these details about yourself, and will really be able to almost function as like a digital clone, right?
So, like, there’s really a world where the Sora app almost becomes this mini alternate reality that’s running on your phone. You have versions of yourself that can go off and interact with other people’s digital clones. You can do knowledge work—it’s not just for entertainment, right? And it really evolves more into a platform which is really aligned with kind of where these world simulation capabilities are headed long term. And I think when that happens, the kind of immersion things we will see are crazy.
And for OpenAI across the board, it’s really important that we kind of iteratively deploy technology in a way where we’re not just dropping bombshells on the world when there’s some big research breakthrough. We want to co-evolve society with the technology. And so that’s why we really thought it was important to do this now and do it in a way where we’ve hit this—again, this kind of GPT-3.5 moment for video. Let’s make sure the world is aware of what’s possible now, and also start to get society comfortable in figuring out the rules of the road for this kind of longer-term vision for where again, there are just copies of yourself running around in Sora, in the ether, just doing tasks and reporting back in the physical world. Because that is where we are headed long term.
Konstantine Buhler: So cool.
Sonya Huang: So you’re building the multiverse.
Bill Peebles: Actually kind of, yeah.
Sonya Huang: Okay, well, can simulated me go and find my soulmate somewhere in there?
Bill Peebles: I mean, anything is possible in the multiverse.
Sonya Huang: [laughs]
Konstantine Buhler: That’s a call for action, everyone.
Sonya Huang: It is kind of crazy, though, because now I’m going to sound totally cuckoo, but if we’re in a compute environment, you’re building the perfect simulator, that kind of is the way you ultimately understand and break out of the compute environment, right? Like, are we getting closer to the heart of the Matrix?
[CROSSTALK]
Bill Peebles: There’s some very deep existential questions. Yeah. Yeah. What’s your guys’ [inaudible]
Sonya Huang: Oh, rising.
Bill Peebles: Yeah, me too.
Sonya Huang: What’s your P?
Konstantine Buhler: I’m low. Yeah.
Bill Peebles: Oh man.
Bill Peebles: You’re really—okay.
Konstantine Buhler: I’m just like, you know what? It’s gotta be real.
Bill Peebles: Yeah. I feel like I’m at, like, a solid 60 percent. I don’t know, like more likely than not at this point.
Sonya Huang: I’m there, too.
Konstantine Buhler: Zero.
Sonya Huang: Should we make a [inaudible] on it?
[CROSSTALK]
Sonya Huang: What do you think are the theoretical limits to Sora?
Bill Peebles: Yeah, it’s actually a great question. I thought a little bit about this. Like, I think there’s like a question, can you eventually simulate a GPU cluster in Sora or something? And I assume there are some very well-defined limits on, like, the amount of computation you can run within one of these systems, like, given the amount of compute you’re actually running it on. I’ve not thought deeply enough about this, but I think there are some, like, existential questions there that need to get resolved. Yeah.
Sonya Huang: See, this is why his PSIM is so high.
Bill Peebles: [laughs]
Konstantine Buhler: Fascinating.
Sonya Huang: Wow.
Konstantine Buhler: Got a few lightning round questions for the team that we just kind of generated on the fly here. And take your time. Jump in whenever you have an answer. Your favorite cameo on Sora to date. And what happened.
Bill Peebles: That is so tough.
Thomas Dimson: I have a hot one.
Sonya Huang: Let’s hear it!
Thomas Dimson: Okay. So there was this TikTok trend of—and I got obsessed with it. I don’t know why, but these Chinese factory tours where they’re like, “Hello, I’m the chili—this is the chili factory.” They get one like and it’s me, and it’s like they’re showing their chili factory and they’re like, “It’s the chili factory,” and I’m like, “This is amazing.”
Or like there’s an industrial chemical one. Yeah, I’ve lost the name, but there’s an industrial chemical factory. And the first day I had my cameo options open just because I was like, “I just want to see what happens.” And the first day, late at night, I opened my cameos and I was starting to get tagged in factory tour cameos that were all in Chinese. And I was like, “I’m in the chili factory,” and I was so excited. It got zero likes. I liked it. It was just me, but I was like, “I’m the chili factory guy now. I’m, like, doing the ribbon cutting at the chili factory.” Amazing. That’s too deep of a cut, though.
Konstantine Buhler: Congratulations. Fun fact. I actually have done Chinese factory tours in real life, and they are truly epic.
Rohan Sahai: There’s this one I saw Mark Cuban in jorts dancing around.
Bill Peebles: That was pretty good.
Rohan Sahai: That got me. But I mean, my—just scrolling the latest feed and just seeing, like, the wholesome content of people, like, doing things with their friends actually I think brings me the most joy. They’re not, like, super liked, but it’s like, people just, like, getting a lot of, you know, value, obviously, from just, like, making videos with their friends.
Bill Peebles: Sam has so many bangers. I like the one of him doing, like, this K-pop dance routine about, like, GPUs or something. It’s very good. Actually, I would put it on my Spotify if, like, we had the full song.
Konstantine Buhler: Wow.
Bill Peebles: It’s very good. It was generated by Sora. It’s very compelling. Yeah.
Konstantine Buhler: All right. Well, that leads to the next one, because you mentioned Spotify. What does a fully-generated AI win first—Oscar, Grammy, Emmy?
Rohan Sahai: I think the, like, logical answer is, like, a short winning an Oscar.
Thomas Dimson: I think that’s probably right.
Bill Peebles: What would we win it for? Like, for, like …
Konstantine Buhler: The jorts?
Bill Peebles: Yeah, the jorts. It’ll be the jorts.
Rohan Sahai: The jorts trilogy.
Bill Peebles: Yeah. We need new content.
Thomas Dimson: I do think if people stitch things together in interesting ways, I think there’s a—you can actually start to make some very compelling storytelling in that. And I don’t think it’s like—it doesn’t really feel like AI anymore, the content I’m seeing. That was actually something I noticed with Sora as well. Just, like, I wasn’t even noticing it was AI. It was just kind of interesting.
Rohan Sahai: That’s a more interesting question. Will we know?
Thomas Dimson: Oh, yeah.
Rohan Sahai: Maybe it’s already happened.
Thomas Dimson: Maybe it’s already happened. [laughs]
Konstantine Buhler: I feel like for the Oscars, one of the cool things that’ll be unlocked is this long tale of epic stories in history, stories of heroism and struggle, and all of these things that have been locked up because of the cost of creating. You know, as a history enthusiast, I cannot wait for AI to unlock all of those stories.
Sonya Huang: Have you seen the Bible video app?
Konstantine Buhler: No, I haven’t.
Sonya Huang: Oh, it’s really good. I’ll show it to you after.
Konstantine Buhler: Perfect example. Or there’s this movie, The Last Duel, a few years ago about this really terrible crime that was committed in medieval France, that was historically relevant and basically says a lot about humanity. And it just got picked up because eventually Hollywood picked up this important story about humanity, but how many more are there in human history? That’s going to be really cool. Favorite character from any film or TV show.
Rohan Sahai: I have a really random one.
Bill Peebles: Go for it.
Rohan Sahai: You guys see Madagascar? King Julian? Played by Sacha Baron Cohen. Hs a lemur. He’s a lemur cat.
Konstantine Buhler: Absolutely.
Bill Peebles: It’s a banger.
Rohan Sahai: It’s his humor meets kid-friendly storytelling. It’s perfect.
Thomas Dimson: I play a lot of video games, so I mean your classic answer is gonna be like Mario or something like that. Although I’ll do the deeper cut of we were always joking about PaRappa the Rapper.
Rohan Sahai: Yeah.
Thomas Dimson: Yeah, PaRappa the Rapper. Like, an old PlayStation game, one of the original rhythm games. And it’s got a great artistic style and it’s got great IP of just this little guy.
Rohan Sahai: What is he? A dog
Thomas Dimson: He’s a dog, yeah.
Bill Peebles: It’s a good pick. When I was a kid, I played the Pokemon trading card game competitively for a while. So I was like, really in the Pokemon rabbit hole. So, like, I don’t know. Pikachu.
Konstantine Buhler: Nice.
[CROSSTALK]
Konstantine Buhler: Super non-contentious. Like a fringe deep cut.
Konstantine Buhler: Okay, first world-model scientific discovery. Most specific possible. Obviously you’re not going to say the discovery.
Bill Peebles: I suspect it will be something related to, like, classical physics, like a better theory of turbulence or something. That would be my guess.
Thomas Dimson: I was guessing it was going to be something like that. I was like, Navier–Stokes. I don’t know. Yeah, some fluid dynamics thing that’s maybe hard to understand now. There’s a lot of, like, unsolved problems there. I think sometimes they call it continuum mechanics where it’s, like, in between, and we don’t have good models of them.
Rohan Sahai: Something that lends itself to simulation. Just like the amount of iterations you can do of a simulation unlocking something which I don’t—yeah, something in that realm.
Konstantine Buhler: The last thing we’ll be able to accurately simulate.
Bill Peebles: I do think there’s a set of physical phenomenon for which video data is, like, a poor choice of representation, right? So, like, for example, is it really efficient to learn about high-speed particle collisions or something from, like, video footage? Maybe. I really think video is at its best when, you know, the phenomenon that you’re trying to learn about is just natively represented in the physical world. And so when you need to do, like, quantum mechanics or some other discipline where, you know, it’s more theoretical, we don’t have video footage beyond …
Konstantine Buhler: You can’t see it.
Bill Peebles: Yeah. Things that we’ve, like, manually rendered for, like, educational purposes. It feels like a weaker medium for understanding those things. So I suspect those would come last.
Konstantine Buhler: I guess it’s the things we don’t have sensors for.
Bill Peebles: Right. Right.
Rohan Sahai: Yeah. Maybe the last things we care to simulate is another way of thinking about the answer. I don’t know. I mean, people aren’t doing much with smell right now.
Bill Peebles: True. Green fields.
Konstantine Buhler: I’ve been meaning to tell you about that. Kind of awkward.
Bill Peebles: We’re still trying to figure out how to simulate Thomas with bad hair.
Thomas Dimson: Oh, yeah.
Bill Peebles: It remains an unsolved problem. Not even Sora can do it.
Konstantine Buhler: Thomas’s hair flows.
Bill Peebles: Guzzling ketchup.
Thomas Dimson: There’s a good round of people being bald. We were all doing bald.
Rohan Sahai: Bald gems were good.
Thomas Dimson: That was actually kind of cool. That’s our use case that doesn’t—I don’t really talk about very much, but it’s like …
Konstantine Buhler: Visualization?
Thomas Dimson: When you’re bald. Yeah, everybody wants to be bald. No, it’s just like you just see yourself in some different context. I think it can be quite powerful, even therapeutic in some ways where you just see yourself in some context that you either want or don’t want yourself to kind of be in, and just see yourself.
Rohan Sahai: It’s a real use case.
Thomas Dimson: Yeah.
Konstantine Buhler: Guys, thank you so much for coming. From spacetime tokens to object permanence, world models that will enable scientific discovery, the democratization of creation, all the way to walking clocks. You guys have covered it all. Thank you so much, and the future is being created by you.
Bill Peebles: Thanks, Konstantine. Thanks, Sonya.
Sonya Huang: Thank you.