Podcasts Training Data Google Nano Banana

How Google’s Nano Banana Achieved Breakthrough Character Consistency

Stream Now On

Nicole Brichtova and Hansa Srinivasan, the product and engineering leads behind Nano Banana, share the story behind the model’s creation and what it means for the future of visual AI. Nicole and Hansa discuss how they achieved breakthrough character consistency, why human evaluation remains critical for models that aim to feel right, and how “fun” became a gateway to utility.

The conversation reveals how a combination of technical craft, human evaluation, and accessible design transformed a powerful capability into a viral consumer product that’s opening new pathways to utility.

Character consistency requires obsessive attention to data quality, not just scale: While Gemini’s multimodal foundation and long context window enabled new capabilities, the breakthrough in making faces actually look like the person came from meticulous data curation and having team members who became “obsessed” with specific problems like text rendering or identity preservation.

Human evaluation is essential for subjective capabilities: The team discovered that character consistency is nearly impossible to evaluate quantitatively because only you can judge if an image truly looks like you. They built robust human eval processes, including internal artist testing and executive reviews, to capture the emotional and qualitative aspects that benchmarks miss.

Fun is a gateway to utility, not a distraction: The playful Nano Banana name and red carpet selfie use cases lowered barriers to entry, particularly for older users intimidated by AI. Once people tried the fun features, they discovered practical applications like photo editing, math problem solving, and information visualization they wouldn’t have explored otherwise.

The craft of AI matters as much as the architecture: Small design decisions throughout the development process—from inference speed to enable conversational editing to the philosophical shift toward generalization over narrow optimization—compounded into capabilities that felt magical. The team emphasized that “detail orientedness of high quality” separates good models from breakthrough ones.

Specialized models are proving grounds for unified multimodal systems: Image generation advances faster than video because single frames are cheaper to train and serve, creating a six-to-twelve month preview of capabilities coming to other modalities. The goal remains a single model that can transform any input into any output, with specialized releases like Veo and Nano Banana as stepping stones.

Hansa Srinivasan: There’s something about, like, visual media that really excites people. It’s like the fun thing, but it’s not just fun, it’s exciting, it’s intuitive. The visual space is so much of how we as humans experience life that I think I’ve loved how much it’s moved people.

Nicole Brichtova: I think we’re really now making it possible to, like, tell stories that you never could. And in a way where, like, the camera allowed anyone to capture reality when it became very accessible, you’re kind of capturing people’s imagination. Like, you’re giving them the tools to be able, like, get the stuff that’s in their brain out on paper visually in a way that they just couldn’t before, because they didn’t have the tools or they didn’t have knowledge of the tools. Like, that’s been really awesome.

Stephanie Zhan: Today we’re talking with Nicole Brichtova and Hansa Srinivasan, the team behind Google’s Nano Banana image model, which started as a 2:00 a.m. code name and has become a cultural phenomenon since. They walk us through the technical leaps that made single-image character consistency possible, how high-quality data, long multimodal context windows, and disciplined human evals enabled reliable character consistency from a single photo, and why craft and infrastructure matters as much as scale. We discussed the trade-offs between pushing the frontier versus broad accessibility, and where this technology is headed: multimodal creation, personalized learning, and specialized UIs that marry fine-grain control with hands-off automation.

Finally, we’ll touch on what’s still missing for true AGI and white spaces where startups should be building now. Enjoy the show.

Stephanie Zhan: Nicole and Hansa, thank you so much for joining us today. We’re so excited to be here to chat a little bit more about Nano Banana, which has taken the world by storm. We thought we’d start off with a fun question: What have been some of your own personal creations using Nano Banana, or some of the most creative things you’ve seen from the community?

Hansa Srinivasan: Yeah. So I think for me, one of the most exciting things I’ve been seeing is, like, the—it didn’t occur to me, but this is very obvious in hindsight, is the use with video models to get actually consistent cross-scene character and scene preservation.

Pat Grady: How fluid is that workflow today? How hard is it to do that?

Hansa Srinivasan: So what I’ve been seeing is people are really mixing the tools, and using different video models from different sources. And so I think it’s probably not very fluid. I know there’s some products out there that are trying to, like, integrate with multiple models to make this more fluid. But I think the difference in the videos I’ve been seeing from before and after the Nano Banana launch has been pretty remarkable. And it’s, like, much, much smoother and much more like what you’d want in the video creating process with scene cuts that feel natural. So that’s been cool. And I don’t know why it didn’t totally occur to me that people would immediately do that.

Pat Grady: [laughs]

Nicole Brichtova: One of my favorite ways that I didn’t expect is how people have hacked around the model to use it for learning new things or digesting information. I met somebody last week who has been using it to create sketch notes of these varied topics. And it’s surprising because text rendering is not something that—it’s not where we want it to be. But this person has hacked around these massive prompts that get the model to output something that’s coherent. And he’s used it to try to understand the work that his father’s doing.

Stephanie Zhan: Wow!

Nicole Brichtova: He’s a chemist at a university, and it’s a super technical topic. And so he’s been feeding his lectures to Gemini with Nano Banana, and then getting these sketch notes that are, like, very coherent and visually digestible. And for the first time, I think, in decades, they’ve been able to have a conversation with each other about his dad’s work. And that’s really fun and something that I didn’t see coming.

Pat Grady: That’s very cool.

Hansa Srinivasan: I think people are really working around—you know, like, this model is amazing, but obviously it’s not perfect. We have a lot of things we want to improve, and I think I’ve been astounded by the ways people have found to work with the model in ways we didn’t anticipate, and give inputs to the models in ways we didn’t anticipate to bring out the best performance and unlock these things that are kind of mind blowing.

Pat Grady: Did you guys—in the building of it, was there a moment, like an a-ha moment where you kind of felt, wow, this thing’s going to be pretty good?

Nicole Brichtova: We just talked about this.

Hansa Srinivasan: Yeah, I think Nicole had the a-ha moment.

Nicole Brichtova: I had one where—so we always have an internal demo where we play with the models as we’re developing them. And I had one where I just took an image of myself, and then I said, like, “Hey, put me on the red carpet.”

Pat Grady: [laughs]

Nicole Brichtova: And, like, just total vanity prompt, right? And then it came out and it looked like me. And then I compared it to, like, all the models that we had before, and no other model actually looked like me. And I was, like, so excited.

Stephanie Zhan: Wow!

Nicole Brichtova: And then people looked at it and they were like, “Okay. Yeah, we get it. Like, you’re on the red carpet.”

Pat Grady: [laughs]

Nicole Brichtova: And then I think it took a couple of weeks of other people being able to take their own photos and play with it, and just kind of realize how magical that is when you get it to work. And that’s kind of the main thing that people have been actually doing with the model, right? Turning yourself into a 3D figurine where it’s like you want a computer, you want a toy box, and then you as the figurine. So, like, you three times. Like, that way to be able to kind of like express yourself and see yourself in new ways and almost kind of like enhance your own identity has just been really fun. And that for me was like, “Oh, man, this is awesome.”

Stephanie Zhan: What was it about what Nano Banana did with you on the red carpet that was miles better than what everyone else has?

Nicole Brichtova: It looked like me. And it’s very—it’s very difficult for you to be able to judge character consistency on people’s faces you don’t know.

Pat Grady: Yeah.

Nicole Brichtova: And so if I saw, you know, a version of you that’s like an AI version of you, I might be okay with it. But you would say, like, “Oh, no, you know, like, parts of my face are not quite right.” And you can really only do it on yourself, which is why we now have evals on many team members where it’s like their own faces, and they’re looking at the model’s output with their own faces on it. Because it’s really the only way that you can judge whether or not someone looks like you.

Hansa Srinivasan: Yourself and faces you’re familiar with. I think, like, when we started doing it on ourselves and it’s like, I see Nicole a lot, so Nicole versus random person we might eval on, it’s just a very big difference in terms of judging the model capabilities. Yeah, I think it’s one of those things that preservation of the identity is so fundamental to these models actually being useful and exciting, but is surprisingly tricky. And that’s why we see a lot of other models not quite hitting it.

Pat Grady: Well, I was going to ask you, I would imagine that character consistency is not just an emergent property of scale. And so maybe two questions. One, I’m sure there’s stuff you can’t tell us, but what can you tell us about how you achieved it? And then two, was that an explicit goal heading into the development of this model?

Hansa Srinivasan: Yeah. So I would say—I mean, yeah, I think there’s definitely things that are tricky to say here, but I would say there’s, like, sort of different genres of ways to do image generation. And so that definitely plays a part in how good it is. And I think it was definitely a goal from the beginning.

Nicole Brichtova: It was definitely a goal, because we knew it was a gap with the models that we released in the past. And generally, consistency for us was a goal, because every time you’re editing images, like, you want to preserve some parts of it and then you want to change something. And prior models just weren’t very good at that, and that makes it not very useful in professional workflows. But it also doesn’t make it useful for things like character consistency.

And we’ve heard this for years from even advertisers who are, you know, trying to advertise their products and, like, putting them in lifestyle shots. It has to look like your product, like a hundred percent, otherwise you can’t put it in an ad. So we knew there was demand for it, we knew the models had a gap, and we felt like we had the right recipe, both in terms of the model architecture and the data to finally make it happen. I think what surprised us was just how good it was when we actually finally built the model.

Hansa Srinivasan: Yeah. Right. Because I think we felt like we had the recipe, exactly as Nicole said. But there’s still always—until you’re seeing the model, you finish training, you’re actually using it, you don’t know how close you’re going to get to that goal. And I think we were all surprised by that. Yeah, and I think the other thing is if we think about, like, what people expect out of editing when you edit on your phone apps or, like, Photoshop, you expect a high degree of preservation of things you’re not touching.

Pat Grady: Yeah.

Hansa Srinivasan: And depending on how the models are made and the design decisions behind them, that’s very tricky to do. But it’s something people really—like, it’s one of those things where, like, it’s shockingly technically difficult, even though it’s something I think a layperson who’s using the models would expect it to be, like, the basic thing about editing. It’s like, you don’t mess with the things you don’t want to be messed with.

Pat Grady: Yeah. Back to that moment where you saw yourself on the red carpet and, “Wow, that’s actually me.” And it took some of your colleagues a couple weeks to have the same experience because they tried it with their own photos. The question is beyond, “Hey, that’s actually me,” you know, the qualitative test, is there some sort of an eval that you can put against that to make it quantitative, that we have achieved the thing that we set out to achieve here?

Hansa Srinivasan: Yeah. So I actually think face consistency, exactly for the reason Nicole said, is quite hard. It’s quite hard for other people to do.

Pat Grady: Yeah.

Hansa Srinivasan: I will say, in general, I think what we found with image generation in particular that’s unlocked a lot for us, is human evals are important. And so I think they’re foundational. We have a team that works on helping us build sort of good tooling and good practices for evals, and having humans actually eval these things that are very subtle. Like, if you think about image generation, like faces, aesthetic quality, these are things that are very hard to quantify. And so I think human evals have been a big game changer for us.

Nicole Brichtova: I think it’s a combination of there’s human evals, there is a very technical term, eyeballing, of the model results by different people. And there’s also just community testing. And when we do community testing, we start internally, and we have artists at Google and at Google DeepMind who play with these models. Our execs will play with these models. And that really helps, I think, kind of build that qualitative narrative around, like, why is this model actually awesome? Because if you just look at the quantitative benchmarks, you could say, like, “Oh, it’s 10 percent better than this model that we had before.” And that doesn’t quite grok that emotional aspect of, like, “Oh, I can now see myself in new ways,” or “I can now finally edit this family photo that I cut up when I was five years old.”

Pat Grady: Yeah.

Nicole Brichtova: “And I probably shouldn’t have—” people have done that. “And, like, I’m able to restore it.” Like, I think you really need that qualitative user feedback in order to be able to tell that emotional story.

Hansa Srinivasan: I think this is probably true of many of the gen AI and AI capabilities, but I think it’s especially true of visual media where it’s very subjective, versus if you think about something like math reasoning, logic reasoning where, like, you can really ground it in an answer, right? And so it’s more easy to have these very objective, automated, you know, quantitative evals.

Stephanie Zhan: To get to that level of character consistency from just one 2D image of someone is really, really hard. Can you walk us through maybe a little bit, what are the technical breakthroughs that helped you drive to that level of character consistency that we actually haven’t seen anywhere else?

Hansa Srinivasan: I mean, I think a key thing is having good data that teaches the models to generalize, right? And the fact that this is a base, it’s a Gemini model, it’s a multimodal foundational model that’s seen a lot of data and has good generalization capabilities. And I think that’s kind of the secret sauce is, like, you really need models that generalize well to be able to take advantage of that for this, right?

Nicole Brichtova: Yeah. And I think the other nice part about doing this in a model like Gemini is that you also get this really long context window. So, like, yes, you can provide one image of yourself, but you can also provide multiple. And then on the output side, you can also iterate across multiple turns and actually have a conversation with the model, which wasn’t possible before, right? One, two years ago, we were fine tuning on 10 images of you, and it took 20 minutes to actually get something that looked like you. And that’s why it never took off in the mainstream, right? Because it’s just too hard, and you don’t have that many images of yourself. It’s like too much work.

And so I think it’s both kind of the general, like, Gemini gets better, you benefit from that multimodal context window, and you benefit from the, like, long output and ability to, like, maintain context over a long conversation. And then you also benefit from actually paying attention to the data, focusing on the problem. A lot of the things we get better at come down to there’s a person on the team who’s, like, obsessed with making them work. Like, we have people on the team who are obsessed with text rendering, and so our text rendering just keeps getting better because that person just, like, is obsessed with the problem.

Hansa Srinivasan: Yeah, it’s like it’s not just about throwing high quantities of data in, right? I think that’s one thing that’s really important is there’s this, like, attention to detail, and quality of all the things you’re doing with the model. There’s a lot of small design decisions and decision points at every point. And I think that, like, detail orientedness of high quality of data and selections are really important.

Nicole Brichtova: It’s the craft part of, I think, the AI, which we don’t talk about a lot, but I think it’s super important.

Pat Grady: How big was the team that worked on it?

Nicole Brichtova: To ship it it took a village.

Hansa Srinivasan: Yeah. Especially because we ship across many products, so I think there’s sort of the core sort of modeling team, and then there’s our close collaborators across all the surfaces.

Pat Grady: Yeah.

Nicole Brichtova: When you put them all together, you easily get into, like, dozens and hundreds. But the team who works on the model is much smaller. And then the people who actually make all the magic happen, we had a lot of infrastructure teams, like, optimizing every part of the stack to be able to serve the demand that we were seeing, which was really awesome. But really, like, to ship it, we’re joking that it takes a small country.

Pat Grady: When you build something like this, do you build it with particular personas or particular use cases in mind, or do you build it more with a capability-first mindset, and then once the capabilities emerge, you can map it to personas?

Nicole Brichtova: It’s a little bit of both, I would say. Like, before we start training any new model, we kind of have an idea of what we want the capabilities to be, and some design decisions like how fast is it at inference time, right? They also impact which persona you’re going after.

Pat Grady: Yeah.

Nicole Brichtova: So this model, because it’s kind of a conversational editor, we wanted it to be really snappy, because you can’t really have a conversation with a model if it takes, like, a minute or two to generate. That’s really nice about image models versus video models. Like, you just don’t have to wait that long. And so to us, from the beginning it felt like a very consumer-centric model. But obviously, we also have developer products and enterprise products, and all of these capabilities end up being useful to them. But really, we’ve seen a ton of excitement on the consumer side in a way that I think we haven’t before with our image models, because it was very snappy and it kind of made these pro-level capabilities just really easily accessible through a text prompt. And so that’s kind of how we started it out, but then obviously it ends up being useful in other domains as well.

Hansa Srinivasan: Yeah. And I think one of the, like, differences in philosophy—so, like, previously we’d worked on the Imagine line of models which were straight image generation. And I think one of the, like, big philosophical goal changes in these Gemini image generation models is generalization is a more foundational capability. So I think there is also a lot of—like, there’s things where we want this model to be able to be good at this, like, representing people and letting them edit their images and have it look like themselves. But I think there’s also a lot of things that are emergent from the goal of just having a baseline capable model that reasons about visual information. Like, I think one thing that’s surprised me, I guess, as a callback to your earlier conversation, is people can put in math problems, like a drawing of a math problem and, like, ask it to, like, render the solution, right? So, like, you can put in a geometry problem and say, like, “What is this angle?” And that’s like an emergent thing of a foundationally-capable model that has both, like, reasoning, mathematical understanding and visual understanding.

Pat Grady: Yeah.

Hansa Srinivasan: So I think it’s both. Yeah.

Stephanie Zhan: Can you maybe share, just out of curiosity, what’s a good way to understand maybe the family mapping and the relationship between Gemini powering Nano Banana, Veo, you know, all these other adjacent products and models that are all driven and benefit from the generalization and the scale of Gemini itself, how you co-develop, and then where you want to take it from here?

Nicole Brichtova: Our goal has always been to build the single most powerful model that can do all these things, right? You can take in any modality and you can transform it into any modality. And that’s the North Star. We’re obviously not quite there yet. And so on the way there, we had a lot of sort of specialized models that just got you great results in a specific domain. So Imagen was an example of that for image generation. Veo is an example of that for video generation and editing. And so I think we’re both kind of developing these models to push the frontier of that modality. And you get really useful outputs out of that, right? A lot of filmmakers are using Veo in their creative process, but you’re also learning a lot that you can then bring back into Gemini, and then make it good at that modality.

Image is always a little bit, I think, ahead of the curve, because you just have one frame, right? It’s cheaper both to train and at inference time. So I think kind of a lot of the developments you see in image, I expect you to see in video, like, six-twelve months down the line.

And so that’s always kind of been the goal. And so we have separate teams kind of developing these. And then I think with image, we’re now moving closer to Gemini and to that vision of that single most powerful model. And you will see that, I think, with some of the other modalities, and along the way we’ll release these experiences that are just really powerful and really exciting in that modality. So, like, Veo 3 was really awesome because it brought audio into video generation in a way that we haven’t seen before. Genie 3 was really awesome because it let you, in real time, kind of navigate a world. And so in order to push that frontier, it’s very hard to, like, do all of that at the same time right now in one model. And so to some extent, these specialized models are kind of a testing ground. But I would expect that over time, Gemini should be able to do all these things.

Stephanie Zhan: That’s so interesting.

Pat Grady: Okay, we gotta ask you about the name.

Nicole Brichtova: Ah.

Pat Grady: I suspect that the name was a bit of a—it’s an amazing product. I suspect that the name gave it a little bit of a boost because it’s so easy to remember and so distinct. So was it a happy accident, or is there some creative genius who knew that this is going to be just the right name?

Hansa Srinivasan: It was a happy accident. So I think, as many people know, the model went out on an LMArena where many models do. And part of that is you give a code name. And if anyone hasn’t used an LMArena, you get to put in your prompt. You’ll get back two responses from two models. They have code names until they’re publicly released. And I think it was like we had to—we were going out at, like, 2:00 am, and—Nicole’s our wonderful PM. There’s another PM we have, Nina. And someone messaged her being like, “What do we name it?” And she was really tired and exhausted and she was like—this was the name, the stroke of genius that came to her at 2:00 am.

Pat Grady: This is you?

Nicole Brichtova: It was not me. It was somebody on my team who named the model. I can’t take credit for this.

Hansa Srinivasan: Who works with Nicole. Another one of our PMs.

Nicole Brichtova: But what was really awesome is like, A) it was really fun. I think that really helps. It’s easy to pronounce. It has an emoji, which is critical for branding.

Hansa Srinivasan: She didn’t overthink it.

Nicole Brichtova: But she didn’t overthink it. And what was awesome is everybody just went with it once it went live. And I think it just, like, felt very googly and very organic, and ended up looking like the stroke of marketing genius. But no, it was a happy accident, and it just sort of worked out and people loved it. And so we leaned into it, and now there’s, you know, bananas everywhere when you go into the Gemini app, which we did because people were complaining that they were having a really hard time finding the model when they came into the app.

Hansa Srinivasan: Yeah.

Nicole Brichtova: And so we just made it easier.

Hansa Srinivasan: Yeah. Yeah, exactly. I think publicly, people were like, “Nano Banana. Nano Banana. How do I use Nano Banana?”

Stephanie Zhan: [laughs]

Hansa Srinivasan: I had someone at Google I work with be like, “How do I use Nano Banana?” And I was like, “It’s Gemini. It’s right there. Just ask for an image.”

Pat Grady: [laughs]

Hansa Srinivasan: Yeah, but I think that’s the thing is, like, I think Google’s always had this really fun brand, right? Like, it’s been a consumer-oriented company at its inception. And, like, I think it was really nice to play on that image people have of Google as a fun place, fun company and have this fun name.

Nicole Brichtova: It’s also just like a really nice path to fun being a gateway to utility, right? I think Nano Banana, and just the model in general and what you can do with it like put yourself on the red carpet, do all the childhood dream professions you had, it’s like a really fun entry point. But what’s been awesome to see is that once people are in the app and they are using Gemini, they start to use it for other things.

Pat Grady: Yeah.

Nicole Brichtova: That then become useful in their day-to-day life. Like, you use it to study and solve math problems, or you use it to learn about something else. And so I think it’s maybe a little bit undervalued sometimes to, like, have a little fun, not just with the naming but also just like with the products that we build, because it kind of gets people in, gets them excited, and then it helps them discover other things that, you know, the models are awesome at.

Hansa Srinivasan: Yeah. I think other users, like my parents and their friends are using it. I think it’s because it, like, had this reputation. It was really easy, it was really fun. It felt unintimidating to try, and you try it and you’re like, actually, this is very easy to work. This works very easily. It’s very easy to interact with. There’s no, like—you know, technology I think can sometimes be intimidating to people, especially AI right now.

Pat Grady: Yeah.

Hansa Srinivasan: And I think the chatbot naturalness has broken a lot of the barriers, but maybe more so with younger people. And I think this, like, fun—like, yeah, my mom was making these images and having a great time, and then realized she can use it to, like, remove people from the background of her images. Like, these very practical things, right? It started very silly, turned very practical. Then people can use it to realize, like, actually, they can give you then diagrams or help them understand stuff. So I think there’s also, like, a big accessibility component.

Pat Grady: Yeah.

Stephanie Zhan: Where do you want to take from here? Maybe both from a model side and from a product side.

Nicole Brichtova: On the product side, I think there’s kind of a couple areas. Like, on the consumer side I still think we have a long way to go to just, like, make these things easier to use, right? You will notice that a lot of the Nano Banana prompts are, like, a hundred words long, and people actually go in and copy-paste them into the Gemini app, and go through the work to make it work because the payoff is worth it. But I think we have to get past this prompt engineering phase for consumers and just, like, make things really easy for them to use.

I think on the professional side, we need to get into, like, much more precise control, kind of robustness, like, reproducibility to make it useful in actual professional workflows, right? So, like, yes, we’re very good at editing consistency and not changing pixels, but we’re not a hundred percent there. And when you’re a professional, you need to be a hundred percent there, right? Like, you really need kind of these precise, maybe even like gesture-based controls, like over every single pixel in the frame. So we definitely need to go in that direction.

And then I think there’s a general direction that I’m really excited about, which is just about visualizing information. So the example I had about sketch notes at the beginning, and somebody kind of hacking their way around using Nano Banana for that use case, you could just imagine being able to do that for anything, right? And a lot of people are visual learners. I think we haven’t really exhausted the potential of LLMs to be able to, like, help you digest and visualize information in whatever way is most natural for you to consume, right? So sometimes it’s a diagram, sometimes it’s an image, and sometimes maybe it’s a short video that you want to learn about some concept that you’re learning in a biology class or something like that. So I think that’s like a completely new domain that I’m really excited about, just these models getting better, and getting past the point where, you know, 95 percent of the outputs that you get out of these models are just text. Which is useful, but it’s not how we consume information in the real world right now.

Stephanie Zhan: That’s really interesting. So on the product side then, are you alluding to the fact that you might want to vertically integrate and build a little bit more product around it? And also, are you alluding to the fact that maybe the way you interact with some of these models isn’t just through pure language and prompting over time, but more UI?

Nicole Brichtova: Yeah. Yeah, I definitely think the chatbots, I think, are an easy entry point for people, because you don’t have to learn a new UI, you just talk to it and then you say whatever you want it to do, right? I think it starts to become a little bit limiting for the visual modalities, and I think there’s a ton of headroom to think about, like, what is the new visual creation canvas for the future, and how do you build that in a way that doesn’t become overwhelming, right? Because as these models can do more and more things, it’s very hard to explain to the user in something that’s very open ended, like what the constraints are, and how do you work around that and how do you actually use it in a productive way? So I’m really excited about people kind of building products in those directions.

And for us, you know, we have a team called Labs at Google that’s led by Josh Woodward, and they do a lot of this kind of like, frontier-thinking experimentation. They work with us really closely, where they take our frontier models and they think about, like, what’s the future of entertainment? What’s the future of creation? What’s the future of productivity? And so they’ve built products like NotebookLM and Flow on the video side. And I’m excited that maybe Flow could kind of become this place where you could do, you know, some of this creation and think about what that looks like in the future.

Hansa Srinivasan: I think in the short term, it’s very clear that this model has things that it’s not perfect at. And so in the short term, it’s obviously—it should work the way you expect it to every time, not just a lot of the time. And really make it so seamless, and fix all these small things where it’s just like a little bit inconsistent in its performance. I think long term, I think Nicole covered that, which is, to me, it’s in order to have that reality of really rich multimodal generation—so right now, if you ask Gemini to explain something, it’ll usually just explain in text, unless you ask it for images. But if you think about, like, the platforms that have really taken off in the last 10, 20 years for learning, we think of Khan Academy, it started on YouTube. We think about Wikipedia, it has a lot of images. Like, it’s very image focused. If you look up any math thing, you get diagrams. And so that should become more a natural part of the flow and part of the way you use these models. And to enable that from a modeling point of view, it goes back to, like we were talking about, this multimodal understanding and seamless generalization between modalities.

Nicole Brichtova: Maybe the other interesting area as we think about kind of, you know, these models being more proactive at pulling in, you know, whether it’s code or images or video, when it’s appropriate for the user intent, I think this other exciting—I started out as a consultant in my career, and so obviously I made a lot of slide decks in my time. I still do. And I think there are some of these use cases where you don’t actually really want to be in the weeds of creation. Like, what you really want is let’s say you’re updating your stakeholders on how a project is going, right? You want to pull in some context, maybe it’s meeting notes, maybe it’s a couple of bullet points, maybe it’s, you know, some other deck that you’ve created in the past. And then you maybe just want Gemini to go off and, like, do all the work for you, right? Like, pull that deck together, format it, create appropriate visuals to make it really easy to digest. And that’s something that you probably don’t want to be involved in, and it gets more into these agentic behaviors. Versus, I think, for some of these creative workflows, like, you actually want to be creating, you want to be in the weeds, you want to think about what the UI looks like that makes it easy for a user to accomplish the goal. And so, like, if I’m designing my house and I’m actually into designing my house, then I probably actually want to play with it and, like, play with textures and different colors and, like, what would happen if I remove this wall?

And so I think there’s kind of this spectrum of, like, very hands off, like just let the model go off and pull in relevant visuals, materials for a task that makes sense, all the way to how do you actually make a creative process more fun and remove the tedious parts and remove the technical barriers that exist today with tools that we have?

Hansa Srinivasan: This mix of giving the user fine-grained control, like the precision control they want, but also at the other extreme, having the model be able to understand the user request and anticipate the need and the outcome that it should be, and do all the intervening work in between.

Nicole Brichtova: It’s almost like when you actually hire a professional for something today, right? Like, when you hire a designer, you give them a spec and then they go off and then they do all that awesome work that they do because they have all this expertise. And so these models should be able to do that, and they can’t really do that in many domains today.

Pat Grady: What do you think the next competitive battleground is in this world?

Nicole Brichtova: I think there’s still work to be done on making these models more capable. And so this idea of having a single model that can take anything and transform it into anything else, I think nobody has really figured that out. But I do think in order to actually drive adoption, there’s probably two things. One is user interfaces. Like, we still rely very heavily on the chatbots—and we talked about this. It’s useful for some things and it’s a great entry point, but it maybe isn’t useful for all the things. And so I think starting to think about much more deeply about who are the users, what are they trying to do, how can the technology be helpful, and then what product do you build around it to make that happen is probably one.

Pat Grady: Do you think five or ten years from now, the frontier will be advancing as quickly as it has advanced over the last few years?

Nicole Brichtova: Five to ten years from now feels like twenty years from now. [laughs] Just the space—and you guys probably see this, too. Like, the space is moving really quickly.

Pat Grady: Yeah.

Nicole Brichtova: And, you know, if you asked me two years ago, I would have told you the space is moving really quickly. If you ask me today, I will tell you it’s moving faster than it was two years ago.

Pat Grady: Okay, I’m going to ask you a very different question. [laughs] So I know Google’s very sort of careful and very concerned about deepfakes and that sort of thing. And I have to imagine when you saw how capable this model was, there’s a big conversation about okay, well how are we going to make sure people don’t use it in the wrong sorts of ways? How does that sort of a conversation go inside of Google? And are you guys sort of like happy with where it ended up?

Nicole Brichtova: I think it’s an ever-evolving frontier also, because it’s this mix of you want to give people the creative freedom to be able to use these tools, right? And you want to give users control to be able to use these tools in a way that don’t feel overly restrictive. And you want to prevent the worst harm, right? I think that’s always the balance that we spend a lot of time talking about.

And so obviously, when you look at the outputs of the model, there’s a visible watermark that says it’s been generated with Gemini. So that immediately indicates that it’s AI content. And then we also, in every output that we produce with our models—image, video, audio—there’s SynthID embedded, which is invisible watermarking. And so those are kind of the visible ways or invisible ways in which we verify that a content is AI generated. We’re very invested in it, and we believe that it is really important to give users those tools to be able to understand that when they’re seeing something, it’s not a real video or it’s not a real image.

And then obviously, when we develop these models, we do a ton of testing internally and also with external partners to kind of find—as the models get more capable, you find new attack vectors and new ways that you have to mitigate for. And so that is like a very important part of model development for us and we continue to invest in. And as the models get better and as there’s new things that you can do with them, we also have to develop kind of new mitigations for making sure that we don’t create harm, but also still give users the creativity and the control in order to make these models usable in a product.

Hansa Srinivasan: I mean, I think it’s a very, very hard balance to strike, right? Because you will always have people using a tool in good faith, you’ll also always have people using it in bad faith. And I think it’s hard. It’s like, is it a tool? Is it something that has responsibility? So I think we take this very seriously. Users obviously are also responsible for what they do with the model. But SynthID really is an important technology that lets us release these capabilities to people, and have some faith in that we can still verify and have a tool to combat the risk of misinformation. But it’s a super tricky conversation, and I think it’s one that I’ve seen everyone take very seriously. There’s a lot of conversations about how to balance both.

Stephanie Zhan: Is that the standard now across the industry, SynthID?

Nicole Brichtova: Yeah, it’s a Google standard.

Hansa Srinivasan: It’s the Google standard. I believe there’s—like, every Google, like Imagen, the Imagen line, Veo. They all have SynthID when you use them in any product surface.

Pat Grady: All right. You told us we can’t go five to ten years down the road because things are moving too fast, so we’ll go one to three years down the road.

Nicole Brichtova: [laughs] Thank you.

Pat Grady: Two questions. One, what will be possible that we can only dream about today? And two, what will the resulting change be to the way that we all live our lives?

Nicole Brichtova: I really hope that a year or two from now you could really get, like, personalized tutors, personalized textbooks in a way, right?

Pat Grady: Love it.

Hansa Srinivasan: Yeah.

Nicole Brichtova: There’s no reason why you and I should be learning from the same textbook if we have different learning styles and different starting points, but that’s what we do now. That’s how our learning environment is set up. And I think across all these breakthroughs, like, that should be very possible where you have an LLM tutor that just figures out your learning style, what are the things you like. Maybe you’re into basketball, and so I need to explain physics to you with basketball analogies. And so I’m really excited about learning just becoming way more personalized. And that feels very achievable. And we obviously have to make sure that we don’t hallucinate, and there’s like a high bar for factuality. And so we need to ground in sort of real world content. But that I’m really excited about. And that really, I think just removes a lot of barriers for people, right? To your question of, like, what the impact is going to be. I think it just becomes much more—it becomes much easier to learn basically anything in a way that’s very tailored to you that you just can’t do right now.

Pat Grady: Could that be a Google product surface?

Nicole Brichtova: Somebody should look into it.

Pat Grady: [laughs]

Hansa Srinivasan: Yeah. And I think for the way it’ll change how we live and work, I think working on these technologies, I’ve already seen how it changes the way we work, right? Because we obviously use them a lot. I’m getting married. We made our “save the dates” with our model. And so what I really think we’ll see is—and just work—the amount. Part of, I think, the reason that the innovation has accelerated is we have these models. You have, like, code assistants, you can use models to, like, filter things, to analyze huge amounts of data. Like, it’s drastically increased our own workflows. Like, what I can do this year versus two years ago is just like an order of magnitude more work.

And I think that’s true of the tech industry. It’s not true of a lot of other industries, just because that integration into their workflows or into their tooling hasn’t happened. So I think, you know, some people are like, “Oh, it’s gonna replace me,” but at least what I’ve seen is it really just actually changes the amount of work an individual can get done. What that means, like, for businesses or economically, I’m not sure, but I think it means we will just see people be more empowered to hopefully do more in the same amount of time. Like, maybe you don’t have to—you know, I have friends who are in consulting and they’re like, “I just spent a lot of time, like two hours, making slides, tweaking …”

Pat Grady: Moving logos around.

Hansa Srinivasan: Moving logos around. And, like, hopefully they won’t have to do that. They can actually spend time thinking about what the content of the slides should be, thinking, working with clients. And I think that’s hopefully what we will see in one to two years.

Stephanie Zhan: Given the trajectory that you see in these capabilities, are there some interesting areas that you think startups should go do that Google itself might not get into?

Nicole Brichtova: I think there’s a ton of spaces even just in the creative tools. Like, I think there’s a ton of room for people to figure out, like, what do these UIs of the future look like? Like, what is the creative control? How do you bring everything together? We see a lot of people in the creative field work across LLMs, image, video and music in a way where they have to go to four separate tools to be able to do that. So, like, a lot of people ideate with LLMs, right? Like, give me some concepts, like, here’s an idea that I have. Once you’re happy with that, you take it to an image model, you start to think about where are the key frames that I want to have in my video. You spend a lot of time iterating there, then you take it to a video model which is yet another surface. And then at some point, you want to have sound and music and mix it all together, and then you actually want to do maybe some heavy-handed editing and you go to some of the traditional software tools. That feels like these kind of workflow-based tools are probably going to spin up for a lot of different verticals. So creative activity is just one example of it but, you know, maybe there might be one for consultants so that you can more efficiently make slide decks and presentations and pitch decks to clients. And so I think there’s a lot of opportunity there that, you know, some of the companies may not go into.

Hansa Srinivasan: There’s a lot of how do we make this technology useful for X workflow, right? Like, sales. Like, I’m saying a lot of things I don’t know about in companies, like financial workflows, but I imagine there’s a lot of tasks that could be automated, could be made much more efficient. And I think startups are in a good position to really, like, go understand the specific client use case, need, that niche need, and do that application layer versus what we really focus on is the fundamental technology. I think I’m just really excited by the number of people who’ve been excited by this model.

Pat Grady: Yeah.

Hansa Srinivasan: If that makes sense. Like, a lot of people in my life, like, a lot of aunts, uncles, my parents, like, friends, like, they’ve used chatbots. They ask it things, they get information. My mom loves to ask chatbots about health information. But there’s something about, like, visual media that really excites people, that it’s the fun thing but it’s not just fun, it’s exciting, it’s intuitive. The visual space is so much of how we as humans experience life that I think I’ve loved how much it’s moved people, like, emotionally and excitement wise. Like, I think that’s been the most exciting part of this for me.

Stephanie Zhan: My kids love it.

Hansa Srinivasan: Yeah.

Stephanie Zhan: My three-year-old son tied our dog leash which is this, like, fraying, you know, brown rope over himself so he looked like a warrior. I took a picture of him and turned him into this warrior superhero.

Hansa Srinivasan: Yeah, exactly.

Stephanie Zhan: And it makes him feel superhuman.

Pat Grady: Yeah.

Stephanie Zhan: And my husband will read, so he uses Google Storybook to read him these stories about lessons that he learned in school. You know, if there was, like, an incident on the playground with another kid, or adjusting to a new school. And I mean, it’s made these characters that look like him and my husband and me and our dog and our daughter in these fun stories and lessons that we’re trying to teach him, to the personalization that you talked about. So I really, really love this future. It’s going to be totally different for him growing up.

Nicole Brichtova: And it’s awesome, right? Because this is a story for, you know, one or five people that you would have never had made, right? And other people probably don’t want to read it—I would love to, if you want to.

Stephanie Zhan: Yeah. [laughs]

Nicole Brichtova: But I think we’re really now making it possible to, like, tell stories that you never could, and in a way where, like, the camera allowed anyone to capture reality when it became very accessible. You’re kind of capturing people’s imagination. Like, you’re giving them the tools to be able to, like, get the stuff that’s in their brain out on paper visually in a way that they just couldn’t before, because they didn’t have the tools or they didn’t have the knowledge of the tools. Like, that’s been really awesome.

Pat Grady: That’s a nice way to put it.

Stephanie Zhan: Thank you so much.

Nicole Brichtova: Thank you for having us.

Stephanie Zhan: It’s awesome to have you.

How Google’s Nano Banana Achieved Breakthrough Character Consistency

Stream Now On

Listen Now

Summary

Transcript

More Episodes

OpenAI Sora 2 Team: How Generative Video Will Unlock Creativity and World Models

Nvidia CTO Michael Kagan: Scaling Beyond Moore’s Law to Million-GPU Clusters

How Google’s Nano Banana Achieved Breakthrough Character Consistency

Stream Now On

Introduction

Main Conversation

OpenAI Sora 2 Team: How Generative Video Will Unlock Creativity and World Models

Nvidia CTO Michael Kagan: Scaling Beyond Moore’s Law to Million-GPU Clusters