How End-to-End Learning Created Autonomous Driving 2.0: Wayve CEO Alex Kendall
Alex Kendall founded Wayve in 2017 with a contrarian vision: replace the hand-engineered autonomous vehicle stack with end-to-end deep learning. Wayve built a generalization-first approach that can adapt to new vehicles and cities in weeks. Alex explains how world models enable reasoning in complex scenarios, why partnering with automotive OEMs creates a path to scale beyond robo-taxis, and how language integration opens up new product possibilities. Wayve demonstrates how the same AI breakthroughs powering LLMs are transforming the physical economy.
Listen Now
Summary
Alex’s insights center on building scalable AI systems that generalize across diverse environments, providing a blueprint for deploying embodied AI systems in the real world.
End-to-end learning beats hand-engineered systems: Alex built Wayve on the conviction that one large neural network would outperform modular, hand-coded robotics stacks. Despite years of skepticism about safety and interpretability, this approach, now called AV 2.0, has proven superior, enabling Wayve to deploy in hundreds of cities without requiring high-definition maps or excessive infrastructure.
Generalization is the key to scale: Training on diverse data from multiple countries, vehicles, and sensor configurations allows Wayve’s AI to adapt to new environments in weeks rather than years.
World models enable reasoning in physical space: Wayve trains generative world models that simulate how environments will evolve, creating emergent behaviors like cautiously nudging forward at occluded turns. These models don’t replace real-world data but recombine and magnify it, dramatically improving data efficiency while enabling the vehicle to reason through novel scenarios it has never encountered before.
Product culture matters as much as research excellence: Transitioning from an AI research lab to an automotive supplier required Wayve to adopt the reliability, quality standards, and brand differentiation focus of manufacturers building millions of vehicles.
Collaborating with industry accelerates distribution: Deep integration with automotive OEMs including leveraging their software-defined vehicles, infrastructure, and standards, enables faster, more affordable, and globally scalable deployment of autonomous technology.
Transcript
Chapters
Introduction
Alex Kendall: You know, if you’re building a vertically-integrated robotic solution, maybe you can go deep, but our ambition is to be the embodied AI foundation model for all of the best fleets and manufacturers around the world. And to do that, unless we want to overload the company by building a separate neural network for each application, we need to be able to generalize, we need to be able to amortize our costs over one large intelligence and to be able to very quickly adapt to each different application that our customers care about. That’s what we’re trying to push.
Sonya Huang: Today we’re talking with Alex Kendall, CEO of Wayve, about the shift from software 1.0 to 2.0, or from classical machine learning to end-to-end neural networks in autonomous driving. Wayve sells an autonomous driving stack to auto OEMs, similar to Tesla FSD, but for non-Tesla automobiles. Major car manufacturers globally, like Nissan, are choosing Wayve to power their AV stacks. Alex started Wayve back in 2017 when most self-driving software stacks were massive hand-coded C++ code bases covering every possible edge case, like navigating around double parked cars. Alex bet the farm from the beginning on an end-to-end neural net approach to self-driving and on the use of synthetic data and world models as the ultimate path to generalization and scaling. Today, that architecture is reshaping AV and all of physical AI, including robotics. Enjoy the show.
Main conversation
Pat Grady: Alex, thanks for joining us on the show.
Alex Kendall: Hey, Pat. Hey, Sonya.
Pat Grady: One of the things that is very special about your company is that it sort of typifies AV 2.0, meaning a new architectural approach that I think is kind of demonstrated to be superior to the AV 1.0 approach that people toiled with for so many years. Can we just start by defining what was AV 1.0? What is AV 2.0?
Alex Kendall: For sure. When we started the company in 2017, the opening pitch in our seed deck was all about the classical robotics approach at the time was to take perception, planning, mapping, control, essentially break down the autonomy problem into a bunch of different components and largely hand engineer them. And our pitch was that okay, we think that the future of robotics is not going to be a system that’s hand engineered to drive with a lot of infrastructure like high-definition maps, but instead we thought that the future of robots would be intelligent machines that have the onboard intelligence to make their own decisions. And of course, the best way we know how to build an AI system is with end-to-end deep learning.
So for the last 10 years we’ve been promoting an approach, a next generation approach, AV 2.0, that replaces that stack with one end-to-end neural network. Now of course, that may seem more obvious today, but it has been contrarian for many, many years. But I think today it’s maybe unfair to make that basic distinction because of course, anyone who’s worth a grain of salt will use deep learning in various parts of the stack. But what you see in more incumbent solutions to autonomous driving is, of course, deep learning for perception and maybe for each different component, but still a lot of hand interfaces, still a lot of infrastructure on high-definition maps, and perhaps reliance on a lot of hardware.
So our solution is still somewhat moved on, but today, rather than just being an end-to-end network, today, of course, we start to talk about foundation models, we start to talk about more of a general purpose intelligence, one that can understand not just how to drive that car, but many cars with different sensor architectures, with different use cases. And so really, it all boils down to how do we build the most intelligent robot that can scale without needing onerous infrastructure?
Pat Grady: So Wayve is sensor inputs, motion output, gigantic neural net in the middle.
Alex Kendall: That’s right. At a very simple level, but some of the interesting things you see that are maybe different from the story we’ve all heard with large language models is with autonomous driving, of course, there are some interesting new factors. One is, of course, safety. The system we need to make sure is safe by design, and what that means is that we can’t just pump more data in and hope that hallucinations go away. But we need to design an architecture that is still end-to end data driven, but is both functionally safe, and we can build a robust behavioral safety case.
So that introduces some interesting architectural challenges. And then, of course, we also need to run real time onboard a robot, onboard a vehicle. And so dealing with the onboard compute and onboard sensor limitations make it an interesting challenge. But yes, it’s the same narrative we’re seeing playing out in robotics that we’ve seen play out in all these other AI fields like language or game-playing agents. It’s that an end-to-end data learning solution is out competing anything we can hand code. And what we’re excited to be pioneering is that exact same narrative here in robotics and autonomous vehicles.
Pat Grady: And when you guys started this in 2017, and it was a very contrarian approach, when people from the industry said, “Well, that’ll never work because …” how did they finish that sentence?
Alex Kendall: I could count hundreds of those meetings.
Sonya Huang: [laughs]
Alex Kendall: Yeah, typical arguments were, look, it’s not safe, it’s not interpretable, can’t understand what it’s doing, or even simply, it doesn’t make sense—we haven’t heard of this AI thing. And look, I think five, ten years ago it was probably reasonable to say end-to-end deep learning wasn’t interpretable, but I don’t think that’s true today. I think today we have a lot of really great tools for understanding and responding to insights about the way these deep learning systems reason. But moreover, I think if you have the ambition to build any intelligent machine, I think it’s naive to think you can build a complex intelligent machine and actually make it, let’s say, strictly interpretable to the point where you can point to a single line of code or a single thing that causally made the outcome occur. The beauty of intelligent machines is that they are so wonderfully complex, and there, I think, the way that we’re going to not just design them but understand them is through a data-driven structure.
Sonya Huang: Can you say more about the before and after of the AV 1.0 stack and the billions of lines of code that goes into those systems versus the 2.0 systems today? And how quickly is that changing? Because my sense is that deep learning, large neural nets hitting the physical economy is a much more recent phenomenon than people might appreciate.
Alex Kendall: Well, especially when you think about the path to distribution and deploying these systems. I mean, the automotive industry has just gone through a seismic shift in bringing out software-defined vehicles and the right hardware on these cars to be able to make them drive. Maybe one common point of debate is a camera only or camera-radar-LiDAR as a sensor approach to autonomy.
And just to be clear on our position at Wayve, we want to build an AI that can understand all kinds of different sensor architectures. There’s going to be sometimes where a camera-only solution makes sense, sometimes where camera-radar-LiDAR. And we train our embodied AI model on all of those permutations from very diverse data sources. And the car we just drove in is a camera-only stack. We’ve got other cars that we work on with partners that have radar and LiDAR. And of course, there’s different trade offs that you take there.
But more generally we are seeing mass-produced cars from the best manufacturers around the world, have a GPU on board, have surround camera, surround radar and sometimes a front LiDAR. And what’s beautiful about that is there’s now the opportunity to see this AI come out and benefit people around the world. I think that kind of software-defined infrastructure is happening in automotive—has perhaps not yet happened to the same degree in other robotics verticals, but I’m sure the market’s going to move that way as well. And in general, having the right level of compute and infrastructure in a scalable way and opening up these platforms to AI, I think is what’s really making this possible. And that’s gone through a tipping point in the last couple of years.
Sonya Huang: Hmm. And your perspective of AV 2.0 has flipped from contrarian to I’d say consensus maybe in the last two or three years. Do you think it was FSD 12 that did it? Or when did that mindset start to shift?
Alex Kendall: I miss the contrarian day, but even today—I was in a conversation this morning where I see a lot of folks still say, “Yes, we need end-to-end AI.” They brought the big tech narrative around the future of AI. But they say things like, “We need end-to-end AI with hard constraints or with safety guarantees.”
And there still can be some belief that some hybrid approach is the way to go, where you want to try and take a rules-based stack and an end-to-end learn stack. But often, these approaches can get the worst of both worlds or just add cost and complexity. So I still think there is a distribution in the market of those that are leaning and are moving fast and those that perhaps have some catching up to do.
But, of course, crediting the breakthrough that all of us that have been working in deep learning that really made this world changing and mainstream, of course, we’ve got to credit the large language model breakthroughs. I think they’ve inspired the world and opened up the market’s mind to be curious about this technology.
But also what we’ve been doing at Wayve, you know, a year ago we were just driving in central London. Central London I think is a great proving ground, because it’s this unstructured, incredibly complex and dynamic city that our AIs learn to navigate around very smoothly, safely and reliably. But in the last year we’ve taken it to highways, to Europe, Japan, North America. Our cars were in New York City last week, driving around there. And so bringing it global, being able to take it to different manufacturer’s vehicles and show a product-like experience, this growth is, I think, also really opened up a lot of inspiration around the world.
Sonya Huang: Why is it that you’re able to launch in hundreds of cities worldwide, and some of the AV 1.0 companies need to actually go out and build an HD map. Just say a word on the difference in how technical differences are actually leading to differences in how the machine’s able to learn and how you’re able to roll out.
Alex Kendall: Autonomous driving is all about generalization. Generalization means being able to reason about or understand something you’ve never seen before. Every time you go for a drive, you’re going to see something new for the first time. What did we see today? We saw a road worker rolling out some carpet thing in front of the road, but on a pedestrian crossing, but not wanting to step out. And we had to reason about could we pass them without yielding, for example. That’s just an example from earlier today, but you could think about all the new things you see on the roads every time you drive. You’re never going to see every experience in your training data, so that means that you have to be able to reason and generalize to things you haven’t seen before to be safe, to be useful around the world.
And that’s what has motivated our entire approach. So whether it’s a manufacturer giving us one of their vehicles, and within a couple of months us being able to drive it on the road. A couple of weeks ago in September this year, we unveiled a vehicle to media with Nissan in Tokyo. Just four months earlier was the first time we’d even driven in Tokyo and got hands on this vehicle. Four months later, we were having media drive in the car, experience it. And that was a new country and a new vehicle for us.
So what that showed is that our AI was able to generalize. It’s trained on very diverse data from around the world, it’s trained on diverse sensor sets and vehicles, and so it was able to understand that vehicle’s new sensor distribution and, of course, the complexity of driving around in central Tokyo.
So I think that’s a really great demonstration of generalization. And if we think about if you’re building a vertically-integrated robotic solution, maybe you can go deep, but our ambition is to be the embodied AI foundation model for all of the best fleets and manufacturers around the world. And to do that, unless we want to overload the company by building a separate neural network for each application, we need to be able to generalize, we need to be able to amortize our costs over one large intelligence, and to be able to very quickly adapt to each different application that our customers care about. That’s what we’re trying to push.
Sonya Huang: You mentioned reasoning in there in terms of how the model is reasoning through this construction worker. What do I do now? In the LLM world, obviously, reasoning is its own separate track, lots of scaling inference, time computes, techniques. Are you deliberately training your models to reason? Is it an emergent behavior of the models? Say more about what you mean about reasoning.
Alex Kendall: We are. And I think reasoning in the physical world can be really well expressed as a world model. In 2018, we put our very first world model approach on the road. It was a very small, 100,000-parameter neural network that could simulate a 30×3 pixel image of a road in front of us. But we were able to use it as this internal simulator to train a model-based reinforcement learning algorithm. There’s a fun blog post if you want to see the history on that.
But fast forward to today, and we’ve developed GAIA. It’s a full generative world model that’s able to simulate multiple cameras and sensors in very rich and diverse environments. You can control it and prompt the different agents or scene in it.
And that’s an example of reasoning where we can train in the ability to simulate how the world works and what’s going to happen next. What happens when you bring this kind of representation on the road is you get some really nice emergent behavior like today when we saw—we were driving around unprotected turns that were occluded, you saw the car nudge forward until it could see for itself and then completed the turn.
Sonya Huang: Yeah.
Alex Kendall: Or when it’s foggy in London, you see the car slow down and drive to what it can reason about. And by training it with that level of understanding, it gives that level of emergent behavior that helps it really understand particularly complex multi-agent scenarios. I think that’s key for getting safe and smooth autonomous driving.
Sonya Huang: So the world models are really key to teaching the model how to reason through their new scenarios.
Alex Kendall: A hundred percent.
Pat Grady: You mentioned earlier the diversity of your data. Say a word about where all the data comes from.
Alex Kendall: It’s becoming an enormous amount of data because, of course, unlike the language domain or image domain, when we’re dealing with a typical self-driving car that has a dozen multiple megapixel cameras, radar, maybe LiDAR, you know, you’re dealing with—when you aggregate that up, it’s very quickly tens or hundreds of petabytes of data. So it’s an enormous amount of data you have to train on, but it’s the diversity that’s really key.
And we’ve solved for diversity in two ways. The first one is by becoming a trusted partner across the industry and aggregating data across many different sources from dash cams to fleets to manufacturers to robot operators. And the second one is being able to filter and really understand the data. Here we’ve really worked hard to develop different unsupervised learning techniques to be able to cluster and find unusual or anomaly experiences and, of course, find the scenarios that our system is performing poorly at, and then drive the learning curriculum on those.
But yeah, today we learn from a diverse set of vehicles, a diverse set of sensor architectures of countries, and that’s really one of the key things that drives the level of generalization.
Sonya Huang: Does the increased growth of world models and simulated data, does that mean that you just don’t need as many actual on-road miles?
Alex Kendall: I think there’s two sides to that question, right? On the one side yes, learning efficiency really matters. The second, you can’t only rely on learning efficiency. At the limit, if we take our current approach and just scale it up, I’m sure it’ll produce generic level five driving at the limit. If you have unlimited training data, this is really just a lookup data table with some prior experience, but that’s not economically or technically feasible.
And so the question is: how can you train this to be the most efficient, data-efficient system? Because I think efficiency will lead to not just improved cost, but faster time to market and more intelligence.
So efficiency comes from a number of different factors. There’s most importantly the data curriculum you put in place. But then the learning algorithms. How do you magnify the learning you have? And I think world models are a really great opportunity for that. They generate synthetic data and synthetic understanding that doesn’t replace real-world data, but it recombines it and magnifies it in new ways. It lets you pull in interesting insights. And I think these kind of approaches can really improve data efficiency.
But across the board, I think, working under resource constraints has forced our team to develop so many innovations. I’d also call out just the workflow, because in traditional robotics, when you’re tuning parameters or algorithms or designing geometric maps and things like this, there’s very well-established cultures and workflows.
But our team, when we have 50 model developers working on one main production model, or when we have an end-to-end net that we need to understand and introspect, or even the way that we deploy these systems to simulation or to the road and feedback, we’ve developed—the entire culture from the ground up at Wayve has been developed for embodied AI, for end-to-end deep learning for driving. The data infrastructure, the simulation, the safety licensing before we put systems on the road, this has not been a hedge or a side bet for us, but this is the entire essence of our culture. And I think doing this under resource constraints and doing this with full mission-driven conviction has led to a bunch of interesting innovations that look, getting to where we are today, everything is about iteration speed.
Pat Grady: Speaking of your culture, I’m picturing a bunch of AI research types, machine learning engineers, that sort of thing. How does the culture of your organization differ from similar applied lab-type environments, given the customer base that you serve, given that you’re going after the automotive industry specifically with all of its quirks around, you know, supply chain and all of its requirements around safety? And so how does that influence the culture of your business?
Alex Kendall: Hugely. In fact, for the first few years of Wayve, you know, we were really a group of passionate embodied AI researchers. But in the last couple of years, I’m really, really proud of how our team has built out deep expertise in understanding the automotive industry, but also the ability to reliably deliver to our partners there. And that’s a different culture. It’s a culture I’ve really grown to respect, because when you’re building millions of cars, you know, the level of reliability and MTTF you need there is extraordinary.
Pat Grady: What have you all learned from them? I mean, I’m sure part of your job is to teach them about what’s going on in the world of AI. What have you learned from them?
Alex Kendall: I think some of the main things I’ll call out have been efficiency and reliability. The difference between technology and a product would be some of the main themes. I mean, the level of reliability required, but also the level of quality that is seen to really robustly prove these systems out before deployment and the pride that these companies take in that has been exceptional.
Another thing has been perhaps the sense of brand differentiation, and the desire for, you know, look, do you want your car to drive—how can your driving personality really match the brand’s preferences? How can you provide that experience that really gives brand differentiation?
And the great news is that I think we’ve been able to riff and brainstorm off these and come up with some really neat technical ideas down that vein. But yeah, ultimately safe, high quality and personalizable AI has been some great feedback we’ve got from the industry.
Sonya Huang: Can you talk about your path to market actually in partnering with the other OEMs? How did you decide to do that? And then how do you think the market landscape will play out for how autonomy rolls out?
Alex Kendall: Yeah, of course. Great question, Sonya, because since the beginning of Wayve, we’ve been focused on the pitch I gave around end-to-end deep learning being the approach to autonomy. But we’ve tried a number of different go-to-market approaches over the years.
But in the last couple of years I’ve been hugely energized about working and partnering with the biggest and best consumer automotive manufacturers around the world. Why is that? Well, I mentioned how they’ve begun to introduce software-defined vehicles. So they have the infrastructure to work with autonomy. There’s the market belief that this is a technology that can really thrive. And also, it’s the chance to get to scale far beyond what we’re seeing with the city-by-city robo-taxis we’re seeing right now.
But moreover, these are OEMs that are investing in the right infrastructure to go from not just driver assistance but to eyes-off autonomy, where you can actually take liability for the drive and give the user a safe—and give them time back from their driving experience. So that’s awesome.
I think when you think about the market, there are 90 million cars built each year, and so some manufacturers that are building the autonomy systems themselves like Tesla build a couple of million. But the vast majority of the market, I think, there’s an opportunity to partner, to work with some of these innovative platforms, and to bring our AI to market to make these autonomous products possible.
And it will only grow from there. These manufacturers don’t want to stop at driver assistance. We’re working together to build eyes off and driverless robo-taxi products. But the key thing is that by avoiding retrofitting our own hardware on these vehicles, by putting them in natively as a software integration, we can move fast at scale, we can build low-cost vehicles that can be homologated all around the world. And this is going to be the path to see tens and hundreds of thousands of robo-taxis rolled out around the world at an affordable price. And of course, this is all possible because of the level of generalization that this AI enables.
Sonya Huang: Tesla FSD is just such a game-changing product, and my friends who have it, just they can’t imagine driving any other way. And so it’s really cool that you’re going to empower the 88 million other vehicles sold every year to be able to sell that experience as well.
Alex Kendall: A hundred percent. It’s one of those things that a lot of people would jump in our car and come for a drive being skeptical about autonomy, but without exception they step out with a smile on their face. It’s a magical experience. And yeah, I can’t wait for people to be able to try it around the world and make autonomy not just a robo-taxi tourism experience, but bring this experience to people in eventually every city.
Sonya Huang: What do you make of the sensor fusion ConFusion debate? The one that plays out on Twitter every year or so of Tesla gets confused if there’s both camera and LiDAR coming in. Sorry, radar.
Alex Kendall: I think it’s the wrong debate to be having. It’s not the frontier question. The industry has really—I guess outside of Tesla, has really coalesced around a common architecture of a surround camera, surround radar and a front-facing LiDAR stack. Now this costs under $2,000, so it’s automotive grade components, not the retrofit robo-taxi components you see today.
But having a frontier GPU compute, automotive grade GPU on the car and that kind of sensor architecture is a really great platform to build L3, L4 autonomy, eyes-off or driverless. It gives you the necessary redundancy, it lets you deal with edge cases that cameras alone, I agree they can get you to human level, but we want to go beyond human level.
And so I think this kind of architecture is affordable, scalable, it’s got the supply chain for mass manufacturers, and it can eliminate all accidents and really drive superhuman levels of performance.
So that’s what we’re seeing many manufacturers bring out on their vehicles and where we’re integrating our AI. Of course, for a driver-assistance system, camera-only can work for a human-level driverless system—or of course, I should clarify, 90-something—you can look at different stats, but 95 percent or above accidents unfortunately are caused by human error. So not only can you be human level, but you can eliminate a lot of human inattention and accidents caused by that. But there are still accidents that to be able to solve would require perception capabilities that go beyond vision. And if we want to tackle that long tail, there are many ways to solve it. One of the ways would be to bring in some other sensing modalities like radar and LiDAR. So we’re excited to be working with those kind of platforms. But crucially, natively integrated into the OEM’s vehicles themselves.
Sonya Huang: Is it the same neural net that can drive on one OEM’s car and another’s car? And how does that even work? Because I imagine each vehicle has slightly different position cameras, things like that.
Alex Kendall: It comes from the same family. So we train a very large scale—we regularly train very large-scale models. Of course, we iterate them month on month. But that’s one model that’s common to all of the fleets that we work with. But as you optimize to a specific sensor set or a specific embedded target, of course, you can start to specialize the model. But the beauty is that 99 percent-plus of the cost and the time and the effort is training their base model, and then we can build very efficient personalization to the specific customer. And so this lets us scale, but gives us the ability to squeeze it to very efficient real-time platforms, and make it adapted to a specific use case.
Sonya Huang: Are you going to let Pat personalize a super aggressive driver model?
Pat Grady: [laughs]
Sonya Huang: You’re gonna need to.
Alex Kendall: What driving style would you like, Pat?
Pat Grady: Yeah, pretty aggressive. Safe, very safe. But you know …
Alex Kendall: We can do that. Yeah, we find it’s really funny when you build distributions around driving behavior. Yeah, you can really tell—from the human training data we have, you can really tell when it goes from being helpfully assertive, let’s say, to unhelpfully aggressive. And we can draw a clean line there.
Pat Grady: There you go.
Alex Kendall: What about you, Sonya? How was the drive we just had?
Sonya Huang: Fantastic. It was comfortable, it was safe. And it felt very human, actually. Like, the way it was kind of nudging up when it couldn’t see on the turn. It was very human.
Alex Kendall: Yeah. Well, it’s as complex as we can get in Silicon Valley, but come to Tokyo or London, or I was on the weekend in downtown San Francisco and yeah, you really need the ability to predict and reason about other folks around you to be able to drive in a human-like way. And what we find is that if you’re not able to smoothly go around double-parked vehicles or deal with other dynamic obstacles, or even the prevailing row of traffic might not be aligned to the specific lane, but maybe there’s a human-like way of driving, then what’s awesome about the intelligence that we built is it’s able to reason about these things and keep the traffic flowing, keep interacting with road users in a very human-like way. I think this is going to be key for societies to accept and love robo-taxis. I can’t wait to make that a reality.
Pat Grady: Are there any specific corner cases that your cars have a hard time with today?
Alex Kendall: There are loads. And it’s really hard to generically talk about one because they’re so rare. It’s very hard to say, “Oh, Alex, it’s always these types.” You know, a corner case is a couple of edge cases coming together in a corner. And it’s always confounding factors when you get something really obscure.
But we’ve driven in over 500 cities this year, and so when you’re driving at that level of scale, of course, you see things that you’ve never seen before. Road signs are written in a new language. Actually, maybe one way to break it down is often we talk about driving broken down into safety, utility and flow.
Pat Grady: Yeah.
Alex Kendall: Safety being, of course, safety, critical behavior. Flow being the style of driving, is it smooth, is it enjoyable? And then utility being the navigation and road semantics. And safety and flow we’ve found generalized exceptionally well throughout the world. We get almost uniform metrics in every country we operate in in terms of safety and flow or comfort of the drive.
But utility has been the really interesting one as we’ve gone global. How do you navigate, how do you deal with road signs? How do you read different languages? How do you deal with different driving cultures? And so that’s the one that’s been interesting. We published some results about this. When we went from the UK to the US, we needed hundreds of hours of data to be able to drive within 10 percent of our frontier performance.
But then when we went to Europe, into Germany, of course, we’d already learned to drive on the right side of the road. Coming to the US we’d learn to do right turns at red lights. Then coming to Germany, we had to learn to still drive on the right side of the road but, of course, you can’t turn right at a red light there. But then on the Autobahn—you’d like this—we drive today up to 140, so pretty fast there. But yeah, it gets more efficient each time with exponentially less data in each new market, because you’ve seen some of those things before.
Pat Grady: Yeah.
Sonya Huang: You mentioned at the beginning that large language models were part of what flips your approach from contrarian to consensus. Are you integrating large language models at all into your models? And I know some of the robotics companies that are getting started now are starting from this VLA-VLM base. Is that part of your architecture?
Alex Kendall: A hundred percent. In 2021, we started working on language for driving. I remember my team came to me at the time and said, “Hey, we should start a project on language.” I said, “No, no, no, guys. Start up’s all about focus. Keep focus.” But they actually gave some pretty compelling arguments, so we started to play around with these things. And a year or so later we released Lingo, which is the first vision-language-action model in autonomous driving.
And what was special about this model was it could not only drive a car—see the world, drive a car, but also converse in language. And it’ll let you talk to it, ask it questions, you know, what are you finding that’s risky? What’s going to happen next? Or even it could commentate your drive.
And what’s interesting about this is that—so there’s a few benefits. One is bringing language into pre-training, of course, just improves the representation’s power, it gives more interesting information to learn from than just imagery alone. But then second, aligning the representation with language opens up a ton of interesting product features. It enables you to create a chauffeur experience where you could actually talk to your driver. No longer do you need a PhD in robotics to understand the system, but actually you can just talk to it and ask it to drive. Pat, if you want to race around the commute super fast, then you can demand that. But then third, it gives you a really nice introspection tool where you can start to actually—you could imagine regulators or our engineering team converse with the system in language to really diagnose why it’s doing what it’s doing or get it to explain its reasoning. So I think these are really clear benefits which we’re really excited to be pushing.
Sonya Huang: That’s super cool. And you’re running it on the embedded compute.
Alex Kendall: We are. So we’ve put out demos that run offboard. Onboard’s challenging with what’s in the automotive market today. But some of the next generation compute, for example the Nvidia Thor that our next gen development vehicle is going to be built with, will be large enough to run it on board. That’s going to be cool.
Sonya Huang: Very cool.
Pat Grady: You’ve talked about how autonomous driving sort of provides a path to more generalized embodied AI. Can you paint that picture for us how you go from autonomous driving to, I don’t know, humanoid robots or whatever other things you might want to embody AI?
Alex Kendall: I think we’re going to be in the future looking at a ton of interesting use cases for robotics. What we’re seeing is that mobility is becoming possible, I think, much before manipulation. Manipulation is challenging in terms of access to data, global supply chains for hardware, and actually even the hardware designs themselves. Like, I think tactile sensing is still a really hard challenge. But inevitably it’ll be a massive transformative thing, but maybe it’s at the maturity of where self driving was in 2015.
But today, our system is rapidly becoming a general purpose navigation agent. Giving it an arbitrary sensor view and a goal condition, it’s able to produce a safe trajectory. So I think we’re going to see a rapid advancement from not just consumer automotive, robo-taxis, you can think about trucking and other applications.
But this AI will enable manufacturers and fleets who want to build robots in any kind of mobility application. And, of course, we’re really excited to be working with frontier developers and applications over time as you go out across that robotics stack. And I expect we’ll see more maturity in the coming years from manufacturing and manipulation use cases as well.
But in the end I think the benefits of having a large foundation model that certainly in automotive I think we have access to the largest robot and data supply chain, and so we’re really lucky in that regard to be able to push forward the intelligence there. But generalizing that intelligence to new applications, I think there’ll be benefits from the model being able to experience multiple different verticals, and it will only make it more general purpose. Any applications you’re excited about?
Pat Grady: I mean, I’m psyched to have humanoid robots walking around.
Alex Kendall: Yeah, me too. I think they’re going to be neat. You know, whichever form factor, I think humanoids will play a big part, I think, in other forms of locomotion as well, and then manipulation. There’s some really interesting challenges in those spaces, but I think the same story is going to play out. Working on a narrow application like when self driving went to Phoenix, Arizona and put in a ton of infrastructure and expensive hardware to make it work is going to, I think, have limited runway, but working on general purpose, lean, low-cost hardware stacks that really focus on making the system most intelligent and robust, I think this is the recipe for scale. So yeah, let’s watch that space.
Pat Grady: Yeah.
Sonya Huang: Do you think there are major research breakthroughs needed to reach kind of physical AGI, so to speak? And if so, what do you think is the most promising direction?
Alex Kendall: Absolutely, I do. I think there’s so much more ground to scale up the current approaches. And we’ll do that, but I think we will get compounding returns from—I usually talk about four factors that drive performance. There’s, of course, data and compute, but then also the algorithmic capabilities and the embodiment, what is the hardware and capability on the robots. And I think we need to push all four.
And on the algorithmic side, there are so many opportunities for growth. I think a key one is measurement. How do you actually measure and quantify these systems? How do you respond quickly, find regressions, be able to really have a simulator that closes the real-world gap at scale and can run efficiently. I mean, it’s no secret that these generative world models are very compute intensive, but having a good measurement system will just drive efficiency and iteration speed. So that’s a key one.
People often talk about being a chicken and egg. If you have a perfect simulator, you’ve solved self driving and vice versa. And I really believe that. I think AlphaGo showed that when you have a perfect simulator, you can just solve problems through Monte Carlo tree search. And so I think that’s going to be the case in robotics as well.
So yeah, one is measurement. Another pillar is building more generality into the model. How can you build out more modalities and align those different modalities in their reasoning? I think this is going to open up new use cases, particularly when it comes to human-robot interaction and navigation. I was going back to the utility problem before. Some of these things I’m super excited about.
And then the last one is just engineering efficiency. I mean, training these systems and the data requirements is extraordinary. And so I wouldn’t understate, I think, the most sexy part of this problem is the efficient infrastructure to train and serve these models. And getting that right, I think it’s a real competitive advantage or disadvantage.
Pat Grady: We started by talking about AV 2.0. Someday I imagine we might be talking about AV 3.0. What could AV 3.0 look like? If you go five, ten, fifteen years in the future, are there any other big leaps in this industry that you think we’ll see?
Alex Kendall: You said that with such deadpan, Pat.
Pat Grady: [laughs]
Alex Kendall: So the whole premise of AV 2.0 was all about putting the intelligence on the car, and not needing infrastructure and a ton of overcooked hardware, but really making the system intelligent. And so I think we’re seeing that emerge now with the system that can generalize to the world with all of the onboard scalable intelligence and compute.
If I were to speculate where AV 3.0—we haven’t sort of thought about it in depth lately, but one idea could be taking the intelligence outside the car. So I mean, when you start to have majority prevailing autonomous vehicles, you could imagine a ton of new things you could do when they start to communicate, when they start to interact with each other. You know, why do we need traffic lights in the future if they can coordinate? Why do we need all these sensors if you can actually just communicate with the AV in front of you to be able to see around corners? I mean, of course I’m speculating here. It opens up tons of interesting cybersecurity questions, communication latency questions, things like that. But I don’t know, I’m all up for embodied AI. And if we can build a safer and more accessible system by taking the intelligence not only in the car but beyond, maybe that’s a path. Let’s see.
Pat Grady: I think that’s really interesting. If AV 3.0 is the point at which it’s sort of a mesh network and, you know, at that point, maybe humans aren’t allowed to drive because they can’t communicate with the mesh network the same way that the robots can. Or maybe there are special places that humans go to drive just for recreational purposes, but transportation, you know, it’s all autonomous. Yeah, interesting.
Sonya Huang: How do you hire, and how do you attract people with how hot the AI market is these days?
Alex Kendall: I love that question, because at the end of the day, our team is our product. Our team are the most important thing to making this possible. And we talk a lot about at Wayve about being a place where you can do the best work of your career. And what that means for me in embodied AI is having a set of colleagues around you that inspire and excite and are world class in what they do. Having the right resources and the right culture unblock you.
But I think uniquely at Wayve, we are able to bring together really a frontier AI environment with a near-term product opportunity in automotive. So if you want to work on intelligent machines and see your system brought out with the scale of impact of ChatGPT in robotics, I think this is a place where we can do it.
The other thing is that we’ve gone global. I mean, we have teams in London, Stuttgart, Tel Aviv, Vancouver, Tokyo, Silicon Valley. And, you know, wherever there’s an amazing—almost, you know, some of the major AI and automotive hubs and, you know, we’re really looking to build a global culture that can bring this product to the world, work with customers around the world, and most importantly, collaborate with the very, very best people. Yeah, so anyone who’s interested in pioneering embodied AI, pushing the frontiers and actually turning it into a game-changing product, come chat. We’d love to speak.
Sonya Huang: Wonderful. Alex, you’ve believed in the future for end-to end-neural nets, in self driving and in the physical economy for longer than just about anybody. And it must be incredibly fulfilling to see that vision start to come to life. Congratulations, and thank you for joining us.
Alex Kendall: Thank you Sonya. Thank you Pat. It’s such a privilege.