Dylan Patel of SemiAnalysis: Why Hardware-Software Co-Design Is AI’s Real 100x
Dylan Patel, founder of SemiAnalysis, argues the biggest gains in AI don’t come from faster chips, they come from software-hardware co-design. Optimizing the model, the kernels, and the silicon together turns a 2x here and a 2x there into 100x. He explains why DeepSeek’s experts were shaped for Nvidia’s Hopper (and why TPUs struggle to run it), why OpenAI’s sparser models and Anthropic’s denser ones pull them toward different hardware, and why the so-called CUDA moat was never really about CUDA. He makes the case that inference will be a bigger market than oil, and explains why Jensen Huang is bankrolling neoclouds to engineer a multipolar world.
Watch Now
Transcript
Chapters
Intro
Dylan Patel: I think it’s really fun inside of SemiAnalysis because we have 90 people and a big chunk of them are technologists, engineers across the whole supply chain. And then a big chunk is people who were formerly at hedge funds. And you see these arguments, people are like, “Oh, that doesn’t matter.” And then someone’s like, “Well, but cost.” And then the engineer’s like, “No, no, no, but this technology is the coolest.” And you see this organically fight it out. And we’re pretty informal. And given the fact that I was a forum moderator, you can imagine what the tone is like.
Shaun Maguire: You’re enjoying it.
Dylan Patel: You don’t wrestle with it because the pig enjoys it, right?
Sonya Huang: [laughs]
Main conversation
Shaun Maguire: We’re here in the SemiAnalysis office with Dylan Patel. I’m Shaun from Sequoia. My partner is Sonya Huang. It’s pretty insane what you’ve done. Semis five years ago were not very sexy in the West. They were sexy in the East, but people here in the West had kind of forgotten about them. You did not forget about them, though. You went very long. You created probably the premier research company in the space that’s been educating the world in the state of the art from very technical details to supply chain, you know, to the bigger picture. There’s rumors that SemiAnalysis recently passed $100 million of revenue. I don’t know how accurate those are, but whatever the numbers are, you guys are crushing.
Dylan Patel: It’s as accurate as the information is. You never know.
Shaun Maguire: There’s also rumors that you might start a venture fund. I hear all the time in the ecosystem people wanting affiliation with SemiAnalysis. You’ve built this trusted brand, and so whatever you do it’s working. It’s clearly just the beginning of the journey for you. Congratulations on all of that. But how did this happen? First question is, like, what is the background? How did you kind of get to where you are now?
Dylan Patel: Well, when I was a young boy coming out of the womb—no, I’m just kidding. So I grew up in a small business. My parents had a motel. We lived in the motel. We later had a gas station. So I was selling—I joke a lot of times the first neural network I trained was racially and visually profiling people based on when they entered the gas station, which cigarette to pick.
Shaun Maguire: Oh my God.
Dylan Patel: Basically, the cigarettes were all extruded across the top. And I was too short to actually reach them. And technically I wasn’t legal to sell cigarettes at that age, but whatever. I had to move the stepstool over to the right area.
Shaun Maguire: I started working my first job before it was legal, too. But it’s good experience.
Dylan Patel: Well, I didn’t get paid, right? It’s a family business.
Shaun Maguire: Same.
Dylan Patel: But yeah, we had our motel, and then across the street was our gas station. So sometimes someone would walk in, and so if an old white lady with curly hair walked in, I’d move the ladder or the stepstool over to where the Camels are. And if a different age demographic, profession, race, et cetera, I wouldn’t move the stepstool over. And I joke this is the first neural network I trained, because if I waited for them to tell me, I’d have to move it over and then I’d step up versus just being ready. So menthols versus 100 slims and all these things. I joke that’s the first neural network I trained.
I joke that’s the first neural network I trained. But I grew up in family businesses, lived in a motel, and it all really goes back to when I was like—you know, it was my eighth birthday. My birthday’s in May, and it was April when the Xbox 360 was announced. For my birthday, I didn’t ask for the Xbox—or I didn’t ask for a birthday gift. My parents asked what I wanted. I asked for it for Christmas. We celebrated Christmas, but at least at the time, I thought there was no way they would give me the Xbox 360 for Christmas. And so I asked for my birthday tab for Christmas.
Anyways, Christmas comes around, I get it. Fast forward a couple months, my cousin who lives in Alabama—they also lived in a motel—was going to come over for spring break, for his spring break, and we were going to hang out at my house. And he’s in between me and my older brother in age. Brother’s a bit more jock-y, so he didn’t really care too much about the Xbox. He played sometimes, but he didn’t really care. But my cousin, I wanted him to think I was cool. So I bragged many times on the phone. I was like, “Yeah, I got an Xbox!”
And then the Xbox broke. There was a hardware defect called “the red ring of death.” But long story short, I had to open it up and short the temperature sensor and it fixed it. But there were many other tricks I tried first and none of them worked. And so that’s sort of how I got into hardware is I opened Pandora’s box. By the time I was 12, I was on these forums a lot, reading, posting a lot. And this is around the time when Reddit ate all other forums. And so I became a moderator of Android and Apple and Google, as well as hardware and was looking at Intel, NVIDIA, and AMD and all these other forums—Build a PC, all these forums I was watching, reading, posting a lot, but some of them I was moderating a lot.
And so smartphones, watching smartphones develop from very simple to speed racing to being technologically more advanced than PCs in many ways architecturally. And same with all the GPUs, just tracking and watching that, reading every comment, always having the economic tinge because I grew up in a small business.
So I was always looking at the economics. There was a time where all the neckbeards on the internet loved AMD GPUs. And I personally had bought an AMD GPU too, because price-performance. But then when it came down to what’s technically better, I’d always be like, no, no, no, NVIDIA’s better because they use a smaller chip to get better performance at better power efficiencies and their margin’s better. And so I would always talk about how NVIDIA’s margins were better than AMD’s in the GPU landscape. And so it was very fun.
Shaun Maguire: And you were 12 at the time?
Dylan Patel: I started moderating when I was 12, but this is all through my teenage—tweenage and high school years, right?
Sonya Huang: Did you have any other weird hobbies or was it just semis?
Dylan Patel: I played a ton of StarCraft. At one point I was grandmaster on the North American ladder for StarCraft II.
Shaun Maguire: Are you serious?
Sonya Huang: So you’ve been just obsessively good at multiple things?
Shaun Maguire: Yeah.
Dylan Patel: I mean, obsession is hell.
Shaun Maguire: How were your grades?
Dylan Patel: They were decent. I would say I had mostly As, but there were classes that I thought were really boring or I just didn’t enjoy. Like Spanish, I got not the greatest grades. But it was like, I speak fluent Spanish, by the way, so it’s really dumb. But it’s just sort of like …
Shaun Maguire: But maybe that’s why you didn’t get a good grade.
Dylan Patel: No, I didn’t learn Spanish till later, to be fair. But my grades were fine—they were fine enough for Asian parents. I was better than most of school, but it wasn’t like try-hard maxing for all As.
Sonya Huang: Okay, so you’re very much a student of the internet then. This is how you developed this expertise. At what point did you decide to start SemiAnalysis, and what’s been the biggest surprise since starting the company?
Dylan Patel: Yeah, so I went to school, I got a few degrees in stuff that wasn’t related to semiconductors. I was a quant for two years at a small quant risk firm, and then basically there was a culmination of events that happened. One was that I got screwed out of a bonus. I’d made my company many millions of risk-free revenue because I exploited a risk thing in the market. I think well over $10 million. And then someone else took credit for my work and all this sort of stuff. But eventually I did get right-sized, but I lost the social contract with the company I was working with.
My grandparents grew up in my house with us—or in the motel with us. They lived with us. And so I was very close with them. And my grandmother got dementia and she forgot who I was, and she fell down some stairs and had a tragic accident and passed away. So all of that happened in early 2020. Additionally, there were some girl things. And so there’s a few things that happened that made me very sad. And so all of those things sort of culminated. Then COVID happened, and my brother’s like, “Dude, just come stay with me.” He lived in Nashville, so I came and stayed with him in Nashville. We were like, “Oh, lockdowns will be a few weeks. You can stay with me while they happen and then you can go back home and whatever.” Famous last words. Lockdowns lasted much longer.
But living with my brother for a few months was sort of like, okay, didn’t know what I was doing. I was now at my brother’s home. Everything was his rules. Him and his fiancé at the time, now wife, were there. And so I basically had to tiptoe around, but I didn’t care about my job. And so I was posting even more than normal. I’d always been posting a lot on the internet. I’d always been trading stocks a lot. But I made a lot of money shorting COVID and long in COVID and all this stuff. Semiconductor shortages happened around then, too.
Anyways, I was very much obsessed with posting and things like that, and eventually around that time, I got into an argument with someone on the internet and they doxxed me, right? They publicly revealed my identity for my anonymous account. And at the time I was like, “Oh no!” I was scared. I stopped posting for, like, three weeks and I was like, “What am I doing? Why do I care?” So then I just started posting under—I had blogs and stuff as well. I made a real blog, SemiAnalysis, and on my 24th birthday I posted two blogs. And then from there it just, like—it was not a newsletter, but I got so much traction, because now instead of posting on an anonymous name, it was a real name. And I put a lot more effort into those two posts than I usually did. Instead of shitposting on the internet, it was like real effort into the blog. You can actually go back and read those if you want. They’re not that great, but they were good for the time. They were the best stuff you could find on the internet about semis.
And I just kept posting, posting, posting. I started getting a lot of consulting business. You know, 2020, I also sort of—I was again crashing out. Didn’t know what I wanted to do. So I packed everything up—or sort of I took my truck. I bought a tent that fits on the back of the truck, bought an air mattress and drove around all these national parks all around America. And so three or four days of the week I’d stay in a random motel where I negotiated the price to be, like, $30 a night for a room and I would work on somebody else’s stuff. And then on the weekends I’d read books, and oftentimes read textbooks while in some random national park or hiking and listen to audiobooks about semiconductors, about AI, about all the things that I cared a lot about and got way more educated over these six months where I’m just going to every national park.
Shaun Maguire: And you were alone?
Dylan Patel: I was alone the whole time. I was posting blogs. Everyone was like, “Dylan, what the f- are you doing?”
Shaun Maguire: This is pre-Starlink or the very early days of Starlink?
Dylan Patel: Pre-Starlink. Pre-Starlink. So it was very much like, “What are you doing?” I traveled around LATAM again for a year initially with my friend and then with my ex for about a year. And then end of ‘21, ‘22, ‘23, and ‘24—I’m still completely homeless since mid-2020, but I’m traveling around to every conference in the world. I go to 40-plus conferences a year. No matter where in the supply chain it is, I’m like, “Oh, that looks interesting. I guess I’ll go to that.” I went to one conference and I was like, “Wow, this is amazing! You get to talk to the experts and they’re going to talk to you because you’re so excited.” And in the case of semiconductors, everyone’s a boomer, so they don’t see young people who are excited about it. So they’re really happy to tell stuff.
Shaun Maguire: I have to ask on this: Was there a part of the supply chain or one of these conferences that particularly changed your view of the semi world, or that you felt then or feel now is particularly underrated?
Dylan Patel: I think the trade shows and conferences range really widely. Obviously, the ones I have the most fun at include NeurIPS. Why is that? Because it’s 20,000 AI researchers, and they’re generally in my distribution of age range. So it’s a lot of fun, but they’re also leading AI researchers and it’s a lot of fun and you learn a lot. There’s also a lot of parties. And then it ranges all the way to this random chemical conference in Japan where it’s 300 Japanese dudes, it’s like 20 guys from ASML, 20 guys from TSMC, 20 guys from Intel, and those are the only people who speak English. Everyone else speaks only Japanese and you’re like, I guess it’s still pretty interesting and fun.
I think one thing that I have a skill set of is I’m able to bond with anyone regardless of their background and who they are. I’m able to talk to them and find something interesting to talk about. Oftentimes it’s tech stuff. And so I think the most interesting conferences are oftentimes the really big ones, because that’s where the biggest stuff is happening. But I think the niches that are really exciting is SPIE. There’s IEEE, which is International Electrical Engineering something, and there’s SPIE, which is another ecosystem. SPIE conferences are super, super deep in details. Every single one that I went to, especially, like, SPIE Advanced Lithography or SPIE Photomask, I went to them the first time, I didn’t even understand 90 percent of what I heard. And then I read, read, read, read. I’d made some contacts, of course. And then the next time I went, I understood half of what I went to. Third time I went, I understood 75 percent of what I went to. Even now I went and I was like, I still don’t understand everything that’s going on.
Whereas you go to NeurIPS a couple times, you can understand, okay, what’s neural symbolic reasoning? Okay, what’s this? What’s that? You can kind of get a mapping of what everything is pretty quickly. But some parts of the supply chain are so arcane and so deep and so technical it takes a lot of times for you to even understand what’s happening on everything, for every research paper. It doesn’t necessarily mean you didn’t—you go to a conference for a few reasons, right? You understand the research, but it’s all the research that’s being published. But what you really care about is understanding how does that research intersect with technology? Also, how does that research differ from what’s there today? And none of these research papers tell you what’s happening today. But then you just ask people and you build contacts and you learn. And then you learn about the supply chain and oh, this company supplies this company even though it’s not publicly stated anywhere. You learn that this chemical costs about this much and a tool uses about this much. And you just learn all these things.
Shaun Maguire: You hear the horror stories of, like, this chemical had a shortage and it totally threw off this part of the supply chain. And then it turns out there’s only three companies in the world that make that chemical.
Dylan Patel: My favorite one I learned is a Japanese guy at that specific Japanese conference that I went to where almost no one spoke English, in very broken English, he told me about how his father worked in this industry in the 1980s, that the only factory in the world that built this chemical burned down, and that caused memory prices to double or triple. And I was like, wow, not too different from today. [laughs]
Shaun Maguire: Not at all.
Sonya Huang: That’s crazy. Inference, going to be the biggest market on Earth, the biggest market beyond Earth. Agree or disagree?
Dylan Patel: I mean, obviously, use of tokens is going to be the biggest market, and the value that’s created from tokens is going to be the biggest market. But I think tokenomics, sort of the use of tokens, adoption of AI, sort of is the most important thing that’s happening. Andinference, whether it’s open models or closed models, will be one of the biggest markets in the world. Much bigger than oil, I think. Much bigger than many other parts. Like, inference of AI will be many percentage points of the GDP.
Sonya Huang: Yeah. What you’ve done with InferenceX, I think, is industry standard. Maybe say a word on why you started it, what it does, and what do people misunderstand about performance benchmarking on inference?
Dylan Patel: Yeah, so to zoom back, SemiAnalysis, we do a lot of stuff that’s research for institutional clients and our subscription-based products, but a lot of it is also like, hey, this would just be cool to figure out. Let’s figure out how to figure it out and just post it publicly. And that begets more and more scale.
And so we’ve done this with a lot of GPU benchmarking and testing and training performance and inference performance, but ultimately we saw inference benchmarking was point in time. You know, you test it and you take some time and you release it and it’s slow and arcane and outdated, because models change all the time. I feel like every week there’s a new model, whether it’s a Chinese model or today Mythos 5, Fable dropped. New models are coming out all the time. On the software layer, PyTorch, VLM, SGLang, new drivers, new something drops. In fact, the update cycle for most of these libraries is twice a week. So you basically have the software updating all the time, and therefore performance changing. New inference optimizations are coming out and those get updated, and so I feel like it’s a relentless breakthrough after breakthrough after breakthrough that keeps driving efficiency and cost down, which is why we’ve seen model costs drop for equivalent quality by, like, 60x a year.
It’s incredible. But to stay on top of that, you can’t have point-in-time benchmarking. You need to have benchmarks be living and breathing, i.e., constantly running on the latest hardware, on the latest models. And so we embarked on a project and we got a lot of buy-in from the ecosystem. This was only possible because we had enough aura with some of the ecosystem where we were able to get CoreWeave and Crusoe and Nebius and Oracle and Microsoft and Amazon and Google and OpenAI to contribute to us compute. And then we were able to work with SGLang and VLM and now Radix Arc and Inferact, which are the private companies who are leading those efforts, the open source efforts, to collaborate with us. We were able to get NVIDIA and AMD and Google and Amazon now because we’re adding TPUs and Trainium to collaborate.
Now we’ve got all these people collaborating. We’ve got over $50 million of hardware donated to us. Once we launch TPUs and Trainium, it’ll actually be over $100 million of hardware. You know, maybe about 15 different chip types all running these benchmarks every single day on all the latest models—the best model from Moonshot, the best model from Alibaba, the best model from—there’s about five different Chinese models, the best open source models, the best Chinese labs there. We run benchmarks on their models every day, and then also the best US open source models, GPT-OSAS, NeMoTron, et cetera. So we’re running these benchmarks every day in an automated fashion, and they run on these servers that are dedicated to us for inference benchmarking. And we sweep across so many different configurations and optimization types.
And then what it creates is—and all the results are public and all the configurations are public, so now we have the Pareto optimal curve, because a lot of times when people are comparing inference performance, they’re taking a suboptimal curve or point for someone else and comparing it to their optimal one. It’s like, well, yeah, if I drove a Porsche versus some race car driver, obviously I’d drive it slower. It’s the same thing with inference benchmarking. And so what we did is we created open-source containers for the optimal points across every point on the interactivity, i.e., how fast is it responding to me versus batch size, i.e., how many users am I simultaneously serving curve. And so now anyone who wants the optimal point can just go to InferenceX, download it, and run that as the optimal point. They can check every day if they want or they can even auto-download the most optimal point for that model, and their inference performance will be near peak.
Sonya Huang: Is that curve the most important curve in your opinion? The throughput-interactivity curve is the most important one?
Dylan Patel: Yeah. I think most things in hardware infrastructure, model application layer, everything is downstream of that curve, right? Is it something that needs to be super, super fast, super low latency, and I don’t really care about the cost so I make batch size very low and I use techniques like speculative decoding or multi-token prediction heavily? And there’s so many possible techniques there. Or is it something where actually I’m batch processing a ton of documents, and I don’t really care about all these things, so I don’t use these techniques that actually are worse on cost efficiency but help you with speed for an individual user because I just want to pack a bunch of users? I don’t care if the document takes all night to process.
And right now, the way we treat AI infrastructure is it’s like one size fits all. But over time, we’re going to get to the point where there’s stuff where you have batch workloads or you need instant response. And there’s the whole curve that’s going to matter for users. And so we see this with Anthropic, right? Claude Code fast mode costs way more than regular mode. And same with OpenAI’s priority queue thing.
Sonya Huang: Sorry, dumb question. How does cost factor into this chart?
Dylan Patel: Let’s say—imaginary example, I have a batch size of 100, and I can do 10 tokens per second per user. So in total, I’m doing 1,000 tokens per second off of that one piece of compute. That’s one side of the curve: super slow, 10 tokens per second. The other side is I have 500 tokens per second, but I can only have one user, or maybe 250 tokens per second, one user. And then there’s points in the middle that are more Pareto optimal, right? The average person actually wants 50 or 100 tokens a second, and maybe the number of users I can batch together.
So the curve is, okay, 1,000 tokens total per second or 250 tokens total per second depending on how many users I batch. And there’s a curve in the middle. And so ultimately, some workloads will actually want the 4x cost decrease, because the same unit of hardware can do 1,000 versus 250. And some users, I’ll pay 4x more because I don’t care about the price, I care about time because the person using the tokens is expensive or the feedback loop that I have here is expensive.
Shaun Maguire: If you had to guess—you choose the timeframe, 10 years or 15 years—what percent of inference compute do you think will happen in space? It can be zero percent, 50 percent.
Sonya Huang: Shaun coming in hot.
Shaun Maguire: 99 percent.
Dylan Patel: This is a tough one.
Shaun Maguire: So choose the timeframe, whatever timeframe.
Dylan Patel: So I think the non-consensus, or at least against SpaceX thing—you know, I love SpaceX, by the way, and I totally would buy the IPO if I could buy stocks.
Shaun Maguire: Not investment advice.
Dylan Patel: Not investment advice. Thank you. Not investment advice from Sequoia. I don’t think that space data centers will really matter in the next three to five years. With that said, I think in 20 years, I think the vast majority of compute will be going in space. And so the real factor there is sort of what’s the cost?
Shaun Maguire: The timeframe, yeah.
Dylan Patel: It’s the timeframe. It’s the cost of building power on terrestrial land and how much power are you going to be able to do on terrestrial land. And I think obviously my views of where inference, you know, how many gigawatts or terawatts are devoted to inference, it’s a crazy curve for me personally.
Sonya Huang: What’s your forecast? How many gigawatts?
Dylan Patel: Yeah, I think by 2030, just OpenAI and Anthropic will have over 100 gigawatts combined. And then you’ll add Meta and Google and so on and so forth. It’s a humongous amount of compute that will be dedicated to inference. And by 2040, it’ll be terawatts. The curve of productivity that we’re going to get in inference deployments is going to be huge. And so if you look at, like, 2040, I think probably more than half of incremental compute will be going in space. But if you look at 2030, I think it’s sub-one percent.
Sonya Huang: Do you think intelligence per watt has been increasing? It seems like there’s still a giant gap between where we are intelligence per watt versus human biology. Do you think we are going to close that gap? And if so, where is that gain going to come from?
Dylan Patel: I think it often depends on what you’re doing, too. Like, a TI-84 is way more intelligence per watt in terms of doing math than us. And it’s like 30 years old. So obviously this is a dumb …
Sonya Huang: General intelligence.
Dylan Patel: Yeah, general intelligence-wise. So one of the things InferenceX does is we also measure the power and cost of all of this hardware. And so we offer not just throughput versus interactivity, we offer cost versus interactivity. We offer power versus interactivity. As far as has intelligence per watt has been increasing? I mentioned it’s been a 60x cost decrease for the same benchmark level. We’ve also seen the same on intelligence per watt. It’s not been exactly 60x; it’s been closer to 40x. Some of the efficiencies are in non-power ways, but there’s been a humongous improvement in intelligence per watt on an annual basis, at least so far this year, last year, year before, year before. And I expect that to continue. As far as where we are from the human brain, we’re many orders of magnitude away. Thankfully, it doesn’t really matter. We can devote a lot of power to computers. It’s much easier to power computers than human brains. We have sickness, disease, food preferences.
Shaun Maguire: Sleep.
Dylan Patel: Exactly.
Shaun Maguire: Let me just ask one more question on this general theme. In my opinion, in terms of intelligence per watt or intelligence per dollar, any of these metrics, I think there’s kind of three levels of input. You can get hardware improvements where the hardware is more efficient. You can get low-level systems optimizations like kernel-level improvements, matrix multiplication libraries, things like that. Or you can get high-level model-level or algorithmic improvements at the highest level. To me, it seems like in the last three years, most of the gains have come from hardware level and some from the model level. Do you agree with that? Do you think that’s what it’ll look like in the future? Do you think there’s a bunch of juice to squeeze in the kernel level?
Dylan Patel: Shaun, I completely disagree with you, by the way.
Shaun Maguire: Great. Great. That’s why I’m asking the question.
Dylan Patel: Okay, so I think one way is to look at it as these three different layers. And in that sense, like, from Hopper to Blackwell, which is all we’ve had over the last three years, roughly 30x improvement on DeepSeek, on the most optimized deployment, which is, you can see on InferenceX, there’s about a 30x improvement. But over the last three years, we’ve had way more improvement in intelligence per watt, and a lot of that is coming from the model layer. If you look back three years, it’s GPT-4. Now it’s maybe one of the smaller QWEN models that’s like 27B parameters total and two billion active. It’s way better. And so you’ve got this huge improvement on model layer. You’ve got this pretty sizable improvement on hardware, but it’s that co-design layer. And I think that’s what’s important, right? If you look at the architecture of any of these models—but DeepSeek is the most famous one, at least that’s public and people have seen.
Shaun Maguire: DeepSeek got huge efficiency gains from co-optimization or kernel-level optimizing memory.
Dylan Patel: Yes, I think it’s kernels, of course, but it’s actually you build the hardware architecture for the chip. So if you look at the shapes of all the experts in DeepSeek V3, they were all optimized for Hopper. And if you look at V4, they’re optimized for Blackwell and Huawei’s chip.
And what’s interesting is despite the fact that TPUs are objectively an amazing chip, and they run all of DeepMind and they do all the training for Anthropic as well, on the pre-training side at least, TPUs suck at running DeepSeek. But they are really, really great at running other kinds of models that don’t run well on NVIDIA. There is some level of such deep optimization that has been done, whether it be shapes, network I/O patterns, how you do the collectives, how you do things around the arithmetic intensity of the attention mechanism. All these different things are co-optimized between the model and the hardware and the infra software in between. And it’s hard to say you can disentangle the gains there.
Shaun Maguire: My understanding is that China has done this a lot better than the West the last few years. DeepSeek was one of the first models to really do this.
Dylan Patel: I don’t necessarily think so. I think it’s more so that the West doesn’t tell people what they do, right? Like, OpenAI didn’t tell people how sparse GPT-4o was, what the shape size was, all these things. But GPT-4o is roughly the same size, slightly smaller than DeepSeek V3, and 4o came out a little bit earlier if I recall correctly.
Shaun Maguire: So is your view that, like, all three of these things have been happening simultaneously at roughly the same rate, and the biggest gains are when you just co-optimize?
Dylan Patel: Yeah, I would say there’s been more gains on the model layer than on the software infrastructure layer and the hardware layer. But there’s been innovations on every layer. And really, the biggest gain and the beauty of the best labs is when they co-optimize all three. And that’s what Anthropic is—even though they use many different kinds of hardware, they don’t really inference too much on TPUs. They mostly train on TPUs and they inference a lot on Trainium and GPUs. And GPUs are more of a jack of all trades, but they’ve optimized their hardware, optimized their model, they’ve optimized everything so they can do that.
Whereas OpenAI, their prior models were optimized for Hopper more. Now they’re more optimized for Blackwell. You step forward through time, these labs—and the same with Google, right? Gemini 2 was really optimized for the TPU v6e—or Gemini 3 was. And then the next Gemini that’s coming out is really optimized for TPU v7. And so a lot of these things are being co-optimized, and actually when you pull that model and run it on the old hardware, it’s really not that great.
And so I think a lot of this co-optimization is the most important thing. It’s called “software-hardware co-design.” And that’s what’s really exciting about what I think my day-to-day is is it’s great, you get to look at one layer. There’s all these innovations happening here. There’s all these innovations happening on every layer. But the real breakthrough innovation is when you leapfrog a few layers, you co-optimize and co-design them, and now all of a sudden you’ve taken what could have been a 2x here, a 2x here, a 2x here, and instead of being multiplicative to 8x, it’s actually 100x because you’ve optimized it across all three layers. That’s what’s really exciting about what you see at the labs, what you see at a company like NVIDIA who’s not co-optimizing on the model layer per se, but a little bit from the model layer all the way downstream to silicon. Or you look at a company like TSMC, they’re co-optimizing not just fabrication, but all the way from the components and the consumables and the tools all the way upstream to what the designs, the chips, the customers are telling them. It’s this co-optimization across many layers of the abstraction stack.
Shaun Maguire: There will always be bottlenecks somewhere in that optimization, though, that are lagging behind and then need to get pulled forward.
Dylan Patel: And Band-Aids to cover them up.
Shaun Maguire: Exactly. If you had to predict—at any level of the stack, it can be literally anywhere—what are some of the bottlenecks you’re tracking most acutely in the next year? And not necessarily in the supply chain, not in scale, but in terms of the actual—and it can be in the supply chain, too, but just like, is it memory improvements? Is it just scaling up?
Dylan Patel: Memory is an easy one that everyone’s talked about, but I’m not going to talk about it from a supply chain angle. I’m talking about it from a technology angle, right? Memory capacity and bandwidth have been improving very slowly. The NAND cell was invented, like, 25 years ago. The DRAM cell was invented, like, 40 years ago, and there’s been no major breakthrough in, like what a NAND cell is. Obviously, NAND is a very simple gate. Or a DRAM cell.
There is stuff that could come down the pipeline that could be hugely innovative, but even over the last five years, all we’ve really done is make the HBM, you know, more stacks, faster. But actually, there’s new innovations coming in the next few years where instead of stacking the HBM separately from the chip, you stack the memory directly on the chip and that makes your bandwidth explode.
And so there’s interesting companies in that space and interesting POCs that companies are trying to do there. I think, like, memory bandwidth is one of the biggest. Another one is for the history of silicon, basically for the last two decades at least, how many watts a chip is can be easily predicted just by looking at it for a data center or a desktop chip. It peaks up at one watt per millimeter squared. And so if a chip is 100 millimeters squared, generally the power consumption is around 100 or a little bit less. And if you look at the newest NVIDIA silicon, the newest TPU silicon, it’s still on that range of one watt per millimeter squared. So chips are now getting to 1,400 watts, next generation is 2,000 watts for NVIDIA with Rubin and such. And you move forward to Rubin Ultra, it’s going to be, like, 4,000 watts or something like that. But really, they’re just increasing the amount of silicon. What’s exciting is we’re now finally doing things—and it’s in development right now—where you actually can pump the amount of power into the silicon to be way more, more than one watt per millimeter squared. And now that all of a sudden means you need less silicon. Obviously, it’s running at higher power and it’s less efficient in some cases, but you reduce the amount of silicon and you’re able to open up the silicon volume.
Sonya Huang: Will that run into thermal issues?
Dylan Patel: Thermal issues. There’s electrical interference issues, there’s all sorts of different issues that crop up, and that’s why it’s a hard engineering problem. That’s why we’ve stuck at about one. But what’s exciting is the world is trying to change these things. I think what’s interesting in a different part of the supply chain is people talk about energy is hard and we have energy bottlenecks. And it’s like, yeah, but there’s actually very simple solutions one could think of. Take the millions of diesel engines for trucks that the US has the capacity to make. You can very trivially convert them to be using for gas in the assembly line and then stick them up to an electrical motor, like, back-driving it so the electrical motor generates electricity rather than the electrical motor causing the rotation of the wheel, for example, but doing it the opposite direction. And now you’ve generated electricity by pumping gas into something that the US can make millions of.
And then okay, well, that sounds like a pain in the ass to service, because now you have to have hundreds of these on a data center site. Well actually, you can just pull people out of car mechanic shops and have them run around and repair truck engines. It’s actually pretty trivial to—I don’t want to say it’s trivial. I couldn’t do it.
Shaun Maguire: I think you’re making a really good point, which is that because the West wasn’t really thinking about semiconductor industry or even hardware more broadly the last 20, 30 years, we didn’t have much innovation. We didn’t have the best minds thinking about how to improve these things.
Dylan Patel: Why would you want to go work in hardware when you can …
Shaun Maguire: Make ads, sell ads.
Dylan Patel: Yeah, exactly.
Sonya Huang: Okay, I’m dying to ask. NVIDIA versus TPU. What are your thoughts?
Dylan Patel: I think everyone wants to pick one or the other for this, but it’s really a function of, like, look, you look two years from now, Google’s going to make 10-plus million TPUs through their supply chain and NVIDIA is going to make many more tens of millions of GPUs and both are going to be $100-plus billion—well, Google’s going to be $100-plus billion of TPU created a year, and NVIDIA will be $500-plus or whatever. I’m not making a specific estimate.
Shaun Maguire: This is not a revenue forecast, this is just a thought experiment.
[CROSSTALK]
Sonya Huang: Shaun’s been media trained well. [laughs]
Shaun Maguire: Absolutely. Getting ready for SpaceX IPO.
Dylan Patel: Are you guys big in SpaceX?
Shaun Maguire: Yes.
Dylan Patel: Okay. So that makes sense.
Shaun Maguire: We’re very lucky to be very large investors.
Dylan Patel: Awesome. So I would say the case of Google TPUs versus NVIDIA GPUs, they both have points that are really in their favor. NVIDIA will be like, oh, well, we have switches and we’re general purpose. And TPUs will be like, well, we’re more optimized, we’re actually more energy efficient and our network is actually more optimized for certain types of network architectures. And so you have these counterpoints that both would really get into, and I could with a straight face argue with you that GPUs are way better than TPUs or TPUs are way better than GPUs, but it comes down to hardware-software co-design. So actually, the way OpenAI’s models are headed, it would be a terrible decision for them to use TPUs, potentially. And the way that Anthropic and Google’s models are headed, it’s actually a terrible decision potentially for them to train with GPUs.
Sonya Huang: What’s the fundamental difference there?
Dylan Patel: There’s various things. The size of the matrix multiply unit is different as a very simple thing. And therefore the shape of the matrix multiply you do, the attention mechanism you use, the way that attention mechanism is structured, the way the experts are structured.
Sonya Huang: So OpenAI and Anthropic are converging to very different model architectures?
Dylan Patel: I think they have quite different model architectures, in fact.
Sonya Huang: Huh! Interesting.
Dylan Patel: OpenAI’s are much more sparse—and that has benefits. And then Anthropic’s are—they’re still sparse but more dense in general, and that has different benefits. And there’s many other things. The network topology, right? NVIDIA, all of their chips are connected to switches, NVLink’s switches. For Google, they have no switch, but what they’ve done is they’ve been able to—NVIDIA, the NVLink can only connect 72 GPUs. For Google, their ICI can connect 8,000 chips at super high bandwidth, but you have to pass through other chips to get there because there’s no switch. And so there’s trade-offs there. There’s positives and negatives, and that influences the model architecture. It’s not necessarily that you should claim one is better than the other, because at the end of the day, how do you say that this is better than that when you can’t measure them in isolation because it also extends up to the model layer?
Sonya Huang: But I remember for a long time thinking one, the programmability of NVIDIA and then just CUDA as such a big moat. It seems to me that the narrative has kind of changed, at least in my mind, for the last three to six months. Like, model companies no longer care about, if we have to write custom kernels for this other chip, so be it. We’ll work with four or five chips if we have to. Claude and Codex are actually quite good at doing a lot of that optimization work. And then it’s not like there’s 10,000 model companies that each need programmability. There’s on the order of 10, maybe model companies. And so it seems to me that the fundamental premise of tens of thousands of big customers that need CUDA compatibility, like, it seems that kind of thesis is changing in the last few years.
Dylan Patel: Certainly the CUDA moat and software moat is at least partially disentangled, because models are just great at coding, and all software gets commoditized in that case. I do think there is some level of open source, and what people call the CUDA moat is not actually anything to do with CUDA, but it’s the fact that DeepSeek, KIMI and Zhipu AI, Alibaba, Tencent, all these companies—Xiaomi had an awesome model recently—their models are co-designed for GPUs, and therefore, if I want to run them on TPUs—actually, in some cases they don’t run really well on TPUs. Now Google just has to create their own open source model ecosystem or open source models themselves: They have the Gemma models.
So you end up with, well, that’s not really CUDA as a moat, it’s that the downstream product is more optimized for NVIDIA. And in these cases, these companies are just open sourcing them or, like, NeMoTron is just open sourcing it. And then the users of it, for example, the inference API providers, the RL companies that are trying to take open models and customize them for companies’ business use cases, all these different companies are downstream of the fact that, like, okay, well I guess I need to use NVIDIA because the ecosystem uses NVIDIA, even though I don’t particularly care about writing CUDA kernels, because the models are great at that, but it’s like the shape of, like, well, this expert, the DMod is this and the hidden dimension blah, blah, blah is this, right? And so therefore, it’s better to run on NVIDIA GPUs than it is on TPUs.
And vice versa, right? If Google were to actually open source really good models, this would be the same thing. People would take their models and they’d be like, “Oh wow, these don’t run that well on NVIDIA GPUs. I should actually just rent TPUs or buy TPUs and do it on there.” For small teams, you’re going to want to use all the open-source software like VLM, SGLang, PyTorch, all that stuff. But the big labs, they don’t necessarily need to use all that. OpenAI forked PyTorch long ago, and Anthropic and all these other people don’t necessarily rely heavily on the open-source implementation of these things. They forked things or built it on their own already, and so they don’t need to rely on the open source. And therefore now it’s more like, I’ll choose the best hardware and I’ll co-design my model and infrastructure software through and through for that hardware that is the best and most cost-efficient, and I’ll have AI help me write all that software.
Sonya Huang: What do you think of Cerebras?
Dylan Patel: I think Cerebras is a really innovative company. I think in some spots of the market they’re really, really good. Very fast inference—I think that’s a big market. We use fast mode almost exclusively at SemiAnalysis.
Sonya Huang: By the way, I love how disciplined you’ve been about accounting for—I don’t know if that was one exhibit you did or if you do it consistently, but accounting for the dollars spent and the ROI on each task. It’s awesome analysis.
Dylan Patel: Yeah, we do it pretty diligently. So thank you. That was the Dark GDP article that we wrote. And also track everyone’s token spend by day, and if someone’s spiked up, I’m like, “What did you do?” It’s like, “Okay, thank you for telling me, but that seems worth it. Cool.” On with my day. I think fast mode is obviously worth a lot for high-end tasks. I can just see so many different use cases where super fast tokens are worth it. I can also see the flip side where there’s a lot of use cases where super fast tokens aren’t needed, and therefore the market won’t pay for them and they’ll use GPUs and TPUs instead.
I think the big risk for Cerebras is I mostly think the best models are the ones that you want to use fast mode on and small models you necessarily might not use fast mode on. I could see that being wrong with financial markets maybe or something like that, like a Jane Street high-frequency trading or something like that or medium-frequency trading. But ultimately, running really large models at really long context is very difficult on SRAM-based chips like Cerebras, like Groq. So now it all of a sudden is like, what happens then if the models get too big? If OpenAI’s model is not on the order of hundreds of billions of parameters or low trillion parameters, but it’s actually 10-plus trillion parameters, now all of a sudden I don’t think that that will fit on Cerebras—with a long context length, right? If you have a million context length, now that really difficult to justify.
So far we’ve seen the bulk of revenue and usage at the labs be on their best model. Even when the model price has gone up, we’ve seen that. There’s some data that shows that even though Fable just released today, they’ve had incredible amounts of people switch to Fable and Mythos, sort of that next-tier model, even though it’s way more expensive.
Sonya Huang: And that’s volume by dollars totally, but what about volume by tokens?
Dylan Patel: Well, I guess who cares about volume by tokens? It’s about the dollars.
Sonya Huang: Fair enough.
Dylan Patel: Right? I don’t care that there’s, I don’t know, 200,000 Mini Coopers or Toyota Camrys sold if, you know, I don’t know, Ford F-150s are 5x the ASP and they sell only half as much.
Sonya Huang: Okay, fair enough.
Dylan Patel: And therefore, the most lucrative market is pickup trucks in America, right? Mostly being facetious, but like …
Shaun Maguire: Do you think this is one of the things that you’ve done so well and differentiates you from almost everyone else is that you care so much about the economics in addition to the technology? I think very few people bridge those two things well in this space.
Dylan Patel: I think it’s really fun inside of SemiAnalysis, because we have 90 people and a big chunk of them are technologists and engineers across the whole supply chain. And then a big chunk is people who were formerly at hedge funds. And you see these arguments. People are like, “Oh, well, that doesn’t matter.” And then someone’s like, “Well, but cost.” And then the engineer’s like, “No, no, no, but this technology is the coolest.” And you see this organically fight it out. And we’re pretty informal, and given the fact that I was a forum moderator, you can imagine what the situation was like.
Shaun Maguire: Well, you’re enjoying it. You’re enjoying it.
Dylan Patel: You don’t wrestle with a pig because the pig enjoys it, right?
Shaun Maguire: Exactly.
Sonya Huang: [laughs]
Shaun Maguire: Just on this topic, before I go into the next question, are there trigger topics in semis for you? If someone’s—which is such a meme. You think this person must be a moron, you know, like, memory is the bottleneck.
Dylan Patel: I mean, it’s true. But I think moreover, the one that really gets me is people are like, “AI has no ROI.” It infuriates me. There’s like, “What’s the ROI?” They’re denying model progress. There’s these people that are like, “Models aren’t getting better. They’re not reasoning. They can’t think. They’re going to dead end and plateau.” And it’s like, “Bro, the line has been up and to the right in terms of capabilities this entire time.” And they’re like, “Look, this benchmark didn’t improve.” That’s because it’s at 90 percent. Look at the new benchmarks.
Shaun Maguire: You’re saturated.
Dylan Patel: Yeah, you’re saturated. Now [inaudible] is skyrocketing, right? I think that’s more so the issue and challenge. Semis are really complex, and I don’t fault people for lacking understanding of it. Like, I learn stuff every day about the semiconductor supply chain from people, and I’ve been studying it for arguably 18 years since I started moderating the forums when I was 12. Arguably been studying it for that long. But even then, it’s like live, breathe, and that’s all I care about. But there’s so many layers of the abstraction stack. It’s like I learned about a new chemical that does $100 million of sales yesterday. And I’m like, whoa, didn’t know this one existed and what process it did. But it’s like you learn about things all the time. It’s like, okay, $100 million of sales in a couple hundred billion dollar industries, whatever.
Shaun Maguire: But it’s essential.
Dylan Patel: It’s essential. And it’s like, actually, every chip requires it. It’s like, wow, I guess there are 1,000 process steps. It’s like, oh yeah, you like semiconductors? Name every process step. It’s like, no, come on. What I think is the most funny is when people have all the facts in front of them and then they get the conclusion completely wrong.
Shaun Maguire: That happens in our job all the time, too.
Dylan Patel: Yeah. I think my attitude is not to be mad that you do that, it’s to do it as fast as possible.
Shaun Maguire: I think the industry, because it’s just like AI is the most important thing in the world right now and there’s so many near-term bottlenecks, we talk a lot about the near term. Are there longer-term things that you’re really excited about, like, say on a 10-year timeframe? We talked about orbital data centers, but silicon photonics, do you think they’re underrated or overrated on a 10-year timeframe? Are there other things that on a 10-year timeframe you’re excited about?
Dylan Patel: I think space is super crazy awesome in the 10-year timeframe for space data centers and all these mining asteroids and all these things, which I’m super excited about the vision of SpaceX, right? Again, not investment advice before you hop in. I think on the semiconductor side, tremendous market movements and tremendous things can happen just when things happen one year later or sooner. And so that’s all technology that in terms of co-package optics, everyone knows it’s going to happen by the end of the decade. The debate is like ‘27, ‘28, ‘29, 2030? But at some point along there it’s going to happen. I think the more interesting thing is there’s companies like—did you guys invest in Naveen Rao’s company?
Shaun Maguire: We did.
Dylan Patel: Okay, yeah. So I think, like, he’s trying to innovate on the silicon layer, on the software abstraction layer and the model layer simultaneously. He fully understands that it’s not like, we’re going to do this in a few years.
Shaun Maguire: It’s not a two-year timeframe.
Dylan Patel: Yeah, it’s not a few-year timeframe. It’s a long-term bet. And stuff like that is like, okay, we’re going to bring potentially analog compute with energy-based models and all this crazy shit all at once. It’s like, that’s exciting, probably won’t work, but that’s exciting and I really look forward to it.
Shaun Maguire: It definitely won’t work quickly.
Dylan Patel: Yeah, it definitely won’t work quickly is what I should say. I believe in Naveen, and I met him very—I think he’s one of the first people I met in the industry, funnily enough, in 2020 or 2021. Actually, 2020.
Shaun Maguire: It says something about him. I think he’s someone in my experience, he’s always trying to …
Dylan Patel: I baited him on the internet. I baited him on the internet.
[CROSSTALK]
Shaun Maguire: He’s always trying to help the younger generation. He’s trying to identify talent.
Sonya Huang: He was also so ahead of his time with Mosaic. I remember getting pitched Mosaic.
Dylan Patel: No, it was 2019. I was still anonymous then, actually. I baited him on the internet, and he started replying, and then I just took it to DMs and then took it to a call. And that was the first person who was really important that I talked to in the entire semiconductor industry.
Sonya Huang: That’s funny.
Dylan Patel: But yeah, sorry to interrupt.
Sonya Huang: That’s funny. What do you think is the end state of the ecosystem? Do you think every lab and every hyperscaler just has its own chips? Trainium seems like it’s now working. Do you think we end up with every lab and every hyperscaler has its own chips, at least for inference, and then maybe for training you go to NVIDIA or whoever? What do you think is the end state?
Dylan Patel: I think everyone will try and they won’t stop trying. I think ultimately supply chains matter. What technology you can bring in matters. And more and more as the industry gets bigger, supply chain diversification happens. Right now everyone’s chip more or less looks the same. It’s a big logic compute die in the center and there’s some HBM on the right and left and on the top and bottom. The top side is networking and then the bottom side is PCIe and other I/O. And that is the exact same structure for Trainium TPU, NVIDIA chips, and most of the startups—not Groq and Cerebras, because they’re doing weird shit, but that’s cool, you know?
But I think, like, as you step forward, we’re going to get more bifurcation of hardware architecture and model architecture, and therefore people are going to co-optimize them. Some of them will end up in local minima. If this is gradient descent, people are trying to go to the most optimized solution. Some people will race to a local minima, and then the question is how do you scoot back over to the absolute minima? To some extent, NVIDIA will always be more general purpose than anyone else’s chip in general, at least on a parallel AI compute basis, because they have so many customers who care about different things and who will always give them feedback in the design. You know, the minima will always be better than them, but is that minima a local minima? Is the TPU or Trainium or Groq or Cerebras or whoever’s design optimized awesomely for here, but in the end state, actually, you’ve got to go over here and so they’re wrong?
Sonya Huang: Yeah.
Dylan Patel: And maybe they’re great for a little bit of time, but then they end up being wrong. That’s the real question. And so I think there will be a big market for general-purpose AI compute, because you talk to people at labs, they don’t even know what architecture they’re going to be doing in a year. They literally don’t know what architecture they’re going to be doing in a year. They have bets. They have many research bets and that’s this exciting thing, but they don’t know where it’s going. Generally, they know what hardware they have and they’re trying to co-optimize, but ultimately, like, if a new breakthrough happens on model architecture, it’s like you just replace the attention mechanism with something else, or all of a sudden something happens, the best hardware will change. And therefore, are people going to make five-year investments on hardware solely on an ASIC that is more specialized, or are they going to have some bucket of more general-purpose compute?
And so you see this with, like, Google’s paying $11 an hour per GPU to xAI for GPUs, right? Like, that’s insane! Obviously, compute is limited and so on and so forth, but it’s very insane, despite the fact that they have TPUs. So there’s some questions there, like why did they do that? Google actually has three different design programs for TPUs. They’re making a TPU with Broadcom that’s a different architecture than the TPU with MediaTek, that’s a different TPU than the architecture that is—I won’t disclose my research. But they’re making different architectures. It’s not just like, oh, they’re making TPUs with a couple vendors and it’s the same architecture. It’s different architectures. And the third one is a very different architecture from the first two.
Dylan Patel: And so I think people recognize that the local minima can happen, and therefore I think everyone will have their own ASIC program. I think everyone will deploy billions of dollars of their own ASICs, tens of billions of dollars, in the case of Google, hundreds of billions of dollars a year of their own ASICs. But ultimately, they’re also going to have workloads that don’t use TPUs, right? Some of the Google bets that are not Gemini DeepMind actually primarily use GPUs. They don’t use TPUs. Some of them also primarily use TPUs. It’s a bit of a broad thing, but maybe for drug discovery or for Waymo, you might not want to use TPUs. I won’t say which one it is, but there are different architecture bets and different paths for AI. AI for science may have different algorithmic patterns than general intelligence AGI models. So I think we’ll see diversity continue to proliferate.
Sonya Huang: Yeah.
Dylan Patel: And because the market has gotten so big, niches will be carved out. And so that makes it possible for companies to have their niche and actually make money even if the majority of the pie goes to NVIDIA and TPU and Trainium.
Sonya Huang: Okay. Love that. Can we talk about the data center buildouts? By all accounts, if you look at the charts, dollars per compute hour, we are in the middle of a crazy compute crunch. And it seems like it’s both a demand and supply side crunch, right? Demand for long horizon agents skyrocketing, supply, all these data center buildouts are delayed. Do you think we’re in a compute crunch for the foreseeable future or do you think it alleviates at some point?
Dylan Patel: Yes. Every quarter we’re deploying vastly more compute than the prior quarter, and there’s more data centers built than the prior quarter. This year there’s going to be 20 gigawatts, even accounting for the delays. And next year there’s going to be more than 30 gigawatts, accounting for the delays. Of course, delays happen on everything, right? Anything hardware can have a delay. That’s just the reality of life. Are we going to have a compute crunch for the rest of our lives? It depends on what happens with models but, like, the TAM for Mythos 5, Fable 5 is not just 2x that of Opus, right? The model is so much better and it can do so many more tasks that the TAM for it is way larger than that. And yet compute in the world did not double in the last six months, or maybe, like, seven or eight months since Opus 4.5 was launched to now. 4.6, 4.7, 4.8 were improvements, but Fable and Mythos were a huge step function improvement. The world’s compute did not double or quadruple or whatever in that same time frame, but the demand for useful tasks that can be done by AI, the number of useful tasks and the value of them that can be done by AI has.
So now the question is, what happens? Well, obviously Anthropic in Q2 is profitable. They’re net income profitable excluding stock-based compensation. And I think by Q3 they may even be profitable including stock-based compensation. That’s how profitable they’re getting, and their margins on an Opus token, at least Opus 4.8 token, is north of 80 percent for the API price. They’ve got a lot of deals where their total corporate gross margins get clawed down a little bit because of how they do Bedrock deals and Vertex deals and things like that. But ultimately, their per-token margin is so high. Well, then if you don’t have—they have the capability to pay. Ultimately, every GPU they buy at an above-market rate—they also bought GPUs at above-market rate from SpaceX, which is below the rate of Google, but that’s because they signed earlier. It’s something that other companies, maybe a venture-backed company or a company that’s not really got positive margins, can’t necessarily do. What is the cost-benefit ratio? It’s like, every GPU I rent, because I’m out of compute capacity, I can immediately turn around and sell tokens on it—or every TPU or every Trainium, I can immediately sell tokens on it at a positive margin. And if I’m running 75 percent gross margin and I double the cost of the compute, it’s fine. I’m still running 50 percent gross margin. And spinning up more compute nodes is not really necessarily a human-requiring task for them if they’re renting them. And so ultimately it’s like, well, my NOI still goes up, right? And so I’m going to rent GPUs at whatever price, at some level, whatever price I want to pay, I can pay.
Sonya Huang: I have almost the reverse question of, like, at some point, does this compute buildout go bump in the night? Earlier today, I think there was a tweet, like, Crusoe publicly said one of their customers had asked to halt construction on one of their data center buildouts. Like, it seems like everybody in the ecosystem is so levered right now to, like, we gotta build, we gotta build, we gotta build. High leverage, high growth, to me, is like, makes me very, very nervous as an investor.
Dylan Patel: Wait, hold on. High leverage, high growth means small amount of equity, has huge upside. You’re not a debt investor. You’re a credit investor. You’re an equity investor
[CROSSTALK]
Dylan Patel: You got to go to the school of private equity. Levered buyouts only.
Sonya Huang: I actually come from the school of private equity.
Dylan Patel: Oh, awesome.
Shaun Maguire: She forgot the school. She’s been a VC for too long.
Sonya Huang: Now I just do revenue multiples. Do you see any signs of that? Are you worried about that?
Dylan Patel: I see what you mean. And that sort of goes back to the model point, right? Obviously, if the model’s expanding the total economic valuable work—that’s sort of the Dark GDP report that we did and that you mentioned earlier. If the work that these models can do does not expand faster than the compute capacity, then that tide turns, right? And over the last six months, that tide has been very much levered in this direction of the models can do more work or are expanding their TAM of work they can do faster than the compute is increasing, and so prices go up.
It’s very possible that all of a sudden model progress stops. You talk to anyone at Anthropic or OpenAI, maybe they’re drinking the Kool-Aid, but you talk to basically all of them and they’re like, “No, no, no, model progress still goes up.” And so ultimately, current methods could stall somewhere. I’m not sure where that would be. It seems like we have line of sight to rapid model improvement. And in fact, models are improving faster than they were six months ago or a year ago because there’s—I wouldn’t call it recursive self-improvement, but basically the models are helping write all the infra and launch the next model sooner and sooner and sooner. So you’ve got this pseudo-recursive self-improvement loop going. And so the models are getting better and better and better faster. But ultimately, capital is a big problem, which is why Google raised capital. They’ve got an ungodly amount of SpaceX, right? They own, like, five percent of the company.
Shaun Maguire: I think a little more.
Dylan Patel: I think at one point they had like 10 percent.
Shaun Maguire: Larry Page invested $1 billion at a $10 billion valuation, got 10 percent of the company. It got diluted, all this, but that was one of the greatest investments of all time. Good job, Larry.
Dylan Patel: So they know they have $100 billion in the bank that they can sell in nine months or whatever from the lockup, and they have all the gross profit they do and yet they still modeled that and they were like, “We need to raise capital,” and so they did an offering. And it’s like, that’s insane. So that tells you how much they think they need to spend. But capital is really—Meta announced that they were going to do a raise. Stock tanked, people don’t like it. But all these companies are going to raise capital, whether it be debt or equity. At some point, money spigots will have to slow down, but right now, every GPU that Amazon adds, they’re making higher revenue. Or every TPU or Trainium, whoever anyone adds, is making gross profit.
Shaun Maguire: I’m going to do a little bit of a tee-up on this to turn it into a question for you. But as we talk about this, for me, the thing that’s going through my head is almost an alternative hypothesis for the Crusoe example. I’m going to use an analogy in oil. Like, in oil, Saudi Arabia has way lower cost per barrel to produce oil than a lot of other countries. There’s also the purity of the oil. Saudi has generally very low contaminants in their oil, which makes refining easier, all of this. The question from me is like, when you look at for every gigawatt that’s being put in the ground, call it the 20 gigawatts coming online today, how much homogeneity do you see in those gigawatts? Is it something like—and you can tell me whatever metric you think is right, but are Google’s gigawatts two times more valuable than, say, most neoclouds because they have optical switches and they’ve been doing it for a long time and they know how to do power smoothing? Because I think this could be the alternative hypothesis that some of the people that are good at building data centers, they should just do it to the max, because there’s so much demand and they’re so much better at it. But then maybe we’re starting to see the early signs of the people that are not as good at it kind of getting hit a little. So I don’t know the reality. I’m just curious how you think about that.
Dylan Patel: So far—there are metrics for this, right? So Trainium sells at sub-$10 billion per gigawatt rental rate to Anthropic and to OpenAI. GPUs, at least before the craziness of the last six months, usually went around $12 to $13 billion per gigawatt. So the rental rate—and this is from a neocloud versus Amazon even, and now when Amazon sells GPUs, they’d also be $13 or so.
Shaun Maguire: And my understanding of that also is that those numbers, like, Amazon subsidized that a little bit so that it’s like, I actually think the numbers were even—I think the disparity is even more extreme.
Dylan Patel: It’s less than 10. It’s less than 10, but there’s some weird, basically, how much …
Shaun Maguire: And look, my understanding, Anthropic played a big role in making Trainium useful in terms of writing all the libraries, et cetera. And so everything I hear is that Trainium’s really freaking good hardware and it’s getting way better. And obviously Anthropic’s now using it a lot. So hopefully we would see that price go up per gigawatt.
Dylan Patel: The deal they did was actually like there’s a floor mechanism, and if it didn’t do well, it would be cheaper and then to the point where it’s cancelable. And if it did really well, the price is kind of higher. But effectively less than $10 is where Trainium shakes out at. Whereas GPUs, I mean, the SpaceX deal again was like $25 million dollars per megawatt a year rental rate with Google. I was like, that’s a crazy divergence.
Now obviously, if Amazon was selling Trainium today, it’d probably be more expensive than $10 because of the compute shortages. But you do see this already in the sense of with data centers, oftentimes the rental price of a data center, if you’re doing colocation, not compute in there, but just power, here’s the data center, you price it generally on a dollars per kilowatt per month. And so they used to be $60 per kilowatt-hour per month, and now you see things transacting at anywhere from, like, $120 to $160. But different quality data centers, I’ve seen data centers go as high as $200 when the customer has not such a great credit rating and then the data center is a pretty good one. And I’ve seen stuff go as low as $100 still, or in India go as low as $80 because the grid’s not reliable, the internet connection’s not great and it’s a pretty mid data center, but at least it’s a data center. And so you see this huge discrepancy there already.
In the case of data center construction, usually the pitfalls are they just fail. There’s a lot of people who claim they’re going to—they’re, like, four guys. They’re like, “Yeah, I bought some turbines. I put the money down for them. I’m going to build a data center.” And then they get delayed, delayed, delayed, and fail. So you have to probability weight, time weight, time lag the teams that suck versus don’t. And our data center model does that. We kind of track every data center and try and do this for every single one based on equipment that they’re using and all these things. One of the things you mentioned about Google is in a gigawatt data center, they’ll actually put, like, 1.5 gigawatts of hardware. And because they have such understanding all the way from workload to—they’re able to slosh the power around. And so instead of constantly a gigawatt of compute, which typically runs at, like, 60 or 70 percent utilization—in terms of power consumption, not utilization of the hardware, someone’s always renting it—they’re now running it at—like, that 60 to 70 percent means it’s at a gigawatt. And they’re using the full gigawatt.
You see people doing deals with—including Google—with utilities where they’re like, “Oh, well, I know this grid can sustainably take a gigawatt, but except for three days of the year, you can actually do two gigawatts. So give me two gigawatts and then just tell me to turn off.” And so they’ll do that. And so these sorts of tricks—and then you need to have supreme management of workload, backup power, all these things, generators on site to figure out how to actually keep it two gigawatts sustainably.
When people do this, they’re able to charge more, whether it be I’m actually selling two gigawatts despite only having one gigawatt, because those three days I’m able to deal with via battery, gas, et cetera, or I figured out how to build power on site and now I have a gigawatt where no one else does, and so I’m able to do it quickly. It’s not necessarily transacting for a higher price, it’s that I’m selling more gigawatts. And sometimes there are levers where you’re selling more gigawatts, where each gigawatt is selling at a different price.
On the data center and energy layer, it’s more about just having it versus not and then that being delayed or not. It’s more binary. But on the compute side, I do think there’s a lot more interesting work there, right? A gigawatt given to Anthropic is objectively worth more revenue than a gigawatt given to OpenAI. And it seems that both of them could sell every gigawatt that they have right now given rate limit problems and token max limit and all these sorts of things at OpenAI and Anthropic, especially since Codex 5.5 came out. It’s much better. And then likewise, if you gave a gigawatt to SpaceX, they’d turn it into …
Shaun Maguire: My guess, like, my suspicion is that they probably make better use of the hardware than most people. I think people underestimate how much networking experience they have from Starlink in particular, and also how much power management experience they have via from Tesla and …
Dylan Patel: Yeah, people like Brent Mayo are incredible.
Shaun Maguire: They’re good. For me, that’s actually probably the thing that might—I don’t actually know the answer, but I think that might be missing from the analysis a lot of people are doing.
Dylan Patel: I think it’s also the fact that when CoreWeave builds a gigawatt, even though their GPU compute is objectively better than Amazon or Google or Microsoft’s in terms of performance—we’ve tested the performance and reliability—the problem is Google sells it six months before they have it up, and they need to turn around and take that paper that they signed to get debt with that credit backing and then turn around so they can actually pay for the PO that they’ve already issued, for the order that they’ve already issued. Whereas SpaceX was like, no, no, no, this is running now. Buy it. And it’s a big discrepancy when you have a balance sheet to do that versus not. And that also helps your revenue per megawatt be much higher.
Sonya Huang: Why does the neocloud opportunity even exist? Because if you had asked me five years ago, I would’ve said hyperscalers are going to own this. And, you know, you mentioned just now CoreWeave has better performance than the hyperscalers. Why does this opportunity exist maybe at the macro level and then in the execution level?
Dylan Patel: Yeah, so in 2023, I wrote a report that had Amazon really hate me. It was called “Amazon Cloud Crisis.” I talked about how Amazon was the best cloud because they had their Nitro NICs, which offered tenant isolation. All the hypervisor ran on the NIC, and then you could sell all the cores. And they had custom SSDs that they made, and they’d buy the raw NAND and they’d have lower cost because they’d buy the raw NAND and build their own SSDs. They had their custom Graviton CPUs, and that drove down cost per core.
And so they had all these things that enabled them to sell more cores, have better security, good networking, but this was all for the traditional CPU, better storage for the traditional cloud world. But in the AI cloud, a lot of this stuff hurt performance. These Nitro NICs were bad for performance. They still are worse for performance, although they’ve caught up a lot because they’ve had a couple of iterations to improve them, but they’re still worse for performance. A lot of the security stuff doesn’t matter, because it’s not like I’m time-splicing users or splicing a socket into many users, right? No one rents a single GPU in an eight-GPU server. No one rents a single GPU in a 72-GPU rack. They rent the whole rack. And in fact, they rent many of the racks. And then there’s no, like, oh, I rent for six hours and I give it back. Everyone has these long-term contracts.
So the mechanics of the GPU rental market meant that a lot of the expertise of the hyperscalers fell away, and a lot of the expertise that they did have, some of them were detrimental. Network performance, for Google and Amazon, they had custom networks that were better for traditional CPU and for the stuff that they were doing, but actually worse for AI.
And then in other cases it’s like, Microsoft would save money by building their own data centers, but their data center teams were not actually that great. And so when it came time to run—when it was predictable building, it was fine, but when it came time to actually double your forecast for the year, it’s like they fell on their face and they had to go get a bunch of neocloud capacity.
So I think performance, I think time to market is another one. These massive organizations, no one’s getting rich from building this data center faster. But you look at Crusoe, for example, Chase and all the other people at the team—I was going to name some people on the team, but I’d rather not—all these people are getting rich if they fucking deliver this compute faster. They’re hyper-levered equity owners.
Shaun Maguire: Hey look, they’re also all coming from Bitcoin and you’re not supposed to say that.
Dylan Patel: I mean, a lot of the data center—like their main data center guy came from Microsoft.
Shaun Maguire: I know. I’m just teasing. But it’s like you learn a lot when you’re in a very high fluctuation market.
Sonya Huang: How much of this do you think was Jensen playing 4D chess?
Dylan Patel: Jensen absolutely hates a world where all the hyperscalers have all the power. There’s a reason he’s blowing money on random AI labs that—like, I don’t even know if it makes sense to, but he’s blowing money and pumping them up and going to everyone around the world and saying, “You should invest in this company,” because he wants to create a multipolar world. That’s why he loves Chinese labs, because he wants to create a multipolar world. A world where OpenAI, Anthropic, and Google models are the only models is one in which he’s screwed.
Sonya Huang: Yep.
Dylan Patel: A world in which the hyperscalers are the only ones building compute is one he’s screwed in. And so of course he needs to point the allocation gun at neoclouds, help backstop their clusters, do anything and everything, because while today a GPU sold to Crusoe, and a GPU sold to CoreWeave, and a GPU sold to Google and Amazon are all the same price for him, five years from now, Crusoe and CoreWeave existing means Google TPU will be weaker, and means Amazon Trainium will be weaker. And more inference being done with non-closed-source model labs is better for them.
So I think the neocloud ecosystem is these people that are Wild West, these neolabs as well. A lot of them have investments from NVIDIA. It’s the Wild West. Some will fail, many will fail, but some will emerge as really great teams, whether it be Crusoe, who’s a bunch of crypto guys who then started building data centers and doing flared gas stuff, or CoreWeave, who initially was a bunch of …
[CROSSTALK]
Shaun Maguire: Hedge fund guys and [inaudible].
Dylan Patel: But then they built—there were a lot of people who didn’t bubble up like them. Started around the same time and just failed.
Shaun Maguire: I got to say both those teams are phenomenal. They deserve a lot of credit. That’s your point.
Dylan Patel: Yeah. My point is it’s like you throw a bunch of bait into the water and the best fish will figure out and survive. And sort of the same way with the neoclouds, and he hopes the neolabs as well. We’ll see if any of the neolabs really bubble up. But Thinking Machines has a few hundred million dollars of ARR. That’s pretty impressive even though they’ve had—in the media, it’s like, “Oh, they’ve lost all this talent.” It’s like, well, but Tinker is doing a few hundred million dollars of ARR. That’s pretty impressive for out of the gate, a product that’s less than six months old or whatever. We hope the same happens to other neolabs. And so he wants a multipolar world.
Shaun Maguire: Truly, congratulations on the success.
Dylan Patel: Thank you.
Shaun Maguire: Just the last thing I’ll say is I’ve seen a little bit of this. I think the public, they can probably tell from listening to you how hard you work but, like, it’s clear you’ve just been working your ass off for more than a decade. And it, you know, led to the last few years of being in the right place, right time. But I think it’s unbelievable what you’ve accomplished. And I know it’s just the beginning. So …
Dylan Patel: Thank you so much.
Shaun Maguire: Thank you for doing this.
Dylan Patel: Awesome.