Microsoft CTO Kevin Scott on How Far Scaling Laws Will Extend

Training Data: Ep4

The largest platform companies are continuing to invest in scaling as the prime driver of AI innovation. Are they right, or will marginal returns level off soon, leaving hyperscalers with too much hardware and too few customer use cases? To find out, we talk to Microsoft CTO Kevin Scott who has led their AI strategy for the past seven years.

Listen Now

Stream On

Summary

The current LLM era is the result of scaling the size of models in successive waves (and the compute to train them). Microsoft’s Kevin Scott describes himself as a “short-term pessimist, long-term optimist” and he sees the scaling trend as durable for the industry and critical for the establishment of Microsoft’s AI platform. Scott believes there will be a shift across the compute ecosystem from training to inference as the frontier models continue to improve, serving wider and more reliable use cases. He also discusses the coming business models for training data, and even what ad units might look like for autonomous agents.

Scale remains the primary driver of AI innovation. Scott emphasizes that we are not yet seeing diminishing returns on scaling up AI models. Each new generation of models brings improvements in capabilities, cost-effectiveness, and reduced fragility. He advises companies to architect their AI applications flexibly, allowing them to easily incorporate new advancements as they emerge.
The shift from training to inference is reshaping the compute landscape. While there’s much focus on the costs of training large models, Scott predicts that inference—the actual use of these models in applications—will soon dwarf training in terms of compute demand. This shift is driving innovations in data center architecture and software optimization for inference.
Quality of training data is becoming increasingly crucial. As the field matures, the quality of data used to train AI models is proving to be as important as—if not more important than—sheer volume. This trend is leading to new business models and partnerships centered around high-quality training data, as well as the development of more efficient training techniques.
AI assistants like Microsoft’s Copilots represent a strategic focus on augmentation rather than replacement. Scott emphasizes Microsoft’s deliberate choice to develop AI as a tool to enhance human capabilities rather than substitute them entirely. This approach aims to build user trust and avoid the pitfalls of introducing unreliable autonomous systems prematurely.
The potential for AI to solve zero-sum problems and create abundance is a source of long-term optimism. Scott envisions AI as a transformative force in addressing critical challenges in healthcare, education, and scientific research. He urges a focus on deploying AI to tackle these pressing issues, emphasizing the high cost of not utilizing AI’s potential to improve lives and solve global problems.

Transcript

Chapters

Kevin’s backstory (1:20)
The role of PhDs in AI engineering (6:56)
Microsoft’s AI strategy (9:56)
Highlights and lowlights? (12:40)
Accelerating investments (16:28)
The OpenAI partnership (18:38)
Soon inference will dwarf training (22:46)
Will the demand/supply balance change? (27:56)
Business models for data (30:51)
The value function (36:54)
Copilots (39:58)
The 98/2 rule (44:47)
Solving non-zero-sum games (49:34)
Lightning round (57:13)
Mentioned in this episode

Kevin Scott: The things that are brittle right now, where you’re like, Oh, my God, like this is a little too expensive, or it’s a little too fragile for me to use, like all of that gets better. Like it’ll get cheaper and like, you know, things will become less fragile. And then more complicated things will become possible like that, that is the story of each generation and these models as we’ve scaled up.

Pat Grady: On any given day, Microsoft may be the most valuable company in the world, and arguably no one has been more ambitious, more strategic, or more effective in its AI strategy. The Microsoft, the key architect behind that strategy, Kevin Scott, CTO of Microsoft. We’ve had the pleasure of knowing Kevin for a couple of decades now, dating back to his time at Google when the overlapped with our partner, Bill Coughran will join us today for a very special episode of Training Data. We hope you enjoy!

Kevin, thank you for being here on training data.

Kevin Scott: Yeah, glad to be here.

Pat Grady: So just to start, I know you’ve talked about this before, but for our listeners who might not be familiar with your story, how does a kid from rural Virginia end up becoming the CTO of Microsoft?

Kevin Scott: Who knows? Certainly not a repeatable plan, I don’t think, it is, the thing, when I reflect back on my, my story is it’s just a lot of being at the right place at the right time.

So I’m 52 years old. So I was, you know, 10, 11, 12 years old, when the personal computing revolution started to hit full steam. And so like, right at that moment, when you’re a kid trying to figure out what you’re about, like what you’re gonna latch on to, like, I had this really convenient thing that captured my interest, that was a good place for me to ground my curiosity on. And, you know, and I think that’s one of the object lessons in general is if you happen to be interested and like really motivated to learn more and do more with something that is, at the same time growing really, really quickly, like you probably are going to end up in a reasonable place.

And so, you know, I was interested in computers, I was the first kid and my, or the first person in my family, neither my mom nor my dad went to university. So I was the first one to graduate with a bachelor’s degree. Like I majored in computer science and minored in English literature, like I had this moment, when I was, I had this moment, when I was trying to decide what I was going to go do after I got my undergraduate degree where my two advisors were arguing about whether it should be PhD in computer science or PhD and literature, and I was very seriously considering both, but I was so broke. And just so tired of being busted all the time that I picked the pragmatic path, you know, not not the, like, I still imagine what my life would have been like, as a person with a PhD in English literature. And I think it would have been just fine. But like, I chose one of my two equal interests.

And then, you know, for a while, I thought it was going to be a computer science professor. And at the last minute, this is where Bill and I intersected, like I decided I was a compiler optimization and programming languages person, through years and years in grad school, and I got almost all the way to the end, I was like, I don’t think I want to be a professor anymore. Like I, I would work on these things where it was six months of effort to write a paper and you make some synthetic benchmark 3% better. And I was like, This doesn’t feel to me like the way to have a lot of impact in the world. And like, I don’t want to do this over and over and over again for the next 25 years of my life.

And so I sent my resume cold into Google in 2003. And I got an email from this guy, Craig Nevill-Manning, who had just gone off to New York to open up Google’s first remote engineering office. And like, I had an amazing interview at Google. I don’t know whether this was on purpose or not, or like, I just got the luck of the draw. But like, it seemed like every compiler person who was working at Google was on my interview slate. And I was like, This is amazing, like, all these people know all of this stuff that I know. And yeah, we can have easy conversations. I worked on nothing that was even remotely close to compilers at Google. And I was confused why all of these people were there. But it was a great interview. And I was super stoked.

And I joined Google and Google is yet another one of those things, just like the PC and just like the internet, like it was a phenomenon that was growing crazy fast with a bunch of smart people working there. And that resulted in this opportunity I had to go join this startup AdMob when it was very early on, like right at this pivotal moment when mobile was taking off and you needed things in mobile like advertising infrastructure. And I helped build the seminal company, I think in mobile advertising, and then was back at Google.

And then I helped LinkedIn go public, running its engineering and operations team. And then I was at LinkedIn when we got acquired by Microsoft. So like, none of that, I think you can plan. It’s just like a lot of the right place at the right time. And you’re trying at every point you can to do the most interesting thing you can do on the thing that’s growing really fast.

Bill Coughran: You know, when you talk about your personal history, Kevin, I guess, you know, the focus nowadays is on AI and machine learning. A lot of the practitioners are people with PhDs. How do you think about practical teams for AI? Since you’re obviously doing a lot of that work at Microsoft and involved with partnerships with OpenAI and others?

Kevin Scott: Yeah, I mean, I think if you are building the really complicated platform pieces of AI, so like the big distributed systems for training and inference, the big, like networking and silicon and system software components, or the algorithms that you’re using to do training and inference, I think a PhD is super helpful, like, there’s just a huge amount of prior knowledge that you need to have in order to jump into the problem space and be able to like, go quickly, and like, you know, you need to be clever, but like you don’t PhD is, I like that. I know you have a PhD, Bill and are far cleverer than I am. But like, usually folks with PhDs are clever, but like, they’re not the only people in the universe who are clever.

So like, I think it’s mostly helpful in the sense that you’ve gone through a pretty rigorous training regimen where you get a whole bunch of prior art stuffed into your skull, and like you demonstrably can do a very complicated project. And you know, the PhD projects look kinda like AI platform systems projects, except the AI platform and systems projects are lots and lots of people working together. Whereas, yeah, when you’re getting your PhD, you often are like working in relative isolation on, like, your particular thing. So like, that’s the, you know, one of the things people have to learn is how to get yourself docked into a group and to be able to collaborate effectively with a bunch of other people like yourself.

So useful, but you know, there’s so much else in AI that needs to be done other than building the platform. Yeah, and for those things, a PhD is helpful, but certainly not necessary. Yeah, like figuring out, how do I apply this to education? How do I apply this to healthcare? How do I build developer tools around this? How do I do all of the million things that happen when new platforms emerge that you know, sort of complete the whole platform into like a portfolio of products and a portfolio of middleware and a portfolio of like all of the other stuff you need.

Pat Grady: Well, speaking of which Microsoft seems like it has about the most sort of far reaching or ambitious AI strategy of anybody out there can can you just kind of say in a couple words, what is the AI strategy for Microsoft? And then just for fun, if you’re going to grade yourself, what have you done particularly well? What have you done, maybe not as well as you could have?

Kevin Scott: Yeah, so I mean, we’ve been sort of talking about this strategy. Microsoft is a platform company, like we, I think, participated, or like, helped drive a handful of the big platform waves in computing. Like we were certainly one of the pillar companies in the personal computing revolution. Like we had an important part to play in the Internet revolution, although I think that one was a far more diversely contributed to revolution then personal computing. Yeah, we kind of missed the mobile computing revolution.

But like each one of those things like we have thought about, how do you go build a technology platform for this particular era of technology that allows other people to go build on top of that platform to make useful things for other people.

And so that is our AI strategy. It is like how do you from frontier models to small language models to like highly optimized inference infrastructure, you know, like hyperscale, on both training and inference, like economies of scale, like making the entire platform more accessible because it’s cheaper and more powerful with every turn of the crank.

And like all of the developer tools, and safety infrastructure and testing and everything that has to be there in order to have robustly built AI applications, like go build that and like, listen to developers and listen to people building AI as as intently as you possibly can, so that you are filling in all of the gaps that you can for them as they are encountering problems deploying this technology to users. So that is our strategy. And so yeah, I think we’re doing a reasonable job of it. And like, I hate to grade myself, it seems a little bit disingenuous, right?

Pat Grady: Highlights and lowlights.

Kevin Scott: Well, so, maybe before I do that, like, you know, let me describe something about my own psychology. So I like I am an engineer. And I think most engineers are like, short term pessimists, long term optimists. So the short term pessimism is like you come in every day, and you’re like, Oh, my God, this is like a batch of crap. Like, I just don’t like any of this. And like, everything’s broken. And like, I gotta, like, I got, I got so much stuff to fix, and I’m so frustrated. But you work on all of those things anyway, because you’re optimistic that all of the problems can be fixed, and that they’re going to be worth fixing at the end of the day.

So yeah, I mean, the there, there’s a bunch of stuff that I think we’re doing really well, like, I think we have absolutely along with OpenAI made very powerful AI dramatically more accessible than it otherwise would have been to a larger group of people, I think you because of that work that we’ve been doing alongside OpenAI we have, we’re just seeing lots and lots of customers who otherwise wouldn’t be building powerful AI applications.

And so like, I feel like we’re doing a good job and the way that we’re partnering, I think we’re doing a good job and like having a really particular point of view. And it’s not an immutable point of view. But like it’s a point of view about what an AI platform ought to look like. And we’re trying to, like, make it as complete as we can.

You know, lowlights is I think we were a little bit late to like some of the basic AI stuff. So it wasn’t that we were not investing in AI at all. And like you can sort of look at some of the work that Microsoft research had done over the years. And like MSR was an early AI leader. And I think, you know, the bit Bill, Bill knows this just as well as I did, just from his time, mid time at Google, and where we overlap for a number of years, but many, you know, maybe most of the really important advancements in AI over the past 20 years have been a function of some kind of scale. Yeah, well, it’s usually you got data scale and compute scale, in combination let you do things that weren’t possible at lower scale points.

And yeah, at some point like that, that scaling of data and compute is so exponential that you get past the point where you can have fragmented bets, where you can literally it’s it just becomes economically impossible to bet on 10 different things that are all exponentially scaling or have the ambition or the need to exponentially scale simultaneously.

And so I think one of the things that we were a little bit late to is like, we didn’t put all of our eggs into the right basket soon enough, like we were spending a lot on AI but it was fragmented across a whole bunch of different things. And because we didn’t want to hurt any of the feelings of smart people or yeah, whatever, right. Like I don’t even know what the diagnosis was, because a lot of that was before I was at Microsoft. We, like we just weren’t as quick as we should have been at. Like saying, “nope, scale is what matters,” and like, here’s how we’re going to focus, focus our investments on scale in a principled way.

Pat Grady: When did that, when did you get religion that scale is what matters? Was there a particular event or a moment that really crystallized that for you?

Kevin Scott: Yeah, I mean, I was. So I’ve been at Microsoft for about seven and a half years now. And like, my, if, when I, when I became CTO, my, my job was like, take a take a scan left to right across both Microsoft and the entire industry and try to see where, yeah, we just had holes in execution where like, we were not doing things at that point in time, which was, I guess, 2017, early? Where, all right, like, what are we not doing today that we’re going to deeply regret in 2019, or 2020. So like, two, three years out. And like, the biggest thing on the list was like, you know, our rate of progress on AI was not fast enough.

So I’d say mid 2017, like I had religion that that was going to be a big part of my job was like helping us figure out what the strategy was going to be. And then in 2018, like, if anything, the BERT, the publication of the BERT paper from Google was like a real crystallization of that belief. So like, everything that I had, that was in my analysis, I was like this, this is as fine an example as anything of like, why we have to really, really accelerate on getting more serious here.

And so very shortly after that, I restructured a whole bunch of stuff inside of Microsoft to get us more focused on AI. And then about a year later, we did that first deal with OpenAI. And yeah, we have been accelerating our investments and like trying to get more focused, more crisp, more purposeful, since then.

Pat Grady: You were very early to appreciate the potential of OpenAI. What did you see in them at that time, when that first partnership was struck?

Kevin Scott: Well, we had like our at least I had this, like real belief that what what was happening with these models, as they scaled as they actually became a basis for building a platform that like the big shift that was happening is it wasn’t just, you know, I ran one of these teams that at Google, where you had a pool of data and a bunch of machines and an algorithm, and you were like training a model. And like the model was for a specific thing, like, in the case of the things I was doing at Google it was like, click through rate prediction for advertising, and, you know, like a handful of other things. And like, just outrageously effective, right.

But most of the work before this, before GPT was about those sort of narrow use cases, like you were purpose-building models for narrow things. And it was just tough to scale, like you couldn’t, you’d invest a bunch of compute, and like you couldn’t amortize the cost of the compute across anything more than just the narrow thing that you were building the model for. And you had to have a lot of expertise, that, you know, if you wanted to replicate all of this, it’s like you had to have different data and like different AI PhDs and you know, different processes every time you wanted to go build AI into an application.

And what was happening was you had these big, large language models that were useful for lots of different things. So you didn’t have to, you know, have a separate model for machine translation and sentiment analysis and like, all of the different text things that you were doing and I was like, Okay, this is extraordinary and like they were also becoming more platform-like as a function of scale. So transfer learning was working better as things scaled up.

And yeah, then this is still the general pattern. So like everything, everything that we understand that large language models can do, plus or minus, will get better when you get to the next scale point. And on top of that, they will become slightly or maybe dramatically more general in the sense that that capability set broadens, and OpenAI had that same belief. And they also had a very principled analysis of how those platform characteristics emerged over time as a function of scale, and a bunch of experimental validation that said that their forecasts were right. And so it’s like you sort of look at what the forecast says, and you’re like, this is how much money it costs to run the experiment to see if you gotta be on forecast for the next turn of the crank.

And, yeah, it felt like a big number at the time, like it was a billion dollars. But like, relative to what was happening, it just wasn’t a large amount of money. And then GPT-3 was on forecast, and GPT-4 was on… So like, it just was, like finding a partner that had the same platform belief that you did, and like a track record of being able to execute through the scale points, like it was. It didn’t, like I’ve done a bunch of things before that I had way more reservation about in the past, like, just in terms of investments like this one didn’t that yeah, like there was a bunch of people who didn’t agree with me, but like, I had pretty high conviction.

Bill Coughran: I guess, you know, there’s a lot of trade publications now speculating about the cost of doing training and so forth. And, you know, rumors of billions and billions of dollars being spent, and so forth. And I guess, based on my own background, I think training is going to get dwarfed by inference here soon. How do you see this shifting? Otherwise, we’re building models that nobody knows what to do with? That might not be a great investment. How do you see the computing landscape evolving? And where’s it going? You know, I think people are joking that all the money’s going to Nvidia?

Kevin Scott: Well, look, I think Nvidia is doing a good job. So like the two interesting things that are happening with these models, just in terms of the efficiency of the scale up is, each hardware generation is better price-performance wise, usually, by an extent greater than Moore’s law used to work for a general purpose computing. So yeah, A100 was about, you know, three, three and a half times better price performance than V100. You know, H100, like, not quite that much. But you know, close. On paper, the next generation looks very good as well.

And so like, you’ve got hardware for a variety of reasons, like, part of its process technology, but part of its architecture, and like, a lot of it is like being able to leverage narrower word size in the computation. So like, you know, instead of needing 64 bit arithmetic, like you’re, you’re doing arithmetic with much less precision right now. And so like that, you know, there’s just an embarrassing amount of parallelism there, then, like we’re getting better and better at extracting that architecturally in the hardware.

And yeah, there’s a bunch of innovative stuff happening with networking as well, like, we’re well past the point for the frontier models, at least, where you can do anything interesting on a single GPU. So for years and years now, like both training and inference, have been multi-GPU, multi-compute node problems. And so like, there’s a bunch of innovation happening on the network side as well, which allows you to strap all of the compute together, like at the chassis level, the rack level, the row level of data center level more effectively, which is great, because, you know, for the nerds listening, like we haven’t had effective power scaling, or Dennard scaling since 2012 or so. So, yeah, we’re not we’re, we’re getting more transistors, but like they are not, they’re not getting cooler. Yeah, and like we just we have a lot of a lot of density issues just with power dissipation that we have to go deal with.

Bill Coughran: Do you see inferences driving different data center architecture?

Kevin Scott: Yeah, I mean, look, we already architect our training environments in our inference environments differently. They just need different things. And, like I think, you know, all the way down to, you know, silicon and like through the network hierarchy, you need different things for inference, and like inference is, kind of easier than training, like training, the way that we’re doing it now is like, we go build big environments that take, you know, a few years to build.

And, yeah, with inference, like if somebody came along with, like, a better silicon architecture, or better network architecture, like a better cooling technology, like it’s a much easier experiment to go run, you just go swap some racks out. I mean, my data center people will yell at me, like, it’s not quite that easy. But it is, it is easier than having to go do a big capital project like a training environment looks like.

And so like I you know, intuitively, you would think that, that is going to result in more diversity in the inferencing environments and more competition and more like a faster rate of improvement. Like that’s, and like on the software side, that’s certainly what we see, like the inference stack, just because it’s such a large fraction of the overall compute footprint, and it’s constrained. Because we have more demand than supply at the moment like you just have very, very powerful incentives to go optimize the software stack to squeeze more performance out of it.

Pat Grady: Do you think we’ll be in an environment anytime soon where that demand/supply balance changes? Not necessarily at Microsoft, but it feels like we’re seeing that at the market level as well.

Kevin Scott: Yeah, I don’t know. I mean, if we, if we continue to see the platform, continue to expand capability-wise, and it just becomes more useful, like the demand increases, if anything, now, the shape of the demand is probably going to move around like that, you’re already seeing a little bit of that. Yeah, building, building a frontier model is like a very, very resource intensive thing. And as long as people are like building frontier models, and making them accessible, and like maybe they’re not accessible quite the way that people want, you know, like, they’re only API accessible, and like, there isn’t a open source thing you can go instantiate and muck around with, but like, it’s way more accessible than it was, you know, six or seven years ago, where like, the only way to access some of this stuff is you had to go work for like, you know, two or three tech companies.

And so anyway, but I think you do have to ask yourself, and somebody else should do the asking, right? Because I have all kinds of bias, right? But like, I don’t know, how many frontier models you actually need? If Yeah, they’re all roughly speaking in the same tier of capability. Yeah, that’s an awful lot of money to spend for, you know, things that are roughly equivalent.

It’s sort of like, you know, if you’re starting a company right now, and you believe that you have to build your very own frontier model in order to go deliver an application to someone, that’s almost the same thing as saying, like, I gotta go build my own smartphone, hardware and operating system in order to deliver this mobile app. Like, maybe you need it, but like, probably you don’t.

And so I mean, to Bill’s point, like, I think the thing that makes sense for the market is like you just you’re gonna want to see lots of people doing lots of inference, because that means you’ve got lots of products that have found product market fit, and those things are scaling. But like, lots of speculative dollars flowing into infrastructure R&D, like probably ends the same way that many speculative infrastructure booms have ended.

Bill Coughran: I’m, excuse me, on the scaling front. You know, Microsoft published a paper some time ago, pointing out that the quality of training data may be at least as important as volume and I think one of the speculations you see now In the industry is that we’re running out of sources of high-quality training data. And you’re reading at least some articles claiming that various partnerships are being structured to get access to training data, and that might be behind paywalls, and so forth. How do you see that evolving? Because it feels like we have more and more computation. But we may not have more and more training data.

Kevin Scott: Yeah, I mean, I think that was almost inevitable. I mean, it’s, it, it is, you know, in my opinion, a good thing, that quality of data matters more than quantity of data, because it gives you an economic framework to go do the do the partnerships that you need to go do to make sure that you are feeding your AI training algorithm, a curriculum that, you know, is going to result in smarter models. And like, honestly, like not wasting a whole bunch of compute feeding it a bunch of things that are not.

And I think, from an infrastructure perspective, one of the things people have been very confused about is like, a large language model is not a database, it’s not a repository of facts. You know, it’s like important for it to, you know, quote, unquote, know, some factual things. But it is, like the world’s crappiest database. You know, honestly, like, if you need it to be your retrieval engine, and so like, you just shouldn’t think about it as, like, Hey, I got this thing. And like, it has to have everything baked into it, like into the model weights themselves, so that you can recall a bunch of stuff. But you know, like, as you’ve seen, like, the, the recall is imprecise in the same way that human recall is imprecise. So there are just much more efficient ways to do recall.

Yeah, so look, I think you are, I mean, the way that we say see things developing is like, you have data that is valuable for training models, and then you have data that you need to have access to, for an application in order for the model or reason over and like, those are two different things. And I think they’re probably two different business models around those things. So and like, at the end of the day, this is all just this is about business models, right? Like, people who produce data want to be compensated for the use of that data.

And so yeah, we have all of this data sitting inside of search engines right now, like not in you know, randomized weights, but like, quite explicitly, it’s like sitting at a, you know, sitting in indices in, you know, Bing and Google and whatnot, just waiting to be retrieved. And like, plus or minus everybody’s okay with that, because there’s a business model there that makes sense. So you like, you enter a query, and like, you’re either sending traffic or, you know, like, there’s, yeah, SEO and like advertising and like a whole bunch of like business models that surround that.

I think we’ll figure out a business model for that referral data so that like when an agent or an AI application needs to retrieve some information from someone so that it can reason over it and give the user an answer. Like, we’ll figure out the business model for that. Like, it’ll either be subscription rev share, it’ll be licensing, it’ll be like some new flavor of advertising.

Yeah, I was just telling someone the other day, like, if I was in my 20s, right now, like, if you’re an entrepreneur, like somebody ought to be out right now figuring out what the new ad unit is for agents, and like just building the company. Because like it, it will have the same characteristics and qualities as previous ad units, like you had people with information and products and services who are going to want to get to the attention of someone who might want those data and products and services and like, quality is going to matter and the relevance is going to matter and a bunch of other things and I would be shocked if there isn’t a auction model for you know, that’s got to be the right way to value everything. And you know, like maybe those referrals and like re-referrals will have some economic value.

I think with training is just a little bit different and because it’s very very hard when you’re building that model to, at the time that you’re doing the building to really be able to ascribe monetary value value to a particular token of input, just because its contribution to the model, like the same way that a word from, you know, Moby Dick has like a very diffuse contribution to like your own human intelligence. Even though you definitely read it at some point in your career, or your life. Like, how valuable that is to forming, you know, Bill Coughran, or Kevin Scott’s, you know, useful intelligence, like, who knows?

Pat Grady: Speaking of which, this one of the things that we hear a lot of times is the value function is, in some ways, the bottleneck to broader reasoning capabilities. It’s easy enough to construct a value function when you’re playing a game with a, you know, with a known winner and loser like, go or chess or poker or Diplomacy. But it becomes a lot harder to construct the value function when you’re going into broader domains. And it seems like assigning the value of Moby Dick to Kevin Scott’s life, you know, that sort of thing? Are there? Are there practical solutions to this? Are there practical implications to this? I guess the broader question would be, where do you see the overall field of reasoning, and LLMs going?

Kevin Scott: Well, look, I think people are trying to get at this. So you’ve got a bunch of benchmarks like a GPQA and MMLU. You and yeah, like, we’re just sort of rolling through a bunch of benchmarking paradigms, like try to come up with scores of performance for these models, like whether its reasoning capability, or Yeah, and, yeah, I think we have one of the interesting things we’ve seen over the past handful of years is like, we just are very quickly saturating these benchmarks like where you, like one emerges, and then yeah, within a model generation, like you’ll completely or like, get very close to saturating the particular benchmark, and then you gotta go find something else to help be your guiding light.

So, let’s just sort of assume that, like, you know, you’ll have some interesting benchmarks that are correlated with the reasoning capabilities you want models to have? You know, then the question is just an expensive experiment to run, like, you can run an experiment where it’s like, okay, like, I’m going to train a model, like, you know, with this information in it or out and like, does it get better or worse at your performance on these reasoning benchmarks? And, you know, like, I think all of us have done different versions of those experiments like that they’re just extraordinarily expensive to run at the most granular scale you can imagine writing them.

Part of that paper that Bill references, Textbooks Are All You Need, It’s not the full story, but it’s like part of a story that is like, you know, just sort of evaluating like token contribution quality to model performance. So yeah, like, I think it’s an everybody’s got every incentive in the world right now to try to figure out like, what that is, if for no other reason than you, like in a world where you’re synthesizing data, you’re literally spending compute to generate synthetic tokens for training. You, you really want to make sure that the tokens that you’re generating are actually useful or not.

Bill Coughran: Where do you think the models are at the moment? I think Microsoft has introduced a whole bunch of Copilots to try to help end users with, with their, with your products, and so forth. On the other hand, I see lots of companies trying to build agents that can be kind of autonomous actors. Now that’s a wide spectrum of expected performance of what the models can do. Where do you think we are? Where do you think we’ll be in a couple of years?

Kevin Scott: Well, yeah, I think that’s a super good question. And there’s, you know, there’s even a philosophical thing there about you know, what it is we should want…

Bill Coughran: Well, there is the specter of everybody’s job getting replaced, right?

Kevin Scott: So, yeah, you know, somebody like, we chose the name Copilot for the things that we were doing relatively deliberately, because we want to, at the very least, encourage everyone who’s building these things inside of Microsoft to think about how can I help augment someone who’s doing some form of cognitive work. So like, we want to build, you know, assistive, not substitutive, tech. And, you know, the good news is it is also easier to think about how to go from, you know, rough frontier model capability to useful tool when you’re narrowing it down to a domain. And so like, I think that’s been a reasonable deployment path. And like, we’ve got a handful of our Copilots that have real market traction right now and are like, you know, in daily use by like, a lot of people doing the real, non-trivial cognitive work. And I think that will expand over time.

Pat Grady: Just on that, what are some examples of Copilots that have really, like, hit the bull’s eye already, versus maybe Copilots where the technology’s not quite ready, in terms of like, jobs to be done?

Kevin Scott: Yeah, like, I think you get a GitHub Copilot is like, probably the, you know, the thing we’ve talked most about, and like there’s the most public conversation around. Yeah, like, it’s, it’s been a hit. You know, it is genuinely useful. Yeah, we’ve got some other Copilots that are like that, that are, you know, that are super useful. But, you know, I think the thing that Bill was getting at like, the more general, the more general the Copilot is like, the, you know, the, the harder it is to have it actually take very high precision action on your behalf autonomously.

Yeah, particularly if you know, it’s doing something where it’s representing you, where there are stakes, and you have consequences and accountability back to you if this agent makes a mistake. And we’re trying to be very deliberate there. Because I think one of the things that you don’t want to do is introduce a thing that’s going to make a whole bunch of the sorts of errors where the user’s first reaction is like, this doesn’t work. And like, I’m not going to try it again, for a good long while. So we’d rather have it be very good before we introduce it. Yeah. Which, again, you have means you’re, you’re sort of optimizing for use cases not for like, you know, super, super broad things.

I mean, there’s, like we, we did a partnership recently with, with Devin which I think is another one of these, like, very interesting, like, you know, use case specific things where it’s like, you know, frontier model plus a whole bunch of other stuff that is, like, optimized for like giving humans high-quality recommendations for actions that they can take. And then when you click accept on the action, like you, you have reasonably high confidence that you know, it’s going to work and you haven’t made another set of problems for yourself. Incidentally, I’m, I’m guessing you all see this in your portfolio companies like there seem to be a bunch of companies out there right now that are doing exactly this, and that it’s useful and working?

Pat Grady: Well, it’s interesting, because we hear one of the things that we hear fairly consistently from the companies that are further on in their AI journey, you know, everybody kind of starts in the same way where they start playing with OpenAI. And then maybe they start using some of the other proprietary foundation models and incorporate some open source models, and maybe they have some of their own stuff. There’s a vector database, and they’re somewhere. From an architecture standpoint, it feels like people tend to go on a not quite the same journey, journey, the journeys that sort of rhyme. But then what we hear from them when they’re 12 or 18 months down the road, is there’s kind of this massive 80/20 rule at play, and maybe it’s a 98/2 rule, but you can automate most of the task pretty quickly and pretty effectively. But getting it to the point where it’s actually end to end running autonomously in a way that is compelling and consistent enough that you can actually trust it. Kind of that last mile, that last couple percent that makes you really trust it. Yet seems like that’s been pretty elusive for a lot of tasks? And so, um, one of the things that we’re really curious about is okay, well, when do the foundation models themselves, you know, get good enough to knock out that last 2%? Or is that a domain specific thing? And that’s really the job of the software vendor who lives on top of the platform to figure out that last 2%?

Kevin Scott: I look, I think it’s gonna be both for a while, like the two things that I think you can trust are that. Yeah, I know you, you guys are probably going to ask this question at some point. But like, we’re, despite what other people think, we’re not at diminishing marginal returns on scale up. And like I try to, I try to, you know, help people understand, like, you know, there is an exponential here. And like, the unfortunate thing is, you only get to sample it every couple of years, because it just takes a while to build supercomputers and then train models on top of them. And so the next sample is coming.

And like, I can’t tell you when and I can’t predict exactly how good it’s going to be. But it will almost certainly be better at the things that are brittle right now, where you’re like, Oh, my God, like this is a little too expensive, or it’s a little too fragile for me to use, like all of that gets better. Like it’ll get cheaper and like, you know, things will become less fragile. And then more complicated things will become possible like that, that is the story of each generation and these models as we’ve scaled up.

And so like, you know, we even think about this inside of Microsoft. And like one of the category errors that our own developers who are building these AI products can make is like to get convinced that the only way to solve my problem is like I have to go take the current frontier and supplement it with a whole bunch of things. But what you do have to do, but like you want to be very careful, architecturally, when you’re doing that, it doesn’t prevent you from taking the next sample when it arrives.

So you just want to architect these applications where when the new goodness comes, you can go plug it in, and like you’ll have to go optimize that as well. Like, I think that’s just sort of the grind that we’re all on. But like you just like the thing that was killing us internally is I would have teams inside of the company who would look at a frontier model and say, oh my god, there’s no way that we can ever deploy a product on top of this, because like, this is fragile, and this is too expensive. And so, you know, like, please give me like giant pools of GPUs, and like, let me go spin up a big team doing, you know, like a very, you know, tailored version of this, and like, we’re gonna build a specific model and like, and yeah, they would go off and like spend a whole bunch of money and like, the thing would be, you know, a little bit better, costwise at the same level of performance as the current frontier, and then the frontier would snap to the new point, and it would be just doomed.

And so like, you just architecturally don’t want to get trapped by that, I don’t think, I mean, like, that’d be my advice I give to everyone is like, Just give yourself the flexibility to snap to the new frontier when it emerges. And like that lets you preserve all of your skepticism, you can believe all you want, that the new frontier isn’t coming and read your favorite Twitter troll that says it’s all over and a sham, but just give yourself the option that, you know, maybe maybe what’s been happening for six years now is going to continue.

Pat Grady: Hearing that we are not at diminishing returns to scale, I’m gonna count that as good news. And so let’s stay on the theme of good news. I know that your short term pessimist, long term optimist, can you give us some of the optimistic point of view for where we’re heading in this world of AI? Like what what are the some of the things that you’re most excited to see in the world and five or 10 or 15 years or whatever you count as a longer term horizon?

Kevin Scott: Well, look, I think the thing that everybody ought to spend some time thinking about is where are the gnarliest zero-sum problems that we have in society? Like where are the things where like, we just are fighting with one another or you know, like we are immiserating people, because whatever it is that people need, there doesn’t appear to be enough of it.

And I think for a good number of those things like what you have to have to really turn them into non-zero-sum games to create abundance and to relax some of these constraints is you have to have technological breakthroughs. It’s like the only thing that’s reliably ever turned zero-sum to non-zero-sum in human history is like some tech has to come along that like, you know, lets us, lets us have more.

You know, like, you know, whenever tech comes along and creates more, it doesn’t mean that the more gets, you know equitably and uniformly distributed. And like, I think there are real conversations to go have about that. But like, what you do want is the more and you want it to be directed at things where like, we’re just having a tough time right now.

Like, I’ll tell this story that I’ve told a couple of other times recently, but you know, my mom, like I grew up in rural Central Virginia, my mom’s 74 years old, and like, she’s suffered from this thyroid condition called Graves disease for 26 years. And so, you know, when you have Graves disease, your thyroid is hyperactive, you’re generating too much thyroid hormone. And so they go ahead and irradiate your thyroid gland, to reduce its activity level, and then you take hormone replacement therapy to upregulate your hormones for the rest of your life. And so she was having some blood pressure issues and her doctor dorked around with her dosage of this hormone medicine, and then, like, she just had, like some serious health issues as a consequence of that, that landed her in the ER and rural Central Virginia, like six times, like, in a pretty short number of weeks.

And yeah, the interesting thing, there was the first time she went to the ER, like, she was presenting all of these cardiac symptoms, and it was pretty clear they hadn’t even read our chart, like it and registered on them that she had Graves disease, and like, the thing that they needed to do was like, go order a TSH panel to see what her like, thyroid hormone levels were. Like, if they had done that right away, like they would have said, okay, like, we gotta go, you know, adjust your medicine.

Yeah, and I am not even ascribing ill intent, like, you know, this is a healthcare system that is egregiously overburdened, like, this is not a place where you’d like you’ve got this influx of talent, like coming into this part of the country, and like, it’s a lovely part of the country, like I like love it. Yeah, so like, I’m not criticizing anything, it’s just, you know, they have an aging population, and they don’t have enough young people coming in to be things like doctors to like, help this healthcare system keep up with all of the challenges that they have.

If some of those doctors had had access to GPT-4, and like it was an approved product use all they would, what it needed to do was put the symptoms that she was presenting in and her medical record. And it would have said, hey, she needs a TSH Test. And if you would put the TSH Test result in the recommendation it would add was like, look at the dosage of her hormone replacement therapy that she’s on.

And this isn’t theoretical, I did this, It could have helped alleviate a massive amount of her suffering right now. And like, I think the only reason that she got out of this tough situation that she was in was, I had to intervene. Like I sent her to a specialist that was, you know, 400 miles away that she couldn’t have gotten into without, you know, special, you know, and like, it’s, it’s ridiculous, like, there are so many 74 year old, old southern ladies or old Midwestern ladies or like folks who are going through similar sorts of things who do not have someone who’s gonna go in and intervene on their behalf who are suffering unnecessarily because we’re not even deploying the technology that we’ve got right now. And like, it’s just going to get better.

And so like, that’s the thing that I’m excited about is like, let’s, let’s go. Let’s go give kids a leg up in education and let’s go fix some of these crazy problems we’ve got in the healthcare system that is just you know, absent technological intervention is just going to get more strained over time. You know, Let’s equip our scientists with better tools so that they can, you know, find better carbon capture catalysts so that, you know, like we can design safer modes of transportation So we can more quickly get to, you know, like post-carbon economy, like, it’s just so many things we can go do with this stuff are out, I’m super, super optimistic about it.

And so, you know, like, it just kills me like what we don’t want to do is get distracted on, you know, like, things that are just, you know, effectively noise in the ecosystem right now. And you know, we’re getting so sideways sometimes with either this model said something that hurt my feelings or you know… I don’t want to dismiss, you know, people’s feelings matter. And you know, I’m not trying to be a jerk here. But I do want to make sure that as we’re thinking about how we develop and deploy the technology that we are always remembering, like, what the cost of not deploying the good is. Because that is a high, high cost.

Pat Grady: Very well put.

Bill Coughran: Yeah, very. Probably a good note to end on, I think.

Pat Grady: We have one of the questions that we like to ask people, and it’s a quick one, someone asked, and I’m gonna ask it as one. Who do you admire most in the world of AI?

Kevin Scott: You know, I was, I was thinking about this. So I think it’s Ray Solomonoff who was one of the one of the folks who was at that Dartmouth workshop in the, in the 50s, where, you know, Marvin Minsky and [Herbert] Simon, and, you know, like, a whole bunch of folks convene that summer and like, they were all interested in machine intelligence. And they coined the term artificial intelligence at that workshop.

And like, the reason that Solomonoff is so interesting, like not not many people, I think, know who he is outside of computer science is like, he was the one from the very beginning, who was pushing on this whole notion that probabilistic methods were going to be very important for the development of AI.

And, yeah, when I was in grad school, in the 90s, the prevailing academic theories about how we were going to get to AI were all about like, Okay, well, you know, like, there’s some magic, you know, minimalist calculus about human intelligence, and like, we’ve just got to figure it out. And it’s got to be, you know, rule based systems and ontologies and, you know, symbolic reasoning and like, a bunch of, like, stuff where, you know, we like we do in physics, we were gonna have to, like, you know, divine, you know, the inherent simplicity in the system, like figuring out what the rules are. And as soon as we understood the rules, like we’d be able to make software emulate human intelligence.

And Solomonoff is like, none of that, like this is just… Intelligence is an extraordinarily complicated phenomenon. And like, the only way that we’re ever going to really get there is modeling it with probabilistic methods, and he was right. And like he was, he was judged wrong for a very long time. And so I really admire his contrariness. Yeah, all the way back in the 1950s. And he stuck with his beliefs his entire career. And I don’t know whether Ray actually lived to see how right he actually was.

Pat Grady: That’s a great answer, and a great story. Thank you, Kevin.

Kevin Scott: You’re very welcome.

Bill Coughran: Thank you.

Mentioned in this episode:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, the 2018 Google paper that convinced Kevin that Microsoft wasn’t moving fast enough on AI.
Dennard scaling: The scaling law that describes the proportional relationship between transistor size and power use; has not held since 2012 and is often confused with Moore’s Law.
Textbooks Are All You Need: Microsoft paper that introduces a new large language model for code, phi-1, that achieves smaller size by using higher quality “textbook” data.
GPQA and MMLU: Benchmarks for reasoning
Copilot: Microsoft product line of GPT consumer assistants from general productivity to design, vacation planning, cooking and fitness.
Devin: Autonomous AI code agent from Cognition Labs that Microsoft recently announced a partnership with.
Ray Solomonoff: Participant in the 1956 Dartmouth Summer Research Project on Artificial Intelligence that named the field; Kevin admires his prescience about the importance of probabilistic methods decades before anyone else.

Microsoft CTO Kevin Scott on How Far Scaling Laws Will Extend

Training Data: Ep4

Listen Now

Stream On

Summary

Transcript

Chapters

Contents

Kevin’s backstory

The role of PhDs in AI engineering

Microsoft’s AI strategy

Highlights and lowlights?

Accelerating investments

The OpenAI partnership

Soon inference will dwarf training

Will the demand/supply balance change?

Business models for data

The value function

Copilots

The 98/2 rule

Solving non-zero-sum games

Lightning round

Mentioned in this episode