Skip to main content

Meta’s Joe Spisak on Llama 3.1 405B and the Democratization of Frontier Models

Joe Spisak leads the team behind Llama. We spoke with Joe just two days after the release of the new 3.1 405B model to ask what’s new, what it enables, and how Meta sees the role of open source in the AI ecosystem. 

Summary

Head of Product Management for Generative AI at Meta, Joe Spisak discusses the latest advancements in open-source AI models, Meta’s strategy for pushing the boundaries of scale and performance and his views on the future of AI development. Spisak’s insights reveal a vision for democratizing access to frontier models while emphasizing the importance of execution, scale and continuous innovation in the rapidly evolving AI landscape.

  • Open-source models are rapidly closing the gap with closed-source alternatives. Spisak predicts that the quality of open and closed-source models will converge, particularly in the 7-70 billion parameter range. This trend suggests that AI founders should consider leveraging open-source models as a foundation for their products, as they offer flexibility, customization options, and potentially lower costs.
  • The value is shifting from model development to application and customization. As base models become commoditized, Spisak advises founders to focus on building unique applications, fine-tuning models for specific use cases, and developing efficient inference strategies. The real differentiator will be how well companies can tailor models to their specific domains and user needs.
  • Scale and execution are critical factors in pushing AI capabilities forward. Meta’s approach with Llama 3.1 405B focused on scaling up training data (15 trillion tokens), compute resources (16,000 GPUs), and refining post-training techniques. AI founders should consider how they can leverage scale within their constraints and focus on efficient execution to maximize the performance of their models.
  • Synthetic data and efficient fine-tuning techniques are becoming increasingly important. Spisak highlights the role of synthetic data in improving model performance, especially for smaller models. AI founders should explore techniques like distillation, synthetic data generation, and efficient fine-tuning methods to enhance their models’ capabilities without requiring massive pre-training runs.
  • On-device AI and privacy-preserving techniques present new opportunities. Spisak points out the potential for smaller, efficient models to enable on-device AI applications, such as local summarization or privacy-preserving RAG architectures. Founders should consider how they can leverage these approaches to create unique value propositions, especially in areas where data privacy is a concern.

Transcript

Contents

Joe Spisak: If I was a founder right now, I would absolutely adopt open source. It forces me though, to look at the engineering complexion of my org, and think I’m going to need people doing LLMOps and things like data fine tuning and how to build RAG and things. And APIs. And there’s plenty of APIs that allow you to do this, but ultimately you want control. Like, your moat is your data, your moat is your interaction with users.

Stephanie Zhan: Hi, everyone. Welcome to Training Data. Today we’re excited to welcome Joe Spisak, Director of PM for Generative AI at Meta, where he leads Llama and third-party ecosystem efforts. Joe spent the last decade in AI, leading product at PyTorch and working on initiatives that span protein folding and AI math, many of which have spun out for Meta into their own startups. We’re speaking to Joe just two days after the Llama 3.1 405B launch, and we’re excited to get his view on questions like: where is the open source ecosystem headed? Will models commoditize even at the frontier? Is model development becoming more like software development? And what’s next in agents and reasoning, small models, data and more?

The Llama 3.1 405B launch

Stephanie Zhan: Joe, thank you so much for being here today. We’re so excited to have you just two days after the Llama 3.1 405B launch. It’s an incredible gift to the ecosystem. We’d love to learn a little bit more about how you—what specific capabilities you think the 405B is particularly unique at, especially in comparison to the other state of the art models.

Joe Spisak: Oh, thanks so much for having me. This is so much fun. I haven’t done a podcast like this since, I think, pre-COVID.

Stephanie Zhan: [laughs]

Joe Spisak: So it’s, like, fun to be in the same room and just, like, chatting about this cool stuff. Yeah, I mean, we’re, like, beyond excited at Meta. This was something that I think a lot of us have been working on for such a long time—months and months and months. And, you know, we kind of put out that nice little, like, appetizer, I’ll call it, in April.

Stephanie Zhan: Yeah.

Joe Spisak: Like, Llama 3. And, like, I was actually, like, “Are people really gonna, like, be that excited about these models?” And, like, the response was through the roof. Like, oh my God! Like, everyone’s excited, but they really don’t know what’s really coming. And so you kind of hold that—kind of had to hold that back for a while and kind of keep it to ourselves and, like—and then kind of build up for this launch.

And the 405B is a monster. It’s a great model. And I think the biggest thing we’ve learned about the 405B is it’s just a great—it’s like a massive teacher for other models. And we kind of had that plan all along because when you have a big model, you can use it for, like, improving small models or just distillation. And that’s how the 8 and 70Bs became the great models that they are.

And I mean, in terms of capabilities, we listen to the community. We listen obviously to our own product teams, right? Because we got to build products for Meta. And long context was one of the biggest things people wanted. And we have much longer context internally even than what we released. But we saw just the use cases start to build up. Multilingual. I mean, we’re a global company, so we released more languages, many many more to come, because obviously Meta has billions of people on the platform in hundreds of countries.

And so I think that was like—to me, those are like table stakes things, but they’re really done well on the models. Like, I think we spent a lot of time in post training on our different languages and improving them and safety. Just they’re really, really high quality. So we don’t just pre train on, like, a ton of data and say, “Look at us, we’re multilingual!” You know, we actually did a lot of work in our SFT phase, in supervised fine tuning, and a lot of safety work.

I think one of the coolest things that I’m excited about—well, there’s a couple things I’m excited about, but one is tool use. Like, I think, oh my God, zero-shot-tool-use. This is going to be crazy for the community. I mean, we show a few examples. Like, we can show Call Wolfram, or we can Brave search or Google search and it works really great. But zero shot tool use is going to be a game changer. The ability to kind of call the code interpreter and actually run code or kind of build your own kind of plugin for things like RAG and other things like, and have that really be state of the art, I think it’s going to be a really big game changer.

And I think just the fact that we released the 405 itself, and we changed our license so you can actually use our data, like, that was a big deal. That was a big discussion. We had many meetings with Mark on that, and ultimately landing on a place where—this was this pain point for the community for so long, these closed models. Like, I can’t use the outputs, or maybe I can use them, but maybe I’m using slightly unscrupulously or whatever. We actually are encouraging people to do it.

The open source license

Stephanie Zhan: I’m sure that was a tough decision to make. Walk us through the things that you had to consider in actually making that leap to open up licensing in that way, making it so permissible.

Joe Spisak: Yeah, licensing is a huge topic in itself, obviously. We could probably spend the whole podcast talking about it. I don’t want to, but we could. I think we wanted, number one, just unlock new things.

Stephanie Zhan: Yeah.

Joe Spisak: Like, I think we wanted to  have the 405 and our Llama 3.1 models differentiate, give people new capabilities. Like, we just looked at, like, what people were really excited about in the community, not only in enterprise and products, but also in the research community because we obviously have a research team, and we work with academia and we talked to folks.

I mean, Percy Liang at Stanford texts me all the time saying, you know, “When are you going to release it? When are you going to release it? Can I use it? Can I use it?” Percy, like, you know, stay patient. But I think we heard them, and we knew kind of what they wanted. And I think ultimately we wanted Llama everywhere. We wanted just adoption, you know, maximal adoption, really the world using it and building on it. And I think Mark even used in his letter he put out, like, you know, the new standard or standardized. So I think, like, to do that, you kind of have to enable stuff like that, or you kind of have to unblock all these different use cases and really look at what the community wants to do and make sure that you don’t have these kind of artificial barriers. And that’s what the discussion really was.

And so actually even beyond that, we started working with partners like Nvidia and AWS, and they started building distillation recipes, and even synthetic data generation services, which is pretty cool. I mean, you can start to use those and actually create specialized models from it. And the data that—I mean, we know how good the data is because we used it in our smaller models. It’s really good and it improved our model significantly.

What’s in it for Meta?

Sonya Huang: I’m going to pull on the open source thread a little bit more.

Joe Spisak: Sure.

Sonya Huang: And I’ve read Zuck’s manifesto. It was great. But I’m trying to wrap my head around, like, what’s in it for Meta? This is a massive investment to open source. In some ways, you’re leaving a lot of money on the table because you now have a state of the art model that you’re offering to everybody for free. And so I guess, is this an offensive move? Is this a defensive move? What’s in it for Meta?

Joe Spisak: I mean, we’ve—well, first of all, our business model doesn’t depend on this model to make us money directly.

Sonya Huang: Yeah.

Joe Spisak: So we’re not selling a cloud service. We’ve never been a cloud company. We’ve always worked, I would say, with a partner ecosystem all the way back to the five years I was helping to lead PyTorch and the ecosystem and the community we built around that. Like, we never built a service. We probably could have in some way, but it would have been weird. We saw basically—going back to PyTorch, we kind of saw it as this kind of lingua franca kind of bridge to this, like, area of high entropy. This is kind of weird way to say it, but like there’s all this innovation happening. How do we kind of build a bridge to it and actually be able to harness all that innovation? And the way to do that is to be open, and it’s to kind of get the world building on your stuff.

And I think that ethos is kind of carried over into Llama. And, you know, if you look at PyTorch, like, that was a huge way for us to kind of pull in—you know, at the time when we really started working on PyTorch in earnest, computer vision and, like, CNNs and all that, right? If you remember that. Old, old times now, but we actually would see these architectures come, like, constantly. The people would—and they’d write code and they’d publish it in PyTorch, and we’d take it internally, we’d evaluate it. People would open source models and put them out on model zoos, and we’d evaluate them and we’d see just how quickly the community was improving things. And we’d actually leverage that especially for, like, integrity applications where we release hateful memes and some of these other data sets. We just saw the improvements week over week, month over month. And it was built on something that we were using internally, so it was very easy for us to just take it inside.

So I think Llama is definitely similar in that regard where when academia and when companies start to red team these models or, you know, try and jailbreak them, we want people to do that to our models and so we can improve. And I think that’s a big reason.

Sonya Huang: [laughs]

Joe Spisak: And it’s like, be careful what you wish for, right? Of course. But, like, it’s the same with Linux, right? Linux is open source and the kernel is open source, and people will—you know, it’s much more secure when things are transparent and bugs can be pushed faster. And so that helps us a lot. I think there’s also the angle of, you know, we don’t want this to turn into kind of a completely closed environment. I think just like today if you look at Linux and Windows, in my opinion there’s room for both, right? There’s room for closed, room for open, and people use depending on what they need and the applications. I think that there’s going to be a world of open models, and I think there’s going to be a world of closed models. And I think that’s totally fine.

And I think we want to—because of our global footprint and everything, we want to also democratize this technology and make sure that it’s available to everyone. And of course, if we can help, because we’re a global company and because this technology is global, if that can help us also localize the technology as well, because localization isn’t just like doing NMT and, like, then post training on that data, right? Like, machine translation is fantastic, but when you actually support a language, you actually have to tie a locale to it.

And so, you know, take French, for example. Like, how many different versions of French or locations of French? There’s, of course, Canadian, which—I love Canada. I used to live there. There’s French Canadian, there’s France, there’s Cameroon. There’s like all these different, like, versions. And so, like, having the world build on your technology and help to, like, localize is like a really big benefit. So for us, it’s really a bridge to the community and a bridge to really make the technology available to everyone. And again, it’s like, it’s not something that we’re building a business around. We fundamentally believe that doing this will actually help build better Meta products. So that’s our ultimate goal, obviously.

Sonya: I love that.

Joe Spisak: Yeah.

Why not open source?

Sonya Huang: What was the primary argument against open sourcing? Was there one?

Joe Spisak: I mean, there was definitely, like, competitive concerns we talked through. You know, do you want to give your technology—you know, put it out there and that? And I think we’re, like, less concerned about that because we’re moving really fast.

Sonya Huang: Yeah.

Joe Spisak: Like, if you look back, I mean, I’ve been in Meta, like, close to what, six or seven years now. And, like, in the last year or so, we’ve done—you know, we had a Connect launch. We released Purple Llama last December. We released Llama 3.1. Before that, we released Llama 2 in July. Llama 1 was, like, in February. So just if you think about the pace.

Sonya Huang: The velocity’s incredible.

Joe Spisak: The pace of innovation that’s, like, coming out of our team and our company is just at a crazy pace right now. So I’m not too worried about it. I don’t think we’re that worried about it.

Will frontier models commoditize?

Stephanie Zhan: I’d love to kind of move into your personal views on the broader ecosystem. I think a lot of the questions that people have center around what happens to the value of all these models, especially as Meta open sources more of them at the state of the art level. With Llama 3.1, with OpenAI launching GPT-4o mini, what is your view on—do models commoditize even at the state of the art frontier?

Joe Spisak: Well, this is a great question. I mean, I think if you look at just even the last two weeks, I mean, 4o mini is a really, really good model. I think input per million tokens is something like fifteen cents, sixty cents out. So it’s incredibly cheap to run, but it’s also an excellent model. They’ve done an incredible job in distilling and getting to something that’s, like, really, really performant, yet really, really cheap. So I think Sam is definitely pushing on that.

And then if you look at what we’ve done in the last week in pushing out, I would say pretty compelling state of the art models across the spectrum, I do think, like, it’s rapidly getting to a place where the model is going to be kind of a commodity. I mean, I think there’s this frontier of data where we can certainly gather data from the internet, we can license data, but at some point there is kind of like some frontier of limitations, I think, that we’re all going to have.

And this goes back to our conversation this week on The Bitter Lesson of data and scale and compute. Is that enough? It’s probably not quite enough, but it’s like compute and data becomes kind of, if you have enough of both, you can kind of get a first order approximation of the state of the art without anything else is kind of what we’ve seen. So I do think models are commoditizing. I think the value is elsewhere. And I look at Meta and I look at our products and I look at what we’re building. Like, that’s honestly where the value is for us. It’s Meta AI, it’s our agent. It’s all the technology that we’re going to put into Instagram and WhatsApp and in all of our end products where we actually are going to monetize, where we’re actually going to add real value. The model itself, I think definitely we’ll keep innovating—new modalities, new languages, new capabilities. That’s what research is, right? It’s pushing the frontier in emerging capabilities, and then we can leverage those in products. But the models are definitely pushing in that direction.

What about startups?

Stephanie Zhan: Yeah. If that’s the case, and all these existing companies that have massive distribution and wonderful applications that are already out in the wild can just adopt these state of the art models, what advice would you give to the whole wave of new startups that are trying to make it out there either building their own models, using other state of the art models and then trying to build applications on top?

Joe Spisak: Yeah, I mean, there’s definitely, like, some model companies or companies that are building—you know, they’re pre-training foundation models. And it’s expensive. It’s like, I think we’re—you know, I can’t say how much Llama 3 cost, but it was very expensive and, you know, Llama 4 is going to be even more expensive. And so to me, given kind of the state of play and things, to me it doesn’t make that much sense if I was a startup to try and go in and do a pre-training. Like, I think the Llama models are absolutely incredible as foundations to build on.

And so I do think, like, there is—you know, if I was a founder right now, I would absolutely adopt open source. It forces me though, to look at the engineering complexion of my org, and think I’m going to need people doing LLMOps and things like data fine tuning and how to build RAG and things. And APIs. And there’s plenty of APIs that allow you to do this, but ultimately you want control. Like, your moat is your data, your moat is your interaction with users.

Joe Spisak: And you may want to deploy these things onto a device at some point and have a mixed interaction or something. You might want to have simpler queries running on your device and have very low latency interactions with your users. You might want to split and have a more cloud-based approach for more complex queries, more complex interactions. And I think, like, the open source approach gives you that flexibility. It gives you the ability to modify the models directly. You own the weights, you can run the weights, you can distill them yourself. There’s going to be distillation services that allow you to take your weights, distill them down to something smaller. That’s pretty awesome. We’re just now seeing the beginnings of that.

So I think in my mind, control matters a lot. And ownership of the weights. There are a lot of API services where you’ll do fine tuning on your model. So you’re bringing your own data, you’re fine tuning, and they use something called low rank adaptation, or LoRA. And unfortunately, you don’t actually have access to those LoRA weights at the end of it. You’re kind of like, forced to use their inference. So you’re like, “Hmm. Let’s see, I’m kind of like, held hostage here. I’ve given my data, I don’t have access to the actual IP that was generated from that data, and now I’m forced to use their inference service. Like, that’s not a good deal.” So I think the open source kind of like brings an inherent freedom that I think that approach doesn’t.

The Mistral team

Sonya Huang: What do you think of Mistral Large was announced, I think maybe a day after Llama 3, 3.1. What do you think of them? And I guess more broadly for everybody at the frontier, is everyone kind of pursuing the same recipes, the same techniques, the same kind of compute scale data, et cetera? And so, like, everyone’s kind of going to be roughly similar at the frontier? Or do you think you guys are doing something very different?

Joe Spisak: So first of all, on Mistral, I mean, amazing team. It was one of my old teams in FAIR. They were working on improving AI and mathematics. So Guillaume and Tim and the team are—and Marianne, they’re incredible.

Stephanie Zhan: Joe was just talking about fun banter with Guillaume last night. [laughs]

Joe Spisak: So I mean, this was like one of the scrappiest teams that I’ve ever worked with. I mean, the team I don’t think ever slept. So it was like basically by day they’re doing …

Stephanie Zhan: Probably even less now. [laughs]

Joe Spisak: Probably even less now. I mean, they would push the state of the art and, like, AI, and they’re improving and, you know, during the day. And we published some work on that, you know, I think a couple years ago now, geez! But by night, you know, they were basically scrappily, like, grabbing compute to train Llama 1.

Joe Spisak: And so they were—you know, we were building large language models several years ago in FAIR. And, you know, that team basically just—like, they were just really ambitious, and they were kind of working by night. And that’s really where Llama 1 came from.

Joe Spisak: So the team is great. I mean, I think they’re doing really good work. I think they’re definitely challenged in that they’re trying to also, like, open source models, but also make money. And models like 4o mini are not helping them. And this is, I think, why they changed their license, for example, to have a research-only license, which kind of makes sense because they were open sourcing models, and they immediately—like, their own ecosystem is competing with them in a lot of ways because they’ll release a model, they’ll host it, use this model, but then they have Together and Fireworks and Lepton and all these companies that provide sometimes a lower cost per million token offering.

So it’s a really tough business right now. In terms of Large 2, I think it’s a really good model. I mean, just on paper. I haven’t evaluated it. We haven’t looked at it internally yet. I think if you look at artificial analysis, I think they added up in kind of, like—it was under, I think, like, the 70B model in terms of, like, quality. But, you know, that’s like a blended—you know, they blend a bunch of benchmarks to make that distinction. But on paper, it looks really good. We’re gonna evaluate it. I think, you know, for me anyway, like, the more the merrier. Like, the more models that are out there, the more companies are doing this, the better. It’s not—like, we’re not gonna be the only one, and I think that’s good that we’re not the only one. And I think, like, more generally, like, the Gen AI space, like, you wake up every single day and you kind of expect something like this, right? You expect them all to be released or something groundbreaking to happen, and that’s kind of like the fun of being in it.

Are all frontier strategies comparable?

Sonya Huang: Yeah, totally. Do you think everyone at the frontier is comparable, though? Like, are you all pursuing comparable strategies?

Joe Spisak: Yeah, this is actually a good question because, you know, if you read the Llama 3 paper, which was, I think, 96 pages. Lots of citations, obviously.

Sonya Huang: Lots of sharing.

Joe Spisak: Lots of sharing, lots of, like, you know, contributors and core contributors and that. So, like, it was a detailed paper. And Lawrence and Angela on the team spearheaded writing that. And I think that was, like, one of the hardest things. Like, developing the model was relatively easy compared to writing the paper. It was a lot of work pulling that paper together. I think if you look at Llama 3, there was a lot of, I would say, innovation that happened, but also, we didn’t—we also didn’t, I would say, take on, like, a lot of research risk either.

Sonya Huang: Yeah.

Joe Spisak: So I would say, like, the primary things we really did with Llama, with the 405B, especially, was really pushing scale. I mean, it was still, you know—we use grouped-query attention, for example, so, you know, GQA. And that improves inference time and, you know, kind of helps solve the kind of quadratic attention computational challenge. We trained on over 15 trillion tokens. We did post training. We used synthetic data, which improved the models, the smaller models, quite a bit. We trained on over 16,000 GPUs on our training runs, which is something we hadn’t done before. It’s really, really hard to do that because GPUs fail and …

Sonya Huang : I saw the table.

Joe Spisak: Yeah, I mean, everyone’s like, “Oh, I’m just going to train on 100,000 GPUs.” Like, good luck, right? You better have a really, really great infra team, a really great MLSys team. You better be ready to innovate at that level because this is non trivial. Everyone says it’s easy or says you can do it, it’s non trivial. So I think, like, I almost look at Llama 3 as very similar to the GPT-3 paper. So if you ever talk to Tom, he was a lead author, Tom Brown, now at Anthropic. And there’s a reason why Tom was the first author on that paper is because, like, a lot of the innovation was really scale. It was really like how do I take something that’s an architecture and push it as hard as we can push it? And that involves like a lot at the MLSys kind of layer and infra layer and, like, how do I scale the algorithm?

And so I think that was really, like, the mentality we had with Llama 3 and Llama 3.1. And I mean, internally, obviously we have a great research team, we have FAIR, we have research in our org, and we’re looking at lots of different architectures and MOE and other things. And so, you know, so I think we—you know, who knows what Llama 4 will be? We have a lot of candidate architectures, and we’re looking at it. But it’s kind of a trade off. It’s a trade off between how much risk you take on for research, and potentially how much reward or the ceiling of the potential improvements versus just v taking something that’s relatively known and, like, pushing scale and getting that to improve even more. So ultimately this becomes a trade off.

Is model development becoming more like software development?

Stephanie Zhan: I think this is such an interesting point. I actually also think it makes Llama and Meta quite unique in the strategy it’s taking. The words that I liked that you used yesterday were, “Is model development becoming more like software development?” I’m curious to hear if you think unlike what many of the other labs have been doing on pushing more of the research, you guys have been focused on just executing on strategies that you know work. Do you see that representative of the continuous strategy you think as you extend Llama out 4, 5, 6, 7, 8? And then also how do you think the other research labs and maybe some of the other startups in the ecosystem will react? Will they kind of switch and veer a little bit more to the strategy that you’ve been taking?

Joe Spisak: I mean, it’s a really great question. We don’t have all the answers, for sure, I think. But there’s definitely, like, somewhere in the middle right now is kind of where I see things landing where, you know, we’ll continue to push on execution, we’ll continue to push models out. We’ll continue because we want our products to iteratively improve as well. So we want Meta AI improving constantly. And so there’s definitely like a software engineering analog here that’s happening where you can imagine something like a Llama train and new features and new capabilities get on that train. And we have a model release.

It’s actually much easier when you start to componentize the capabilities, too. We’re doing that with safety right now. And you saw it in the release. We released PromptGuard and a new Llama Guard, and you can iterate on those components externally, and it’s great. Obviously, the core model is much more difficult. I do think we’ll start to include or start to push on the research side as well because the architecture is going to evolve. You’ve seen what AI21, for example, has done with their Jamba and their Mamba. And Evelyn kind of thinks Mamba is a new architecture that could have promise.

I think what’s interesting, though, is to truly understand, like, the capabilities of the architecture, you kind of have to push the scale. And I think that’s what’s missing right now in the ecosystem is, you know, if you look at academia, and academia is like a lot of absolutely brilliant people there, but they don’t have a lot of access to compute.

Sonya: Yeah.

Joe Spisak: And that’s a problem because they have these great ideas, but they have no way to truly execute them at the level that’s needed to really understand will this actually scale? Because, like, the Jamba paper and model was really interesting and the benchmarks are great, but they didn’t scale it beyond, I think, like under 10 billion parameters. So you’re like, okay, what happens when, you know, we train this in 100? Do you still see those improvements or not? And no one really, at least outside of these labs, knows the answer yet. So I think that’s one challenge. So I think to me, we’re going to get into this hybrid space of we are going to push definitely on architecture. We have a very, very smart and well-accomplished research team, but we also are going to be, like—you know, we are going to be executing, and I think that’s when we start to get like a recipe.

You know, we’re going to push it to the limits and, you know, we are going to start, you know, release—we’re going to continue to release more models on it, but in parallel to that, we have to push on architecture. And I think it just makes sense because the next breakthrough, you know, at some point you’re going to reach, like, a kind of a theoretical limit and you need to evolve the architecture, right? So I see kind of a little bit of an in between. And obviously we’re really good at execution. I think we’re pretty good at execution, but we’re also good in research. And we just need to marry those two so it makes sense. Because, like, research and products are very different, right?

Joe Spisak: Like, one should be pretty deterministic, the product side. And one is inherently non deterministic, right? It’s like, is this going to work? I don’t know. It’s a really big bet. If it fails, it’s research. Like, it should have a non zero chance of completely blowing up in our face. We just need to go in another direction. But that’s what research is.

Agentic reasoning

Sonya Huang: I’m curious about one branch where a lot of, I think, model research is happening right now, agentic reasoning. And I think you all have announced really great results in reasoning. I’m curious, maybe at a very basic level, how do you define reasoning? And then are you all seeing reasoning fall out of kind of scale during pre training or is it post training? And is there a lot of work left to do on the reasoning side?

Joe Spisak: Yeah, reasoning is a bit of a loaded area. I mean, you could argue it’s things like multi step. And I think unfortunately, the best examples we have are kind of like the sort of semi gimmicky Bob is driving the bus and he picks—you know, like those kind of like things, right? If you troll LocalLLaMA, you’ll see a billion of those, right? But those actually force the model to take multiple steps to respond to you and think through and logically kind of respond.

I think coding is actually really—like, you know, when you look at pre training and so, like, to answer your question directly, like, reasoning improvements come in both post training and pre training. So what we’ve learned, which is now, like, everyone’s like, “Oh, of course this is the case.” But definitely, like, the last year or so, everyone’s kind of learned that, you know, having a lot of code in your kind of pre training corpus really improves reasoning. But when you think about it, like, of course, duh. It’s step by step. It’s very logical. You know, code is just logical by nature and kind of step by step. And, like, if you incorporate a lot of that in your pre training, your model will reason better. And then we, of course, like, look at examples in post training and SFT to improve as well. So we look at the pre-trained model, and it kind of depends on how you balance things as well, because you can balance how well your model reasons with how well it responds in different languages.

Ultimately, in post training, everything’s a little bit of a trade off. Like, you can super optimize things for coding if you want to. And we did that with Code Llama. It was really great. But of course, the model will suffer in other areas. And so ultimately it becomes what kind of Pareto frontier of capabilities we want to bring out if it’s a general model. And I think yeah, I mean, ultimately it’s a trade off. So anyone can kind of pick a benchmark or some capability and say, “I’m going to super optimize for it,” and say, “By the way, I’m better than GPT-4.” Well, great. Anyone can do that. But is your model as generally capable as GPT-4 or Llama 3.1 or whatever? That, I think, is a different story.

What future levers will unlock reasoning?

Stephanie Zhan: What do you think are the future levers to unlock reasoning for anyone going forward?

Joe Spisak: I mean, the obvious answer is data. The more data, the more code and supervised data that you can get, I think, is a natural answer. I mean, I think we need to find applications as well for how we, like, define it. And that would help us. Like, once you’ve kind of start finding, like, those kind of killer applications, then you can, like—then you kind of know where to kind of focus in terms of your data.

Stephanie Zhan: What you need to solve for.

Joe Spisak: Exactly, what you’re solving for. Like, and this goes back to, like, evals and, like, what is your eval? Because we’re starting to saturate evals.

Sonya: Yeah. Yeah.

Joe Spisak: And so we tend to as a community, like, we define a benchmark or a metric and we just, like, optimize the living. hell out of it. And it’s great, but then you actually look at the model in an actual environment and you’re like, “Oh. Well, that model has a better MMoU score.” Great. But, like, how does it actually respond? Well, it doesn’t respond as well, but it has a better MMoU score. And so I think we need better evals and better benchmarks that allow us to, you know, I would say, like, find a clear line of sight to actual interactions.

And I think, like, you know, the live—what is it called, the abacus benchmark? The live bench, I think it’s called—I can’t remember the name of it—it’s pretty good. I was looking at that. And of course, like, LMSYS and Chatbot Arena, like, these are more natural. Even though, you know, it’s still not perfect, but it’s, like, moving in the right direction of things are, like, more humanlike interactions versus, like, a static dataset or a static prompt set that is not that helpful.

So I think, like, once we start to find these other, like, what reasoning use cases make sense, we’re gonna start to generate more data and you’re gonna start to improve the model there. And you’re gonna—and hopefully that kind of has, again, line of sight to a benchmark or an eval that actually feels like it improves an end product. And a lot of this actually depends on the end product, of course. What is my application?

Will coding and math lead to unlocks?

Stephanie Zhan: Yeah. Out of curiosity, I think within large research labs, coding and math have always been two primary categories in trying to unlock reasoning. In the startup ecosystem now, we’re seeing more folks who really want to go it from the math angle. Do you have a perspective on whether or not that has led to interesting unlocks?

Joe Spisak: I mean, the answer is yeah. I mean, I think if you look at our data or look at least at our models, we’ve—like, coding and math have been, I would say, the primary levers. So I mean, it’s—yeah, I mean, I think that’s—like, having more obviously is better because obviously math is also very logical and very, like, stepwise. So obviously you can see the pattern here: the more data you have that kind of follows that sort of pattern, the more your model is going to be able to reason.

And you can see that in how actually models respond. Like, if they start and you ask them to respond and, like, step me through your thinking process, right? And it’ll actually do that. And some models do it better than others. So anything like that. I think scientific papers, like, also there’s like—you know, we had some projects out of FAIR that, like, trained on, you know, like, arxiv papers. And you can see not only is it like code and math, like pure mathematics, but also like scientific papers, which is like, scientists are very logical in how they write things and how they step wise and how they create images of their, like, charts and stuff. And, like, that also I think we’ve seen. Like, just general scientific information, like, helps as well.

Sonya: Interesting.

Joe Spisak: Galactica was our project. Yeah. So Robert and Ross from the Papers with Code team led that. Still, in my opinion, like, one of the coolest projects ever.

Stephanie Zhan: [laughs]

Joe Spisak: It got a lot of bad press, but wow, they were ahead of their time, in my opinion.

Small Models

Stephanie Zhan: I’d love to talk a little bit about small models.

Joe Spisak: Sure.

Sonya: Given the scale of capital and the compute that many startups have, the 8B and 70B models are an incredible gift to the ecosystem.

Joe Spisak: Totally.

Stephanie Zhan: And it’s funny that you called them ’appetizers’ at the start because I think they’re super powerful for that set, but they’re also really powerful for a number of different applications where you want smaller models. And so I’m curious to hear: what do you hope to see developers use the 8B and 70B models for, given that they are best in class for their size of model?

Joe Spisak: So it’s interesting though, when we released—we released in April Llama 3, we released an 8 and a 70, the ’appetizers,’ as we call them, you know, the 8B was actually better than the Llama 2-70B by leaps. So we were—you know, I had to look at the chart and I was like, “Is this right? Like, is that really the case?” And we’re like, “Yeah, like, it really is.” It was that much better.

7X more data

Sonya Huang: What’s the intuition for how that happened?

Joe Spisak: I mean, it was more data. We had what, 7X more data. Obviously, like, we put a bunch more compute at it as well. So, you know, going back to compute and data, you know, we’re pushing on those. So I think we saw just like—it’s almost like every generation—which is again, the generations are accelerating—you start to see the benchmarks for, like, a large model basically get pushed down into the smaller, like, size regime. And so, you know, 70 becomes an 8 and, you know, like, internally, we have models where the 8 is, you know—like, on a much even smaller than 8, actually. We’re starting to see really nice benchmarks on even smaller models.

So you continue to see, like—and that, you know, that the models improve at smaller scale. And that, I think, is just we’re pushing the architecture. We’re pushing scale, and we’re starting—you know, we haven’t quite saturated it yet. And I think that’s really interesting. So, you know, for me, one of the biggest reasons that I think a small architecture is useful is obviously on-device. Everyone loves to talk about on-device and Apple’s talking about that. And Google has Gemma models and Gemini running in Android devices. So I think on-device makes sense.

I think safety is kind of interesting because one of the things—we have our own internal versions of Llama Guard which we used that are orchestrated for our applications internally at the company, at Meta. And today they’re built on an 8B model, which is kind of expensive to run, if you think about a safety model that’s kind of like the secondary model. And so I do think internally, we’ve been experimenting with much smaller models in that regard. And it creates efficiency, it lowers latency because really, those models are really just classifiers. They’re not really autoregressive, like chat-like interfaces. They really just classify the input prompt of does that violate this whatever category in the taxonomy, in the output, the model? When it generates, does it violate that kind of stuff? So you can actually push those even further.

Sonya: Yeah.

Joe Spisak: I think that there’s also really interesting cases, though, for on-device, where you almost have—when you think about privacy and you think about data, you want to have your data stay on-device. You can think about a RAG architecture on-device. So you have data, even your chat history, that’s on, say, WhatsApp or other things, you can imagine that model having access to data, aggregating it, and then running some type of almost like a mini-vector database, right, where you’re using RAG and doing your kind of fuzzy search or fuzzy matching with your small model, and that becomes this kind of own system in itself. And you can basically do things like local summarization. Like, I don’t know, like, I get so many text messages. Like, you know, “Hey, summarize my last 15 messages, please because I’ve been in meetings and I haven’t looked at my phone.” And that’s super useful. And then I don’t have to send data up to the cloud or anywhere else. So there’s those kind of use cases, I think, that’s where small models actually are going to be really compelling. And then for super-complex queries and things, obviously you have a big model in the cloud that can always service those. But for many things, I think on-device or even in-the-edge and on-prem, these small models actually can do pretty good.

Are we going to hit a wall?

Sonya Huang: You talked about scaling up compute and data as, you know, the two fundamental vectors to improve performance. I guess there’s been a lot of chatter about how we are going to hit a wall, or maybe we’re not going to hit a wall on data, and maybe synthetic data is the answer, et cetera. I’m curious your perspective on that. Like, is there an impending wall that we’re going to hit most likely of, you know, cheap, accessible data. What do you think? How do we scale beyond that?

Joe Spisak: I mean, I think we’ve shown with this release that synthetic data does help a lot. I mean, I think, you know, in pre training, you know, we train on 15 trillion tokens, give or take. And in post training, we generated a ton of, you know, millions of annotated synthetic data, a lot of it generated by the 405B. We obviously paid, you know, for annotations as well. I do think synthetic data is a potential path forward. I think it’s gonna—like, we know now, and the kind of proof is in the models, right? It’s great to talk about in that.

I do think data is going to be a challenge at some point for us. And this is why I think companies are licensing a lot of data these days to get access. I mean, Open AI is licensing data. We’re licensing certainly data. I think having access to services that generate data to improve models, you know, is important. So I think that inherently is an advantage for a lot of companies. I mean, Google has YouTube, right? They can—I’m sure it’s a value to them. So which kind of implies that, you know, bigger companies have an advantage, which is not something that’s anything new, right? We’ve been talking about this for a long time. In terms of a data wall, I don’t know. I mean, I think we’re not there yet. I would say like, let’s talk another—let’s schedule this for like a year out and let’s see where we are next year. I’ll save my calendar for one year exactly from now in Meta AI. But let’s talk in a year and see where we are. We haven’t hit it yet, and we’re still scaling and we’re still gathering a lot of data, and we’re generating data and our models are still continuing to improve.

Lightning round

Sonya: Let’s close it out with some rapid fire questions.

Joe Spisak: Sure. Sounds great.

Stephanie Zhan: In what year do you think we’ll surpass the 50 percent threshold on SWE-bench?

Joe Spisak: [laughs] Good question. If I’ve learned anything, it’ll be faster than whatever answer I give you, because I think any benchmark will, as soon as we zero in on it, people are going to go and figure it out. So I don’t have an answer.

Sonya: [laughs]

Joe Spisak: It’ll be fast, I’m sure.

Sonya: You know, one of the questions we have been asking people is in what year will an open source model surpass the other companies on the front, the other models on the frontier? And we have to take out that question now thanks to you all this week.

Joe Spisak: I mean, it’s true. We’re almost there. I mean, I think 405B is incredible. It’s definitely—it’s definitely in that class.

Sonya: Yeah, absolutely.

Joe Spisak: Which is incredible.

Sonya: Will Meta always open source Llama?

Joe Spisak: I mean, I think Mark’s pretty committed. You saw his letter. I mean, we’ve open sourced, you know, for years and years now back to PyTorch, to FAIR to Llama models. I mean, this isn’t something that’s a flash in the pan for the company. The company’s been committed to open source for a long time. So I wouldn’t never say never, but I mean, the company and Mark are really committed.

Sonya: Amazing. Joe, thank you so much for being here today, and also for all the work that you’re giving to the entire ecosystem. I think the entire AI community is very much grateful for all the work that you’ve done with pushing out Llama and the advancements to come.

Joe Spisak: It’s a huge team. Check out the paper. Look at all the acknowledgments.

Sonya: I mean, we spent all of yesterday reading it.

Joe Spisak: We need, like, the Star Wars, like, scrolling text of all the contributors because it was an incredibly big team.

Sonya: I was thinking about that same thing. [laughs]

Joe Spisak: So my hat’s off to the team. This was a—I mean, this absolutely took a village to get Llama out there, and I’m so proud and excited to represent the team here. So thank you.

Sonya: Thank you.

Mentioned in this episode: 

Mentioned in this episode: