Fireworks Founder Lin Qiao on How Fast Inference and Small Models Will Benefit Businesses

Training Data: Ep9

After leading the PyTorch team at Meta, Lin Qiao started Fireworks with the mission to compress the timeframe of training and inference and democratize access to GenAI to let a diversity of AI applications thrive. Lin predicts when open and closed source models will converge and reveals her goal to build simple API access to the totality of knowledge.

Listen Now

Stream On

Summary

Fireworks Founder and CEO Lin Qiao led the PyTorch team at Meta that rebuilt the whole stack to meet the complex needs of the world’s largest B2C company. In this episode, she discusses how Fireworks is compressing the timeframe of training and inference to democratize access to GenAI beyond the hyperscalers. Qiao’s insights reveal a vision for simple API access to the totality of knowledge, emphasizing the power of small, customizable models and the convergence of open and closed-source AI capabilities.

Simplicity scales. Qiao emphasizes that the success of PyTorch stemmed from its focus on simplicity for researchers, which then flowed downstream to production environments. At Fireworks, they’ve embraced enormous complexity behind the scenes to provide a simple API for developers. This approach allows customers to focus on innovation and product design rather than grappling with technical intricacies.
The AI customer journey is shifting from training to inference. As companies move from experimentation to scaling their AI applications, they encounter challenges with latency and cost. Qiao notes that most GenAI applications are consumer-facing, requiring high responsiveness and low latency for product viability. AI founders should anticipate this shift and design their solutions to address these critical production concerns early in their development process.
Open-source and closed-source model quality is converging. Qiao predicts that for models between 7 to 70 billion parameters, the quality gap between open and closed-source models will shrink significantly. This trend suggests that the key differentiator will be customization—how well models can be tailored to specific use cases and workloads. AI founders should consider leveraging open-source models and investing in customization capabilities to create unique value propositions.
Small, specialized models are becoming increasingly powerful. Rather than relying solely on massive, general-purpose models, Qiao advocates for a “thousand flowers bloom” approach. This strategy involves using smaller, more easily tunable models that can be customized for specific problem spaces. AI founders should explore how they can leverage or create specialized models that excel in particular domains or tasks.
Function calling and API blending are emerging as critical capabilities. Qiao envisions a future where AI systems can seamlessly blend information from various models and APIs, both public and private. This approach could provide access to a broader range of knowledge and capabilities than any single model can offer. AI founders should consider how they can incorporate function calling and API integration into their products to enhance their offerings’ versatility and power.

Transcript

Chapters

What is Fireworks? (2:01)
Leading Pytorch (2:48)
What do researchers like about PyTorch? (5:01)
How Fireworks compares to open source (7:50)
Simplicity scales (10:38)
From training to inference (12:51)
Will open and closed source converge? (17:46)
Can you match OpenAI on the Fireworks stack? (22:18)
What is your vision for the Fireworks platform? (26:53)
Competition for Nvidia? (31:17)
Are returns to scale starting to slow down? (32:47)
Competition (34:28)
Lightning round (36:32)
Mentioned in this episode

Lin Qiao: The other interesting thing is we thought replacing other frameworks as libraries with PyTorch is simple. Just swap the library. How hard can that be? But we realized—we thought it’s just a six-month project. It turns out to be a five-year project for us to support the entire Meta’s AI workload building on top of PyTorch, because we have to rebuild the whole entire stack from scratch, from ground up, because we have to think about how to load data efficiently, how to do distributed inference in PyTorch efficiently, how to scale training efficiently. And then we end up rebuilding the whole entire inference and training stack on top of PyTorch. When we left, it was sustaining more than five trillion inference per day. So that’s a kind of massive scale. But it took five years. And the Fireworks mission is to significantly accelerate time to market for the whole entire industry, compressing it from five years to five weeks or even five days time to market. So that’s our mission.

Sonya Huang: Joining us today is Lin Qiao, founder and CEO of Fireworks. Lynn is an AI infrastructure heavyweight who previously led PyTorch at Meta, which is the backbone of the entire global machine learning ecosystem. She’s taken her experiences at PyTorch in order to build Fireworks, an inference platform for Generative AI. We’re excited to ask Lin about the market trends behind AI inference and how she plans to support and even accelerate the market shift to compound AI systems at Fireworks.

We’re thrilled to have Lin, CEO and founder of Fireworks, with us today. Thanks for joining us, Lin.

Lin Qiao: Thanks for having me.

Sonya Huang: We’re really excited to talk about a lot of things with you today, from PyTorch to the small model stack that you’re building to what you’re seeing in terms of enterprises building production deployments. But before we get there, can you maybe say a sentence or two on what you’re building at Fireworks?

Lin Qiao: Yeah, so we started Fireworks in 2022, and Fireworks is a SaaS platform first and foremost for general AI inference and high-quality tuning. Especially using our small model stack, we can get to very low latency for real time applications, very low cost for sustainable business growth, and customization, automated customization for tailored high quality for enterprises. So that is Fireworks.

Sonya Huang: Wonderful. I wanted to maybe start with the PyTorch story. PyTorch is kind of the foundation upon which the entire AI industry runs today. You and Dima and some of your other co-founders were integral and, you know, leaders of that project at Meta. Maybe can we start with—pretend I’m a five year old, can you explain PyTorch to me like I’m five? Like, what is PyTorch?

Lin Qiao: So think about PyTorch as a programming language for digital brains, okay? And it’s designed for researchers to very easily create those digital brains and experiment with it. But the challenge of PyTorch is it’s very fast for people to create various different deep learning models, the digital brains, but the brains don’t think fast enough. So that’s a challenge I took on to address while I was at PyTorch.

Pat Grady: And you’ve mentioned before that most of the companies that are trying to build something similar to what you’re building in Fireworks have chosen to be framework agnostic, whereas you very much made a big bet on PyTorch. Can you say why make the big bet on PyTorch and what benefits that brings to your customers?

Lin Qiao: That is really based on what I see when I operate PyTorch and Meta also across the industry. And I clearly see a funnel effect that PyTorch is, because it started as a tool for researchers, it starts to dominate the top of the funnel for model creation, and then the next stage of funnel is people doing applied production work. They take those research models and test it out for production setting and try to validate the hypothesis and then feed into production. So that’s clear funnel effects that’s happening.

And as PyTorch is designed for researchers, it takes over the top of the funnel, and it’s really hard for people to rewrite into other frameworks for production. And naturally it just flows down towards the bottom of the funnel. And that’s how PyTorch becomes dominant. And I’m starting to see more and more models, especially more nascent models, are all built in PyTorch and run in PyTorch in production, including the general AI models. That’s why we only bet on PyTorch and we don’t want to distract ourselves to support other frames.

Pat Grady: So researchers like it and it flows downstream from there. What do researchers like so much about PyTorch?

Lin Qiao: Simplicity. Simplicity scales. And that’s kind of a lesson learned through the journey of PyTorch at Meta and also building on the community. It is a relentless journey to focus on simplicity. And we have been constantly seeking how to make the user experience simpler and simpler, and hiding more and more complexity in the backend. For example, when I started this journey at Meta, there are three different frameworks. Caffe2 for mobile, ONNX for server side production, PyTorch for researchers. It’s too complicated. And the mission is to reduce three frameworks into one to simplify, but it’s actually a mission impossible.

After I consolidated all three teams and there’s no consensus to how to simplify and build this one stack. And we took a very idealistic approach and take the PyTorch frontend and take the Caffe2 backend, and we said, we’re going to zip them together. It seems simple, but it’s very hard to do because these two frameworks were never designed to work together. And the integration complexity is even much higher than to build a framework from scratch. So too complex.

And then we said, forget about it, we’re going to only on PyTorch, keep its beautiful simple frontend and rebuild the backend. So we built TorchScript—that’s PyTorch 1.0. So that’s really like the key focus on simplicity wins over time.

The other interesting thing is we thought replacing other frameworks as libraries with PyTorch is simple. Just swap the library. How hard can that be? But we realized—we thought it’s just a six-month project. It turns out to be a five-year project for us to support the entire Meta’s AI workload building on top of PyTorch, because we have to rebuild the whole entire stack from scratch, from ground up, because we have to think about how to load data efficiently, how to do distributed inference in PyTorch efficiently, how to scale training efficiently. And then we end up rebuilding the whole entire inference and training stack on top of PyTorch. When we left, it was sustaining more than five trillion inference per day. So that’s a kind of massive scale. But it took five years. And the Fireworks mission is to significantly accelerate time to market for the whole entire industry, compressing it from five years to five weeks or even five days time to market. So that’s our mission.

Sonya Huang: Wonderful. Maybe when you look at the open source standards there’s a lot of people that are trying to do it on using vLLM or TensorRT-LLM. How do you think about how Fireworks compares to what’s in the open source?

Lin Qiao: I really like both projects, and because my heart is in open source based on my PyTorch experience, I would say both projects are great projects for the community. I think our biggest differentiation is first of all, Fireworks off the shelf is faster than both of the offerings. And second is we’re building a system, we’re not just a library. And our system can autotune towards our developers or enterprise workload to be much, much faster and to be much, much higher quality. And that cannot be achieved by just a library. And we are building all this complexity. Going back again to our journey of PyTorch, we are providing a very simple API, but hiding a lot of automation—the complexity of automation, complexity of auto tuning—behind the scenes.

For example, when we deliver our inference with high performance—high performance here means low latency and low cost—we’ve handwritten CUDA kernels. We’ve implemented distributed inference across nodes, and this aggregated inference across GPUs where we chop models into pieces and scale them differently. We also implemented semantic caching where given the content, we don’t have to recompute. And this—we capture application workload patterns specifically, and then build into our inference stack.

We have many other—many other optimizations. We have been—specific design for different use cases that is not like general purpose or horizontal. So that is being encapsulated. We also have complex optimization for quantization. You can think about oh, quantization, just one technology, how hard can that be? But you can quantize so many different things. You can quantize KV cache, you can quantize weights, you can quantize communication across GPUs, across nodes and yield different performance gain and quality trade offs. We also automate—like quality optimization. There are many things we are doing behind the scenes to deliver a very simple experience to the app developers. So they innovate, they concentrate their cognitive bandwidth to innovate on the application side.

Pat Grady: I like your comment earlier about simplicity scales. And as you’re talking through everything that you’ve built to make this such a simple and delightful experience for your customers, it reminds me of the idea of conservation of complexity, you know? Like the amount of complexity required to deliver any given task can be neither created nor destroyed. It’s just a question of who takes the complexity.

Lin Qiao: That’s right.

Pat Grady: And it feels like yours is a business where you have embraced an enormous amount of complexity to make life simple for your customers. And actually my question is about your customers. So where in the AI journey of your customers, where in their AI journey do they say, “Wait a minute, we need something better,” and then what brings them to you?

Lin Qiao: Yeah, so we’ve seen a pretty consistent pattern that last year, people all start from OpenAI because they are in the heavy experimentation/exploration mode. Many startups, they have some creative ideas, application product ideas, and they want to explore product market fit, so they want to start with the most powerful model that OpenAI provides. And then when they feel confident they’ve hit product market fit, they want to scale their business. And then the problem comes in because as I mentioned, most of the GenAI application, they are B2C—consumer, person, market, developer-facing. It requires very high responsiveness. Low latency is a critical part of product viability. Without that it’s not a viable product. People are not patient enough to wait for half a minute for a response. That’s not going to work. So they are seeking—actively seeking low latency. And then another key factor is they want to build a sustainable, viable business that cannot bankrupt quickly. And the weird thing is …

Pat Grady: Not in this market, they can’t.

Lin Qiao: The weird thing is if they have a viable product, that means they can scale quickly. And if they’re losing money at the small scale, they’re gonna bankrupt quickly, right? So bringing down the total cost of ownership is critical for them. So that’s why they come to us.

Pat Grady: So it sounds like—I remember you had this insight a year or so ago, and we spoke about, you know, training tends to scale in proportion to the number of researchers that you have, whereas inference tends to scale in proportion to the number of customers that you have. And in the long term, there are probably gonna be more customers of AI products than AI researchers out there, and therefore inference is the place to be. It sounds like you’re kind of—the customer journey sort of begins as people are going from training into inference. What sort of applications, what sort of companies are at that point where they’re starting to really go into production?

Lin Qiao: There’s so many ways to answer this question. It’s a very interesting question. So first of all, my hypothesis when I start a company is we’re going to take on startups first because they’re most tech advanced. There will be a ton of startups built on top of GenAI. Then we’ll go to digital native enterprises because they are tech forward, and then we’ll go to traditional enterprises because they are, like, tech conservative. They want to observe and adopt when the technology and the product ideas are mature. So that’s kind of my hypothesis. And it totally blew my mind what’s happening right now because we have a lot of inbound of startups, we are working with digital native enterprises. We’re also simultaneously working with traditional enterprises, including health insurance companies, healthcare companies and banks.

And especially for those traditional enterprises, usually I adjust my pitch to be very business oriented because hey, you know, that’s kind of my—maybe my bias and I kind of want to strike a meaningful conversation with them. But they quickly dive into very low level details, technical details with me, and it’s very, very engaging.

Pat Grady: Who are the people doing—like, at a traditional enterprise, who are the people that you’re engaging with?

Lin Qiao: So we …

Pat Grady: Like is it an innovation person, AI person, or is it more business line leader, somebody who owns a production application?

Lin Qiao: Yeah, so I think it’s starting to shift. We are more engaging at the start with CTOs. I feel like this business is shifting towards innovation-driven business transformation, and that’s why can we encounter more CTOs than like the COOs or other CISOs. So that’s kind of an interesting shift. But yeah, across the board, I think there are multiple fundamental reasons why that’s happening. That’s my hypothesis. One is all the leaders realize current GenAI wave is similar to the cloud-first shift, or similar to the mobile-first shift. It’s going to remap the landscape of industry. Startups are growing really fast, and the incumbents feel threatened if they are not innovating fast enough they will be obsolete, they will be irrelevant. But also across the incumbents, they are heavily competing with each other. They’re competing how fast they kind of transition their business to be—to create more revenue, to be more efficient using GenAI. So that’s one phenomenon.

The second phenomenon is generative AI is different from traditional AI. I would say kind of this is very different. Traditional AI is giving a lot of power to hyperscalers, right? Because traditional AI is you always have to train from scratch. There’s no concept of foundation in the model you build on top of, and that means you have to go off and curate all the data. And the data–rich companies, usually they are hyperscalers, and you need to have a lot of resource investment to train your own models and so on. So that is before GenAI. It’s less affordable, it’s concentrated in hyperscalers. Post-GenAI, because of this concept of foundation models, people build on top of foundation models and you don’t train from scratch. It’s not meaningful. It’s all the same data—it’s all the internet data you can crawl. It’s a more or less similar model architecture. It’s a waste of resources if you train from scratch—you fine tune. You tune based on your small data set, high-quality small data set. So it becomes a small data, small model problem. And it makes it so much [more] affordable to everyone to assess this technology, and that’s why everyone is jumping in to embrace it.

Sonya Huang: How many of your customers are using you for fine tuning versus just using a base model? And what do you think goes into building a great fine tuning product?

Lin Qiao: It really depends on the problem they are trying to solve. We actually see the open source model is becoming better and better. The quality differences between open source model and the closed source model are shrinking, and my prediction is it’s going to converge at the same model bucket—same model size.

Pat Grady: You think open and closed will converge?

Lin Qiao: The open and closed will converge.

Pat Grady: Do you think there will be a time lag where closed is always six months ahead? Or do you think they’ll just be neck and neck?

Lin Qiao: For same model size, especially like 70—between seven to 70 billion, or even within 100-billion model size—the quality will converge. That’s my prediction. We’ll see. We’ll see after a couple of years and we will come back to this podcast and see how it goes. [laughs] So the key here is customization, right? Given, like, if this trend is true, then the key differentiation is how we customize those models towards individuals’ use cases, towards individuals’ workload.

Sonya Huang: And is it easier to customize an open source model than a closed source model? Or how does it think about that?

Lin Qiao: So I wouldn’t say it’s easier. It’s just the open source models tend to have a much richer community, and there are a lot more people working on building on top of those models. For example, a Llama 3 model, it’s a very, very good base model. It is very strong in instruction following. It follows instructions very well, so it’s very easy to tune it, to align the model to solve a specific problem really well. And for example, we have been investing in function calling strategically as a direction. We can talk more about that, it’s its own topic by itself, but we find fine tuning a function calling model on top of Llama 3 is so much easier compared with fine tuning based on mixture models or previous Llama 2 and other models. So that’s just kind of—the base model, open source base model is becoming very, very strong in instruction following, in logical reasoning, and many other base capabilities, so it’s very easy to morph it to become a high performance model for solving specific business tasks. That’s the power of small models.

Pat Grady: If we think about just open source software, open source infrastructure software, 20 years ago, open source was thought of as a fast follower sort of thing, you know, Red Hat being a canonical example. And then more recently, open source is not the fast follower, it’s actually the innovator, if you think about Mongo or Confluent or some of these other great open source businesses that have been created. Do you think there are areas in the world of models where open source is actually going to lead closed source and is actually going to be ahead of the proprietary models?

Lin Qiao: So I think the dynamics is very interesting right now because the proprietary or closed source model provider, they’re betting on very few models, right? So OpenAI’s LLM models are like maybe three models, right? Or you can think about it as one model, because what is a model? Model is the model architecture and data, training data, right? That defines the model. So I’m pretty sure they use—all the models, they have more or less similar training data. Model architecture is more or less similar, so it’s kind of scale, all parameters and so on.

It’s not just OpenAI. I think Anthropic or Mistral, and also all these kind of model builders, they have to concentrate their effort to focus on a specific model segment. That’s kind of the best ROI, that’s a business model. But open source pushes a different dynamic because it enables so many researchers to build on top of it. So that’s kind of the small model phenomenon. It’s smaller and it’s easier to tune, easier to improve quality, easier to focus on specific problem space, so it enables thousands of flowers to blossom. Thousands of flowers blossom. So that’s the direction we believe in, is to solve an enterprise problem, a thousand flower blossom is much better for enterprise because, you know, you just have so many problems. And in fact at a given… at any problem space there is a solution for you. And we further customize towards your use case and your workload. And what you get is better quality, much lower latency for real time application, much lower cost for business sustainability and growth. So we believe in that direction.

Sonya Huang: Maybe to that point, have you seen your customers so far are able to match the quality that they got with OpenAI when they move over to the Fireworks stack? And how are you enabling what I call the small but mighty stack to compete?

Lin Qiao: Yeah, so it really depends on domain. So for some domains, actually people don’t even fine tune, they use an off-the-shelf model as is, and it’s already very, very, very good. For example, in the domain of coding, copilot, codegeneration, transcription, translation, OCR, it’s just phenomenal. Those models are really, really good. So that’s kind of off-the-shelf and ready to go. But for some areas it requires business logic, because every company is defining what is good differently. And then of course, like, off-the-shelf models will not work off-the-shelf because they don’t understand the business logic.

For example, classification. Different companies want to classify—some marketplace wants to classify whether it’s a furniture or it’s a dish or it’s something else. That completely depends on their domain. Or summarization. You think summarization is a very general task, but for example, an insurance company wants to summarize into a very special template, right? And there are specific business tasks and many other things. We just kind of work with—across the board on various different problems, and those require fine tuning.

And I want to call out: fine tuning sounds simple, but it’s actually not simple at all. So the end-to-end requires enterprise or developers to collect data, first thing, to trace. After they trace they need to label. After they label, they need to pick and choose which fine tuning algorithm to use. There’s supervised fine tuning, there’s DPO, there’s a slew of preference-based fine tuning, as in they don’t label absolute good results. They basically say, “I prefer this over that.” They need to pick whether they want to use parameter-efficient fine tuning like LoRA or full model fine tuning. And for some tasks, they need to tune hyperparameters, not just the model weights itself.

So among these many technologies, they have to kind of figure out when to use what and so on. It’s very deep. And usually those app developers, they haven’t even touched AI yet and there’s a lot for them to pick up. And then once they’ve tuned and they test it, it’s still improving in some dimensions but it’s still not good in some other cases, and then they need to capture those failure cases and analyze should I collect more data and go through this cycle again, or it’s actually a product design, right? It’s very interesting. Some failure cases are not really failure cases, it’s just they haven’t designed what the product should react.

For example, people are building a system to autogenerate content when people type. And if you’re in a table and your cursor is in a cell—and what does autogenerate mean? Do you auto-extend what you type in the cell? Do you generate more rows, or do you do nothing? So it’s actual product design. So that requires a PM to be in the loop to think about the failure cases. So with all this complexity, what we want to do is take away the rudimentary stuff, take away the complexity of figuring out which tuning approach to use, how to automatically label data, how to automatically collect data from production. We want to take away all this and keep a simple API for people to use, but leave the design part to our end user. For example, how the product should respond should completely be in their realm to figure out and solve, so there we want to kind of create that separation. And we started working in this direction, and hopefully we’ll announce our product there soon.

Pat Grady: I love that you’re kind of liberating people to not have to think from the tech out and to actually think from the customer back, and sort of use all the stuff that you’ve built to deal with the underlying technology and really focus on, to your point, the design patterns and the usability, and making sure that they’re actually solving an important problem end to end in a compelling way.

Sonya Huang: What is your vision for the Fireworks platform? And, like, to Pat’s point earlier on, conservation of complexity, you know, we started this podcast talking about how you’re conserving complexity for your customers on the inference stack. You just now talked about how you’re conserving complexity for your customers in terms of the fine tuning workflows. Like, what are the other pieces that have to come together, and what is your ultimate vision for what Fireworks the product is?

Pat Grady: If everything works, five years from now, what will you have built?

Lin Qiao: [laughs] So what we—like the north star for Fireworks is the simple API access to the totality of knowledge, right? So right now we’re building towards that direction. We already have more than a hundred models we provide across large language models, image generation models, audio generation models, video generation models, embedded models and multimodal models—image as the input to extract information. So that’s kind of one side of the foundation model coverage.

But put all the foundation models together, it will still have limited knowledge because its training data is limited. Its training data has a starting time, ending time. All the information they can crawl on the Internet is still limited because there are a lot of knowledge that’s hidden behind APIs, hidden behind public APIs that you don’t have access to, or you just cannot get real time information. There are a ton of private APIs hosted with the enterprises. There’s no way anybody will have access outside of the companies. So then how we get access to totality of the knowledge for the enterprises is to have a layer to blend across many different models and public/private APIs.

So that’s kind of the vision. And the tool, the vehicle to get there is function calling, is the function calling model. Basically, this model is capable of understanding here are the APIs you want to access and for what reason. It can automatically be the router to most precisely call out to those APIs, whether it’s accessing models or accessing non-model APIs in the most accurate way.

So think about strategically, that’s extremely important to build this simplified user experience, because then our customer, they don’t need to figure out, they don’t need to scratch their head and figure out, “Oh, I need to fine tune to be able to access those APIs. And how to even do that myself, it’s kind of a tall order for me to do.” So we want to basically, you can think about that because many people are familiar with a notion called mixture of expert. So OpenAI is providing mixture of expert and mixture of expert becomes a very popular model architecture. The concept is it has a router sit on top of a few very big experts, and each is specialized in doing its own thing. And our vision is we want to build a mixture expert that accesses hundreds of those experts, and each of those experts are much smaller, much agile, but with high quality of solving specific problems.

Pat Grady: In that vision, real quick, do those experts live in Fireworks, in AWS, in Hugging Face? Like, where do those experts come from that get put together with Fireworks as the overarching framework?

Lin Qiao: Yeah. Our ambition is those experts live in Fireworks. That’s where we want to curate—curate our models we serve towards that, and that’s why today we already have more than 100 models. And it will take some time to kind of build this layer in a very solid way, but we’re going to release our next generation function calling model. It’s really, really good. A little preview on that. It has multiple layers of breakthroughs, and we’re going to announce it together with demos and examples, and people can leverage and build on top of it.

Pat Grady: Very cool.

Pat Grady: Do you see any viable competition for Nvidia on the horizon?

Lin Qiao: [laughs] That’s a very interesting question. First of all, I think Nvidia is operating in a very lucrative market, and any lucrative market invites competition. This is just the economics here. And also from the whole entire industry point of view, in general, industry doesn’t like monopoly. So that’s kind of another trend-like pressure coming from the industry. So I think it’s just not a question whether there will be competition to Nvidia, it’s just a question of when.

Pat Grady: Do you think it’s coming soon?

Lin Qiao: I think it’s coming soon. I think it’s coming soon. I think—I mean, obviously we can look at Nvidia competition in multiple segments. On the general purpose competition segments, GPU, AMD is coming up. That’s interesting. And I think also in a specific AI segment, where the AI model space is stabilized, there’s no more innovation, like the problem is well defined, and this is the model, then custom ASIC will have its own role. So I think kind of I will look at the market that way, and I do think there will be competition coming soon.

Pat Grady: Can I ask you about that, by the way? Because you guys are in this part of the market where you are model agnostic to some degree, and it’s really about the optimization of those models when it comes to putting them into production. Do you think that the returns to scale on the frontier, the models that are out on the bleeding edge, do you think the returns to scale are starting to slow down? Do you think that we’re going to go into a phase where capabilities have started to mature or asymptote, and the race is more about the optimization and tuning and application of those capabilities?

Lin Qiao: I think both will happen at the same time. One is it will start to stabilize and plateau in the model applicability point of view, and we’ll heavily customize. Our strategy is heavily customized towards the use cases and workloads. So that’s one direction. And the second is I want to caution that—like, because at Meta we also think for a certain period of time that that is the model for ranking recommendation, right? So—and we should heavily, like, index on that assumption. But then after a few years it’s not the case. There’s significant amount of model innovation in seemingly stabilized modeling spaces and that pushed the S curve for Meta. I think the same phenomenon will happen in the GenAI space, that a new model architecture will happen. And we’re kind of overdue.

Sonya Huang: So you mentioned—we’ve talked about competition from other vendors, other direct competitors. What about OpenAI? Does OpenAI keep you up at night? Like, they drop prices on their APIs all the time, they’re making their models. They’re also trying to win the better, faster, cheaper race. Like, how do you think about them, and how do you think about ultimately what you’re gonna build that’s different from where they are going?

Lin Qiao: Right. So again, I feel like for the—they are actually going smaller, right? And cheaper. I think for the same model size, for the same model bucket, whether it’s closed source or open source, the quality is going to converge. That’s—again, that’s my prediction. And the real meaningful thing is to push the boundary here is heavy customization, or automated customization tailored towards individual use case and individual workload. I’m not sure if OpenAI has the appetite to do it because their mission is AGI. If they hold their mission—which is a great mission actually, but it’s kind of solving a different problem than solving an enterprise problem. Which basically means there are a lot of problems, a lot of specific problems that is really good for the small models to customize towards. And that’s where we want to focus our energy and build on top of open source models, assuming the trend that they are going to converge in quality.

Sonya Huang: Love that. Our partner Roelof, last time you were here, made the point that in prior technology waves—internet, mobile—it was the people that did all the hard work driving down the marginal cost of running this stuff that actually enabled all the application development on top and all the end use cases that we get to enjoy every day. And I love that you are taking that exact approach with AI, where it’s still so cost prohibitive for most people to run in production. And by just dramatically bringing down that cost curve, you’re actually helping the whole industry blossom. So it’s really wonderful. Should we close out with some rapid fire questions?

Pat Grady: Yeah, let’s do it.

Sonya Huang: Okay. Do you want to go first?

Pat Grady: No, go for it.

Sonya Huang: Okay. Let’s see. Favorite AI app?

We do a lot of video conferencing, and the notes taker for video conferencing is a game changer for us, whatever it is. There’s so many different varieties, but I just love that.

Pat Grady: Which one do you use?

Lin Qiao: I think we use Fathom. Yeah, our sales team use that. It’s really good for training and also summarization. Significantly shortens our time.

Sonya Huang: Nice. What will be the best performing models in 2024?

Lin Qiao: My prediction is there will be many, given the rate that every week, every week there’s a new model coming up. And on LMSys arena, they keep competing with each other. So this is all good news for the whole entire industry. It’s really hard to predict which one, but the one prediction I’m pretty confident is the model qualities will keep improving and keep increasing.

Pat Grady: In the world of AI, who do you admire most?

Lin Qiao: I would say Meta. It’s not one person, but that Meta’s commitment to open source, I think Meta is the most brilliant in the journey of GenAI by continuous open source, a series of Llama models and continue to push the boundary, continue to kind of shrinking the quality differences. So—okay, so what Meta is doing is basically decentralized power from hyperscalers to everybody who has a dream to innovate on foundation models, GenAI models. And I think that’s really brilliant.

Sonya Huang: Love that. Okay. Will agents perform or disappoint this year?

Lin Qiao: I’m very bullish on agents. I think it’s going to blossom.

Pat Grady: That’s all we got.

Lin Qiao: All right!

Pat Grady: Thank you.

Lin Qiao: It’s really fun to have this conversation. Thanks for having me.

Mentioned in this episode:

Pytorch: the leading framework for building deep learning models, originated at Meta and now part of the Linux Foundation umbrella
Compound AI Systems: AI composed of multiple components instead of calls to a single model
Caffe2 and ONNX: ML frameworks Meta used that PyTorch eventually replaced
Conservation of complexity: the idea that that every computer application has inherent complexity that cannot be reduced but merely moved between the backend and frontend, originated by Xerox PARC researcher Larry Tesler
Mixture of Experts: a class of transformer models that route requests between different subsets of a model based on use case
Fathom: a product the Fireworks team uses for video conference summarization
LMSYS Chatbot Arena: crowdsourced open platform for LLM evals hosted on Hugging Face

Fireworks Founder Lin Qiao on How Fast Inference and Small Models Will Benefit Businesses

Training Data: Ep9

Listen Now

Stream On

Summary

Transcript

Chapters

Contents

What is Fireworks?

Leading Pytorch

What do researchers like about PyTorch?

How Fireworks compares to open source

Simplicity scales

From training to inference

Will open and closed source converge?

Can you match OpenAI on the Fireworks stack?

What is your vision for the Fireworks platform?

Competition for Nvidia?

Are returns to scale starting to slow down?

Competition

Lightning round

Mentioned in this episode