Databricks Founder Ion Stoica: Turning Academic Open Source into Startup Success

Training Data: Ep25

Berkeley professor Ion Stoica, co-founder of Databricks and Anyscale, transformed the open source projects Spark and Ray into successful AI infrastructure companies. He talks about the importance of partnerships both in academia and at Databricks. The Microsoft partnership in particular accelerated Databricks’ growth and contributed to Spark’s dominance among data scientists and AI engineers. He also emphasizes how compound AI systems and open source models help enterprise customers get the most value out of their data.

Listen Now

Stream On

Summary

UC Berkeley Professor Ion Stoica, cofounder of both Databricks and Anyscale, is unique in bridging academic research and commercial success by solving fundamental infrastructure problems in data processing and AI. His insights draw from decades of experience building systems that became industry standards—from Apache Spark to Ray—while maintaining his position as a leading computer science researcher.

Focus on solving real production problems, not demos: Build systems that work reliably at scale—demos can inspire but production systems must deliver consistently. The gap between demo and production is where many AI companies falter.
Data quality and control are paramount for enterprise AI adoption: Companies value their proprietary data as a key differentiator and require security, auditability and control over their AI systems. Open source models deployed in private environments often serve these needs better than closed API services.
Help customers get value out of their data with AI: Ion’s original motivation was to scale classical machine learning systems to handle more data. Now the challenges are to improve accuracy and eliminate hallucinations so businesses can rely on these systems and find the use cases which are going to provide the biggest value.
Compound AI systems are emerging as the dominant architecture: Like traditional software engineering, complex AI applications require breaking down problems into smaller, more manageable pieces. Compound AI systems call specialized models and agents that can be independently optimized and reliably composed.
Cost becomes critical at scale, but only after proving value: Early AI adoption focuses on capabilities and value creation, with less emphasis on costs. As use cases prove valuable and scale up, optimization for cost and efficiency becomes essential. At this juncture, open source models and specialized infrastructure provide advantages.
Infrastructure complexity is increasing faster than individual hardware capabilities: The growing gap between AI compute demands and single-node performance means systems must efficiently orchestrate heterogeneous, distributed resources. This creates opportunities for new infrastructure solutions that abstract this complexity away from developers.

Transcript

Chapters

Getting value out of data
Why train your own models?
Have open source models caught up?
Databricks: from data scientists to AI workloads
The Databricks-Microsoft partnership
From academia to industry
The AI brain drain in academia
Lightning Round
Mentioned in this episode

Ion Stoica: In general, this was our approach. We are going to be aggressive also about partnerships, even though the partners could compete and overlap. Because you have to trust yourself that—at least when it comes to Spark—you can build the best products. We are kind of saying internally well, if someone else is building a better product for Spark, then we deserve to lose, right? So that kind of always was the confidence that we can build the best product for Spark. And eventually, if Spark wins, we are going to win.

Stephanie Zhan: Hi, everyone. Welcome to Training Data. Today we’re excited to welcome Ion Stoica, professor of computer science at UC-Berkeley and cofounder of both Databricks and Anyscale. He has a uniquely exceptional career as a leading professor and founder of companies at truly legendary scale. Today we dig into questions like Databricks’ positioning in AI, how research projects like Spark and Ray have led to the founding of Databricks and Anyscale, how he ties his research projects closely to industry from day one, new projects out of his lab like VLLM, MemGPT, LMSYS and Vicuna, and what research fields he’s thinking about next.

Ion, thank you so much for joining us today. We’re really excited to have you on the pod. To kick things off, we’d love to hear a little bit about where Databricks aspires to fit into the overall ecosystem, especially with some of the recent launches. What are you personally most excited about?

Ion Stoica: First, thanks for having me here. So I think with Databricks, always we wanted to provide the platform an end-to-end platform which helps our customers get most of the value out of their data. And one of the best ways today to get value out of the data is using these kind of new developments in AI, including large language models and everything else.

And the one thing to note is that this was all always our vision from day one. Actually, Spark was created—one of the main reasons to create Spark when we built it, was to solve—to speed up classical machine learning algorithms, right? And to scale them up, right? And so in some sense for us it’s full circle. We started with AI, with machine learning—classic machine learning, and right now we are back and doing more and more AI because to take advantage and to create value out of the data.

Sonya Huang: So you mentioned that you were founded around, you know, enabling classical machine learning. How do you see the current AI moment as the same and how do you see it as different? And I’m curious, what are the specific things you’re doing for this moment in time, like the Mosaic acquisition, things like that?

Ion Stoica: Yeah. So certainly the momentum around AI, it’s on a different level today, right? You just look at the investments out there, right? That should tell a big part of the story. So I think that obviously is like what—the way we are looking at it is that taking advantage and being successful with AI is not easy, right? It’s like the AI ecosystem is growing in complexity. It’s not just a simple call into a model. You have so many techniques now like RAG and Raft to improve the accuracy of your application using AI, right?

Obviously everyone is excited about AI because it solves so many problems and there are so many, you know—generate so many headlines, but still it’s not what you want. You want a product of AI. A lot of things still we are seeing today are fantastic demos, right? And demos are inspirational. And when people see demos, whatever, ChatGPT solves this kind of problems, Olympics math problem, it’s very easy to think that, “Wow, doing that, it’s going to do everything,” right? So—but going from demo to production, like I said, it’s a big step. A demo means that you need to find an instance, at least an instance to be really impressive in production. So it’s like there exists such instances, right? But when you go from the demo to the product, when you have a product, it has to work for all cases, right? So that’s kind of the big gap.

So that’s why the effort is to improve the accuracy, improve reliability, of course, eliminate hallucinations as much as you can. And really also to find where it provides the best value, you know, for—because, you know, you can apply AI to 1,000 use cases, but which are the use cases which are going to provide you the biggest value? I think that’s what we are also trying to help our customers with. So to navigate how to successfully apply AI to their product, to their staples, their services, to their business.

Stephanie Zhan: In that vein, one of the most interesting things that I thought came with the Databricks AI launch was the new Open General Purpose LLM created by Databricks, Deepbricks. What was the reasoning behind training your own models, open sourcing them? And what do you think are some of the best use cases for that model?

Ion Stoica: Yeah, so I think the main—if you look at, like, our main market, it’s enterprise and enterprise customers. They do have a lot of concerns about data privacy, confidentiality and obviously they kind of want control. It’s not only that, they want auditability, right? They want to be able to audit what data you use, what results have been used and what decisions have been made with what data the decisions have been made. And so that’s why in general the enterprises, everything being equal, migrate towards the open-source models we can host on their machines or in their VPC and so forth.

So it’s one of the reasons for releasing the Deepbricks is to help our customers. And then the customers say in many of these cases they start from this model and then fine tune with their own data, and to optimize for their particular use case, right? Again, so our enterprise customers, they want, you know, open-source models, they want kind of control, as much visibility as they can. They want privacy and confidentiality, especially in the recent light of the breaches which are widely published. The other thing is that, you know, the DNA of Databricks has been open source. It’s not only Spark, but Delta and MLflow and many others.

Stephanie Zhan: The other thing I thought was fascinating about Deepbricks is its excellent programming abilities. What do you think has made it such a capable code model, especially even compared to something like CodeLlama-70B?

Ion Stoica: Look, I think it’s about—obviously it’s about data and how you train it. And it’s—one thing, I think one of the major Databricks advantage is Mosaic, right? We have also the entire infrastructure for training because we have not only the data, but the entire infrastructure of training and fine tuning. And that makes it much easier and most cost effective to optimize models for different use cases. And obviously, the copilot programmability, it’s one of the very important use cases in these enterprises because software engineers are still very expensive.

Stephanie Zhan: Yes. Interesting.

Ion Stoica: So making the people you have—and it’s not only about that, it’s like hiring top software engineering is very difficult if you are a large company like maybe, I don’t know, Ford or things like that. And so making those people productive, it’s extremely important and critical for their business.

Sonya Huang: You mentioned Mosaic as a key part of the strategy. I guess, what do you think are your most important chess pieces in this kind of AI battleground? I imagine Mosaic is one of them. And are most of your enterprise customers looking to train their own models? And how does Mosaic and your other acquisitions fit in with your customers’ needs?

Ion Stoica: Yeah, so I think there are—it’s like there are a few enterprises who would like—so it’s again it’s pre-training, and then it’s fine tuning and using the model on your own hardware, in your own VPC, on your own machines. You know, you rent it, your own.

And you may expect there are few which are doing pre-training, but there are still a few doing pre-training on their data. You have enough data. A lot of them, they want to do fine tuning, right? Because look, if you are an enterprise and a company and you want to improve your business, what do you have? What do you have which others don’t? The thing you have which others don’t is the data—the data about your business, about your users, right? So therefore, because this is something you have and others do not, you want to take advantage of that, right? So how do you take advantage of that? Again, there are many ways and you try different ways to do it, right?

One is about fine tuning user data and you have an open source model, you fine tune on that, right? The other one is to use RAG, and things like that. But so you are going to do any of those, but then you want to do, like I said, in your own VPC to preserve security, to have terms for security perimeter, you want to do it in order to protect confidentiality of your users. And obviously right now, GDPR, the California Consumer Privacy Act and so forth, there are a lot of regulations and the number of regulations is going to increase. So the fact that you can have an open source model you can fine tune, you can use RAG in your own VPC, it’s a very compelling value proposition.

Sonya Huang: Mm-hmm. And are you finding that most of your customers want to go that approach versus, you know, OpenAI and the very powerful closed source models in the market?

Ion Stoica: I think enterprises—and still we are still early on, and there are many enterprises obviously using OpenAI for different use cases, different applications. And I’m sure OpenAI and Microsoft Azure will come with new products to provide better confidentiality, better security. But at the end of the day, what I was saying is that everything being equal, as an enterprise I will prefer more control, security and strategic, right? It’s like it’s less kind of locking or things like that, right? So I think that’s what we are saying. So if open source models are going to catch up with the proprietary models, in particular in the use cases which matter for the enterprises, it doesn’t need to be, you know, perfect, right? For all the particular use cases you just need to be very competitive where it matters then, you know, enterprises will prefer these solutions on which they have, you know, more control and more secure.

Sonya Huang: And how far away do you think we are from that moment? Like, do you think we’re there today where all else is equal or when do you think we cross that?

Ion Stoica: In terms of the open source versus proprietary?

Sonya Huang: Yeah, open source being, you know, on par, all else being equal, for the core use cases.

Ion Stoica: So you have a lot of use cases. Again, it’s the open source plus the data. And right now, the applications are more complex. It’s not just a call to a large language model. You have this kind of—what’s it called? You compose. You have an application which builds from many components, and that’s what we called compound AI. And so actually it turns out that if you can build an application for doing recommendations or things like that, or for programming for a particular copilot, for a particular tool, you can actually do better than even something like OpenAI or the latest ChatGPT because you have more data.

And the other thing about Databricks, which I think there are also announcements about, is that with a kind of Unity Catalog and things like that, you have access and you know also about the structure of the data, which helps you tremendously to improve the accuracy of your applications, right? So it’s not only the model. It’s everything else you have around it, and the quality of the data you feed it in. So I think that “it’s the data stupid” like you say, right? Like, at the end of the day.

Stephanie Zhan: It sounds like control and security are two primary areas of things that enterprises really care about that you’ve noticed. And Databricks obviously has a tremendous advantage with having access to data as well to help these companies use more …

Ion Stoica: Control and security for the customers.

Stephanie Zhan: Exactly. What about some other factors have you noticed that they also care about? How much does cost matter? How much does diversity of other models matter?

Ion Stoica: I think obviously cost is important. And it’s like this, right? First of all, you want to—initially the important thing it’s about the value.

Stephanie Zhan: Yeah.

Ion Stoica: Right? Can you provide the value? So that’s the first thing. And at that stage, cost is not as important, right? And in here actually, in some of these early stages where people also try to use the most powerful model like OpenAI and things like that. But once you cross that and you have a use case which you conclude that is good, adds value to your business, now you want to scale it up, right? And now you are talking about having more control and more security and all of these things, protect confidentiality of the data, privacy of your users and so forth.

And now basically people consider how they are going to deploy it. And this is where having multiple choices and having, like you said, more control, security, it matters. And where open source models and platforms like Databricks are very valuable, right? And again, it’s also all the other components which are together in Databricks, like I mentioned, like Unity Catalog and everything else to increase the value of your applications.

Stephanie Zhan: Yeah, super interesting.

Sonya Huang: I’d love to talk to you about compound AI systems. I think you guys probably coined or popularized the term, and it seems like that’s a lot of what the industry is latching onto now. Maybe for our audience, can you explain what is a compound AI system, and what are enterprises thinking about when they’re building these?

Ion Stoica: Yeah. So a compound AI system, it basically consists of multiple components, multiple calls to large language models or agents. And it’s very much—you can think about when you write a program, you have multiple components, you have different functions and procedures to do different things, and then you then put them together to create the program. The same is very similar here. You can use maybe one model to parse the data, to extract the data you can use, and then you can use, for instance, depending on the prompt, you may use a model if the prompt is about math problems, it’s like you can use one model. If the prompt is maybe about programming, you may use a different, different model, right?

And then you may use, for instance, about formatting the result, right? That’s another one. You may use it for instance, now more and more you are talking about agents, and with agents you have to—you know, you call external services or functions like search, or you can use a calculator and things like that. So now you have different models, you can do a better job. And there are small models actually to take the prompt and convert it to a function call, right? So that’s kind of what it is. But the way to think about it conceptually it’s like you write a program that has different components, and make it easier to develop and deploy and manage the same thing you want to apply to AI applications.

Sonya Huang: So like a collection of smaller components that work together where the—you know, the sum of the parts is greater than the one monolith that you’re replacing.

Stephanie Zhan: I’d love to dive a little bit into the Databricks story quickly, which I think is an incredible legendary journey over the last decade, but with a lot of nuances that I think maybe many folks don’t yet understand. From today, it looks like you were building the right company at the right place at the right time, but the actual nuance is that Databricks was started originally for data scientists. It happened to cater well to machine learning workloads because of all the work that you were doing in data. And over time you made the right strategic decisions to actually really grow with the AI market. Can you share a little bit of some of the learnings and journey that got you to where Databricks was today?

Ion Stoica: Yeah, I do think that—very happy to. I do think that a lot of it’s also about being the right time and the right place and things like that. It’s being lucky, or I think all of this is true. And there are many things that need to go right to be successful. Some of the things you control, some of the things you do not control. When we started, indeed we are focusing to build a product, a cloud product, hosted product for data scientists, right? We have this kind of notebook and we have provided host and Spark and we targeted data scientists. And we targeted data scientists, it was—one of the reasons we targeted data scientists because again we all have, like I mentioned, Spark early on was also targeting machine learning, application machine learning workloads. And at that time there were not many data scientists. It was, like, 2013.

However, when you look around, most of the universities already have data science programs, right? Degrees. You’re starting to offer data science degrees and you’re saying, “Okay, we are—it seems a good, you know, path forward, a good market.” And I remember we were looking at LinkedIn to see how many data scientists are there because they’re our users. Initially, there are not as many.

Stephanie Zhan: Thousands.

Ion Stoica: Especially when you compare the data with database analysts, analysts and engineers and so forth. We started to build, and I think it was a reasonably good product. And then we started to grow quickly in terms of the number of customers. Initially, we had small customers, but then after—and still the interactive analysis, the data science has been for a long time, it’s actually one of the biggest workloads, in particular in terms of revenue because the interactive workloads are priced higher than batch workloads.

But I remember that I sold to these companies to use Databricks, and they are aspirationally buying it to do data, to the data science AI. And after a few months we would go to them because we didn’t go much earlier because we saw that they are doing well, they’re usage is growing and everything seems good, so no reason to worry. So we went back to them and said—to see what they are doing, maybe we can write a blog post or whatever they are doing, right? Marketing and everything. The surprise is that very few are doing actually machine learning at that time. And we ask them what happens.

Well, it turns out that obviously that in order to do machine learning you need data, like we discussed early on. And they realized that for the particular applications they wanted, they don’t have the data they need. So they need to now put maybe logs, to collect new logs in their products and things like that. And also they need to clean up the data, curate the data and so forth. So now we are lucky because Spark was very good also for data engineering, for data processing, right? It’s a data processing tool at the end of the day. And so they are using Spark for data engineering. And then obviously we start focusing also on serving data engineers much more than before. And that’s how we started. And then obviously later, now got data engineering, right? And still have all these data scientists exploring and starting to build models. And then it is a very natural extension to start to add more products for our users, for our customers, like I mentioned, to get more value out of their data. And that meant building machine learning models and using the open source models.

Stephanie Zhan: Very interesting.

Sonya Huang: I want to talk about the Databricks-Microsoft partnership. I think that was the stuff of legends, and I think still probably one of the only case studies of a successful, like truly transformative partnership. Maybe walk us through what that partnership was. Was it about the company moment at the time? Do you think Databricks would have become what it became if you hadn’t struck that partnership? Maybe just talk about that.

Ion Stoica: Look, obviously the partnership with Microsoft was a great partnership for us. The one thing I want to say that—and this is most visible—we are very—from the day one, we were actually very focused on the partnerships. Our idea was always like we make Spark successful, you know, hopefully de facto standard for data processing, and then we make Databricks the best place to run Spark. So early on, actually, you know, we started a few months in the life of the company, we had this kind of partnership with Cloudera, and then we had with Hortonworks. And this partnership was mainly to advance Spark, right? Because Spark, it was created in Hadoop ecosystem, right? And these are Hadoop companies, right? And so we had this partnership or data stacks and so forth.

So this was like in the first year or so we had all this partnership, despite the fact that some of these companies, we knew that they could become our competitors, right? Because, you know, just making them, helping them to deploy, to manage and to sell Spark-based services, right? So in some sense, you know, Microsoft is like—it fits our kind of approach of trying to strike partnerships with others, being aggressive, doing partnerships with other organizations in the ecosystem. Even though again, in some cases it was not clear cut whether they are going to compete with us or not. So just growing the ecosystem and growing Spark, that was the priority.

We even had at some point a partnership with Snowflake. So it was very—it requires a lot of heavy lifting. So we always look for partnerships which are meaningful, right? And I think great negotiation, you know, Ali and so forth did a fantastic job there. But at the end of the day we also need to commit to it. It’s like it took to build the Azure Databricks—we were in AWS before that—it took, you know, tens of engineers one year. And you are a small company at that point. So it was a huge commitment and a huge bet also from our perspective to do that. And yeah, and we—you know, I think engineering and everyone executed very well and it was a successful product, right? And, you know, Microsoft were great partners, you know? And yeah, this is what happens.

We are obviously a bit lucky, but in general this was our approach. We are going to be aggressive also about partnerships even though the partners could compete and overlap. Because you have to trust yourself that at least when it comes to Spark, you can build the best products. We are kind of saying internally, “Well, if someone else is building a better product for Spark than we deserve to lose,” right? So that kind of always was the confidence that we can build the best products for Spark, and eventually if Spark wins, we are going to win.

Sonya Huang: Do you think that Databricks would have become the company it is today without that Microsoft partnership?

Ion Stoica: I think so. It maybe could have taken a little bit longer, but yeah, I think so. We would still have a good, very good offering on Azure like we have with GCP. It would have taken a bit longer, but I don’t see any fundamental change in dynamics. Because one of the advantages of Databricks of course is once Spark won and we could provide the best product for Spark, we are in a very strong position. And compared with other clouds, remember that one of our advantages like everyone else’s advantages, like Confluent and so forth, is that you can provide a service on multiple clouds, right? And multiple clouds is—you know, multi-cloud has been, you know, more and more very strategic especially for large enterprises. We do not want necessarily to be locked in or to want the one choice.

Stephanie Zhan: I love the confidence and conviction that you have in Spark and your own execution abilities internally, but that married with the practicality and aggressiveness of winning as a business and pursuing the right partnerships and doing whatever it takes to win.

Ion Stoica: Yeah. Yeah, I mean, you try to simplify things. That’s what I was saying, you know? You know, like initially when you said, “Look, you know, we have to make Spark win,” you know? There are many—I remember I looked at all these combinations. It’s like Spark wins, the product fails, right? Or Spark loses, but we have a product which is successful. Or both fail. No, that’s not very interesting. Or both are successful, right? And we convinced ourselves that, you know, we need to bet on Spark to win because that’s the most likely way also for the product to win. Again, for better or worse, right? But sometimes, you know, there could be many ways to success, right? And in retrospect, you cannot go back and try other alternatives. Maybe there are better alternatives at that point, but it’s important to commit to one thing, which hopefully it’s a reasonably good solution, a good path forward, right? Again, there are many paths to the peak, right? The most important is to commit to one which leads to the peak, right? It’s like not the shortest one or the easiest one, it may not be. But it has to be one to get there.

Stephanie Zhan: And to the highest peak.

Ion Stoica: And that’s why we said, “Okay, Spark has to win.” And we have to build the best—need to be the best place for Spark. And then we are saying to be the best place for data and AI, we need to—eventually we knew and we assumed that if you are to be hugely successful, you are going to go beyond Spark, right? It’s like, that’s why the name of the company is Databricks, it’s not SparkLabs or something like that. So that’s kind of—you try to simplify. Once you do that, then you start to execute, okay? So you want to make Ray successful as an open source, so you want everyone to use it. So that’s why you do Cloudera and Hortonworks and so forth to do this partnership.

Because at that time there are other solutions. People knew that Hadoop Map Reduce, its time had passed, so to speak. So they are talking about new systems. There are like, Tez, it’s actually Hortonworks has this project and so forth. So that was very important. And then it was also for data science, it was kind of when we bet it’s like it’s a niche and, you know, and we thought that we can build the best product for it. So, you know, so that’s kind of—and you need to have, ultimately, confidence in what you bet on, right? You have to bet, right? Because you are a small company. If you don’t bet, how you are going to win? And then you need to have some level of conviction to do it.

Stephanie Zhan: I’d love to kind of pull on that thread and switch gears a little bit into tying a lot of the entrepreneurial path that you have with a lot of the academic and research background that were the roots for the beginnings of these companies. You have a very unique career as both a leading professor and a founder of multiple unicorn and decacorn companies. I don’t think there’s anyone who comes close to pursuing both disciplines at the scale of success that you have. Maybe specifically Ray with Anyscale or Ray to Anyscale, Spark to Databricks, take us into your head. What is the process in which these research fields start to ruminate in your mind? When do you continue to give them resources to develop, and then when do you know that it’s the time to then start a company to pursue that in a better, more open and fast way?

Ion Stoica: That’s a good but hard question. So I think that—and by the way, I like to preface to start with saying that it’s obviously also a lot of luck involved, and being in a place like Berkeley and having fantastic students and colleagues around you. You couldn’t do without that, right? It’s like it’s an academic problem more than mine. But one thing I think is that I’ve been always trying to focus on the problem. And actually even to my students, I’m telling them it’s like one of the most important things you need to do is to figure out what problems you are going to work on, right? Because it’s like everyone which comes to Berkeley or these top schools, one thing they have in common, they’re good problem solvers, right? They have good grades, you know, good scores, you know, write papers. A paper is about solving a problem. So therefore if all of them are good problem solvers, the differentiator is a problem you are working on, right?

Stephanie Zhan: Yes.

Ion Stoica: So you start with that. And also I think it’s like especially at Berkeley, you get exposed to kind of not only new ideas, but willingness to take the risk and get into new areas in some sense. And this is what I like about Berkeley. If you look traditionally, Berkeley among top schools, they are the first to open new areas. Of course the RISC processor, it was with Stanford as well. But databases, networking, sensor networks, even in open source with the Unix BSD, you know, TCP/IP, right? Part of the CIB. So they are always kind of more, you know, a little bit trying to experiment. So I think that’s kind of—you know, that kind of culture. I really resonated with it.

And then the other thing that happened with Berkeley, we have these labs which are like five-year labs which basically each lab has kind of a vision and, you know, it’s a group of faculty coming together which believe in that vision and with their students and try to make it happen over five years. And this has a lot of great impact all the way. This tradition started 40, 50 years ago with Dave Patterson, Randy Katz and others. And they built RISC, Redundant Array of Inexpensive Disks. Now network of workstation is commodity. This is—everyone is building now this huge cluster of commodity machines, servers, and again and many more. So there are these kinds of elements, and these labs are—it’s a very strong relation with industry, right? Connections. They are funded. Since I came—when I came, actually, I saw this one change happen. Before, these labs are also supported by government, in particular DARPA. But it was that point in which that kind of—at least that particular DARPA funding dried.

Stephanie Zhan: Yeah, yeah.

Ion Stoica: So kind of when I came to Berkeley, there was this kind of now we need to go and remember, you know, getting more money from industry. I remember from Google, first time we got—and it was unheard by then, because you were asking for $500,000 per year, right? For four years. So—but you know, we got—and this is what we got. So now you have also this kind of very tight connection with the industry. It’s a very good environment to see the problems, right? To understand the problems. And then you can see that, and you try to think obviously about trends. Because trends are important, right? You have to be aligned with these secular trends. You need to bet on the right trends because these are things you cannot change, right? Or it’s very hard to change. So if you are not aligned, it’s not good.

And these trends, actually there are multiple trends, and the multiple trends kind of open gaps between them. And these are kind of opportunities for problems. Like, for instance, for big data, it was clear you have more and more data, and the amount of data people were collecting was just growing. It was pretty clear, right? Like, Google had seen that years before and they built all these systems. But now everyone wants to emulate that, right? That’s why Hadoop was created.

And then you start to see again, you look at that and you are working in that area. And we are—you know, we have all these Hadoop people coming to our retreats on these kinds of labs and we are, you know, friends with them. And we started to see problems, and then you try to use them and then you just like for instance, there are two things which happened with Hadoop. One thing is about a group in our lab. Oh, by the way, and the other thing, what happens in these labs, they are interdisciplinary, right? It was people, you know, we are in this kind of frat lab. It was people from machine learning systems, databases, networking, right? So there are these groups of Michael Jordan students which wanted to compete to this Netflix challenge, that Netflix released some data basically and asked for people to provide recommendations, come up with recommendations, to build recommendation algorithm systems so that—you know, to beat their own recommendation.

So they come to us and okay, it’s a lot of data. What can we do about it? And you know, told them to use Hadoop. But Hadoop was very slow, right? Because, you know, for—and then, you know, Matei put together something quickly for solving this problem in which the data was kept in memory.

The other thing I’ve seen is about—like I have a previous company, Conviva, and it’s its own analytics company. And kind of it was very slow and we tried to do it, you know, it’s like for ad hoc queries. And there is no way to do it. And again, keeping the data in memory was a solution. It’s one solution. That’s kind of how we started. That’s one thing, right? And you look at the trends. Yeah, it’s obvious, right? It’s like on one hand you have more and more data growing faster than the Moore’s Law. So you need to have—it’s not going to fit on one machine, and therefore you need to use multiple machines. And then the only other question is that are you going to have datasets that are going to fit, important datasets are going to fit in memory, right? It’s like, that’s kind of the first question. And they are, because people, even when they are doing, for instance, ad hoc query, we notice that when you look at the data from different clusters, Hadoop clusters, from Yahoo and Microsoft and others, and we notice that in a lot of cases, actually when you do queries and you’re doing analytics, they are doing not on the—very rare, on all the data. You do say for the most recent data, you want to see what happens yesterday, what happens last week, something like that.

So once you get that, and you have a lot of cases in which the data is fitting in memory and memory is still growing quite quickly, at that time you kind of connect the dots, right? And then it’s about solving the problem. And the other thing that happens and why they are related, I’m talking about academia and industry because I’m telling, you know—people—you know, some people—you know, there are people in academia, they push back and say, “You know, it’s a lot of engineering here. This is not what you should be doing in academia.”

But one thing I’ve always found is very satisfying is if you build a system in a new area, and that system, it’s used by other people, then you can—you are in the best position to understand the new problems in that area. Because people are going to use your system in different ways, right? And then, you know, you understand that. And going back, if you know the problem, you are also in a good position to solve it, right? So actually it directly helps you with your research to be ahead. Because otherwise what is the choice? Where are you finding the problems? Of course there are problems, very good problems in theory which are not solved by decades and so forth. But other things that people do, they go to Google and Microsoft and so forth, spend time and to understand what the problem they have to solve. But that’s kind of a little bit unsatisfying, right? Because you go to someone to learn about their problems. But the question is why don’t people solve those problems? Maybe they don’t solve the problems because maybe they’re not as important at the given time and maybe for the right reasons they are, you know, too much in the future. But this is the thing, right? You have to focus about the problem, on the problem and you have to focus on the trends and the way they’re connected. You want to solve a problem ideally, which is going to be more important tomorrow than today.

Sonya Huang: What are the problems that you’re most excited about right now? Like Spark, Ray, like, what’s going to be the next Databricks or Anyscale?

Ion Stoica: So a few things. So I still think that it will be a lot of work. Right now, what happens is that we need to rethink most of the software stack. Why? Because it’s again going back to the trends is that the demands of this application, in particular AI applications and so forth, are growing much quicker than the capabilities of a single process or a single node, even if you considered accelerators, right? So on one hand this happens. On the other hand, the infrastructure becomes much more complex. You need to run that application not only on one node, but on many nodes. It’s distributed. But it’s not only that, it’s becoming very heterogeneous, right? Because in order to bridge the gap between the demand and capabilities of hardware, people build accelerators. Like, that’s why Nvidia is a trillion-dollar company, right?

But now the infrastructure becomes even more complex, right? It’s hetero—is not only distributed, heterogeneous. When we started Spark, it was homogeneous. All the nodes are the same, some storage, some CPUs, that sort. But right now, look at the heterogeneity. You have Nvidia and you have many others. You have TPUs from Google. Every cloud, it’s having their own chip. Like now you have AMD and Intel say that it’s everything, it’s about AI, right? So that’s kind of what happens.

So now you have a huge gap between this application, and this very complex infrastructure just growing in complexity. And then it’s not only about compute, it’s about networking. You have Infiniband, and you have all of this, you know, RDMA and so forth, right? So huge heterogeneity. And the software stack has to abstract away that complexity for the developers. There is no way around. You want a single machine, or you have this operating system to abstract away the complexity, right? That’s what makes it easy to develop all this application. Now it’s extremely hard. So something is going to happen there.

I think the other one is about this building application. You’re talking about compound AI, we’re talking about compound AI and things like that. Right now, this application, AI application in particular, large language models, everyone is talking about large language models, the applications are like assistants for humans, right? Humans are in the loop. If you are thinking about customer support, if you are thinking about Copilot, if you are thinking about Q & A question and answering—even summarization, you have a human helping a human to be much more productive. Which is a fantastic application. But they cannot be autonomous. They are not yet autonomous. And to go from having the human in the loop to being autonomous is a huge gap, right? Because it’s autonomy—to have something autonomous, you need to have someone which is running kind of—it’s more deterministic. It’s more reliable, right? It’s, like, far more accurate. And, you know, you need to get there because if you don’t get there, you know, you are still limited to having this system where it’s the human in the loop, and the human is a bottleneck, will become the bottleneck. It’s just a certain number of people on the planet. And I think that there is about—a lot of work will be about how to make, I would say at least some aspects of this large language, building large language model applications more like an engineering discipline, where you can build much easier systems from smaller components.

Sonya Huang: Okay, so the two next Databricks will be distributed compute across heterogeneous hardware and autonomous compound AI systems.

Stephanie Zhan: Noted.

Sonya Huang: I’d love to tug on another thread that you mentioned just now of kind of how funding constraints drive what you’re working on. I’m curious what you think right now of—there’s been a lot spoken about kind of the almost brain drain in AI right now, because the universities just don’t have the funding that you could get if you went to work at one of the big research labs. How do you think about that? What do you think is ideal? Does operating under constraints force creativity for you? What do you make of all that?

Ion Stoica: Yeah, that’s a great question. And it’s true. I mean, it’s challenging. It’s very challenging. And when I came to the United States, I came to do my PhD, and I graduated from Carnegie Mellon University. I am originally from Romania. So one thing people admired about the United States is like, is that how well this kind of, you know, the collaboration and the partnership, the three-way partnership between academia and government and industry are working, right? And a prime example at that time was obviously the internet, which was a DARPA project. And of course, academia had a huge impact. And also industry was, you know, whatever third industrial revolution people were saying.

And that, when it comes to AI today, that kind of partnership is broken. It’s like industries, every company is doing this research in silos, right? You know, they don’t talk to each other as much. Academia, like you said, doesn’t have resources, and the government doesn’t invest as much. So I think that’s something to be very concerned about, right? And that’s why I’m also a big proponent of open-source models. And the US and California kind of will lose in the long term if this is not fixed.

So what happens now? Unfortunately, in academia one thing that happens is that of course there are some, you know, bigger universities and labs which can still afford to maybe train—spend $1or $2 million maybe to train some models. But still they are not in the same league like OpenAI, right? It’s like OpenAI and Microsoft, they are talking about building the data centers, about $100 billion. And I think the danger is that you are going to—some of the academics are going to give up and try to innovate around the edges. Now you can still innovate in this application and things like that and, you know, I think there’s a lot of innovations there, but clearly it will be harder to come up with new model architectures and innovate. It will be harder. It’s not impossible. Nothing is impossible.

So yes, it’s a challenge right now. I mean, there are people which are better fortunate—more fortunate position, maybe I’m one of them, in which have access to resources outside academia, right? But you do want to level the playing field in order to maximize the innovation. The innovation comes from everywhere. Now this being said, it’s true that, you know, scarcity and always in the past scarcity, you know, spurs innovation as well. But the concerns about not having access to resources to play the same game like in industry, it’s a concern.

Stephanie Zhan: I’d love to switch gears into some rapid fire questions if you’re ready for it.

Ion Stoica: Yeah, go ahead.

Stephanie Zhan: Will anyone take meaningful market share from Nvidia over the next five years?

Ion Stoica: I think it will be, there will be at the minimum because they are going to—it’s probably that it’s because Nvidia will not like to be accused of monopoly behavior. So their market share has to decrease under some percentage, whatever 70, 80 percent. So that one will be one of the reasons. But I think that if I have to name one company to—of course there are clouds, and for strategic reasons they are going to push their agenda to build their chips. Like, still probably the biggest competitor right now in terms of the market share, it’s Google with TPUs.

Stephanie Zhan: Yeah, yeah.

Ion Stoica: Probably that will continue for a while.

Sonya Huang: What’s one project or student in your lab right now that you’d want to highlight?

Ion Stoica: You know, I’m going to cheat here because I think that both VLLM and Chatbot Arena has been tremendous. It’s like—I’m not talking about Skypilot because Skypilot is like they started a company. So I’m not talking about. But I think VLLM has been amazing. It’s like it’s a one-year-old project and I’ve never seen such a rapid growth. And of course, it’s also part of the AI. AI kind of compresses the time, so there is something about that. And I think the other one is Chatbot Arena, because it just is fascinating to see the development in the space, and to see how these different models, where they are strong, where they are weaker, and I think that having a front seat to see that kind of development of the ecosystem and in the space, it’s fascinating.

Sonya Huang: Do you think the foundation models will commoditize?

Ion Stoica: The foundation models which are?

Sonya Huang: Like a GPT4 or a Claude. Do you think there’s a market to be made in providing these models over time, or do you think it commoditizes?

Ion Stoica: I think people will continue to build larger and larger models. I think when it comes to serving, it looks like model distillation works quite well. By the way, the model is where you train a smaller model on the outputs from the bigger model. There has been a lot of success with that. By the way, this in some way it says that it shows how important is the data for training a model, going back earlier on in our conversation, because you have higher quality data from the big model and you use that to train the small model. And it’s working very well. So I think using multiple distillation models to reduce the cost of inference is going to be a way forward. But yes, I think for advancing and pushing the frontier, so to speak—pun intended—you are still going to see a lot of effort on bigger and bigger models.

Stephanie Zhan: What are you most excited to see in the world of AI in one, five and ten years?

Ion Stoica: What I’m most excited about AI?

Stephanie Zhan: Yeah.

Ion Stoica: Look, there is no question it’s transformational, right? I think that—and it will change a lot of things, everything maybe. I think the most excited I am about it, it’s about how do you make these AI systems, more predictable, accurate, verifiable? How you can debug these systems? All of this is kind of in the realm of software engineering-like techniques. This is what I think is exciting.

Stephanie Zhan: What advice do you have for founders building in AI?

Ion Stoica: It’s the same thing. You know, focus on the problem, don’t focus on the hype. Hype is emotional, it’s not reliable. Just look at the facts right? It’s like—and look at the problem, try to understand the problem and try to be truthful to yourself. It’s about if you build an application, it’s about production, it’s not about the demo, right? It’s beyond the demo. Of course, the demos are important, don’t get me wrong, are very important. But the production, this is the mindset you have to have.

And the dangerous thing is there’s so much hype, and you think you can solve everything and you can do everything probably in certain number of years. But now just focus on exactly what problems you are going to solve. Convince yourself it’s a good problem, convince yourself that you can solve it, or at least you can have an MVP, right? You can solve a smaller version of that problem which is still very valuable for your customers. That’s what I will say. Yeah, no silver bullet.

Stephanie Zhan: Amazing. Thank you so much, Ion, for joining us today. I’ve loved hearing a lot about your own thinking and reasoning behind your own journey, a lot of the thought process behind finding the right problem to solve, building the right systems to actually be in a position to understand the best problems, and then applying that even to many of the bold decisions that you’ve had to make in founding multiple companies from research into commercialization, and the incredible success of Databricks today. Thank you.

Ion Stoica: Thank you for having me.

Mentioned in this episode:

Spark: The open source platform for data engineering that Databricks was originally based on.
Ray: Open source framework to manage, executes and optimizes compute needs across AI workloads, now productized through Anyscale
MosaicML: Generative AI startups founded by Naveen Rao that Databricks acquired in 2023.
Unity Catalog: Data and AI governance solution from Databricks.
CIB Berkeley: Multi-strategy hedge fund at UC Berkeley that commercializes research in the UC system.
Hadoop: A long-time leading platform for large scale distributed computing.
VLLM and Chatbot Arena: Two of Ion’s students’ projects that he wanted to highlight.

Databricks Founder Ion Stoica: Turning Academic Open Source into Startup Success

Training Data: Ep25

Listen Now

Stream On

Summary

Transcript

Chapters

Contents

Getting value out of data

Why train your own models?

Have open source models caught up?

Databricks: from data scientists to AI workloads

The Databricks-Microsoft partnership

From academia to industry

The AI brain drain in academia

Lightning Round

Mentioned in this episode