Skip to main content
The Rise of Generative Media: fal’s Bet on Video, Infrastructure, and Speed
Episode 72 | Visit Training Data Series Page

The Rise of Generative Media: fal’s Bet on Video, Infrastructure, and Speed

fal is building the infrastructure layer for the generative media boom. In this episode, founders Gorkem Yurtseven, Burkay Gur, and Head of Engineering Batuhan Taskaya explain why video models present a completely different optimization problem than LLMs, one that is compute-bound, architecturally volatile, and changing every 30 days. The team also shares what they’re seeing from the demand side: AI-native studios, personalized education, programmatic advertising, and early engagement from Hollywood.

Listen Now

Summary

How AI will transform media, education, and entertainment:

The generative video market is rapidly expanding and highly differentiated from LLMs: Video models demand far more compute, have different optimization bottlenecks, and address industry use cases that are only now emerging as quality and predictability improve.

Success in generative media relies on deep technical focus and infrastructure innovation: fal’s edge comes from relentless optimization at both the inference and infrastructure levels, obsessively tuning kernels and scheduling workloads across a broad, distributed GPU fleet.

Model variety and turnover are defining characteristics of the space: Unlike LLMs, video and image models have a thriving long tail; the most popular models have a half-life of just 30 days, and workflows often chain together many specialized models for specific creative goals.

Open source and ecosystem effects matter more in visual domains: Open-sourcing foundational models jumpstarted vibrant ecosystems, enabling fine-tuning, adaptation, and a combinatorial explosion of aesthetics—features less pronounced in text model ecosystems.

Demand is surging across new and traditional media, with education poised for transformation: Use cases include AI-native studios, dynamic security training, personalized ads, and especially education—where generative video could unlock highly effective, scalable learning experiences once model quality matures.

Transcript

Introduction

Gorkem Yurtseven: We recently had our first generative media conference and Jeffrey Katzenberg, former CEO of DreamWorks was there. And he made a comparison. He said this is exactly playing out how animation when it first came out, people revolted against it. It was all hand drawn before that. And computer graphics, it was new, and there was a lot of rebellion against computer-driven animation. And something very similar is happening with AI right now. But there’s no way of stopping technology. It’s just going to happen. You’re either going to be part of it or not.

Sonya Huang: In this episode, we sit down with a team from fal, the developer platform and infrastructure powering generative video at scale. fal is a place that developers can go to access more than 600 generative media models simultaneously, from OpenAI Sora and Google Veo to open weight models like Kling. We’ll discuss why video models present fundamentally different optimization challenges than LLMs, why the open-source ecosystem for video has a thriving long tail in ways that text models never did, and why the top video models have a half life of just 30 days. The team also shares insights from the demand side of the video model equation. We discuss what’s happening in the app layer, from AI-native studios to personalized education, what’s happening in Hollywood and more. Enjoy the show.

Main conversation

Sonya Huang: Burkay, Gorkem, Batuhan, thank you so much for joining us today. I want to start with the problem space that you decided to tackle. So fal is a developer, API and platform for generative video and image models. Video is massive, obviously. It is more than 80 percent of the internet’s bandwidth, and it follows that generative video is going to be similarly massive. But there’s not that many companies that are focused on this problem. Why do you think that is?

Gorkem Yurtseven: Yeah. In a way, generative image and then video was an overlooked market in this current phase of AI. In my opinion for two reasons. Number one, there wasn’t a very clear industry use case that people were going after. There wasn’t vibe coding that automates software engineering, or there wasn’t search, which seems the LLM market is going after, or customer support, anything like that. Also number two, the investment on the research side wasn’t as big three years ago, and then that ramped up a little bit slower than LLMs, but still considerably since then. And now the models are much more capable, much more useful ,and real industry use cases. Compared to what it was three years ago, it felt like a toy use case. This was just going to be for fun on the side, and it’s going to be a small market at the end. And now we can see that it’s going to be a massive market with very unique use cases and customers compared to the LLM market.

Burkay Gur: Like, if you actually go back to as we were experiencing it, I think that was an interesting time. We were working on some Python compute infrastructure, and then these models like Dall-E 2 had just come out. And then soon after that, ChatGPT had come out and then Llama had come out. And we were just like—initially, we didn’t know that image and video market was going to get that big. We were actually just curious about running image models much faster. That was our initial entry point. And then we saw the initial growth. We had a few customers and they were growing really fast. We were like, “What the heck is going on?” And then a few customers later, we actually thought, “Hey, we should double down here.”

And around that time also the other thing that was happening was people were over-indexed on language models. This story of AGI was being told, and that attracted all the dollars, that attracted all the talent. So everyone was just working on that, where we thought we had something niche growing fast, don’t tell anyone. And then we just, like, started focusing on that. And soon after, as we got more familiar with the models, we thought the—I remember, I think we changed our website copy to say “generative media,” like generative media platform. And then it was only, like, two or three months after that Sora was announced. So we were definitely ahead, but we really saw the whole future kind of coming with better image models, video models, et cetera. So yeah, we made this early bet.

Sonya Huang: I mean, you guys have a front row seat to the sorts of new experiences people are building. I think the market’s only going to expand from the media market that we know today.

Burkay Gur: Yeah, absolutely. I think, like, I’ll quote a Karpathy tweet. You know, no good podcast without it. He did one, like, recently where he was talking about, like, why he’s excited about the, you know, media models. And one of the things he said was that—like, he also mentioned that people are visual and we have so much more video than text, like wall of text. And he was saying, like—he was making a point around education and a lot of the content you consume just to learn things. I think right now the model quality is just like relatively so much worse than what it can be, where you could actually have—you know, I do a lot of learning on ChatGPT, but it’s through text. But if it actually rendered a video where it could compress a concept instead of 10,000 characters, if it could do it in 15 seconds, it’d be so much better. I think, like, there’s sort of like the quality bar where it’s going to go up, and once we have that we’re going to have even more penetration. So it’s really a function of the quality right now. And we’re just like in the very early beginnings.

Sonya Huang: Totally.

Gorkem Yurtseven: Education market almost untouched right now with video generation. And there’s so much potential there and it’s just waiting for the quality, the predictability to get there. And I think it’s going to have a lot of potential.

Sonya Huang: Totally. I mean, you guys sent me that generative video Bible app.

Gorkem Yurtseven: [laughs]

Sonya Huang: I think it’s a much better way to learn some of the lessons from the Bible. And, you know, if you’re capturing consumers’ attention right where they are. I think it’s—I agree with you. We’re just at the beginning. So fal is an infrastructure company, and so we’re going to structure today’s interview—I love infrastructure companies—in terms of the technical layer cake. So we’re going to start from the core inference engine, compilers, kernels that you’ve built. We’re going to go up to the model layer, and then the workflows and then end with some observations on the markets and what people are building. Sounds good?

Gorkem Yurtseven: Sounds good. Sounds exciting.

Sonya Huang: Okay, let’s do it. The inference engine. Batuhan, how old are you?

Batuhan Taskaya: Twenty-two.

Sonya Huang: You’re 22 years old? Okay. Say a word on your background. I think it’s super badass, and makes complete sense why this company is so hardcore at AI.

Batuhan Taskaya: I started working on compilers when I was 14, so in a way I have a lot of experience on that front. It’s not just that, but I started working on open-source projects. So my first contributions were around tooling around the Python language. And then I started to slowly contribute back to the Python language—core compiler, core parser and the core interpreter itself, and became, like, one of the core maintainers of it. I think at the time I was the youngest core maintainer of the language. And this kind of gave me, like, a unique appreciation of compilers and how flexible they are.

So when we first started working on serving these image models at fal, the main idea was, okay, there are these, like, three different image models, three different architectures, but this is surely going to explode. There’s upscalers, there’s going to be video models—we were predicting that. And we didn’t want to go optimize a single model, put our eggs into a single basket, and then go become invalidated when the next model comes.

So we started building this inference engine, which is a tracing compiler that traces the execution, and essentially tries to find common patterns that are fitting within the templated kernels that we do. So our bread and butter is, like, spending—we have a 10-person performance team that’s spending all their efforts into writing kernels that are, like, 95 percent there, but generalized with templates. So we trace the execution of a model and find common patterns that could replace these templated semi-generic kernels to specialized kernels at runtime, and optimize the performance of these models. And we found this technique to yield pretty much superior results from anything that’s out there in the market. And this led us to claim, like, the number one spot on performance, on all the benchmarks.

And another big thing about this is we specialize in doing this sort of kernel-level mathematically-correct sound abstractions that let us maintain the same quality of these models, which is a very high bar when you’re in this media industry and when you really care about the output that you’re getting.

Sonya Huang: What’s different between optimizing a diffusion model versus an autoregressive LLM?

Batuhan Taskaya: In autoregressive LLMs, your bottleneck is how fast you can move all those giant weights from memory to SRAM. Because you have a 600-billion parameter model, and you’re trying to predict the next token. You’re doing the attention for, like, all the tokens, like a couple tokens before that. In diffusion models, you’re trying to denoise, like, thousands, tens of thousands of tokens for a video at the same time doing attention of it. So you’re essentially saturating all the compute bandwidth of these GPUs. You’re not necessarily bound on memory bandwidth but, like, the computational operations that you do are, like, fully saturated. So you’re trying to find better ways to execute around the GPU. This could be like writing more efficient kernels, or this could be overlapping [inaudible] that you do. It’s essentially you’re trying to use all of the power of the GPU, leverage it in a way that gets you all the capabilities.

Sonya Huang: So it’s a different binding constraint. It’s on the compute versus the memory. And what’s the intuition for why LLMs are relatively memory constrained and why video models are by comparison relatively compute constrained, but not as large in terms of just sheer number of parameters?

Batuhan Taskaya: I think it’s a scaling issue, right? Like, in terms of, like, if you scale these video models of 600 billion parameters with the same dense architecture, you’re going to have to do attention with all those full, like, hundred—like, let’s say a single video is 100,000 tokens. And you do this attention step or, like, you do this, like, denoising step 50 times, and every 50 times you do, like, attention over all these, like, 100,000 tokens. It’s insanely, insanely expensive. So I think the constraint there is like, just how fast you can do the inference. And the same applies to LLMs at, like, larger batch sizes, but, like, at  the traffic patterns that people do, like, you know, the batch sizes are not that much and you’re mainly constrained by memory bandwidth. So people do optimizations, like speculative decoding and other factors to, like, reduce that overload.

Sonya Huang: Yeah. What exactly goes into being at the top of the leaderboard in terms of performance? Because I would imagine there’s other teams that also have very smart people. And, you know, this is my Olympics, and so what exactly goes into—I imagine people have, like, very similar ideas on the techniques and different optimizations they can do.

Batuhan Taskaya: I don’t think anyone cares about it as much as us. We are literally obsessed with generative media. We are literally obsessed with these models. We have a team that’s just focusing on this. So far, it seems like from Nvidia to other inference players, everyone is super obsessed with language models, everyone’s trying to get one more token per second on, like, DeepSeek benchmarks, whatever.

And we are on a different lane. We have competitors, but no one close to us, because I think we assembled one of the best teams. We found out the best way to optimize these general models. And we just focus on this. This is like a purely focus thing, right? Like, at the end of the day, you’re constrained by the hardware, there’s nothing unique about it. But, like, we’re just like three months ahead, six months ahead. Like, when we benchmark Torch, like, the latest version of Torch against, like, you know, our inference engine from a year ago, we are clearly underperforming because, like, Torch caught up. The same thing is going to happen with other players. You’re going to always—like, the lead that you can maintain is three months, six months ahead at most. The thing that matters is just focus. If you focus on it, if you purely put all your energy into it, I think it’s very hard to get out-competed by others.

Gorkem Yurtseven: Because models are slightly changing each month, each release. So it’s still the same general architecture, but there are slight differences where we can go in and optimize where that’s different. And no one else is paying that much attention to it. Also, hardware is changing as well. So we were able to adapt to B-200s earlier than anyone else, and we were able to run video models much faster, basically throughout the year because of that obsession with running video models with the latest hardware.

Sonya Huang: Yeah, got it. What are the hardest technical problems that you think you’re solving?

Gorkem Yurtseven: So one thing people don’t appreciate it as much is we are running 600 different models at the same time. We have to be so good at running them that we should be running a single one of them better than as if someone else is running a single model. Because when a foundational lab is running models, maybe they have a single version of the model, maybe they have, like, a couple other different versions, and that’s all they care about. We have to be better than them at running those models, and we have to be doing all 600 at the same time.

So on top of the inference optimizations that happens in the GPU, a lot of optimizations on the infrastructure level needs to happen. We need to manage the GPU cluster in a way that’s efficient to load, unload these models at the right times. We need to route traffic to the right GPUs who have the warm cache of these models. We need to be smart about choosing the right kind of machines, which kind of chips are running what kind of models, and the customer traffic is changing all the time, and we need to adapt towards that. So on top of the inference engine, the overall infrastructure is also a really, really hard beast to manage. And so far we’ve done an incredible job at that. Would you add anything to that?

Batuhan Taskaya: I think that’s a pretty fair explanation of what we do. Like, I call this distributed supercomputing. I don’t know why people don’t like that name. But, like, you know …

Sonya Huang: I like it. [laughs]

Batuhan Taskaya: But, like, the idea is we are at, like, 28—like, this was like a month ago. Now we are at probably at 35 different data centers, and you have these heterogeneous groups of compute that’s split across with their own different specs, different networking, whatever, and you’re trying to schedule workloads as if it’s like a homogeneous cluster that you got from a hyperscaler. It doesn’t work like that. So we built—like, we spent the last three years building the abstractions over it from our own orchestrator to building our own CDN. Like, we go back to, like, you know, the fundamentals of web and we built our own CDN service deploying racks to call us, like, you know, just like routing traffic. So we built all these technologies to essentially make sure that we can tap into capacity wherever it is and schedule our workloads, which is very different than a traditional enterprise LLM-usage pattern. The use cases that we have are so much more spread across, so much more consumer facing. And when you consider that there is so much investment going into making sure we can tap into this scarce capacity of GPUs.

Sonya Huang: Yeah. You mentioned hyperscalers, and I hear distributed compute and I hear managing giant clusters and I naturally think that’s somewhere where hyperscalers should have the incumbent advantage. Why do you think that you’ve been able to so far execute them on the core engine?

Batuhan Taskaya: There’s two things about the core engine, right? There’s the inference part where none of the hyperscales have any expertise. This is a net new field. I think this has been only happening for the past three years, inference optimization. So it’s like a brand new lane that we have been outcompeting anyone in our field. I think, like, that’s like a pretty much answer of its own. And the second one is infrastructure. I think right now, hyperscalers are very busy with their traditional pattern of oh, we have this data center capacity, we’ll just deploy GPUs and we don’t care about trust. This has been changing recently. You know, even, like, Microsoft is going buying from [inaudible]. There’s an interesting pattern happening, because the GPUs and the demand and the growth of GPUs doesn’t fit to the patterns of these hyperscalers, the growth patterns they expect. So I think at this age, not even hyperscalers have that big of an advantage of scale because, like, they’re going and buying GPUs from New Class. Like, the tables have turned a bit.

Burkay Gur: Yeah, it almost helps to also be, like, slightly earlier in the company journey, right? Like, if you’re a public company, you also have to kind of abide by, like, what the market’s expecting of you. So the other thing is that there’s a huge price discrepancy with hyperscalers and neoclouds, right? So it’s maybe sometimes 2x, 3x more expensive to use things through hyperscalers.

Sonya Huang: What’s driving that?

Burkay Gur: Well, I think one is like market pressure, right? And also there’s added kind of operational expenses that hyperscalers have for, like, having, you know, better—they just have a better service, better uptime and better SLAs. And all of these things add up. And then on top of that there’s kind of an established cloud margin, right? And, you know, the market expects the cloud margin to be a certain level. Whereas, like, if you have a three-year-old neocloud, you know, you’re a private company, maybe you don’t have as much pressure. And, like, assuming infinite demand and limited capacity, you can actually—hyperscalers can keep their prices high, and they will fill out the capacity and also get slightly better economics. Whereas neoclouds compete over the whole, like, infinite demand and that pushes the prices down.

Sonya Huang: Perfect price competition. What does it take to run image versus video models well? Like, you guys started the company around a stable diffusion moment. The field was mostly image at the time. How does running video models compare to image?

Gorkem Yurtseven: Let’s actually do text, image, video. Let’s compare all three of them. So for let’s say a SOTA LLM, like, let’s say DeepSeek or something like that where we know the numbers, running a single prompt is, like, 200 tokens. Let’s say it takes 1x of teraflops—I think it’s tens of teraflops but let’s call that unit one. One image is around 100x of that, and if you are doing a five-second video, 24 FPS, that is around 120 frames. So 100x from one image. So you are already 100x of 100x. So you’re at 10,000x for a low-standard definition video. And if you want to do 4k, that’s another 10x. So that’s 10,000x compared to a single 200-token LLM input. So it is a lot more compute intensive in terms of the amount of flops you are doing.

Batuhan Taskaya: Yeah. In general, like, when we started with image, the infrastructure was relatively easier to do because it’s like it takes three sec—or it took fifteen seconds back in the day. It takes 15 seconds to generate an image. You don’t need to necessarily shave, like, that 50ms, 100ms you have overall on the system. And then when we went to video it’s, like, even easier because it takes, like, 20 seconds, 30 seconds to generate the video.

The way that has been happening in the past couple months is real time video where you need to stream 24 FPS videos over, like, a network link from these GPUs. That’s where we actually spend, like, now some of our time. We started this progress with speech-to-speech models a year ago. We started optimizing them where we were able to reduce the latency of our system with a globally distributed GPU fleet. When you send a request, we route to the closest GPU, minimize our own overhead, and then do stuff like we pick the best run or stuff like that. So we are now applying those same optimizations we did to real time video, and we actually see really good, interesting demand there where people want to experience this stuff, like, as they type, as they prompt. And that’s where, like, some of the infrastructure technical challenges differ from traditionally running image and video models. Because image and media are similar-ish, you know, like, just more compute expensive. But, like, you actually need to care about infrastructure stuff when you go from less than a second generation time for some of these models.

Burkay Gur: Yeah. Another interesting thing is, like, image models especially, you were able to run them on a single GPU. Like, the parameter count is actually much smaller. So that actually makes it a little bit easier for us, as opposed to LLMs. And then with video, parameter count is going up. Right now I think we’re around, like, for the open source ones, I don’t know, 30 billion parameters, whereas we hear rumors about GPT-4 being like in the trillions, GPT-5 maybe more. So on the flip side it’s a little bit easier, but it doesn’t mean video models are not going to grow, right? There’s rumors around numbers for Veo, numbers around SORA. So there’s also an increase in parameter count, so that you’re going to have to kind of use more distributed computing. But if you’re just eight nodes or one node or eight nodes, you kind of have a slight advantage.

Sonya Huang: Yeah, totally. Okay, let’s pop one layer up the stack to the models.

Gorkem Yurtseven: Let’s do it.

Sonya Huang: So one thing I think people don’t fully appreciate about the media space, and you mentioned this, you alluded to this before, is that there’s a very, very long tail of models that are actually used in practice. And so I was hoping you’d give people a sense of on your platform, how many models are people actively using, how’s it distributed and, like, why do you think there’s such a long tail of models being used compared to the LLM space?

Gorkem Yurtseven: This is actually one of the things I would say three years ago people got it wrong. I mean, jury is still out, but people—right after the ChatGPT, people started talking about omni models. There’s going to be these giant models that are going to be able to generate video, audio, image and code, text, every type of token. This might still happen, I think, but it’s more clear that you are better off if you optimize for a certain type of output. Even this is true for co-generation, definitely true for image or video output.

So that’s one thing. When we were pitching three years ago, like, that’s one feedback we got. There’s going to be omni models, and there’s going to be a single way of running these. It’s going to be hard to create an edge on the modality. But turns out it’s not true, and it actually makes sense to have a technical edge on the modality. And this is one of the reasons why there is also a variety of models, because still the best upscaling model is just doing upscaling. And the best image editing model, even the best text-to-image model is different from the image editing model. So all these special tasks require their own model. It might be the similar model family or similar architecture, but at the end of the day it has its own weights that needs to be deployed independently, and that creates the variety in the ecosystem.

Batuhan Taskaya: I think there’s also the—like, this also applies to language models, where even in the same modality there is different families of models with different tastes, different characteristics, there’s different Personas. And this happens with language models still. Like, the code that Claude writes is very different than the code GPT-5 does, right? And we see this happening, but the good thing about here is there’s these three, four different personas on top of different categories: upscaling, editing, video, text-to-video, whatever, stuff like that. So it gets you close to 50 models that are active at any point in time. And then you have a very long tail of models that people still choose because they might like the persona of that better.

Sonya Huang: Yeah, totally. Speaking of model personalities, what are some of the most popular models on your platform? What do you think are the personalities of them?

Gorkem Yurtseven: So one thing that’s been true since the beginning, the popular models change all the time. So there’s always new releases from different labs that take over the other, and it’s always a moving target. But that being said, there’s two types of models usually preferred by our customers. Usually there’s one big expensive model that has the best quality on video generation. This could be Veo, this could be Kling. This could be Sora. And then there’s usually a workhorse model which is cheaper, smaller, but good enough, and people usually use that at higher volumes. I would say this has been true for the past almost two years that there’s an expensive, high-quality model that keeps changing. There’s a cheaper, good enough model that keeps changing. But overall this has been constant.

Sonya Huang: And is the workhorse model for prototyping and then you run it through the big expensive model for the final product? Or what do people use the workhorse for.

Gorkem Yurtseven: It’s for higher volume use cases. And depending on the application you are building, you might encourage different—lots of variations of the same output maybe. But it’s very application specific, I would say.

Burkay Gur: Yeah. There’s also another dimension, I think, that’s kind of happening in real time right now, which is based on the different use case you want to use the model for. So when OpenAI released GPT image editing, that model had just superior text generation and editing capabilities. And for things that require a lot of text, people started going and choosing that model versus the other models. So it also tends to correlate with different capabilities models are bringing and also what they’re good at. So Klink for example, people really like it for visual effects types of workflows, because they had that kind of data in their data set as opposed to, you know, some other models. For example, Seedance is, like, very good at, like, detailed textures and artistic diversity, things like that. So it’s really a matter of also, like, this sort of use case dimension that models excel at.

Batuhan Taskaya: An interesting metric that we saw on Q2 and Q3 was half life of a top five model was 30 days.

Sonya Huang: Wow!

Batuhan Taskaya: It’s very, very interesting to me where these models are continuously shifting. The top five of the models are continuously shifting.

Sonya Huang: Tough depreciation schedule for the model providers.

Gorkem Yurtseven: Hopefully they are building on top of the work that they’ve already done. So it’s additive to the end. But yeah.

Sonya Huang: Yeah, I’m teasing. And the model probably is in a more turbulent state right now than what the end state will probably be. What do you guys think is the most underrated model? Like, what’s your personal favorite?

Gorkem Yurtseven: I usually like Kling models for video. But this kind of has been changing because they don’t have sound. For sound we have Veo3 and Sora. They are the only ones. A lot of people are working on it, so would love to have more variety there as well.

Batuhan Taskaya: Image models, I like ReV’s model and Flux still holds. Very nostalgic even though it’s been a year, value for me. I still go back to Flux. There’s variations of Flux models now that I like.

Burkay Gur: I’ll go with Midjourney, which is not on fal. It’s not available on API. I just like how they navigated the space, I think, is very interesting. Like, they kind of brought this, like, photorealism which was—you know, that was like a very big deal at the time. You know, no model could do it. And then now they’re more like this artsy model, right? Like, photorealism is kind of cracked and, like, no one cares about it. So now they have this, like, niche, very artistic visuals, which is very cool.

Sonya Huang: Yeah. I’d love to chat about the marketplace dynamics a little bit. So I understand your business is a little bit of a marketplace where you aggregate developers on one side of the market—that’s the demand side—and you aggregate model vendors on the other side of the market—that’s the supply side. And the model vendors are both proprietary APIs, model labs that view you as a distribution partner, and then also open models that you host and run yourselves. And so maybe talk a little bit about for the closed model providers, you have partnerships with OpenAI, Sora, with DeepMind on Veo. What’s in it for them? Why did they choose to partner with you?

Gorkem Yurtseven: We were one of the first platforms that accumulated the developer love. And following from that, these developers work at big companies, so they started working with us. And we really built the platform for simplicity and being able to get going really fast. And because the thing Batuhan mentioned, the half life of these models is really short. People usually work with many different models at the same time, so we were able to claim that we have this big developer base that love the platform, and not tied into any single model and here for the platform.

And model research labs see this, and they use the platform as a distribution channel and tap into the developer ecosystem that we built. On the other side, this helps us with the next model provider, because they see all the developers, they want to be on the platform as well, which attracts more developers on the platform and creates a very nice positive flywheel for us.

Sonya Huang: Yeah, it very much is a marketplace business, and for developers it’s a single choke point to be able to access multiple model vendors. And to your point on, like, the model space is changing so quickly, I think they really do value that choice.

Gorkem Yurtseven: Yeah, we call it marketplace-plus-plus, because we get to provide infrastructure to the research labs as well, also to the developers. So there’s additional benefits. It ties into the flywheel effect that we are creating. Marketplace plus other services next to it.

Sonya Huang: How do you position yourselves to get, in some cases, day zero launch access, sometimes exclusive launch access to models like Kling and MiniMax. Have you done that?

Gorkem Yurtseven: Yeah. Throughout the last two years, we were able to build a very robust marketing machine as well. And this is our connection point with the developers who are on the platform. Every time we release something, this creates another opportunity for us to introduce a new capability, introduce a new model. And model developers also see that. And we usually do co-marketing together. And part of that co-marketing, we get exclusive release access for a certain period of time, sometimes forever. We have a couple competitors that are on the smaller side, so model developers want to work with the biggest platform out there, and increasingly that platform is ours, and we get to have these exclusive benefits with the model providers.

Sonya Huang: That’s awesome. Why do you think it is that the open source model ecosystem has been so vibrant for video models? You know, it almost feels like the text models are just consistently a generation behind. Whereas in video, you know, there’s so much that’s happening in the open source realm.

Gorkem Yurtseven: Video and also image editing as well.

Sonya Huang: Why do you think that is?

Gorkem Yurtseven: It started with Stability. They first open-sourced stable diffusion and got insane adoption. And almost the same team then started Black Forest Labs. And they knew the power of open source, how it helps them create the ecosystem. And with image and media models, the ecosystem actually matters. When developers are training LoRas, they are building adapters, they are building on top of your model. It really brings free marketing, but also creates stickiness so that the developer—there are still people who are using stable diffusion models because they like that ecosystem because it was so open.

Sonya Huang: Yeah.

Gorkem Yurtseven: So the Flux team saw this from their experience at Stability, and they had a very smart strategy of having at least some models that are open source, some that are closed source. And a lot of video model providers that came after is following the same playbook, because you can have a very robust ecosystem, it gives you a lot of advantages in terms of marketing, in terms of developer love. And I think it’s going to keep going like this.

Sonya Huang: Yeah, totally.

Burkay Gur: I want to add on to that the domain is also very interesting. Like, I think in the visual domain, like, ecosystem actually matters more. Like, I think when Llama 2 first came out, there was, like, many fine tunes out there. But if you actually downloaded it and started using one, like you can’t …

Gorkem Yurtseven: You can’t tell it’s a fine tune.

Burkay Gur: You can’t tell the difference. Like, you can’t really—you know, if you’re using a, I don’t know, like a ControlNet—like, the concept doesn’t even exist. Like it doesn’t—you know, language models are a lot more general, like generalized. So you can’t really understand the difference if you were to actually fine tune it. So it kind of just ends up being very monolithic, as opposed to in the visual realm, it’s just like any small adjustment you make to the model it can actually have huge implications, right? And so it’s just a very fertile ground for a lot of customization.

Sonya Huang: Yeah. I mean, speaking of Midjourney, David Holz, one of his quotes that I like is, you know, he’s curating the aesthetic space with Midjourney. And I very much think you just have this combinatorial explosion of styles aesthetically. I think that’s the reason why some of the—I think some of the models on your platform are fine tunes of other models, right?

Burkay Gur: Yes, yes. And, like, the thing is, like, even if you add a lot of diversity of aesthetics, like, if you actually train on everything, if you train on too many, you may not be able to actually get the exact. Like, there’s so many times you want the exact aesthetics, and then you may still have to fine tune the model to get exactly the output you want. Whereas with LLMs, that’s not really how you operate. You don’t exactly want a particular outcome. It’s like a different problem. So this is a lot more—you know, it’s very subjective. So, like, you kind of have to do these post-training things on top of the models.

Sora is another good example. Like, Sora 2 is very fine tuned on, like, social-looking stuff, right? And so, you know, you could probably—you can have tens of different styles, and you still want to probably push the model towards that direction with post training.

Sonya Huang: Yeah, absolutely.

Gorkem Yurtseven: It all depends on the use case, too. Like a customer support chatbot does not need personality. You want it to be as vanilla as possible. But we are talking about filmmakers, marketing teams. They all want to add the personality of their style or their brand, so they want to have greater control over the outputs. Whereas maybe in LLMs that’s not necessarily true all the time. If you have an agent, if you are doing code generation, there’s no equivalent of style and personality.

Sonya Huang: Yeah. Okay, that’s a good segue for us to go one more layer up the stack. Let’s go to workflows. What does the average developer workflow inside fal look like today?

Gorkem Yurtseven: They are using many different models, first of all. So I looked this up recently, our top 100 customers that are using 14 different models at the same time. These are sometimes chained to each other, so one text-to-image model, one upscaler, one image-to-video model, all part of the same workflow or, like, a more complicated combination of this part of the same workflow, or different models used in different use cases. I think that’s the most interesting part, the variety of the models people use on the platform. We do have a no-code workflow builder as well. We built this in collaboration with Shopify, and this is usually very good for their PMs, their marketing teams, the non-technical members of the team who are playing with these models. It’s really good for trying different things, really good for comparing different models, but eventually this makes it into the product as well. You can reach to this workflow through an API. It’s been very popular recently, and more and more people in a typical software engineering organization is now interested in image and video models. So the users of this platform have been increasing.

Sonya Huang: Okay, so the average workflow is not just text to prompt, it’s not create a five-minute commercial that does it. If I wanted to create a five-minute commercial, what would the workflow be?

Gorkem Yurtseven: Yeah, so for this reason people actually prefer open—like, that’s one of the reasons why people prefer open-source models, because they get to have more control over the model, and they can add things here and there to steer the model towards the outputs they want.

Sonya Huang: Yeah.

Gorkem Yurtseven: When we go talk to studios or more professional marketing teams, they all love working with the open source models because of the pieces they can replace and control they can add into it. And then these workflows are usually like the ones if you’ve seen any, like, big comfy UI workflows with many different nodes, it resembles those where each different piece of the model can be replaced to create more control for the creator.

Sonya Huang: Got it.

Burkay Gur: Yeah. And I think, like, what we have like our workflow tool, it’s not the final form of—like, there’s almost like a another layer of abstraction maybe on top in terms of workflow and, like, as we talk to these studios, we actually figure out there’s so many ways of—just like there’s so many ways of using Photoshop, like there is no single workflow. In fact, like, there’s probably, like, based on your role, right? Like, you’re a marketing person or you’re an animator or whatever, like, you have different workflows, right? And so I think that is also emerging, like, as more and more professionals are actually starting to use these tools. Like, you see the emergence of very particular workflows, right? One of our favorite creators is PJ Ace. He actually, like, shares his workflows online, and every time—like, he posts things, you know, every month he actually has a different kind of workflow. It’s really driven by the new models, like based on the new model. He may have a completely new workflow next time. I think once we sort of reach I guess some sort of productivity and some professionals actually adopting these tools, there will probably be more sort of standardized best practices around using these abstractions. But I don’t think anyone knows the final, final form yet. And it’s like every day we see new things and we try to update our product to make sure, like, it caters to those people.

Sonya Huang: Totally. One of the workflows I’m seeing somewhat commonly is, you know, you have an idea for high level what you want and you type that in, and then—and the aesthetics that you want, and you iterate on the aesthetics from an image model, and then use that image model with the aesthetics you want to then generate a series of images which then form the storyboard, so to speak.

Gorkem Yurtseven: And then it cascades down from there.

Sonya Huang: Exactly. And then the video models kind of interpolate in between them. And it’s funny because that’s actually how, you know, Pixar and all these companies work in terms of storyboards.

Burkay Gur: I think it was a cost thing in the beginning. Like, that’s why they had to do it like that. But, like, it actually also makes sense, right? Like, it makes sense in so many ways to do it like that. And yeah, they called that stuff pre-production and then post- or production. So pre-production is all the tooling around storyboarding, et cetera. That’s what everyone does, like, even today, even though it was very costly, now it’s more of a speed thing.

Gorkem Yurtseven: And AI makes the workflow very interesting where you have everything laid out, and let’s say a new text-image model comes out. They built it in such a way that okay, you can press a button and now all different combinations are going to be generated with this other model, and then you can generate all the videos again. We’ve seen those insane workflows. You want to update one thing, and the whole thing is going to cost, like, a thousand dollars to rerun it again. But these individuals, they spend a ton of money on creator platforms. I’ve seen builds like half a million dollars just spent by a single individual, maybe even more when it’s in a small production studio and stuff like that. So it’s pretty incredible.

Sonya Huang: Totally. Wonderful. Okay, speaking of studios who are building on your platform, let’s go our final layer up the stack. Let’s talk about customers and markets and then what the future might hold. Maybe what are the coolest things that people are building on your platform today? And are they what we would think of as traditional media businesses or are they net new businesses?

Burkay Gur: It’s all over the place. Like, what’s so exciting about this space is that it just goes across all of the, you know, markets you can possibly imagine. Like, I’ll give you some more, I guess, long tail stuff first, because it’s super fun and interesting. There’s a security company that’s building on top of fal, and they basically have these, like, trainings, and the trainings are generated on the fly. And the content is all dynamic. Obviously, they have some scripts, I’m guessing, to kind of fit the curriculum, but the content you get per person is all dynamic.

Sonya Huang: This is Brian Long’s company?

Burkay Gur: Yeah, this is Adaptive Security. Yeah, they do some really cool stuff. I think that’s one of the most unique use cases. You can see how that translates into, like, the rest of education. I think that market is kind of picking up. Another one, I think, like, you know, this is a more common use case, I guess, is AI-native studios. You mentioned, like, the Bible app. That was one of my favorites. It’s called Faith. It’s one of the highest-ranked apps on the App Store. And yeah, they have, like, stories for each of the stories from the Bible, and they’re, like, really well produced.

And this sort of category of AI-native studios, either in the form of applications or they’re doing feature films and series and things like that, that’s a huge category. So I would call this maybe new media or AI-native media and entertainment. There is also a lot of design and productivity out of our public customers. Like Canva is one of those, Adobe is one of those. So they’re integrating kind of like—in this older tooling, they’re integrating new models. Ads is a big one, and ads kind of come in many flavors. Basically there’s like the UGC-style ads, like the stuff you see, like, there’s a person, you know, demoing a product. That’s a very big category. So AI-generated versions of those. There’s also kind of like older styles of ads, right? More professional looking, higher production. Maybe you saw the Coca-Cola ad that came out recently.

Gorkem Yurtseven: That’s some controversy.

Burkay Gur: Yeah. Yeah, so that’s like a kind of a higher production, you know, style of ads. But, you know, what we’re excited about is also, like, programmatic ads, right? So where you can do personalized to the degree of literally individuals, you know, yourself being in the ad or in the movies, whatever. So that’s also a big growing use case.

Sonya Huang: Yeah, I’m most excited for the education use case. I think that ads is, you know, the backbone of commerce and the internet. And so, like, super-compelling business case. But education is a market that’s, like, so important, and has never really had that many compelling business cases behind it.

Burkay Gur: Yes.

Sonya Huang: And part of the challenge with education, I mean the challenge has been the bottleneck to creating high-quality content at scale that’s actually ideal for the learner. And so I’m personally most excited about education.

Burkay Gur: Same. Like, I really love the education use cases, and I actually think that, like, ChatGPT or just LLMs in general, I think they are already solving it in a way, but it’s not the right form factor. If you actually want to fully realize the power that these models are bringing, you actually need to go into the visual space, because then it’s so much more compact, it’s more approachable. And yeah, I think once we actually crack visual learning, like, through these video models, that’s when it’s going to really just impact people.

Sonya Huang: Do you think that the advent of generative media is going to increase the value of existing IP? So, like, Mario Brothers, Nintendo, Disney, Pikachu, all these things? Or do you think it’s going to lead to the democratization of the creation of IP?

Gorkem Yurtseven: I love this question because it felt like—I would say six months ago, it felt like this was all happening too fast for Hollywood, the IP holders to adapt and be part of it. And from our viewpoint, we thought all right, these AI-native studios, they’re just going to take over and Hollywood is just going to be too slow, and this is going to just go past them and they’re going to be left behind. But this summer something changed, and we’ve been talking to a lot of usual suspects from Hollywood. We recently had our first generative media conference, and Jeffrey Katzenberg, former CEO of DreamWorks was there, and he made a comparison. He said this is exactly playing out how animation, when it first came out, people revolted against it. It was all hand-drawn before that. And computer graphics, it was new. And there was a lot of rebellion against computer-driven animation.

And something very similar is happening with AI right now. But there’s no way of stopping technology. It’s just going to happen. You are either going to be part of it or not. So we are seeing a lot of existing IP holders are now taking this very seriously. And at least for the medium term, I think they are pretty well positioned, because they have the technical people who are actually really interested behind the scenes in this technology. They also have the IP, but they also have storytelling and filmmaking know-how. You still need quite large budgets. Maybe things are going to get cheaper, but in the medium term, filmmaking is still going to be expensive. Yes, AI is going to make it maybe a little bit cheaper, but we need these deeply technical people who know filmmaking, who have the IP, who know storytelling to actually in the beginning be part of this. And I think they’re going to play a big role in the next coming years in the AI ecosystem.

Sonya Huang: Yeah. When there’s infinite content generation, it almost puts a value on the things that are finite. And I think, you know, for those of us who grew up with Power Rangers or Neopets or whatever, there is just this nostalgia element and this finite supply of IP that really resonates with us.

Gorkem Yurtseven: The opposite is true too, also. There’s a lot of new—like, we had little toys of these Italian Brainrot characters. These are characters with no IP, no one owns them. They are completely AI generated from, like, the Internet community. And once you have cheap generation of content, very different permutations of it, things that people like catches on and it becomes part of the zeitgeist.

Sonya Huang: Totally.

Gorkem Yurtseven: So there’s signs of opposite being true as well.

Sonya Huang: Yeah, both are true. How do you—related question. How do we prevent, like, the infinite slop machine state of the world? You know, there’s this version where we’re just connected to this machine that knows how to personalize stuff for us, and we’re just hooked up to the infinite slop machine. And there’s a version where there’s, you know, human creativity and artistry and things like that involved. Like, how do you think the world plays out?

Burkay Gur: I think humans eventually converge on the things that are more meaningful in general. I don’t know, like, no matter how much slop we fill the world with, I think taste prevails, and people are drawn to experiences that are personal and human. I just think that that’s going to happen. One interesting example of this was, like, when Meta announced Vibes and then OpenAI announced Sora 2. The reception was very different, and one of the reasons in my mind was Vibes was positioned as this slot machine kind of thing where they didn’t have the product out at the time, but it was just like these AI generated—like, you have no relation to the characters, et cetera, right? Like, it was kind of this, like, detached thing. Whereas, like, Sora really made it about friends, right? Like Cameo and you know, they were very …

Sonya Huang: And now you can Cameo your pets.

Burkay Gur: There you go.

Sonya Huang: [laughs]

Burkay Gur: It’s huge, right? So yeah, I think, like, this connection to friends and pets and things like that, that actually made—and Sora was also being very personal about it. They were very adamant about, like, hey, we want to make this about friends, we want to make this about, you know, these connections, as opposed to influence slot machine. So I think that perception was also, I think, a good signal that there’s ways to make this technology work in a good way.

Sonya Huang: Absolutely. Okay, I’m going to get your perspective on timelines, and what’s feasible today and what’s feasible to come. I guess, do you think that we’ll see Hollywood-grade feature film-length films entirely generated by AI? And if so, on what timeline?

Batuhan Taskaya: Well, what does “entirely generated by AI” mean? Is it like no human involvement or …

Sonya Huang: No human filming. So human involvement.

Gorkem Yurtseven: But editing is okay.

Sonya Huang: Yes, absolutely human editing, but no human filming.

Batuhan Taskaya: I think in less than a year we’ll have, like, you know, advanced video models combined with the storyboarding that people have been doing. You’ll have feature grade short films, like, less than 20 minutes. I think that’s a fair estimation. Like, even today I think you can do really great films. It’s just not enough investment of time is going into these but, like, with enough investment of time and the model quality I think will be there.

Sonya Huang: You think already there. Okay, and you think it’s photorealistic? You think it’s anime? Like, what categories do you think are more likely to happen sooner?

Batuhan Taskaya: I think photorealistic is like what everyone is, like, targeting. But, like, anime would be a cool one, right? Like, it’s like you don’t see that many anime-specialized models. Why not? I think there needs to be a market for that, clearly.

Gorkem Yurtseven: I think it’s going to be animation or anime or cartoon, like, not photorealistic. Like, as far away from photorealistic as possible. Maybe even like as fantasy as possible, because filming photorealism is cheap and doable already. Like, that’s not what costs money when people are making movies. It’s the non-photorealistic stuff that’s actually expensive. And, you know, even if you look at the animated movies, some of my favorite movies are animated—the Toy Story series, How to Train Your Dragon, Shrek, Ratatouille. And people like these things not because it reminds them of photorealism, it’s the storytelling that matters. And this created a new medium. I think AI is going to be similar to animation and how that brought a whole different angle to filmmaking.

Burkay Gur: Yeah, I think feature films are hard, like, because with photorealism you typically—I mean people usually like the movies that their favorite actors are in, whatever actors, actresses. So it’s like one step removed from …

Gorkem Yurtseven: That’s the thing that costs money to get the actors.

Burkay Gur: Yeah. Yeah, exactly. So that’s the—you know, we first need to build a connection to this AI-generated character before we can turn it into a film. But I think, like—yeah, I think it’s among, like, different kinds of content like shorts, you know? I think Italian brainrot is an amazing example, right? It was first like these characters, and then it became a Roblox game making—I don’t even know, like, you know, a lot of revenue. So yeah, I think AI-native stuff and shorter-form content is probably going to be very, very big.

Batuhan Taskaya: We saw this with VFX where, like, the VFX effects, like, one of the most expensive parts of, like, producing these videos or films is, like, got, like, AI fight very, very quickly because it’s very easy for AI to do, like, explosions or a building collapse. It’s, like, almost perfect now, and I think it’s just going to continue along on that dimension.

Gorkem Yurtseven: And maybe facial expressions are going to be hard.

Burkay Gur: Yes. They’re very hard.

Gorkem Yurtseven: You don’t have to do facial expressions. That’s going to be okay.

Sonya Huang: But now they can do gymnastics.

Gorkem Yurtseven: Yeah, gymnastics are important.

Sonya Huang: [laughs]

Gorkem Yurtseven: Good thing we have a lot of footage of Olympics.

Sonya Huang: What about—you mentioned Roblox. At what point—do you think we’ll have interactive video games that are generated in real time?

Burkay Gur: Yes, I think so. I’m very excited about it, actually. In one world, I think the sort of next reasonable step for text-to-video—like, if you think text-to-video is the continuation of text-to-image, I would say, like, a text-to-game is the continuation of text-to-video. Because, you know, with a game you would essentially be making the video interactive, right? That’s kind of what that means. And I actually think that there is a world where these, like, hyper-casual games exist. But this is like another level of hyper casual, where it’s actually discardable. I think we’re not too far away from that. I actually feel, like, pretty bullish on having these one time playable games, like very short games. I think that’s probably going to happen. I think that’s a good use case for world models other than many other great use cases. But I think it’s going to happen.

Sonya Huang: What about AAA quality games? Will these models at least assist and change the development pipeline of those games?

Burkay Gur: Yeah, I think they’re already impacting—like, at least LLMs are impacting conversations. There’s like dynamic conversations, things like that. I think pre-production stuff is impacted already. I think, like, kind of side quests, like IP stuff is impacted, right? Like, where you have the assets and you can make a mini game. I think people are using it actually—not very public but, like, that is already happening. I think using it for AAA production or, like, generating that with a model, that’s like, I don’t know, at least three, four years ahead for me. And yeah, I mean that would be insane if we can actually do that. But along the way there, just like the video space, I think along the way to the AAA there’s, like, many other things. I think those are going to be very big.

Sonya Huang: Yeah. The video model space has just exploded in terms of options, quality, et cetera. As you look ahead towards what’s needed to get us to the promised land for everything that generative media can be, do you think that there’s future R&D breakthroughs that are needed on the horizon, like fundamental R&D breakthroughs? Or do you think we’re very much in the engineering scale-up leg of the race?

Batuhan Taskaya: I think the architecture needs to slightly change at least. If you think about scaling these models by 10x, 100x I think the architecture is a big bottleneck right now in terms of the inference efficiency. The more compression of the video space, that’s definitely needed. Like, we saw this with—image models used to be, like, much less compressed, and then latent—or, like, you were operating at the pixel space, and then we introduced latent space and then, like, even inside that latent space you took, like, 64 pixels and made them a single pixel.

And now, like, with video, we are compressing on a time dimension where we are seeing, like, 4x ratios. Why not, like, 24x or whatever? Like, you need to, like, increase that, like, compression. And, like, I think that’s going to be a big driver of improving both inference efficiency as well as training efficiency. But, like, I think, like, any model, like, I think at this stage that we are referring, any model you take on the generative media side, we are far from being, like, scaled up engineering wise. Like, I think there’s not enough investment being put into or, like, it just started happening within the past six months. Like, Google showed this with, like, their models and how quickly they were able to catch up. They didn’t need to innovate that much. It’s just like they have the resources, they can put more effort into it. But at the same time, smaller labs are able to demonstrate this because, like, there’s so much, like, unique and noble stuff that you can do at the data level to train these models. So I think that’s also, like, helping, contributing. And there’s the factor of, you know, like, outside, like, you know, mid-tier labs that raise, like, $100 to $1 billion that’s also trying to come up with models, releasing them open source or, like, contributing to the ecosystem.

Sonya Huang: Yeah.

Gorkem Yurtseven: That’s what’s so exciting about this space. There’s so much more work to do. Like, so far, the research community did the simplest thing possible: they captioned images and trained the model on text-to-text prompt. And now, like, we are doing video image editing. That requires a lot more data engineering to create the data sets. But luckily, seemingly we have a lot of abundant free video data. We are going to run out of compute before we run out of video data, so that means there’s a lot more work to do and a lot more room for improvement.

Burkay Gur: I mean earlier on, like, Gorkem’s math also indicates that if you want to get to 4K video real time, that is like—I mean that means, like, I don’t know, 100x, maybe more in compute or architecture. Something has to give to get us there, right?

And yeah, right now a lot of models are not that usable for professionals especially, right? Or even for consumers. Like, if you’re sitting there—like, for the best models you still have to wait, like, 40 seconds. I don’t know, sometimes you have to wait two minutes, three minutes. Like, that’s not really acceptable in a world where, like, we want everything on demand. So yeah, I think something needs to change.

Sonya Huang: Yeah.

Burkay Gur: And probably pace of, like, hardware getting faster, it’s not enough. I think if that’s the case, you know, it’ll take much longer, we’ll have longer timelines. So I think architecture needs to get better.

Sonya Huang: Awesome. Thank you, guys. You made a very high conviction bet on generative media as a theme, I think way before it was obvious. I think we are just at the start of, I think, what’s going to be an explosion of generative media. And it’s been really cool to hear about everything you’ve built from the kernel optimizations and the compiler all the way up to the workflows and what you’re seeing from customers with new and old media alike. And so thank you for joining us on the show today.

Gorkem Yurtseven: Thank you. Thank you so much. This was a lot of fun.

More Episodes