How Cursor Trained Composer on Fireworks: Distributed Infrastructure for High-Performance RL
Cursor’s Federico Cassano and Fireworks’ Dmytro Dzhulgakov explain how they collaborated to build Composer as a specialized foundation model. The core insight: models have finite capacity in their weights, and allocating all those bits to the singular task of software engineering in Cursor frees the model to be both better at the task and far more efficient at inference. Rather than start from pre-training and work up, they took an unconventional top-down approach — mid-training and RL on top of an open-source base to get a useful model into users’ hands fast, then specializing the model around real Cursor usage. With Fireworks providing distributed infrastructure, Composer delivers frontier-class coding performance with the speed of a much smaller model.
Watch Now
Transcript
Chapters
Full conversation
Federico Cassano: You need all the infrastructure to run these environments that have to mimic as closely as possible what a user’s computer would look like. And it’s very important, as closely as possible, because sometimes the model can actually figure out when it’s being run in a fake environment or a real one. And it has different behaviors during RL than in production.
Sonya Huang: Are you saying it being conscious of it being in a fake environment it starts behaving differently?
Federico Cassano: Yes. Yes.
Sonya Huang: Interesting.
Federico Cassano: Like, it’s like, “Oh, I’m in a fake environment? I will learn a few tricks to, like, get a better reward in this environment. Let me try them out.”
Dmytro Dzhulgakov: Models love to cheat. RL is really good at encouraging cheating.
Sonya Huang: I’m delighted to welcome Federico from Cursor and Dima from Fireworks to the podcast today. Federico, you are the research lead on Composer 2 at Cursor—Cursor’s new agentic coding model. And Dima, you spent several of the last few months moonlighting at Cursor in order to support a lot of the infrastructure required to make this gargantuan training task happen. And so I’m excited to talk to both of you today about how the training of Composer 2 came together, what hard problems you solved together, and what you think it means for the future of AI and foundation model companies.
Federico Cassano: Exciting.
Dmytro Dzhulgakov: Yeah, exciting. Thank you for having us.
Sonya Huang: Thanks for joining. Okay, let’s dive right in. For those who haven’t been following us closely, Cursor recently announced Composer 2, which is an agentic coding model meant for long-horizon coding tasks. Federico, up till now, Cursor was mostly enabling other people’s coding agents. What was the impetus for Cursor to lean so heavily into Composer 2, and how existential is it for you to become not just an application company, but also a foundation model company yourselves?
Federico Cassano: The reason why we started looking into training our own models is you can sort of think about the model as sort of like a storage drive. It has a certain amount of bits that it can store in its weights.
And the idea is very simple: We care about only one task. We don’t even care about coding or programming necessarily. We care about software engineering inside Cursor and inside Cursor only. And so what if we were to allocate all of the bits of information that can be stored inside the model weights to that one particular task?
Also, as people may have noticed, Composer is an order of magnitude less expensive than Opus and other coding models because we can simply specialize all of the model weights to that particular task. And so we can serve a smaller model or something of that sort.
Sonya Huang: So it’s about let’s make sure every single bit of weight or information we have is dedicated toward the specific problem that we have at hand.
Federico Cassano: Exactly.
Sonya Huang: Got it. That seems like an almost generalizable problem. Dima, I’m curious about your perspective. Do you think that every application company should be looking at Cursor as a harbinger of what’s to come? Like, should they all be looking to do the same thing?
Dmytro Dzhulgakov: Yeah, absolutely. I mean, we actually generally see it as a pattern of evolution of the applications. You maybe start prototyping, you might be using an off-the-shelf model to get something running, maybe do some prompt engineering, figure out how your harness works. But the most leveraged attribute of your application is actual usage of user data or particular specific aspects of how the application works. Maybe some aspects of your harness, which tools do you provide, how the application works—kind of really important bits which are important for your application. And the right way to capture that: you can do a little bit of that through prompting, but really the right way to do this is craft your model to act in your environment.
Federico Cassano: Yeah, absolutely. There are certain tools the agent calls that it’s very hard to succinctly describe exactly the behavior of that tool to the model. And with just post-training, we can bake in the optimal way to use those tools. Like, Composer, we do serve a prompt to Composer, but I think the way we are training it is it would work even without a prompt and it would know what to do just because we are intrinsically pushing the model to the right direction of how it should act throughout our training.
Dmytro Dzhulgakov: Basically, there’s kind of an upper bound of how far you can get with prompt engineering. And if you want to craft really great AI products, you have to go through fine-tuning and influencing model behavior. That’s one reason. I mean, reason number two is what Federico mentioned is the cost trade-off or speed trade-off. The way we view it at Fireworks is that when you’re trying to do optimization, you have this three-dimensional trade-off between quality, speed and cost.
And you can go quite far—and we are doing it with a lot of customers initially—you can go quite far with just optimizing infrastructure. But when you start getting into model training, you can really push this trade-off much further, and you can get a better model at a fraction of the cost running much faster. And Composer is a great example of that.
Sonya Huang: Can I push on this a little bit? I want to ask if this approach is bitter lesson pills. And we were actually all talking about Tabnine on the walk-in. I’m remembering before the LLM era, there were these small specialized coding models. And one of the things that was, I think, surprising to a lot of people was as you scaled up just training on the internet and a bunch of English texts and other languages, actually the models themselves got inherently better at coding as well. And so at least the trend line I’ve seen so far is like bigger models perform better on everything, including on coding. Is what you guys are saying go against the grain of the bitter lesson?
Federico Cassano: I think no, but one thing to point out is that the big models trained by the labs train on a lot of code as well. Like, code is one of the main tasks the labs are interested in pushing. And so they don’t just generalize to it—they’re a bit specialized as well.
I think for our case, actually, if we believe in the bitter lesson, we are just pushing very hard on the data dimension and we know that the models inherently have finite capacity. And so if we want to saturate all that capacity, we need to scale data. And in order to ingest more data, we need to free up the weights from distractions the model may have.
Sonya Huang: Okay, got it. Super interesting. Okay, let’s dig into the training of Composer 2. You launched a couple weeks ago, immediately grabbed attention, strong benchmark numbers, much lower cost to run inference on. What’s the short version of how Composer 2 works, and what you guys did to make it so performant?
Federico Cassano: We started from a very strong base, which is Kimi 2.5. That’s a one trillion parameter MoE that’s 30B active—so very sparse, actually. We sort of looked at the stack and realized there are two axes. Mainly, Composer 1 was just pushing on one of these axes, which is reinforcement learning, but Composer 2 pushes in two different axes. One is continual pre-training and the other is reinforcement learning. So the thing that made Composer 2 very good is pushing in both of these directions.
So we started off the training round by doing lots of mid-training on code tokens—almost sort of pre-training scale, actually. And then coming out of that mid-training run, we took the checkpoints and we did very large-scale RL on lots and lots of tasks.
Sonya Huang: Okay, and then the premise here would be that because Cursor sits in the middle of so many interesting coding tokens, you actually pretty uniquely have access to data to be able to train at almost pre-training scale.
Federico Cassano: Yeah.
Sonya Huang: Why not pre-train your own model then?
Federico Cassano: We just think about our approach from the top down instead of bottom up. So how do we get a model that’s useful to users in the least time possible? If we were to start from the bottom—figure out how we do pre-training and then scale it up to mid-training and then okay, now we figured out mid-training, do reinforcement learning—that would take a very long time to get a model out to our users. By doing it the other way around, we were able to give a useful model to our users in very little time. So hopefully the next Composer versions are going to be our own model instead of basing it off an open source base.
Sonya Huang: And what is the model roughly learning in the mid-training step?
Federico Cassano: Yeah.
Sonya Huang: And what is the model learning in the post-training step for you?
Federico Cassano: Yeah, so in mid-training, it’s sort of just learning about libraries of code and learning about specific code patterns that are very common, like just world knowledge as well. There is web data there as well. And this is sort of just creating a wider distribution that then reinforcement learning can then sharpen on.
And so during reinforcement learning, you know, the model gets to play directly with the Cursor harness, and so it gets to learn about the world the model is going to live in for the rest of its life in some way. And so then during reinforcement learning, that’s where it learns how to call tools properly, how to navigate its environment, how to write correct code. Because during mid-training, it learns how to write code, but that doesn’t necessarily mean it learns how to write correct code. We try to train on code that is largely only correct, but the model doesn’t actually know how to differentiate between the two. While in RL, one of the key things that we are doing is we’re kind of tuning the feature of the model saying, “Hey, now you’ve got to write correct code all the time.”
Dmytro Dzhulgakov: Exactly.
Sonya Huang: Interesting. And is the model after mid-training, is that similar to the model that you guys have on tab autocomplete, or is that a different core competency?
Federico Cassano: Yeah, I mean, I think I would put it like that, because during mid-training we are just doing next token prediction, you know, like, how well you predict the next token and then the token after that. So yeah.
Sonya Huang: So why not just post-train on your tab autocomplete model then? Why mid-train different models together?
Federico Cassano: Yeah, Tab is a very small model because it’s a super low-latency model. We want it to be very fast. So the core two distinctions about the base models here is that Tab is small and Composer is quite large.
Sonya Huang: I see. Okay. So it seems like a lot of the focus of what you guys did for Composer 2 was this large-scale reinforcement learning run. Can you break that down for us? What goes into that and what are the various hard problems you solved along the way?
Dmytro Dzhulgakov: When you do RL, it’s quite different from pre-training or mid-training because you’re not just trying to predict the next token. You’re actually running the entire harness, the entire experiment. You’re letting the model act in the environment, see how it performs for a given rollout—that’s the terminology, which is called a rollout—and assign it reward whether it did something correctly or not, which might be using LLM as a judge or maybe something verifiable like does this code compile or something like this.
Which actually means that compared with regular training, you need a bunch of other components. You still need large-scale training, you still need to orchestrate tens of thousands of GPUs to do forward-backward propagation, do all the stuff you do in mid-training and pre-training. But now you also need to orchestrate a bunch of environments. You need to run model inference, because when you do this rollout, you’re effectively running a real Cursor session in some sense.
Sonya Huang: So every rollout is like a forward pass?
Dmytro Dzhulgakov: No, rollout is basically your entire agent session from Cursor. So it basically means it might take something like 50 turns. The model will take your initial prompt, then decide to call some tools. You execute those tools, then the model generates a bunch of other code. The entire session, when you interact with the agent in Cursor, you kind of simulate this entire session as a part of your training run. You get the final reward and you use that signal to now go back to the trainer and incorporate it in the model weights. So you have this very big update loop, which is very heterogeneous because you have all these different components working together. And now you’re trying to orchestrate all of this to work efficiently and work with high throughput because GPUs are expensive and you want to get your model trained quickly in an economic fashion.
So that by itself is a very interesting problem at the intersection of algorithms and infrastructure, because there are a lot of trade-offs in how you can optimize and co-design the system. One aspect is what people call async RL or pipeline RL.
The idea is basically okay, you’re trying to update this model in steps, right? So you have your current model version, and you are trying to do a bunch of rollouts with it. What does your trainer do while you’re doing these rollouts? A naive approach would say that now I’m going to stop my trainer, I’m going to do a bunch of sessions—and those sessions might run for five to ten minutes or even longer if they’re longer-horizon tasks—I’m going to get those outcomes and now I’m going to pause my inference and then go back to training, trying to do updates. That’s very theoretically and algorithmically robust because you are not precisely simulating everything, but it’s very system-inefficient because half of your capacity is sitting idle all the time. So you can do all the clever algorithmic tricks allowing you to …
Sonya Huang: What do you do instead?
Dmytro Dzhulgakov: You can pipeline all of this. So imagine this as a gigantic factory, right? You have this trainer building and you have rollouts building. They’re always churning, right? So rollouts always take the latest model version and try to do new sessions and simulate new agent sessions. And the trainer always takes new outcomes as they come and tries to compute updates. So everything is moving along all the time.
The trade-off is that algorithmically it’s different because now by the time you finished some test rollout in your simulated environment, maybe the model weights were already updated on some other data. So you have this kind of staleness, like a delay between how quickly models can learn updates, because by the time you process through some interaction session with a simulated environment, your model weights changed.
And that introduces interesting training dynamics, and there are clever ways you can address this. But the flip side of that is that all your GPUs, all your compute is loaded and churning all the time, which means you’re using more flops. And to your bitter lesson example, if you have higher compute efficiency you can get to a better model in a smaller amount of time. Maybe you’re losing a few percent from being asynchronous and not doing perfect mathematical updates, but you way more than compensate for that by effectively not leaving half your capacity on the table. And there are a lot of interesting interactions in that part.
Federico Cassano: And we’re very serious about performance at Cursor, because unlike the big labs, we have tens of thousands of GPUs, not millions. And so we do all sorts of tricks to get the most out of each GPU. We train in production with FP4 even. We work with Fireworks to push on inference as well. Because the thing about RL infrastructure is it’s just inherently more complex than pre-training because you need all the pre-training infrastructure. That’s just one of the requirements. Then you need all the infrastructure to run these environments that have to mimic as closely as possible what a user’s computer would look like. And it’s very important as closely as possible, because sometimes the model can actually figure out when it’s being run in a fake environment or a real one, and it has different behaviors during RL than in production.
Sonya Huang: Are you saying it’s conscious that it’s in a fake environment and it starts behaving differently?
Federico Cassano: Yes.
Sonya Huang: Interesting.
Federico Cassano: Like, it’s like, “Oh, I’m in a fake environment. I’ve learned a few tricks to get a better reward in this environment. Let me try them out.”
Dmytro Dzhulgakov: Models love to cheat. RL is really good at encouraging cheating.
Sonya Huang: Yeah.
Federico Cassano: And then we need really efficient inference. So this is really important. So there is actually this kind of myth that during RL you spend way more inference flops than training flops. This is sort of like just because the open source inference engines are very unoptimized, instead of actually being a property of RL. Roughly the same ratio is kind of the same. In theory, if you push the GPUs to the maximum, you should have one-third of your training GPUs allocated to inference, right? Because training is effectively three forward passes. You have the forward pass, you have the data gradient, the weight gradient. While if you really hit the critical batch size on inference, you should only have a single forward-pass worth of flops.
Sonya Huang: So that’s why you guys use Fireworks instead of using an open inference engine?
Federico Cassano: Yeah. I mean, the other alternative is we would build one in-house, but we have finite engineers like everybody else. We would prefer to have engineers make training more efficient and more precise rather than spin up an inference effort.
Sonya Huang: Okay, that’s super hardcore. What about—I think you mentioned in your technical paper that you were doing this in a kind of globally distributed way. Why globally distributed? And then what makes that hard?
Federico Cassano: Yeah. Well, there are various reasons. One, very large contiguous clusters are hard to find in the market. And so what we can do instead is we have one cluster that’s going to run all of the training. We can’t do a global training cluster. But then the inference component of reinforcement learning, we can globally distribute that across small clusters all over the world.
So I think for the Composer 2 run, we used four clusters in total that were all over the world, very far away from each other. And we even used some of our production traffic when it was least used. So we had Composer 1.5, the previous model served, and when it was least used by people, we just grabbed some inference GPUs and put them to speed up training. And so we can do these sorts of things and easily scale up our training run without having one large contiguous cluster. And the thing that enables it—maybe Dima can talk more about it.
Dmytro Dzhulgakov: To reframe what Federico said is basically RL training is very heterogeneous, right? And by leveraging heterogeneity, how different components and what infrastructures they need, you can actually drive efficiency. And you see this pattern across the board everywhere. Specifically for training, you have all these highly interconnected clusters, you need a high-speed network, you need to work in lockstep. So those clusters are expensive, and actually it’s really hard to find big ones. Basically at the scale at which Composer was trained, finding a 2x larger cluster is significantly harder than finding the current sized one.
And that’s why if you can disaggregate these components and put them in different places, one, you don’t need to find such a big cluster. Two, you can actually find different trade-offs of hardware, because for inference you don’t need that kind of wide interconnect. You can have smaller groups of GPUs interconnected together. You can have heterogeneous types of GPUs. You can have different generations of GPUs. You can play all these games of optimization. And finally, inference is much easier to scale up and down as you go. And it’s very conventional. Like, when you have off-peak hours, you can view all your inference pool as one set of GPUs serving production traffic for real users or serving simulated environments for RL purposes and balance between these.
Of course, it’s a very interesting systems problem. So you mentioned the Qwen model is one terabyte. A training step takes somewhere between five to fifteen minutes. So basically every five to ten minutes you are producing a one terabyte new snapshot of weights. So the question is how are you going to ship it to a different cluster on the other side of the world very efficiently? And you want to do it quickly because remember, you don’t want to get this staleness to get out of hand.
Federico Cassano: Yeah.
Dmytro Dzhulgakov: So I think that’s the most fun part, which we figured out together, is that despite the full model being one terabyte, not all the weights change every step, because RL does a lot of very precise adjustments, especially as the training is going along. So actually there are very regular patterns in which subset of weights gets changed. Maybe not all of them change every time. So if you were to look at how the model changes within one training step after 10 minutes, there is a relatively small delta between those. You can write a compression algorithm which leverages this property, and now you end up with a database systems problem, which is okay, I have my delta and I just want to ship it across the world. My delta might be like 20 times smaller than shipping the full model. And that makes it practical.
But of course, now you need to build all this machinery, from storage systems of full snapshots and deltas and recovery and reconciliation, et cetera. We were able to build it kind of in a lossless fashion, which basically means that you always end up with a bit-equivalent model on the other side, so you don’t need to worry about any math aspects of this. And you can do it really fast, too. You can do it under a few minutes even in the worst conditions—usually it’s under a minute. And most importantly, you pause only for maybe 30 seconds to swap the weights in your actual inference server.
Federico Cassano: And we also fully saturated the egress of the cluster by sharding the upload and the download as well.
Dmytro Dzhulgakov: So you can do all these system tricks to bring this time down. It is quite a lot of complexity, but you can abstract it out and just make it work great. It doesn’t interfere with your training algorithm. And on the flip side, you have this power to disaggregate, to leverage other clusters to do that. And that kind of goes against conventional wisdom of how you should do RL infrastructure, because conventional wisdom is like, okay, you’re going to have this really huge one cluster connected with RDMA, and it’s going to be very expensive and you’re going to probably allocate one-third to training and two-thirds to inference. And sure, if you have a very expensive network, it’s much easier to copy this one terabyte quickly. But now you have a three times larger cluster. And now if your inference engine is more optimized, then maybe you’re going to save one-third of that cluster in terms of GPUs anyway, because you’re just more efficient, and you can take half of this cluster somewhere else on maybe cheaper hardware in a different region. So your cost comes down quite a bit.
Sonya Huang: I love that you guys are just grinning as you describe this, because it’s so hard and this is a systems engineer’s dream, and so it’s just like an amazing system you guys have built.
Federico Cassano: Yeah. We spent a bunch of nights working on this.
Sonya Huang: Yeah, you look like you’ve spent a lot of time together. What about—I mean, you mentioned at the beginning that Qwen is a very large sparse MoE model. Does that make the RL run tricky in any way?
Federico Cassano: Mm-hmm. Yeah.
Sonya Huang: How so?
Federico Cassano: Well, when you do inference, you’re essentially doing a forward pass that is just is kind of like autoregressive. And in this forward pass, it produces log probabilities of the tokens it has sampled. When we ship back the generations of the model to the trainer, we have to rerun that forward pass, because as we mentioned, we are doing asynchronous training. So the model that has produced the pass may have actually been a few steps behind what the trainer is at. And so we have to rerun that forward pass and reproduce log probabilities. Now the problem is, in theory, this log probability should be exactly the same if it’s the same model version. But even with the same model version, you get slightly or sometimes very different log probability values for the same tokens. So this is often called like a numerical mismatch for inference—you hear this all the time these days.
Sonya Huang: Why is that? Why does that happen?
Dmytro Dzhulgakov: Primarily because fundamentally floating point arithmetic, which is doing this, is non-deterministic. So if you …
Sonya Huang: Sorry, floating point arithmetic is non-deterministic?
Dmytro Dzhulgakov: So you learned in school that, like, if you take A+B+C and C+B+A it’s going to be the same result. If you’re doing this with integers, with whole numbers on the computer, that’s going to be always true. If you’re going to do it with floating point numbers—which are actually approximation numbers, you have this mantissa and exponents, et cetera—A+B+C and C+B+A is going to give you different results, or even A+B and B+A. So basically, like, fundamentally it’s the accumulation order of all these operations which models do—it’s basically multiplications and additions—and addition order matters to your final result. They’re all small differences, but they get amplified through millions and billions of operations. So when you do inference models, usually it doesn’t matter that much because you pre-train your model, it’s actually pretty robust. If you flip some bits, it’s still going to produce good results, your benchmarks are not going to change. But RL in particular, because you’re using this very, very weak signal to teach the model, the noise from these numerical differences can make or break your training. And that’s particularly important.
And again, it’s an interesting intersection between the algorithmic and systems parts, because you can write a beautiful mess and it just doesn’t work in practice. There are ways you can drive this difference to pretty much zero. There are all these batch-invariant ways where you can be very, very careful and write all your GPU kernels so they always add numbers in the same order, so you always do A+B+C and not a different order. It’s possible, but it always has trade-offs, right? Basically, your system becomes maybe 2x or 3x slower. Again, it becomes an interesting trade-off, like okay, what is the 10 percent of slowdown which we can take? Or in fact, it’s probably a few percent of slowdown we can take to address 90 percent of this difference. That’s the right trade-off, which we found together through iteration.
And you mentioned that particularly for MoEs, sparsity is hard. The reason for that is the way MoEs work is that you would take your activations at every layer and run them through a gating layer, which basically decides—okay, for this token, I’m going to run—out of 384 experts, I’m going to run these eight. So it’s going to do some math and pick the top eight scores. Those eight experts are going to be activated; other ones will not be activated for this token. This operation amplifies your small numerical differences quite a bit, because maybe your hidden states were different by the fifth digit after the dot—doesn’t really matter—but this difference made it so you picked expert number seven versus expert number nine kind of as a cutoff. And suddenly you went and activated a totally different part of the model and your difference got amplified quite a bit. And MoE models by definition are more sensitive to this mismatch. Again, when you do inference or when you do regular work, it usually averages out. But now if you’re trying to make this model learn, this difference is huge, because your inference activated expert number seven. Now in your training, you are trying to update expert number nine, which didn’t even contribute to that during inference.
Sonya Huang: So were you guys handwriting GPU kernels then to help get around this problem?
Federico Cassano: Yes.
Dmytro Dzhulgakov: Again, you can address a lot of this through GPU kernels, and there’s always a trade-off. Specifically for MoE, you can do this interesting trick which people call “router replay.” Basically you can have your inference just pass extra information to train and say, “Hey, I activated expert seven for this token.” This very small piece of information is just one integer saying that, like, okay, this is the expert which I activated. So the trainer can be aligned with that. And a lot of this numerical alignment is basically doing tricks like that—matching quantization levels, matching kernels, et cetera—to drive the divergence between training and inference implementation down. And that makes a huge difference between your run maybe diverging completely or being multiple times less compute-efficient because you’ll need much more data to address this mismatch.
Sonya Huang: I’d love to maybe chat a little bit more about the RL recipe. Can you say a word about the reward signal you’re using? Okay. Can’t say. Got it. Top secret stuff, top secret stuff. Okay. That makes sense. It seems like there’s almost like the equivalent of learning in sim—these are simulated rollouts, versus you have so much actual user data that you could be learning on. Why not just do RL on your actual user data and your actual user harness versus doing this in sim?
Federico Cassano: Yeah, we’re also doing that. So that’s what we call a real-time RL. And we use the same technology to do the inference weight sync. We have Fireworks to do this. We find user signals where the user was happy or sad about a particular model generation, and we are able to update that model live and so then ship a new version of the model continuously every few hours. We are working on decreasing that time. Actually, at some point we’ll have to increase that time, because as the horizon of the model gets longer and longer, we’ll have to re-extend that time. So it’s an interesting play. Right now we are trying to decrease the time for stability, because we were figuring out the right hyperparameters. And then after we figure it out, we have to re-extend it again just because we want to lengthen the horizon of these models.
Sonya Huang: Do you need to do any of the kind of pre-training simulated RL? You have so much actual user data. I imagine that’s just much more valuable to train and tune on. Like, why not just go straight to the online RL step? Why do you have to do the offline RL?
Federico Cassano: The online RL currently is pretty inefficient. So to do each of these steps, we have to take very large steps. And it’s fully on policy. So we suffer from this problem that the GPUs are offline for a long time, essentially.
Dmytro Dzhulgakov: And besides that, there are also different trade-offs, both in terms of efficiency and user experience. If you do a simulation, you actually do multiple rollouts from the same prompt, right? You effectively take a task and ask a model to do 16 tries at a task, or 128 tries on a task—different rollouts from the same prompt. Some of them are going to go well, some of them are not going to go well. And by doing multiple rollouts in parallel, you are able to get much more precise signals. Maybe the model is very good and it does it well 90 percent of the time. Maybe it’s not very good. Losses like GRPO, like group policy gradient, kind of work by doing multiple rollouts at the same time. If you’re doing online, you have only one rollout coming back, so the trade-offs of how you do it algorithmically are different. And most importantly, if a simulated rollout goes wrong, it’s not too bad—you just maybe spend some time on GPU. If it’s an actual user, you have a much higher minimum bar on that, because effectively you’re doing A/B tests. So if the model produces something weird, that’s a bad user experience.
Sonya Huang: Okay, so you can go off-policy more often when it’s not a real user, because you can experiment with crazy things without affecting the user experience. You can do a lot more rollouts, you can do GRPO, and then you can basically bootstrap some level of performance that’s good enough to even put in front of users.
Federico Cassano: Yeah, we teach reasoning through the offline RL—which is actually called online RL. Offline RL is more like a DPO kind of technique to sort of reinforce RL online. And then there we teach reasoning to the model. We give it some kind of input of the behavior it should have. We try to give it new information about the world, and we teach it tool calling, and then we put it live to users. Because you could imagine if the model is bad, users don’t want to use it, they’re not going to give us any feedback, right? So the model has to meet some kind of bar to even be put into online RL. We want to be really happy with the model, and this is the model we ship. That’s kind of the paradox of online RL—or how we like to call it, real-time—is that we can’t use this to really create the model from scratch, because users need to be using the model. And so it has to be good already. And we can only make it better.
Sonya Huang: Yeah.
Dmytro Dzhulgakov: So it’s kind of like the cherry on top to really get this super delightful experience.
Sonya Huang: Yeah, totally.
Federico Cassano: Hopefully one day it will be like a big, big cherry, you know?
Sonya Huang: Yeah. Dan Roberts presented at our conference last year. I think you were there. Traditionally it was the big cake and a little cherry.
Dmytro Dzhulgakov: Yann LeCun’s cherry. Yeah.
Sonya Huang: Little cake, big cherry. I’m curious. The Andrej Karpathy line of right now RL is still super inefficient, you do a big long rollout and then you kind of get a little bit of information at the end, and it’s still slurping bits from a straw. What do you think? And have you been able to figure out how to get more bits out of that path?
Federico Cassano: I can’t talk about that.
Sonya Huang: Okay, got it. We’re back on the secret stuff. Good. That’s how I know I’m asking the right questions.
Federico Cassano: [laughs]
Sonya Huang: You mentioned the rollouts are a few minutes at a time. It seems like the whole field is pushing towards making long-horizon agents, agents that can work for a long period of time uninterrupted and generally not failing. I love that meter-scaling chart. What goes into the RL process to try to get the agent to run for longer?
Federico Cassano: Several things. So one problem about reinforcement learning is that the longer the trajectory is, the harder it is to do credit assignment. You can imagine we are giving thumbs up or thumbs down to the model right at the end of its work. And to simplify the problem, the model asks itself, okay, where did I do right and where did I do wrong? That’s basically the problem called credit assignment. It gets harder as this gets longer. So you have to do a bunch of tricks there.
The other problem is just like you just run out of space, right? These models have a finite context window, and at some point they’re going to reach that. So actually the way we solve this at Cursor is we put compaction inside the RL loop. So we call this “self-summarization.” So during reinforcement learning, the agent actually learns how to continue and go on forever. So in practice, our model is a 200,000 context window model, but in reality it can go on for millions of tokens, and we push it to millions of tokens during RL just because of this ability that it can summarize its work and then take that summary to restart its context window while still trying to accomplish the task. And through RL—because RL pushes the model to do things correctly towards the goal—at the same time jointly we are training the model to produce a good summary, and then we’re training the model to listen to that summary very well at the same time. And so this is kind of like a continuation to reasoning almost, I feel like.
Dmytro Dzhulgakov: I find it fascinating, because usually context management is considered part of the harness, right? In this case, you’re effectively co-optimizing how part of the harness and the model itself work together, and throwing all of that into the optimization loop. And we’ve seen this again and again in AI: the more you throw compute at the problem, the more you can solve the problem end to end. The magic of compute and the bitter lesson works, and you get a much better system which can work together.
Sonya Huang: Totally. Totally. Do you think every company is going to be RLing their own harnesses? Do you think that every company has the same shape of problem as Cursor?
Federico Cassano: If they are using AI and they’re producing lots of tokens and they have a product to optimize against, I think it’s the right move and the right direction to train models.
Sonya Huang: Yeah, interesting. And so it seems like most of the reinforcement learning you guys did then was on the harness/tool use part rather than on getting good at completing the next token for code. Is that roughly the pattern that other founders should have in mind when they’re trying to think about where should I use reinforcement learning? So if you’re trying to get an agent to perform tasks with tools over a long horizon, you need RL. If you’re trying to create a model that’s good at summarization or next token or whatever, you probably don’t need RL. Is that a good framework for when you need RL?
Federico Cassano: I think RL fits everywhere. So even for Tab, we used RL. Personally—this is just my theory and it’s not backed up by anything—when you pre-train a model, the models are just ingesting the totality of human knowledge. Let’s say you’re training a model for math. The model sort of like learns all the math on Stack Exchange. The model, when it’s presented with a math problem—and this is a model that hasn’t gone through RL—the model needs to wonder what kind of person it is. Is it the expert or is it the student that’s trying to learn?
And so one of the things that I think happens during RL is that we are tuning this knob, letting the model know, hey, you are the expert, you need to do things correctly. So that’s like one thing that happens is we are sharpening this distribution. Sort of like RL has a few phases. So there is a very first phase where the model learns and becomes very good very quickly. And then there is a second phase where, like, it takes a lot of compute to continuously improve the model. And you see the model starts reasoning and has this pattern. So in the very first phase of the curve, I think that’s where we’re just tuning the knob, telling the model, “Hey, you should do things correctly here.” And so RL in the small compute case is also very useful just to let the model know that it has to do things correctly. That’s sort of like my case for this.
Dmytro Dzhulgakov: Yeah, I second that. I mean, we see this pattern across many use cases. We help with RL fine-tuning generally for many customers, and we see this usually—continual pre-training, basically mid-training, regular supervised fine-tuning—you can say this is the transfer of new knowledge in an abstract way. And RL is kind of sharpening the behavior or particular qualities you would want from the model. And usually you end up needing both.
And even to your example of summarization, it’s actually like RL may be very useful for this, because sometimes if you want a particular style out of summarization, it’s really hard to come up with examples of good and bad summarization, really describing this precisely. But if you use, for example, LLM as a judge, you can actually state very precise rubrics. You can prompt the eval saying, okay, this is a criteria for how I’m going to evaluate whether summarization is good or not, throw it into an RL loop, and let the model experiment with different summarization styles, figure out what you actually want from it, while maybe another LLM evaluates whether it’s matching a particular rubric or not. And that’s the type of pattern you see a lot, not just in coding.
Sonya Huang: I see. Okay. I’m going to ask this question to Dima because Federico is going to plead the fifth. You’ve mentioned LLM as judge a couple of times. Do you think that ultimately companies will be more successful having experts hand-examining RL rollouts and hand-coaching the model behavior in some way? Or do you think LLM as judge, other automated rubrics are likely to get us there?
Dmytro Dzhulgakov: You don’t really put experts directly in judging RL rollouts. I mean, that would be some kind of real-time RL if it’s actual users, or some form of RLHF or DPO. Generally, the more verifiable your reward is the better, because it allows you to scale the compute and just get a better outcome. And by “verifiable,” it basically means, okay, can you automatically produce it without the human? Of course, if it’s math or coding, and you can craft something very deterministic, that’s the best. The reason why LLM as a judge works is that it’s actually a kind of generator-discriminator distinction. It’s much easier to judge. I mean, it’s the same for humans, right? It’s easier to judge than to create.
Sonya Huang: That’s why I’m a VC. [laughs]
Dmytro Dzhulgakov: Yeah, no implication there. But it’s much easier judging. You can craft precisely the different criteria you want to rank some answer. And you see this pattern where you might have a very complicated eval from multiple aspects, right? Because if you dump multiple aspects to a single LLM, it might get confused about how to judge, right? You might break it down: okay, you’re going to judge rubric based on style, based on some different aspects, based on factuality. Kind of really craft these rewards. Some of them will be generative, some of them will be LLM-based. And that’s what guides your model behavior. Then you just turn on more compute and see the graph go up.
Sonya Huang: Do you think that we’re going to see RL be more effective in the harder-to-verify domains? Like, do you think LLM as judge is just sufficient?
Dmytro Dzhulgakov: That’s one of the techniques you would start with, right? Ideally, you want to figure out what is the actual outcome, what is the actual metric you want to get. So trying to approximate this through LLM is one way, trying to get bigger simulated environments is another. If you can simulate more of your product, if you can simulate more of your environment, usually you have a final metric which you care about, it’s just harder to capture. If you can figure out how to capture this, that’s great. And to your point about experts, I mean, experts are still needed, because crafting this task and actually encoding the product experience you want, that’s what matters. We went through software 1.0, 2.0, 3.0. Instead of crafting software directly, we went to crafting training data. Right now you’re effectively crafting the evaluation rules, but that’s still very important. You need to look at examples, you need to look at the data, you need to look at where your product fails and how to nudge the model in the right behavior.
Sonya Huang: I want to ask about RL environments, which is maybe related to what you were talking about. It seems like there’s been a huge explosion in just the revenue scale that some of these RL environment companies are reaching. What do they provide that’s actually useful? Because I think Cursor, for example, you have so much data on, like, how your customers are actually using your environments. What do the RL environment vendors offer you on top of what you already have?
Federico Cassano: Yeah, we don’t actually use any of the environment vendors. It’s very difficult to construct working environments. It’s a valuable product for people that do not have access to this. However, for coding particularly, there is a very large amount of working coding environments available to everybody—that’s GitHub, right? You can go in and maybe, like, have a model just install all of the dependencies for a repository, and that’s a working environment. I think a lot of the difficulty comes from the infrastructure as well. So you can imagine that an environment that works well for a particular task may need, like, services up. You’re, like, making a change that’s let’s say a database migration, to test that it’s actually working, you need the database up, right? And those kinds of things are very tricky. I think these environment companies are quite helpful for that kind of stuff.
Dmytro Dzhulgakov: There are kind of two aspects to this, right? First, if you look at frontier labs, they’re trying to build a generic model which is good at everything. So they need to cover all these different tasks underneath, package it up in one model, and encourage it to generalize. So that’s one pattern, and that’s very helpful. In cases like Composer, you have your actual product. And I think that’s what also [inaudible] with Fireworks—if you have your actual product, you should do RL against it.
Sonya Huang: The most powerful environment is your own product.
Dmytro Dzhulgakov: Exactly. Because that’s where your model is actually going to be used. And of course, if you have a frontier lab, you’re not going to do it across all products. But if you’re trying to build the best model for your product, specialize and tailor it, you should just use your production environment. Of course, you want to isolate it properly, right? You don’t want to wreak havoc on your production database. You want to clone it, et cetera. And there are some tools from environment companies, such as from general infrastructure, which makes it easier. But generally, you want your RL environment to be as close to real production as possible.
And as an example, that’s what we see if you look at toy RL examples and toy RL frameworks, they always start as, oh, there’s this toy environment, I’m going to spin up a docker container and run everything in it, which is great for toy examples if you’re trying to teach a model how to play Atari or whatever. But if you actually transition to production cases, you can’t just put your real production application in a docker container. And we found this out pretty early ourselves working with many folks. In the case of Cursor, the trainer is on their side; some other customers run the trainer on our training platform. But for environments, we actually default to running them on the customer side, because that’s where the actual implementation is. And you effectively have the same setup of trainer—even if it’s part of the Fireworks platform or on the customer side—calling the actual production environment, not trying to kind of wrap it and componentize it on the hosted platform, because that’s really hard and that introduces differences.
Federico Cassano: Yeah. What we call RL environments is really three components. One is the harness. So the harness is where the model can submit tools and the tools get executed. And the second thing is, let’s call it the operating system. So what is the actual world and the state where the model is interacting with? And then there is the reward component, which needs to check at the end that the work is done correctly. And generally the harness is pretty portable—you can take the harness and put it in many different environments. The thing that’s key is the operating system. And to replicate this, just normal containers don’t really work very well. So at Cursor, we actually built, like, a whole virtual machine stack. And so we can spin up virtual machines really quickly and it has to be super bursty, because you can imagine, like, we are asking this system please give me 100,000 virtual machines now, and it has to all come up. Yeah.
Sonya Huang: Awesome. I really enjoyed this conversation today. I think Cursor is such an inspiration in what you all are doing as a company towards going from application company to really a frontier model lab. And I think the work you did with Composer 2 really leads that charge. So really special to hear about it. And Dima, really cool to hear about the hardcore infrastructure problems that the two of you solved together in the trenches over many, many late nights to make it all possible. So thank you. Thank you guys for joining today.
Federico Cassano: Thank you so much for having us.
Dmytro Dzhulgakov: Thank you.
-30-