Podcasts Training Data Prime Intellect

Building the GitHub for RL Environments: Prime Intellect’s Will Brown & Johannes Hagemann

Stream Now On

Will Brown and Johannes Hagemann of Prime Intellect discuss the shift from static prompting to “environment-based” AI development, and their Environments Hub, a platform designed to democratize frontier-level training. The conversation highlights how AI progress is moving toward Recursive Language Models that manage their own context and agentic RL that scales through trial and error. They describe their vision for the future in which every company will become an AI research lab.

Insights from this episode:

Every AI company should become an AI research lab: The winning applications won’t use generic off-the-shelf models—they’ll use models deeply customized for specific workflows through post-training. Just as Cursor built Composer One optimized specifically for their IDE, companies need product-model optimization loops where institutional knowledge and best practices compound over time in the model weights themselves.

Trade compute for expertise through reinforcement learning: RL allows companies to use compute to extract maximum value from limited high-quality human data and domain expertise. When you need a model to excel at a specific task where no golden examples exist, RL’s exploration capabilities let you venture into uncharted territory beyond what supervised fine-tuning or distillation can achieve.

Environments are the unifying abstraction for AI optimization: An environment encompasses everything needed for model improvement: the task, the harness for model interaction, and the grading rubric. The same environment used for evaluation can be used for RL training, synthetic data generation, prompt optimization, or distillation, making it a versatile tool beyond just reinforcement learning.

Environment construction is the new data labeling: Creating high-quality RL environments with proper tasks, simulators, and reward functions is becoming critical, similar to how human data labeling was essential in the pre-RL era. The key is identifying which pieces of complex systems to mock versus simulate fully to balance realism with computational efficiency.

Context management will be learned, not scaffolded: Future models will learn to manage their own context through approaches like Recursive Language Models, where models access persistent state in Python REPLs and call sub-models as needed. This represents a more fundamental solution than hand-crafted scaffolds for handling long-horizon tasks and large amounts of information.

Will Brown: If data is the bottleneck, if having the real expertise is the bottleneck, like, would you rather have the smartest person in history work at your company or someone who’s been there for 30 years? Sometimes you really want the person who’s been there for 30 years. There’s a lot of expertise that comes from really understanding a problem deeply and interacting with it over a long time. And this is really what happens in training that is almost impossible to replicate in a short prompt. You really want the ability for institutional knowledge to compound over time, for best practices to compound over time. And this is how institutions and companies grow to be really powerful and successful is they stand on the shoulders of what they’ve done before rather than kind of resetting every day. And we want to have this be accessible to any company that wants to do this. And I think that’s how we’ve thought about approaching it, especially as software becomes easier for people to manipulate, as the barrier to entry for coding becomes easier, we see the same happening for AI research.

Sonya Huang: Will and Johannes work at Prime Intellect, which is one of the coolest neolabs in AI right now. Your mission is to make frontier lab training accessible to everyone, which I think is a very noble mission. You have really, really strong taste and just developer feel and just understanding how to—you know, that intuition for what developers care about. And then what you all launched with the Reinforcement Learning Environments Hub was really, really differentiated and people were very excited about. And so I’m excited to chat about many topics with you all today: post-training, reinforcement learning, agent harnesses, your platform, the RL Hub, and then big picture questions on what’s coming next in post-training in RL. Does that work?

Johannes Hagemann: Sounds great.

Will Brown: Absolutely.

Sonya Huang: Maybe to get started, you are one of the leading research labs enabling customers to post-train their agents. Can you tell me what is that? What does that mean?

Johannes Hagemann: Yeah, for sure. Happy to take that one, and give a bit of a higher-level overview of what our platform does as well as our research at Prime Intellect. As you already mentioned at the beginning, we try to make frontier infrastructure available to any startup, enterprise, and neo lab as well, and basically the infrastructure that is currently locked behind the walls of the big labs where nobody really has access to them. And yeah, we really start from, like, the compute layer and the compute orchestration layer and go all the way up then to the entire full post-training stack. So everything from the training frameworks that are needed to do large-scale reinforcement learning to the environments with a bit of a more community approach of our environment hub to other pieces that are actually needed to do this, like sandboxes for secure code execution and evaluations as part of our environment hub as well. And yeah, to offer this as an end-to-end product in a sense.

Sonya Huang: And what’s the intuition for why even pursue that mission statement of making all that infrastructure available to everyone?

Will Brown: Yeah, there’s a lot of reasons. I think one is—and I think something that we are very passionate about is just open science as a way that humanity moves forward, where a lot of the big scientific discoveries historically have been things that we talk about and as a world we can kind of build on top of, but more practically speaking as well, there’s a lot of value in model customization where, like, the winning applications are using AI for a specific thing, for some agent, for some workflow, where you also want to be able to deliver this at scale and with cost-effective performance. And so really the way to kind of really optimize these systems end to end is to be able to have access to the model weights directly where you can then craft the model to be the best model for your problem rather than some model off the shelf where the crafting happens just in a prompt. And so it’s really just allowing a deeper layer of customization than what you can do at the prompt level.

Sonya Huang: And then is your vision for the future then that every company will be pre-training their own models, post-training their own models, fine-tuning their— like, what do you think the future holds?

Will Brown: We definitely think that every company will be an AI company, and we think most AI companies will want to have an AI research lab. And research can look like many different things. It can look like pre-training, especially if you’re in a domain where you maybe don’t just want text in, text out, if you want something more bespoke. In a lot of cases, it will be post-training agentic or otherwise focused models for specific tasks and workflows. And I think that’s getting to the point where it’s productionizable and cost-effective, where you can actually make this very practical for people to do at scale for the right shape of problems that people want to solve.

Sonya Huang: Awesome. And then can you say at a high level what your platform does?

Will Brown: Yeah. So we have a full-stack research platform called Lab, and Lab is about giving everybody the ability to do the things that a frontier research lab can do internally, but for anyone in the world who wants to do this kind of research. And a big focus of Lab is the notion of an environment. And so I think a lot of people have heard the term “environment” in the context of reinforcement learning, and that’s a big focus of it for us, which is that an RL environment is encapsulating the things you need to do to do reinforcement learning, where you can have a model improve via trial and error. But it’s also more general beyond just reinforcement learning.

And I think if you haven’t heard of an environment before, it’s essentially the same thing as the evals that get reported when people talk about new model releases. So SWE-bench and Amy and Terminal-Bench, these are all examples of environments where there’s a dataset of tasks, there’s a harness for the model to be in, and there’s something called a rubric or reward function, which is responsible for grading the quality of the outputs. And so the same thing you’d use as an eval offline as kind of your test set, you can use this in reinforcement learning as your train set. And this is a way to improve model performance interactively.

And so our platform is really enabling people to use environments in their workflows for post-training, for evaluation, for synthetic data, for reinforcement learning. And it’s also very much focused as a community platform in the same way as things like GitHub are, where this stuff is new, it’s complicated, and we want people to have lots of building blocks they can draw from and lots of examples, and for it to be collaborative and for people to have reasons to kind of show off ways of using models or different tools in workflows as environments to allow other people to post-train models in those environments to kind of have this way of sharing the ability to improve performance across different tasks.

Johannes Hagemann: I think the general idea is to give more companies the actual ability and advantage that currently only the big labs have in a sense of, like, this product model optimization loop where they can optimize their models for their specific product in a sense. And we see it as a thing where yeah, that’s the kind of reason why, like, a ChatGPT was created by OpenAI or like a Claude Code was created by Anthropic. They actually have the capabilities to optimize models for their specific scaffolds in a sense. And yeah, have their models work way better in their products. And the more popular those kinds of products also become in a sense, like a Claude Code becoming extremely popular right now, the big labs have naturally less of an incentive to actually make it work better for other coding startups in a sense, right?

And the idea there is to give them the tools to have their own model product optimization loop. And I think they’re early adopters on that front. I think one great example always in this case that I always give is Cursor, for example, that realized that in my opinion quite early on, they built their own Composer 1 model where they did large-scale post-training in a sense, and really optimized a model where the environment was actually Cursor itself. So yeah, it gives them all the tools that you have in Cursor in a sense as well and optimize a model inside of Cursor. And yeah, we believe there’s a lot more startups that will go this direction to, on the one side, optimize their current products in a sense, but also build completely new products that are really not possible right now without having this product-model optimization loop.

Sonya Huang: Awesome. And can we say a word on—you said something interesting. You said environments are just evals. Can we dissect that statement? In my head, an environment is a state, it’s a description of world state through which, you know, you observe what actions you take, you observe how world state changes, and therefore you update your world model. That to me is distinct from an eval, which is like, you should have gotten this answer on this set of questions. And so can you help me merge those two realities?

Will Brown: Sure. Yeah. So I think there’s a version of eval that’s kind of where we were maybe a year or two ago, where a lot of evals were like question and answer, and it’s like this big bank of questions. And then maybe there’s other notions of environment that people think about when they talk about, like, an old school RL with, like, Atari that is much more about this kind of long-running state interaction loop. And I think where we are now is it’s both in the same thing, where especially the environments that you want to do large-scale training on, they do have this complex state. They maybe are simulating a web app. It’s a full-fledged kind of coding platform where you have an agent doing these things. But in the original RL games, there was always a reward. There’s a goal. And so the notion of there being a goal of this problem where it’s not just running through some system and a human is going to kind of vibe check it, there actually is something that can measure progress and performance. And that’s kind of what I mean by it’s an eval in that there is a system that it starts at some state, it interacts with the system, the environment, the harness, the agent, whatever you want to call it. But there is some goal, and there’s a way to measure whether it’s doing well or not.

Sonya Huang: Okay, got it. And then is there a difference between, you mentioned kind of Cursor as a great example of somebody doing really great frontier work in reinforcement learning. Is there a good way to think about when should you be kind of constructing RL environments versus when should your actual application and your application states be the environment, so to speak?

Will Brown: Yeah, I think there’s definitely reasons where you want both, especially, like, let’s say you’re training a model to be a really good Rust coding model. Here you might want it to be good for lots of different applications, where you’d have different environments that are focused on some, like, domain task. Maybe you’re a company that wants a model that’s going to be really good at calling your tools or using your specific domain language that then you can provide as a service to people who are building around that.

And there’s also ones where the product very much is like a user interface for an end user where the user’s interacting with an agent, in which case it might make the most sense for the product to very directly become the environment, where I think the companies who might want to be doing this are the same who care about, whether they’re using Claude or GPT-5 or Gemini, or the ability to choose models and have internal systems to evaluate whether a certain system prompt is good or whether changing out a model endpoint is good, or whether using the mini version of a model is a better cost-performance trade-off, the infrastructure to do that is—that a lot of people have already been building at these kind of advanced agent companies, is the same infrastructure you use to do reinforcement learning. And so I think that’s this kind of convenient world we ended up in where the training paradigm that makes the most sense for improving model capabilities is the same sort of thing that a lot of people have been building up the muscle for just without doing the training piece. And so that’s kind of where we see RL being a very useful tool for people to then have this as an option in their toolkit for system optimization.

Sonya Huang: Got it. Okay. I want to talk about agent harnesses as they apply to RL. I feel like harnesses are like the theme of the moment, especially with, you mentioned Claude Code getting so much love. I think one of the things that they do exceptional engineering around is the harness. Harness and RL, are those things orthogonal, mutually exclusive? How do they relate?

Will Brown: They definitely relate. I think of a harness as like a piece of the environment. And so for any eval or environment task, there’s some input, a bunch of stuff is going to happen, then there’s some output state, which is then going to be graded. And so this whole intermediate piece, whether it’s interacting with some simulator or interacting with another agent or physical world sim, this is the environment and the harness is very much a piece of that, where it couples how the model interacts with any other pieces of the system.

And I think depending on the application area, like I think for coding agents, we have a pretty clear definition of what the—you could say the harness is the CLI coding agent and the terminal is the environment. But this isn’t necessarily going to be universal across all different types of agents. In some cases it’s a system prompt and some tools is the harness. In some cases it’s something that is going to be spawning subagents, and those subagents also have their own harnesses. And so there’s a lot of complexity. And I think the way we’ve thought about it is harnesses are going to keep evolving. There’s going to be this Cambrian explosion of ways people want to use models, and we want to take a pretty general approach in defining what you could do with a harness. And so what we are really thinking about and why we use the term “environment” as the abstraction is “agent” is too narrow, “harness” is kind of too narrow. And you can do all of these things within an environment, but the environment as an abstraction on the whole allows this. Any sort of system-model interaction is in scope.

Sonya Huang: I see. I see. Super interesting. And then do you think all companies should be post-training their models with environments? Are there specific kind of—you know, where the bullseye, like, you absolutely need to be using environments, or is this like you could be post-training with a different method, versus you should just be prompting your thing?

Will Brown: I mean, I think environments are tools you can use to do all of these things. And so I think that when I talk about kind of environments beyond RL, I think part of the reason why it’s a useful abstraction is because it doesn’t tie you to RL. Let’s say I want to have a small model, and I want it to be distilled from a big model. The way you can do this is you take the big model, you plug it in your environment, you let it run a bunch of times. Now you have all this data that comes from this same interaction protocol. You can use the same grader at the end to filter for the best examples, and then do SFT fine-tuning on that. You could do prompt optimization with an environment. You could A/B test different models with an environment just as you would with an eval. And so I think the idea is that every AI company should be optimizing their AI systems. I think that’s a less controversial take.

Sonya Huang: Okay. And maybe there’s another way to frame it. Like, the eval is almost how your agent performs in the set of environments that you expect your customers to face?

Will Brown: Yeah, I think it depends on what you want to call—like, I mean, in kind of traditional machine learning terms, you don’t want to overfit to the test set. And so in some ways, we kind of are already accepting that we’re going to be using the test set or the eval to measure, to have that kind of filter back into the model. And so I think it’s a little tricky to distinguish even, like, what is the eval, what is the environment. We kind of think of them as one and the same. And we use the “environment” term very generically, where an eval is a type of environment that’s used for measuring performance but not training on. And that’s kind of how we see a lot of people thinking about when they’re doing evals what they really mean they’re doing is they have some way of measuring current performance and they’re iterating on it. And in some ways, RL is this iteration applied at scale where you’re automatizing the process of changing the model a little bit, changing the prompt a little bit, and having this be the way that you can hill climb on some goal.

Sonya Huang: Maybe another question. Do you see reinforcement learning as synonymous functionally with post-training? Meaning, like, if you’re post-training models, are you doing reinforcement learning, or are you doing other things?

Will Brown: There’s definitely a lot of things. I think reinforcement learning is the big thing now where it’s like, in many cases, practically speaking, if you’re doing a large-scale model, RL will be where you spend the most of your time and focus and compute. But it’s not the only thing you want to do. There’s a lot of things involved in the whole process of going from some initial model to the system you want to deploy. This can be prompt tuning, this can be SFT, this can be online distillation. There’s a number of algorithms that all kind of are under this umbrella where RL is like the big one in the middle, but there’s a lot of stuff around it. And really, I think, exposing this toolkit to people and letting them have all these knobs they can play with is the way to unlock.

Sonya Huang: And are you finding that it’s the really smart AI researchers that actually know how to make this stuff happen in practice? Towards your goal of democratizing AI development for you all, does your average Fortune 500 know how to use the platform and get value out of it?

Will Brown: I think most companies have people who can. I think any Fortune 500 company will likely have a team of AI engineers who are capable at following the latest tools, who are good at using Claude Code, who have a lot of opinions about models and prompting. And those people certainly can do this. That’s kind of the audience where we see as the target customer for this.

Johannes Hagemann: Especially if you give them the right tools to actually do it. Like, maybe some of those large companies don’t really have anybody in there who can debug your GPU cluster in a sense to actually kick off such a run or other components that are needed in there to actually just make it easier in a sense to do large-scale agentic reinforcement learning with tool use, with code execution and pieces like this to just abstract away the entire infrastructure for them.

Sonya Huang: Yeah, got it. Awesome. Actually, on that note, do you guys have any favorite customer stories that you want to share?

Johannes Hagemann: Yeah, one of our favorite customers that I would like to point out in a sense is RCI and Neolab are working on, like, frontier open models, have been working with us on the entire stack in a sense, right? We’ve been talking a lot about reinforcement learning, right? But yeah, we’ve been also doing large-scale pre-training in the past. That’s basically where our history is coming from in a sense as well. So yeah, I’ve been training with them some of the largest mixture of expert models, and actually able to open source them as well. And yeah, as well on the post-training side with them as well. Maybe Will, do you want to share some more?

Will Brown: Sure. Yeah. So they’re a very close collaborator of ours that we’ve, I think—we’ve all been friends for a long time, but also I think they’ve had a lot of things that they want infra for. And that’s been a way that we’ve been able to kind of force us to build out a lot of the pieces, both from compute orchestration to post-training to pre-training to inference around just making everything that is needed to be a frontier lab. And I think they’re very aligned with us in the openness mission, but I think their focus, they are more targeted at enterprises and the end user, where they are going to work more directly with customers in terms of end-to-end delivery of a certain artifact. I think where we come in is we are really focused on the developer experience, the infra layer, and the ways that we can put these tools into more people’s hands, where the process of going from idea to deployable model can become as seamless and quick as possible and efficient as possible.

Sonya Huang: Awesome. And then any other favorite customer stories, maybe on people that are using the environments product specifically?

Will Brown: Yeah, so we are definitely really focused on the research community. And some of this is like a lot of grad students use it, a lot of students and people who are getting early in their career learning how to do this stuff or using it. But also a lot of labs who are focused on a very specific domain, where let’s say you’re starting a medical AI lab. We work closely with a number of groups in the medical AI space where they want to create more, both benchmarks to understand how good are models at medical capabilities, both in terms of diagnosis or patient interactions or question-answer about medical literature or agentic search over certain medical tasks. And so like Sophont, OpenMed being two that we work closely with, where the focus there is they really care about earning trust from the medical professionals. And so for them, they really want to focus on the domain and not as much on kind of the generic medical, the LLM infra that is kind of like a headache for a lot of people. And so for them, like, being able to kind of have this platform for creating evaluations, for showing them off, for being able to use them to then improve model capabilities that could then be deployed locally in a hospital or deployed locally for some end user where ability to have this customization and have this kind of end-to-end trust and understanding of the data provenance for tasks at hand, or the ability to kind of customize models very directly is very key.

Sonya Huang: Do you have any customers that are using you for more of what you call the old-school kind of Atari style, you know, learning from your environment type?

Will Brown: By old, do you mean, like—so when I think of Atari, I think of, like, non-LLMs. And so we definitely are focused on LLMs and foundation models that look like LLMs. There’s definitely a lot of researchers who use our platform for these things that are more—like, there’s some examples we have on the platform that are much more like games. So I mean, kind of one fun one that I use as a demo a lot is the game Wordle from the New York Times, where this kind of ended up just being a great “hello world” environment for people, because the infra is really simple, but it’s very expressive where you can kind of get a feel for it and you get this a-ha moment of seeing a model learn to think about the game as you give it rewards for doing better. And you can do it with a really tiny model, too. Like, it’s in this sweet spot of difficulty where you can do these runs on, like, a couple GPUs in an hour and see a model actually learn how to get better at the game.

Johannes Hagemann: I would say, like, the more toy-ish game examples are usually the ones that people are going for for actually learning how to build those reinforcement learning environments. And yeah, that’s also what people are heavily using the Environment Hub for in a sense, just because we have all this infrastructure built around it to be able to actually test in your environments. And yeah, that’s usually how they start out, and then go to actually building more complex environments later on.

Another group I would love to give a shout out to in a sense that have been building some of the more complex environments on the hub are people who are part of our reinforcement learning residency where we have a group, we initially started out with, like, eight to ten people. I think 14 to 16 are now in the group. You have grad students as well as people working full-time that are part-time building reinforcement learning environments as well as doing novel research on top of the environment hub. And those folks have built amazing environments in all kinds of verticals in a sense, everything from verifiable software engineering to medical physics environment to some cybersecurity environments. And then yeah, also we give them the tools obviously now to actually do the training in those as well.

Sonya Huang: Yeah, awesome. Can you help demystify for me, I’ve heard that some of the foundation model companies are spending millions of dollars each on some of these environments. And so you mentioned cybersecurity at the end. What goes into constructing a cybersecurity environment?

Will Brown: Yeah, so there’s someone in our residency program who actually does a lot of this stuff. And so there actually is a lot of tooling in the cybersecurity world around these capture the flag games, where there’s some system that has some hidden bug in it. And this is like a challenge that originally these are built for programmers, or programmers will have, like, a little hackathon where they try to go find the bug in some system. But you can adopt these challenges to LLMs as well, where it then is a full software environment, where it’s a terminal where the agent lives in the terminal and has access to tools for doing bash commands. Maybe it’s using Claude Code or some other wrapper for an agent harness, but it’s in a terminal full of files and can interact with these files. And then at some point, it marks that it’s finished with the rollout, and then you can grade the state of the environment using other pieces of software, other code that executes.

But we actually have a lot of people we work with who are in the data and environment space, where we’ve found that there’s a lot of interest in using reinforcement learning as a way to evaluate data quality, where the ability to measure what happens when you train a model on a set of tasks, set of rubrics, set of environments, allows you to understand bugs in the environments, because there are issues that come up in reinforcement learning where maybe if your environment has a backdoor, a model can exploit this and kind of game the system. And so I think there’s a lot of interest in people using RL, like, in the pipeline from idea to environment that ends up in some frontier lab’s foundational training run for the next GPT or Claude model. There’s a lot of vetting that goes into this, and doing these smaller- or medium-scale runs in let’s say one environment allows you to really poke and see where the problems might be.

Sonya Huang: Okay. Super interesting. And if we take the cybersecurity environment example one step further—by the way, I have no particular affection for cybersecurity, but it’s something that, like, I love having a specific example in my head. I could see how you could construct a toy example, right? These capture the flag examples, I would imagine they’re toy examples. I would imagine they look nothing like the actual kind of corporate network environment of a real company with all of these cybersecurity products. And so I guess how do you construct an actual—you know, does what people can construct, does it actually scale up towards imitating and reflecting the complexities of the real world? Almost like, you know, the robotics sim-to-real gap?

Will Brown: Yeah.

Sonya Huang: Is there a sim-to-real gap in crafting these environments? And is the solution just to train on, like, real kind of corporate security environments?

Will Brown: Yeah, so I would say there’s less of a barrier in terms of the actual complexity. Like, in principle, these can be as complex as you want. It’s anything that could be on a computer. So just think of anything that’s on a computer or network of computers as potentially an environment. What really kind of becomes the bottleneck in many ways is cost of the simulator, where I think there’s a lot of focus on identifying clever ways to kind of mock the right piece of the system where let’s say you have an agent that you want to—like, let’s say the internet. Like, let’s say you want to have an agent that does a lot of web search. In some cases you actually want to do the real web search. In some cases you want to find ways where you can design tasks where you actually don’t need the full thing.

It’s like this kind of—I don’t know. I think of, like, the map games where there’s this world you want to explore where there might be this whole map and, like, some of it’s dark, but as you walk around, like, the light shines and then you now see this part of it. And so if you know which pieces of this map might actually be explored on a given task, then this kind of allows you to decide which pieces are important or not. And so one example, there’s a benchmark called Tau2-bench that is very popular in the eval world, where it’s about customer service agents involving a database. And so the database is something where in a real system, you might need the full database, but if you know that the agent is only going to ask certain types of questions for a task, you don’t need a full database with millions of records or thousands or billions of different examples, you can have kind of a mock database that is a cheaper to run, maybe in-memory piece of software that only has the things that are expected by the agent or expected in the scope for a given task. And so identifying the pieces of the system where the complexity is kind of overkill is part of this task design process to enable efficiency.

Sonya Huang: And does your platform help with that? That almost seems like, you know, if creating environments is the bottleneck to training these systems, then being able to efficiently create these environments where you’re lighting up the part of the map that has to be lit up is like a core kind of almost platform competency. Do you guys assist with that?

Will Brown: Yeah. I mean, it’s definitely a way that we think about the design of everything, and we kind of go down a lot of these different rabbit holes as we have to focus on different tasks when working with different people. And so coding agents maybe is one example where, like, there’s a lot of complexity that comes up when working with sandboxes and terminal states, and ensuring that you have good snapshotting and all of these things and protocols to interact with different agent harnesses.

And so we’ve built a lot around that, but it’s also the sort of thing that we kind of know that the space of complexity is going to grow arbitrarily over time as people start getting more deep into these different domains. And I think we’ve tried to design everything in a way where we keep a lot of doors open, where we have room to build features that kind of like, there’s base layers of like generic environments. Then you can go, I want a coding agent environment. I want a coding agent environment with a sandbox that’s global across the run, or I want one per rollout. There’s lots of these different branches you can go down. And so we kind of have anticipated, like, there will be a lot of these branches. There’s some that we’ve built for, there’s some that I think we’re kind of ready to build for when we need to.

And we also, like, think a lot about how do you make a good developer experience, where let’s say, think about just documentation or skill files for agents. People are going to be using coding agents when they’re building these. And so there’s a lot of institutional knowledge that gets built up when you’re doing this kind of research, both for a specific project or as a research team doing large-scale training runs. And being able to surface this information, some of it is directly in the product, some of it is in the way we design the libraries, some of it is in documentation or skill files that get shown to agents. And this is the sort of thing where we’ve designed it to be something that can evolve over time as the research literature evolves, as the best practices for different types of complex agents become more clear.

Sonya Huang: Yeah. Awesome. I assume you guys read Sutton’s Learning from—what is it? Age of Experience.

Will Brown: Age of Experience. Yeah.

Sonya Huang: My question is: scale AI-era data labeling, Do you think that constructing environments is sort of the natural successor to that?

Will Brown: Yeah. I mean, it very much seems like it kind of already is, where it does seem like a lot of the focus from the major labs has shifted to—they’re still using a lot of human data. And so human data doesn’t seem to be going away, because creating these environments kind of by definition is for things that models aren’t good enough at yet, which means the humans are better than the models in some capacity. And so identifying which pieces of information the human can most uniquely assist the model in improving its skill on, I think, is really the key to target, which is how do you create this information flow from the human who really has the expert knowledge on how to do some sort of thing, what does a job well done look like, and get it into the model? And it does seem to be that RL is the most effective way to do this right now, where having tasks with prompts to grade with an LLM judge, a rubric for what success looks like on a task, is kind of the paradigm that’s emerging for a lot of these cases where a human—sometimes it’s the reward model, but kind of the most direct visceral version of it is there’s a set of questions about yes, no, was this done in the LLM’s answer, and that turns into the reward score. And so that requires a lot of human data.

Sonya Huang: Okay. Awesome. Why create a hub for environments?

Johannes Hagemann: Yeah, I think the hub idea started by seeing a lot of different open source [inaudible] out there that had overflowing implementations of all those environments in a sense. And yeah, and in general, Will already created a nice verifier framework before we even started the environment hub, right? Where you had different examples for environments in there, and it was very much an approach to standardize the whole process even more.

And beyond that, beyond just sharing those environments and having an open source platform for it, it’s also about having a place where you can build a lot of infrastructure around them, which you can’t really do if you just upload them to a GitHub repo or something like this. So having proper evaluations integrated with your environments so you can immediately test them across all frontier models is one of the features people are heavily using the Environment Hub for, right? We make it extremely easy to install one of those environments inside of a different trainer. So we obviously have our own large-scale trainer with [inaudible], which we’ve been heavily optimizing for this. But yeah, we are very open there on the open-source side to integrate with a bunch of different trainers, because people have different needs on the trainer side as well. And yeah, that’s how the whole idea of the environment hub …

Sonya Huang: And what has the community behavior been? Do you see people forking, modifying these environments? Do you see them, you know, putting something 80/20 out into the public domain and then, like, you know, we’re going to keep our secrets for ourself and, you know, not share that back. Like, what’s the community behavior?

Will Brown: Yeah. I mean, there’s definitely a lot of people we work with who want to keep their environments private, as you understand, but the value for them of it being a hub is that they can do ablations on ones that are known to be—they can compare their private one versus some public one that might be on a similar type of task. Or for evals, there’s a lot of value in having uniform implementations of popular benchmarks in a way that if you’re doing a training run, you can plug in some known eval as a way to monitor the progress of your own. And so maybe you’re doing well in your environment. You can see, does your environment also generalize to other tasks? And so having all of these other tasks available is a really helpful way for people to be able to understand not just their own task, but other things they might want the models to be good at as well.

Sonya Huang: Yeah. Awesome. What are the most popular environments on the hub today?

Will Brown: Yeah, I think it tends to be the ones that are these kind of—one is the ones that we use as the examples in the documentation that naturally in software tends to be the most popular ones.

Sonya Huang: So the Wordle stuff?

Will Brown: The Wordle one’s popular, but I think one that we see a lot of interest around and one where I think there has been the most degree of people kind of branching and forking and turning it in different versions is one that we call WikiSearch, which is doing search over Wikipedia pages, but it’s designed to be this template you could use for agentic search more broadly. So there’s a lot of applications where people want agents that know how to search over their internal documents or documents for a specific type of information. And having this kind of template for it, you just have to swap out the documents and now you have this environment ready to go where the rest of it is already kind of set up. That tends to be the sort of thing where we see a lot of value in people being able to bring other types of documents that they’d want to do agentic search over.

Sonya Huang: Yeah, got it. Okay, awesome. I want to maybe shift towards future research, big blue sky questions. Maybe the first one, Andrej, I think is one of your angels as well. He kind of has that infamous quote of, you know, RL is amazing, but it’s quite inefficient and it’s like sucking bits from a straw. I guess, do you agree with that? And what do you think is going to happen on the research side to make RL more efficient?

Will Brown: Yeah. I mean, I think it’s definitely true that RL is using a lot of compute to get a pretty small signal in terms of pure information. But I think in some ways that’s part of the value of it as well. And I think one of the reasons that a lot of the labs have really focused on it is that one of the bottlenecks that’s hard to scale is human data, especially high-quality human data. And RL allows you to kind of trade off compute for data in a sense, where you can get a lot of value out of a smaller amount of data by using more compute. And so the supervision coming from this data is small, but you can get a lot out of it more so than you can via pre-training or supervised fine-tuning alone, as well as it’s useful in cases where you don’t necessarily have golden examples. So, like, if you have a bigger model to distill from, that’s great, but if you’re already at the biggest model size that you have access to, then you kind of need to go into uncharted territory. And exploration is really the heart of RL. It’s how do you explore and try out different things. And maybe there are ways to do exploration that are more efficient than RL that people will discover. But this is kind of currently the frontier of using compute to explore and improve capabilities. And so that’s what we got for now.

Johannes Hagemann: For sure. And I think—like, I can’t speak for Andrej, obviously, but yeah, I would be curious to hear his views, like, two months later after the latest [inaudible] podcast, in a sense, on his views, especially on the coding side for large-scale reinforcement learning and how it actually helps there. If we look at something like Claude Code, which definitely was already popular before, but definitely popped up more over the last one month, I would say, then his views might change in a sense on the specific piece of how useful reinforcement learning can be in the coding domain.

But yeah, generally we don’t think that’s going to be the end in a sense, right? We generally think we want to be always at the frontier of what comes next in terms of paradigms. Yeah, there’s definitely lots of low-hanging fruit still on the pushing agentic AI capabilities even further. But yeah, we also know the limitations in a sense, right? Like, some of the pieces we’ve been working on where we definitely see limitations is on the context side. So yeah, I think we just have a hard limit right now of how much tokens we can fit into a context. And you have been thinking there about ways on how to actually improve that in a sense.

Sonya Huang: Yeah. Awesome. Switching gears a little bit. Open source, open weight models. What do you see the role open weight models play? Does your infrastructure kind of work only on or work optimally on open weight models? Could you kind of help people do post-training around closed-weight models? How does that all work?

Will Brown: Yeah. So in many ways, the trainer itself is going to require having access to the weights. And so if anyone who has closed-weight models wants to use it, we’re happy to chat. But more broadly, the idea of the environment is general across different types of optimization. And so we can do, using the same infrastructure at the environment level for doing evals on closed models, for doing prompt tuning on closed models, for doing model selection, for evaluating agent harnesses, there’s a lot of research that can be done around closed models just by having a way to do this experimentation. And so whether using open or closed models, you can create data that some platforms might let you upload some examples that maybe you’re distilling from a particular model into another, where you use the environment as this data engine. So there’s a lot of different ways you can use the tools to optimize models.

Sonya Huang: And do you need-need-need the weights? Like, for example, can you kind of LoRA?

Will Brown: Yeah. So I mean, the LoRA I would consider still part of the fine-tuning process, and that’s what we recommend a lot of people do for RL anyways. And so, like, you don’t need necessarily the full weights, but, like, I can’t upload my own LoRA for GPT-5, but I could bring an environment to the platform, potentially. And so I think there’s a lot of different ways you can do partial customization, and I would imagine that in many cases, the degree of how many—do you need a LoRA adapter, do you need full fine-tuning is going to depend on the training recipe as well as the goal of your optimization.

Sonya Huang: What about can you do reinforcement learning on, like, the agent harness around a closed model?

Will Brown: Yeah. So I would definitely consider this in the domain of, like, there’s a world of prompt optimization that some people have been exploring in the research world that is in some ways kind of an analog of RL but in prompt space where you’ve got …

Sonya Huang: Like DSPY?

Will Brown: Yes, exactly. And so the GEPA algorithm got a lot of attention late last year as kind of what seemed to be a better way of doing this. And so we have support for that as well with our environments, where you can do this around different pieces of the harness where this might be what’s the prompt used for a certain tool? What’s the agent skill? What’s the system prompt? There’s a lot of these things that you can apply different types of optimization to.

Sonya Huang: Awesome. While we’re on the topic of DSPY, I think the DSPY authors also have this new thing that is the current thing. What is it? Recursive language models?

Johannes Hagemann: Yes.

Sonya Huang: What do you guys think?

Johannes Hagemann: Yeah, we are definitely very interested in that. As I already said earlier, in a sense, we are very interested in longer horizon agents and so on, and actually solving things for those type of use cases. And yeah, I’ve been internally thinking for a long time about how can we have models learn how to manage their own context. So right now, people are building a lot of scaffolds for context management. And yeah, we believe something that is a bit more [inaudible] in a sense is to have the model learn how to manage its own context. And yeah, I’ve been searching for different research in this kind of domain for a pretty long time. And yeah, the recursive language model research direction is one of the most promising ones in our opinion.

We’ve been, since Alex Zhang, who’s the original author of the RLM work, published it, we’ve been very interested in this kind of work, have been exploring it as part of our research as well. And we had a blog post out a couple of weeks ago that basically showed using this RLM harness where you pretty much give a language model access to a variable in a persistent Python REPL. So it can not have this whole context or the whole data as input in a sense, but it has it in this variable that it can then retrieve, it can transform it and manage its own context through that. And then also call other sub-LLMs where the recursive part comes from to actually manage it. And that’s the whole idea behind the recursive language model. We’ve been doing some exploration there on this front to actually just give current frontier language model access to this specific RLM harness. So not necessarily training in this harness, but just giving it access to the specific way of dealing with its own context. And yeah, that’s been already shown to improve benchmarks on very long horizon reasoning quite a bit. And yeah, we are very excited as a new frontier in a sense to actually train in this as well, to train the model to actually use this harness. And yeah, that’s what we’re going to work on over the next couple of months.

Sonya Huang: Super exciting. What else in the research domain? You guys have great research taste. What do you think is on the horizon?

Will Brown: I’m really excited about synthetic data research. And it feels like there’s a lot of stuff that feels like we should be able to do it, but you haven’t really seen it emerge in the open in terms of creative ways of doing kind of self-reflection. And I think people talk a lot about continual learning as this kind of idea that we’re going to have to get better at models kind of learning things on their own. And I think the idea of using other tricks that we already know in conjunction in different ways, things like prompt optimization, things like distillation in conjunction with synthetic data, it feels like there’s a lot of—I don’t want to go too deep into different directions, but it seems like there’s a lot of room for exploration around having models curate their own training data, maybe curate their own environments, and understanding which versions of this are most effective for, like, lifelong learning.

Sonya Huang: Love it. Okay, we’re going to close on an optimistic note. If everything goes right, what does the world look like, and what is the role that Prime Intellect serves in that world?

Johannes Hagemann: Yeah, great question. How to answer that on a high-level overview in a sense, I would say we don’t want to have a world where all the future value of AI in all kinds of verticals is just owned by the big labs. We have something where we empower entrepreneurs and enterprises and so on to actually not get steamrolled in a sense and, like, optimize their products better than they have the tools for doing so right now. And yeah, just enabling this and a lot more like Claude Code moments, a lot more Cursor for X type moments that will be enabled through this.

Sonya Huang: Every company is a NeoLab?

Johannes Hagemann: Kind of.

Will Brown: In some ways. I mean, if data is the bottleneck, if having the real expertise is the bottleneck, like, would you rather have the smartest person in history work at your company or someone who’s been there for 30 years? And in some ways, sometimes you really want the person who’s been there for 30 years. There’s a lot of expertise that comes from really understanding a problem deeply and interacting with it over a long time. And this is really what happens in training that is almost impossible to replicate in a short prompt, where you really want the ability for institutional knowledge to compound over time, for best practices to compound over time. And this is how institutions and companies grow to be really powerful and successful is they stand on the shoulders of what they’ve done before rather than kind of resetting every day. And we want to have this be accessible to any company that wants to do this. And I think that’s how we’ve thought about approaching it, especially as software becomes easier for people to manipulate, as the barrier to entry for coding becomes easier, we see the same happening for AI research.

Sonya Huang: It’s a really inspiring vision for the world. Thank you guys so much for joining today. You’ve really paved the way on environments and your environment hub. And thank you for taking the time to demystify what an environment is and share your vision for the future.

Will Brown: Thanks.

Johannes Hagemann: Thank you.

Building the GitHub for RL Environments: Prime Intellect’s Will Brown & Johannes Hagemann

Stream Now On

Listen Now

Summary

Transcript

More Episodes

The Wartime CEO: Vlad Tenev of Robinhood

Context Engineering Our Way to Long-Horizon Agents: LangChain’s Harrison Chase

Building the GitHub for RL Environments: Prime Intellect’s Will Brown & Johannes Hagemann

Stream Now On

Introduction

Main conversation

The Wartime CEO: Vlad Tenev of Robinhood

Context Engineering Our Way to Long-Horizon Agents: LangChain’s Harrison Chase