Making the Case for the Terminal as AI’s Workbench: Warp’s Zach Lloyd
Zach Lloyd built Warp to modernize the terminal for professional developers, but the rise of coding agents transformed his company’s trajectory. He discusses the convergence of IDEs and terminals into new workbenches built for prompting and agent orchestration, and why he thinks “coding will be solved” within a few years, making human expression of intent the ultimate bottleneck. Zach explains why the terminal’s format makes it perfect for managing swarms of cloud agents.
Listen Now
Summary
Insights from this episode
The terminal is becoming the hub for agentic development: The terminal’s time-based, text-driven interface is ideal for orchestrating agents, multitasking, and logging, making it well-suited as the workbench for AI-powered software development.
Coding interfaces are converging into new hybrid workbenches: Traditional distinctions between terminals and IDEs are fading—future tools will merge prompting, context management, and code editing to support both human and agent collaboration.
The next wave is ambient, cloud-based agents: The future isn’t just developers typing prompts; it’s agents running autonomously in the cloud, triggered by system events and integrated into team workflows, requiring orchestration platforms that track and manage agent activity.
Product differentiation and sustainable business models are critical: In a market dominated by well-funded model providers, survival depends on building a superior product experience for pro developers and adopting pricing models that align value with usage, rather than racing to the bottom on cost.
The real bottleneck will be expressing intent, not coding itself: As models continue to improve, the main challenge shifts from generating code to clearly conveying what should be built—meaning the ultimate constraint will be human communication, not AI capability.
Transcript
Chapters
Introduction
Zach Lloyd: Just the general form factor of the terminal is perfect for agentic work, because everything is, like, time based. It’s all about input of text and output of text. You get to log what you’re doing. You can multitask agents in the terminal really easily, and so I think it’s been like actually a great stroke of luck for us in a lot of ways that the terminal has become the center of agentic development. It’s a huge opportunity for us
Sonya Huang: In this episode, Zach Lloyd, founder of Warp, reveals why the terminal is becoming the center of AI-powered development. Zach shares how coding interfaces are converging into a new workbench built for prompting and agent orchestration, and why the next frontier isn’t developers typing prompts, but ambient agents running in the background that autonomously respond to system events like server crashes or security incidents.
We discuss the brutal competitive dynamics of the coding market, and why model providers are racing into the application layer. And finally, Zach shares his thesis that coding is nearly solved and that the ultimate bottleneck for AI will be humans’ ability to clearly express intent. Enjoy the show.
Main conversation
Sonya Huang: Zach, thanks so much for taking the time to join today.
Zach Lloyd: Thanks for having me on. Good to be here.
Sonya Huang: Before we get started, can you tell our audience a little bit about yourself, and what is Warp, and what company did you set out to build and why?
Zach Lloyd: Yep. So I am Zach. I’m the CEO and founder of Warp. Warp is a developer-focused startup. Our goal has always been just, like, help pro developers ship better software more quickly. The product that we’ve built, it has an interesting history. We’re, like, five years old. We started off building a modern reimagination of the terminal, and today the product has evolved into—it’s sort of a terminal with agents built in is one way of thinking about it, probably the simplest. It’s a workbench for building software with agents is kind of the more general way of framing it.
Sonya Huang: Awesome. Let’s dive right in. What made you decide that the terminal was the right place to build?
Zach Lloyd: So I’ve been a developer for a really long time. I’ve always used the terminal. In a prior life I was a principal engineer at Google. I used to run engineering on Google Docs. I’m not a good terminal user. I always worked with people who were good at using it, and I saw that you get just a ton of stuff done as a developer, because of where it sits in the stack. So it’s like a super duper powerful thing if you know how to use it right, but the sort of stock version or classic version of the terminal, I think, is like a horrible product. It’s hard to learn, it’s easy to make mistakes in. The mouse doesn’t work. And so I was interested in how you build something that’s impactful for developers, how do you build something that helps more good software exist in the world. And trying to reimagine the terminal felt like a cool thing to take on.
Sonya Huang: Hmm. And how much of the thesis around making the terminal great for single player versus multiplayer?
Zach Lloyd: It’s a good question. So the multiplayer part was going to be what the business model was going to be. So it’s like, you know, I came from the Google Docs world, I built collaborative software. I think the closest analogy would be something like Postman, where, you know, they have, like, a collaborative API platform. We were going to do that around the terminal where you could share commands, you could share sort of like runbooks, share incident response manuals.
And Warp actually has all that stuff. And it’s super—it’s super useful, not just for people, but for agents at this point to have all that knowledge baked into the product. So that was going to be the business model, but where we actually started was just like the hands on keys interaction with the terminal itself. Could we reimagine the developer experience of that? And so we spent the first year, year and a half of just, like, how would we like this thing to work, how do we want the input into the terminal to work, how do we want the output to work, and how can we make it just easier without diminishing the power of the tool?
Sonya Huang: Yeah. Awesome. And you made the decision to focus on rebuilding, reimagining the terminal, pre-generative AI, pre-coding models taking off.
Zach Lloyd: Yeah.
Sonya Huang: Do coding models and agents, do they change your answer to the question of, like, how important is the terminal as the workbench?
Zach Lloyd: Terminal is ironically more important now. The terminal has become, I think, the preferred form factor for working with agents. I mean, basically you can work with them in the IDE or you can work with them in the terminal or you can create some other workbench which, you know, Warp, you can see Warp that way. Actually, Warp started as a terminal, as a broader workbench now for agents. But just the general form factor of the terminal is perfect for agentic work because everything is, like, time based. It’s all about input of text and output of text. You get a log of what you’re doing. You can multitask agents in the terminal really easily. And so I think it’s been, like, actually a great stroke of luck for us in a lot of ways that the terminal has become the center of agentic development. It’s a huge opportunity for us.
Sonya Huang: I’m curious if you thought that we were headed towards a world where people just weren’t going to spend time in the IDE, and do you think that’s been accelerated now?
Zach Lloyd: I think that the kind of tools are morphing. And so, you know, pre-agent world you had a pretty clear distinction between terminals and IDEs. Today you have tools like Warp, which are—you know, we’ve grown from the terminal and added a bunch of IDE features like code editor and code review features and the file tree. And, like, we get yelled at on Twitter for having a file tree in Warp because it’s not like a pure terminal thing.
But then if you look at, like, the latest iteration of Cursor, which, you know, started as an ide, it looks a lot more like Warp. Like, the primary interface is now more of a chat interface and talking to your computer, but you still have all the file editing things.
So I don’t know if I would be like, terminal is going to die or the IDE is going to die. What I do feel strongly about is that there’s going to be innovation and there is innovation happening where the form factor is changing to match what the agentic workflow should be. So the form factor, it should be geared towards prompting, it should be geared towards adding context, it should be geared towards reviewing agent-generated code diffs. I think actually now, like, team is even more important, like, especially as you have more and more agents that are not just launched locally by people, but are coming to be launched by system events. And so I think the workbench is changing, and I actually think it will end up looking more like a terminal than an IDE, but probably won’t, strictly speaking be a traditional version of either.
Sonya Huang: And is the rough framing of, you know, the reason to use each that, you know, terminal is roughly equivalent to the chatbot. Like, you can chat with a coding agent and hand off tasks, and then IDE is roughly equivalent to, like, a GUI for actually editing and writing code. Is that the right mental model?
Zach Lloyd: Yeah, that’s definitely where things have started. Yeah, the IDE is like Microsoft Word for your code, and the terminal is like chatting with your computer. And if you’re doing professional agentic development, which I would distinguish from vibe coding, you kind of want both those.
Like, I don’t think for the pro use case, we’re at a spot yet where you can be so disconnected from the code that you don’t need some way of, like, falling into hand editing it. I would think of it as like the hand editing has almost become like a fallback interface or a secondary interface, and the primary interface now is the prompting interface. And so yeah, I basically think that that is the right distinction, but I think it’s all kind of merging product wise.
Sonya Huang: Yeah, interesting. You mentioned pro coders a few times. What was the decision to focus on pro coders about? And how do you think that plays out over the next decade? Like, will there be pro developers left? Will everybody be a pro developer?
Zach Lloyd: So okay, it’s a great question. So I think what I really care about is helping build software that I use every day. Like there’s probably 10 apps or whatever in my Mac doc and pinned as Chrome tabs which are apps like Google Docs or Spotify or Notion or Figma or Warp, which are hard-to-build apps that I think it’s those hard-to-build apps that the world spends most of their time using. And those are built, I think, more by pros, and they’re definitely built more by enterprises.
And I just want to be a part of, like, creating that kind of software. Whereas I do feel like the non-pro segment is cool, and I do think it’s actually really empowering that kind of anyone can make an app at this point. But I just think, like, the sort of economic value of the apps that you build with a vibe coding tool like Lovable or Replit or whatever is lower than the economic value of the apps built with the tool that’s geared towards pros. And it’s also, it’s just like I don’t think I’ve ever spent my day using an app that’s been built in a no code, low code, vibe code tool. Whereas I spend all of my days, like, literally living in software that is built by pros. It’s really built for these immersive, hard, important, economically valuable use cases.
Sonya Huang: Totally. The world is a museum of fashion projects, and I think that includes the software we choose to use every day.
Zach Lloyd: Yeah.
Sonya Huang: Let’s talk about competition in the coding market. This is the most brutally competitive subset of software I have ever, ever seen. And you’re playing in interesting waters, right? You’re competing with a lot of folks, you’re collaborating with a lot of folks. Maybe just help orient our audience. Where do you see yourselves in the broader competitive landscape in the coding market?
Zach Lloyd: Yeah, it is competitive out there, because it’s such a big, important market. A lot of people want to play in it and, you know, where do we sit? So we are a sort of general purpose agentic development workbench, which means you can use us like Cursor, you can use us like Claude code. I think we have a unique product approach to doing agentic development where we are truly the only platform out there that has grown out of the terminal. So there’s a lot that have grown out of, like, IDEs, specifically forking VS code, and they’re all very similar products. And then there’s a lot that are just apps that run within the terminal. And so those are like text-based apps, and those are also basically all the same.
And so Warp has a very differentiated product approach. I think one area where our product approach really shines is for people who are doing, like, traditionally terminal heavy workflows. And so that would be things like stuff beyond coding, so stuff like the software development life cycle. It could be setting up projects, it could be deployment, it could be working with, like, Docker and Kubernetes, it could be incident response. So, like, backend DevOps, SRE people who do production work, I think Warp is an amazing tool for them because it integrates so well with all of these non-coding terminal workflows. But the truth is, I don’t know, Warp is like, at any given moment we’re one of the top five agents on SWE-bench. We’re typically number one or two on Terminal-Bench. And so it’s a great general purpose coding agent. And so we’re in the market, but it is really competitive.
Sonya Huang: Yeah.
Zach Lloyd: We’re trying to compete on the quality of the product. It’s like we are in competitive—there’s competitive pressure around cost, which is like a really challenging thing for us.
Sonya Huang: Let’s talk about that directly. And just to hit it head on, like, how do you compete when Anthropic, you know, can subsidize their tool with model profits?
Zach Lloyd: It’s Anthropic, it’s OpenAI and it’s Google. So we have to compete based on the quality of the product, for one thing. And so I think, like, we can be a little bit in the more premium part of the market here. Like, coding agents and developer experience, it’s not bananas, it’s not like a commodity. These aren’t like totally fungible things. Like, the product experience does actually matter.
And so we can get people who care about that. I think that matters in that basically—I also think you want to stay away from certain user segments who are most cost conscious and cost shopping. And so that would be, like, vibe coders, people who are running agents, like, 24 hours a day, you know, making prototypes. And that’s just not actually the usage pattern of a pro developer.
And so for a pro developer, I think you can make a pretty strong argument that the actual holistic experience of using the tool might be worth, like, 20 bucks more, 40 bucks more, 80 bucks. Like, these are tiny sums compared to the amount of productivity that people are gaining and the amount of software that’s being produced.
I think there’s also a way that we are trying to sit above the model providers. And I think there’s been a positive development here, which is that for about, I don’t know, three months, maybe six months, I think Anthropic was kind of like the main show in town when it came to frontier coding. And now I think Gemini 3 and even the latest codecs are basically on par with the latest Claude model. And so there is advantage to being able to let people choose between those or model route amongst them, to model route with cheaper open source models. So it’s not easy, and if you view it as just like we’re in a cost race and all these coding tools are the same, then I think we have to differentiate way more on the product and also just the orchestration of these agents. But I don’t think that’s quite the situation.
Sonya Huang: Got it. Okay, so you’re winning people because they love the overall Warp product and that includes [inaudible].
Zach Lloyd: I think so.
Sonya Huang: But also products like the actual terminal, the better terminal you set out to build.
Zach Lloyd: Yeah, it is kind of a funnel. It’s like we have a lot of—you know, we have, like, 700,000 developers in Warp actively. And there’s like a bit of a funnel from the terminal into the coding use cases and at least into the terminal use cases.
Sonya Huang: Yeah. You mentioned being one or two on Terminal-Bench. What goes into that? Like, are you training your own terminal models, or is this harnesses on top of the existing foundation models?
Zach Lloyd: So for us it’s a harness on a mix of models. And then, like, the actual capabilities of the app kind of matter, which is interesting. And what I mean by that is, like, Terminal-Bench, it’s not just coding tasks, it’s like all sorts of things that you might do in the terminal. And so we have some intrinsic advantage there by actually being the terminal and not like an app running within the terminal.
And so, like, for example, one of the Terminal-Bench tasks was, like, playing Zork or something, which is like an interactive terminal game. And so we can use the terminal—we can do computer use in the terminal is probably the easiest way to think of it. So just like there’s companies out there doing browser use, we can do terminal use at the layer of the terminal as opposed to at the layer of, like, a web page which is what the equivalent would be for the browser analogy. And so that helps us do certain tasks on that particular eval that is hard for other harnesses to do.
Sonya Huang: Got it. You recently redid your pricing. What’d you learn about developers and how they want to pay for AI?
Zach Lloyd: Oh my God. So we’re still, like, not fully out of this. Yeah, I mean, I could just explain, like, the whole thing here. So our initial pricing was basically you do a subscription, and you get a fixed amount of AI credits every month. And we priced it so that—this is when we were at smaller scale—if you, like, fully utilized your plan, it would cost us money, but the hope was that on the sort of average utilization that we would make money, right? So it’s like you have a plan that gives people 50,000 credits and most people only use 20,000. You can kind of price it around that.
What happened was, like, the people just used more and more. And so we got to a point where we were losing more and more money. And so from a company strategy standpoint, we had a choice. We talked to Andrew a bunch about this. Like, we could either kind of play the, like—and, like, we’re growing really fast. Like, the revenue is growing. You know, we’re adding, like, a million in revenue every—it’s since slowed down a little. It was like every five days or something.
And it’s like we could play the game, go raise more money, but the margins were really bad. And so we decided that wasn’t the smart, long term, strategic thing to do. And also not like a race we can win, to the earlier conversation here. We just can’t beat people if the thing is cost. And so we wanted to know, like, are people paying for value? Will they pay if we are margin positive?
And so the way that we have changed the pricing is so that it’s much more consumption based. So you now pay for, like, a base plan of 20 bucks a month, and then you buy credits on top of that. And we ensure that that like—you know, it’s like in the old world we didn’t want people fully utilizing their AI because it would cost us money. Now it’s much better if people use more AI.
It is more expensive, for sure, and we’ve had a lot of user complaints around that, which sucks. If any Warp customers are listening, like, it’s a bummer. It really does suck, but it’s like we just could not afford to keep subsidizing the way that we were. And all in all, I would say it’s gone pretty well. We’re still growing pretty well, and now it’s like a growth that is sustainable, not like an unsustainable subsidized revenue growth. So tricky thing to do.
Sonya Huang: Would you ever train your own models?
Zach Lloyd: I think we would definitely do, like, what some of our competitors are doing where we would fine tune models and do RL and that type of stuff. It’s hard for me to imagine us competing with training, like, a full frontier level model, just the amount of capital that costs. We do have a ton of interesting data. Like, I think it’s actually a really interesting strategic asset for us in terms of the workflows people are doing in the terminal, how to improve them, how people are interacting with our agent.
So I think it’s likely that we will do some sort of RL, and I think it’s also very likely that we are going to lean more into a mixture of models and more model routing to try to give users the best experience when it comes to sort of latency, cost and quality, which are the three vectors here.
Sonya Huang: And do you see your role as kind of like optimizing that on behalf of users? Do you want to see yourselves as giving all options to users for them to pick?
Zach Lloyd: Yeah. So our philosophy has been, like, make a great default, but then because these are developers, they want control. So we actually have a couple variants of default. So we have a default that’s geared towards, like, efficiency and one that’s geared towards performance. And then after that we give people the raw choice.
The raw choice is a little weird because, like, increasingly we really want to use different models for different things internally and it doesn’t map that cleanly on the using say, like, you know, GPT 5.2 for everything. So it’s a little complicated, but I think it’s actually something that developers like is the control. And so I don’t see us moving away from that right now.
Sonya Huang: Yeah. One of the most interesting parts about where you sit is that you can actually see which models different developers are using. And so I’m curious, in your user base, which models are most popular? Has it evened out a lot, and are there different flavors or personalities of what the different models are good at?
Zach Lloyd: Like, 70 to 80 percent of our user base will use whatever we set our auto to and not touch it. And what we set the auto to currently is it’s a different one for efficient, a different one for performance. It’s a mix of the codecs—sorry, GPT-5.2, but it’s related to codecs, and then Sonnet 4.5.
When people are opting into choosing a model, lately, Gemini 3 Pro has been very popular. It’s a really good model. What we will do is we will test different variants in our auto model and see how people respond, how they engage with them. And I think we’ll probably test Gemini 3. I’ve been impressed with it. So yeah, I would say if I had to, like, stack rank, I would probably say, like, the Anthropic models are probably still most popular. And then between Gemini and OpenAI, there’s a decent amount of people opting into each of those.
Sonya Huang: What about Grok?
Zach Lloyd: Grok is not in Warp. It could be in Warp. They’ve reached out a bunch of times to put it in Warp. I’m not at all opposed to putting in Warp, but it’s just like every time we put a model in Warp, we have to—I would like some concrete benefit to users, and because it’s a bunch of work to tune our harness to work well with the model.
Sonya Huang: Hmm, I see. Interesting. Can you say a word more about that harness? Like, what do you do to make your harness good?
Zach Lloyd: So harness is like, how you prompt, what tools you make available, how you manage context. And so, like, the big things that are determining quality of harness are, like, it’s literally like the language of the prompting. It’s the tool set definition. It’s things like handling the context window, so specifically when do you use something like a sub agent where you go out and have something as a separate context window? When do you summarize? When do you truncate?
Like, we have things that have, like—you know, you might run a terminal command that has gigantic output, and you don’t want that all in your context window, but you might want some part of it in your context window. So how do you sort of pick out the right stuff? How do you do RAG, how do you integrate with MCP?
And so there’s just, like, some engineering and, like, alpha that goes into that. The way that you make that good is by measuring. Like, you can start in a pretty naive way and just give it a bunch of prompts. But the way that you make it really good is by measuring. And by measuring you can do it with sort of like a fixed set of evals where we know what the results should be and, like, not all of them should work. And so we have internal evals. You can do it on public benchmarks, and that’s actually been really good for getting our harness to be awesome. It’s just like going through the exercise of making our agent perform well on the public benchmarks.
And then you can do it by looking at, you know, user data. And so we use Braintrust. So there’s various platforms you can use to, like, sort of look for patterns and failure modes in the agent interaction, and then try to tune the harness and sort of replay them as evals. So, you know, that was a big mindset shift for us to get to doing that, but that was a hundred percent necessary to do it all data driven to get something that was good.
Sonya Huang: Yeah. Got it. And do you have your own tab autocomplete models? Is that just not even relevant for your product?
Zach Lloyd: We don’t have that. It’s not super duper relevant at the moment. I think there would be incremental benefit if we were doing tab completion in the terminal for people. And I think for the hand editing parts of Warp, by far the more typical use cases for code review of an agent’s code than it is like typing code. But it would be nice to have, it’s just not an area of high priority for us.
Sonya Huang: Yeah. Got it. Awesome. I’d love to talk a little bit about how you see the future of coding interfaces and see the future workbench evolving. So it seems like very much like you believe there’s a convergence happening between kind of the traditional—call it Microsoft Word IDE GUI approach to writing code, and then the chat with your computer agentic kind of style of, you know, terminal first approach. And those two are starting to merge. What other kind of UI innovations do you think are happening in terms of how people work with coding agents?
Zach Lloyd: I think the biggest change that we’re going to see in the next year is more and more kind of like cloud agents, or we call them [inaudible] agents, where—and this is already happening. Like, we’re investing in this at Warp, where rather than a developer sitting at a keyboard giving a prompt, there’s some system event that triggers an agent to do something. And that system event could be like you have a server that’s crashing, you have a cluster of user reports, someone has reported, like, a security incident against you. And all those things are basically going to serve as context into an agent that gets launched and runs not on some individual’s machine, but somewhere in the cloud.
And so what I think that implies is that you’re going to want the sort of like workbench to become more of an orchestration platform, more of a kind of cockpit for managing not just your own agents, but your team’s agents. I really think it implies you’re going to need, like, a strong team concept because, you know, these things aren’t—it’s not going to be the normal workflow of, like, I’m sitting at my desk, I’m writing a coding change and then I push a PR. It’s like agents are going to push PRs, and then agents are probably going to leave an initial round of reviews on PRs, and they’re going to file tasks in your task tracking system.
And so all this stuff needs tracking and it needs coordination, and you need different ways of integrating it into existing systems. And so whoever—like, this is like what Warp’s—probably our biggest product focus for next year is on this type of evolution off of just, like, interactive agents into the cloud agents. Because I think it’s going to be pretty transformative.
Sonya Huang: And I imagine that’s like a massive infrastructure push to be able to kind of, you know, run up and spin up.
Zach Lloyd: It is. Yeah. So, like, for us, it’s turning us much more from, like a product to a platform. And so the way that we think about building this out is building it on different layers of the stack where you have, like, an agent SDK, you have agent hosting if you want it. So, you know, if you’re a smaller company and you don’t want to set up spots in the cloud for your agents to do their work, Warp will host that for you.
There’s a whole category of startups that are going into this business of agent hosting, which I think is really interesting. It speaks to, like, this is like a real thing that’s happening. There’s an API layer for once you have the agent running, how do you get its status and how do you maybe take it over or see its progress? Where does it write its logs? And then there’s, like, a management layer of, like, what are all these things doing? What states are they in? What’s the log—like, who started them? When? Did they produce PRs?
And so I think it’s cool because it’s going to be the most impactful way to use these agents a lot is just not to have a person driving them. I don’t think that this means that the person driving the agent is going to go away either. I think that the sort of types of tasks that are going to start with these ambient cloud agents are more like toil tasks or things that are one shot-able and I think harder, more interesting software engineering is still going to be done by a developer at their workbench. But yeah, this is how I see things evolving in the next year.
Sonya Huang: That’s awesome. And I think one of the more important scaling law charts is the meter, like, how long can the agent run for.
Zach Lloyd: Yeah.
Sonya Huang: What are you seeing in terms of how kind of long horizon these agents can be?
Zach Lloyd: I don’t know. At its max, I would say, like, doing real coding tasks for us now is, like, 20, 30 minutes, something like that. Like, you can have it run longer, just to be clear, but you …
Sonya Huang: Yeah, you can have it run in circles.
Zach Lloyd: The problem is it will start going in circles still. There’s still context limitations. And, like, it’s a costly proposition to have agents running without people checking in and guiding them. And just by far, you get the best results when the agent is really steered. So when you do, like, an upfront plan with the agent, when you check in on the agent’s work. And so yeah, I think this will just keep on going up. But the—I don’t know, it’s like hours and hours of work. It needs a really clearly defined task in order for that to even make sense to me. It needs to be doing some big code migration or some big, big task.
Sonya Huang: Got it. What do you think the product might look like when it’s good at kind of being this cockpit to manage, you know, swarms of these agents?
Zach Lloyd: Yeah, so we’re building this out right now and, like, we are having all sorts of internal debates on whether it should be one product or two products. The way that we’re doing it right now, and how I think other people will probably approach this is a sort of like area of our app which is about agent orchestration.
The reason I wonder if it should be a whole separate product for us is because, like, you know, it very much feels more web-centric to me, which we can make Warp work on the web, but it’s not the primary interface. It feels like potentially it has, like, a different user some of the time. But the advantage of having it bundled into Warp is that it makes the handoff from one of these cloud tasks to a developer extremely seamless, and so very, very common workflow.
Like, we have this thing running in our Slack, in our linear and so what will often happen is you’ll tag something in Slack and you’ll be like, you know, can you make this fix for me? Change this button position or whatever. And right now you need a developer to, like, tie the loop on that. So it’ll do the work in the cloud, and then you’ll bring it onto your local machine. And it’s very nice to have that be in one environment where you can just keep working on it seamlessly.
So short answer is I don’t know, but it’ll be a little bit more like a task management UI. I don’t think it’s going to quite, like—I know there’s thoughts of, like, is task management the primary primitive for developers to be working with? I don’t buy that either. Like, I don’t think every developer’s going to be doing all their work out of Linear Jira, but I do think there’s some aspect of seeing what the agents are doing across various systems that developers are going to want.
Sonya Huang: Yeah. Awesome. I’d love to close by maybe talking a bit about the state of agentic development, and how the software engineering market will play out. Sound good?
Zach Lloyd: Sure.
Sonya Huang: I guess maybe for starters, where do you think we are in terms of, like, the frontier model capability? Like where are the models good today? Where are they not? Are they still producing all the errors? Like, where are we?
Zach Lloyd: So I’m constantly using this stuff. I’m somewhat biased because I use it on Warp’s code base, which is like a very custom big rust code base. But I think that’s still an interesting perspective. The agents can do what I would think of as, like, medium complexity tasks pretty well if you give them a bunch of guidance. They can’t do whole big projects—at least we haven’t had success doing that. I don’t trust them to make, like, very fundamental architecture decisions for us.
So it’s like, you want pretty constrained tasks, but they’re well beyond doing trivial tasks like change the button color, change the text. Like, they can make apps. They’re very good at 0 to 1. They can solve, like, kind of hard bugs. We have a medium-sized feature—like, I don’t know what a good example would be. Like, I was adding a new slash command to Warp the other day, and it’s like I just tagged the agent to do that, you know, in Slack and it made a 300 line PR and it was basically right.
And so I think there’s a bunch of headroom at the upper end. If I had to put it on a scale of zero to ten, I think we’re at, like, a six maybe. So I think it’s real, it’s game changing for how people work, but it’s not at the level of doing what a full time engineer on a hard product needs to do.
Sonya Huang: And where do you think the bottlenecks are? Like, is it just, you know, the models don’t have enough context? We need to get better at giving them instructions? Is it just we need to keep scaling these things up? Like, what are the biggest bottlenecks?
Zach Lloyd: So I think context window is still a big issue. And even with the bigger context windows, it having attention over the whole context window in reasonable ways is hard. I think there’s an issue of it always having to relearn everything. Like memory is not—it just seems like a slow, inefficient, repopulate the whole thing with a bunch of files. Like, there’s no continuous learning with it. So it’s like this big stateless thing where you’re kind of always starting from scratch and have to fill it up before you can set it loose. That sort of stinks. I would like to see that solved.
There’s still like how do you use it effectively as a developer? We’re very early. Like, this stuff didn’t exist a year ago, and so how should you be doing context engineering? How should you be setting up your project so that agents can work well with them? That’s a problem. If you were to look across how people on our team use Warp to build, it’s like high variance. And that’s not great because it’s like we have very, very rigorous standards around writing code and almost no standards—I mean, we’ve tried—around how to use the agents. No one has been taught how to use the agents. There aren’t even agreed best practices on how to use the agents. And so I think that’s pretty nascent.
Sonya Huang: Yeah, got it. My experience, whenever I try to vibe code a little bit, is that the coding models still produce a lot of errors.
Zach Lloyd: Yes. That’s true.
Sonya Huang: Is that getting better over time? And that seems to me in the category of stuff of, like, if you can verify it, like, did it work or did it not, you should be able to RL it. And it’s like, where are we today in terms of the state of how frequently are errors coming out? And can we actually RL that, or am I misunderstanding something?
Zach Lloyd: No, I think that they’re still definitely producing errors. It’s interesting. So it’s pretty infrequent that the agent at this point will produce something that doesn’t compile for me, which I think is an interesting milestone. So I don’t know, not that long ago, four or five months ago, that was a problem, like, getting to a compiling version of the thing. It compiles for me about a hundred percent of the time right now, which is amazing. It produces stuff with bugs and errors relatively frequently. I don’t think it has a good way of closing the loop in terms of does the thing work? And so I think some version of browser use or computer use where the agent can not only make the change, but verify the change from the user’s perspective, not the code perspective, is pretty important.
Sonya Huang: Are people doing that yet?
Zach Lloyd: Yeah, we’re working on stuff like that. The computer use, all of the model providers have beta versions of computer use, APIs and browser use, for sure. Computer use we’re looking at. Like, I would be surprised if this wasn’t a thing. And I think it becomes even more important of a thing pretty soon as more work is done remotely, because the real pain in the ass with the remote work is verifying that it works from a user perspective. And so I think that’s a big part of it. And then I think if you have that loop, it’s probably easier to do RL and get to things that are behaviorally correct, not just static compile correct.
Sonya Huang: Yep. Yep, absolutely. Okay. Well, looking forward to that. And then I guess, do you think that we’re going to reach a super intelligence moment here, like, where the models are better at coding than the best human coders?
Zach Lloyd: I have no idea.
Sonya Huang: No idea?
Zach Lloyd: What I do think is going to happen is I think—and I don’t know if this is super intelligence. I do think, like, coding will be solved by models. And what I mean by that is, like, I think that the limiting factor that we’re going to come up against, it’s just like an expression of intent from humans in terms of, like, what do you want built? How do you build it? Like, how do you express that clearly? Like, English is ambiguous.
Sonya Huang: Isn’t coding the truest expression of intent, though?
Zach Lloyd: Yeah, but the problem is we’re moving from a world where people speak in code to one where they just speak in English to try to build apps. And so we’re, like, reintroducing ambiguity, because developers, people building apps are no longer actually directly expressing what they want. They’re going through this translation layer of saying to a model what they want and then the model produces the code. So it’s like an interesting step backwards there in a sense, but it’s also way, way, way more efficient to do it this way.
But yeah, I think we’ll get to a point where you actually don’t need to be on the frontier to have something that produces code that is as well matched to a person’s intent as possible. And I think that actually is an interesting thing from a competitive perspective. I wouldn’t want to be in the API business for coding tokens because I do think, like, at some point you just won’t need to be on the frontier, and you’re not going to be able to charge a huge margin on top of it. Which is why I think actually you see Anthropic and OpenAI and Google going so hard at the application layer, because there’s huge risk at the API layer. It’s just for this vertical in particular that I think things are basically solved within a few years.
Sonya Huang: Yeah.
Zach Lloyd: I don’t know that. That’s just—I’m prognosticating.
Sonya Huang: That’s awesome. Do you think that people will ever—are people already thinking about the amount they spend on coding tools being, you know, the replacement of what they would be spending on, you know, hiring few software engineers, or are they thinking about [inaudible] as buying a tool still?
Zach Lloyd: So when we talk to enterprises, it is still viewed by and large as like a productivity boost, and that’s the way that it’s being evaluated. In fact, it’s really hard to measure even what, like, the effectiveness of this stuff. And so it tends to fall back to subjective measurements from engineers. Like, do you feel like you’re getting a bunch of value out of this or not? Or maybe you look at, like, DORA metrics. It’s really hard to know. So I don’t think that they’re viewing it yet, by and large, at least as labor spend. And I think today, if you pitch, like, here’s a $200,000 agent to replace your $200,000 engineer or whatever, they would be like, “What? Like, no. Like, not even close.” But I would expect that this starts to change.
Sonya Huang: What do you think will change that?
Zach Lloyd: It’s a great question. I think it’s increasing the automation use cases. Or maybe another way of thinking of it is, like, if companies start to launch products without engineers, I think that that will be like a major proof point. And typically I don’t want this to happen. I’m like an engineer at heart and I don’t want people losing their jobs, but there will be projects, products that are launched where there’s very, very minimal engineering involved. And you’re going to look at the spend for that and be like, “Okay, this was the cost of delivering the product.” And you’re going to be like, “Okay, with and without engineers, what’s that like?” So I think you need more of that to happen. I don’t think that’s happening very much yet.
Sonya Huang: Got it. And then maybe last question. I’d love to chat about how you see coding as an art form, and therefore, you know, your role in the world evolving. You wrote this blog post I loved back in 2023, I think. Everyone should go give it a read. It’s called—I think it’s about the future of productivity interfaces being ask and adjust. Maybe say a word on that, and how you think, you know, three years in, how you think that’s evolved.
Zach Lloyd: Yeah. So I wrote this, like, pretty shortly after ChatGPT came out, and we started, like, trying to deeply integrate it into Warp. And the idea was—and this sounds really obvious right now, but the way that, like, productivity interfaces have always worked in the past was that they were geared towards hand editing, right? And by hand editing, it could be like you go into Figma and you’re drawing vectors, or you go into Google Sheets and you’re entering cells, or you go into VS Code and you’re typing code. And my thesis in that article was, like, that’s going to change to a point where the primary interface is—I didn’t have the word “agentic” at the time, but it was like AI based where you would ask the app to do the thing for you, and then you as a human author would be responsible for adjusting. And adjusting might mean, like, reprompting or it might mean if reprompting failed, it might mean, like, going in, treating the prior hand editing interface, using that to, like, complete your change. And I kind of think that’s where we’re at right now for a lot of, like—especially for coding, it’s really transitioning to you start by asking for something and then you adjust it.
And another thing I said in that article, which I don’t know if it’s right or not, was that I was thinking about are you going to be able to get rid of the adjustment piece? And my thesis was that the area we’re going to need the adjustment piece the least is in areas where there’s, like, a lot of acceptable solutions. So that would be like creative domains. You know, if you ask for an image of something, there’s probably a thousand images that might work for you. And so you can just reprompt, reprompt, reprompt until you get what you want. Whereas for something like code or a spreadsheet, where there’s one thing that needs to be right, that you would have to keep that ability to, like, get it perfect with a hand editing interface. So that was the thesis. I think it wasn’t bad. I think it’s held up okay.
Sonya Huang: Not bad. Yeah, I guess you didn’t coin “agentic editing” back then.
Zach Lloyd: No.
Sonya Huang: Yeah, the thesis was spot on.
Zach Lloyd: You want to know something we coined at Warp?
Sonya Huang: What did you coin?
Zach Lloyd: Which we should have, like, trademarked, is “agent mode.” So we were the first product to launch a branded thing called agent mode. And if you look this up on ChatGPT and just ask, like, where did this come from? It came from Warp. And now that’s like a very common way of describing the feature, which I wish we were getting some kickbacks for that or something.
Sonya Huang: Totally. I love it. Well, thanks so much for coming on to share what you’re doing, and your observations on the coding market as a whole. It’s such a white hot competitive market, and the way that you think the terminal will be the workbench of the future and how it’s going to evolve. It was awesome to have this chat today. Thanks, Zach.
Zach Lloyd: Thanks, Sonya. It was awesome to be here.