Skip to main content

Zapier’s Mike Knoop launches ARC Prize to Jumpstart New Ideas for AGI

As impressive as LLMs are, the growing consensus is that language, scale and compute won’t get us to AGI. Although many AI benchmarks have quickly achieved human-level performance, there is one eval that has barely budged since it was created in 2019. Google researcher François Chollet wrote a paper that year defining intelligence as skill-acquisition efficiency—the ability to learn new skills as humans do, from a small number of examples. To make it testable he proposed a new benchmark, the Abstraction and Reasoning Corpus (ARC), designed to be easy for humans, but hard for AI. Notably, it doesn’t rely on language. Zapier co-founder Mike Knoop read Chollet’s paper as the LLM wave was rising. He worked quickly to integrate generative AI into Zapier’s product, but kept coming back to the lack of progress on the ARC benchmark. In June, Knoop and Chollet launched the ARC Prize, a public competition offering more than $1M to beat and open-source a solution to the ARC-AGI eval. In this episode Mike talks about the new ideas required to solve ARC, shares updates from the first two weeks of the competition, and shares why he’s excited for AGI systems that can innovate alongside humans.

Transcript

Contents

Mentioned in this episode: 

Inference essay: 

Mike Knoop: Right now, I think that what I see happening is there’s sort of this mythical story of a very bad outcome once we get to super intelligence, right? It’s a very theoretical driven story. It’s not grounded in empirical evidence. It’s based on reasoning our way to this outcome. And I think the only way that we can really, really effectively and truly set good policy is you have to look at what the systems can and can’t do. And then the regulator decides, makes decisions at that point, about what it can or can’t do. I think anything else is like you’re cutting off potential really, really good futures way too early. 

Sonya Huang: Hi, and welcome to Training Data. We have with us today Mike Knoop, co-founder of Zapier. Mike has recently stepped up to co-found and sponsor the Arc Prize, which is one of the most unique benchmarks in AI that measures a machine’s ability to truly learn things intelligently versus just to parrot patterns in the training data. We’re excited to ask Mike for an update on how things are going with the Arc prize two weeks in, and to hear his views on why we need radically different approaches and benchmarks to achieve true general intelligence.

Pat Grady: Mike, thanks for being here today. So we’re excited to talk about the ARC AGI initiative. Before we get into that, I’d love to spend a few minutes on your background at Zapier because, I mean, Zapier has emerged as probably one of the best examples of what an existing application company can do with the power of AI. And the way that you guys were so early to that and the way that it’s now kind of interwoven in the product has been really interesting to watch. So maybe can you just say a few words on what is Zapier, and what has your approach to AI been at Zapier?

AI at Zapier

Mike Knoop: Yeah, Zapier is a workflow automation platform. We support 6000 different integrations. Things from Salesforce to Gmail to basically any SaaS software that you can imagine, we connect with. And I think the unique thing about Zapier is it’s intended to be very easy to use for non-technical users. So the majority of Zapier customers, users would not self-identify with being a programmer, or being technical, being an engineer or something like that. Even though I also kind of think of myself as an engineer, and I find lots of interesting use cases for Zapier, but the large majority of our users don’t. 

And I think that’s what is quite special about Zapier and why people tend to fall in love with it is the sort of feeling of power and leverage that you get as a non-technical user being able to, you know, have software do work for you, basically. And I think that’s in an interesting way, the exact same promise of AI, right? Like that’s what people want of AI, they want software that’s just going to do more work for them. And so in many ways, I think the mission of Zapier and sort of the mission and purpose of AI intersect. 

And I’ve been sort of, I guess, call me AI curious, since all the way back to college, and I gave a whole all hands at Zapier, I can’t remember what year but like when the GPT-3 paper came out and showed that to the company. And so I’ve been sort of tracking and following along the progress. 

But it really wasn’t until I think January of 2022, when I think this was the Chain of Thought paper. When that one came out, I saw that and it really surprised me because I thought I had priced in everything that sort of AI language models could sort of do up to that point. And this idea of like, you know, let’s think step by step, this chain of thought technique, breaking down these language models as just tools for reasoning, instead of you know, just a one shot sort of completion or, or chat engine felt felt very special and something that I think most people didn’t expect that they could do, even though the technology been out there for over a year at that point. 

And so that actually, that kind of moment, caused me to give up my exec team role, I was running all their product engineering at the time. And I went back to basically being an individual contributor at the company, as an AI researcher alongside my co-founder, Brian, and have talked more about that journey. But I think that’s what caused Zapier to be relatively early in terms of AI. 

Pat Grady: What are some of the things that you’ve put into the product that you’re most proud of at Zapier in terms of AI features?

Mike Knoop: You know, at this point, I think Zapier is pretty, I’d say there’s probably two main places where we’ve gotten a lot of value from AI. The first is that over half the company now at individual level uses AI on a daily basis. And I know this because we’re actually measuring Zapier’s own platform usage of our own company. So this is like we have over half the companies like buildings Zaps, building automations that use an AI step like ChatGPT step in the middle, either to do so content generation, or data extraction over unstructured text, all sorts of really interesting use cases that we can sort of talk about, in fact, one of the top internal use cases is probably getting us about, I think, like 100x, labor enhancement rate right now, which which has been phenomenal. 

Pat Grady: And what is that? 

Mike Knoop: Yeah, you want to talk about it?

Pat Grady: Yeah. I mean, 100x improvement. Yeah, I want to talk about that. 

Mike Knoop: I think it’s our personal high watermark for what we’ve been able to achieve using AI internally from an operational perspective. And so Zapier has these things on our website called Zap Templates, they’re effectively recipes that help users figure out what can Zapier do, and help them get started. And these templates are, in order to make them, they’re all they’ve all been historically handmade, because they require a bit of right brain and a bit of left brain. They’re very creative. You have to inspire the end user, the customer, the would be user, what could Zapier do for you? What’s the outcome, what’s the ROI you might get? And then there’s also a very, like technical way, they have to be crafted in an exact way. You have to map JSON fields from one integration app to another integration app to make sure it actually works. And together, that’s what actually helps users get started and activates them in the product. 

And we had a hole of maybe a million of these things that we knew we wanted to build that we hadn’t built yet because they’re just they take so much effort, the sort of rate of production for contractors was about 10 a day up until last summer. And we had a person, a member of our, I think it was our partner marketing team actually, with a background in freelance writing, who’d built a Zap, a system of actually several Zaps, using OpenAI, some little steps here, and built an internal system that whenever a new integration got launched on Zapier, it would automatically try to like figure out what are the most interesting Zap Templates that can be built, write the inspirational use case behind it, because there’s probably a millions of these things already today with lots of training data, and then also do the exacting field mapping as well. 

And we moved the human in that workflow from the do loop to the review loop. So instead of having those contractors now, actually generating them thinking really hard about what the use cases should be, building certain field mappings. Now, they’re basically reviewing output from the system in a spreadsheet and saying, Yes, this one is good. This one’s bad. This one’s good. This one’s bad. And the funny thing is, because the cost of production of generating these things is so low, we don’t even try to fix the bad ones, we just throw them away and say we’ll just generate another stack and throw them on top. And that rate of production is about 1000 a day. So we’ve gone from, you know, 10, a day to 1000 a day per contractor. And we’ve been chipping away steadily at that whole million and keeping up with this with sort of the launch of new app integrations on Zapier. 

I think one of the main things that that showed me was you know, the, probably the space you want to look for in, you know, businesses, if you’re thinking about how to deploy AI is like, really up, like, top of funnel or bottom of funnel tend to be like, you want to get something that’s really close to like an important conversion rate for your business. And then, you know, I think if you can identify any manual work that your organization does, that has high volume, where a human is doing the work, I think those are sort of opportunities to like introspect and say, Okay, is there an opportunity to get that human out of a do loop and, and sort of craft a system that can do the work, but then put the human in that review loop, which is still quite needed at the maturity level of the technology today, but still phenomenal from an ROI perspective.

Sonya Huang: Are there any metrics you can share on the impact Zapier AI has had on the overall Zapier business?

Mike Knoop: The biggest one today is we’re just about to hit 10 million AI tasks per month. I think we’re at a run rate for about 10 million AI tasks per month. And, I think, I would love to be shown wrong or corrected if you know examples. But I think at this point, Zapier might be the biggest automated AI platform in the world, in the sense you, there’s a lot of researchers, entrepreneurs, builders, who are trying to build these like agentic AI systems where the AI is sort of working without sort of human, you know, in the loop. And I think at this point, with 10 million tasks a month, Zapier may be the, like, biggest example of that in the world right now.

What is ARC AGI?

Sonya Huang: Really cool. Want to talk about ARC AGI?

Mike Knoop: Let’s do it. 

Sonya Huang: Maybe start with a recap of what is ARC AGI? Why did you and François set out to establish this prize? 

Mike Knoop: Yeah, this was so a follow on to my like AI curiosity. So, you know, the reason I gave up my exec role back in 2022. Was I kind of wanted to know for myself, like, are we on the path for AGI or not? I felt like it was very important to know for Zapier’s mission, but also just like, as a human, I was very curious, and wanted to know because, is this gonna happen? Like, you know, there’s definitely some interesting scaling happening, is that sufficient to, like, get to I think, you know, what, naively I had in my head is like this, you know, super intelligence, AGI. 

And, you know, surprisingly, what I learned is the answer is no. I revisited, actually got to first know François Chollet, who is my co-founder on ARC Prize. I first thought I heard about him, and got exposed to his research back during COVID actually was during 2020. He did I think another podcast where he was explaining his paper, a 2019 paper, On The Measure of Intelligence, where he tried to formalize a definition of what AGI actually is. 

Yeah, I thought it was interesting at the time. But you know, I kind of parked it with lots of other stuff going on at Zapier, doing our own AI product building. But as I actually got more into sort of building with AI, I built my intuition of what language models could do, what the apparent limits were, I started getting more into AI evals. And trying to understand where it’s where the limits were, what can we expect from a product as a product building perspective? Like where are our products going to, like, tap out? Where should we invest our engineering and research effort versus just wait for the technology to keep scaling and mature? 

And, you know, the thing I found was that most AI evals were saturating up to human level performance, and there was accelerating. And when I went back to look at the ARC eval back from 2019, I expected to see a similar trend. And instead, what I found was basically the opposite, not only had it not reached human performance yet, but it was actually decelerating over time. And this was like, really supremely sort of surprising to me. 

And maybe it’s worth defining, you know, we use these terms of AGI but like what, what’s the actual correct definition for it? Right? I think there’s a kind of popular definition in the world today. There are actually probably two schools of thought. I think one school of thought that I see is that AGI is undefinable. We shouldn’t even try; this is a quite popular perspective. And I think the other school of thought is, you know, that AGI is the system that can do the majority of economically useful work humans do. This was popularized by OpenAI and the Microsoft deal this is actually like in their, their deal together. Once this is achieved, OpenAI retains all the future IP. I think it’s very interesting, I think Vinod Khosla actually might get credit for coining that definition. But nonetheless, I think because of OpenAI’s success, that definition has sort of become accepted by a lot of people. 

And as a target and goal we should shoot for. The challenge, I think it’s a fine goal, by the way. And I think current model architectures may be within spitting distance of it. I think it says way more probably about what the majority of humans do for work, if it’s a true goal, than what AGI actually is, though. 

And François defines AGI as the efficiency of acquiring new skills. That’s it. And here’s like, a quick thought experiment I think you can use to try and grok this, we’ve had AI systems now for many years, five plus years that can beat humans at games like go, chess, poker, Diplomacy, even. And the fact remains that you cannot take any one of the systems that was built to beat one of those games, and simply retrain it with new data, new experience, to beat humans at another game. 

Instead, what researchers and builders and engineers have to do is they have to like go back to zero, they have to tear it all down, rethink new algorithms, new architectures, new ideas, of course, new training data as well, often new, massive scale in order to beat that next game. 

And yet, this is in complete contrast to how you two both learn, right? I could sit you both down here, teach your new card game in probably about an hour, I could probably show you a new board game and get you up to proficiency within a couple hours. And that fact, is what makes I think it’s highly representative, what makes you generally intelligent, it’s your ability to very quickly and efficiently sort of gain skill in order to accomplish some open-ended or novel tasks that you’ve never encountered before in your life. And that’s what’s special about ARC. 

So ARC AGI is an eval that tries to take that definition and actually measure it. And it was designed specifically to resist the ability to memorize the benchmark, which is very different from most other AI evals that are out there. Every task is completely novel. And there’s a private test set that no one’s seen outside of a handful of people that have taken it to verify that, you know, all of the puzzles are solvable. And that degree of novelty, and that degree of not having ever been seen before, is what makes ARC a really, really strong benchmark for trying to distinguish between this more narrow AI that can be beaten largely through memorization techniques, and AGI, which is, you know, a system that can very, very rapidly and efficiently acquire the skill of test time.

What does it mean to efficiently acquire a new skill?

Pat Grady: What is the definition of efficiency? I imagine there’s a compute component, a data component, what’s the definition of an efficiently acquired new skill?

Mike Knoop: Yeah. François, I’ll probably do a bad job trying to summarize his research. If you want to read more, by the way, his On the Measure of Intelligence paper is like the source of truth for all of this stuff. Like it’s really, really good. And I think one other, before I get to the answer, I think one other important thing to sort of see is that ARC has been unbeaten since 2019. And I think its endurance to date is probably the strongest set of empirical evidence that the underlying concepts of the definition are correct, which is why I think it’s worth paying attention to and why it’s such a special eval and special set of research. 

So I think François would sort of describe efficiency as the ability for a system to sort of translate from Core Knowledge priors to being able to like, attack, sort of like a sort of search space, or task space around it. And so a very weak generalizable system is only going to be able to take on sort of very near term adjacent tasks to sort of the core data, the core knowledge that it was that system was trained on. Whereas a highly generalizable system is going to be able to have a much larger field of sort of tasks and novelty that it’s able to attack and be able to effectively do with a small set of training data. And that’s what we hope to see with the eventual solution for ARC AGI as well is that, you know, someone’s able to beat it, the goal is to get 85% on eval. 

Today, state of the art is, I think, 39% as we record this, and I think what’s special is if if someone can actually beat ARC at the 85% that would mean that you’ve created a computer program that can be trained on this very small set of core knowledge priors, things like goal directedness, objectness, symmetry, rotation. These are these are sort of things that even that emerge very early in childhood development, and be able to use those core knowledge priors and recombine them and synthesize them into new programs or to solve tasks with exacting accuracy that that system has never seen before, never been exposed to it in Its training data, that’ll be a really, really important thing. Particularly at the application layer where like the number one problem today is like hallucination and accuracy and consistency. And that results in, lowers your trust, which lowers deployment of real AI right now.

The constraints on competing for ARC Prize

Sonya Huang: You have some peculiar rules for competing for the prize. I think there’s a limit on how much compute you can use. You can’t use the internet. I don’t know if you can use GPT-4 and closed models, like, why set those, why put those limits in place?

Mike Knoop: Yeah, so the two big ones, you’re right, so competition sits on Kaggle, and Kaggle enforces no internet, and you have limited compute. So specifically, you get one P100 for 12 hours. And no internet means you can’t use frontier closed models. They’re available through API’s like Claude Sonnet or Gemini or GPT4o. 

Maybe I’ll take them in order. I think the compute one is maybe more interesting. The reason for the compute limit is to target efficiency first and foremost, because if there wasn’t any compute limit at all, then you could simply define AGI as saying it’s just a system that can acquire skill, with no degree of efficiency attached to it. And if that was true, that would mean that this system could brute force, basically every possible program, think through every possible future outcome here, generate every possible single archetype puzzle, and use that in order to sort of win the challenge. 

And we know that’s not actually what happens in human general intelligence, that you can read more and sort of François paper about why but the way that I think about it is, you can think about it, you can introspect even yourself while you’re taking the ARC puzzles, and see that when you’re trying to solve one of these, that you’re not brute forcing every possible transformation from the pattern and trying to recognize the pattern and apply it to the test. Instead, you’re using your intuition, you’re using your prior experience, to try and identify maybe 3, 4, 5 possible possibilities of like, what the pattern is, and then you check them right in your head. 

And I think this shows that, like humans are the sort of efficiency humans have is that brute forcing every possible, you know, solution and checking, it’s actually there’s, there’s bigger efficiency. So the compute limit, it sort of forces researchers to reckon with that definition. 

Now, I think it is worth acknowledging that we don’t know exactly how much compute is necessary to beat ARC yet and we’re gonna keep upping the compute bar over time is what I expect. For example, we already upped it, over 2xed it from prior versions of the competition. I think you prior usually got somewhere between like two and five hours to run on the GPU. We bumped that up to 12. Interestingly, all of the state of the art techniques are actually maxing out that 12 hours on time as well. So I do expect we’ll continue to increase it over time, but I think it is an important tool in order to force the generality of the full solution that we’re looking for.

And the no internet is no more of a practical reason. You know, we’re trying to reduce cheating, reduce contamination, reduce overfitting and not be able to leak the private test set. And largely just increase confidence that when we reach the 85% grand prize mark, that someone has actually beaten ARC. And be able to sort of say that with some sense of sort of authority and confidence. That’s a true statement.

One of François’ goals for ARC Prize is to establish a public benchmark of progress towards or maybe, towards AGI or with maybe the lack of progress towards AGI. And have it be sort of a trusted public tool that, you know, policymakers, students, entrepreneurs, venture capitalists, you know, employees, everyone can look at to get a sense of, you know, how close or far are we away from this sort of important technology existing. And then using that insight in order to help try to drive more AI researchers to work again, on exploring new ideas, which is something that’s, unfortunately, kind of fallen out of favor in the last several years as LLMs have taken off.

What approaches will succeed?

Pat Grady: What have you seen? Or maybe what do you expect to be true about the efforts that are successful, or more successful toward ARC AGI? That makes them different from what we’re seeing out of the frontier models and the big research labs? 

Mike Knoop: Yeah. So that it gets into like, the details of like, how does an LLM work because that’s kind of the bet most frontier AI researcher labs have been taking the last several years is we’re gonna scale up language models, and that’s gonna, more scale, more data is gonna get us AGI, and even though that’s the dominant story, actually, I don’t think it’s what most of the AI labs actually believe internally, most of them are working on new ideas. 

So I think there’s an interesting story there. But it is definitely in their interest to sort of promote a very strong narrative of like, scale is all you need. Don’t compete with us. You know, we’re just gonna steamroll you. I think there’s true competitive dynamics that have emerged in the market that are, unfortunately, shaping a lot of attention and investment effort away from exploring new ideas. And if it is true, that new ideas aren’t needed, which I believe it is, and I think ARC AGI and ARC Prize show that like at least some new idea is needed, then due to the sort of competitive dynamics that emerged with the market over the last couple years, we’re kind of like headed in the wrong direction. Right? 

There’s like, you know, all the frontier research is basically gone closed source, the GPT-4 paper had no technical detail shared in it. The Gemini paper had no technical detail shared in its larger context innovation, things like that. And yet, this is like in direct contrast to the history of how we even got here today, right, the sort of innovation set that led the sort of chain of research that led from you know, Ilya’s Sequence to Sequence paper at Google, out to Jacobs University back to Google, then to Alec Radford, and back to Ilya at OpenAI. Like there’s like a six or seven year chain of research that will only happen because of open sharing, open progress and open science. And I think that’s a bit unfortunate that we don’t really have that right now. Again, somewhat just due to the market dynamics and commercial success of language models kind of forcing a lot of that closed frontier research up. So yeah, one of the goals of ARC Prize is to help counterbalance a lot of those things. You’re asking about the difference between solutions might look like?

A little bit of a different shape

Pat Grady: What you said resonates because it seems like a lot of the foundation model companies are going down very similar, somewhat clearly defined paths. And I’m sure that internally, there’s all sorts of work being done to find the next breakthrough in architecture. But in terms of what’s working today, they’re all fairly similar paths. 

Mike Knoop: It’s all based on LLMs. 

Pat Grady: And I imagine that what works for the sake of ARC AGI is going to be a little bit of a different shape. And I’m wondering if you’re starting to see what shape that may take, and have a sense for what may be different about this more general architecture than what we’re seeing out of the foundation models?

Mike Knoop: Great. So I think, you know, a useful shortcut on how to think about language models is that they are effectively doing very high dimensional memorization, right, they’re able to train and memorize tons and tons of examples and apply them in slightly adjacent contexts. I don’t want to under I don’t want to dismiss language models too much, because they think that they are very special. So they are very magical and something very, that has lots of economic utility, Zapier is in existence, proof of that fact alone. So yeah, I don’t want to throw it under the bus too much. I think there’s some really good things that it has, like unlocked, as the technology goes. 

But there are limits to it. And, you know, they’re the sort of limits are, you know, not being able to effectively leverage its training data to compose it or combine it at test time to go attack and accomplish novel tasks that it had never seen before, in its training data. And that’s what ARC sort of shows, right, is that this is like a skill that these language models don’t don’t sort of possess. I think it’s kind of maybe useful to look at the history of the high score so far, and maybe where we expect it to go. 

So from when the eval was first introduced in 2019, 2020. The first, there was a small Kaggle competition that ran to kind of get a baseline and it was 20%. And from 20% to 30%, the techniques that worked were effectively, researchers created crafted a like handcrafted domain specific language, by looking at the puzzles that were in part of the public tests others, there’s two test sets, there’s a public set and the private one, that’s the set that ARC measure on, they looked at the public tests, and they try to infer and write down programs and Python code, or C# or whatever, what are the individual transformations that you do in your head to go from one puzzle to the next. And so that they call this a DSL, and then they wrote a brute force search, to try and search through all possible permutations and combinations of those sub programs or to like, find the general pattern and then apply it in real time. And that got to about 30%. 

What’s gotten from 30%, up to close to 40% now is a slightly different technique. This is Jack Cole. And his approach is effectively using a code based open source language model, and doing test time fine tuning. So he has some pre-training done on the code gen model. And then at test time, taking the puzzles, they get the novel puzzles that’s never been seen before and permuting variations of it and then training this like code gen based model in order to crack, write that program, then find a program that fits the pattern and then apply it to testing. And that’s gotten to 40%. 

I suspect that we probably have, I bet we have the ideas in the air already to get to like the 50% mark, maybe even a little beyond the 50% mark without a lot of new innovation, I bet just ensembling or combining the sort of existing idea sets that have already worked probably gets you about halfway. 

I think to get to the 85% mark, and beyond, I think the solution, the ultimate solution, probably it looks more like the shape of, at least to solve ARC, something that looks like a, you know, a deep learning guided DSL generator, where you have some sort of instead of hand coding and handcrafting the DSL, like ahead of time by trying to infer from the public tests at what those like sub programs would be, you need some way to generate that DSL dynamically, right? By looking at the puzzles in real time. And being able to learn from past puzzles, and apply that towards future puzzles. 

This is also another important thing humans do when they’re going through the ARC set. Sometimes the first or second puzzle are actually a little trickier because you’re orienting yourself around what am I doing? What tasks am I doing? What does the possible solution space look like? And then as you get further into the task set, they tend to go easier because you’ve started recognizing the space of possible transformations is just finite, so you start recognizing patterns there. 

And then combining that with some sort of deep learning based program synthesis engine. Something that can not brute force all possible programs of how to combine those DSLs together but something that has some sort of, you know, deep learning approach to like shape which program traces do you try to generate or test and try against the pattern, then it kind of goes back to this human introspection of how we take the puzzles, which is we’re not brute forcing all possible programs in our head. Instead, we’re trying to identify just a handful of likely candidates and then testing those deterministically in our head and applying the one that works.

The role of code generation and program synthesis

Sonya Huang: Really interesting that code generation and program synthesis kind of underlies all of the methods you just talked about. And there’s something, something very special that like program synthesis is very general and allows you to actually get close to that definition of generalized intelligence that you mentioned at the beginning.

Mike Knoop: It’s very exacting. And I think this is one of the reasons why the solution to ARC AGI is going to be useful, very quickly. So there’s, you’re talking this before, there’s a history of like Toy AI benchmarks over the last 10-15 years that, you know, kind of looked like ARC, there were games, there were puzzles, and really never amounted to much in terms of being beaten, they they all got handily beaten, as sort of scale emerged. And, you know, they really didn’t add to our understanding of how to build useful AI systems. 

And so what’s one of the common questions I’ve been fielding the last couple weeks is like, what’s different about ARC? Is that, you know that likely just happened here again, and I think the reason why we’re likely to see something much more useful, assuming we get a really good solution to ARC, from the first grand prize win, is that the number one problem at the application layer, and we see this with Zapier too with our new AI bots, that we launched a couple of months ago. 

It’s been surprising to me how that has gotten adopted, actually, by our users. You know, there’s like how I kind of described it as like concentric rings of use cases that you can use AI and AI automation for. And what we’re seeing is people are sort of restricting the use cases for the AI bots, where they’re sort of fully automated, totally hands off, to the use cases where there’s sort of a very low need for user trust, or where the sort of, let me say that a different way, if, if it goes wrong, it’s not catastrophically bad. So they deploy it for use cases, like personal productivity, or team-based like workflow automation, things where you know, if it’s wrong, or it’s right, only nine out of 10 times, or it takes me, you know, maybe a couple of days to like, really work with the system to like, do the prompt engineering to steer it towards getting maybe 95, 99% reliability, that’s acceptable, because the risk of being wrong is just, you know, quite low. 

In order to get much higher up and expand the number of concentric rings to, you know, moderate risk to high risk deployment scenarios where we want the systems working autonomously, we’re going to need that, the main thing that is missing is user confidence and the exact nature of what it can do and what it can’t do. This is what Zapier like core classic gives us, right, it’s like it’s a deterministic engine to execute automation. So you know, once you build it and set up, it’s gonna do the exact same thing every single time. That’s also what makes it fragile and hard to use. 

And on contrast, you know, these like AI core-based LLM systems that are totally autonomous, have the opposite set of trade offs, right, they’re much easier to use to steer them, guide them and fix them entirely through natural language. But because the accuracy is still inexact, confidence is low. 

And I think that’s what ARC gets us. A solution to ARC at 85 or 100% means that you’ve written a code, can you write a computer program that can generalize from like, very simple foreknowledge, priors to solve with exacting accuracy at 100 percent reliability, these like sight-unseen puzzles, and I think that, that tool, as a, there’ll be a new tool and like the programmers toolkit, basically, in terms of building products and building systems that can achieve that same thing.

What types of people are working on this?

Sonya Huang: We’re two weeks in, I think, from when you launched the ARC AGI prize, what have you learned so far? Like, what types of people are working and competing on this? Is it like the pedigreed researchers or the big labs? Is it scrappy hustler types, like who’s competing? How many teams are submitting solutions?

Mike Knoop: Yeah, let’s see. So the response, by the way, after launch was phenomenal, it was much bigger than we expected. I think we were trending on Twitter twice during the launch week, the number one Kaggle competition in the world, over a million social views, I think over all the launch channels. So just very phenomenal. I’m really thankful for everyone who helps sort of promote and help share ARC. Hopefully, we actually can get a solution here in some short time. I think the most interesting thing about the folks that are working towards, it’s probably a historic answer, and then what I’ve seen over the last two weeks, so the historical answer is most of the people that have worked on ARC are outsiders to the field. 

This is not actually the first year that there’s ever been a contest about it, there was a past competition called ARCathon. It was much smaller, it was hosted at Lab42 this AI Lab in Switzerland. And so last year, there were actually 300 teams that worked on trying to beat ARC, and again, no one sort of beat it. And almost, to my knowledge, all of those teams were effectively individuals or effectively, you know, outsiders in some way. They’re not, you know, people from like big AI labs, they’re folks with backgrounds in engineering, mechanical engineering, or video game programming or physics. Folks who just got curious and interested in the problem at hand. 

And I actually think that’s more likely than not where the breakthrough for ARC is going to come from is. I think it’s going to come from an outsider, somebody who, like somebody just thinks a little bit differently or has a different set of life experience, or they’re able to, like, you know, cross pollinate a couple of really important ideas across fields. 

That’s one of the reasons why I put as much money as we did in ARC Prize. I felt like the progress was idea rate limited actually. And one of the best ways to sort of increase the amount of ideas is to try and blow up awareness which is what the launch kind of did. Over the last two weeks the I’ve kind of seen like to probably like camps with people, at least on Twitter emerge, I think there’s one camp of people who are sort of the, you know, they’re in it for the mission, they agree with the underlying concept, they, they think that like, we do need some new ideas are excited to like, try and figure out what those are. 

And then there’s like a second group of people that are sort of like, I’m gonna prove you wrong. LLMs are definitely enough scale is definitely what we need. And I’m going to do my best to go beat this benchmark just using existing off the shelf technology, and, and sort of prove you wrong with that. So I’m actually quite happy for both those camps to exist. 

Trying to prove you wrong

Sonya Huang: One of those approaches is currently up in the leaderboard?

Mike Knoop: So we can break some sort of news here. So we’re, this week is Thursday, we’re launching, I guess when this comes out it will have launched just a couple days in the past, a brand new public task leaderboard. So we talked about how ARC doesn’t allow internet access, and there’s compute limits. 

I know personally, how unsatisfying that is to not be able to use frontier models, though, I also want to know how good can GPT-4o, how good can Claude Sonnet do against this benchmark. And also, because of compute, it’s also a bit of a barrier to entry, right? You have to use open source models, you have to do quite a bit of engineering work, before you can even start just testing, experimenting. 

So we’re launching a new public task leaderboard, it’s going to be a secondary leaderboard, we’re going to be committing about $150,000 for a reproducible fund towards this secondary leaderboard. It won’t be officially part of the competition this year. You know, we want to maintain that aspect of assurance on cheating and contamination, overfitting with the private test set. And that’s also the test set that has sort of the most empirical evidence against over the last four years. 

But the secondary leaderboard is going to allow folks to basically submit scores against the 400 public task set and we’ll verify and reproduce the scores locally to sort of ensure good fitting with the approach. And we’ll publish that. And you’re right, I think the top score, or one of the top scores on that is this guy, Ryan Greenblatt. He came out a couple days after the competition launched with a pretty interesting novel approach actually, towards beating it. And he’s using GPT-4o. And, but not not just GPT-4o, I think the interesting thing is he’s he created a, like an outer loop around 4o, where he is using 4o to generate programs or sample from GPT-4o these programs, these reasoning traces, to beat the task or identify the patterns, then testing these patterns against  the demonstration set, then finding the one that works and applying it. 

And that’s that approach seems to be getting in the low 40s, maybe 40%, 41%, somewhere in that range. And it’s pretty interesting, because I think it’s, you know, someone might look at that, I think, sort of, at first blush, say, Well, isn’t that evidence that scale is all you need. And I do think there’s something interesting there, right, it’s like just showing that, hey, the more training data these things have, the more sort of, you know, programs that they can spit out, that might be kind of right. But it also shows that I think that new ideas are needed, still. Like this outer loop is novel, like that might actually be frontier LLM reasoning that Ryan published, and we’re gonna take that approach whenever, similar to ARC Prize, we’re gonna to open source all the code for all the reproducible solutions, so folks can can take these and apply them and try to reproduce them in sort of using private, closed or open closed, open, open source models for the sort of the closed private dataset. Yeah, do you think it’s pretty interesting? What? How much innovation you get when you know how much innovation we’ve got over the last few weeks is just a result of putting you to just the awareness against public test set out there.

Where are the big labs?

Sonya Huang: Are the folks at the big research labs, like why are they not working towards this benchmark? Because it almost, like when you explain the benchmark, it seems so clear, that obviously this is the thing you want to solve. You want to solve the memorizing the textbook use case. Why do you think the folks from the big research labs aren’t trying to solve this benchmark? Or are they?

Mike Knoop: So I am aware of a handful of big AI labs that have tried in the past, several years ago. So like, you know, this was perhaps at a smaller scale with weaker models and things like that. I would. One of the things I would hope is that actually more do in the future, I’d actually love to see if there was if we could make ARC AGI an actual measure on some of the model cards, they get reported against future models, I think it’d be like a really cool thing we’re wanting to do it. So if anyone is listening to this and wants to reach out and make that happen, more than happy to work with them and find some way to do that. 

If I had to guess, well, let me say when I guess and let’s say we have more sort of confidence and. I’ve been surveying. Once I got exposed to François’ work again, and I started thinking deeply about ARC AGI, I started surveying a lot of my friends and researchers in SF in the Bay Area. About had they heard of François and had they heard of ARC. And François has pretty good name recognition, because he’s really, really big on Twitter. been big on Twitter for many years, probably nine out of 10 people I talked to knew who François Chollet was. Maybe one in 10, two in 10 had heard about the ARC AGI eval. And probably half those were confused because there’s like five other AI evals are called ARC. And I had to do some, you know, disambiguation with them. So it had really low awareness.

This was one of the first things I asked François when I met him for the first time in person this year. I asked him, Why do you think that is? Why do you think you have such high awareness? But ARC has such low awareness? And his answer, effectively, was that it’s hard. You know, the way that benchmarks gain popularity and notoriety is we make progress towards them, right? Researchers are working against it, someone has an idea. They have a breakthrough. They published that in a paper, that paper gets picked up and cited by others that generates awareness and attention. Other researchers say, Ooh, interesting. Okay, something might be possible now on this like, really hard benchmark. And so you get the snowball effect, right, of attention? And because ARC has sort of endured with very low rates of progress, in fact, decelerating progress over the last, you know, four years, I think anyone in the lab looking at that would just say, well, maybe the time is not right for yet, maybe we don’t have the idea set in the world, maybe we don’t have the scale we need yet in order to sort of beat this thing. And it looks like a toy, and it doesn’t like, you know, I don’t fully understand why. I don’t get the necessary importance about why it’s qualitatively different. I haven’t spent that much time, I’ve got a million other benchmarks I could use. And, you know, I think that’s somewhat of the dynamic that has existed in the past and is one of the main reasons why we launched ARC Prize, right? 

I think there are lots of market tools you can use to shape markets and shape innovation. And I think prizes do you have of narrow, there’s a narrow spot where prizes can be like, outrageously effective. And it’s where the idea is small. And it’s, and it’s idea rate limited. One person or a small team can make that breakthrough. And it’s very quickly and easily inspectable, reproducible, and like build, you can build on top of it very rapidly. And all those sorts of boxes got checked in. Yeah, one of the reasons why I decided to put together ARC Prize.

The world post-AGI

Pat Grady: And you’ve mentioned curiosity around AI or AGI dating back to college that was sparked a few years ago in the context of Zapier and has kind of been nurtured ever since. I’m curious why this is important, meaning if you could paint the picture of what life looks like, for the world post AGI, where we’ve defined it as the ability to efficiently acquire new skills? What do you think that version of the future looks like? Like, why is this an important thing to solve?

Mike Knoop: I think there’s the thing that I, I feel like I have a unique insight into at this point, having spent a lot of time thinking about ARC. And this AGI definition, is I suspect, the advent of AGI is going to look very differently than most people expect, especially of the group who are in the camp that like AGI is undefinable, because it’s so mythical and, you know, scary or big or awesome that like we can’t even hope to ever define, it’s just going to be this like, magical, special thing. 

And, you know, it turns out, something I believe quite deeply is that definitions are really important, because definitions allow us to create benchmarks and benchmarks allow us as a society to measure progress, and set goals towards things that we care about, and want to happen. And this idea of efficiently acquiring skill, one of the we’ve talked about a handful of times today, but one of the direct near term things that you get from that is you get systems that can do exacting accuracy generalization from a small set of core priors and apply it towards novel solutions. That is, again, the number one problem that rate limits AI adoption for more real world use cases today. 

And so yeah, that’s the that’s what you’re gonna see, you’re gonna see basically, like the application layer of AI, get, like, amazingly good at accuracy, consistency, low hallucination rates, which is going to allow us to use it in a much more unfettered way, in a much more trusted way. Because of the underlying way in which it’s built. So I think that’s the reason I think that’s important. I think the reason why that’s important is that there’s a lot of—we don’t know what that set of capabilities is going to build on top of into the future, right? There’s lots of unknowns, I think of how AGI evolves beyond the actual inception moment of a system that can efficiently acquire skill. 

But I think it’s going to be a much more gradual and incremental rollout where there’s a lot of contact with reality as we build and engineer these systems. Which is going to give us as a society a lot of time to update based on what those capabilities—what it can do, what it can’t do, and make decisions at that point about how do we want to deploy this technology? Where might we as a society say, we don’t want to deploy it for this set of use cases? 

Yeah, I think that’s one of the reasons why I’ve been so sort of such a proponent I think of, of open source AGI progress with with ARC Prize is like, right now, I think the what I see happening is there’s sort of this mythical story of a very bad outcome once we get to like, super intelligence, right? It’s a very theoretical driven story. It’s not grounded in empirical evidence. It’s basically based on sort of reasoning our way to this outcome. And I think the only way that we can really, really effectively and truly set good policy is you have to look at what the systems can and can’t do. And then the regulator decides, makes decisions at that point, about what it can or can’t do.

I think anything else is sort of like you’re cutting off potential really, really good futures way too early. And that’s sort of what’s happening, I think, with a lot of this early AI regulation where, you know, I’ll try to paint the good side of this picture. It’s like, maybe the risk of this bad outcome is so high in the future that we should pause here. I think the risk of that is you’ve trimmed off every possible good path of the future way too early. And the reason it’s way too early is because we still need new ideas. We need new ideas from researchers, we need new ideas from students, we need new ideas from young people, we need new ideas from labs. Otherwise, we’re actually, there’s like a chance that we’d like never actually reached the like degree of useful AGI that we actually want. 

And so that’s kind of my nuanced take on probably what the advent of AGI looks like. I think it’s much less likely to be a moment in time, and much more likely to be a stair step of technology that we build on top of past technology, and that creates a lot of moments to update beliefs based on what it can and can’t do.

When will we cross 85% on ARC-AGI?

Sonya Huang: Do you have any predictions on when we’ll cross 85% on ARC Prize?

Mike Knoop: You know, before the competition started, there was, the first data scientist we hired at Zapier gave me this idea a long time ago, it’s stuck with me. He said, “the longer it goes, the longer it goes.” And so it’s this idea that the longer something takes, the more you should update your priors about that it’s going to take longer. So coming into this year, my expectation was like, hey, at least three or four years, probably before we get to the grand prize mark, based on the past track record. 

Having seen what we’ve seen over the last like two or three weeks, though, I think it is quite likely we get to 50% during this competition period. I would be very surprised, I’d be surprised in a good way if we actually get to the 85% Grand Prize of this competition period. But I think it is not unlikely that we cross the 50% mark before the end of or middle of November, which is when the contest period ends for 2024. 

Sonya Huang: And is there a good “why now,” because people have been trying at this for five years now. And you know, you’re galvanizing interest around it. And now, a lot more researchers around the world are interested in AI and solving hard problems. But is there a good “why now” in terms of enabling techniques, technologies, etc, that’s different now than than five years ago, when François first kind of defined the benchmark?

Mike Knoop: If it is true that deep learning is an important part of the solution, right? A deep learning guided program synthesis engine, or a DSL that is generated on the fly through deep learning technique. If that’s true, the world has a lot of experience now on building and engineering and scaling such systems over the last three or four years, and there’s a lot more compute online, which brings the cost down into a territory where some of these things may just been out of practical cost before.

For example, actually, Ryan Greenblatt’s solution right now is maxing out our cost limits we’re going to have against the public leaderboard. It costs $10,000 to generate the 8000 reasoning sample traces from GPT-4o, that he then deterministically checks. And so that would have been a technique that would not have been possible, you know, three or four years ago, in any way. 

So I think if it is true, that there’s a minimum amount of skill that’s necessary to beat ARC, hey, we’ve gotten more of that in the last three or four years than we had when the first competition ran. And then I think the other one now is just largely due to awareness. If it truly is. Actually, let me answer in the opposite. Like, I think the risk is that it isn’t. The reason we launched ARC Prize is that it is actually not “why now.” Right? It’s like, actually, “it’s not going to happen,” is the problem. It’s not a “why now,” it’s not like, oh, the ideas are in the world. We just need to get people to work on it.

 The risk is that it isn’t. It’s actually “not why now,” and “not why now” is, I think, a much more interesting story, right, then this LLM driven, focused attention on LLM solutions only, the closed research due to the competitive dynamics. All of these things have shifted and shaped attention away from the new ideas and towards scale, towards LLMs, towards application layer AI. And I think that we think we need some shaping, reshaping back towards the new idea set. So hopefully, the “why now” is because ARC has now lots of attention. To be seen.

Will LLMs be part of the solution?

Sonya Huang: So you think LLMs will be part of the solution? I’m curious what you think of it seems like in the big research labs right now, a lot of the frontier research is around let’s merge kind of LLMs with the insights that you get from the inference time compute in the Q-Star, AlphaGo style. I’m curious what you think of that kind of direction of research? 

Mike Knoop: There’s some pretty interesting research I’ve come across with transformers, that the transformers are capable as an architecture of representing very deep deductive reasoning chains with 100% accuracy. I think this is interesting. The challenge with them is actually we just don’t have the learning algorithm. Backpropagation is an ineffective learning algorithm in order to teach a transformer architecture a set of network weights that can do deductive reasoning with 100% accuracy. And so I think it’s possible that the systems that are at least the core concepts that underlie language models have sufficient capability in order to do this type of reasoning. 

And we have not yet discovered the algorithm that can train the model in the right way, we haven’t quite discovered the right outer loop around the transformer that is going to do the program synthesis engine or the DSL generator.

I feel more confident in saying that deep learning is almost certainly going to be a part of the grand prize in some way. I bet it will, I’m pretty confident it wouldn’t be just a pure deterministic program is going to be how we how it gets solved. I think transformers are effectively the technology that has the most, has the highest degree of awareness in research literature, there’s a lot of hardware now that’s going towards accelerating transformers. Miles, I think actually just wrote about an ASIC that got announced recently that’s trying to accelerate the transformer architecture. So to the degree that actually some degree of scale of compute is necessary to beat ARC, I think those are things that I would say I’m bullish on sort of the transfer architecture. 

Though, I would point out that the search base of alternative architectures is quite rich, right? We’ve had maybe like nine or 10, now, mainline architectures from transformers to LSTM, CSCF, CNNs, RNN, Xex, LSDM, state space models. This would suggest that the search space of those architectures is quite rich, actually. And they all have slightly different varying properties. So like, I think it’s certainly possible that someone comes up with some innovation there. I’m less confident or less bullish that LLMs in their exact form are going to be part of the 85% solution, though. I think it would probably be like a sub-component architecture instead of the entire application system itself.

Pat Grady: When somebody does ultimately hit the 85% threshold, what do you hope they do with the solution? Like what would you like to see out of that person? Other than submitted to the leaderboard.

Mike Knoop: So this is one thing we didn’t didn’t talk about a ton. But you know, one of our prizes goals is to accelerate open progress towards AGI. So we’re going to be requiring that in order to claim the prize money, you do have to publicly share reproducible code and put it into the public domain. And this goes for both the public leaderboard and the official competition leaderboard as well. 

You know, this is with the spirit of trying to re-accelerate open progress again, so that we have research in small ways out in public that other researchers can build on top of, and hopefully stair step our way to actually actually building AGI and not getting stuck in sort of the plateau that we’re in today. I think that’s probably my first pick. I’ve actually seen a handful of people online that have said, Hey, if you got a solution to ARC AGI, I’ll give you this like, you know, million dollar offer in my company, you’re gonna work. Which I, on one hand, I’m like, okay, that’s kinda interesting. But, you know, on the other hand, I’m kind of like, I think that’s great awareness. And it shows actually the importance of solving this. I think more people are starting to become aware of the lack of progress, lack of frontier progress. I think ARC is kind of becoming a lightning rod for folks that want an actual measure towards this, I think growing sentiment in the field today.

Lightning round

Sonya Huang: So we close out with some rapid fire questions. 

Mike Knoop: Yeah, let’s do it. 

Sonya Huang: Okay, who do you admire most in AI?

Mike Knoop: I mean, François Chollet is a bit of a cop out answer. I wouldn’t have co-founded ARC Prize if I didn’t admire and respect his his work over the last four years. I mean, I think the two people that I have learned the most from, like directly and have inspired my own belief and work, Rich Sutton and Chollet, both of them published papers in 2019, right? Rich Sutton published The Bitter Lesson, which I think is fairly well known now in the industry at this point. 

I think his idea set is right right there with maybe the one asterisk that the one aspect that has not yet had scale applied to is architecture search itself. We certainly have applied search and learning on the inference side and the training side, but every architecture still has a very human handcrafted story and journey to it, which is an important insight, I think. 

And then On the Measure of Intelligence from July 2019. And I think both of these papers are, maybe Sutton’s was more of a blog post, but both of these pieces of writing, I think are very important because history has proven them right as time has gone on. Language models, transformers, scale have sort of shown Sutton’s ideas to be even more true than they were in 2019. And I think the endurance of ARC has shown François’ definition of AGI to be more and more true as time has gone on.

Pat Grady: What is your most contrarian point of view in AI?

Mike Knoop: I feel like everything we’ve talked about today was contrarian!

Pat Grady: Alright, we’ll count it. 

Mike Knoop: Scale is not all you need. New ideas are needed. 

Sonya Huang: What’s your favorite AI app, other than Zapier?

Mike Knoop: Let me look and see what I have a handful. I think I’m not gonna surprise you with anything. So I’ve got ChatGPT, Perplexity and Claude, and I’m a paying user of all three of those services. 

You know, one interesting thing actually, I’ll add, I have gotten way more value out of language model-based tooling over the last six months, than I ever did in the first epoch when I started working on Zapier. And it’s because one of the things they’re really, really good at the thing that perhaps like best at is summarizing tons of unstructured text, and helping be like an educational tutorial for you. 

So it’s significantly ramped up my learning rate on actually building with AI. I built  these fundamental different architectures. Starting to actually do model training, that’s something’s Zapier hasn’t done, but I started to do it myself over the last six months to get a sense of that type of work. And, yeah, language models and AI tooling have definitely accelerated my learning process.

Pat Grady: Awesome. All right, last question. Let’s do something optimistic, something that we can all dream about. What change in the world are you most excited to see over the next five or 10 years as a result of AI?

Mike Knoop: I think that I’ve always wanted to live in the future. I think that’s maybe a something that has always driven me towards like working on like frontier tech, I’ve always, you know, bought the latest gadget always tried the latest app, I think it’s led me to work on Zapier and AI, and it’s one of the reasons I’m working on AGI right now is because I think it’s like the biggest thing that you can, I can potentially have an influence on trying to pull forward for that future. 

You know, I personally, I think one of the things that feels very limited for me in AI right now is that with the narrow form of AI that we have, if we never get AGI, what that will mean is that we will always be rate limited on developing things by the human that’s in the loop. And that means we will never have AI systems that can invent and discover and sort of innovate alongside humans, and really help pull forward and push forward, the frontier in I think, a lot of really interesting ways like, you know, understand more about the universe and then discover new pharmaceutical things, you know, and discover new physics. Discover how to build AI. 

I think we’re always going to be rate limited by the human today. And I think if you just care about living in the future, and you want to put forward the good aspects of the future, some form of AGI is necessary to do that.

Pat Grady: Awesome, thank you, Mike.

Mike Knoop: Thank you both for having me.

Mentioned in this episode

Mentioned in this episode: