Skip to main content

Could pixels hold the keys to training useful agents?

The race to scale language models — and the agent ecosystem around them — is white-hot. Coding agents, which reason through problems and write code to solve them, have already taken us very far.

But one ambitious young team is making a different bet: that the most promising path to general computer agents may not run through language, screenshots, and tool calls, but through scaling raw video.

Standard Intelligence’s thesis is that the best way to build a general agent is through full video pre-training on computer use, because it is the only approach that can truly scale action data. Instead of predicting text tokens, the model learns to use a computer from raw screen data, predicting the next mouse movement, click, and keystroke from the pixels in front of it. 

It is the Tesla FSD approach applied to knowledge work on computer screens.

That makes the bet both deeply contrarian and deeply “bitter lesson”-pilled. Rather than hand-engineering workflows or wrapping language models in increasingly elaborate harnesses, Standard Intelligence is betting on a new pre-training paradigm: feed the model the raw stream of computer use, scale it aggressively, and let the generality emerge from the data.

“We’re not video people”

Video is unwieldy. It is computationally expensive, economically expensive, and technically unforgiving. Prior attempts to scale video toward AGI have often died on the vine.

The Standard Intelligence team is emphatically “not video people.” They did not arrive with a decade of inherited assumptions about how to work with video as a medium. Instead, they have had to reason through each challenge from first principles, and have met those challenges with unusual optimism, creativity, and scrappiness.

The results are striking. An 11-million-hour computer action dataset — the largest in the industry. A video encoder that is roughly 50× more token-efficient than competing approaches, enabling nearly two hours of 30 FPS video to fit inside a 1-million-token context window. A 30-petabyte storage cluster racked in San Francisco for under $500K, roughly 20× cheaper than hyperscaler alternatives.

FDM-1, their first foundation model trained directly on computer-use video at scale, offers an early glimpse of what this paradigm could become. It is a general model that can extrude a CAD gear in Blender, drive a car around a San Francisco block after an hour of fine-tuning, and find bugs in software by exploring its state space the way a curious human might.

Conscientious young founders

Founders Galen Mead and Devansh Pandey met as teenagers during the Atlas Fellowship in 2022, a selective fellowship for high-school students interested in AI alignment and AGI. 

Galen and Devansh are unusually serious about reaching AGI, and unusually conscientious about doing so safely. Both founders are wise beyond their years (21 and 20 respectively), and both left their undergraduate programs out of a sense of urgency to work on this problem.

Galen and Devansh stand out for their combination of taste, scrappiness, technical courage, and ambition. It shows up in the product thinking, in the research direction, and in the FDM-1 report itself.

The full team of six is small but mighty. Neel, Yudhi, Ulisse, and Ryan are each quirky and exceptional. They have chosen to turn down the conventional path (fancy degrees and offers from big token) and pursue this courageous mission together. 

A new pre-training regime

Video has long been a powerful training ground for AI. DQN showed that agents could learn rich behavior directly from pixels in Atari environments. Tesla scaled video models to make self-driving cars and robots navigate the physical world.

But in the race toward general knowledge agents, video-first pre-training remains an unconventional idea.

Standard Intelligence is betting that it will not stay unconventional for long.

We are thrilled to lead Standard Intelligence’s Series A alongside Miko and Yasmin from Spark Capital.