Skip to main content
Illustrations: Andreas Weiland

For most of computing history, intelligence flowed through a narrow channel: text in, text out.

If you could code or write, you had power. If you couldn’t, you were boxed out. Digital tools were built for verbal, logical thinkers. Everyone else—visual thinkers, spatial reasoners, musicians, creatives who think in movement or metaphor—had to translate their intelligence into text to be understood by machines.

That era is ending.

We’re entering a new phase of AI: one where models natively understand and generate across modalities—text, image, code, video, audio, voice—not as plugins, but as representations in a unified latent space. This isn’t just a technical upgrade. It’s a shift in the cognitive interface between humans and machines.

Multimodal AI is not just about making cooler images. It’s about unlocking intelligence that’s been latent, unexpressed, or trapped in the wrong modality. It’s about turning the act of thinking itself into a richer, more fluid medium.

This is AI’s synesthesia moment.

Multimodality: The new superpower 

Multimodal models don’t literally “think” in pictures or sounds—they encode and generate meaning through a universal latent representation.

That’s the leap. Not a patchwork of plugins or stitched-together APIs, but a single system that learns and expresses concepts through a shared semantic embedding space, where a sentence, a sketch, a snippet of code, or a melody are all interconnected representations of meaning.

When Google announced Gemini, it described the model as “built from the ground up to be multimodal.” So was GPT-4o. These aren’t add-ons—they’re foundational systems each built on their own universal representations.

Why does that matter? Because it fundamentally changes how intelligence is represented inside the model. Instead of translating across separate sensory domains, multimodal models encode information into a shared latent space—enabling seamless transitions between forms of expression.

Under the hood, this unified approach enables a more expressive and controllable generation process. Consider the two dominant paradigms in image generation: diffusion and next-token prediction.

Diffusion starts with noise and gradually refines it—like a sculptor carving form from chaos. Next-token prediction works more sequentially, adding elements incrementally. While GPT-4o’s precise internal workings aren’t fully public, its image generation appears to integrate both approaches: creating a rough semantic outline autoregressively, then refining it through diffusion processes. The outputs aren’t “thought” in images or audio—they’re decoded from the latent space. This unified latent representation makes outputs better because generation happens all together, rather than across loosely coupled systems.

Previously, image generation meant rewriting prompts, piping them through DALL-E, and hoping for the best. With GPT-4o, prompting feels collaborative. You can speak in fragments or metaphors, and the model effortlessly decodes these into coherent results.

The same latent-space shift is happening with voice. Models like Sesame’s voice engine capture tone, emotion, pacing, and style—not because they “think” in sound, but because they decode nuanced semantic representations directly into expressive audio outputs. Voice becomes a fully expressive medium, with vast implications for storytelling, education and interaction.

Translating intelligence with AI synesthesia

Digital tools have long favored those who express themselves verbally. But intelligence isn’t limited to words. Some think visually, rhythmically or spatially.

Multimodal AI bridges these gaps. Visual thinkers no longer need paragraphs. Conceptual thinkers no longer need to wrestle with interfaces. Musicians and speakers no longer depend solely on editors or engineers. Intelligence can now fluidly express—and translate—across mediums via a shared latent space.

This is AI synesthesia: converting strengths in one cognitive domain into capabilities in another. If you’re adept at prose but not code, AI bridges the gap through semantic representations. If you’re a gifted designer but struggle to pitch verbally, AI transforms your sketches into narratives.

Scriabin, the Russian composer and mystic, experienced music as color. His neurological synesthesia enabled fluent transitions across sensory modalities. What was once rare neurological wiring is now becoming our shared digital capability, mediated through latent semantic representations.

Your intelligence is no longer siloed—it’s fluidly transferable.

Raising both floor and ceiling of human capability

The implications for work are profound. Multimodality raises both the floor and ceiling of human capability.

The floor rises as specialized skills become accessible. No design degree is needed to create beautiful visuals, nor coding expertise to automate workflows. AI fills the gaps—fast.

The ceiling rises, too. Specialists can explore across modalities. Data scientists can craft visuals without designers. Strategists can prototype without engineers. Composers can narrate stories. More time is spent at the forefront of their expertise rather than on technical minutiae.

This doesn’t make everyone an expert—it amplifies those who already are. Meanwhile, routine “middle skill” tasks—basic design, simple data analysis, boilerplate content—may automate or morph into new roles focused on prompt design and output supervision. Demand at the high end, for originality and domain-specific insight, may surge.

We’ve seen this before. Word processors didn’t end writing—they transformed who writes and how. Multimodal AI will similarly redefine every knowledge field.

Fluid intelligence

We’re just beginning to understand what it means to think through a unified latent space across modalities.

As multimodal AI becomes native to the tools we use every day, intelligence starts to feel less like a fixed trait and more like a transferable asset. We are now entering the creative new world we predicted two-and-a-half years ago: Visual thinkers can speak in paragraphs. Coders can sketch in pixels. Writers can prototype products. Musicians can narrate. The barriers between disciplines, roles and modes of thought begin to dissolve.

This isn’t just a UX improvement—it’s a new mental operating system.
Your strengths are no longer limited to the format you were trained in.
Your ideas are no longer trapped in the mode you happen to be best at.

In the synesthesia era, creativity becomes translation.
Expression becomes multidimensional.
And intelligence becomes fluid.

Related Topics