The AI Race Isn’t About Models. It’s About Tokens.

Who wins isn’t determined by model power, but by who extracts the most value from every token spent.

Stan Sedberry
Stan Sedberry
48 views
The AI Race Isn’t About Models. It’s About Tokens.

The biggest AI companies aren't winning because they have better models. They're winning because they spend tokens better.

Foundation models are commoditized infrastructure. Everyone rents from the same handful of providers: OpenAI, Anthropic, Google. The model is just the substrate. And by tokens, I mean the atomic unit of cost in AI apps: the thing you pay for when you call an API, the resource you burn through every time a user asks a question. The real battle is at the application layer, where companies decide how to spend those tokens. Two startups can use identical models and get radically different outcomes. One wastes tokens on unfocused queries and general-purpose flailing. The other extracts maximum value from every token through tight framing, specialized workflows, and architectural choices most people don't see.

The companies actually making money right now aren't competing on model capability. They're competing on token efficiency.

Framing isn't just UX, it's token economics

Ask GPT-5 to "summarize this contract" and you'll get 400 words of boilerplate. Ask Harvey the same thing and you'll get a legally scoped answer with embedded precedent and redlines. Same model. Different interface. Different outcome.

When you send a prompt to GPT-5, you're not just asking a question: you're constructing a frame. The system message, the few-shot examples, the context you inject, the structure you impose: these determine what comes back. The product is the interface, not the model. Harvey isn't valuable because it uses Claude. It's valuable because it frames legal queries in ways that compress research workflows into fewer, more targeted completions. GitHub Copilot doesn't win because it has exclusive model access. It wins because it understands code context deeply enough to suggest completions without burning tokens on irrelevant explorations.

Framing manifests in concrete techniques: chain-of-thought prompting to structure reasoning, tool calling to offload computation, role-based system messages to set behavioral constraints, dynamic context injection to minimize waste. The companies that master these patterns extract more signal per token. In AI, every architectural decision is a cost decision.

Specialization compounds efficiency

General-purpose AI assistants face a structural problem. They compete directly with ChatGPT, a product with effectively infinite capital behind it, no margin pressure, and a brand that's become synonymous with "AI." Worse, generalist apps dilute their token spend across every possible use case, which means they can't afford to optimize for any single one.

Specialized apps flip this dynamic. By constraining the domain (contract review, code search, executive assistance), they can pre-load context that generalist apps must repeatedly fetch. A legal AI embeds case law and precedent structures upfront. A coding assistant indexes your repository architecture. An executive assistant learns your calendar patterns, communication style, and decision-making context. After that initial cost, every query becomes cheaper because the foundation is already in place.

Specialized apps amortize token cost. The first query might cost 1,000 tokens. The tenth costs 50 because the system already knows the user, the domain, the task. Specialization lets you spend like a founder, not like a tourist. This is how efficiency becomes a moat.

It's like the difference between a Swiss Army knife and a scalpel. The knife tries to do everything; the scalpel does one thing with minimal waste. In a market where tokens have cost, specialization isn't just a positioning choice. It's an efficiency engine.

What token efficiency actually means

Token efficiency is the ratio of user value to tokens spent. High efficiency means you deliver outcomes with minimal API calls, short completion chains, and little wasted context. If App A delivers the right answer in 3 prompts and App B takes 10, App A is roughly 3x more efficient. Same outcome, lower spend, and that gap compounds at scale.

Think of it like energy per bit in computing. Early computers consumed enormous power to flip a single bit. Modern chips do it for almost nothing because engineers spent decades optimizing at every layer: transistor design, circuit layout, instruction sets, cooling systems. Early programmers paid per instruction in a very literal sense. You couldn't afford to waste machine time. As computing got cheaper, people stopped counting cycles. But the programs that won were still the ones that used resources efficiently, even if users didn't see it.

Token efficiency will follow the same arc: critical now, invisible later, but always the difference between a product that scales and one that doesn't.

The levers are concrete: compression of repeated queries through caching or memory systems, tighter scoping so prompts stay focused, tool augmentation to handle structured tasks without language model overhead, architectural choices that minimize roundtrips. Companies that master these levers get better margins and faster products with the same underlying models.

Why generalism fails

Building a general-purpose AI assistant is startup poison. Users already have ChatGPT. They already have Claude. What they don't have is a tool that understands their specific workflow deeply enough to save time rather than create a new cognitive burden.

Generalism burns tokens because you can't pre-bake anything. Every user brings a different context, a different goal, a different mental model. By the time you've spent 10,000 tokens clarifying, a specialized tool would have delivered an answer in 1,000 because it knew the domain well enough to skip the back-and-forth.

There's an edge case where generalism works: if you have distribution so overwhelming that you can subsidize token costs indefinitely. ChatGPT can do this. Most startups can't.

Who wins

Winning apps don’t just minimize token usage, they design entire systems around it. They encode domain knowledge upfront, structure workflows around repeatable tasks, and scope tightly enough that every token spent pushes toward a clear outcome. Efficiency isn’t an afterthought. It’s baked into the architecture.

These companies don’t try to do everything. They pick a high-leverage workflow, embed context at the system level, and build memory so the model doesn’t have to start from scratch. They augment with tools where language models are wasteful, and they reduce roundtrips through tighter orchestration. Every decision compounds efficiency.

This is the real game: not building a smarter chatbot, but engineering a smaller loop between intent and result. The tighter the loop, the more value you extract per token and the stronger your economic advantage.

What this means in practice

Most founders still talk about models as if that's the differentiator. It's not. The model is the substrate. What you build on top (the frame, the specialization, the memory, the tools) determines whether you extract value or waste money.

If you're building a general-purpose assistant without the distribution of ChatGPT, you're playing a losing game. Find a workflow where token efficiency compounds, where tight scoping and pre-loaded context let you do more with less. Where the value you deliver per token is higher than anyone else can match with the same model. Build that, or get priced out.

Even if tokens become free, time won't be. Focus still wins. The product that gets to the right answer in three steps will beat the one that takes ten, regardless of cost. Efficiency isn't just economic, it's experiential.

The next $100B companies won't have the best models. They'll just use them better.

Related Articles