Less attention is all you need
A new startup is challenging the quadratic cost of Transformer attention.
I came across Subquadratic recently, a new AI startup that just launched with a long list of bold claims: a 12-million-token context window, around 50 times faster on million-token inputs, and roughly a fifth of the cost of running Claude or GPT. The interesting part isn’t the numbers, it’s the idea behind them. They are not training a bigger Transformer. They are changing the attention mechanism itself.
A quick refresher on attention
In 2017, a Google team published “Attention Is All You Need” the paper that introduced the Transformer architecture. Every modern LLM (GPT, Claude, Gemini, all of them) is a descendant of that work.
Before Transformers, the dominant approach was the recurrent neural network (RNN), and later the LSTM. Both processed text one token at a time, passing a running summary forward through the sequence. This worked, but was slow, and the model tended to lose track of things from earlier in the input. The Transformer dropped the chain. Instead, it lets every token in the input look at every other token, in parallel, and decide for itself what to focus on.
That looking-at-each-other step is attention. Mechanically, the model turns each token into three vectors: a query (what am I looking for), a key (what do I offer), and a value (what would I contribute if picked). For every query, the model compares it against every key, produces a similarity score, normalizes those scores into weights, then sums the values weighted accordingly. The result is a fresh representation of that token, now informed by everything in the input that turned out to be relevant. Stack a few of these layers and the model can resolve which “it” in a sentence refers to the dog mentioned six words ago, not the bone mentioned three words ago. For a deeper, visual walkthrough of how query, key, and value vectors fit together, Jay Alammar’s The Illustrated Transformer is the classic.
Why this gets expensive fast
Notice the every-pair-of-tokens part. If your input has n tokens, attention does roughly n² comparisons. Double the input, quadruple the work. Ten times the input, a hundred times the work. That’s the quadratic cost everyone in the field has been working around for years.
You feel this every day, even if you don’t think about it in those terms. It’s why context windows used to cap at a few thousand tokens, why even today most models start choking on long PDFs, why coding agents need to chunk and retrieve instead of just reading the whole repo, and why every LLM API charges per input token. The bill grows with what you send in, not just what comes out.
The standard answer has been to optimize the implementation: FlashAttention, sliding-window attention, and similar tricks. These are real wins, but the underlying scaling stays the same. Quadratic is quadratic. To go further, someone has to change the math, not just the kernel.
The usual workarounds
Most of today’s long-context tricks try to dodge the quadratic cost rather than remove it.
- FlashAttention: a smarter kernel that computes the same attention without ever building the full n × n matrix in memory. Big practical wins. Same scaling.
- Sliding-window attention (Longformer, BigBird): each token only attends to a fixed neighborhood plus a handful of global tokens. Linear cost, but the model can’t see anything outside the window.
- State-space models (Mamba, RWKV): drop attention altogether and carry a compressed running state through the sequence. Linear cost, but exact recall of something said far back gets fuzzy.
- Retrieval-Augmented Generation (RAG) and chunking pipelines: don’t put everything in the context. Chunk your documents into smaller pieces, embed them into a vector database, and either retrieve the relevant chunks for each query (RAG) or run the model chunk by chunk and combine the outputs (map-reduce summarization). Most production LLM apps live here. Cheap and effective, but lossy.
- Hybrid architectures: mix cheap linear layers with a few full-attention layers to keep some of the recall. Helps, but the full-attention layers still dominate the cost as inputs grow.
Each one is a tradeoff. You give up recall, you give up flexibility, or you skip reading the whole input.
What SubQ does differently
SubQ starts from a quiet observation: in practice, most attention weights are near zero. For any given query, only a small fraction of the other tokens actually matter. The full n² computation spends most of its time confirming things don’t.
Their architecture, Subquadratic Sparse Attention (SSA), is built on what they call content-dependent selection. For each query, the model picks which positions in the sequence are worth attending to, then computes attention exactly over only those. The selection is decided by meaning, not position, which is what sets it apart from earlier sparse approaches like sliding windows. They are also careful to contrast it with DeepSeek’s “lightning indexer,” which routes attention sparsely but uses a quadratic selector to do so. SubQ claims the whole pipeline, selection included, runs in sub-quadratic time. How they actually do the selection is not in the public docs; the technical report is “coming soon.”
If their numbers hold up, the result is full attention with linear scaling. The model can still pull information from anywhere in the context. It just doesn’t pay to compare every pair. They report attention compute reduced by about 1,000x at 12 million tokens, prefill roughly 50x faster than FlashAttention at 1 million tokens, and 95.6% on the RULER long-context benchmark at 128K (above Claude Opus 4.6’s 94.8%).
A few caveats
Worth pausing on the “if their numbers hold up” part. As of writing, none of this has been independently verified. Subquadratic has published two blog posts and a marketing site. The promised technical report is “coming soon.” The only third-party material is an Appen whitepaper benchmarking the speed of their attention kernel, not the model’s accuracy. A separate evaluation through LayerLens / Stratix is announced but pending publication. VentureBeat’s headline captures the mood: “researchers demand independent proof.”
That doesn’t mean it’s wrong. It means we’re looking at company-reported numbers from a team with credentials (ex-Meta, ex-Google, Oxford, Cambridge) and a $29M seed round, but no peer review yet. Treat it as a thesis worth tracking, not a settled result.
If it works
Here’s the part that’s fun to think about. Most of what’s been packaged as “AI engineering” over the last two years (RAG pipelines, chunking strategies, vector DB selection, agent orchestration to fetch the right context) exists because models can’t fit much in at once. Take that constraint away and a lot of that machinery becomes unnecessary, or at least optional. You don’t need to retrieve the relevant five chunks if you can just give the model the whole codebase. You don’t need a clever agent framework to swap in the right files if every file is already in scope.
Whether SubQ specifically is the thing that changes this, or whether someone else gets there first, the direction looks real. Worth watching either way.