Study Tests Whether Transformers Need All Three QKV Projections

A new paper systematically evaluates three variants of projection sharing in transformer attention: shared key-value, shared query-key, and a single projection. The authors found that sharing key and value projections performs on par with standard QKV attention while reducing KV cache by 50% with only 3.1% perplexity degradation. Combining this with grouped-query or multi-query attention can cut cache by up to 96.9%, enabling practical on-device inference.

Neura News

Neura Market Editorial

June 5, 20263 min read

A research paper accepted at the International Conference on Machine Learning (ICML) 2026 questions whether the standard three-projection design in transformer attention is always necessary. The study systematically examined what happens when some of the query, key, and value projections are shared or merged.

The work, titled "Do Transformers Need Three Projections? Systematic Study of QKV Variants," was authored by Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis. It is available on arXiv and has been accepted for publication in the PMLR proceedings (volume 306). The paper runs 26 pages, includes 12 figures and 16 tables, and the code is publicly released.

Testing Three Projection Sharing Strategies

The researchers evaluated three constraints on how the query (Q), key (K), and value (V) projections are tied together. The first variant, Q-K=V, shares the key and value projections while keeping the query separate. The second, Q=K-V, shares query and key but keeps value separate. The third, Q=K=V, uses a single projection for all three.

The last two variants produce symmetric attention maps, meaning they lose the directional distinction between queries and keys. To address this, the authors experimented with asymmetric attention using 2D positional encodings.

Experiments spanned a range of tasks. Synthetic tasks tested basic attention behavior. Vision tasks included MNIST, CIFAR, TinyImageNet, and an anomaly detection benchmark. Language modeling experiments trained models of 300 million and 1.2 billion parameters on 10 billion tokens.

Language Models Show Minimal Quality Loss

The #1 Newsletter in AI

Stay ahead of the AI curve

The most important updates, news, and content — delivered weekly.

No spam. Unsubscribe anytime.

In language modeling, the Q-K=V variant (shared key-value) performed on par with the standard three-projection approach and occasionally better. It achieved a 50% reduction in the key-value (KV) cache, which is a major memory bottleneck during inference, with only a 3.1% increase in perplexity. The Q=K-V and Q=K=V variants did not fare as well; the authors found that sharing query and key breaks attention directionality, which harms quality.

The paper explains that sharing key and value works because keys and values naturally occupy similar representational spaces in many transformer layers, and attention often operates in a low-rank regime where the distinction between keys and values is less critical.

Memory Gains for On-Device Inference

Crucially, projection sharing is complementary to existing head-sharing techniques such as grouped-query attention (GQA) and multi-query attention (MQA). Combining Q-K=V with GQA-4 yielded an 87.5% reduction in the KV cache. Combining it with MQA achieved a 96.9% reduction, making it feasible to run large language models on devices with limited memory.

The authors describe projection sharing as an underexplored form of weight tying in attention mechanisms. They argue that the direct, quantifiable memory benefits make it particularly valuable for edge deployment, where memory and bandwidth are constrained.

The paper does not claim that three projections are never needed. Instead, it systematically characterizes scenarios where reducing them is possible without significant quality loss, and it provides practical guidance for model designers.

Related on Neura Market:

transformer attention projection sharing kv cache edge ai icml 2026 language modeling model compression

Study Tests Whether Transformers Need All Three QKV Projections

Testing Three Projection Sharing Strategies

Language Models Show Minimal Quality Loss

Stay ahead of the AI curve

Memory Gains for On-Device Inference

More from Neura News

Anthropic Launches Claude Opus 5 With Better Performance at Same Cost

Brain Waves Could Solve Physical AI's Data Scarcity Problem

OpenAI's ChatGPT Health Platform Signals Healthcare's Future

Panic Over Chinese AI Models Sparks Debate on US Competitiveness

Related Workflows