A research paper accepted at the International Conference on Machine Learning (ICML) 2026 questions whether the standard three-projection design in transformer attention is always necessary. The study systematically examined what happens when some of the query, key, and value projections are shared or merged.
The work, titled "Do Transformers Need Three Projections? Systematic Study of QKV Variants," was authored by Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis. It is available on arXiv and has been accepted for publication in the PMLR proceedings (volume 306). The paper runs 26 pages, includes 12 figures and 16 tables, and the code is publicly released.
Testing Three Projection Sharing Strategies
The researchers evaluated three constraints on how the query (Q), key (K), and value (V) projections are tied together. The first variant, Q-K=V, shares the key and value projections while keeping the query separate. The second, Q=K-V, shares query and key but keeps value separate. The third, Q=K=V, uses a single projection for all three.
The last two variants produce symmetric attention maps, meaning they lose the directional distinction between queries and keys. To address this, the authors experimented with asymmetric attention using 2D positional encodings.
Experiments spanned a range of tasks. Synthetic tasks tested basic attention behavior. Vision tasks included MNIST, CIFAR, TinyImageNet, and an anomaly detection benchmark. Language modeling experiments trained models of 300 million and 1.2 billion parameters on 10 billion tokens.
Language Models Show Minimal Quality Loss
Stay updated
Get the day's AI and automation news in your inbox. No spam, unsubscribe anytime.
In language modeling, the Q-K=V variant (shared key-value) performed on par with the standard three-projection approach and occasionally better. It achieved a 50% reduction in the key-value (KV) cache, which is a major memory bottleneck during inference, with only a 3.1% increase in perplexity. The Q=K-V and Q=K=V variants did not fare as well; the authors found that sharing query and key breaks attention directionality, which harms quality.
The paper explains that sharing key and value works because keys and values naturally occupy similar representational spaces in many transformer layers, and attention often operates in a low-rank regime where the distinction between keys and values is less critical.
Memory Gains for On-Device Inference
Crucially, projection sharing is complementary to existing head-sharing techniques such as grouped-query attention (GQA) and multi-query attention (MQA). Combining Q-K=V with GQA-4 yielded an 87.5% reduction in the KV cache. Combining it with MQA achieved a 96.9% reduction, making it feasible to run large language models on devices with limited memory.
The authors describe projection sharing as an underexplored form of weight tying in attention mechanisms. They argue that the direct, quantifiable memory benefits make it particularly valuable for edge deployment, where memory and bandwidth are constrained.
The paper does not claim that three projections are never needed. Instead, it systematically characterizes scenarios where reducing them is possible without significant quality loss, and it provides practical guidance for model designers.
Related on Neura Market:

