SpectralQuant: KV Cache Compression via Spectral Rotation and Selective Error Correction

Executive Summary

SpectralQuant achieves 5.95x KV cache compression (vs 5.02x for TurboQuant) with identical perplexity and 4.5x faster attention decoding, by exploiting a striking empirical finding: only ~4 out of 128 dimensions in KV cache keys carry meaningful signal. The method replaces random rotation with a learned spectral rotation, applies error correction selectively to signal dimensions only, and allocates bits non-uniformly. The core insight — that correcting noise dimensions with noisy estimates adds noise — resolves a puzzle about when error correction helps vs hurts.


1. Background: TurboQuant (ICLR 2026)

TurboQuant from Google Research established the data-oblivious baseline for KV cache compression. Its pipeline:

  1. Random rotation (Hadamard or random orthogonal matrix) to spread outliers across dimensions
  2. Uniform scalar quantization at low bit-width
  3. QJL error correction — Johnson-Lindenstrauss sketches to correct quantization error

Key theoretical result: TurboQuant is provably within 2.7x of the information-theoretic optimum for data-oblivious methods (methods that don’t look at the data distribution before choosing the compression scheme).

Achieved: 5.02x compression with acceptable quality.

The “data-oblivious” constraint is both a strength (no calibration, no distribution shift risk) and a limitation (leaves signal structure on the table).


2. The Spectral Gap Finding: d_eff ~ 4

Participation Ratio

SpectralQuant’s foundational observation comes from computing the participation ratio of the key covariance matrix eigenvalue spectrum:

d_eff = (sum(lambda_i))^2 / sum(lambda_i^2)

where lambda_i are the eigenvalues of the key covariance matrix K^T K (or equivalently the squared singular values of the key matrix).

The participation ratio is a standard measure from statistical physics and random matrix theory. It answers: “How many dimensions actually participate in representing the signal?” A uniform spectrum (all eigenvalues equal) gives d_eff = d. A single dominant eigenvalue gives d_eff = 1.

The Result

Model FamilyModeld_eff (keys)d_eff (values)
Qwen 2.514B~4~50
(5 other models across 4 families)Various~4~50

d_eff ~ 4 for keys across all tested models. This is remarkably consistent across 6 models in 4 families. It means the eigenvalue spectrum of key covariance is extremely steep — a handful of directions capture almost all the variance.

What This Means Physically

Keys implement a narrow matching/selection operation. When you compute Q @ K^T during attention, you’re asking “which stored keys match my current query?” This matching operation lives in a very low-dimensional subspace. The query-key dot product is dominated by a few principal directions that encode positional, syntactic, and semantic selectors.

The other ~124 dimensions contain essentially noise — residual variance that doesn’t contribute meaningfully to the attention pattern.


3. SpectralQuant’s Three Modifications

3.1 Spectral Rotation (replacing random rotation)

TurboQuant: Apply a random orthogonal rotation R to keys before quantization. This spreads outlier energy across all dimensions, making uniform quantization work better.

SpectralQuant: Apply the eigenvector matrix of the key covariance as the rotation. This aligns coordinate axes with the principal signal directions.

Calibration cost: ~15 seconds on a small calibration set. Compute the covariance of keys, take the eigendecomposition, use the eigenvector matrix as the rotation. This is a one-time cost per model (not per-request).

Why this helps: After spectral rotation, the signal is concentrated in the first few coordinates (those corresponding to large eigenvalues). After random rotation, signal is smeared uniformly. Concentration enables the next two optimizations.

3.2 Selective Error Correction

TurboQuant: Apply QJL (quantized Johnson-Lindenstrauss) error correction to all 128 dimensions. QJL stores a random sketch of the quantization error and uses it to correct dot-product estimates.

SpectralQuant: Apply QJL only to the ~4 signal dimensions.

Savings: 124 fewer bits per key for the QJL sketch.

The critical insight — and arguably the paper’s most important contribution:

Correcting noise dimensions with noisy estimates adds noise.

QJL error correction is itself noisy (it’s a random projection). When the true quantization error on a dimension is near zero (because the key component in that dimension is near zero — it’s a noise dimension), the QJL “correction” introduces error that wasn’t there before. The correction noise dominates the original signal.

Removing QJL entirely (even with random rotation) improves cosine similarity by +3.0 percentage points over TurboQuant. This is a remarkable finding: the error correction was net harmful on noise dimensions.

3.3 Non-Uniform Bit Allocation

TurboQuant: Same number of bits for every dimension.

SpectralQuant: More bits for signal dimensions, fewer for noise dimensions.

This is the water-filling algorithm from information theory / rate-distortion theory. Given a total bit budget, you allocate bits proportionally to log(lambda_i / theta) where theta is the “water level” chosen to exhaust the budget. Dimensions with large eigenvalues (signal) get more bits; dimensions below the water level get zero bits.

After spectral rotation, dimensions are ordered by importance (eigenvalue magnitude), so non-uniform allocation is straightforward — just assign different quantization step sizes per coordinate.


4. Results

Compression and Quality

MethodCompressionCosine Sim (Qwen 2.5-14B)PerplexityNotes
TurboQuant5.02xbaseline9.51Random rotation + QJL all dims
TurboQuant - QJL~5.5x (est.)+3.0pp~9.51Random rotation, no QJL at all
SpectralQuant5.95x+2.59pp9.51Spectral rotation + selective QJL

Task Performance

  • Perplexity: Identical to TurboQuant (9.51) — no quality regression
  • Needle-in-haystack: Perfect retrieval — long-context performance preserved
  • Attention decoding: 4.5x faster (on NVIDIA B200) — fewer bits to move through memory hierarchy

The 4.5x decoding speedup is particularly notable. KV cache is the memory bandwidth bottleneck during autoregressive decoding. Compressing it by 5.95x means proportionally less data movement, and the B200’s 8 TB/s HBM3e bandwidth is used more efficiently.

Statistical Reliability

Repeated across five independent random seeds. SpectralQuant won all five, with mean cosine similarity 0.8681 +/- 0.0013 versus TurboQuant’s 0.8404 +/- 0.0043. SpectralQuant is not only higher on average but 3.3x more consistent across runs.

Distribution Shift

Calibrated on WikiText-2, evaluated cross-domain (code, chat, legal). SpectralQuant’s advantage held across all four domain combinations, with gains from +2.1 to +3.6 percentage points. The calibration captures structural properties of the model itself, not surface features of any particular text domain.


5. Keys vs Values: A Fundamental Asymmetry

This is the most conceptually interesting finding in the work.

The Asymmetry

PropertyKeysValues
d_eff~4~50
RoleSelectors (matching)Information carriers
SpectrumExtremely steepRelatively flat
Low-rank truncationWorks okayCatastrophic
QuantizationAll dims needed (for dot product accuracy)All dims needed (for information diversity)

Why Keys Are Low-Dimensional

Keys answer the question: “Does this token match the query?” This is fundamentally a classification/matching task. The matching criteria are:

  • Positional proximity (a few dimensions)
  • Syntactic role agreement (a few dimensions)
  • Semantic similarity along a few axes

A 4-dimensional subspace is sufficient to encode these selectors with high fidelity. The attention pattern softmax(Q @ K^T / sqrt(d)) depends on relative dot products, and those dot products are dominated by the top-4 principal components.

Why Values Are High-Dimensional

Values answer a different question: “Given that this token was selected, what information does it contribute to the output?” Each token carries different information:

  • Token A might contribute syntactic features along dimensions 10-20
  • Token B might contribute entity information along dimensions 50-60
  • Token C might contribute tonal information along dimensions 100-110

No single token uses all 50 effective dimensions, but across the token population, many different subsets of dimensions are used. The population covariance looks high-dimensional because of this diversity.

The Failed Experiment

An obvious idea: if keys live in a 4-dimensional subspace, project them there. This would give 25.6x compression (128/4 = 32x on keys, blended with values).

This fails because:

  1. Values need all dimensions for their diverse information — at rank 4, values reconstruct with only 0.15 cosine similarity (catastrophic)
  2. Even at rank 64 (only 2x compression), attention output quality (0.66) is still far below TurboQuant’s 0.84 at 5x compression
  3. The spectral concentration is a property of keys specifically, not of the KV cache in general

The solution is what SpectralQuant does: keep all dimensions, but allocate bits and error correction budget non-uniformly.

Connection to “Homogeneous Keys, Heterogeneous Values”

This asymmetry connects to the observation (from the HKHV line of work, Cui and Xu 2025) that:

  • Keys are homogeneous: similar structure across layers and tokens, cluster tightly
  • Values are heterogeneous: diverse across tokens, layer-specific distributions

SpectralQuant provides a quantitative characterization of this qualitative observation: the participation ratio gives a single number measuring heterogeneity of the spectrum.


6. Connections to Prior Work

KIVI (ICML 2024)

KIVI quantizes keys and values separately, noting that keys are easier to quantize (tolerant of aggressive quantization). SpectralQuant explains why: with d_eff ~ 4, most key dimensions contain noise that can be aggressively quantized without quality loss. KIVI didn’t have the spectral analysis to exploit this asymmetry optimally.

Loki (2024)

Loki identifies “lazy” key heads that barely participate in attention and skips them entirely. This is a head-level version of SpectralQuant’s dimension-level observation. Some heads have even lower d_eff; Loki’s lazy heads likely have d_eff ~ 1-2.

KV-CoRE (2026)

KV-CoRE uses core tokens (high-attention tokens) for cache eviction — measuring that keys have substantially lower effective rank than values. Orthogonal to SpectralQuant: KV-CoRE decides which tokens to keep, SpectralQuant decides how precisely to store them. The two could compose.

RoPE and Key Structure

Rotary position embeddings (RoPE) may partly explain the low d_eff of keys. RoPE applies frequency-dependent rotations to pairs of dimensions. The dominant frequencies concentrate key energy in a few dimension pairs, while high-frequency pairs contribute mostly noise for typical sequence lengths. The spectral rotation may be partially learning to undo RoPE’s mixing.

Random Rotation Literature (QuIP, QuIP#, AQLM)

The idea of rotating weights before quantization appears in weight quantization (QuIP, QuIP#). SpectralQuant applies the same principle to activations (KV cache) and adds the insight that learned rotation >> random rotation when the spectrum is steep.

Rate-Distortion Theory

The water-filling bit allocation is classical (Shannon, 1959; Gallager). SpectralQuant’s contribution is showing it applies to KV cache: the spectrum is steep enough that non-uniform allocation gives meaningful gains over uniform allocation at the same total bit budget.


7. Practical Implications for Inference Serving

Memory Savings

For a 14B-parameter model with 128K context:

  • FP16 KV cache: ~8 GB
  • TurboQuant (5.02x): ~1.6 GB
  • SpectralQuant (5.95x): ~1.34 GB
  • Savings vs TurboQuant: ~260 MB per request

At scale (thousands of concurrent requests), this compounds. A serving cluster with 1000 concurrent 128K-context sessions saves ~260 GB of HBM.

Throughput

The 4.5x attention decoding speedup is the headline number for serving operators. In memory-bandwidth-bound decoding (the dominant regime for long contexts), compressing KV cache directly translates to higher tokens/second.

On B200 at 8 TB/s HBM bandwidth:

  • FP16 KV read for 128K context: ~1 ms
  • SpectralQuant KV read: ~0.17 ms
  • This moves the bottleneck toward compute, especially at longer contexts

Calibration Overhead

15 seconds of calibration is negligible for production deployment. It can be done once per model version offline. The spectral rotation matrices are small (128x128 per layer per head group) and add no runtime latency — they’re fused into the projection weights.

Compatibility

SpectralQuant operates on the KV cache post-projection, so it’s compatible with:

  • GQA (grouped-query attention) — apply per group
  • MQA (multi-query attention) — apply to the shared KV head
  • Paged attention (vLLM, SGLang) — compress each page independently
  • Speculative decoding — compressed KV cache works for verification too

See also: implementation analysis, how SpectralQuant fits in the inference stack


8. Open Questions and Limitations

Tested Scale

Six models across four families is good but not exhaustive. Key questions:

  • Does d_eff ~ 4 hold for MoE models (Mixtral, DBRX, DeepSeek-V3)?
  • Does it hold for very large models (70B+, 405B)?
  • What about non-transformer architectures with KV caches (Mamba-2 hybrid, RWKV)?

Layer Variation

The thread reports aggregate d_eff. It’s likely that d_eff varies across layers:

  • Early layers (pattern matching on surface features): possibly d_eff ~ 2
  • Middle layers (semantic processing): possibly d_eff ~ 6-8
  • Late layers (output preparation): unclear

Per-layer adaptive compression could squeeze additional gains.

Training-Time Implications

If keys only need 4 effective dimensions, should we train models with lower-dimensional key projections? A 4-dimensional key projection would be a radical architectural change but could dramatically reduce both compute and memory. The “failed experiment” section suggests this doesn’t work naively because of the key-value coupling, but modified architectures (e.g., decoupled key-value dimensions, as in some linear attention variants) might enable it.

Dynamic d_eff

Does d_eff change with:

  • Sequence length? (Longer contexts might need more dimensions for finer selection)
  • Task type? (Code generation vs creative writing vs factual QA)
  • Prompt structure? (Few-shot examples vs zero-shot)

If d_eff is task-dependent, adaptive compression (measure d_eff online, adjust bit allocation) could be valuable.

Interaction with GQA

GQA already reduces the KV cache by sharing keys/values across query head groups. Does GQA affect d_eff? Shared keys might have higher d_eff (serving multiple query heads) or lower (forced to find a common low-dimensional subspace). This interaction hasn’t been characterized.

Theoretical Gap

TurboQuant has a 2.7x optimality guarantee for data-oblivious methods. SpectralQuant is data-aware (uses calibration). What’s the theoretical optimum for data-aware methods? Is 5.95x close to fundamental limits, or is there substantial room for improvement?


9. Why This Matters

SpectralQuant is important not primarily for its compression ratio (5.95x vs 5.02x is an 18.5% improvement) but for what it reveals about transformer internals:

  1. Attention is lower-dimensional than we thought. The key-query matching mechanism operates in a ~4-dimensional subspace despite having 128 coordinate dimensions. This has implications for architecture design, pruning, and theoretical understanding of what attention heads learn.

  2. Error correction can be harmful. The finding that QJL on noise dimensions hurts quality is a general principle: don’t correct errors that are smaller than your correction noise. This applies beyond KV cache to any quantization or compression scheme with error correction.

  3. Keys and values have fundamentally different information geometry. This asymmetry should inform every KV cache optimization going forward. Treating keys and values identically leaves performance on the table.

  4. 15 seconds of calibration buys a lot. The gap between data-oblivious (TurboQuant) and data-aware (SpectralQuant) methods is substantial but cheap to bridge. This argues against purely random/universal approaches when a tiny calibration budget is available.

The participation ratio as a diagnostic tool for understanding transformer representations is perhaps the most transferable contribution. Measuring d_eff per layer, per head, per token position could yield a rich characterization of what different parts of the model are doing.