SpectralQuant Implementation Analysis
Companion to SpectralQuant KV Cache. See also: Inference Stack Synthesis.
Note: The GitHub repository at
https://github.com/Dynamis-Labs/spectralquantwas not accessible at the time of this analysis (April 2026). This document reconstructs what the implementation should look like based on the SpectralQuant paper and standard practices in KV cache compression research code. When the repo becomes available, this analysis should be updated with actual code references.
1. Paper Recap (Implementation-Relevant Points)
SpectralQuant’s core claim: replace the random rotation matrix used in QuIP#/QJL-style KV cache compression with a data-driven spectral rotation derived from the eigendecomposition of the key covariance matrix. This separates “signal” dimensions (high-variance eigenvectors) from “noise” dimensions (low-variance), enabling:
- Selective QJL: Apply JL random projection only to noise dimensions, preserve signal dimensions at full precision
- Non-uniform bit allocation: Water-filling algorithm assigns more bits to high-variance dimensions
- Headline result: 5.95x compression ratio with improved cosine similarity vs uniform quantization
2. Expected Implementation Architecture
spectralquant/
calibrate.py # Offline calibration: collect activations, eigendecompose
spectral_rotation.py # Build rotation matrix from eigenvectors
qjl_selective.py # QJL with signal/noise split
bit_allocation.py # Water-filling non-uniform quantization
compress.py # Full compression pipeline
decompress.py # Decompression / dequantization
eval/
cosine_sim.py # Cosine similarity measurement
perplexity.py # Perplexity evaluation on WikiText-2 / C4
scripts/
calibrate.py # CLI entry point for calibration
evaluate.py # CLI entry point for benchmarks
configs/
qwen2.5_14b.yaml # Per-model calibration configs
llama3_8b.yaml
3. Implementation Details (Reconstructed)
3.1 Calibration Data Collection
Expected implementation:
# scripts/calibrate.py (reconstructed)
def collect_key_activations(model, dataloader, num_samples=128):
"""Run forward passes, collect K projections per layer per head."""
key_accum = {} # {(layer, head): list of key tensors}
for batch_idx, batch in enumerate(dataloader):
if batch_idx >= num_samples:
break
with torch.no_grad():
# Hook into attention layers to capture K after projection
outputs = model(batch, output_attentions=False)
return key_accumKey parameters:
- Dataset: Likely WikiText-2 or C4 validation split (128-512 sequences)
- Sequence length: 2048 tokens (matching model context)
- Batch size: 1 (calibration is not throughput-sensitive)
- Sampling: First N sequences, no shuffling (deterministic)
Collection mechanism: Register forward hooks on each attention layer’s key projection
(k_proj). Capture the output tensor after the linear projection but before any RoPE or
reshaping. This gives raw key vectors in [batch, seq_len, num_heads, head_dim] format.
3.2 Eigendecomposition
def compute_spectral_basis(key_activations, per_head=True):
"""
Compute eigenbasis for key covariance.
Args:
key_activations: [N, head_dim] concatenated key vectors
per_head: If True, separate basis per attention head
Returns:
eigenvectors: [head_dim, head_dim] orthogonal rotation matrix
eigenvalues: [head_dim] variance along each component
"""
# Center the data
mean = key_activations.mean(dim=0)
centered = key_activations - mean
# Covariance matrix: [head_dim, head_dim]
# For head_dim=128, this is 128x128 -- trivially small
cov = (centered.T @ centered) / (centered.shape[0] - 1)
# Eigendecomposition (symmetric positive semi-definite)
eigenvalues, eigenvectors = torch.linalg.eigh(cov)
# Sort descending by eigenvalue
idx = eigenvalues.argsort(descending=True)
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
return eigenvectors, eigenvaluesCritical design decisions:
-
Per-head vs per-layer: Should be per-head because different heads attend to different subspaces. The covariance structure varies significantly across heads (some are “positional,” some are “semantic”).
-
torch.linalg.eigh vs numpy: Use
torch.linalg.eigh(symmetric eigendecomposition) because the covariance matrix is guaranteed symmetric PSD. For head_dim=128, the 128x128 eigendecomposition takes <1ms — the bottleneck is collecting activations. -
Storage: The eigenvector matrix is
[num_layers, num_heads, head_dim, head_dim]. For Qwen 2.5-14B: ~32 MB in FP16. One-time cost, stored alongside the model. -
RoPE interaction: Keys have RoPE applied. Calibration must eigendecompose post-RoPE keys because that’s the representation stored in the KV cache. However, RoPE makes the covariance position-dependent. The paper likely averages across positions.
3.3 Spectral Rotation at Inference
class SpectralKVCache:
def __init__(self, eigenvectors, eigenvalues, d_eff, bit_config):
self.R = eigenvectors # rotation matrices
self.lambdas = eigenvalues # for bit allocation
self.d_eff = d_eff # signal/noise boundary
self.bit_config = bit_config # per-dim bit widths
def compress_keys(self, keys, layer_idx, head_idx):
"""
keys: [batch, seq_len, head_dim]
Returns: compressed representation
"""
R = self.R[layer_idx, head_idx] # [head_dim, head_dim]
# Rotate into eigenbasis
rotated = keys @ R # [batch, seq_len, head_dim]
# Split signal and noise
d = self.d_eff[layer_idx, head_idx]
signal = rotated[..., :d] # high-variance dims
noise = rotated[..., d:] # low-variance dims
# Quantize signal with more bits (e.g., 4-8 bit)
signal_q = self.quantize(signal, self.bit_config[..., :d])
# Quantize noise aggressively or skip QJL
noise_q = self.low_bit_quantize(noise, bits=2)
return (signal_q, noise_q, d)The rotation replaces random projection: In QJL, you multiply keys by a random Gaussian matrix. SpectralQuant replaces this with the eigenvector matrix, which concentrates variance into the first few dimensions.
3.4 QJL Selective Application
d_eff cutoff determination:
def compute_d_eff(eigenvalues, threshold=0.99):
"""
Find effective dimensionality via participation ratio
or cumulative variance threshold.
"""
# Participation ratio method (as described in paper)
total = eigenvalues.sum()
d_eff_pr = (total ** 2) / (eigenvalues ** 2).sum()
# Or: cumulative variance threshold
cumvar = eigenvalues.cumsum(dim=0) / total
d_eff_cv = (cumvar >= threshold).nonzero()[0][0].item() + 1
return int(d_eff_pr) # paper uses participation ratioThe paper reports d_eff ≈ 4 via participation ratio for keys. This is much lower than a 99% variance threshold would give. The participation ratio weights by squared eigenvalues, making it sensitive to the spectral gap — which is what makes it the right metric here.
Key insight: SpectralQuant’s “selective” QJL means NO QJL on noise dimensions, not reduced QJL. The paper shows removing QJL entirely from noise dims improves quality by +3.0pp cosine similarity, because QJL correction on near-zero signals injects noise.
3.5 Non-Uniform Bit Allocation (Water-Filling)
def water_filling_allocation(eigenvalues, total_bits_per_dim=4.0):
"""
Optimal bit allocation: R_i = max(0, 0.5 * log2(sigma_i^2 / theta))
where theta is the water level chosen so sum(R_i) = R_total.
"""
d = len(eigenvalues)
total_budget = total_bits_per_dim * d
# Binary search for the water level theta
lo, hi = 1e-10, eigenvalues.max().item()
for _ in range(100): # bisection
theta = (lo + hi) / 2
bits = 0.5 * torch.log2(eigenvalues / theta).clamp(min=0)
total_used = bits.sum().item()
if total_used > total_budget:
lo = theta
else:
hi = theta
# Snap to implementable bit widths: {0, 1, 2, 3, 4, 6, 8}
bits = snap_to_codebook(bits)
return bitsBit width mapping: Signal dimensions get 4-8 bits, noise dimensions get 0-2 bits. The codebook is likely a simple uniform quantizer with learned scale/zero-point per channel.
3.6 Memory Layout
Per token, per layer, per head:
+------------------+-------------------+------------------+
| Signal dims (d_eff) | Noise dims (remaining) | Metadata |
| mixed-precision | low-bit quantized | scales |
+------------------+-------------------+------------------+
Memory savings (for head_dim=128, d_eff=4):
- Uncompressed: 128 dims * 16 bits = 256 bytes
- Signal: 4 dims * 8 bits = 4 bytes + 2 bytes scales
- Noise: 124 dims * 2 bits = 31 bytes + 4 bytes scales
- Total: ~41 bytes ⇒ ~6.2x compression (close to reported 5.95x with metadata overhead)
4. Code Quality Assessment
Research Code vs Production
Based on the paper’s recency, this is almost certainly research code, not production-ready:
- Single-model evaluation scripts with hardcoded paths
- No batched inference support — processes one sequence at a time
- Pure PyTorch (no custom CUDA kernels)
- HuggingFace Transformers integration via monkey-patching
- Limited error handling
Integration Gap to Production (vLLM / SGLang / FlashInfer)
Estimated 2-4 weeks engineering work:
- PagedAttention compatibility: vLLM’s paged blocks need mixed-precision packing within pages
- CUDA kernel for fused rotate-quantize: Must avoid materializing the full rotated tensor
- Attention kernel modification: FlashAttention expects uniform precision KV cache — mixed-precision requires a custom kernel
- Calibration pipeline: One-time step on model load (~15 seconds per the paper)
FlashInfer integration path: FlashInfer already supports INT4 KV cache. Extending to mixed-precision per-dimension would require modifying append_paged_kv_cache and batch_decode kernels.
5. Reproduction Guide
Hardware Requirements
- Calibration: 1x GPU with enough VRAM for the model + ~10GB activation cache
- Qwen 2.5-14B: ~28GB model + ~10GB activations = 38GB (1x A100-80GB)
- Calibration time: ~15 seconds (per paper claim)
- Evaluation: Same as calibration
Expected Commands
git clone https://github.com/Dynamis-Labs/spectralquant
cd spectralquant && pip install -e ".[dev]"
# Calibrate (15 seconds)
python scripts/calibrate.py --model Qwen/Qwen2.5-14B-Instruct
# Evaluate
python experiments/eval_memory_efficiency.py --config SQ_noQJL_v3
# Expected: cos_sim = 0.9485, ratio = 5.95xExpected Headline Numbers (Qwen 2.5-14B)
| Method | Compression | Cosine Sim | Perplexity |
|---|---|---|---|
| FP16 baseline | 1.0x | 1.000 | 9.51 |
| TurboQuant | 5.02x | ~0.84 | 9.51 |
| SpectralQuant | 5.95x | ~0.87 | 9.51 |
What Could Go Wrong
- RoPE handling: Wrong treatment of RoPE-transformed keys during calibration invalidates the eigenbasis
- Calibration data mismatch: Different split/dataset shifts results by 0.1-0.3 perplexity
- Head dimension: Must operate on KV heads (not Q heads for GQA models)
- GQA: LLaMA-3 uses 8 KV heads vs 32 Q heads — calibration must target KV heads
6. Comparison with Related Work
| Feature | QJL/TurboQuant | KIVI | GEAR | SpectralQuant |
|---|---|---|---|---|
| Rotation | Random Gaussian | None | None | Learned eigenvectors |
| Bit allocation | Uniform | Uniform + outlier | Uniform + residual | Water-filling |
| Calibration | None (data-oblivious) | None | None | 15 seconds (data-aware) |
| Error correction | QJL on all dims | None | Low-rank residual | QJL on signal only |
| Compression | ~5x | ~8x | ~6x | ~6x |
| Quality at 6x | Good | Degraded | Good | Best (by cosine sim) |
SpectralQuant’s tradeoff: requires offline calibration (a new step in the deployment pipeline) but delivers the best quality at a given compression ratio.
7. Open Questions for When Repo Becomes Available
- Does the implementation handle GQA models (LLaMA-3, Mistral)?
- Is there a Triton kernel for fused rotation + quantization?
- How are value vectors handled — same spectral approach or separate strategy?
- Is the calibration deterministic across hardware?
- What’s the per-token latency overhead of rotation + mixed-precision dequant?
- Does the code include a vLLM/SGLang plugin or is it standalone?
- How is the packed memory layout implemented — custom dtype or manual bit packing?
This analysis should be updated when the SpectralQuant repository becomes publicly accessible. The reconstructed implementation details are based on the paper’s method section and standard practices in KV cache quantization research.