SpectralQuant Implementation Analysis

Companion to SpectralQuant KV Cache. See also: Inference Stack Synthesis.

Note: The GitHub repository at https://github.com/Dynamis-Labs/spectralquant was not accessible at the time of this analysis (April 2026). This document reconstructs what the implementation should look like based on the SpectralQuant paper and standard practices in KV cache compression research code. When the repo becomes available, this analysis should be updated with actual code references.

1. Paper Recap (Implementation-Relevant Points)

SpectralQuant’s core claim: replace the random rotation matrix used in QuIP#/QJL-style KV cache compression with a data-driven spectral rotation derived from the eigendecomposition of the key covariance matrix. This separates “signal” dimensions (high-variance eigenvectors) from “noise” dimensions (low-variance), enabling:

  • Selective QJL: Apply JL random projection only to noise dimensions, preserve signal dimensions at full precision
  • Non-uniform bit allocation: Water-filling algorithm assigns more bits to high-variance dimensions
  • Headline result: 5.95x compression ratio with improved cosine similarity vs uniform quantization

2. Expected Implementation Architecture

spectralquant/
  calibrate.py          # Offline calibration: collect activations, eigendecompose
  spectral_rotation.py  # Build rotation matrix from eigenvectors
  qjl_selective.py      # QJL with signal/noise split
  bit_allocation.py     # Water-filling non-uniform quantization
  compress.py           # Full compression pipeline
  decompress.py         # Decompression / dequantization
  eval/
    cosine_sim.py       # Cosine similarity measurement
    perplexity.py       # Perplexity evaluation on WikiText-2 / C4
  scripts/
    calibrate.py        # CLI entry point for calibration
    evaluate.py         # CLI entry point for benchmarks
  configs/
    qwen2.5_14b.yaml    # Per-model calibration configs
    llama3_8b.yaml

3. Implementation Details (Reconstructed)

3.1 Calibration Data Collection

Expected implementation:

# scripts/calibrate.py (reconstructed)
def collect_key_activations(model, dataloader, num_samples=128):
    """Run forward passes, collect K projections per layer per head."""
    key_accum = {}  # {(layer, head): list of key tensors}
 
    for batch_idx, batch in enumerate(dataloader):
        if batch_idx >= num_samples:
            break
        with torch.no_grad():
            # Hook into attention layers to capture K after projection
            outputs = model(batch, output_attentions=False)
 
    return key_accum

Key parameters:

  • Dataset: Likely WikiText-2 or C4 validation split (128-512 sequences)
  • Sequence length: 2048 tokens (matching model context)
  • Batch size: 1 (calibration is not throughput-sensitive)
  • Sampling: First N sequences, no shuffling (deterministic)

Collection mechanism: Register forward hooks on each attention layer’s key projection (k_proj). Capture the output tensor after the linear projection but before any RoPE or reshaping. This gives raw key vectors in [batch, seq_len, num_heads, head_dim] format.

3.2 Eigendecomposition

def compute_spectral_basis(key_activations, per_head=True):
    """
    Compute eigenbasis for key covariance.
 
    Args:
        key_activations: [N, head_dim] concatenated key vectors
        per_head: If True, separate basis per attention head
 
    Returns:
        eigenvectors: [head_dim, head_dim] orthogonal rotation matrix
        eigenvalues: [head_dim] variance along each component
    """
    # Center the data
    mean = key_activations.mean(dim=0)
    centered = key_activations - mean
 
    # Covariance matrix: [head_dim, head_dim]
    # For head_dim=128, this is 128x128 -- trivially small
    cov = (centered.T @ centered) / (centered.shape[0] - 1)
 
    # Eigendecomposition (symmetric positive semi-definite)
    eigenvalues, eigenvectors = torch.linalg.eigh(cov)
 
    # Sort descending by eigenvalue
    idx = eigenvalues.argsort(descending=True)
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
 
    return eigenvectors, eigenvalues

Critical design decisions:

  1. Per-head vs per-layer: Should be per-head because different heads attend to different subspaces. The covariance structure varies significantly across heads (some are “positional,” some are “semantic”).

  2. torch.linalg.eigh vs numpy: Use torch.linalg.eigh (symmetric eigendecomposition) because the covariance matrix is guaranteed symmetric PSD. For head_dim=128, the 128x128 eigendecomposition takes <1ms — the bottleneck is collecting activations.

  3. Storage: The eigenvector matrix is [num_layers, num_heads, head_dim, head_dim]. For Qwen 2.5-14B: ~32 MB in FP16. One-time cost, stored alongside the model.

  4. RoPE interaction: Keys have RoPE applied. Calibration must eigendecompose post-RoPE keys because that’s the representation stored in the KV cache. However, RoPE makes the covariance position-dependent. The paper likely averages across positions.

3.3 Spectral Rotation at Inference

class SpectralKVCache:
    def __init__(self, eigenvectors, eigenvalues, d_eff, bit_config):
        self.R = eigenvectors           # rotation matrices
        self.lambdas = eigenvalues      # for bit allocation
        self.d_eff = d_eff              # signal/noise boundary
        self.bit_config = bit_config    # per-dim bit widths
 
    def compress_keys(self, keys, layer_idx, head_idx):
        """
        keys: [batch, seq_len, head_dim]
        Returns: compressed representation
        """
        R = self.R[layer_idx, head_idx]  # [head_dim, head_dim]
 
        # Rotate into eigenbasis
        rotated = keys @ R  # [batch, seq_len, head_dim]
 
        # Split signal and noise
        d = self.d_eff[layer_idx, head_idx]
        signal = rotated[..., :d]     # high-variance dims
        noise = rotated[..., d:]      # low-variance dims
 
        # Quantize signal with more bits (e.g., 4-8 bit)
        signal_q = self.quantize(signal, self.bit_config[..., :d])
 
        # Quantize noise aggressively or skip QJL
        noise_q = self.low_bit_quantize(noise, bits=2)
 
        return (signal_q, noise_q, d)

The rotation replaces random projection: In QJL, you multiply keys by a random Gaussian matrix. SpectralQuant replaces this with the eigenvector matrix, which concentrates variance into the first few dimensions.

3.4 QJL Selective Application

d_eff cutoff determination:

def compute_d_eff(eigenvalues, threshold=0.99):
    """
    Find effective dimensionality via participation ratio
    or cumulative variance threshold.
    """
    # Participation ratio method (as described in paper)
    total = eigenvalues.sum()
    d_eff_pr = (total ** 2) / (eigenvalues ** 2).sum()
 
    # Or: cumulative variance threshold
    cumvar = eigenvalues.cumsum(dim=0) / total
    d_eff_cv = (cumvar >= threshold).nonzero()[0][0].item() + 1
 
    return int(d_eff_pr)  # paper uses participation ratio

The paper reports d_eff ≈ 4 via participation ratio for keys. This is much lower than a 99% variance threshold would give. The participation ratio weights by squared eigenvalues, making it sensitive to the spectral gap — which is what makes it the right metric here.

Key insight: SpectralQuant’s “selective” QJL means NO QJL on noise dimensions, not reduced QJL. The paper shows removing QJL entirely from noise dims improves quality by +3.0pp cosine similarity, because QJL correction on near-zero signals injects noise.

3.5 Non-Uniform Bit Allocation (Water-Filling)

def water_filling_allocation(eigenvalues, total_bits_per_dim=4.0):
    """
    Optimal bit allocation: R_i = max(0, 0.5 * log2(sigma_i^2 / theta))
    where theta is the water level chosen so sum(R_i) = R_total.
    """
    d = len(eigenvalues)
    total_budget = total_bits_per_dim * d
 
    # Binary search for the water level theta
    lo, hi = 1e-10, eigenvalues.max().item()
 
    for _ in range(100):  # bisection
        theta = (lo + hi) / 2
        bits = 0.5 * torch.log2(eigenvalues / theta).clamp(min=0)
        total_used = bits.sum().item()
 
        if total_used > total_budget:
            lo = theta
        else:
            hi = theta
 
    # Snap to implementable bit widths: {0, 1, 2, 3, 4, 6, 8}
    bits = snap_to_codebook(bits)
 
    return bits

Bit width mapping: Signal dimensions get 4-8 bits, noise dimensions get 0-2 bits. The codebook is likely a simple uniform quantizer with learned scale/zero-point per channel.

3.6 Memory Layout

Per token, per layer, per head:
+------------------+-------------------+------------------+
| Signal dims (d_eff) | Noise dims (remaining) | Metadata   |
| mixed-precision     | low-bit quantized      | scales     |
+------------------+-------------------+------------------+

Memory savings (for head_dim=128, d_eff=4):

  • Uncompressed: 128 dims * 16 bits = 256 bytes
  • Signal: 4 dims * 8 bits = 4 bytes + 2 bytes scales
  • Noise: 124 dims * 2 bits = 31 bytes + 4 bytes scales
  • Total: ~41 bytes ~6.2x compression (close to reported 5.95x with metadata overhead)

4. Code Quality Assessment

Research Code vs Production

Based on the paper’s recency, this is almost certainly research code, not production-ready:

  • Single-model evaluation scripts with hardcoded paths
  • No batched inference support — processes one sequence at a time
  • Pure PyTorch (no custom CUDA kernels)
  • HuggingFace Transformers integration via monkey-patching
  • Limited error handling

Integration Gap to Production (vLLM / SGLang / FlashInfer)

Estimated 2-4 weeks engineering work:

  1. PagedAttention compatibility: vLLM’s paged blocks need mixed-precision packing within pages
  2. CUDA kernel for fused rotate-quantize: Must avoid materializing the full rotated tensor
  3. Attention kernel modification: FlashAttention expects uniform precision KV cache — mixed-precision requires a custom kernel
  4. Calibration pipeline: One-time step on model load (~15 seconds per the paper)

FlashInfer integration path: FlashInfer already supports INT4 KV cache. Extending to mixed-precision per-dimension would require modifying append_paged_kv_cache and batch_decode kernels.

5. Reproduction Guide

Hardware Requirements

  • Calibration: 1x GPU with enough VRAM for the model + ~10GB activation cache
    • Qwen 2.5-14B: ~28GB model + ~10GB activations = 38GB (1x A100-80GB)
    • Calibration time: ~15 seconds (per paper claim)
  • Evaluation: Same as calibration

Expected Commands

git clone https://github.com/Dynamis-Labs/spectralquant
cd spectralquant && pip install -e ".[dev]"
 
# Calibrate (15 seconds)
python scripts/calibrate.py --model Qwen/Qwen2.5-14B-Instruct
 
# Evaluate
python experiments/eval_memory_efficiency.py --config SQ_noQJL_v3
# Expected: cos_sim = 0.9485, ratio = 5.95x

Expected Headline Numbers (Qwen 2.5-14B)

MethodCompressionCosine SimPerplexity
FP16 baseline1.0x1.0009.51
TurboQuant5.02x~0.849.51
SpectralQuant5.95x~0.879.51

What Could Go Wrong

  1. RoPE handling: Wrong treatment of RoPE-transformed keys during calibration invalidates the eigenbasis
  2. Calibration data mismatch: Different split/dataset shifts results by 0.1-0.3 perplexity
  3. Head dimension: Must operate on KV heads (not Q heads for GQA models)
  4. GQA: LLaMA-3 uses 8 KV heads vs 32 Q heads — calibration must target KV heads
FeatureQJL/TurboQuantKIVIGEARSpectralQuant
RotationRandom GaussianNoneNoneLearned eigenvectors
Bit allocationUniformUniform + outlierUniform + residualWater-filling
CalibrationNone (data-oblivious)NoneNone15 seconds (data-aware)
Error correctionQJL on all dimsNoneLow-rank residualQJL on signal only
Compression~5x~8x~6x~6x
Quality at 6xGoodDegradedGoodBest (by cosine sim)

SpectralQuant’s tradeoff: requires offline calibration (a new step in the deployment pipeline) but delivers the best quality at a given compression ratio.

7. Open Questions for When Repo Becomes Available

  1. Does the implementation handle GQA models (LLaMA-3, Mistral)?
  2. Is there a Triton kernel for fused rotation + quantization?
  3. How are value vectors handled — same spectral approach or separate strategy?
  4. Is the calibration deterministic across hardware?
  5. What’s the per-token latency overhead of rotation + mixed-precision dequant?
  6. Does the code include a vLLM/SGLang plugin or is it standalone?
  7. How is the packed memory layout implemented — custom dtype or manual bit packing?

This analysis should be updated when the SpectralQuant repository becomes publicly accessible. The reconstructed implementation details are based on the paper’s method section and standard practices in KV cache quantization research.