MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization

baa.ai
Abstract. We present MINT (Memory-Informed N-bit Tuning), a data-free mixed-precision quantization framework that formulates per-tensor bit-width and group-size allocation as a budget-constrained optimization problem. Given a user-specified memory budget, MINT jointly selects the (bit-width, group-size) configuration for each weight tensor by solving a Multiple-Choice Knapsack Problem (MCKP) over per-tensor rate-distortion curves. The framework introduces three key innovations: (1) budget-targeted quantization—users specify an exact memory target (e.g., “fit in 4 GB for iPhone” or “fit in 24 GB for RTX 4090”) and MINT produces a near-optimal allocation for that budget, with a fitted prediction curve that estimates output quality before running the pipeline; (2) joint bit-width and group-size optimization that treats group size as a first-class allocation variable, revealing that group-size selection provides larger quality gains than bit-width changes; and (3) an SQNR safety veto with an empirically validated 9 dB threshold that exploits the natural gap between catastrophic 2-bit quantization (SQNR < 9 dB, PPL triples) and usable 3-bit quantization (SQNR > 10 dB). We evaluate MINT on six model families spanning 8B–109B parameters across dense and Mixture-of-Experts architectures. On Qwen3-30B-A3B, MINT achieves +0.6% perplexity degradation versus BF16 at 19 GB (33% of full size) and matches prior heuristic quality at 2.6% less storage. On Llama-4-Scout (109B MoE), MINT enables hardware-targeted deployment of the same model across memory tiers: PPL 7.703 at 58 GB for a 64 GB device versus PPL 7.359 at 163 GB for a 192 GB device, both outperforming uniform 4-bit (PPL 7.899). In matched-size comparisons against GPTQ—a calibration-based method—across three MoE families (Qwen3-30B-A3B, Qwen2-57B-A14B, Mixtral-8x7B), MINT consistently outperforms GPTQ despite being entirely data-free: −1.7%, −0.95%, and −4.6% perplexity respectively at identical model sizes. We also demonstrate that standard mean perplexity can be misleading due to outlier sequences, and recommend reporting median perplexity and tail statistics alongside standard corpus perplexity. The entire pipeline requires no calibration data, no gradient computation, and completes in under 50 minutes on commodity hardware.

1. Introduction

Post-training quantization (PTQ) has become the primary means of deploying large language models on consumer hardware. Methods such as GPTQ [1], AWQ [2], and SqueezeLLM [3] achieve remarkable compression, but they share a common requirement: a representative calibration dataset. This introduces practical concerns—calibration data may be unavailable for proprietary models, the chosen distribution may not generalize to deployment domains, and calibration demands substantial compute.

Existing data-free approaches [6, 7, 19] typically apply uniform bit-widths or rely on single sensitivity metrics with hand-tuned thresholds. These approaches face two fundamental limitations. First, threshold-based allocation produces fixed bit-width decisions regardless of the deployment memory budget—the user cannot specify “quantize this model to fit in 6 GB” and receive a budget-optimal allocation. Second, single-point error proxies create circularity: using 4-bit reconstruction error to decide 4-bit allocation means the method partly predicts its own label.

We address both limitations with MINT (Memory-Informed N-bit Tuning), which reformulates mixed-precision quantization as a constrained optimization problem:

min{(bi, gi)} Σi=1T πi · NRMSEi(bi, gi)   s.t.   Σi=1T sizei(bi, gi) ≤ B    (1)

where bi and gi are the bit-width and group size for tensor i, B is the user’s memory budget, and πi is a soft protection prior (§3.3). The loss function is the normalized root mean squared error (NRMSE) from the rate-distortion curve at the chosen configuration. The key insight is that both bit-width and group size are allocation variables—prior work optimizes bit-width alone, but our evidence shows group-size selection often provides larger quality improvements than bit-width changes.

Our contributions are:

2. Related Work

Calibration-based PTQ.

GPTQ [1] applies layerwise optimal brain quantization using Hessian information from calibration data. AWQ [2] identifies salient weight channels via activation-aware scaling. SqueezeLLM [3] combines sensitivity-based non-uniform quantization with dense-and-sparse decomposition. SpQR [4] isolates outlier weights for full-precision storage. QuIP [5] applies random orthogonal transformations before quantization. SmoothQuant [18] migrates quantization difficulty from activations to weights. All require calibration data.

Data-free quantization.

EasyQuant [6] proposes data-free PTQ via weight distribution analysis. MXQ [7] uses Frobenius norm of quantization error as a single sensitivity metric. HQQ [19] uses half-quadratic splitting for fast uniform quantization. HIGGS [8] applies Hadamard rotation via the linearity theorem. These methods typically apply uniform bit-widths or coarse-grained mixed precision. MINT differs by formulating allocation as budget-constrained optimization over joint (bit-width, group-size) configurations using per-tensor rate-distortion curves.

Sensitivity-based mixed-precision allocation.

LLM-MQ [10] formulates mixed-precision as an integer linear program but requires calibration for sensitivity estimation. SliM-LLM [20] achieves group-wise mixed precision but requires calibration data. CherryQ [21] identifies high-impact parameters but also requires calibration. MINT is, to our knowledge, the first method to combine data-free sensitivity estimation with constrained optimization over both bit-width and group-size at per-tensor granularity.

MoE quantization.

MC-MoE [12] combines expert quantization with dynamic pruning. MoEQuant [13] proposes expert-balanced sampling for calibration. These highlight unique MoE challenges—uneven expert activation, sparse routing—but require calibration data. MINT’s data-free approach is particularly advantageous for MoE models where calibration must cover activation patterns across hundreds of experts.

3. Method

MINT operates in two phases: (1) compute a per-tensor rate-distortion curve measuring reconstruction error at each candidate (bit-width, group-size) configuration, and (2) solve a budget-constrained allocation over these curves. The entire pipeline is data-free—it requires only the pretrained weight tensors.

3.1 Rate-Distortion Curves

Prior data-free methods compute reconstruction error at a single operating point (typically 4-bit), creating circularity when this error is used to decide 4-bit allocation.

MINT computes a complete rate-distortion (RD) curve for each weight tensor Wi: the normalized RMSE at every valid (bit-width, group-size) configuration:

NRMSEi(b, g) = RMS(Q(Wi; b, g) − Wi) / RMS(Wi)    (2)

evaluated at configurations C = {(2,32), (3,64), (4,32), (4,64), (4,128), (8,64), (8,128)}, plus full-precision (16-bit, no grouping) as the zero-distortion anchor. Here Q(·; b, g) denotes group-wise round-to-nearest (RTN) quantization at b bits with group size g. The multi-point curve serves directly as the loss function for the allocator: the NRMSE at the chosen configuration (bi, gi) is the per-tensor loss in Eq. 1. By evaluating all candidate configurations, MINT avoids the circularity of single-point methods.

3.2 SQNR Safety Veto

We compute the signal-to-quantization-noise ratio (SQNR) at each configuration:

SQNRi(b, g) = 10 · log10( ||Wi||F2 / ||Wi − Q(Wi; b, g)||F2 ) dB    (3)

Any configuration with SQNR < τfloor (default τfloor = 9 dB) is excluded from the allocator’s candidate set Ci. This provides an absolute safety floor: regardless of budget pressure or relative ranking, tensors will never be assigned configurations that introduce catastrophic distortion. The 9 dB threshold is empirically determined (§4.3): it sits in the natural gap between 2-bit SQNR (max ~8.7 dB) and 3-bit SQNR (min ~10.4 dB), blocking catastrophic 2-bit allocations while permitting usable 3-bit configurations.

3.3 Soft Protection Priors

Rather than binary hard-coded protection rules (“embeddings stay 16-bit”), MINT uses soft multiplicative priors πi that enter the optimization objective (Eq. 1). A prior of πi = 10 means the allocator needs 10× the quality improvement per byte to justify compressing tensor i versus a default tensor.

Table 1: Soft protection priors. Higher values make the allocator less likely to compress the tensor.

Tensor TypeπiRationale
Embedding10.0Lookup tables; quantization corrupts rare tokens
LM head10.0Final projection; directly affects output distribution
LayerNorm / RMSNormTiny (<0.01% of params); critical for training stability
MoE router8.0Controls expert routing; errors cascade
Vision encoder8.0Cross-modal alignment is sensitive
First layer3.0No error correction from prior layers
Last layer2.0Directly precedes output
Default1.0No bias

Tensors with πi = ∞ are hard-protected at 16-bit and excluded from the allocator. All other priors participate in the optimization—the allocator can compress high-prior tensors if the budget demands it and the quality tradeoff is favorable.

3.4 Budget-Constrained Allocation

The allocation problem is a Multiple-Choice Knapsack Problem (MCKP). Each tensor i has a candidate set Ci (configurations surviving the SQNR veto). For each candidate (b, g) ∈ Ci, the cost is sizei(b, g) and the loss is πi · NRMSEi(b, g).

For quantized tensors (b < 16), the memory cost is:

sizei(b, g) = ⌈ni · b / 8⌉ + ⌈ni / g⌉ · 4 bytes    (4)

where ni is the parameter count and the second term accounts for 16-bit scale and zero-point per group. Full-precision tensors (b = 16) have sizei = 2ni bytes with no grouping overhead.

The budget B is specified by the user as an absolute size (--budget 6GB), an average bits-per-parameter target (--avg-bits 4.0), or a fraction of BF16 size (--budget-ratio 0.25).

We provide three solvers:

  1. Greedy (default): Initialize all tensors at their lowest-cost valid configuration. Compute upgrade options—for each tensor and each higher configuration, compute the efficiency ratio η = Δloss / Δsize. Sort upgrades by η descending and greedily apply them until the budget is exhausted. Runtime: O(T · |C| · log(T · |C|)), under 10 ms for 18,867 tensors. The greedy solver is a heuristic; it is not guaranteed to find the global optimum, but empirically achieves solutions within 0.1% of the LP relaxation bound on all tested models.
  2. LP relaxation: Continuous relaxation solved via scipy.optimize.linprog with HiGHS, followed by rounding. Provides an upper bound on greedy gap.
  3. ILP: Exact solution via scipy.optimize.milp. Practical for models with fewer than 1,000 tensors.

3.5 Joint Bit-Width and Group-Size Optimization

A critical design decision in MINT is that group size is an allocation variable, not a hyperparameter. Prior mixed-precision methods optimize bit-width while keeping group size fixed (typically 128). However, empirical evidence shows that reducing group size from 128 to 64 often yields larger perplexity improvements than bit-width changes [11].

MINT’s configuration space includes multiple group sizes at each bit-width: {(4,32), (4,64), (4,128), (8,64), (8,128)}. The allocator considers both the quality impact (NRMSE) and the memory cost (Eq. 4) of each option, automatically selecting finer granularity for sensitive tensors and coarser granularity for insensitive ones.

3.6 Pipeline Summary

Algorithm 1: MINT Analysis and Allocation Pipeline

Require: Model directory, memory budget B, SQNR floor τ

Ensure: Per-tensor manifest: {(bi, gi)} for all tensors

1: Phase 1: Rate-distortion analysis

2: for each shard file in model do

3:   for each 2D weight tensor Wi with ni ≥ 1024 do

4:    Compute RD curve: NRMSEi(b,g) for all (b,g) ∈ C

5:    Compute SQNR: SQNRi(b,g) for all (b,g) ∈ C

6:   end for

7:   Free shard memory

8: end for

9: Phase 2: Allocate

10: Compute protection priors πi for each tensor

11: Filter candidate sets: Ci ← {(b,g) ∈ C : SQNRi(b,g) ≥ τ}

12: Subtract protected tensor sizes from budget B

13: Run greedy MCKP solver → allocation {(bi, gi)}

14: Write manifest JSON

The pipeline (Algorithm 1) processes safetensor shards sequentially, keeping peak memory well below the model size. The output manifest maps each tensor to its assigned (bi, gi) configuration and is used as a quantization predicate during model conversion via MLX [16].

3.7 Expert Handling for MoE Models

For MoE models with expert tensors stored as 3D arrays [E, din, dout], we analyze each expert slice separately (E ≤ 32) or cluster experts and sample representatives (E > 32), using worst-case SQNR across experts for safety.

3.7.1 Expert-Grouped Allocation

Inference frameworks (MLX, vLLM) implement MoE expert layers as SwitchLinear modules where all E experts within a single (layer, projection) group share the same quantization parameters. The allocator must respect this constraint: it is not possible to quantize expert 0 at 3-bit and expert 1 at 4-bit within the same module.

MINT groups expert tensors by (layer, projection) before running the MCKP solver. For each group G = {W(1), …, W(E)}:

This reduces the number of allocation variables from T · E to T (e.g., 5,376 expert tensors → 84 groups on Qwen2-57B-A14B), while ensuring that the solver’s output is directly implementable by the runtime. The parameter-weighted mean NRMSE ensures that the upgrade efficiency of expert groups scales proportionally with both quality impact and size, consistent with the additive structure of the global objective.

4. Experiments

We evaluate MINT on six model families using WikiText-2 perplexity (test split, sequence length 2048, seed 42). All experiments run on an Apple Mac Studio with M2 Ultra (192 GB unified memory). Analysis uses PyTorch CPU; conversion and inference use MLX [16].

Models: Qwen3-8B (dense, 8.2B parameters), Qwen3-30B-A3B (MoE, 30.5B parameters, 128 routed experts), Qwen2-57B-A14B (MoE, 57.4B parameters, 64 experts/layer), Mixtral-8x7B-Instruct (MoE, 46.7B parameters, 8 experts/layer), GLM-4.7-Flash (dense, 31.2B parameters), and Llama-4-Scout-17B-16E-Instruct (MoE, 109B parameters, 16 routed + 1 shared expert/layer).

Baselines: BF16 (unquantized), uniform 4-bit RTN with group size 64 or 128, GPTQ-Int4 where available, and the threshold-based v1 heuristic.

4.1 Main Results: Matched-Budget Perplexity

All comparisons are at matched or near-matched model sizes, addressing the critique that mixed-precision methods win only by spending more bits.

Table 2: Perplexity comparison at matched and varied model sizes. All evaluations use WikiText-2 (test, seq_len=2048, 128 sequences, seed=42). Both mean and median PPL are reported; see §4.6 for discussion of when they diverge. Δ columns are relative to BF16. Values from [23]. BF16 model (203 GB) exceeds available memory; Δ computed vs uniform 4-bit.

ModelConditionSize (GB)Mean PPLMedian PPLΔMed
Qwen3-8BBF16 baseline15.269.727
AWQ 4-bit4.0510.50+8.1%
GPTQ 4-bit4.0510.30+6.1%
Uniform 4-bit4.0510.249+5.4%
v1 (best)6.0510.097+3.8%
MINT6.0010.039+3.2%
Qwen3-30B-A3B (MoE)BF16 baseline56.878.7338.789
Uniform 4-bit15.119.629+10.3%
v1 (best)16.738.9248.974+2.8%
MINT (16 GB)16.298.9308.971+2.3%
MINT (17 GB)17.398.8588.912+1.5%
MINT (19 GB)19.018.7828.798+0.6%
GLM-4.7-FlashBF16 baseline58.1611.3448.706
Uniform 4-bit14.82~11.46+31.6%
v1 (best)15.929.9309.084+4.3%
MINT15.829.4279.210+5.8%
Llama-4-Scout (109B MoE)BF16 baseline~203exceeds 192 GB
MINT (no safety)34.6223.57723.714+198%
MINT (min-safe)46.938.6758.786+9.8%
MINT (50 GB)51.987.9808.284+1.0%
Uniform 4-bit56.97.899
v1 (best)59.57.628−3.4%
MINT (64 GB)58.037.7038.070−2.5%
MINT (192 GB)163.247.3597.691−6.8%

Qwen3-8B (dense).

MINT achieves PPL 10.039 at 6.00 GB, outperforming the best v1 configuration (PPL 10.097 at 6.05 GB) by 0.6%. The improvement is attributable to joint group-size optimization: MINT assigns group size 32 to 29 tensors, group size 64 to 46, and group size 128 to 101, directing finer granularity to the most sensitive tensors. MINT closes 41% of the gap between uniform 4-bit and BF16.

Comparison with calibration-based methods (dense).

Published results from an independent evaluation [23] report GPTQ (PPL 10.30) and AWQ (PPL 10.50) for Qwen3-8B at uniform 4-bit (g128). MINT (PPL 10.039) outperforms both despite requiring no calibration data: −2.5% versus GPTQ and −4.4% versus AWQ (Table 2). MINT’s advantage comes at a size cost (6.00 GB vs 4.05 GB), but this is precisely the budget-targeted tradeoff MINT is designed for: given more memory, it delivers better quality than calibration-based methods achieve with less.

Qwen3-30B-A3B (MoE).

This model demonstrates MINT’s budget-targeting capability. At three different user-specified budgets, MINT achieves 100% budget utilization with monotonically improving quality:

The greedy solver allocated 18,867 tensors in under 20 ms. The SQNR safety veto blocked all 2-bit configurations on this model (max SQNR: 8.1 dB, below the 9 dB floor), catching unsafe allocations that v1’s heuristic thresholds allowed (4% of parameters at 2-bit with SQNR below 8 dB). 3-bit configurations (min SQNR: 9.5 dB) pass the veto and are available to the allocator at tighter budgets.

GLM-4.7-Flash.

MINT achieves median PPL 9.210 at 15.82 GB, closing 82% of the gap between uniform 4-bit and BF16. On this model, mean PPL is misleading: BF16 exhibits 5 catastrophic outlier sequences (PPL 25,000–81,000), inflating the mean to 11.344, while quantized models stabilize these sequences (see §4.6).

Llama-4-Scout (109B MoE).

This model—the largest in our evaluation at 109 billion parameters with 48 MoE layers—demonstrates MINT’s budget-targeting capability at extreme scale. The BF16 model (~203 GB) exceeds our 192 GB test system, making data-free quantization essential rather than optional. MINT produces two hardware-targeted variants from a single analysis pass:

The SQNR safety veto blocks all 2-bit configurations on this architecture (SQNR < 8.7 dB), while 3-bit configurations (SQNR 10.4–14.6 dB) pass the 9 dB floor. This enables a minimum safe model of 46.9 GB (all 3-bit, PPL 8.675)—17% smaller than the all-4-bit floor with acceptable quality. A 2-bit model (34.6 GB) was also tested and confirmed catastrophic (PPL 23.6), validating the veto.

4.2 Joint Bit-Width and Group-Size Allocation

Table 3 shows MINT’s allocation decisions at the 19 GB budget on Qwen3-30B-A3B, revealing how the allocator exploits group size as a quality lever.

Table 3: MINT per-tensor allocation at 19 GB budget on Qwen3-30B-A3B. The allocator overwhelmingly selects group size 32 for 4-bit tensors (85% of all tensors), trading scale/bias overhead for quantization accuracy.

Config (b, g)Tensors%Role
(4, 32)15,90885.2%Default: small groups maximize accuracy
(4, 64)9<0.1%Transitional
(4, 128)960.5%Low-sensitivity tensors
(8, 128)2,61214.0%Sensitive: first/last layer attention, MoE experts
(8, 64)1<0.1%
(16, 0)2411.3%Hard-protected: norms, biases
Total18,867

The dominant pattern is striking: 85% of tensors receive 4-bit quantization with group size 32, not the conventional group size 128. The allocator discovers that reducing group size from 128 to 32 trades 0.125 bytes/parameter of scale/bias overhead for substantially better per-group quantization resolution. This overhead is funded by eliminating all 16-bit allocations for quantizable tensors—v1 allocated 5.6% of parameters to 16-bit, but the knapsack determines this is wasteful: those bytes produce more quality when redistributed as finer group sizes across 4-bit tensors.

The 8-bit allocations (14%) concentrate on first-layer attention tensors (prior π = 3.0), last-layer attention (prior π = 2.0), and the most sensitive MoE expert weights. Embeddings and the LM head remain at 4-bit despite their prior of π = 10.0—they are too large for 8-bit upgrades to be efficient.

4.3 SQNR Safety Veto Validation

To determine the optimal SQNR floor, we conducted a systematic sweep on Llama-4-Scout, measuring perplexity at different floor values. The SQNR distribution reveals a natural gap between bit levels that guides floor selection.

4.3.1 SQNR Distribution Analysis

Across both Llama-4-Scout (691 2D tensors) and Qwen3-30B-A3B (18,674 2D tensors), SQNR values cluster into well-separated bands by bit-width:

Table 4: SQNR distribution (dB) across all 2D tensors on Llama-4-Scout. A natural gap of ~2 dB separates 2-bit (max 8.7 dB) from 3-bit (min 10.4 dB). Setting the floor at 9 dB exploits this gap: all 2-bit configurations are vetoed while all 3-bit configurations are permitted.

ConfigMinP5MedianP95Max<9 dB<15 dB
(2, 32)5.17.28.08.18.7691691
(2, 64)2.55.66.86.97.2691691
(3, 64)10.413.014.214.314.60691
(4, 32)19.421.322.022.122.800
(4, 64)17.019.620.820.921.300
(4, 128)15.118.219.820.020.000
(8, 64)41.644.345.445.645.900

The same pattern holds on Qwen3-30B-A3B: 2-bit max SQNR is 8.1 dB, 3-bit min SQNR is 9.5 dB. The gap between 2-bit and 3-bit is consistent across architectures.

4.3.2 Floor Sweep: Where Quality Breaks

We evaluated Llama-4-Scout at four SQNR floor settings to determine where model quality transitions from acceptable to catastrophic:

Table 5: SQNR floor sweep on Llama-4-Scout. The quality cliff is between 2-bit and 3-bit (PPL triples), not between 3-bit and 4-bit (+12.5%). The 9 dB floor exploits the natural SQNR gap to permit 3-bit while blocking catastrophic 2-bit.

SQNR FloorAvg BitsSize (GB)PPL (std)PPL (med)Effect
0 dB (no safety)2.0034.6223.57723.714Catastrophic (+198%)
9 dB (default)3.0046.938.6758.786Usable (+9.8%)
9 dB + 50 GB budget3.4851.987.9808.284Good (+1.0%)
15 dB (conservative)4.0056.167.7098.076Best (−2.4%)
15 dB + 58 GB budget4.0158.037.7038.070Best (−2.5%)

The results reveal two key findings:

The real cliff is between 2-bit and 3-bit.

PPL jumps from 8.675 (3-bit) to 23.577 (2-bit)—a 2.7× degradation. In contrast, 3-bit to 4-bit is only a 12.5% increase (8.675 → 7.709). The 15 dB floor used in earlier experiments was overly conservative: it blocked all 3-bit configurations despite them producing acceptable quality.

The 9 dB floor exploits a natural gap.

On both Llama-4-Scout and Qwen3-30B-A3B, 2-bit SQNR peaks at ~8.7 dB while 3-bit SQNR starts at ~10.4 dB. A 9 dB floor sits cleanly in this gap, vetoing all 2-bit configurations (proven catastrophic) while permitting all 3-bit configurations (proven usable). This reduces the minimum safe model size by 9.2 GB (53.75 → 44.27 GB estimated, 56.16 → 46.93 GB on disk) with only a 12.5% PPL penalty versus the conservative floor.

Mixed 3/4/8-bit fills the gap.

At the 50 GB budget with 9 dB floor, the allocator produces a mixed allocation (88% 3-bit, 2.6% 4-bit, 9.1% 8-bit) that achieves PPL 7.980—only 1.0% worse than uniform 4-bit at 4 GB smaller. This intermediate point was inaccessible with the 15 dB floor.

4.4 Budget-Targeted Deployment

A unique capability of MINT’s constrained formulation is hardware-targeted quantization: the user specifies an exact memory budget matching their deployment target, and MINT produces a near-optimal allocation for that constraint. Table 6 shows the full quality–size tradeoff curve for Qwen3-30B-A3B, and Table 7 demonstrates the same capability on Llama-4-Scout at extreme scale.

Table 6: PPL vs. budget for Qwen3-30B-A3B. The curve enables deployment planning: given a memory target, predict quality before running MINT.

BudgetModel SizeMean PPLMedian PPLDeployment Target
15.1 GB15.11 GB9.629(uniform 4-bit floor)
15.3 GB16.13 GB8.9709.020iPhone 16 Pro (16 GB)
15.5 GB16.29 GB8.9308.971
16.7 GB17.39 GB8.8588.912
19.2 GB19.01 GB8.7828.798
20.0 GB19.32 GB8.7848.803RTX 4070 (20 GB headroom)
25.0 GB27.39 GB8.7608.779RTX 4090 (24 GB)
30.0 GB30.75 GB8.6578.684Mac M4 Pro (36 GB)
56.87 GB8.7338.789BF16 (unquantized)

Several observations emerge from this curve:

Steep initial gains.

The first 1 GB above the 4-bit floor (15.1 → 16.1 GB) produces a 6.8% PPL improvement (9.629 → 8.970). This is because the additional budget funds selective 8-bit upgrades for the most sensitive tensors and finer group sizes for 4-bit tensors. Threshold-based methods cannot exploit this—they produce a fixed allocation regardless of budget.

Diminishing returns.

Beyond ~19 GB, additional budget yields marginal returns: the 19 → 30 GB increase improves mean PPL by only 1.4%. The 30 GB model (mean PPL 8.657) appears to outperform BF16 (mean PPL 8.733), but controlled experiments (§5, “Does quantization act as regularization?”) show this is a distributional artifact rather than genuine regularization.

Fitted prediction curve.

We fit an inverse-power model to the measured points as a local empirical interpolation:

PPL(B) = 8.371 + 0.494 / (B − 15.099)0.135    (5)

with RMSE = 0.025 PPL over the measured range B ∈ [15.2, 30.0] GB (95% CI: ±0.046 PPL). This enables practitioners to estimate output quality for budget targets within this range before running the pipeline. Key thresholds: within +1% of BF16 at 17.3 GB, within +2% at 15.7 GB. We caution that the parametric form is not physically meaningful outside the measured range—in particular, the fitted asymptote lies below BF16 PPL, which is an artifact of the functional form rather than a meaningful prediction.

Scaling to 109B parameters.

Table 7 demonstrates budget targeting on Llama-4-Scout, a model too large to run in BF16 on any single consumer device. From a single rate-distortion analysis pass, MINT generates allocations for different hardware tiers.

Table 7: MINT budget-targeted deployment of Llama-4-Scout (109B MoE). The same analysis produces allocations for different hardware targets. BF16 (~203 GB) exceeds all single-device targets.

BudgetModel SizeMean PPLMedian PPLDeployment Target
34.62 GB23.57723.714(no safety — catastrophic)
min-safe46.93 GB8.6758.786min safe (all 3-bit)
50 GB51.98 GB7.9808.284Mac Studio M4 (48 GB headroom)
56.9 GB56.9 GB7.899(uniform 4-bit baseline)
64 GB58.03 GB7.7038.070Mac Studio M4 Max (64 GB)
192 GB163.24 GB7.3597.691Mac Studio M2 Ultra (192 GB)
~203 GBexceeds 192 GBBF16 (unquantized)

The Scout budget curve spans a 4.7× size range (35–163 GB) and reveals three regimes. Below the 9 dB safety floor, quality collapses (PPL 23.6 at 2-bit). Above the floor, the --min-safe mode produces the smallest viable model (46.9 GB, all 3-bit, PPL 8.675). With additional budget, the allocator upgrades the most sensitive tensors: the 50 GB mixed model (88% 3-bit, 9% 8-bit) achieves PPL 7.980—only 1.0% worse than uniform 4-bit at 5 GB smaller. The 64 GB and 192 GB allocations use progressively more 4-bit and 8-bit to push quality further. The 3.5× size difference between the minimum safe model and the 192 GB model (47 vs 163 GB) produces only a 1.9× PPL difference (8.675 vs 7.359), demonstrating diminishing returns—practitioners with constrained hardware sacrifice surprisingly little quality.

4.5 Matched-Size Comparison with GPTQ

To rigorously evaluate MINT against calibration-based quantization, we compare against GPTQ [1] at exactly matched model sizes across three MoE model families spanning different architectures and scales. For each family, we obtain an official GPTQ-Int4 quantized model, evaluate its perplexity, then run MINT’s pipeline (rate-distortion curves → MCKP allocation → conversion) targeting the same byte budget.

A critical engineering detail for MoE models is expert-grouped allocation: frameworks such as MLX implement MoE experts via SwitchLinear modules where all experts within a layer share the same quantization parameters (bit-width and group size). MINT’s allocator groups expert tensors by (layer, projection) and uses the parameter-weighted mean NRMSE across experts within each group, ensuring that the allocation is implementable by the inference runtime and that upgrade efficiency scales consistently with the additive global objective.

Table 8: Matched-size comparison: MINT vs GPTQ across three MoE families. All evaluations use WikiText-2 (test, seq_len=2048, seed=42). MINT outperforms GPTQ on all three families despite being entirely data-free. Mixtral GPTQ uses per-channel quantization (group_size=−1); original compressed size is 22 GB but must be dequantized to fp16 (87 GB) for evaluation since MLX does not support per-channel quantization. MINT comparison is at the uniform 4-bit size (24.5 GB).

ModelMethodSize (GB)Mean PPLMed PPLΔ vs GPTQ
Qwen3-30B-A3B (MoE, 30B)GPTQ Int4 (g128)16.09.1229.160
MINT16.18.9709.020−1.5%
Qwen2-57B-A14B (MoE, 57B)GPTQ Int4 (g128)29.96.3906.396
MINT29.96.3296.356−0.6%
Mixtral-8x7B (MoE, 47B)GPTQ per-ch87.04.6084.640
Uniform 4-bit (g64)24.54.4714.461
MINT24.54.2644.266−4.6%

Consistent wins across architectures.

MINT outperforms GPTQ on all three model families, with improvements ranging from −0.6% to −4.6% median PPL at matched sizes. These are different MoE architectures (Qwen2-MoE with 64 experts per layer, Qwen3 with shared expert + 128 routed experts, Mixtral with 8 experts) ranging from 30B to 57B total parameters. The consistency across architectures is notable because GPTQ uses Hessian-based calibration with activation data while MINT uses only the weight tensors.

Expert grouping is essential.

Without expert grouping, the allocator treats each expert tensor independently and assigns different bit-widths to experts within the same SwitchLinear module. Since the inference runtime requires uniform quantization within each module, the bridge must aggregate these decisions (typically via mode), creating a mismatch between planned and actual allocation. On Qwen2-57B-A14B, this mismatch inflated the model from 29.9 GB (planned) to 30.2 GB (actual) and degraded PPL from 6.329 to 7.147. Expert grouping with parameter-weighted mean NRMSE eliminates this mismatch entirely: the allocator’s 29.9 GB estimate matches the actual output size to within 0.01 GB.

Why does data-free beat calibration?

We hypothesize three factors: (1) joint group-size optimization: MINT explores a configuration space of 8 (bits, group_size) combinations per tensor, while GPTQ uses a fixed group size; (2) budget-constrained allocation: MINT’s MCKP solver targets a specific memory budget, while GPTQ applies uniform bit-width without budget awareness; and (3) MoE calibration difficulty: calibrating quantization across hundreds of experts requires activating each expert sufficiently, which narrow calibration sets may fail to do. MINT’s data-free approach avoids this coverage problem entirely.

4.6 Mean vs. Median Perplexity

Standard perplexity evaluation computes the geometric mean loss across all evaluation sequences. We observe that this metric can be misleading:

Table 9: Mean vs. median perplexity reveals different quality orderings. GLM-4.7-Flash is the extreme case, but the effect appears across models.

ModelConditionMean PPLMedian PPLOutliers (>100)
Qwen3-30BMINT (19 GB)8.7828.7980
MINT (16 GB)8.9308.9710
v1 (best)8.9248.9740
GLM-4.7BF1611.3448.7065
v19.9309.0844
MINT9.4279.2100
Llama-4 ScoutUniform 4-bit7.899
MINT (64 GB)7.7038.0700
MINT (192 GB)7.3597.6910

On GLM-4.7-Flash, mean PPL gives a completely inverted quality ordering: BF16 appears worst (11.344) and MINT best (9.427). Median PPL gives the correct ranking: BF16 best (8.706), v1 second (9.084), MINT third (9.210). The inversion is caused by 5 catastrophic outlier sequences where BF16 produces PPL values of 25,000–81,000, while quantization noise acts as implicit regularization that stabilizes these pathological sequences to PPL 100–360.

On Qwen3-30B-A3B, the mean and median are consistent in their ordering but differ in magnitude: MINT (16 GB) has mean 8.930 but median 8.971. The 0.04 difference matters when comparing against v1’s mean of 8.924—mean PPL suggests MINT is worse, while median PPL (8.971 vs 8.974) shows MINT is marginally better at smaller size.

Recommendation: Future quantization evaluations should report standard corpus perplexity (token-weighted cross-entropy) as the primary metric, supplemented by median per-sequence PPL, tail percentiles (P95, P99), and outlier counts as robustness diagnostics. When the per-sequence loss distribution is heavy-tailed, mean per-sequence PPL can produce misleading orderings; reporting both gives a more complete picture of quantization quality.

4.7 Analysis Efficiency

Table 10: MINT pipeline timing on M2 Ultra 192 GB.

ModelTensorsPass 1AllocationTotal
Qwen3-8B3993 min<1s~10 min
Qwen3-30B-A3B18,86750 min<1s~54 min
GLM-4.7-Flash9,70339 min<1s~44 min
Llama-4-Scout~1,00045 min<1s~50 min

Pass 1 dominates runtime because it computes rate-distortion curves at 8 configurations per tensor. The allocation phase (greedy solver) completes in under 10 ms regardless of model size. Total pipeline time is 2–5× slower than v1’s single-pass approach but produces strictly better allocations due to joint optimization.

5. Discussion

Group size is the primary quality lever.

The most striking finding is that group-size optimization provides larger quality gains than bit-width selection. On Qwen3-30B-A3B, the allocator chose 4-bit g32 for 85% of tensors instead of the conventional g128. v1 kept all 4-bit tensors at g128 and instead spent its budget on 16-bit and 8-bit allocations. The knapsack determines that the quality improvement from 4× more quantization groups (g32 vs g128) at 0.125 bytes/parameter overhead is more efficient than upgrading select tensors to 16-bit at 1.5 bytes/parameter overhead. This validates the observation from prior work [11] that group size is a first-order compression variable.

SQNR veto catches unsafe heuristic allocations.

On Qwen3-30B-A3B, v1’s threshold heuristic allocated 4.0% of parameters to 2-bit. The SQNR analysis reveals these allocations had SQNR values of 4.8–8.1 dB—far below the 9 dB safety floor. The SQNR veto provides a model-specific safety check that catches dangerous allocations regardless of how the heuristic scoring ranks them. Our floor sweep (§4.3) confirms that the cliff is between 2-bit and 3-bit, not between 3-bit and 4-bit.

16-bit allocation is targeted.

MINT hard-protects embeddings, the LM head, layer norms, and MoE routers at 16-bit, matching the inference runtime’s constraints (these modules are never quantized in MLX). v1 additionally allocated 5.6% of quantizable parameters to 16-bit via its heuristic thresholds. The MCKP solver never assigns 16-bit to quantizable tensors: the bytes saved by avoiding 16-bit produce more quality when spent on finer group sizes across thousands of other tensors.

Budget utilization enables hardware targeting.

The greedy solver achieves 100% budget utilization across all experiments. This is a direct advantage of the constrained formulation: a user with an iPhone (16 GB), an RTX 4090 (24 GB), or a Mac (48 GB) specifies their exact available memory and MINT delivers the best achievable quality. Threshold-based methods produce a fixed model size with no user control. Combined with the PPL prediction curve (Eq. 5), users can estimate quality before running the pipeline, enabling informed deployment decisions. The Llama-4-Scout results demonstrate this at extreme scale: the same 109B model is deployed on a 64 GB device (58 GB, PPL 7.703) and a 192 GB device (163 GB, PPL 7.359) from a single analysis pass, with the quality difference (4.7%) far smaller than the 2.8× size difference.

Scaling to models that exceed available memory.

Llama-4-Scout (109B parameters, ~203 GB in BF16) cannot be loaded or evaluated in full precision on any single consumer device. This makes data-free quantization not merely convenient but necessary—calibration-based methods would require either distributed infrastructure for calibration or reliance on proxy models. MINT’s weight-only analysis processes each safetensor shard independently, never requiring the full model to reside in memory simultaneously. The analysis completes in approximately 50 minutes and produces a manifest from which any number of budget-targeted variants can be generated without re-analysis.

When the allocator disagrees with heuristics.

On GLM-4.7-Flash, v1 achieved slightly better median PPL (9.084 vs 9.210) by assigning 141 tensors to 8-bit via its threshold heuristic. MINT’s allocator determined that within the 16 GB budget, all quantizable tensors are best at 4-bit. This reveals a limitation of NRMSE as the sole loss function: it may undervalue certain tensor-level quality improvements that are important for downstream perplexity. Incorporating activation-aware sensitivity (e.g., Fisher information or Hessian traces) as a per-tensor importance weight could address this gap without requiring calibration data.

Does quantization act as regularization?

At 30.75 GB (97.5% 8-bit), MINT achieves mean PPL 8.657—apparently below BF16’s 8.733. To test whether this reflects genuine regularization, we conducted controlled experiments on Qwen3-30B-A3B comparing BF16, uniform 8-bit quantization at two group sizes, uniform 6-bit, and Gaussian noise injection calibrated to match 8-bit quantization error magnitude (σ matched per tensor, 385 tensors perturbed, average SNR 44.7 dB).

Table 11: Controlled noise experiments on Qwen3-30B-A3B. Mean and median PPL diverge in opposite directions under all noise conditions, including unstructured Gaussian noise.

ConditionSize (GB)Mean PPLMedian PPLΔMed
BF16 (baseline)56.878.7338.789
Gaussian noise (1×)56.878.7428.755−0.38%
Uniform 8-bit g6430.218.7508.765−0.27%
Uniform 8-bit g12829.328.7698.776−0.15%
Uniform 6-bit g6423.118.7658.798+0.10%

Three findings emerge. First, the mean PPL “improvement” is a distributional artifact: all quantized conditions produce higher mean PPL than BF16 (+0.11% to +0.42%), ruling out genuine regularization. The earlier observation of mean PPL below BF16 at the 30 GB budget point was likely due to the specific MINT allocation pattern combined with evaluation variance.

Second, the median PPL improvement is real but not regularization-specific. All 8-bit conditions show lower median PPL than BF16 (−0.15% to −0.27%). However, matched-scale Gaussian noise produces the same effect (−0.38%), confirming this is a general noise phenomenon rather than anything specific to quantization’s structured rounding. The mechanism is asymmetric perturbation of the per-sequence loss distribution: BF16 has a left tail of sequences with unusually low PPL, and any noise—structured or not—partially disrupts these, pulling median down while pushing mean up.

Third, per-sequence analysis reveals no selective improvement. Comparing BF16 to uniform 8-bit across 145 WikiText-2 sequences, 68% of sequences degrade and only 32% improve, with the effect uniformly distributed across PPL quartiles (~+0.20% per quartile). This rules out the hypothesis that quantization noise stabilizes pathological sequences on this model, unlike the genuine outlier stabilization observed on GLM-4.7-Flash (§4.6).

These results highlight a subtle evaluation pitfall: when mean and median perplexity diverge, apparent “improvements” from quantization may reflect distributional artifacts rather than genuine quality gains. This reinforces our recommendation (§4.6) to report both metrics.

MINT consistently outperforms calibration-based GPTQ.

In matched-size comparisons across three MoE families (§4.5), MINT outperforms GPTQ by −0.6% to −4.6% median PPL despite being entirely data-free. Additionally, on dense Qwen3-8B (at a larger size budget), MINT (PPL 10.039) outperforms both GPTQ (PPL 10.30) and AWQ (PPL 10.50) by 2.5% and 4.4% respectively. The consistency across four model families and two architecture types (dense and MoE) challenges the assumption that calibration is necessary for competitive quantization quality. We attribute this to three factors: (1) joint group-size optimization accesses a richer configuration space than fixed-group GPTQ; (2) the MCKP solver targets quality maximization under an explicit budget constraint rather than applying uniform bit-widths; and (3) for MoE models, calibration must adequately cover expert activation patterns across potentially hundreds of experts, which narrow calibration sets may fail to achieve.

Limitations.

The allocator uses raw NRMSE as its sole loss function without per-tensor importance weighting. NRMSE measures weight reconstruction fidelity, but not all tensors contribute equally to downstream loss—incorporating activation-aware sensitivity (e.g., Fisher information or Hessian traces) could improve allocation quality, particularly on models like GLM-4.7-Flash where the NRMSE-based allocator disagrees with heuristic methods. We investigate several candidate sensitivity features (spectral, kurtosis, output noise amplification) in Appendix C, but in the current system these are not used by the allocator. The PPL prediction curve (Eq. 5) is a local empirical fit valid over the measured budget range; it should not be extrapolated as a universal law. BF16 perplexity is unavailable for Llama-4-Scout as the model (~203 GB) exceeds our test system’s 192 GB memory. We also lack downstream task accuracy comparisons and throughput measurements for mixed group-size overhead. The GPTQ comparison uses publicly available GPTQ-Int4 models with default settings; comparisons against GPTQ with optimized hyperparameters (e.g., act-order with variable group sizes) could narrow the gap.

6. Conclusion

We presented MINT, a data-free mixed-precision quantization framework that replaces ad-hoc heuristic scoring with budget-constrained allocation over per-tensor rate-distortion curves. We evaluated MINT on four model families spanning 8B to 109B parameters, including both dense and MoE architectures. By formulating per-tensor allocation as a Multiple-Choice Knapsack Problem over joint (bit-width, group-size) configurations, MINT enables three capabilities that prior methods lack:

  1. Budget-targeted deployment. Users specify their exact memory constraint—16 GB for iPhone, 24 GB for RTX 4090, 64–192 GB for Mac—and receive a near-optimal allocation. On Llama-4-Scout (109B), a single analysis pass produces deployments from 58 GB (PPL 7.703) to 163 GB (PPL 7.359), both outperforming uniform 4-bit. A fitted prediction curve estimates output quality before running the pipeline.
  2. Joint group-size optimization. Treating group size as an allocation variable (not a fixed hyperparameter) emerges as the single most impactful design decision. On Qwen3-30B-A3B, 85% of tensors receive group size 32 instead of the conventional 128, and this produces more quality improvement than bit-width changes.
  3. Safety guarantees with empirically validated thresholds. The SQNR veto at 9 dB exploits the natural gap between catastrophic 2-bit (max 8.7 dB) and usable 3-bit (min 10.4 dB). On Llama-4-Scout, this enables a --min-safe mode producing the smallest viable model (46.9 GB, PPL 8.675) while blocking 2-bit allocations that would triple perplexity (PPL 23.6).

On Qwen3-30B-A3B, MINT achieves +0.6% perplexity versus BF16 at 19 GB (33% of full size), and matches prior heuristic quality at 2.6% less storage. On Llama-4-Scout (109B MoE), MINT enables deployment of a model too large for BF16 on any consumer device, spanning from 46.9 GB (min-safe, PPL 8.675) through 58 GB (PPL 7.703) to 163 GB (PPL 7.359)—all superior to uniform 4-bit (PPL 7.899) except the min-safe floor which trades 9.8% quality for 17% smaller size. We additionally demonstrate that standard mean perplexity can produce misleading quality orderings due to outlier sequences, and recommend that quantization evaluations report median perplexity and tail statistics alongside standard corpus perplexity.

Contrary to the conventional assumption that calibration-based methods are strictly superior, MINT consistently outperforms GPTQ at matched model sizes across three MoE families (Qwen3-30B-A3B: −1.5%, Qwen2-57B-A14B: −0.6%, Mixtral-8x7B: −4.6% median PPL) and also outperforms both GPTQ and AWQ on dense Qwen3-8B. This consistency across four model families and two architecture types (dense and MoE) suggests that joint group-size optimization and budget-constrained allocation can provide quality gains that compensate for the absence of calibration data. For MoE models specifically, MINT’s data-free approach avoids the coverage problem inherent in calibrating across hundreds of sparsely activated experts. MINT is useful both as a standalone method when calibration is impractical (very large models, proprietary data) and as a front-end for calibration-based refinement.

References

[1] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. In ICLR, 2023.

[2] J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. In MLSys, 2024.

[3] S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer. SqueezeLLM: Dense-and-sparse quantization. In ICML, 2024.

[4] T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. In ICLR, 2024.

[5] J. Chee, Y. Cai, V. Kuleshov, and C. De Sa. QuIP: 2-bit quantization of large language models with guarantees. In NeurIPS, 2024.

[6] J. Tang, Y. Liu, H. Yin, Y. Li, Y. Jiang, and Q. Tan. EasyQuant: An efficient data-free quantization algorithm for LLMs. arXiv preprint arXiv:2403.02775, 2024.

[7] F. Zhang et al. A mixed quantization approach for data-free quantization of LLMs. In ICAART, 2025.

[8] M. Badri, F. Tramer, and T. Hoefler. Pushing the limits of large language model quantization via the linearity theorem. In NAACL, 2025.

[9] Y. Zhao et al. KurTail: Kurtosis-based LLM quantization. In Findings of EMNLP, 2025.

[10] J. Li et al. LLM-MQ: Mixed-precision quantization for efficient LLM deployment. Technical report, Tsinghua University, 2024.

[11] Y. Li et al. MixLLM: Mixed-precision LLM quantization with algorithm-system co-design. In OpenReview, 2025.

[12] W. Wei et al. Mixture compressor for Mixture-of-Experts LLMs gains more. In ICLR, 2025.

[13] M. Huang et al. MoEQuant: Enhancing quantization for Mixture-of-Experts large language models via expert-balanced sampling and affinity guidance. arXiv preprint arXiv:2505.03804, 2025.

[14] Y. Xie et al. Examining post-training quantization for Mixture-of-Experts: A benchmark. In NeurIPS Workshop on Efficient Natural Language and Speech Processing, 2024.

[15] Y. Bondarenko, M. Nagel, and T. Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. In EMNLP, 2023.

[16] Apple. MLX: An array framework for Apple silicon. https://github.com/ml-explore/mlx, 2023.

[17] Qwen Team. Qwen3.5 technical report. https://qwenlm.github.io/blog/qwen3.5/, 2026.

[18] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, 2023.

[19] M. Badri and A. Shaji. Half-quadratic quantization of large machine learning models. https://github.com/mobiusml/hqq, 2024.

[20] W. Huang et al. SliM-LLM: Salience-driven mixed-precision quantization for large language models. In ICML, 2025.

[21] Y. Shang et al. Cherry on Top: Parameter heterogeneity and quantization in large language models. In NeurIPS, 2024.

[22] J. Park et al. HESTIA: A Hessian-guided differentiable quantization-aware training framework for extremely low-bit LLMs. arXiv preprint arXiv:2601.20745, 2026.

[23] Z. Xu et al. An empirical study of Qwen3 quantization. arXiv preprint arXiv:2505.02214, 2025.

[24] Meta AI. The Llama 4 herd of models. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025.


Appendix A. Gap Closure Summary

Table A1: Perplexity gap closure: fraction of the quality gap between uniform 4-bit and BF16 that MINT recovers at various budget points.

ModelBudgetMINT ΔPPLUniform ΔPPLGap Closed
Qwen3-8B6.0 GB+3.2%+5.4%41%
Qwen3-30B16.3 GB+2.3%+10.3%78%
17.4 GB+1.5%+10.3%86%
19.0 GB+0.6%+10.3%94%
GLM-4.7-Flash15.8 GB+5.8%+31.6%82%
Llama-4-Scout58.0 GB−2.5%*
Llama-4-Scout163.2 GB−6.8%*

*Llama-4-Scout's BF16 model (~203 GB) exceeds available memory; Δ computed vs uniform 4-bit (PPL 7.899). Gap closure is not computable without BF16 baseline.

On Qwen3-30B-A3B, MINT closes 78–94% of the perplexity gap depending on the budget, with the 19 GB budget achieving near-BF16 quality at 33% of full size.

Appendix B. Comparison with Threshold-Based Allocation (v1)

MINT’s MCKP formulation addresses specific limitations identified in the threshold-based v1 approach:

Table B1: Systematic comparison of threshold-based (v1) and optimization-based (MINT) design decisions.

AspectThreshold-based (v1)MINT (MCKP)
ObjectiveWeighted sum + thresholdsConstrained optimization (Eq. 1)
Error metricSingle-point 4-bit NRMSEMulti-point RD curve (8 configs)
Group sizeFixed hyperparameterPer-tensor variable (85% chose g32)
ProtectionBinary hard-coded rulesSoft priors in objective (Table 1)
Safety floorNone (allows SQNR < 5 dB)SQNR veto at 9 dB (empirically validated)
BudgetNo user control (fixed output)User-specified (GB, avg bits, or ratio)
16-bit alloc5.6% of params0% (wasteful, redirected to g32 overhead)
2-bit alloc4.0% of params0% (blocked by SQNR floor)
Quality predictNot possibleFitted curve (Eq. 5)

Appendix C. Exploratory Sensitivity Features

In addition to the rate-distortion curves and SQNR values used by the allocator, MINT computes several per-tensor sensitivity features during analysis. These features were investigated as candidates for a per-tensor importance weight in the allocator objective, but the current system uses only raw NRMSE as the loss function. We document them here as they may be useful for future extensions that incorporate learned importance weighting.

C.1 Spectral Features

We compute three scale-invariant spectral features from the singular values σ1 ≥ σ2 ≥ ··· ≥ σr (via randomized SVD with rank k = 256):

Stable rank.

The effective dimensionality of W:

rs(W) = ||W||F2 / ||W||22 = Σi=1r σi2 / σ12    (6)

Lower values indicate that fewer directions carry most information, making quantization more likely to corrupt critical components. Note that ||W||F2 is computed exactly from the weight tensor (not from the truncated SVD), while σ1 comes from the top singular value.

Spectral tail mass.

The fraction of energy outside the top k/10 singular values:

τ(W) = 1 − Σi=1⌊k/10⌋ σi2 / ||W||F2    (7)

The denominator uses the exact Frobenius norm, not the truncated singular value sum, ensuring the metric is well-defined regardless of the SVD rank parameter.

Approximate log spectral spread.

Based on the top-k truncated SVD (not the true condition number):

κk(W) = min(10, log101 / (σk + ε)))    (8)

Since only k = 256 singular values are computed, this measures the spectral spread within the top-k subspace rather than the true condition number.

C.2 Per-Group Kurtosis Features

We reshape W into K = ⌈mn / g⌉ groups of size g = 128 and compute the excess kurtosis κj for each group j:

κj = (1/g) Σi=1g ((wj,i − w̄j) / sj)4 − 3    (9)

Four features are extracted from the distribution {κ1, …, κK}:

fkurt90 = P901, …, κK)   (90th percentile)    (10)

fspread = P99 − P50   (tail heaviness spread)    (11)

foutlier = (1/K) Σj=1K 1[∃ wj,i : |wj,i − w̄j| > 3sj]   (outlier group fraction)    (12)

fmaxmed = max(κj) − median(κj)   (max-to-median gap)    (13)

Per-group kurtosis aligns the statistical metric with the quantization granularity, addressing the critique that global kurtosis operates at the wrong level [9].

C.3 Norm-Guided Output Noise Amplification

Each linear layer is preceded by a normalization layer with a learned scale parameter γ ∈ Rn. We sample probes from xj ~ N(0, diag(γ2)), encoding the model’s own channel importance without calibration data. Using the actual RTN quantization residual ΔW = Q(W; 4, 128) − W, the output noise amplification is:

fout = ||ΔW · X||F / (||W · X||F + ε)    (14)

averaged over p = 32 probe vectors. We note that fout uses a fixed 4-bit residual as the perturbation; unlike single-point allocation methods, this feature is auxiliary—the allocator’s actual loss function is the full multi-point RD curve, so the single-point circularity critique does not apply to the allocation decision. If a preceding normalization layer is unavailable, we fall back to isotropic probes x ~ N(0, I).

Appendix D. Reproduction

All code is available at [anonymized]. Experiments use: