Symmetric vs Asymmetric Linear Quantization: A Practical Deep Dive
Following our quantization research plan, this post provides a conceptual deep dive into symmetric and asymmetric linear quantization.
While we haven’t conducted large-scale experiments yet, this article aims to:
- clarify the underlying math and intuition,
- evaluate the trade-offs of each method,
- and outline our planned evaluation strategy.
We will follow up with a dedicated “results” post once our experiments are complete.
1. A Quick Refresher on Number Representation
Before diving into quantization schemes, it’s helpful to review how numbers are represented in hardware.
FP32 (32-bit floating point)
- 1 sign bit + 8 exponent bits + 23 mantissa bits
- Range: roughly ±3.4 × 10^38
- Precision: about 7 decimal digits
- Memory: 4 bytes per value
INT8 (8-bit signed integer)
- Range: -128 to 127
- Memory: 1 byte per value
- 4× memory reduction compared to FP32
The core question in quantization, therefore, is:
How do we map continuous FP32 values to a discrete set of INT8 values while minimizing the information loss relevant to our task?
2. Symmetric Linear Quantization
2.1. Basic Idea
In symmetric quantization, we force zero in FP32 to map exactly to zero in INT8. This is achieved using a single scale factor applied symmetrically around zero.
Let x be a tensor (weights or activations), and let b be the bit width (e.g., b = 8 for INT8).
The effective positive integer range is:
n_levels = 2^(b - 1) - 1
For INT8, this gives n_levels = 127.
A simple conceptual implementation:
# Quantization (conceptual example)
def quantize_symmetric(x, bits=8):
n_levels = 2 ** (bits - 1) - 1 # 127 for INT8
max_abs = torch.max(torch.abs(x))
scale = max_abs / n_levels # Δ
# Map FP32 -> integer
x_int = torch.round(x / scale)
# Clamp to representable range
x_int = torch.clamp(x_int, -n_levels - 1, n_levels)
return x_int, scale
# Dequantization
def dequantize_symmetric(x_int, scale):
return x_int * scale
Here, scale (often written as Δ) tells us how much real-valued change corresponds to a step of 1 in the integer domain.
2.2. Toy Example (Weights)
Let’s consider a hypothetical weight tensor of shape [512, 512] with:
min = -0.15max = 0.18
For INT8 (symmetric):
n_levels = 127
Δ = max_abs / n_levels = 0.18 / 127 ≈ 0.00142
Interpretation:
- The smallest non-zero representable step is about
0.00142. - Any value between roughly
-0.00142 / 2and+0.00142 / 2will round to 0. - Larger values are rounded to the nearest integer multiple of Δ.
This is a constructed example, not a measured result, designed to build intuition for the scale factor.
2.3. When Symmetric Quantization Is a Good Fit
Symmetric quantization is generally a good fit when:
- The data distribution is centered around zero (e.g., outputs following batch or layer normalization).
- The distribution is roughly symmetric (e.g., well-initialized weights in many layers).
- Implementation simplicity is a priority — no need to store or handle a separate zero-point.
Typical uses in practice:
- Quantizing weights of convolutional / linear layers.
- Quantizing pre-activation values that have been normalized.
3. Why Activations Are Tricky (The ReLU Problem)
This clean symmetry assumption, however, breaks down significantly for ReLU activations and other non-negative distributions.
For example, imagine a ReLU output with the following range:
min = 0.0max = 3.24
If we apply symmetric INT8 quantization:
- We are forced to reserve half of the integer range (-128 to -1) for negative values that will never be used, as the data is strictly non-negative.
- In effect, we are only using half (127) of our 256 available quantization levels to represent useful information.
This leads to two problems:
- Wasted Range & Poor Resolution: The effective resolution in the
[0, 3.24]region is unnecessarily coarse, as half the quantization ‘bins’ are unused. - Increased Noise: This inefficiency increases quantization noise and can lead to a significant drop in model accuracy.
This problem is the primary motivation for asymmetric quantization.
4. Asymmetric Linear Quantization
4.1. Basic Idea
Asymmetric quantization addresses this by introducing a zero-point (or offset), which aligns the integer range with the actual min/max of the data, rather than forcing it to be centered at 0.
We define:
- a scale Δ (float),
- a zero-point
z(an integer offset).
Conceptually, the mapping is:
q = round(x / Δ) + z
x̂ = (q - z) * Δ
In code:
def quantize_asymmetric(x, bits=8):
n_levels = 2 ** bits # 256 for INT8
x_min, x_max = torch.min(x), torch.max(x)
# Avoid degenerate case (all values are the same)
if x_max == x_min:
return torch.zeros_like(x, dtype=torch.int32), 1.0, 0
# Compute scale and zero-point
scale = (x_max - x_min) / (n_levels - 1)
zero_point = torch.round(-x_min / scale)
# Quantize
x_int = torch.round(x / scale) + zero_point
x_int = torch.clamp(x_int, 0, n_levels - 1)
return x_int, scale, zero_point
def dequantize_asymmetric(x_int, scale, zero_point):
return (x_int - zero_point) * scale
4.2. Toy Example (ReLU Output)
Let’s revisit our hypothetical ReLU output:
min = 0.0max = 3.24
For INT8 (asymmetric):
n_levels = 256
Δ = (3.24 - 0.0) / (256 - 1) ≈ 0.0127
z ≈ 0 # because min is already 0
Key differences from the symmetric approach:
- All 256 integer levels are now utilized to represent the
[0, 3.24]range. - The step size is roughly halved, effectively doubling the resolution in our region of interest.
Again, this is a constructed example, but it clearly illustrates the appeal of asymmetric quantization for non-negative activations.
5. Our Evaluation Plan
While we haven’t run large-scale benchmarks yet, our evaluation plan (based on our research roadmap) is as follows:
- Start with Symmetric INT8 — apply to weights and select activations (e.g., post-normalization). Benchmark on a standard vision model (e.g., ResNet-50) and a smaller model.
- Introduce Asymmetric Quantization for Clearly Asymmetric Activations — post-ReLU activations, softmax outputs in attention blocks, other bounded outputs in
[0, 1](e.g., probabilities). - Compare Metrics — Top-1 / Top-5 accuracy vs FP32. Inference latency (ms per sample). Model size and memory footprint. Sensitivity per layer.
We hypothesize the following outcomes:
- Symmetric quantization will be sufficient (and simpler) for weights and normalized signals.
- Asymmetric quantization will provide a distinct advantage for highly skewed, non-negative activations.
- A mixed-precision strategy (symmetric for weights, asymmetric for key activations) will likely offer the best balance of accuracy and efficiency.
Once these experiments are complete, we’ll replace this section with actual numbers, plots, and per-layer analysis.
6. Preliminary Practical Recommendations
Even without our own experimental results, theory and established literature suggest some reasonable default choices.
6.1. When to Prefer Symmetric Quantization
Symmetric quantization is generally preferred for:
- Weights of conv/linear layers, especially after standard initialization.
- Outputs of batch/layer normalization, which are often close to zero-mean.
- Cases where implementation simplicity is important and you want to avoid storing a separate zero-point.
6.2. When to Prefer Asymmetric Quantization
Asymmetric quantization is generally preferred for:
- ReLU outputs and other non-negative activations.
- Softmax outputs (e.g., in attention).
- Any activation or signal where the distribution is strongly skewed, or the dynamic range is narrow but not centered at zero.
6.3. Likely Mixed Strategy
A practical default mixed strategy, which we will use as our baseline, is:
- Weights: symmetric INT8 (per-tensor).
- Pre-activation / normalized activations: symmetric INT8.
- Post-ReLU activations, softmax, probabilities: asymmetric INT8.
- Very sensitive layers (e.g., the final classifier) are optionally kept in FP16/FP32 for maximum precision.
This approach aims for simplicity and efficiency for the bulk of the model, with higher precision (via better range utilization) only where it demonstrably matters (i.e., for skewed activations).
7. Key Implementation Questions and Future Work
Key questions we plan to investigate experimentally include:
- Calibration Strategy — how many samples are needed for stable min/max estimates? How does percentile-based clipping compare to naive min-max for robustness to outliers?
- Granularity: Per-tensor vs. Per-channel — per-tensor is simpler and faster; per-channel quantization may be crucial for accuracy in layers where weight statistics vary significantly across channels.
- Method: PTQ vs. QAT — start with Post-Training Quantization (PTQ) for its speed and simplicity. If the accuracy drop is unacceptable, escalate to Quantization-Aware Training (QAT) to recover precision.
All concrete choices, hyperparameters, and benchmark results will be documented in our follow-up post.
If you’re interested in the broader roadmap and motivation behind this work, you can read our Neural Network Quantization Research Plan.