Symmetric vs Asymmetric Linear Quantization: A Practical Deep Dive

Following our quantization research plan, this post provides a conceptual deep dive into symmetric and asymmetric linear quantization.

While we haven’t conducted large-scale experiments yet, this article aims to:

We will follow up with a dedicated “results” post once our experiments are complete.


1. A Quick Refresher on Number Representation

Before diving into quantization schemes, it’s helpful to review how numbers are represented in hardware.

FP32 (32-bit floating point)

INT8 (8-bit signed integer)

The core question in quantization, therefore, is:

How do we map continuous FP32 values to a discrete set of INT8 values while minimizing the information loss relevant to our task?


2. Symmetric Linear Quantization

2.1. Basic Idea

In symmetric quantization, we force zero in FP32 to map exactly to zero in INT8. This is achieved using a single scale factor applied symmetrically around zero.

Let x be a tensor (weights or activations), and let b be the bit width (e.g., b = 8 for INT8).

The effective positive integer range is:

n_levels = 2^(b - 1) - 1

For INT8, this gives n_levels = 127.

A simple conceptual implementation:

# Quantization (conceptual example)
def quantize_symmetric(x, bits=8):
    n_levels = 2 ** (bits - 1) - 1  # 127 for INT8
    max_abs = torch.max(torch.abs(x))
    scale = max_abs / n_levels      # Δ

    # Map FP32 -> integer
    x_int = torch.round(x / scale)

    # Clamp to representable range
    x_int = torch.clamp(x_int, -n_levels - 1, n_levels)
    return x_int, scale

# Dequantization
def dequantize_symmetric(x_int, scale):
    return x_int * scale

Here, scale (often written as Δ) tells us how much real-valued change corresponds to a step of 1 in the integer domain.

2.2. Toy Example (Weights)

Let’s consider a hypothetical weight tensor of shape [512, 512] with:

For INT8 (symmetric):

n_levels = 127
Δ = max_abs / n_levels = 0.18 / 127 ≈ 0.00142

Interpretation:

This is a constructed example, not a measured result, designed to build intuition for the scale factor.

2.3. When Symmetric Quantization Is a Good Fit

Symmetric quantization is generally a good fit when:

  1. The data distribution is centered around zero (e.g., outputs following batch or layer normalization).
  2. The distribution is roughly symmetric (e.g., well-initialized weights in many layers).
  3. Implementation simplicity is a priority — no need to store or handle a separate zero-point.

Typical uses in practice:


3. Why Activations Are Tricky (The ReLU Problem)

This clean symmetry assumption, however, breaks down significantly for ReLU activations and other non-negative distributions.

For example, imagine a ReLU output with the following range:

If we apply symmetric INT8 quantization:

This leads to two problems:

This problem is the primary motivation for asymmetric quantization.


4. Asymmetric Linear Quantization

4.1. Basic Idea

Asymmetric quantization addresses this by introducing a zero-point (or offset), which aligns the integer range with the actual min/max of the data, rather than forcing it to be centered at 0.

We define:

Conceptually, the mapping is:

q = round(x / Δ) + z
x̂ = (q - z) * Δ

In code:

def quantize_asymmetric(x, bits=8):
    n_levels = 2 ** bits        # 256 for INT8
    x_min, x_max = torch.min(x), torch.max(x)

    # Avoid degenerate case (all values are the same)
    if x_max == x_min:
        return torch.zeros_like(x, dtype=torch.int32), 1.0, 0

    # Compute scale and zero-point
    scale = (x_max - x_min) / (n_levels - 1)
    zero_point = torch.round(-x_min / scale)

    # Quantize
    x_int = torch.round(x / scale) + zero_point
    x_int = torch.clamp(x_int, 0, n_levels - 1)

    return x_int, scale, zero_point

def dequantize_asymmetric(x_int, scale, zero_point):
    return (x_int - zero_point) * scale

4.2. Toy Example (ReLU Output)

Let’s revisit our hypothetical ReLU output:

For INT8 (asymmetric):

n_levels = 256
Δ = (3.24 - 0.0) / (256 - 1) ≈ 0.0127
z ≈ 0   # because min is already 0

Key differences from the symmetric approach:

Again, this is a constructed example, but it clearly illustrates the appeal of asymmetric quantization for non-negative activations.


5. Our Evaluation Plan

While we haven’t run large-scale benchmarks yet, our evaluation plan (based on our research roadmap) is as follows:

  1. Start with Symmetric INT8 — apply to weights and select activations (e.g., post-normalization). Benchmark on a standard vision model (e.g., ResNet-50) and a smaller model.
  2. Introduce Asymmetric Quantization for Clearly Asymmetric Activations — post-ReLU activations, softmax outputs in attention blocks, other bounded outputs in [0, 1] (e.g., probabilities).
  3. Compare Metrics — Top-1 / Top-5 accuracy vs FP32. Inference latency (ms per sample). Model size and memory footprint. Sensitivity per layer.

We hypothesize the following outcomes:

Once these experiments are complete, we’ll replace this section with actual numbers, plots, and per-layer analysis.


6. Preliminary Practical Recommendations

Even without our own experimental results, theory and established literature suggest some reasonable default choices.

6.1. When to Prefer Symmetric Quantization

Symmetric quantization is generally preferred for:

6.2. When to Prefer Asymmetric Quantization

Asymmetric quantization is generally preferred for:

6.3. Likely Mixed Strategy

A practical default mixed strategy, which we will use as our baseline, is:

  1. Weights: symmetric INT8 (per-tensor).
  2. Pre-activation / normalized activations: symmetric INT8.
  3. Post-ReLU activations, softmax, probabilities: asymmetric INT8.
  4. Very sensitive layers (e.g., the final classifier) are optionally kept in FP16/FP32 for maximum precision.

This approach aims for simplicity and efficiency for the bulk of the model, with higher precision (via better range utilization) only where it demonstrably matters (i.e., for skewed activations).


7. Key Implementation Questions and Future Work

Key questions we plan to investigate experimentally include:

  1. Calibration Strategy — how many samples are needed for stable min/max estimates? How does percentile-based clipping compare to naive min-max for robustness to outliers?
  2. Granularity: Per-tensor vs. Per-channel — per-tensor is simpler and faster; per-channel quantization may be crucial for accuracy in layers where weight statistics vary significantly across channels.
  3. Method: PTQ vs. QAT — start with Post-Training Quantization (PTQ) for its speed and simplicity. If the accuracy drop is unacceptable, escalate to Quantization-Aware Training (QAT) to recover precision.

All concrete choices, hyperparameters, and benchmark results will be documented in our follow-up post.


If you’re interested in the broader roadmap and motivation behind this work, you can read our Neural Network Quantization Research Plan.

↑↓ navigate · enter open · esc close