Quantization

Understanding Neural Network Quantization: Symmetric vs. Asymmetric Linear Methods

Published: May 19, 2025 | Author: Efaimo AI Research Team

In our previous post, we outlined our research plan for neural network quantization techniques. Today, we're diving deep into two fundamental quantization methods: Symmetric and Asymmetric Linear Quantization.

Why Quantization Matters

Before exploring the technical details, let's consider why quantization has become crucial for efficient AI deployment:

Reduced Memory Usage: Converting 32-bit floating-point weights to 8-bit or 4-bit integers can shrink model size by 4-8x
Faster Inference: Integer operations are significantly less computationally expensive than floating-point operations
Power Efficiency: Lower-precision computations consume less energy, critical for edge devices with battery constraints
Hardware Compatibility: Many specialized AI accelerators are optimized for integer computations

The Basics: Bits, Bytes, and Number Representation

To fully understand quantization, we need to start with how computers represent numbers:

Bit: The most fundamental unit of information, representing either 0 or 1
Byte: 8 bits, capable of representing 256 different values (0-255)
Integer: Can be unsigned (0 to 2^n-1) or signed (-2^(n-1) to 2^(n-1)-1)
Floating-Point: Represents real numbers with a sign bit, exponent, and mantissa (or significand)

In deep learning, we typically work with:

Weights: Learnable parameters multiplied by inputs during forward passes
Activations: The outputs of each layer, resulting from computations between inputs and weights
Biases: Constant values added to outputs to shift the activation function

Most neural networks are trained using 32-bit floating-point (FP32) precision, providing high accuracy but at the cost of substantial memory and computational demands.

Quantization Fundamentals

At its core, quantization maps continuous floating-point values to a discrete set of integer values. This process inevitably introduces some approximation error, but when done properly, the impact on model accuracy can be minimal while providing significant efficiency gains.

The key challenge in quantization is determining the optimal mapping between floating-point and integer values that minimizes information loss.

Symmetric Linear Quantization

Symmetric linear quantization is characterized by fixing the zero-point to 0, ensuring that floating-point zero maps exactly to integer zero. This approach is particularly elegant and computationally efficient.

Mathematical Formulation

For a floating-point value x being mapped to a quantized value xQ:

xint = round(x / Δ)
xQ = clamp(-Nlevels/2, Nlevels/2 - 1, xint)  // for signed integers

Where:

Δ (scale): The step size between adjacent representable values
Nlevels: The number of representable levels (256 for 8-bit)
clamp(a, b, x): Constrains x to the range [a, b]

For dequantization (converting back to floating-point):

xfloat = xQ * Δ

Example Calculation

Let's work through a concrete example with a weight matrix having values in the range [-4.0, 4.0] that we want to quantize to 8-bit signed integers (-128 to 127):

Calculate the scale factor:
Δ = max(abs(-4.0), abs(4.0)) / 127 = 4.0 / 127 ≈ 0.0315
Quantize specific values:
- For -4.0: xint = round(-4.0 / 0.0315) = -127, xQ = clamp(-128, 127, -127) = -127
- For 0: xint = round(0 / 0.0315) = 0, xQ = clamp(-128, 127, 0) = 0
- For 4.0: xint = round(4.0 / 0.0315) = 127, xQ = clamp(-128, 127, 127) = 127
Verify dequantization:
- -127 → -127 × 0.0315 = -4.0005 (original: -4.0)
- 0 → 0 × 0.0315 = 0 (original: 0)
- 127 → 127 × 0.0315 = 4.0005 (original: 4.0)

This demonstrates how symmetric quantization preserves zero exactly and provides good approximations at the extremes of the value range.

Asymmetric Linear Quantization

While symmetric quantization works well for weight distributions centered around zero, many activation functions (like ReLU) produce outputs that are predominantly positive. For these asymmetric distributions, asymmetric quantization often provides better results by utilizing the full range of integer values.

Mathematical Formulation

For asymmetric quantization:

xint = round(x / Δ) + z
xQ = clamp(0, Nlevels - 1, xint)  // for unsigned integers

Where:

z (zero-point): The integer value that represents floating-point zero
Δ (scale): (max(x) - min(x)) / (Nlevels - 1)

For dequantization:

xfloat = (xQ - z) * Δ

Example Calculation

Let's quantize a weight matrix with values in the range [-2.5, 1.8] to 8-bit unsigned integers (0 to 255):

Calculate the scale and zero-point:
- Δ = (1.8 - (-2.5)) / (255 - 0) = 4.3 / 255 ≈ 0.0169
- z = round(0 - (-2.5) / 0.0169) = round(147.93) = 148
Quantize specific values:
- For -2.5: xint = round(-2.5 / 0.0169) + 148 = -148 + 148 = 0, xQ = clamp(0, 255, 0) = 0
- For 0: xint = round(0 / 0.0169) + 148 = 148, xQ = clamp(0, 255, 148) = 148
- For 1.8: xint = round(1.8 / 0.0169) + 148 = 107 + 148 = 255, xQ = clamp(0, 255, 255) = 255
Verify dequantization:
- 0 → (0 - 148) × 0.0169 = -2.5012 (original: -2.5)
- 148 → (148 - 148) × 0.0169 = 0 (original: 0)
- 255 → (255 - 148) × 0.0169 = 1.8083 (original: 1.8)

Comparing Symmetric and Asymmetric Quantization

The fundamental difference between these approaches lies in how they handle the zero point:

Symmetric Quantization: FP32 0 maps directly to INT 0
Asymmetric Quantization: FP32 0 maps to a computed integer value (like 148 in our example)

When to use each approach:

Symmetric Quantization: Generally preferable for weights, which typically have distributions centered around zero
Asymmetric Quantization: Often better for activations, especially after ReLU layers, which have skewed distributions with predominantly positive values

Implementation Considerations

When implementing quantization in practice, several important factors come into play:

Per-tensor vs. Per-channel Quantization: Applying different quantization parameters for each channel often yields better results than using the same parameters for the entire tensor
Quantization-Aware Training vs. Post-Training Quantization: Incorporating quantization effects during training typically helps reduce accuracy loss compared to quantizing after training
Hardware Constraints: Some hardware accelerators may only support specific quantization schemes or bit-widths
Dynamic Range: Carefully analyzing the distribution of values helps in setting optimal quantization parameters to preserve important information

Next Steps

In our upcoming work, we'll implement both symmetric and asymmetric quantization on MobileViT models and compare their performance across different bit-widths (FP32, INT16, INT8, INT4). We'll evaluate these approaches based on:

Accuracy preservation
Model size reduction
Inference speed improvements
Memory usage

Stay tuned for our implementation details and experimental results!