In our previous post, we outlined our research plan for neural network quantization techniques. Today, we're diving deep into two fundamental quantization methods: Symmetric and Asymmetric Linear Quantization.
Why Quantization Matters
Before exploring the technical details, let's consider why quantization has become crucial for efficient AI deployment:
- Reduced Memory Usage: Converting 32-bit floating-point weights to 8-bit or 4-bit integers can shrink model size by 4-8x
- Faster Inference: Integer operations are significantly less computationally expensive than floating-point operations
- Power Efficiency: Lower-precision computations consume less energy, critical for edge devices with battery constraints
- Hardware Compatibility: Many specialized AI accelerators are optimized for integer computations
The Basics: Bits, Bytes, and Number Representation
To fully understand quantization, we need to start with how computers represent numbers:
- Bit: The most fundamental unit of information, representing either 0 or 1
- Byte: 8 bits, capable of representing 256 different values (0-255)
- Integer: Can be unsigned (0 to 2^n-1) or signed (-2^(n-1) to 2^(n-1)-1)
- Floating-Point: Represents real numbers with a sign bit, exponent, and mantissa (or significand)
In deep learning, we typically work with:
- Weights: Learnable parameters multiplied by inputs during forward passes
- Activations: The outputs of each layer, resulting from computations between inputs and weights
- Biases: Constant values added to outputs to shift the activation function
Most neural networks are trained using 32-bit floating-point (FP32) precision, providing high accuracy but at the cost of substantial memory and computational demands.
Quantization Fundamentals
At its core, quantization maps continuous floating-point values to a discrete set of integer values. This process inevitably introduces some approximation error, but when done properly, the impact on model accuracy can be minimal while providing significant efficiency gains.
The key challenge in quantization is determining the optimal mapping between floating-point and integer values that minimizes information loss.
Symmetric Linear Quantization
Symmetric linear quantization is characterized by fixing the zero-point to 0, ensuring that floating-point zero maps exactly to integer zero. This approach is particularly elegant and computationally efficient.
Mathematical Formulation
For a floating-point value x
being mapped to a quantized value xQ
:
xint = round(x / Δ)
xQ = clamp(-Nlevels/2, Nlevels/2 - 1, xint) // for signed integers
Where:
Δ
(scale): The step size between adjacent representable valuesNlevels
: The number of representable levels (256 for 8-bit)clamp(a, b, x)
: Constrains x to the range [a, b]
For dequantization (converting back to floating-point):
xfloat = xQ * Δ
Example Calculation
Let's work through a concrete example with a weight matrix having values in the range [-4.0, 4.0] that we want to quantize to 8-bit signed integers (-128 to 127):
- Calculate the scale factor:
Δ = max(abs(-4.0), abs(4.0)) / 127 = 4.0 / 127 ≈ 0.0315 - Quantize specific values:
- For -4.0: xint = round(-4.0 / 0.0315) = -127, xQ = clamp(-128, 127, -127) = -127
- For 0: xint = round(0 / 0.0315) = 0, xQ = clamp(-128, 127, 0) = 0
- For 4.0: xint = round(4.0 / 0.0315) = 127, xQ = clamp(-128, 127, 127) = 127
- Verify dequantization:
- -127 → -127 × 0.0315 = -4.0005 (original: -4.0)
- 0 → 0 × 0.0315 = 0 (original: 0)
- 127 → 127 × 0.0315 = 4.0005 (original: 4.0)
This demonstrates how symmetric quantization preserves zero exactly and provides good approximations at the extremes of the value range.
Asymmetric Linear Quantization
While symmetric quantization works well for weight distributions centered around zero, many activation functions (like ReLU) produce outputs that are predominantly positive. For these asymmetric distributions, asymmetric quantization often provides better results by utilizing the full range of integer values.
Mathematical Formulation
For asymmetric quantization:
xint = round(x / Δ) + z
xQ = clamp(0, Nlevels - 1, xint) // for unsigned integers
Where:
z
(zero-point): The integer value that represents floating-point zeroΔ
(scale): (max(x) - min(x)) / (Nlevels - 1)
For dequantization:
xfloat = (xQ - z) * Δ
Example Calculation
Let's quantize a weight matrix with values in the range [-2.5, 1.8] to 8-bit unsigned integers (0 to 255):
- Calculate the scale and zero-point:
- Δ = (1.8 - (-2.5)) / (255 - 0) = 4.3 / 255 ≈ 0.0169
- z = round(0 - (-2.5) / 0.0169) = round(147.93) = 148
- Quantize specific values:
- For -2.5: xint = round(-2.5 / 0.0169) + 148 = -148 + 148 = 0, xQ = clamp(0, 255, 0) = 0
- For 0: xint = round(0 / 0.0169) + 148 = 148, xQ = clamp(0, 255, 148) = 148
- For 1.8: xint = round(1.8 / 0.0169) + 148 = 107 + 148 = 255, xQ = clamp(0, 255, 255) = 255
- Verify dequantization:
- 0 → (0 - 148) × 0.0169 = -2.5012 (original: -2.5)
- 148 → (148 - 148) × 0.0169 = 0 (original: 0)
- 255 → (255 - 148) × 0.0169 = 1.8083 (original: 1.8)
Comparing Symmetric and Asymmetric Quantization
The fundamental difference between these approaches lies in how they handle the zero point:
- Symmetric Quantization: FP32 0 maps directly to INT 0
- Asymmetric Quantization: FP32 0 maps to a computed integer value (like 148 in our example)
When to use each approach:
- Symmetric Quantization: Generally preferable for weights, which typically have distributions centered around zero
- Asymmetric Quantization: Often better for activations, especially after ReLU layers, which have skewed distributions with predominantly positive values
Implementation Considerations
When implementing quantization in practice, several important factors come into play:
- Per-tensor vs. Per-channel Quantization: Applying different quantization parameters for each channel often yields better results than using the same parameters for the entire tensor
- Quantization-Aware Training vs. Post-Training Quantization: Incorporating quantization effects during training typically helps reduce accuracy loss compared to quantizing after training
- Hardware Constraints: Some hardware accelerators may only support specific quantization schemes or bit-widths
- Dynamic Range: Carefully analyzing the distribution of values helps in setting optimal quantization parameters to preserve important information
Next Steps
In our upcoming work, we'll implement both symmetric and asymmetric quantization on MobileViT models and compare their performance across different bit-widths (FP32, INT16, INT8, INT4). We'll evaluate these approaches based on:
- Accuracy preservation
- Model size reduction
- Inference speed improvements
- Memory usage
Stay tuned for our implementation details and experimental results!