Research Plan

Neural Network Quantization Research Plan

Published: May 8, 2025 | Author: Efaimo AI Research Team

Welcome to the first blog post from Efaimo AI. In this post, we'll outline our research plan for Neural Network Quantization techniques.

Research Background

As AI models continue to grow in size and complexity, deploying and running them efficiently has become a critical challenge. Particularly in environments with limited resources, such as mobile or embedded devices, model size and inference speed are key considerations.

Neural Network Quantization is an effective approach to address these issues. It involves converting model parameters stored in 32-bit floating-point (FP32) format to lower bit-precision formats (16-bit, 8-bit, 4-bit, etc.). This technique reduces model size and increases inference speed.

Research Plan

We will proceed with our research in the following phases:

1. Symmetric Linear Quantization

Concept Understanding: Mathematical principles and implementation methods of Symmetric Linear Quantization
Code Implementation: Analysis and experimentation with open-source implementations from GitHub
Model Application: Applying Symmetric Quantization to the MobileViT model

2. Asymmetric Linear Quantization

Concept Understanding: Advantages of Asymmetric Linear Quantization and differences from the Symmetric approach
Code Implementation: Analysis and customization of open-source implementations
Model Application: Applying Asymmetric Quantization to the same MobileViT model

3. Performance Evaluation and Comparison

After applying both quantization techniques, we will compare the following metrics:

Accuracy Comparison:
- FP32 model vs. 16-bit model vs. 8-bit model vs. 4-bit model (Symmetric)
- FP32 model vs. 16-bit model vs. 8-bit model vs. 4-bit model (Asymmetric)
Latency Comparison:
- FP32 model vs. 16-bit model vs. 8-bit model vs. 4-bit model (Symmetric)
- FP32 model vs. 16-bit model vs. 8-bit model vs. 4-bit model (Asymmetric)

Quantization Concepts

Symmetric Linear Quantization

Symmetric Linear Quantization is the most basic quantization method, mapping the model's real-value parameters to integer values at regular intervals. The characteristic of this approach is that it has a symmetrical range centered around zero.

Mathematically expressed:

q = round(r / scale)
r = q * scale

Where:

r is the original real value (FP32)
q is the quantized integer value (e.g., INT8)
scale is the scale factor (scale = max(abs(r)) / (2^(bits-1) - 1))

The advantage of Symmetric Quantization is its simplicity and computational efficiency. However, it may use the representational range inefficiently if the data distribution is not symmetrical around zero.

Asymmetric Linear Quantization

Asymmetric Linear Quantization was developed to address the limitations of the Symmetric approach. This method uses an asymmetrical range to represent the actual distribution of the data more efficiently.

Mathematically expressed:

q = round((r - zero_point) / scale)
r = q * scale + zero_point

Where:

zero_point is the quantized value that maps to the real value 0
scale = (max(r) - min(r)) / (2^bits - 1)

Asymmetric Quantization can utilize the data range more effectively, making it particularly useful for tensors with asymmetric distributions, such as the outputs of activation functions.

Expected Outcomes

Through this research, we expect to achieve the following outcomes:

Establish efficient methods for optimizing the MobileViT model
Understand performance differences between Symmetric and Asymmetric Quantization at various bit precisions
Derive optimal quantization strategies for mobile and embedded environments

We will share our research progress and results through this blog. In the next post, we will cover the specific implementation methods and experimental results of Symmetric Linear Quantization.

Thank you for your interest!