Quantization
Published on: October 09, 2025
Tags: #quantization #ai
1. The Core Quantization and Dequantization Process
graph TD %% Define styles for input/output and processing nodes classDef inputOutput fill:#cde4ff,stroke:#333,stroke-width:2px; classDef process fill:#f9f9f9,stroke:#333,stroke-width:1px; subgraph Quantization A[FP32 Value] --> B[Divide by Scale]; B --> C[Add Zero-Point]; C --> D[Round to Nearest Integer]; D --> E[INT8/INT4 Value]; end subgraph Dequantization F[INT8/INT4 Value] --> G[Subtract Zero-Point]; G --> H[Multiply by Scale]; H --> I[Approximated FP32 Value]; end %% Link the two processes E --> F; %% Apply styles class A,I inputOutput; class E,F inputOutput; class B,C,D,G,H process; linkStyle 4 stroke:#ff9999,stroke-width:2px,fill:none;
2. Symmetric vs. Asymmetric Quantization
graph TD subgraph "Symmetric Quantization
(for Weights)" direction TB A[FP32 Range: -1.0 to 1.0] --> B["Zero-Point = 0"]; B --> C[INT8 Range: -127 to 127]; style A fill:#cde4ff style C fill:#d5e8d4 end subgraph "Asymmetric Quantization
(for Activations)" direction TB D[FP32 Range: 0.0 to 2.5] --> E["Zero-Point = 60 (Example)"]; E --> F[INT8 Range: 0 to 255]; style D fill:#cde4ff style F fill:#d5e8d4 end
3. Comparison of Quantization Strategies: PTQ vs. QAT
graph TD subgraph "Post-Training Quantization
(PTQ)" direction TB A(FP32 Pre-Trained Model) --> B[Calibration with Data]; B --> C[Calculate Quantization Params]; C --> D(Quantized
INT8/INT4 Model); end subgraph "Quantization-Aware Training
(QAT)" direction TB E(FP32 Pre-Trained Model) --> F{Fine-Tuning Loop}; F -- Forward Pass --> G(Simulate Quantization); G -- Backward Pass --> H(Update FP32 Weights); H --> F; F -- End of Fine-Tuning --> I(Final Quantized
INT8/INT4 Model); end
4. Overview of Advanced PTQ Techniques
mindmap root((Advanced PTQ)) GPTQ ::icon(fa fa-compress) Layer-by-layer quantization Uses second-order info (Hessian) Updates remaining weights
to compensate for
quantization error AWQ ::icon(fa fa-star) Activation-Aware Weight
Quantization Identifies important
weights based on
activation magnitudes Protects salient weights
with per-channel scaling SmoothQuant ::icon(fa fa-sliders) Addresses challenging
activation outliers Shifts quantization
difficulty from activations
to weights Enables accurate W8A8
quantization
5. Hardware Acceleration for Quantized Models
graph TD subgraph Standard Inference direction TB A(FP32 Model) --> B(General-Purpose Cores /
CUDA Cores); B --> C(Slower Inference); end subgraph Accelerated Inference direction TB D(Quantized INT8/INT4 Model) --> E(Specialized Hardware
e.g., NVIDIA Tensor Cores,
Intel AMX); E --> F(Faster Inference); end style F fill:#d5e8d4,stroke:#333,stroke-width:2px