Quantization

Published on: October 09, 2025

Tags: #quantization #ai


1. The Core Quantization and Dequantization Process

graph TD
    %% Define styles for input/output and processing nodes
    classDef inputOutput fill:#cde4ff,stroke:#333,stroke-width:2px;
    classDef process fill:#f9f9f9,stroke:#333,stroke-width:1px;

    subgraph Quantization
        A[FP32 Value] --> B[Divide by Scale];
        B --> C[Add Zero-Point];
        C --> D[Round to Nearest Integer];
        D --> E[INT8/INT4 Value];
    end

    subgraph Dequantization
        F[INT8/INT4 Value] --> G[Subtract Zero-Point];
        G --> H[Multiply by Scale];
        H --> I[Approximated FP32 Value];
    end

    %% Link the two processes
    E --> F;

    %% Apply styles
    class A,I inputOutput;
    class E,F inputOutput;
    class B,C,D,G,H process;
    linkStyle 4 stroke:#ff9999,stroke-width:2px,fill:none;

2. Symmetric vs. Asymmetric Quantization

graph TD
    subgraph "Symmetric Quantization 
(for Weights)" direction TB A[FP32 Range: -1.0 to 1.0] --> B["Zero-Point = 0"]; B --> C[INT8 Range: -127 to 127]; style A fill:#cde4ff style C fill:#d5e8d4 end subgraph "Asymmetric Quantization
(for Activations)" direction TB D[FP32 Range: 0.0 to 2.5] --> E["Zero-Point = 60 (Example)"]; E --> F[INT8 Range: 0 to 255]; style D fill:#cde4ff style F fill:#d5e8d4 end

3. Comparison of Quantization Strategies: PTQ vs. QAT

graph TD
    subgraph "Post-Training Quantization 
(PTQ)" direction TB A(FP32 Pre-Trained Model) --> B[Calibration with Data]; B --> C[Calculate Quantization Params]; C --> D(Quantized
INT8/INT4 Model); end subgraph "Quantization-Aware Training
(QAT)" direction TB E(FP32 Pre-Trained Model) --> F{Fine-Tuning Loop}; F -- Forward Pass --> G(Simulate Quantization); G -- Backward Pass --> H(Update FP32 Weights); H --> F; F -- End of Fine-Tuning --> I(Final Quantized
INT8/INT4 Model); end

4. Overview of Advanced PTQ Techniques

mindmap
  root((Advanced PTQ))
    GPTQ
      ::icon(fa fa-compress)
      Layer-by-layer quantization
      Uses second-order info (Hessian)
      Updates remaining weights 
to compensate for
quantization error AWQ ::icon(fa fa-star) Activation-Aware Weight
Quantization Identifies important
weights based on
activation magnitudes Protects salient weights
with per-channel scaling SmoothQuant ::icon(fa fa-sliders) Addresses challenging
activation outliers Shifts quantization
difficulty from activations
to weights Enables accurate W8A8
quantization

5. Hardware Acceleration for Quantized Models

graph TD
    subgraph Standard Inference
        direction TB
        A(FP32 Model) --> B(General-Purpose Cores / 
CUDA Cores); B --> C(Slower Inference); end subgraph Accelerated Inference direction TB D(Quantized INT8/INT4 Model) --> E(Specialized Hardware
e.g., NVIDIA Tensor Cores,
Intel AMX
); E --> F(Faster Inference); end style F fill:#d5e8d4,stroke:#333,stroke-width:2px

Share this post

Share on X  •  Share on LinkedIn  •  Share via Email