Tokenization

Published on: September 26, 2025

The Overall NLP Pipeline

graph TD
    A[Raw Text: 'Tokenization is
crucial'] --> B{Tokenization};
    B --> C["Tokens: ['Tokenization', 'is',
'crucial']"];
    C --> D{Numericalization /
Vocabulary Mapping};
    D --> E["Input IDs: [2345, 16, 6789]"];
    E --> F{Embedding Lookup};
    F --> G["Vector Representations:
[[0.1, ...],
[0.5, ...],
[0.9, ...]]"];
    G --> H{LLM Processing};

Comparison of Tokenization Methods

graph TD
    subgraph Input
        A["Input Text: 'Learning
tokenization'"]
    end

    subgraph Tokenization Methods
        B(Word-Level)
        C(Character-Level)
        D(Subword-Level)
    end

    subgraph Output Tokens
        B_out(['Learning', 'tokenization'])
        C_out(['L','e','a','r','n','i','n','g',
' ','t','o','k','e','n','i','z','a','t','i','o','n'])
        D_out(['Learn', '##ing', 'token',
'##ization'])
    end

    A --> B --> B_out
    A --> C --> C_out
    A --> D --> D_out

Byte-Pair Encoding (BPE) Algorithm Flow

graph TD
    A[Start with a corpus of text] --> B{Initialize vocabulary with
all individual characters};
    B --> C{Loop until vocabulary size
is reached};
    C -- Yes --> D{"Find the most frequent
adjacent pair of tokens
(eg 'e' + 's')"};
    D --> E{"Merge this pair into a new,
single token ('es')"};
    E --> F{Add the new token to
the vocabulary};
    F --> G{Update the corpus by
replacing all instances
of the pair with
the new token};
    G --> C;
    C -- No --> H[End: Final Vocabulary
Generated];

Vocabulary Size Trade-offs

graph TD
    subgraph Larger Vocabulary
        direction LR
        A_Pro1["+Shorter sequence lengths"]
        A_Pro2["+Fewer 'unknown' tokens"]
        A_Con1["-Larger model size
(embedding layer)"]
        A_Con2["-Slower training"]
        A_Con3["-May have undertrained
embeddings for rare tokens"]
    end

    subgraph Smaller Vocabulary
        direction LR
        B_Pro1["+Smaller model size"]
        B_Pro2["+More computationally
efficient"]
        B_Con1["-Longer sequence lengths"]
        B_Con2["-May split words into less
meaningful pieces"]
        B_Con3["-Can make it harder for the
model to learn long-range
dependencies"]
    end

Share this post

Share on X • Share on LinkedIn • Share via Email

Tokenization

The Overall NLP Pipeline

Comparison of Tokenization Methods

Byte-Pair Encoding (BPE) Algorithm Flow

Vocabulary Size Trade-offs

Related Diagrams

Share this post