Tokenization
Published on: September 26, 2025
Tags: #tokenization #ai #llm
The Overall NLP Pipeline
graph TD A[Raw Text: 'Tokenization is
crucial'] --> B{Tokenization}; B --> C["Tokens: ['Tokenization', 'is',
'crucial']"]; C --> D{Numericalization /
Vocabulary Mapping}; D --> E["Input IDs: [2345, 16, 6789]"]; E --> F{Embedding Lookup}; F --> G["Vector Representations:
[[0.1, ...],
[0.5, ...],
[0.9, ...]]"]; G --> H{LLM Processing};
Comparison of Tokenization Methods
graph TD subgraph Input A["Input Text: 'Learning
tokenization'"] end subgraph Tokenization Methods B(Word-Level) C(Character-Level) D(Subword-Level) end subgraph Output Tokens B_out(['Learning', 'tokenization']) C_out(['L','e','a','r','n','i','n','g',
' ','t','o','k','e','n','i','z','a','t','i','o','n']) D_out(['Learn', '##ing', 'token',
'##ization']) end A --> B --> B_out A --> C --> C_out A --> D --> D_out
Byte-Pair Encoding (BPE) Algorithm Flow
graph TD A[Start with a corpus of text] --> B{Initialize vocabulary with
all individual characters}; B --> C{Loop until vocabulary size
is reached}; C -- Yes --> D{"Find the most frequent
adjacent pair of tokens
(eg 'e' + 's')"}; D --> E{"Merge this pair into a new,
single token ('es')"}; E --> F{Add the new token to
the vocabulary}; F --> G{Update the corpus by
replacing all instances
of the pair with
the new token}; G --> C; C -- No --> H[End: Final Vocabulary
Generated];
Vocabulary Size Trade-offs
graph TD subgraph Larger Vocabulary direction LR A_Pro1["+Shorter sequence lengths"] A_Pro2["+Fewer 'unknown' tokens"] A_Con1["-Larger model size
(embedding layer)"] A_Con2["-Slower training"] A_Con3["-May have undertrained
embeddings for rare tokens"] end subgraph Smaller Vocabulary direction LR B_Pro1["+Smaller model size"] B_Pro2["+More computationally
efficient"] B_Con1["-Longer sequence lengths"] B_Con2["-May split words into less
meaningful pieces"] B_Con3["-Can make it harder for the
model to learn long-range
dependencies"] end