In Part 1, I mentioned that LLMs predict the next token, not the next word. And I said this matters because "small wording changes can change behavior."
This post answers the obvious next question: What exactly is a token, and why should you care?
Tokenization is one of those "boring" details that quietly controls:
- why prompts behave differently after a tiny rephrase
- why some languages consume more context window
- why your RAG system retrieves the "right" chunk but still answers badly
What Tokenization Actually Does
LLMs do not read text the way humans do.
Before the model sees anything, your text is converted into tokens - small pieces of text represented as IDs.
So the model doesn't see:
"I love machine learning."
It sees something closer to:
[token_1837, token_91, token_5021, token_7723, token_13]
Different text → different token sequence → different internal computation → potentially different output.
Tokens Aren't Words
This is the key insight. Consider the word:
"unbelievable"
Depending on the tokenizer, it might become:
"un"+"believable""un"+"believ"+"able"- or even smaller chunks
The model learns patterns at the token level, not the "word" level.
This is why:
- Spelling tasks can be unreliable
- Counting letters doesn't work well
- Tiny edits can change behavior more than you'd expect
Engineer takeaway: Depending on the tokenizer, the model may not see "tokenization" as a single unit. It sees token chunks that happen to appear together.
How Tokenizers Work (No Math)
Tokenization is a compression strategy. If a model tried to store every possible word in every language as a separate unit, it would explode in size.
Instead, modern tokenizers use subword algorithms:
| Algorithm | Common In |
|---|---|
| BPE (incl. byte-level variants) | GPT-style models |
| WordPiece | BERT-family models |
| SentencePiece (tokenizer library) | Often Unigram LM or BPE; used in many seq2seq/multilingual setups |
The Core Idea (BPE-style, simplified)
- Start with every character as its own token
- Count which pairs of tokens appear most often
- Merge the most frequent pair into a new token
- Repeat until you hit your vocabulary size (typically 32K-100K tokens)
The result: common words become single tokens, rare words get split into pieces.
- "the" → single token (very common)
- "tokenization" → multiple tokens (less common)
- "supercalifragilisticexpialidocious" → many tokens (rare)
Why Rephrasing Changes Model Behavior
You've probably experienced this:
Prompt A: "Explain RAG in simple terms."
Prompt B: "Can you explain RAG like I'm new to it?"
To a human, these are "the same." To an LLM, they're different token sequences with different learned associations.
This can change:
- which concepts become "activated"
- the style the model defaults to
- the level of detail it produces
- the probability distribution of the next tokens
Engineer takeaway: When a prompt fails, try controlled rephrases before assuming "the model is bad."
Context Window = Token Budget
A model's context window is measured in tokens, not characters, not words.
That means two prompts of the same character length can consume very different token budgets.
Practical consequences:
- Your long system prompt might be eating most of your window
- Your retrieved RAG chunks might crowd out the question
- Multi-language inputs can reduce how much context fits
Rough English rule of thumb: very roughly ~3–5 characters per token, but it varies a lot (numbers, code, URLs, and non-English scripts can change it dramatically).
Engineer takeaway: Treat context like memory in a constrained system. Budget it.
Why Some Languages Cost More Tokens
Tokenizers are trained on whatever text was in their training data.
If a tokenizer's vocabulary is optimized mostly for English scripts, other languages tokenize into smaller pieces - meaning more tokens for the same meaning.
This causes:
- Higher token count for equivalent content
- Higher latency/cost in paid APIs
- Earlier truncation in long conversations
This isn't a "language quality" problem. It's a vocabulary coverage problem.
Engineer takeaway: If building for multilingual users, test token counts across your target languages.
Common Tokenization Gotchas
1. Leading Spaces Matter
This is tokenizer-dependent, but it's common in GPT-style tokenizers for whitespace to be part of the token.
"Hello" → ["Hello"]
" Hello" → [" Hello"]
These are different tokens. Inconsistent spacing = inconsistent tokens.
2. Numbers Are Weird
"2024" → ["202", "4"] or ["2", "024"] (depends on tokenizer)
"$1,234.56" → multiple tokens
Numbers don't tokenize predictably. Tokenization is one reason arithmetic can be brittle, but the bigger reason is that next-token prediction isn't the same thing as running an exact calculator.
3. Code Has Special Patterns
def calculate_total(items):
Might tokenize as:
["def", " calculate", "_", "total", "(", "items", "):", "\n"]
Code-optimized models (Codex, CodeLlama) handle these patterns better.
4. URLs and Emails Fragment
"user@example.com" → ["user", "@", "example", ".", "com"]
URLs, emails, and paths fragment heavily, using up context and potentially confusing the model.
Token-Aware Prompting Patterns
Here are habits that reduce tokenization surprises:
1. Put Constraints Early
If you want JSON output, put it near the top, not after 10 paragraphs. Early tokens influence what follows.
2. Use Consistent Terminology
Switching terms ("customer" vs "user" vs "client") changes tokens and can drift behavior.
3. Remove Fluff in System Prompts
System prompts are expensive. Every token spent there is budget you don't have for task context.
4. Prefer Explicit Structure
Headings, bullet points, and schemas make it easier for the model to "lock onto" your intent.
What to Shorten vs Keep
When you're hitting context limits:
Shorten
- Verbose instructions - "I would like you to please..." → "Please..."
- Repeated context - Don't paste the same info multiple times
- Excessive examples - Three examples often work as well as ten
Keep
- Specific nouns - Names, technical terms, identifiers
- Structure - Headers, formatting that guides the model
- Key constraints - Requirements the model should follow
Debug Checklist: Tokenization Issues
When your prompt behaves unexpectedly:
- Check token count - Are you near the context limit?
- Look for fragmentation - Are important terms being split weirdly?
- Test rephrasing - Does a simpler wording work better?
- Consider language - For non-English, are you token-inefficient?
- Inspect special characters - URLs, code, numbers fragmenting?
Try This Yourself
Experiment 1: See Tokenization
Use any tokenizer visualizer:
- OpenAI's tokenizer platform.openai.com/tokenizer
- Hugging Face tokenizer tools
tiktokenorsentencepiecelocally
- Paste: "The tokenizer splits text into tokens."
- Note the colored token splits
- Try the same sentence in another language
- Compare token counts
Experiment 2: Rephrasing Effects
- Ask a model: "Summarize this in 3 bullets: [paragraph]"
- Rephrase: "Give me exactly 3 bullet points summarizing: [same paragraph]"
- Compare: Did structure, tone, or detail change?
Small token-level changes. Noticeable output changes.
Key Takeaways
- LLMs don't read words - they process tokens (subword chunks)
- Small rephrases change token sequences - which can shift behavior
- Context windows are token budgets - manage them like memory
- Tokenization explains "LLM weirdness" - spelling, counting, arithmetic issues
- Token-aware prompting is an engineering skill - not superstition
Key Terms
| Term | Meaning |
|---|---|
| Tokenization | Splitting text into tokens before the model processes it |
| BPE | Byte Pair Encoding - algorithm that merges frequent character pairs |
| WordPiece | Similar to BPE, used by BERT-family models |
| SentencePiece | Language-agnostic tokenizer, good for multilingual |
| Vocabulary | The set of all tokens a model knows (typically 32K-100K) |
| Subword | A token that's part of a word, not a complete word |
What's Next
Now you understand how text becomes tokens. But what happens when the model produces output?
In the next post, we'll cover Decoding & Sampling - how the model chooses which token to output next, and why temperature and top-p settings change the results so dramatically.
In This Series
- What is an LLM? - the fundamentals
- Tokenization (You are here) - why wording matters
- Decoding & Sampling - temperature, top-p, determinism
- Embeddings - how text becomes searchable geometry (coming soon)
Further Reading
- Subword units with BPE in NLP (Sennrich et al., 2015/2016): https://arxiv.org/abs/1508.07909
- SentencePiece tokenizer (Kudo & Richardson, 2018): https://arxiv.org/abs/1808.06226
- BERT (WordPiece pipelines are common in this family): https://arxiv.org/abs/1810.04805
- Token counting / tokenizer visualizers (OpenAI): https://platform.openai.com/tokenizer
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.