Have you ever wondered how those incredibly smart Large Language Models (LLMs) like GPT-4 or Phi-3 work their magic, generating human-like text right before your eyes? It might seem like pure wizardry, but at their core, LLMs rely on two fundamental concepts: tokens and embeddings. Understanding these two building blocks isn’t just for AI researchers; it’s key to grasping how LLMs are built, how they function, and even where the future of Language AI is headed.
So, let’s pull back the curtain and explore these fascinating elements that give us more control over how models are trained and how they generate text.
Tokens: The AI’s Bite-Sized Chunks of Understanding
Imagine text not as a continuous flow of words, but as a sequence of small, manageable pieces. That’s essentially how LLMs “see” and process language. These small chunks are called tokens.
You might have noticed that when an LLM responds to your query in a chat interface, it doesn’t just blurt out the entire answer all at once. Instead, it generates its output response one token at a time. This isn’t just for output; tokens are also the primary way the model interprets your input. When you send a text prompt to an LLM, the very first thing that happens is that your prompt is broken down into these tokens. Before the language model even begins to process your request, it hands your text over to a specialized tool called a tokenizer. This tokenizer’s job is to chop up your prompt into these essential tokens. For instance, if you were to feed a prompt into the GPT-4 tokenizer on the OpenAI Platform, you’d see the text visually broken down, with each token highlighted in a different color.
Let’s look at a practical example to really drive this home. When we interact with an LLM programmatically, like the Phi-3-mini-4k-instruct model, the actual text prompt isn’t what the model directly receives. Instead, the tokenizer takes the prompt, like “Draft an email to Alex apologizing for missing the team presentation and briefly explaining the reason for your absence. <|assistant|>”. It then processes this text and transforms it into something the model can understand: a series of integers. These integers are unique IDs for each specific token (which can be a character, a word, or even just a part of a word). Each ID refers to an internal table within the tokenizer that holds all the tokens it’s familiar with.
Just as tokenizers prepare the input, they are also crucial on the output side. When an LLM generates its response, it produces a sequence of token IDs, which then need to be translated back into readable text using the tokenizer.decode method. This process is seamlessly integrated, ensuring that what the model “thinks” in numbers is converted into text we can understand.
Different Ways to Tokenize: A Spectrum of Granularity
While sub-words tokenization, which mixes full words and partial words, is the most common scheme for LLMs today, it’s not the only way text can be broken down. There are four notable approaches to tokenization, each with its own advantages and disadvantages:
- Word Tokens: This method breaks text into full words. It was common in older natural language processing (NLP) techniques like word2vec, but is used less frequently in modern LLMs. A significant challenge here is that if a new word appears that wasn’t in the tokenizer’s training data, it simply won’t know how to deal with it. Plus, a vocabulary of only full words can be quite large and include many subtly different tokens (e.g., “apology,” “apologize”).
- Subword Tokens: As the name suggests, this method includes both complete and partial words. Its key benefits are a more expressive vocabulary and the ability to represent new, unseen words by breaking them down into smaller, known subword characters. This is the widely adopted approach for models like GPT and BERT.
- Character Tokens: With this method, every single character (like ‘p’, ‘l’, ‘a’, ‘y’) becomes its own token. This approach excels at handling new words because it always has the “raw letters” to fall back on. However, while tokenizing is easier, modeling becomes more difficult for the LLM. Instead of understanding “play” as one concept, the model has to process “p-l-a-y,” which adds complexity. Additionally, subword tokens are more efficient in terms of context length; a model using subword tokenization can typically fit about three times more text within the same context length limit compared to character tokens.
- Byte Tokens: This advanced method breaks down text into the individual bytes that represent Unicode characters. Sometimes referred to as “tokenization-free encoding,” this can be a very competitive approach, especially in multilingual scenarios, as it can represent any character. It’s worth noting that some subword tokenizers, like those used by GPT-2 and RoBERTa, incorporate bytes as fallback tokens for characters they cannot otherwise represent, but this doesn’t make them purely byte-level tokenizers.
The DNA of a Tokenizer: What Shapes Its Behavior?
How a tokenizer breaks down text isn’t arbitrary; it’s determined by following major design factors decided during the model’s creation:
- The Chosen Tokenization Method: This refers to the underlying algorithm that determines how tokens are selected to efficiently represent a text dataset. Popular methods include:
- Byte Pair Encoding (BPE): Widely used by GPT models and many others.
- WordPiece: Utilized by BERT models.
- SentencePiece: A flexible method that supports BPE and the unigram language model, used by models like Flan-T5.
- Tokenizer Parameters and Special Tokens: Beyond the core method, designers make choices about the tokenizer’s initialization:
- Vocabulary Size: How many unique tokens should the tokenizer know? Common sizes range from 30,000 to 50,000, but newer models are pushing to 100,000 or more.
- Special Tokens: These are unique tokens that serve specific functions beyond representing text30…. They are crucial for communication between the model and the system or for particular tasks. Common examples include:
- Beginning of text (e.g., <BOT>)
- End of text (signals generation completion eg, <EOT>).
- Padding token ([PAD]) for filling unused input positions.
- Unknown token ([UNK]) for characters the tokenizer doesn’t recognize.
- Classification token ([CLS]) used for classification tasks.
- Masking token ([MASK]) used to hide tokens during training.
- Separator token ([SEP]) to distinguish different segments of text, like a query and a candidate result in search applications.
- Fill-in-the-middle tokens (e.g., <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>) enable LLMs to complete text not just from what comes before but also what comes after.
- Domain-specific tokens: Models trained for particular domains, like scientific research or code, include specialized tokens. Galactica, for instance, has tokens for citations ([START_REF], [END_REF]), reasoning, mathematics, and even biological sequences2531. Phi-3 and Llama 2, popular for chat, include chat tokens (<|user|>, <|assistant|>, <|system|>) to distinguish conversational turns and roles.
- Tokenizer Behavior Depends on Training Data: Even with identical methods and parameters, a tokenizer’s behavior varies significantly based on the dataset it was trained on, as the vocabulary is optimized for that specific data.
- Tokenizer Comparisons Across Models:
- BERT Base (uncased): Uses WordPiece. Lowercases text, removes newlines, uses ## for subwords, [UNK] for unknowns (e.g., emojis, Chinese), and wraps input with [CLS] and [SEP].
- BERT Base (cased): Same as above, but preserves capitalization, leading to different token splits for capitalized words.
- GPT-2: Uses BPE. Preserves capitalization and newlines. Breaks emojis and non-English characters into multiple tokens. Represents whitespace and tabs with specific tokens.
- Flan-T5: Uses SentencePiece. Removes newlines and whitespaces, replaces emojis and Chinese characters with unknown tokens—less suitable for code.
- GPT-4: BPE with a large vocab (~100K). Similar to GPT-2 but more efficient—fewer tokens per word, supports long whitespace sequences as single tokens and includes code-specific tokens like elif.
- StarCoder2: BPE, code-focused. Encodes whitespace sequences as single tokens, assigns individual tokens to digits, and includes special tokens for filenames and repos.
- Galactica: BPE (50K vocab), science-focused. Has tokens for citations, math, DNA, etc. Encodes whitespace and tabs as single tokens.
- Phi-3 / LLaMA 2: BPE (32K vocab). Phi-3 adds chat-specific tokens (<|user|>, <|assistant|>, etc.) for conversational context.
- Tokenizer Comparisons Across Models:
These examples clearly show that newer tokenizers have evolved to improve model performance and specialize in tasks like code generation or conversational AI. The specific method, parameters, and training data all contribute to a tokenizer’s unique “personality” and how it prepares text for the LLM.
My Implementation: Precision with Special Tokens
My own work incorporates these learnings to improve the generation of fine-tuned models. Initially, I used a string-only formatting approach for training data, like ### Prompt: {instruction}\n### Response:{output}.
However, this method had several limitations for instruction-tuned models:
- Inconsistent Stopping Behavior: The model relied solely on learned text patterns to decide when to stop, leading to truncated or overly verbose responses.
Prompt-Response Boundary Issues: Without clear markers, the model struggled to differentiate between the prompt and its response. - Limited Control: Basic string formatting offered minimal control over generation boundaries.
- A Llama3.8B Instruct model trained this way, it failed to generate precise answers and kept generating beyond the intended length if max tokens were increased.
To overcome these, I implemented a specialized token-based approach using Llama 3.1’s special tokens. The training data formatting now looks like this: <|begin_of_text|>Prompt:{instruction}<|eom_id|>Response: {output}<|eot_id|<|end_of_text|>. This transition significantly improved the inference process, allowing the model to provide precise answers and stop its generation even when max_new_tokens was set to a large number like 1000. This shift represents a major improvement in reliability and controllability, enhancing the user experience by ensuring consistently formatted and appropriately terminated responses.
Influencing LLM Generation: The Strategic “Wait”
Beyond training data formatting, one powerful technique to influence an LLM’s generation process, which I have also incorporated into my fine-tuned model, involves the strategic use of a “Wait” string during output This method is a core part of a “budget forcing” approach, designed to control the amount of “thinking tokens” a model generates at test time.
Here’s how it works and its impact:
- Encouraging Continued Reasoning: When you want the model to spend more compute on a problem or to continue its reasoning process, you can suppress the generation of the end-of-thinking token delimiter and instead append “Wait” to the model’s current reasoning trace. This intervention encourages the model to reflect on its current generation and explore further.
- Self-Correction and Improved Answers: This forced continuation can lead the model to double-check its answer, often resulting in self-correction and a better answer. For example, the s1-32B model, when forced to continue with “Wait”, corrected an initial incorrect answer of “2” to “3” for a “How many ‘r’ in raspberry?” question.
- Enhanced Reliability and Controllability: The capability of forcing the model to continue its thinking, rather than stopping prematurely, significantly improves the reliability and controllability of the model’s responses by ensuring they are appropriately terminated or extended as needed. This “budget forcing” method has demonstrated perfect control over test-time compute.
- Performance Extrapolation: Experiments have shown that “Wait” generally provides the best performance for extrapolating model performance by encouraging continued reasoning. For instance, using budget forcing with s1-32B allowed for performance extrapolation on the AIME24 math competition benchmark, improving accuracy from 50% to 57%. It also leads to a positive scaling slope, meaning performance generally increases with more thinking tokens.
- Comparison to Other Methods: Budget forcing with “Wait” is superior to other methods like token-conditional control, step-conditional control, or class-conditional control, which often struggle with precise control or can lead to suboptimal reasoning. For example, models often fail to reliably count tokens, making token-conditional control difficult without forcing mechanisms. Rejection sampling, surprisingly, showed an inverse scaling trend where shorter generations tended to be more accurate, suggesting longer generations might involve more backtracking or errors.
- Limitations and Future Directions: While budget forcing can improve performance, it eventually reaches a plateau, and its effectiveness is constrained by the underlying language model’s context window. Future work could explore using a rotation of different strings instead of solely “Wait”, or combining it with frequency penalties or higher temperatures to prevent repetitive loops and push the limits of test-time scaling further. Parallel scaling methods like majority voting or tree search via REBASE also complement sequential scaling methods like budget forcing, offering avenues for scaling beyond fixed context windows.
Final Thoughts: Tokens—The Quiet Architects of Language AI
This exploration into Large Language Models (LLMs) reveals that tokens and the process of tokenization are foundational to language AI. Tokenizers—using methods like BPE, WordPiece, or SentencePiece—break down raw text into pieces that models can understand and generate, whether those pieces are words, subwords, characters, or even bytes. The choices made in tokenizer design, such as vocabulary size, special tokens, and the nature of the training data, directly influence a model’s flexibility, reliability, and domain expertise.
We’ve seen how special tokens and thoughtful prompt formatting can dramatically improve the precision and control of model outputs, especially for instruction-tuned models. Strategic techniques like “budget forcing” (using the “Wait” string) empower us to influence generation length and reasoning quality, yielding more accurate and reliable responses.
In sum, understanding tokens—their creation, management, and strategic use—is essential for anyone looking to master LLMs. These quiet architects shape how language models interpret, generate, and control text, leading to ever-greater sophistication and user-centricity in AI-powered applications.
References
- Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting; Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics 2022; 10 73–91. doi: https://doi.org/10.1162/tacl_a_00448
- Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel; ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics 2022; 10 291–306. doi: https://doi.org/10.1162/tacl_a_00461
- Neural Machine Translation of Rare Words with Subword Units arXiv:1508.07909 [cs.CL]
- SentencePiece: A simple and language-independent subword tokenizer and detokenizer for Neural Text Processing arXiv:1808.06226 [cs.CL]
- Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates arXiv:1804.10959 [cs.CL]
- Efficient Training of Language Models to Fill in the Middle arXiv:2207.14255 [cs.CL]
- StarCoder 2 and The Stack v2: The Next Generation arXiv:2402.19173 [cs.SE]
- StarCoder: May the source be with you! arXiv:2305.06161 [cs.CL]
- Galactica: A large language model for science https://galactica.org/static/paper.pdf
- https://huggingface.co/docs/transformers/tokenizer_summary
- Efficient Estimation of Word Representations in Vector Space arXiv:1301.3781 [cs.CL]
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks arXiv:1908.10084 [cs.CL]
- s1: Simple test-time scaling arXiv:2501.19393 [cs.CL]
- Hands-On Large Language Models by Jay Alammar, Maarten Grootendorst https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/