Tokens are the LEGO bricks of language models. Context windows are their working memory. If you don’t manage both, your AI will crash or cost you a fortune.
You are designing a customer support chatbot for a global e-commerce platform. It handles short queries like “Where’s my order?” flawlessly but crashes when users paste entire 10-page product manuals. The cause is not a bug in the code — it is the fundamental limits of tokens and context windows that govern how AI models process text.
The actual job is to understand what tokens and context windows are, why they matter, and how to design around their constraints. If you don’t, your AI product will either fail silently or cost far more than expected.
Tokens are the smallest units AI understands — the LEGO bricks of language
Tokens are the atomic pieces of text that AI models process. They can be whole words, subwords, or even characters. Imagine building with LEGO bricks — the model assembles meaning by combining these small units.
For example, the word “ChatGPT” is split by OpenAI’s Byte-Pair Encoding tokenizer into three tokens: ["Chat", "G", "PT"]. Similarly, “Unsubscribe” might be tokenized as ["Un", "subscribe"]. This subword tokenization helps the model handle unknown or rare words efficiently.
Tokenization methods matter
Two widely used tokenization methods are:
-
Byte-Pair Encoding (BPE): Used by GPT-4 and many other models. It merges frequent pairs of characters into tokens, balancing vocabulary size and computational efficiency.
-
WordPiece: Used by BERT. It prioritizes whole words but breaks down rare words into subword tokens, e.g., “tokenization” becomes
["token", "##ization"].
Why does this matter? Tokenization affects model performance and cost in several ways:
-
A phrase like “San Francisco” can be tokenized as a single token or two tokens (
"San"+"Francisco"), affecting how the model processes its meaning. -
Non-English languages often require more tokens to represent the same content, increasing API usage costs. For example, Hindi or Tamil text may produce more tokens than English for the same sentence.
This is a major cost driver for Indian products serving vernacular users.
You can experiment with tokenization yourself using tools like OpenAI's Tokenizer.
Context windows are the AI’s working memory — the maximum tokens it can process at once
A context window defines how many tokens a model can consider in a single interaction. If your input exceeds this limit, the model truncates the excess or fails.
Here are some example context windows for popular models:
| Model | Context Window (tokens) | What It Can Handle |
|---|---|---|
| GPT-4 | 128,000 | War and Peace (~588,000 words) in ~4 passes |
| LLaMA 2 | 4,096 | About a 6-page document |
| Claude 2.1 | 200,000 | A 500-page novel in one go |
Longer context windows enable coherent conversations over lengthy documents, useful in legal, financial, or academic settings.
But there is a trade-off: longer contexts cost more. GPT-4 charges about $0.06 per 1,000 input tokens. Processing War and Peace in four passes could cost around $35.
Real-world impact in India
Anthropic’s Claude 2.1 analyzes 200k-token contracts for Indian law firms, reducing manual review time by 70%. This kind of capability is transformative for high-stakes document processing.
A common pitfall is neglecting context limits. One Indian startup’s chatbot failed because it truncated critical user feedback at 4,000 tokens, losing vital information and frustrating users.
Scaling laws reveal why bigger AI models are not always better
A key insight from Hoffmann et al. (2022) in the Chinchilla paper is that training large AI models is not just about increasing the number of parameters (model size). Instead, data quality and quantity must scale proportionally with model size and compute resources.
| Model | Parameters | Training Tokens | Performance |
|---|---|---|---|
| GPT-3 | 175B | 300B | Good |
| Chinchilla | 70B | 1.4T | Better |
Chinchilla used fewer parameters (70 billion vs. 175 billion) but trained on 4 times more data. This resulted in superior performance compared to GPT-3.
Why this matters for Indian startups
Training large models like GPT-4 costs an estimated $100 million, far beyond most startups’ budgets. LLaMA 2’s 70B parameter model cost about $20 million to train.
The takeaway: startups should prioritize data quality and volume over blindly increasing model size. This approach yields better performance for less cost.
Solving the chatbot crash: practical steps to work within token and context limits
The chatbot crashed when processing long product manuals because the input exceeded the model’s context window.
Here are three effective solutions:
- Token counting: Use tools like OpenAI’s
tiktokento count tokens before sending input to the model.
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Your text here")
print(len(tokens))
# Ensure tokens <context window (e.g., 128,000)
-
Chunking: Split long documents into smaller chunks that fit within the context window. For example, split a 10,000-token manual into 4,000-token overlapping sections to avoid losing context.
-
Summarization: Generate summaries of each chunk using GPT-4 before performing full analysis. This reduces tokens sent and focuses on essential information.
Using these methods, the e-commerce platform reduced chatbot errors by 90% and cut API costs by 40%.
Common pitfalls and debugging token issues
-
Ignoring token limits: Feeding inputs larger than the context window leads to truncation or errors. For example, sending 150,000 tokens to GPT-4 can cost $9 per query instead of $0.36 for 6,000 tokens.
-
Poor chunking: Splitting text at arbitrary points can break sentences and confuse the model. Use semantic chunking tools like
langchain.text_splitter.RecursiveCharacterTextSplitterfor cleaner splits. -
Overlooking multilingual token costs: Some languages require twice as many tokens as English, doubling costs. Test tokenization early for your target languages.
-
Using expensive models unnecessarily: For tasks requiring over 10,000 tokens, consider Claude 2.1 or pre-summarization instead of GPT-4 to optimize costs.
Quiz: Test your knowledge
-
Tokenization splits text into:
a) Paragraphs
b) Subwords or symbols -
Which model has the largest context window?
a) LLaMA 2
b) Claude 2.1 -
Chinchilla outperformed GPT-3 by prioritizing:
a) More training data
b) Larger model size
Homework: Hands-on practice
Task 1: Tokenize Your Favorite Book
- Use OpenAI’s Tokenizer Tool to tokenize a paragraph from Pride and Prejudice.
- Compare token counts for the same text in Spanish (e.g., Cien años de soledad).
Task 2: Cost Calculation
- Calculate the cost to process a 50,000-token document with GPT-4.
- Explore how chunking it into 4,000-token sections affects total cost.
Reflection:
How might token limits impact your current projects? What trade-offs would you make between model size and data quality?
Alignment with the broader curriculum
- Prior knowledge from Lesson 1.1: Introduction to Generative AI explains hallucinations partly caused by token limits truncating critical context.
- Upcoming lessons build on this:
- Lesson 3.2 explores optimizing Retrieval-Augmented Generation (RAG) pipelines using chunking strategies.
- Lesson 4.3 applies scaling laws to balance cost and performance in production AI systems.
- Lesson 5.1 discusses sector-specific token challenges, such as handling lengthy legal documents in finance.
Where to go next
- Learn how to optimize long-form AI interactions: Retrieval-Augmented Generation (RAG)
- Understand model architectures and customization: Transformer Architecture Deep Dive
- Explore open vs. closed-source AI models: Open vs. Closed Source Models
- Master cost-effective AI product design: AI Product Strategy
PL alumni now work at Razorpay, Swiggy, PhonePe, Flipkart, and 30+ other companies.