Tokens are the building blocks, and context windows are the AI’s working memory. Without mastering both, your AI system will crash or cost you a fortune.
You are designing a customer support chatbot for a global e-commerce platform. It handles short queries like “Where’s my order?” flawlessly. But when users paste entire product manuals—10 pages or more—the bot crashes or returns incomplete answers. The reason lies in the fundamental concepts of tokens and context windows—the smallest units AI models process and their maximum working memory.
Ignoring these concepts leads to AI failures, inflated costs, and poor user experience. This lesson teaches you the technical core behind these limits and how they shape practical AI product decisions.
Tokens are the AI’s LEGO bricks
Tokens are the smallest pieces of text an AI model processes. Think of them as LEGO bricks that the model uses to build understanding. Tokens can be whole words, parts of words, or symbols.
For example:
- “ChatGPT” is split into
["Chat", "G", "PT"]by OpenAI’s Byte-Pair Encoding (BPE) tokenizer. - “Unsubscribe” breaks down into
["Un", "subscribe"], which is common in non-English languages where words combine prefixes or suffixes.
You can experiment with tokenization yourself at https://token-calculator.net/.
How tokenization works
Two main tokenization methods dominate:
-
Byte-Pair Encoding (BPE) — used by GPT-4 and its family. It merges frequent character pairs to balance vocabulary size and efficiency. For instance, “ch” + “at” become “chat.” This method allows the model to handle rare words by breaking them down into subwords.
-
WordPiece — used by BERT. It prioritizes whole words but breaks off suffixes with markers like
##. For example, “tokenization” becomes["token", "##ization"].
The choice of tokenizer affects how many tokens a piece of text consumes. For example, “San Francisco” might be treated as one token or two ("San" + "Francisco"), impacting both cost and model performance.
Why tokens matter in India
Indian languages and code-mixed text often require more tokens than English. This increases the input size and, consequently, the processing cost. If you don’t monitor tokens carefully, your AI product’s cloud bills can spiral out of control.
The actual job is to keep token counts in check to balance cost and user experience.
Context windows: the AI’s limited working memory
A context window is the maximum number of tokens an AI model can process in a single interaction. If your input exceeds this window, the model truncates the text or crashes.
Consider these examples:
| Model | Context Window | What It Can Handle |
|---|---|---|
| GPT-4 | 128,000 tokens | About War and Peace (~588,000 words) in roughly 4 passes |
| LLaMA 2 | 4,096 tokens | Approximately a 6-page document |
| Claude 2.1 | 200,000 tokens | A 500-page novel in one go |
Longer context windows enable more coherent understanding of lengthy documents or conversations. For example, Anthropic’s Claude 2.1 analyzes 200,000-token contracts for law firms, reducing manual review time by 70%.
Cost implications
Larger context windows come at a price. GPT-4 charges roughly $0.06 per 1,000 input tokens. Processing a novel-length document like War and Peace might cost about $35 per query.
In practice, many startups hit a wall because their chatbot truncates critical context at 4,096 tokens, missing key information and delivering poor answers.
The trap is ignoring context window limits until your system crashes or costs explode.
The lesson from scaling laws: bigger models aren’t always better
Most assume that larger AI models perform better. The Chinchilla paper (Hoffmann et al., 2022) proved otherwise.
| Model | Parameters | Training Tokens | Performance |
|---|---|---|---|
| GPT-3 | 175B | 300B | Good |
| Chinchilla | 70B | 1.4T | Better |
Chinchilla trained with half the parameters but 4 times more data than GPT-3 and outperformed it. The key insight: data quality and quantity often matter more than model size.
Training costs are staggering:
- GPT-4’s estimated training cost is around $100 million for 1.8 trillion parameters and 13 trillion tokens.
- Meta’s LLaMA 2 reportedly cost around $20 million for 70 billion parameters.
For startups, this means it’s smarter to prioritize data quality and efficient compute rather than chasing giant models.
Solving the chatbot crash: practical steps
Your chatbot failed because it received product manuals exceeding the model’s context window. Here is a practical approach to fix this:
- Token counting: Use OpenAI’s
tiktokenlibrary to count tokens before sending input to the model.
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Your text here")
print(len(tokens))
Ensure your input stays within the model’s context window (for GPT-4, under 128,000 tokens).
-
Chunking: Split large documents into smaller sections, such as 4,000-token chunks with 10% overlap to avoid losing context between chunks.
-
Summarization: Use GPT-4 to generate summaries of each chunk before a comprehensive analysis. This reduces token usage while preserving essential information.
By applying these methods, the e-commerce platform reduced chatbot errors by 90% and cut costs by 40%.
Common pitfalls to avoid
-
Ignoring token limits
A developer once fed 150,000 tokens to GPT-4, incurring $9 per query instead of $0.36 for 6,000 tokens. The fix is to monitor token counts rigorously. -
Poor chunking strategies
Splitting text arbitrarily can break sentences or lose meaning. Use semantic chunkers likelangchain.text_splitter.RecursiveCharacterTextSplitterto split at natural boundaries. -
Overlooking multilingual token costs
Japanese or Hindi text may require twice as many tokens as English for the same content, doubling costs unexpectedly. Test tokenization early for your target languages. -
Using large models unnecessarily
For tasks needing less context, smaller models or summarization can optimize speed and cost.
Field Exercise: Measuring tokens and context in your project (15 min)
- Pick a typical text input your AI product processes (e.g., customer emails, chat logs, documents).
- Use OpenAI’s
tiktokenor Hugging Face’stokenizersto count tokens for this input. - Estimate the cost of processing this input with GPT-4 at $0.06 per 1,000 tokens.
- If your input exceeds 80% of your model’s context window, plan a chunking strategy.
- Reflect: How does token count impact your product’s cost and performance? What trade-offs could you make?
Where to go next
- If you want to optimize AI prompts and reduce hallucinations: Prompt Engineering for RAG
- If you want to learn about open vs. closed-source AI models: Open vs. Closed-Source Models
- If you want to understand AI product strategy: AI Product Strategy
- If you want to explore AI latency and real-time applications: Real-Time LLM Applications
You are the PM for a customer support chatbot at a Series A startup in Bangalore. The bot crashes when users paste large product manuals exceeding 5,000 tokens. The engineering lead proposes switching to GPT-4 with a 128k-token context window but warns costs will triple. Your CEO wants a quick fix without budget increases.
The call: What is your recommended approach to fix the chatbot crashes while controlling costs?
Your reasoning:
You are the PM for a customer support chatbot at a Series A startup in Bangalore. The bot crashes when users paste large product manuals exceeding 5,000 tokens. The engineering lead proposes switching to GPT-4 with a 128k-token context window but warns costs will triple. Your CEO wants a quick fix without budget increases.
Your task: What is your recommended approach to fix the chatbot crashes while controlling costs?
your reasoning: