Tokens, Context Windows, and Scaling Laws

Reading time

6 min

Section

section A-resources

6 min left0%

tokens, context windows, and scaling laws0%

6 min left

Tokens are the building blocks, and context windows are the AI’s working memory. Without mastering both, your AI system will crash or cost you a fortune.

Talvinder Singh, from a Pragmatic Leaders AI session

You are designing a customer support chatbot for a global e-commerce platform. It handles short queries like “Where’s my order?” flawlessly. But when users paste entire product manuals—10 pages or more—the bot crashes or returns incomplete answers. The reason lies in the fundamental concepts of tokens and context windows—the smallest units AI models process and their maximum working memory.

Ignoring these concepts leads to AI failures, inflated costs, and poor user experience. This lesson teaches you the technical core behind these limits and how they shape practical AI product decisions.

Tokens are the AI’s LEGO bricks

Tokens are the smallest pieces of text an AI model processes. Think of them as LEGO bricks that the model uses to build understanding. Tokens can be whole words, parts of words, or symbols.

For example:

“ChatGPT” is split into ["Chat", "G", "PT"] by OpenAI’s Byte-Pair Encoding (BPE) tokenizer.
“Unsubscribe” breaks down into ["Un", "subscribe"], which is common in non-English languages where words combine prefixes or suffixes.

You can experiment with tokenization yourself at https://token-calculator.net/.

How tokenization works

Two main tokenization methods dominate:

Byte-Pair Encoding (BPE) — used by GPT-4 and its family. It merges frequent character pairs to balance vocabulary size and efficiency. For instance, “ch” + “at” become “chat.” This method allows the model to handle rare words by breaking them down into subwords.
WordPiece — used by BERT. It prioritizes whole words but breaks off suffixes with markers like ##. For example, “tokenization” becomes ["token", "##ization"].

The choice of tokenizer affects how many tokens a piece of text consumes. For example, “San Francisco” might be treated as one token or two ("San" + "Francisco"), impacting both cost and model performance.

Why tokens matter in India

Indian languages and code-mixed text often require more tokens than English. This increases the input size and, consequently, the processing cost. If you don’t monitor tokens carefully, your AI product’s cloud bills can spiral out of control.

The actual job is to keep token counts in check to balance cost and user experience.

Context windows: the AI’s limited working memory

A context window is the maximum number of tokens an AI model can process in a single interaction. If your input exceeds this window, the model truncates the text or crashes.

Consider these examples:

Model	Context Window	What It Can Handle
GPT-4	128,000 tokens	About War and Peace (~588,000 words) in roughly 4 passes
LLaMA 2	4,096 tokens	Approximately a 6-page document
Claude 2.1	200,000 tokens	A 500-page novel in one go

Longer context windows enable more coherent understanding of lengthy documents or conversations. For example, Anthropic’s Claude 2.1 analyzes 200,000-token contracts for law firms, reducing manual review time by 70%.

Cost implications

Larger context windows come at a price. GPT-4 charges roughly $0.06 per 1,000 input tokens. Processing a novel-length document like War and Peace might cost about $35 per query.

In practice, many startups hit a wall because their chatbot truncates critical context at 4,096 tokens, missing key information and delivering poor answers.

The trap is ignoring context window limits until your system crashes or costs explode.

The lesson from scaling laws: bigger models aren’t always better

Most assume that larger AI models perform better. The Chinchilla paper (Hoffmann et al., 2022) proved otherwise.

Model	Parameters	Training Tokens	Performance
GPT-3	175B	300B	Good
Chinchilla	70B	1.4T	Better

Chinchilla trained with half the parameters but 4 times more data than GPT-3 and outperformed it. The key insight: data quality and quantity often matter more than model size.

Training costs are staggering:

GPT-4’s estimated training cost is around $100 million for 1.8 trillion parameters and 13 trillion tokens.
Meta’s LLaMA 2 reportedly cost around $20 million for 70 billion parameters.

For startups, this means it’s smarter to prioritize data quality and efficient compute rather than chasing giant models.

Solving the chatbot crash: practical steps

Your chatbot failed because it received product manuals exceeding the model’s context window. Here is a practical approach to fix this:

Token counting: Use OpenAI’s tiktoken library to count tokens before sending input to the model.

import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Your text here")
print(len(tokens))

Ensure your input stays within the model’s context window (for GPT-4, under 128,000 tokens).

Chunking: Split large documents into smaller sections, such as 4,000-token chunks with 10% overlap to avoid losing context between chunks.
Summarization: Use GPT-4 to generate summaries of each chunk before a comprehensive analysis. This reduces token usage while preserving essential information.

By applying these methods, the e-commerce platform reduced chatbot errors by 90% and cut costs by 40%.

// thread: #product-ai — Team discussing fixes for chatbot crashes

Neha (PM)The chatbot crashes when users paste the full manuals. We need a fix.

Rahul (ML Engineer)Let’s add token counting using tiktoken to block inputs over 128k tokens.

Neha (PM)Great. Also, chunk the manuals into 4k-token sections with overlap to keep context.

Rahul (ML Engineer)We can generate summaries for each chunk before full processing to save tokens.

Neha (PM)Perfect. That should reduce crashes and lower our API costs.

Common pitfalls to avoid

Ignoring token limits
A developer once fed 150,000 tokens to GPT-4, incurring $9 per query instead of $0.36 for 6,000 tokens. The fix is to monitor token counts rigorously.
Poor chunking strategies
Splitting text arbitrarily can break sentences or lose meaning. Use semantic chunkers like langchain.text_splitter.RecursiveCharacterTextSplitter to split at natural boundaries.
Overlooking multilingual token costs
Japanese or Hindi text may require twice as many tokens as English for the same content, doubling costs unexpectedly. Test tokenization early for your target languages.
Using large models unnecessarily
For tasks needing less context, smaller models or summarization can optimize speed and cost.

Field Exercise: Measuring tokens and context in your project (15 min)

Pick a typical text input your AI product processes (e.g., customer emails, chat logs, documents).
Use OpenAI’s tiktoken or Hugging Face’s tokenizers to count tokens for this input.
Estimate the cost of processing this input with GPT-4 at $0.06 per 1,000 tokens.
If your input exceeds 80% of your model’s context window, plan a chunking strategy.
Reflect: How does token count impact your product’s cost and performance? What trade-offs could you make?

Where to go next

If you want to optimize AI prompts and reduce hallucinations: Prompt Engineering for RAG
If you want to learn about open vs. closed-source AI models: Open vs. Closed-Source Models
If you want to understand AI product strategy: AI Product Strategy
If you want to explore AI latency and real-time applications: Real-Time LLM Applications

// learn the judgment

You are the PM for a customer support chatbot at a Series A startup in Bangalore. The bot crashes when users paste large product manuals exceeding 5,000 tokens. The engineering lead proposes switching to GPT-4 with a 128k-token context window but warns costs will triple. Your CEO wants a quick fix without budget increases.

The call: What is your recommended approach to fix the chatbot crashes while controlling costs?

Your reasoning:

// practice

Your task: What is your recommended approach to fix the chatbot crashes while controlling costs?

your reasoning:

0 chars (min 80)

Tokens are the building blocks, and context windows are the AI’s working memory. Without mastering both, your AI system will crash or cost you a fortune.

Talvinder Singh, from a Pragmatic Leaders AI session

Tokens are the AI’s LEGO bricks

Tokens are the smallest pieces of text an AI model processes. Think of them as LEGO bricks that the model uses to build understanding. Tokens can be whole words, parts of words, or symbols.

For example:

“ChatGPT” is split into ["Chat", "G", "PT"] by OpenAI’s Byte-Pair Encoding (BPE) tokenizer.
“Unsubscribe” breaks down into ["Un", "subscribe"], which is common in non-English languages where words combine prefixes or suffixes.

You can experiment with tokenization yourself at https://token-calculator.net/.

How tokenization works

Two main tokenization methods dominate:

Byte-Pair Encoding (BPE) — used by GPT-4 and its family. It merges frequent character pairs to balance vocabulary size and efficiency. For instance, “ch” + “at” become “chat.” This method allows the model to handle rare words by breaking them down into subwords.
WordPiece — used by BERT. It prioritizes whole words but breaks off suffixes with markers like ##. For example, “tokenization” becomes ["token", "##ization"].

Why tokens matter in India

The actual job is to keep token counts in check to balance cost and user experience.

Context windows: the AI’s limited working memory

A context window is the maximum number of tokens an AI model can process in a single interaction. If your input exceeds this window, the model truncates the text or crashes.

Consider these examples:

Model	Context Window	What It Can Handle
GPT-4	128,000 tokens	About War and Peace (~588,000 words) in roughly 4 passes
LLaMA 2	4,096 tokens	Approximately a 6-page document
Claude 2.1	200,000 tokens	A 500-page novel in one go

Cost implications

Larger context windows come at a price. GPT-4 charges roughly $0.06 per 1,000 input tokens. Processing a novel-length document like War and Peace might cost about $35 per query.

In practice, many startups hit a wall because their chatbot truncates critical context at 4,096 tokens, missing key information and delivering poor answers.

The trap is ignoring context window limits until your system crashes or costs explode.

The lesson from scaling laws: bigger models aren’t always better

Most assume that larger AI models perform better. The Chinchilla paper (Hoffmann et al., 2022) proved otherwise.

Model	Parameters	Training Tokens	Performance
GPT-3	175B	300B	Good
Chinchilla	70B	1.4T	Better

Chinchilla trained with half the parameters but 4 times more data than GPT-3 and outperformed it. The key insight: data quality and quantity often matter more than model size.

Training costs are staggering:

GPT-4’s estimated training cost is around $100 million for 1.8 trillion parameters and 13 trillion tokens.
Meta’s LLaMA 2 reportedly cost around $20 million for 70 billion parameters.

For startups, this means it’s smarter to prioritize data quality and efficient compute rather than chasing giant models.

Solving the chatbot crash: practical steps

Your chatbot failed because it received product manuals exceeding the model’s context window. Here is a practical approach to fix this:

Token counting: Use OpenAI’s tiktoken library to count tokens before sending input to the model.

import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Your text here")
print(len(tokens))

Ensure your input stays within the model’s context window (for GPT-4, under 128,000 tokens).

Chunking: Split large documents into smaller sections, such as 4,000-token chunks with 10% overlap to avoid losing context between chunks.
Summarization: Use GPT-4 to generate summaries of each chunk before a comprehensive analysis. This reduces token usage while preserving essential information.

By applying these methods, the e-commerce platform reduced chatbot errors by 90% and cut costs by 40%.

// thread: #product-ai — Team discussing fixes for chatbot crashes

Neha (PM)The chatbot crashes when users paste the full manuals. We need a fix.

Rahul (ML Engineer)Let’s add token counting using tiktoken to block inputs over 128k tokens.

Neha (PM)Great. Also, chunk the manuals into 4k-token sections with overlap to keep context.

Rahul (ML Engineer)We can generate summaries for each chunk before full processing to save tokens.

Neha (PM)Perfect. That should reduce crashes and lower our API costs.

Common pitfalls to avoid

Ignoring token limits
A developer once fed 150,000 tokens to GPT-4, incurring $9 per query instead of $0.36 for 6,000 tokens. The fix is to monitor token counts rigorously.
Poor chunking strategies
Splitting text arbitrarily can break sentences or lose meaning. Use semantic chunkers like langchain.text_splitter.RecursiveCharacterTextSplitter to split at natural boundaries.
Overlooking multilingual token costs
Japanese or Hindi text may require twice as many tokens as English for the same content, doubling costs unexpectedly. Test tokenization early for your target languages.
Using large models unnecessarily
For tasks needing less context, smaller models or summarization can optimize speed and cost.

Field Exercise: Measuring tokens and context in your project (15 min)

Pick a typical text input your AI product processes (e.g., customer emails, chat logs, documents).
Use OpenAI’s tiktoken or Hugging Face’s tokenizers to count tokens for this input.
Estimate the cost of processing this input with GPT-4 at $0.06 per 1,000 tokens.
If your input exceeds 80% of your model’s context window, plan a chunking strategy.
Reflect: How does token count impact your product’s cost and performance? What trade-offs could you make?

Where to go next

If you want to optimize AI prompts and reduce hallucinations: Prompt Engineering for RAG
If you want to learn about open vs. closed-source AI models: Open vs. Closed-Source Models
If you want to understand AI product strategy: AI Product Strategy
If you want to explore AI latency and real-time applications: Real-Time LLM Applications

// learn the judgment

The call: What is your recommended approach to fix the chatbot crashes while controlling costs?

Your reasoning:

// practice

Your task: What is your recommended approach to fix the chatbot crashes while controlling costs?

your reasoning:

0 chars (min 80)