Lesson 1: Tokens & Context
- The Illusion: Why
len("text")is useless for AI engineering. - The Currency: Tokens as the fundamental unit you pay for.
- The Math: How to calculate exact costs before sending a request.
- The Limit: The Context Window and why "memory" is finite.
- The Tool: Using
tiktokento budget tokens programmatically.
You build a document summarizer. It works great in testing. Then a user uploads a 200-page contract, and you get a $4.50 bill for a single API call—or worse, a cryptic error about "context length exceeded."
This lesson prevents both disasters.
To build with AI, you must stop thinking in words and start thinking in tokens. When you send a prompt to ChatGPT or Claude, the model doesn't see your text. It sees a sequence of numbers. Understanding this conversion is the difference between a system that's cost-effective and reliable, and one that's expensive and prone to crashing.
1. The Tokenization Pipeline
Computers cannot process language directly; they process numbers. Before your text reaches the model, it passes through a Tokenizer—a dictionary that converts text chunks into integers.
Example:
| Step | Data |
|---|---|
| Input | "AI is amazing" |
| Tokenizer breaks into chunks | ["AI", " is", " amazing"] |
| Maps to Token IDs | [9552, 374, 10419] |
| Model receives | [9552, 374, 10419] |
Notice that " is" includes the leading space; whitespace matters. The model never sees the string "AI is amazing"; it only sees [9552, 374, 10419].
2. Tokens ≠ Words
A token is not a word. It's a chunk of characters that the tokenizer recognizes from its dictionary.
The patterns:
- Common English words → usually 1 token (
"the","apple","code") - Compound or rare words → 2-4 tokens (
"Counterintuitively"→ 3 tokens) - Punctuation → separate tokens (
"Hello!"→ 2 tokens:"Hello"+"!") - Numbers → unpredictable (
"100"= 1 token,"101"= 1 token,"$100.00"= 4 tokens) - Non-English text → often more tokens per word
Rule of Thumb: 1,000 tokens ≈ 750 English words.
When estimating costs quickly, remember that tokens are roughly 35% more than the word count. But always verify with actual tokenization for precision.
Surprising examples you'll encounter:
| Text | Tokens | Why |
|---|---|---|
ChatGPT | 1 | It's in the dictionary as a single unit |
chat GPT | 2 | Space and case change split it |
$100 | 2 | $ and 100 are separate |
$100.00 | 4 | $, 100, ., 00 |
2024-01-15 | 5 | Dates are expensive |
[email protected] | 5 | Emails fragment badly |
Zażółć | 4 | Non-English words split into subwords |
This is why you never estimate token counts by counting words or characters.
3. The Context Window
Every model has a maximum Context Window. It is the total number of tokens it can process in a single request. This is a hard limit that includes everything:
Context Window = System Prompt + Conversation History + User Input + Model's Response
If you try to send 128,001 tokens to a 128K model, the request fails immediately. You can't pay extra for more space. You must engineer your context to fit.
4. The Model Landscape
Different models offer different trade-offs between context size, capability, and cost:
| Model | Context Window | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| GPT-4o | 128K | $2.50 | $10.00 |
| GPT-4o-mini | 128K | $0.15 | $0.60 |
| Claude Sonnet 4 | 200K | $3.00 | $15.00 |
| Claude Haiku | 200K | $0.80 | $4.00 |
| Gemini 1.5 Flash | 1M | $0.075 | $0.30 |
| Gemini 1.5 Pro | 2M | $1.25 | $5.00 |
Prices as of 2025. Always verify at provider pricing pages.
Key insight: A 10x difference in token price means a 10x difference in your bill. Choosing GPT-4o-mini over GPT-4o for simple tasks can reduce costs by 95%.
Research shows that LLMs perform worse on information placed in the middle of long contexts. When working with large documents:
- Put critical information at the start or end of your prompt
- Don't fill 100% of the context window—aim for 80% max for best reasoning quality
- If you must use the full window, consider chunking and multiple calls
5. Hands-On Exercise: Token Economics Calculator
We'll build a single script that grows in three parts: inspect tokens, calculate costs, and check context budgets.
Setup
mkdir ai3c-training
cd ai3c-training
uv init
uv add tiktoken
touch token_economics.py
The Complete Script
"""
Token Economics Calculator
==========================
A practical toolkit for understanding and budgeting LLM token usage.
Part 1: Inspect how text becomes tokens
Part 2: Calculate costs across different models
Part 3: Check if your request fits the context window
"""
import tiktoken
# Load the encoder for GPT-4 / Claude (cl100k_base is widely compatible)
encoder = tiktoken.get_encoding("cl100k_base")
# ============================================================================
# PART 1: TOKEN INSPECTOR
# ============================================================================
def inspect_tokens(text: str) -> dict:
"""
Reveal exactly how text is tokenized.
Returns a dict with counts and the actual token chunks.
"""
tokens = encoder.encode(text)
chunks = [encoder.decode([t]) for t in tokens]
return {
"text": text,
"char_count": len(text),
"word_count": len(text.split()),
"token_count": len(tokens),
"token_ids": tokens,
"chunks": chunks,
"ratio": f"{len(tokens) / max(len(text.split()), 1):.2f} tokens/word"
}
def print_inspection(text: str):
"""Pretty-print token inspection results."""
result = inspect_tokens(text)
print(f"\n{'─' * 60}")
print(f"Text: \"{result['text']}\"")
print(f"{'─' * 60}")
print(f" Characters: {result['char_count']}")
print(f" Words: {result['word_count']}")
print(f" Tokens: {result['token_count']} ({result['ratio']})")
print(f" Chunks: {result['chunks']}")
# ============================================================================
# PART 2: COST CALCULATOR
# ============================================================================
# Model pricing (USD per 1 Million tokens) - Update as needed
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00, "context": 128_000},
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "context": 128_000},
"claude-sonnet": {"input": 3.00, "output": 15.00, "context": 200_000},
"claude-haiku": {"input": 0.80, "output": 4.00, "context": 200_000},
"gemini-flash": {"input": 0.075, "output": 0.30, "context": 1_000_000},
"gemini-pro": {"input": 1.25, "output": 5.00, "context": 2_000_000},
}
def calculate_cost(
input_text: str,
expected_output_tokens: int = 500,
model: str = "gpt-4o"
) -> dict:
"""
Calculate the cost of an API call before making it.
Args:
input_text: The full prompt (system + user + history)
expected_output_tokens: Estimated response length
model: Model name from MODEL_PRICING
Returns:
Dict with token counts and costs
"""
if model not in MODEL_PRICING:
raise ValueError(f"Unknown model: {model}. Choose from: {list(MODEL_PRICING.keys())}")
pricing = MODEL_PRICING[model]
input_tokens = len(encoder.encode(input_text))
input_cost = (input_tokens / 1_000_000) * pricing["input"]
output_cost = (expected_output_tokens / 1_000_000) * pricing["output"]
total_cost = input_cost + output_cost
return {
"model": model,
"input_tokens": input_tokens,
"output_tokens": expected_output_tokens,
"total_tokens": input_tokens + expected_output_tokens,
"input_cost": input_cost,
"output_cost": output_cost,
"total_cost": total_cost,
"context_limit": pricing["context"],
"utilization": (input_tokens + expected_output_tokens) / pricing["context"]
}
def print_cost_comparison(input_text: str, expected_output_tokens: int = 500):
"""Compare costs across all models."""
print(f"\n{'═' * 70}")
print("COST COMPARISON ACROSS MODELS")
print(f"{'═' * 70}")
input_tokens = len(encoder.encode(input_text))
print(f"Input: {input_tokens:,} tokens | Expected Output: {expected_output_tokens:,} tokens\n")
print(f"{'Model':<16} {'Input Cost':>12} {'Output Cost':>12} {'Total':>12} {'Utilization':>12}")
print(f"{'-' * 16} {'-' * 12} {'-' * 12} {'-' * 12} {'-' * 12}")
for model in MODEL_PRICING:
result = calculate_cost(input_text, expected_output_tokens, model)
print(f"{model:<16} ${result['input_cost']:>10.4f} ${result['output_cost']:>10.4f} "
f"${result['total_cost']:>10.4f} {result['utilization']:>11.1%}")
# ============================================================================
# PART 3: CONTEXT BUDGET CHECKER
# ============================================================================
def check_context_budget(
system_prompt: str,
user_input: str,
conversation_history: str = "",
expected_output_tokens: int = 500,
model: str = "gpt-4o"
) -> dict:
"""
Before making an API call, verify everything fits in the context window.
This is the function you'll actually use in production.
"""
pricing = MODEL_PRICING.get(model, MODEL_PRICING["gpt-4o"])
system_tokens = len(encoder.encode(system_prompt))
history_tokens = len(encoder.encode(conversation_history)) if conversation_history else 0
input_tokens = len(encoder.encode(user_input))
total_needed = system_tokens + history_tokens + input_tokens + expected_output_tokens
context_limit = pricing["context"]
# Calculate safety margins
fits = total_needed <= context_limit
fits_safely = total_needed <= (context_limit * 0.8) # 80% rule
return {
"breakdown": {
"system_prompt": system_tokens,
"conversation_history": history_tokens,
"user_input": input_tokens,
"reserved_for_output": expected_output_tokens,
},
"total_needed": total_needed,
"context_limit": context_limit,
"utilization": total_needed / context_limit,
"fits": fits,
"fits_safely": fits_safely,
"tokens_remaining": context_limit - total_needed,
"recommendation": (
"✅ Good to go" if fits_safely else
"⚠️ Fits but may degrade quality (>80% utilization)" if fits else
"❌ EXCEEDS CONTEXT LIMIT - will fail"
)
}
def print_budget_check(
system_prompt: str,
user_input: str,
conversation_history: str = "",
expected_output_tokens: int = 500,
model: str = "gpt-4o"
):
"""Pretty-print a context budget check."""
result = check_context_budget(
system_prompt, user_input, conversation_history,
expected_output_tokens, model
)
print(f"\n{'═' * 60}")
print(f"CONTEXT BUDGET CHECK ({model})")
print(f"{'═' * 60}")
print("\nToken Breakdown:")
for component, tokens in result["breakdown"].items():
bar_length = int((tokens / result["context_limit"]) * 40)
bar = "█" * bar_length if bar_length > 0 else "▏"
print(f" {component:<24} {tokens:>8,} tokens {bar}")
print(f"\n {'TOTAL':<24} {result['total_needed']:>8,} tokens")
print(f" {'Context Limit':<24} {result['context_limit']:>8,} tokens")
print(f" {'Utilization':<24} {result['utilization']:>8.1%}")
print(f" {'Remaining':<24} {result['tokens_remaining']:>8,} tokens")
print(f"\n{result['recommendation']}")
# ============================================================================
# DEMO: Run all parts
# ============================================================================
if __name__ == "__main__":
print("\n" + "=" * 70)
print(" TOKEN ECONOMICS CALCULATOR - DEMO")
print("=" * 70)
# ─────────────────────────────────────────────────────────────────────
# PART 1: Inspect surprising tokenizations
# ─────────────────────────────────────────────────────────────────────
print("\n\n📊 PART 1: TOKEN INSPECTION")
print("See how text is actually tokenized (it's not what you expect!)")
test_cases = [
"Hello, world!",
"ChatGPT",
"chat GPT",
"$100.00",
"2024-01-15",
"[email protected]",
"Zażółć gęślą jaźń", # Polish pangram
"The quick brown fox jumps over the lazy dog.",
]
for text in test_cases:
print_inspection(text)
# ─────────────────────────────────────────────────────────────────────
# PART 2: Calculate costs for a realistic document
# ─────────────────────────────────────────────────────────────────────
print("\n\n💰 PART 2: COST COMPARISON")
print("How much does it cost to process a ~10,000 word document?")
# Simulate a business document (~10,000 words ≈ 13,000 tokens)
sample_document = ("This is a sample business document. " * 1000)
print_cost_comparison(sample_document, expected_output_tokens=1000)
# ─────────────────────────────────────────────────────────────────────
# PART 3: Check if a complex request fits
# ─────────────────────────────────────────────────────────────────────
print("\n\n🎯 PART 3: CONTEXT BUDGET CHECK")
print("Will this request fit? Should we proceed?")
system_prompt = """You are a legal document analyzer.
Extract key clauses, identify risks, and summarize in plain English.
Always cite the specific section numbers."""
# Simulate a long contract
user_document = "AGREEMENT made this day... " * 2000 # ~8000 tokens
# Simulate some conversation history
history = "User asked about Section 5. Assistant explained indemnification." * 20
print_budget_check(
system_prompt=system_prompt,
user_input=user_document,
conversation_history=history,
expected_output_tokens=2000,
model="gpt-4o"
)
# Try with a larger context model
print_budget_check(
system_prompt=system_prompt,
user_input=user_document,
conversation_history=history,
expected_output_tokens=2000,
model="claude-sonnet"
)
print("\n" + "=" * 70)
print(" END OF DEMO")
print("=" * 70)
Run the Demo
uv run token_economics.py
What You'll See
The script demonstrates all three capabilities:
-
Token Inspection — See exactly how text fragments into tokens, including surprising cases like emails and dates
-
Cost Comparison — A table showing what the same request costs across different models (spoiler: GPT-4o-mini is 17x cheaper than GPT-4o)
-
Budget Check — A visual breakdown of where your tokens are going, with a clear pass/fail indicator
6. Try It Yourself
Challenge 1: The Language Tax
Add more languages to the inspection demo:
# Add these test cases
language_tests = [
("English", "The meeting is at three o'clock."),
("Polish", "Spotkanie jest o trzeciej."),
("German", "Das Treffen ist um drei Uhr."),
("Japanese", "会議は3時です。"),
("Arabic", "الاجتماع في الساعة الثالثة"),
]
for lang, text in language_tests:
result = inspect_tokens(text)
print(f"{lang}: {result['token_count']} tokens for {result['word_count']} words")
Question: Which language is most "expensive" in tokens? Why do you think that is?
Challenge 2: Build a Pre-Flight Check
Create a function that should run before every API call in production:
def preflight_check(prompt: str, model: str) -> bool:
"""
Returns True if safe to proceed, False if request should be blocked.
Requirements:
- Must fit in context window
- Must be under 80% utilization
- Must cost less than $0.10 per request
If any check fails, print a warning and return False.
"""
# Your implementation here
pass
Challenge 3: Token Budget Dashboard
Extend check_context_budget to return data suitable for a monitoring dashboard:
def get_budget_metrics(requests: list[dict]) -> dict:
"""
Given a list of requests, return aggregate metrics:
- Total tokens consumed
- Total cost
- Average utilization
- Number of requests that exceeded 80% utilization
- Most expensive request
"""
# Your implementation here
pass
7. Production Patterns
Pattern 1: Always Check Before Calling
# ❌ Bad: Hope it fits
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
# ✅ Good: Verify first
budget = check_context_budget(system, user_input, history, model="gpt-4o")
if not budget["fits"]:
raise ValueError(f"Request exceeds context limit: {budget['total_needed']} > {budget['context_limit']}")
if not budget["fits_safely"]:
logger.warning(f"High context utilization: {budget['utilization']:.1%}")
response = client.chat.completions.create(...)
Pattern 2: Automatic Model Fallback
def smart_model_select(input_tokens: int, required_output: int = 1000) -> str:
"""Pick the cheapest model that fits the request."""
total_needed = input_tokens + required_output
# Prefer cheaper models when possible
if total_needed < 100_000:
return "gpt-4o-mini" # Cheapest, good for most tasks
elif total_needed < 180_000:
return "claude-sonnet" # Larger context
else:
return "gemini-pro" # Massive 2M context
Pattern 3: Chunking for Large Documents
def chunk_document(text: str, max_tokens: int = 50_000, overlap: int = 500) -> list[str]:
"""
Split a document into chunks that fit the context window.
Overlap ensures we don't lose context at boundaries.
"""
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(encoder.decode(chunk_tokens))
start = end - overlap # Overlap for continuity
return chunks
8. Common Pitfalls
| Symptom | Cause | Solution |
|---|---|---|
| Unexpected $50 bill | Didn't count tokens before large batch job | Always run cost calculation before production jobs |
| "Context length exceeded" errors | User uploaded huge document | Implement preflight_check that rejects oversized inputs |
| Model gives worse answers on long docs | "Lost in middle" effect | Put critical info at start/end; stay under 80% utilization |
| Non-English users complain about errors | Same word count = more tokens | Test with target languages; budget 2x tokens for non-English |
| Costs vary wildly per request | Didn't account for output tokens | Include expected output in cost estimates |
9. Key Takeaways
-
Tokens ≠ Words. They're chunks of characters. Always measure, never assume.
-
You pay twice. Input tokens AND output tokens both cost money. Budget for both.
-
Context is finite. System prompt + history + input + output must all fit. No exceptions.
-
Model selection matters. GPT-4o-mini can be 95% cheaper than GPT-4o for simple tasks.
-
Check before you call. Build
preflight_checkinto every production system. -
Non-English costs more. Budget extra tokens for multilingual applications.
10. What's Next
Now that you understand the currency of AI (tokens), you're ready to learn how to spend it wisely. In Lesson 2, we'll cover the three fundamental prompting patterns—Zero-Shot, Few-Shot, and Chain-of-Thought—and when to use each.
In Lesson 5, we'll actually call the APIs and watch these token calculations turn into real streaming responses.
11. Additional Resources
- OpenAI Tokenizer Visualizer — Interactive tool to see tokenization in real-time
- Anthropic Token Counter — as above, but Claude-specific (unofficial)
- Tiktoken GitHub Repository — The library we used in this lesson
- Lost in the Middle (Research Paper) — Why context position matters