Skip to main content

Lesson 1: Tokens & Context

Topics Covered
  • The Illusion: Why len("text") is useless for AI engineering.
  • The Currency: Tokens as the fundamental unit you pay for.
  • The Math: How to calculate exact costs before sending a request.
  • The Limit: The Context Window and why "memory" is finite.
  • The Tool: Using tiktoken to budget tokens programmatically.

You build a document summarizer. It works great in testing. Then a user uploads a 200-page contract, and you get a $4.50 bill for a single API call—or worse, a cryptic error about "context length exceeded."

This lesson prevents both disasters.

To build with AI, you must stop thinking in words and start thinking in tokens. When you send a prompt to ChatGPT or Claude, the model doesn't see your text. It sees a sequence of numbers. Understanding this conversion is the difference between a system that's cost-effective and reliable, and one that's expensive and prone to crashing.

1. The Tokenization Pipeline

Computers cannot process language directly; they process numbers. Before your text reaches the model, it passes through a Tokenizer—a dictionary that converts text chunks into integers.

Example:

StepData
Input"AI is amazing"
Tokenizer breaks into chunks["AI", " is", " amazing"]
Maps to Token IDs[9552, 374, 10419]
Model receives[9552, 374, 10419]

Notice that " is" includes the leading space; whitespace matters. The model never sees the string "AI is amazing"; it only sees [9552, 374, 10419].

2. Tokens ≠ Words

A token is not a word. It's a chunk of characters that the tokenizer recognizes from its dictionary.

The patterns:

  • Common English words → usually 1 token ("the", "apple", "code")
  • Compound or rare words → 2-4 tokens ("Counterintuitively" → 3 tokens)
  • Punctuation → separate tokens ("Hello!" → 2 tokens: "Hello" + "!")
  • Numbers → unpredictable ("100" = 1 token, "101" = 1 token, "$100.00" = 4 tokens)
  • Non-English text → often more tokens per word
The Golden Ratio

Rule of Thumb: 1,000 tokens ≈ 750 English words.

When estimating costs quickly, remember that tokens are roughly 35% more than the word count. But always verify with actual tokenization for precision.

Surprising examples you'll encounter:

TextTokensWhy
ChatGPT1It's in the dictionary as a single unit
chat GPT2Space and case change split it
$1002$ and 100 are separate
$100.004$, 100, ., 00
2024-01-155Dates are expensive
[email protected]5Emails fragment badly
Zażółć4Non-English words split into subwords

This is why you never estimate token counts by counting words or characters.

3. The Context Window

Every model has a maximum Context Window. It is the total number of tokens it can process in a single request. This is a hard limit that includes everything:

Context Window = System Prompt + Conversation History + User Input + Model's Response

If you try to send 128,001 tokens to a 128K model, the request fails immediately. You can't pay extra for more space. You must engineer your context to fit.

4. The Model Landscape

Different models offer different trade-offs between context size, capability, and cost:

ModelContext WindowInput (per 1M tokens)Output (per 1M tokens)
GPT-4o128K$2.50$10.00
GPT-4o-mini128K$0.15$0.60
Claude Sonnet 4200K$3.00$15.00
Claude Haiku200K$0.80$4.00
Gemini 1.5 Flash1M$0.075$0.30
Gemini 1.5 Pro2M$1.25$5.00

Prices as of 2025. Always verify at provider pricing pages.

Key insight: A 10x difference in token price means a 10x difference in your bill. Choosing GPT-4o-mini over GPT-4o for simple tasks can reduce costs by 95%.

The "Lost in Middle" Effect

Research shows that LLMs perform worse on information placed in the middle of long contexts. When working with large documents:

  • Put critical information at the start or end of your prompt
  • Don't fill 100% of the context window—aim for 80% max for best reasoning quality
  • If you must use the full window, consider chunking and multiple calls

5. Hands-On Exercise: Token Economics Calculator

We'll build a single script that grows in three parts: inspect tokens, calculate costs, and check context budgets.

Setup

mkdir ai3c-training
cd ai3c-training
uv init
uv add tiktoken
touch token_economics.py

The Complete Script

token_economics.py
"""
Token Economics Calculator
==========================
A practical toolkit for understanding and budgeting LLM token usage.

Part 1: Inspect how text becomes tokens
Part 2: Calculate costs across different models
Part 3: Check if your request fits the context window
"""

import tiktoken

# Load the encoder for GPT-4 / Claude (cl100k_base is widely compatible)
encoder = tiktoken.get_encoding("cl100k_base")

# ============================================================================
# PART 1: TOKEN INSPECTOR
# ============================================================================

def inspect_tokens(text: str) -> dict:
"""
Reveal exactly how text is tokenized.

Returns a dict with counts and the actual token chunks.
"""
tokens = encoder.encode(text)
chunks = [encoder.decode([t]) for t in tokens]

return {
"text": text,
"char_count": len(text),
"word_count": len(text.split()),
"token_count": len(tokens),
"token_ids": tokens,
"chunks": chunks,
"ratio": f"{len(tokens) / max(len(text.split()), 1):.2f} tokens/word"
}


def print_inspection(text: str):
"""Pretty-print token inspection results."""
result = inspect_tokens(text)

print(f"\n{'─' * 60}")
print(f"Text: \"{result['text']}\"")
print(f"{'─' * 60}")
print(f" Characters: {result['char_count']}")
print(f" Words: {result['word_count']}")
print(f" Tokens: {result['token_count']} ({result['ratio']})")
print(f" Chunks: {result['chunks']}")


# ============================================================================
# PART 2: COST CALCULATOR
# ============================================================================

# Model pricing (USD per 1 Million tokens) - Update as needed
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00, "context": 128_000},
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "context": 128_000},
"claude-sonnet": {"input": 3.00, "output": 15.00, "context": 200_000},
"claude-haiku": {"input": 0.80, "output": 4.00, "context": 200_000},
"gemini-flash": {"input": 0.075, "output": 0.30, "context": 1_000_000},
"gemini-pro": {"input": 1.25, "output": 5.00, "context": 2_000_000},
}


def calculate_cost(
input_text: str,
expected_output_tokens: int = 500,
model: str = "gpt-4o"
) -> dict:
"""
Calculate the cost of an API call before making it.

Args:
input_text: The full prompt (system + user + history)
expected_output_tokens: Estimated response length
model: Model name from MODEL_PRICING

Returns:
Dict with token counts and costs
"""
if model not in MODEL_PRICING:
raise ValueError(f"Unknown model: {model}. Choose from: {list(MODEL_PRICING.keys())}")

pricing = MODEL_PRICING[model]
input_tokens = len(encoder.encode(input_text))

input_cost = (input_tokens / 1_000_000) * pricing["input"]
output_cost = (expected_output_tokens / 1_000_000) * pricing["output"]
total_cost = input_cost + output_cost

return {
"model": model,
"input_tokens": input_tokens,
"output_tokens": expected_output_tokens,
"total_tokens": input_tokens + expected_output_tokens,
"input_cost": input_cost,
"output_cost": output_cost,
"total_cost": total_cost,
"context_limit": pricing["context"],
"utilization": (input_tokens + expected_output_tokens) / pricing["context"]
}


def print_cost_comparison(input_text: str, expected_output_tokens: int = 500):
"""Compare costs across all models."""
print(f"\n{'═' * 70}")
print("COST COMPARISON ACROSS MODELS")
print(f"{'═' * 70}")

input_tokens = len(encoder.encode(input_text))
print(f"Input: {input_tokens:,} tokens | Expected Output: {expected_output_tokens:,} tokens\n")

print(f"{'Model':<16} {'Input Cost':>12} {'Output Cost':>12} {'Total':>12} {'Utilization':>12}")
print(f"{'-' * 16} {'-' * 12} {'-' * 12} {'-' * 12} {'-' * 12}")

for model in MODEL_PRICING:
result = calculate_cost(input_text, expected_output_tokens, model)
print(f"{model:<16} ${result['input_cost']:>10.4f} ${result['output_cost']:>10.4f} "
f"${result['total_cost']:>10.4f} {result['utilization']:>11.1%}")


# ============================================================================
# PART 3: CONTEXT BUDGET CHECKER
# ============================================================================

def check_context_budget(
system_prompt: str,
user_input: str,
conversation_history: str = "",
expected_output_tokens: int = 500,
model: str = "gpt-4o"
) -> dict:
"""
Before making an API call, verify everything fits in the context window.

This is the function you'll actually use in production.
"""
pricing = MODEL_PRICING.get(model, MODEL_PRICING["gpt-4o"])

system_tokens = len(encoder.encode(system_prompt))
history_tokens = len(encoder.encode(conversation_history)) if conversation_history else 0
input_tokens = len(encoder.encode(user_input))

total_needed = system_tokens + history_tokens + input_tokens + expected_output_tokens
context_limit = pricing["context"]

# Calculate safety margins
fits = total_needed <= context_limit
fits_safely = total_needed <= (context_limit * 0.8) # 80% rule

return {
"breakdown": {
"system_prompt": system_tokens,
"conversation_history": history_tokens,
"user_input": input_tokens,
"reserved_for_output": expected_output_tokens,
},
"total_needed": total_needed,
"context_limit": context_limit,
"utilization": total_needed / context_limit,
"fits": fits,
"fits_safely": fits_safely,
"tokens_remaining": context_limit - total_needed,
"recommendation": (
"✅ Good to go" if fits_safely else
"⚠️ Fits but may degrade quality (>80% utilization)" if fits else
"❌ EXCEEDS CONTEXT LIMIT - will fail"
)
}


def print_budget_check(
system_prompt: str,
user_input: str,
conversation_history: str = "",
expected_output_tokens: int = 500,
model: str = "gpt-4o"
):
"""Pretty-print a context budget check."""
result = check_context_budget(
system_prompt, user_input, conversation_history,
expected_output_tokens, model
)

print(f"\n{'═' * 60}")
print(f"CONTEXT BUDGET CHECK ({model})")
print(f"{'═' * 60}")

print("\nToken Breakdown:")
for component, tokens in result["breakdown"].items():
bar_length = int((tokens / result["context_limit"]) * 40)
bar = "█" * bar_length if bar_length > 0 else "▏"
print(f" {component:<24} {tokens:>8,} tokens {bar}")

print(f"\n {'TOTAL':<24} {result['total_needed']:>8,} tokens")
print(f" {'Context Limit':<24} {result['context_limit']:>8,} tokens")
print(f" {'Utilization':<24} {result['utilization']:>8.1%}")
print(f" {'Remaining':<24} {result['tokens_remaining']:>8,} tokens")

print(f"\n{result['recommendation']}")


# ============================================================================
# DEMO: Run all parts
# ============================================================================

if __name__ == "__main__":
print("\n" + "=" * 70)
print(" TOKEN ECONOMICS CALCULATOR - DEMO")
print("=" * 70)

# ─────────────────────────────────────────────────────────────────────
# PART 1: Inspect surprising tokenizations
# ─────────────────────────────────────────────────────────────────────
print("\n\n📊 PART 1: TOKEN INSPECTION")
print("See how text is actually tokenized (it's not what you expect!)")

test_cases = [
"Hello, world!",
"ChatGPT",
"chat GPT",
"$100.00",
"2024-01-15",
"[email protected]",
"Zażółć gęślą jaźń", # Polish pangram
"The quick brown fox jumps over the lazy dog.",
]

for text in test_cases:
print_inspection(text)

# ─────────────────────────────────────────────────────────────────────
# PART 2: Calculate costs for a realistic document
# ─────────────────────────────────────────────────────────────────────
print("\n\n💰 PART 2: COST COMPARISON")
print("How much does it cost to process a ~10,000 word document?")

# Simulate a business document (~10,000 words ≈ 13,000 tokens)
sample_document = ("This is a sample business document. " * 1000)

print_cost_comparison(sample_document, expected_output_tokens=1000)

# ─────────────────────────────────────────────────────────────────────
# PART 3: Check if a complex request fits
# ─────────────────────────────────────────────────────────────────────
print("\n\n🎯 PART 3: CONTEXT BUDGET CHECK")
print("Will this request fit? Should we proceed?")

system_prompt = """You are a legal document analyzer.
Extract key clauses, identify risks, and summarize in plain English.
Always cite the specific section numbers."""

# Simulate a long contract
user_document = "AGREEMENT made this day... " * 2000 # ~8000 tokens

# Simulate some conversation history
history = "User asked about Section 5. Assistant explained indemnification." * 20

print_budget_check(
system_prompt=system_prompt,
user_input=user_document,
conversation_history=history,
expected_output_tokens=2000,
model="gpt-4o"
)

# Try with a larger context model
print_budget_check(
system_prompt=system_prompt,
user_input=user_document,
conversation_history=history,
expected_output_tokens=2000,
model="claude-sonnet"
)

print("\n" + "=" * 70)
print(" END OF DEMO")
print("=" * 70)

Run the Demo

uv run token_economics.py

What You'll See

The script demonstrates all three capabilities:

  1. Token Inspection — See exactly how text fragments into tokens, including surprising cases like emails and dates

  2. Cost Comparison — A table showing what the same request costs across different models (spoiler: GPT-4o-mini is 17x cheaper than GPT-4o)

  3. Budget Check — A visual breakdown of where your tokens are going, with a clear pass/fail indicator

6. Try It Yourself

Challenge 1: The Language Tax

Add more languages to the inspection demo:

# Add these test cases
language_tests = [
("English", "The meeting is at three o'clock."),
("Polish", "Spotkanie jest o trzeciej."),
("German", "Das Treffen ist um drei Uhr."),
("Japanese", "会議は3時です。"),
("Arabic", "الاجتماع في الساعة الثالثة"),
]

for lang, text in language_tests:
result = inspect_tokens(text)
print(f"{lang}: {result['token_count']} tokens for {result['word_count']} words")

Question: Which language is most "expensive" in tokens? Why do you think that is?

Challenge 2: Build a Pre-Flight Check

Create a function that should run before every API call in production:

def preflight_check(prompt: str, model: str) -> bool:
"""
Returns True if safe to proceed, False if request should be blocked.

Requirements:
- Must fit in context window
- Must be under 80% utilization
- Must cost less than $0.10 per request

If any check fails, print a warning and return False.
"""
# Your implementation here
pass

Challenge 3: Token Budget Dashboard

Extend check_context_budget to return data suitable for a monitoring dashboard:

def get_budget_metrics(requests: list[dict]) -> dict:
"""
Given a list of requests, return aggregate metrics:
- Total tokens consumed
- Total cost
- Average utilization
- Number of requests that exceeded 80% utilization
- Most expensive request
"""
# Your implementation here
pass

7. Production Patterns

Pattern 1: Always Check Before Calling

# ❌ Bad: Hope it fits
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)

# ✅ Good: Verify first
budget = check_context_budget(system, user_input, history, model="gpt-4o")
if not budget["fits"]:
raise ValueError(f"Request exceeds context limit: {budget['total_needed']} > {budget['context_limit']}")
if not budget["fits_safely"]:
logger.warning(f"High context utilization: {budget['utilization']:.1%}")

response = client.chat.completions.create(...)

Pattern 2: Automatic Model Fallback

def smart_model_select(input_tokens: int, required_output: int = 1000) -> str:
"""Pick the cheapest model that fits the request."""
total_needed = input_tokens + required_output

# Prefer cheaper models when possible
if total_needed < 100_000:
return "gpt-4o-mini" # Cheapest, good for most tasks
elif total_needed < 180_000:
return "claude-sonnet" # Larger context
else:
return "gemini-pro" # Massive 2M context

Pattern 3: Chunking for Large Documents

def chunk_document(text: str, max_tokens: int = 50_000, overlap: int = 500) -> list[str]:
"""
Split a document into chunks that fit the context window.
Overlap ensures we don't lose context at boundaries.
"""
tokens = encoder.encode(text)
chunks = []

start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(encoder.decode(chunk_tokens))
start = end - overlap # Overlap for continuity

return chunks

8. Common Pitfalls

SymptomCauseSolution
Unexpected $50 billDidn't count tokens before large batch jobAlways run cost calculation before production jobs
"Context length exceeded" errorsUser uploaded huge documentImplement preflight_check that rejects oversized inputs
Model gives worse answers on long docs"Lost in middle" effectPut critical info at start/end; stay under 80% utilization
Non-English users complain about errorsSame word count = more tokensTest with target languages; budget 2x tokens for non-English
Costs vary wildly per requestDidn't account for output tokensInclude expected output in cost estimates

9. Key Takeaways

  1. Tokens ≠ Words. They're chunks of characters. Always measure, never assume.

  2. You pay twice. Input tokens AND output tokens both cost money. Budget for both.

  3. Context is finite. System prompt + history + input + output must all fit. No exceptions.

  4. Model selection matters. GPT-4o-mini can be 95% cheaper than GPT-4o for simple tasks.

  5. Check before you call. Build preflight_check into every production system.

  6. Non-English costs more. Budget extra tokens for multilingual applications.

10. What's Next

Now that you understand the currency of AI (tokens), you're ready to learn how to spend it wisely. In Lesson 2, we'll cover the three fundamental prompting patterns—Zero-Shot, Few-Shot, and Chain-of-Thought—and when to use each.

In Lesson 5, we'll actually call the APIs and watch these token calculations turn into real streaming responses.

11. Additional Resources