Lesson 1: Tokens & Context

Topics Covered

The Illusion: Why len("text") is useless for AI engineering.
The Currency: Tokens as the fundamental unit you pay for.
The Math: How to calculate exact costs before sending a request.
The Limit: The Context Window and why "memory" is finite.
The Tool: Using tiktoken to budget tokens programmatically.

You build a document summarizer. It works great in testing. Then a user uploads a 200-page contract, and you get a $4.50 bill for a single API call—or worse, a cryptic error about "context length exceeded."

This lesson prevents both disasters.

To build with AI, you must stop thinking in words and start thinking in tokens. When you send a prompt to ChatGPT or Claude, the model doesn't see your text. It sees a sequence of numbers. Understanding this conversion is the difference between a system that's cost-effective and reliable, and one that's expensive and prone to crashing.

1. The Tokenization Pipeline

Computers cannot process language directly; they process numbers. Before your text reaches the model, it passes through a Tokenizer—a dictionary that converts text chunks into integers.

Example:

Step	Data
Input	`"AI is amazing"`
Tokenizer breaks into chunks	`["AI", " is", " amazing"]`
Maps to Token IDs	`[9552, 374, 10419]`
Model receives	`[9552, 374, 10419]`

Notice that " is" includes the leading space; whitespace matters. The model never sees the string "AI is amazing"; it only sees [9552, 374, 10419].

2. Tokens ≠ Words

A token is not a word. It's a chunk of characters that the tokenizer recognizes from its dictionary.

The patterns:

Common English words → usually 1 token ("the", "apple", "code")
Compound or rare words → 2-4 tokens ("Counterintuitively" → 3 tokens)
Punctuation → separate tokens ("Hello!" → 2 tokens: "Hello" + "!")
Numbers → unpredictable ("100" = 1 token, "101" = 1 token, "$100.00" = 4 tokens)
Non-English text → often more tokens per word

The Golden Ratio

Rule of Thumb: 1,000 tokens ≈ 750 English words.

When estimating costs quickly, remember that tokens are roughly 35% more than the word count. But always verify with actual tokenization for precision.

Surprising examples you'll encounter:

Text	Tokens	Why
`ChatGPT`	1	It's in the dictionary as a single unit
`chat GPT`	2	Space and case change split it
`$100`	2	`$` and `100` are separate
`$100.00`	4	`$`, `100`, `.`, `00`
`2024-01-15`	5	Dates are expensive
`[email protected]`	5	Emails fragment badly
`Zażółć`	4	Non-English words split into subwords

This is why you never estimate token counts by counting words or characters.

3. The Context Window

Every model has a maximum Context Window. It is the total number of tokens it can process in a single request. This is a hard limit that includes everything:

Context Window = System Prompt + Conversation History + User Input + Model's Response

If you try to send 128,001 tokens to a 128K model, the request fails immediately. You can't pay extra for more space. You must engineer your context to fit.

4. The Model Landscape

Different models offer different trade-offs between context size, capability, and cost:

Model	Context Window	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o	128K	$2.50	$10.00
GPT-4o-mini	128K	$0.15	$0.60
Claude Sonnet 4	200K	$3.00	$15.00
Claude Haiku	200K	$0.80	$4.00
Gemini 1.5 Flash	1M	$0.075	$0.30
Gemini 1.5 Pro	2M	$1.25	$5.00

Prices as of 2025. Always verify at provider pricing pages.

Key insight: A 10x difference in token price means a 10x difference in your bill. Choosing GPT-4o-mini over GPT-4o for simple tasks can reduce costs by 95%.

The "Lost in Middle" Effect

Research shows that LLMs perform worse on information placed in the middle of long contexts. When working with large documents:

Put critical information at the start or end of your prompt
Don't fill 100% of the context window—aim for 80% max for best reasoning quality
If you must use the full window, consider chunking and multiple calls

5. Hands-On Exercise: Token Economics Calculator

We'll build a single script that grows in three parts: inspect tokens, calculate costs, and check context budgets.

Setup

mkdir ai3c-training
cd ai3c-training
uv init
uv add tiktoken
touch token_economics.py

The Complete Script

token_economics.py
"""
Token Economics Calculator
==========================
A practical toolkit for understanding and budgeting LLM token usage.

Part 1: Inspect how text becomes tokens
Part 2: Calculate costs across different models
Part 3: Check if your request fits the context window
"""

import tiktoken

# Load the encoder for GPT-4 / Claude (cl100k_base is widely compatible)
encoder = tiktoken.get_encoding("cl100k_base")

# ============================================================================
# PART 1: TOKEN INSPECTOR
# ============================================================================

def inspect_tokens(text: str) -> dict:
    """
    Reveal exactly how text is tokenized.
    
    Returns a dict with counts and the actual token chunks.
    """
    tokens = encoder.encode(text)
    chunks = [encoder.decode([t]) for t in tokens]
    
    return {
        "text": text,
        "char_count": len(text),
        "word_count": len(text.split()),
        "token_count": len(tokens),
        "token_ids": tokens,
        "chunks": chunks,
        "ratio": f"{len(tokens) / max(len(text.split()), 1):.2f} tokens/word"
    }


def print_inspection(text: str):
    """Pretty-print token inspection results."""
    result = inspect_tokens(text)
    
    print(f"\n{'─' * 60}")
    print(f"Text: \"{result['text']}\"")
    print(f"{'─' * 60}")
    print(f"  Characters: {result['char_count']}")
    print(f"  Words:      {result['word_count']}")
    print(f"  Tokens:     {result['token_count']} ({result['ratio']})")
    print(f"  Chunks:     {result['chunks']}")


# ============================================================================
# PART 2: COST CALCULATOR
# ============================================================================

# Model pricing (USD per 1 Million tokens) - Update as needed
MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00, "context": 128_000},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60, "context": 128_000},
    "claude-sonnet": {"input": 3.00, "output": 15.00, "context": 200_000},
    "claude-haiku": {"input": 0.80, "output": 4.00, "context": 200_000},
    "gemini-flash": {"input": 0.075, "output": 0.30, "context": 1_000_000},
    "gemini-pro": {"input": 1.25, "output": 5.00, "context": 2_000_000},
}


def calculate_cost(
    input_text: str,
    expected_output_tokens: int = 500,
    model: str = "gpt-4o"
) -> dict:
    """
    Calculate the cost of an API call before making it.
    
    Args:
        input_text: The full prompt (system + user + history)
        expected_output_tokens: Estimated response length
        model: Model name from MODEL_PRICING
    
    Returns:
        Dict with token counts and costs
    """
    if model not in MODEL_PRICING:
        raise ValueError(f"Unknown model: {model}. Choose from: {list(MODEL_PRICING.keys())}")
    
    pricing = MODEL_PRICING[model]
    input_tokens = len(encoder.encode(input_text))
    
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (expected_output_tokens / 1_000_000) * pricing["output"]
    total_cost = input_cost + output_cost
    
    return {
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": expected_output_tokens,
        "total_tokens": input_tokens + expected_output_tokens,
        "input_cost": input_cost,
        "output_cost": output_cost,
        "total_cost": total_cost,
        "context_limit": pricing["context"],
        "utilization": (input_tokens + expected_output_tokens) / pricing["context"]
    }


def print_cost_comparison(input_text: str, expected_output_tokens: int = 500):
    """Compare costs across all models."""
    print(f"\n{'═' * 70}")
    print("COST COMPARISON ACROSS MODELS")
    print(f"{'═' * 70}")
    
    input_tokens = len(encoder.encode(input_text))
    print(f"Input: {input_tokens:,} tokens | Expected Output: {expected_output_tokens:,} tokens\n")
    
    print(f"{'Model':<16} {'Input Cost':>12} {'Output Cost':>12} {'Total':>12} {'Utilization':>12}")
    print(f"{'-' * 16} {'-' * 12} {'-' * 12} {'-' * 12} {'-' * 12}")
    
    for model in MODEL_PRICING:
        result = calculate_cost(input_text, expected_output_tokens, model)
        print(f"{model:<16} ${result['input_cost']:>10.4f} ${result['output_cost']:>10.4f} "
              f"${result['total_cost']:>10.4f} {result['utilization']:>11.1%}")


# ============================================================================
# PART 3: CONTEXT BUDGET CHECKER
# ============================================================================

def check_context_budget(
    system_prompt: str,
    user_input: str,
    conversation_history: str = "",
    expected_output_tokens: int = 500,
    model: str = "gpt-4o"
) -> dict:
    """
    Before making an API call, verify everything fits in the context window.
    
    This is the function you'll actually use in production.
    """
    pricing = MODEL_PRICING.get(model, MODEL_PRICING["gpt-4o"])
    
    system_tokens = len(encoder.encode(system_prompt))
    history_tokens = len(encoder.encode(conversation_history)) if conversation_history else 0
    input_tokens = len(encoder.encode(user_input))
    
    total_needed = system_tokens + history_tokens + input_tokens + expected_output_tokens
    context_limit = pricing["context"]
    
    # Calculate safety margins
    fits = total_needed <= context_limit
    fits_safely = total_needed <= (context_limit * 0.8)  # 80% rule
    
    return {
        "breakdown": {
            "system_prompt": system_tokens,
            "conversation_history": history_tokens,
            "user_input": input_tokens,
            "reserved_for_output": expected_output_tokens,
        },
        "total_needed": total_needed,
        "context_limit": context_limit,
        "utilization": total_needed / context_limit,
        "fits": fits,
        "fits_safely": fits_safely,
        "tokens_remaining": context_limit - total_needed,
        "recommendation": (
            "✅ Good to go" if fits_safely else
            "⚠️ Fits but may degrade quality (>80% utilization)" if fits else
            "❌ EXCEEDS CONTEXT LIMIT - will fail"
        )
    }


def print_budget_check(
    system_prompt: str,
    user_input: str,
    conversation_history: str = "",
    expected_output_tokens: int = 500,
    model: str = "gpt-4o"
):
    """Pretty-print a context budget check."""
    result = check_context_budget(
        system_prompt, user_input, conversation_history, 
        expected_output_tokens, model
    )
    
    print(f"\n{'═' * 60}")
    print(f"CONTEXT BUDGET CHECK ({model})")
    print(f"{'═' * 60}")
    
    print("\nToken Breakdown:")
    for component, tokens in result["breakdown"].items():
        bar_length = int((tokens / result["context_limit"]) * 40)
        bar = "█" * bar_length if bar_length > 0 else "▏"
        print(f"  {component:<24} {tokens:>8,} tokens  {bar}")
    
    print(f"\n  {'TOTAL':<24} {result['total_needed']:>8,} tokens")
    print(f"  {'Context Limit':<24} {result['context_limit']:>8,} tokens")
    print(f"  {'Utilization':<24} {result['utilization']:>8.1%}")
    print(f"  {'Remaining':<24} {result['tokens_remaining']:>8,} tokens")
    
    print(f"\n{result['recommendation']}")


# ============================================================================
# DEMO: Run all parts
# ============================================================================

if __name__ == "__main__":
    print("\n" + "=" * 70)
    print(" TOKEN ECONOMICS CALCULATOR - DEMO")
    print("=" * 70)
    
    # ─────────────────────────────────────────────────────────────────────
    # PART 1: Inspect surprising tokenizations
    # ─────────────────────────────────────────────────────────────────────
    print("\n\n📊 PART 1: TOKEN INSPECTION")
    print("See how text is actually tokenized (it's not what you expect!)")
    
    test_cases = [
        "Hello, world!",
        "ChatGPT",
        "chat GPT",
        "$100.00",
        "2024-01-15",
        "[email protected]",
        "Zażółć gęślą jaźń",  # Polish pangram
        "The quick brown fox jumps over the lazy dog.",
    ]
    
    for text in test_cases:
        print_inspection(text)
    
    # ─────────────────────────────────────────────────────────────────────
    # PART 2: Calculate costs for a realistic document
    # ─────────────────────────────────────────────────────────────────────
    print("\n\n💰 PART 2: COST COMPARISON")
    print("How much does it cost to process a ~10,000 word document?")
    
    # Simulate a business document (~10,000 words ≈ 13,000 tokens)
    sample_document = ("This is a sample business document. " * 1000)
    
    print_cost_comparison(sample_document, expected_output_tokens=1000)
    
    # ─────────────────────────────────────────────────────────────────────
    # PART 3: Check if a complex request fits
    # ─────────────────────────────────────────────────────────────────────
    print("\n\n🎯 PART 3: CONTEXT BUDGET CHECK")
    print("Will this request fit? Should we proceed?")
    
    system_prompt = """You are a legal document analyzer. 
    Extract key clauses, identify risks, and summarize in plain English.
    Always cite the specific section numbers."""
    
    # Simulate a long contract
    user_document = "AGREEMENT made this day... " * 2000  # ~8000 tokens
    
    # Simulate some conversation history
    history = "User asked about Section 5. Assistant explained indemnification." * 20
    
    print_budget_check(
        system_prompt=system_prompt,
        user_input=user_document,
        conversation_history=history,
        expected_output_tokens=2000,
        model="gpt-4o"
    )
    
    # Try with a larger context model
    print_budget_check(
        system_prompt=system_prompt,
        user_input=user_document,
        conversation_history=history,
        expected_output_tokens=2000,
        model="claude-sonnet"
    )
    
    print("\n" + "=" * 70)
    print(" END OF DEMO")
    print("=" * 70)

Run the Demo

uv run token_economics.py

What You'll See

The script demonstrates all three capabilities:

Token Inspection — See exactly how text fragments into tokens, including surprising cases like emails and dates
Cost Comparison — A table showing what the same request costs across different models (spoiler: GPT-4o-mini is 17x cheaper than GPT-4o)
Budget Check — A visual breakdown of where your tokens are going, with a clear pass/fail indicator

6. Try It Yourself

Challenge 1: The Language Tax

Add more languages to the inspection demo:

# Add these test cases
language_tests = [
    ("English", "The meeting is at three o'clock."),
    ("Polish", "Spotkanie jest o trzeciej."),
    ("German", "Das Treffen ist um drei Uhr."),
    ("Japanese", "会議は3時です。"),
    ("Arabic", "الاجتماع في الساعة الثالثة"),
]

for lang, text in language_tests:
    result = inspect_tokens(text)
    print(f"{lang}: {result['token_count']} tokens for {result['word_count']} words")

Question: Which language is most "expensive" in tokens? Why do you think that is?

Challenge 2: Build a Pre-Flight Check

Create a function that should run before every API call in production:

def preflight_check(prompt: str, model: str) -> bool:
    """
    Returns True if safe to proceed, False if request should be blocked.
    
    Requirements:
    - Must fit in context window
    - Must be under 80% utilization
    - Must cost less than $0.10 per request
    
    If any check fails, print a warning and return False.
    """
    # Your implementation here
    pass

Challenge 3: Token Budget Dashboard

Extend check_context_budget to return data suitable for a monitoring dashboard:

def get_budget_metrics(requests: list[dict]) -> dict:
    """
    Given a list of requests, return aggregate metrics:
    - Total tokens consumed
    - Total cost
    - Average utilization
    - Number of requests that exceeded 80% utilization
    - Most expensive request
    """
    # Your implementation here
    pass

7. Production Patterns

Pattern 1: Always Check Before Calling

# ❌ Bad: Hope it fits
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

# ✅ Good: Verify first
budget = check_context_budget(system, user_input, history, model="gpt-4o")
if not budget["fits"]:
    raise ValueError(f"Request exceeds context limit: {budget['total_needed']} > {budget['context_limit']}")
if not budget["fits_safely"]:
    logger.warning(f"High context utilization: {budget['utilization']:.1%}")
    
response = client.chat.completions.create(...)

Pattern 2: Automatic Model Fallback

def smart_model_select(input_tokens: int, required_output: int = 1000) -> str:
    """Pick the cheapest model that fits the request."""
    total_needed = input_tokens + required_output
    
    # Prefer cheaper models when possible
    if total_needed < 100_000:
        return "gpt-4o-mini"  # Cheapest, good for most tasks
    elif total_needed < 180_000:
        return "claude-sonnet"  # Larger context
    else:
        return "gemini-pro"  # Massive 2M context

Pattern 3: Chunking for Large Documents

def chunk_document(text: str, max_tokens: int = 50_000, overlap: int = 500) -> list[str]:
    """
    Split a document into chunks that fit the context window.
    Overlap ensures we don't lose context at boundaries.
    """
    tokens = encoder.encode(text)
    chunks = []
    
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(encoder.decode(chunk_tokens))
        start = end - overlap  # Overlap for continuity
    
    return chunks

8. Common Pitfalls

Symptom	Cause	Solution
Unexpected $50 bill	Didn't count tokens before large batch job	Always run cost calculation before production jobs
"Context length exceeded" errors	User uploaded huge document	Implement `preflight_check` that rejects oversized inputs
Model gives worse answers on long docs	"Lost in middle" effect	Put critical info at start/end; stay under 80% utilization
Non-English users complain about errors	Same word count = more tokens	Test with target languages; budget 2x tokens for non-English
Costs vary wildly per request	Didn't account for output tokens	Include expected output in cost estimates

9. Key Takeaways

Tokens ≠ Words. They're chunks of characters. Always measure, never assume.
You pay twice. Input tokens AND output tokens both cost money. Budget for both.
Context is finite. System prompt + history + input + output must all fit. No exceptions.
Model selection matters. GPT-4o-mini can be 95% cheaper than GPT-4o for simple tasks.
Check before you call. Build preflight_check into every production system.
Non-English costs more. Budget extra tokens for multilingual applications.

10. What's Next

Now that you understand the currency of AI (tokens), you're ready to learn how to spend it wisely. In Lesson 2, we'll cover the three fundamental prompting patterns—Zero-Shot, Few-Shot, and Chain-of-Thought—and when to use each.

In Lesson 5, we'll actually call the APIs and watch these token calculations turn into real streaming responses.

11. Additional Resources

OpenAI Tokenizer Visualizer — Interactive tool to see tokenization in real-time
Anthropic Token Counter — as above, but Claude-specific (unofficial)
Tiktoken GitHub Repository — The library we used in this lesson
Lost in the Middle (Research Paper) — Why context position matters

1. The Tokenization Pipeline​

2. Tokens ≠ Words​

3. The Context Window​

4. The Model Landscape​

5. Hands-On Exercise: Token Economics Calculator​

Setup​

The Complete Script​

Run the Demo​

What You'll See​

6. Try It Yourself​

Challenge 1: The Language Tax​

Challenge 2: Build a Pre-Flight Check​

Challenge 3: Token Budget Dashboard​

7. Production Patterns​

Pattern 1: Always Check Before Calling​

Pattern 2: Automatic Model Fallback​

Pattern 3: Chunking for Large Documents​

8. Common Pitfalls​

9. Key Takeaways​

10. What's Next​

11. Additional Resources​