Lesson 4: Generation Parameters — Controlling Creativity

Topics Covered

The Probability Engine: Why LLMs work with logits, not probabilities directly.
Temperature: The "chaos dial" that scales logits before softmax.
Sampling Methods: Top-p, Top-k, Min-p, and Tail-Free Sampling.
Mirostat: An alternative approach using perplexity targeting.
Repetition Control: Frequency penalty, presence penalty, and repeat windows.
Output Controls: Max tokens, stop sequences, and context size.
Local LLMs: Ollama-specific parameters like num_ctx and num_predict.

You've crafted the perfect prompt. You've written an airtight system message. You send the request and get... a wildly creative response when you needed precise JSON. Or a robotic, repetitive answer when you wanted engaging copy. The problem isn't your prompt—it's your generation parameters.

These are the "knobs" that control how the model selects its next token. Get them wrong, and even perfect prompts produce wrong outputs.

1. The Probability Engine

Before we touch any settings, we need to understand what's actually happening when an LLM generates text.

At each step, the model doesn't "know" what to say next. It calculates a probability distribution over its entire vocabulary (50,000+ tokens). Every token gets a probability score.

Prompt: "The capital of France is"

Token Probabilities:
  "Paris"     → 92.3%
  "the"       → 2.1%
  "a"         → 1.4%
  "located"   → 0.8%
  "definitely"→ 0.4%
  ... (50,000 more tokens with tiny probabilities)

The model then samples from this distribution to pick the next token. This is where generation parameters come in—they control how this sampling happens.

2. Temperature: The Chaos Dial

Temperature controls how "sharp" or "flat" the probability distribution is before sampling. But to really understand it, we need to look at what happens under the hood.

Under the Hood: Logits

The model doesn't actually store probabilities directly. It works with logits (logistic units)—raw, unscaled numbers typically ranging from -10 to +10.

Before softmax (raw logits):
  "Paris"     → 8.2
  "the"       → 2.1
  "a"         → 1.4
  "located"   → -0.8
  "banana"    → -9.5

These logits are converted to probabilities using the softmax function, which:

Exponentiates each logit
Divides by the sum of all exponentials
Results in numbers between 0 and 1 that sum to 1

Temperature scales the logits BEFORE softmax:

# Simplified: how temperature affects logits
scaled_logits = original_logits / temperature
probabilities = softmax(scaled_logits)

The Math (Simplified)

Temperature < 1 → Dividing by a small number makes differences BIGGER. High logits become dominant.
Temperature = 1 → Use logits as-is. (Default behavior)
Temperature > 1 → Dividing by a large number makes differences SMALLER. Low-probability tokens get boosted.
Temperature = 0 → Special case: Always pick the highest logit. Deterministic.

Visual Intuition

Prompt: "The weather today is"

┌─────────────────────────────────────────────────────────────────┐
│ Temperature = 0.0 (Deterministic)                               │
│ ████████████████████████████████████████ "sunny" (100%)         │
│                                                                 │
│ Temperature = 0.3 (Focused)                                     │
│ ████████████████████████████████ "sunny" (85%)                  │
│ ████ "nice" (10%)                                               │
│ █ "warm" (5%)                                                   │
│                                                                 │
│ Temperature = 1.0 (Default)                                     │
│ ██████████████████ "sunny" (45%)                                │
│ ████████ "nice" (20%)                                           │
│ ██████ "warm" (15%)                                             │
│ ████ "beautiful" (10%)                                          │
│ ██ "perfect" (5%)                                               │
│ █ others (5%)                                                   │
│                                                                 │
│ Temperature = 1.5 (Creative)                                    │
│ ██████████ "sunny" (25%)                                        │
│ ██████ "nice" (15%)                                             │
│ █████ "warm" (12%)                                              │
│ ████ "beautiful" (10%)                                          │
│ ████ "absolutely" (8%)                                          │
│ ███ "quite" (7%)                                                │
│ ██████████ others (23%)                                         │
└─────────────────────────────────────────────────────────────────┘

When to Use What

Temperature	Behavior	Best For
0	Always pick the most likely token	JSON generation, code, factual Q&A, deterministic outputs
0.1 - 0.3	Very focused, minimal variation	Data extraction, classification, structured output
0.5 - 0.7	Balanced creativity and coherence	General chat, explanations, summarization
0.8 - 1.0	More variety, occasionally surprising	Creative writing, brainstorming, marketing copy
1.2 - 1.5	High creativity, risk of incoherence	Poetry, experimental content, breaking writer's block
> 1.5	Chaos mode	Almost never useful in production

Temperature 0 Isn't Truly Deterministic

Even at temperature 0, you might see slight variations due to floating-point math and GPU parallelism. For truly reproducible outputs, also set a seed parameter (if the API supports it).

3. Top-p (Nucleus Sampling): The Candidate Filter

Top-p (also called nucleus sampling) takes a different approach: instead of reshaping probabilities, it limits which tokens are even considered.

How It Works

Sort all tokens by probability (highest first)
Add tokens to the "candidate pool" until their cumulative probability reaches p
Sample only from this pool

Top-p = 0.9 means: "Only consider tokens that together account for 90% of probability mass"

Prompt: "The weather today is"

All tokens (sorted by probability):
  "sunny"     → 45%  ✓ (cumulative: 45%)
  "nice"      → 20%  ✓ (cumulative: 65%)
  "warm"      → 15%  ✓ (cumulative: 80%)
  "beautiful" → 10%  ✓ (cumulative: 90%) ← Stop here
  "perfect"   → 5%   ✗ (excluded)
  "cloudy"    → 3%   ✗ (excluded)
  ... rest excluded

Sample only from: ["sunny", "nice", "warm", "beautiful"]

Top-p Values

Top-p	Effect	Use Case
0.1	Only the very top tokens	Maximum focus, almost deterministic
0.5	Top ~50% probability mass	Focused but some variety
0.9	Top ~90% probability mass	Good default, filters obvious nonsense
0.95	Almost everything included	Creative tasks
1.0	All tokens considered	Full randomness (use temperature to control)

4. Top-k: The Simple Filter

Top-k is the simplest sampling method: only consider the top K most likely tokens.

top_k = 40 (common default)

All tokens sorted by probability:
  #1  "sunny"     → 45%  ✓ included
  #2  "nice"      → 20%  ✓ included
  ...
  #40 "adequate"  → 0.1% ✓ included
  #41 "purple"    → 0.05% ✗ excluded (beyond top 40)

Top-k vs Top-p:

Top-k: Fixed number of candidates (always exactly K tokens)
Top-p: Variable number of candidates (depends on probability distribution)

Top-p is generally preferred because it adapts to the situation. If the model is very confident (one token has 95% probability), top-p will include fewer candidates. Top-k would still include 40 even when most are irrelevant.

5. Min-p: The Relative Threshold

Min-p is a newer alternative to top-p. Instead of cumulative probability, it uses a relative threshold based on the highest logit.

min_p = 0.1 means: "Only include tokens with probability ≥ 10% of the top token's probability"

Example:
  Top token "Paris" has probability 80%
  Threshold = 80% × 0.1 = 8%
  
  "Paris"     → 80%  ✓ (above 8%)
  "the"       → 12%  ✓ (above 8%)
  "located"   → 5%   ✗ (below 8%)
  "banana"    → 0.1% ✗ (below 8%)

Why use min-p?

More intuitive than top-p for some use cases
Automatically adapts to confidence levels
Available in Ollama and some local LLM frameworks (not OpenAI API)

6. Tail-Free Sampling (TFS)

Tail-free sampling takes a statistical approach: it analyzes the probability distribution's "tail" (the long list of unlikely tokens) and cuts it off.

tfs_z = 0.95 means: "Cut off tokens in the tail based on second derivative analysis"

Values:
- 1.0 = Disabled (no tail cutting)
- 0.99-0.95 = Light tail trimming (good starting range)
- < 0.9 = Aggressive trimming

When to use TFS:

When top-p still includes too many unlikely tokens
For more coherent long-form generation
Mainly available in local LLM tools (Ollama, llama.cpp)

Sampling Method Priority

Most frameworks apply these in order: Temperature → Top-k → Top-p → Min-p → TFS. If you're using multiple, be aware they compound. Start with just temperature and add others only if needed.

7. Mirostat: Adaptive Perplexity Control

Mirostat is a completely different approach to sampling. Instead of manually tuning temperature and top-p, it automatically adjusts sampling to maintain a target perplexity level.

What's Perplexity?

Perplexity measures how "surprised" the model is by its own output:

Low perplexity → Model is confident, output is predictable/coherent
High perplexity → Model is uncertain, output is diverse/creative

Mirostat targets a specific perplexity level and adjusts sampling on-the-fly to maintain it.

Mirostat Parameters

mirostat = 0 (default): Disabled, use traditional sampling
mirostat = 1: Mirostat v1
mirostat = 2: Mirostat v2 (generally preferred)

mirostat_tau = 5.0 (default): Target perplexity level
  - Higher tau → More diverse/creative output
  - Lower tau → More coherent/focused output
  - Range: typically 3.0 to 5.0

mirostat_eta = 0.1 (default): Learning rate
  - Higher eta → Faster adaptation to target perplexity
  - Lower eta → More stable, slower adaptation

When to Use Mirostat

Use Mirostat when:

You want consistent "creativity level" across different prompts
Manual temperature tuning isn't giving consistent results
You're generating long-form content and want stable quality

Don't use Mirostat when:

You need deterministic output (use temperature=0 instead)
You're using APIs that don't support it (OpenAI, Anthropic)
You need fine-grained control over specific parameters

Mirostat Availability

Mirostat is primarily available in local LLM tools like Ollama and llama.cpp. Cloud APIs (OpenAI, Anthropic, Google) typically don't offer it—they rely on temperature and top-p.

8. Choosing Your Sampling Strategy

Here's the thing: you usually don't need all of these. Most production systems use just temperature, or temperature + top-p.

OpenAI's recommendation: Adjust one or the other, not both. If you use temperature, set top_p to 1.0. If you use top_p, set temperature to 1.0.

In practice:

Most developers use temperature because it's more intuitive
Use top_p when you specifically want to exclude the long tail of unlikely tokens
Use mirostat (if available) when you want consistent creativity across varied prompts

9. Repetition Control

These parameters fight repetition—a common LLM problem, especially in longer outputs.

How Penalties Work on Logits

Penalties adjust the logits (not probabilities) of tokens that have appeared before:

If the logit is NEGATIVE: logit = logit × penalty
If the logit is POSITIVE: logit = logit / penalty

With penalty > 1 (the default approach):
- Positive logits get smaller (less likely)
- Negative logits get more negative (even less likely)

Result: Previously used tokens become less likely to appear again.

You can also set penalty below 1, which has the opposite effect—making repeated tokens MORE likely. This is rarely useful but exists for edge cases.

Frequency Penalty

Reduces the probability of tokens proportional to how often they've appeared.

frequency_penalty = 0.0 (default): No penalty
frequency_penalty = 1.0: Strong penalty against repetition
frequency_penalty = 2.0: Very strong penalty (can cause incoherence)

Example with frequency_penalty = 0.5:
- Token "the" appeared 5 times → penalty applied 5× (cumulative)
- Token "AI" appeared 2 times → penalty applied 2× (cumulative)

Use case: Long-form content where you want varied vocabulary.

Presence Penalty

Reduces the probability of tokens that have appeared at all (binary: appeared or not).

presence_penalty = 0.0 (default): No penalty
presence_penalty = 1.0: Moderate push toward new topics
presence_penalty = 2.0: Strong push toward new topics

Example with presence_penalty = 0.5:
- Token "the" appeared (any number of times) → one-time penalty of 0.5
- Token "AI" appeared (any number of times) → one-time penalty of 0.5

Use case: When you want the model to explore new topics rather than dwelling on what's already been mentioned.

Repeat Window (repeat_last_n)

In local LLM tools like Ollama, you can control how far back to look for repetitions:

repeat_last_n = 64 (default): Look at last 64 tokens
repeat_last_n = 128: Larger window, catch more distant repetition
repeat_last_n = 0: Disable repetition penalty entirely
repeat_last_n = -1: Use the entire context as the window

Why this matters: A short window (64) only penalizes recent repetition. A long window (or -1) catches patterns that repeat across the entire conversation—useful for long documents but more expensive computationally.

Comparison

Parameter	Penalizes	Effect	Use When
frequency_penalty	Repeat count	Varied vocabulary	Long documents, avoiding word repetition
presence_penalty	Existence (yes/no)	Topic diversity	Brainstorming, exploring new directions
repeat_last_n	Window size	Scope of penalty	Ollama/local: control how far back to look

Start Conservative

Both penalties default to 0. Start there. If you see repetition, try 0.3-0.5. Values above 1.0 often cause erratic output.

10. Output Controls

Max Tokens / num_predict

The hard limit on response length. When reached, the model stops immediately—even mid-sentence.

# OpenAI/Anthropic: max_tokens
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=500  # Stop after ~375 words
)

# Ollama: num_predict
# In modelfile or API
num_predict = 500
# Special values:
# -1 = Generate until done (no limit)
# -2 = Fill the entire context window

Important considerations:

max_tokens counts output only, not input
Set it based on your use case, not "just in case"
Lower values = faster responses and lower cost
If the model stops mid-thought, you'll see finish_reason: "length" instead of "stop"

Use Case	Suggested max_tokens
Classification (one word)	10-50
Short answer	100-200
Paragraph response	300-500
Long-form content	1000-2000
Maximum (let it finish naturally)	4096+ (model dependent)

Context Size (num_ctx) — Ollama Specific

When you see a model advertised with "128K context," that's the maximum supported context size. But in Ollama, models default to only 2,048 tokens to save memory.

# Why the default is small:
- 128K context requires significant GPU memory
- Many users have GPUs with only 8GB VRAM
- Ollama prioritizes working on modest hardware

# To use a model's full context in Ollama:
# Create a modelfile:
FROM llama3.1
PARAMETER num_ctx 131072  # 128K tokens

# Then create the model:
ollama create my-big-llama -f modelfile

To find a model's maximum context:

ollama show llama3.1
# Look for "context length" near the top

Context Size and Memory

Larger context = more memory required. A 128K context model might need 20GB+ VRAM. Start with smaller contexts (4K-8K) and increase only if needed.

Stop Sequences

Tell the model to stop generating when it produces a specific string.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stop=["```", "\n\n", "END"]  # Stop at any of these
)

Use cases:

Stop at code block end: stop=["```"]
Stop at double newline: stop=["\n\n"] (useful for single paragraphs)
Stop at custom delimiter: stop=["---END---"]
Prevent runaway lists: stop=["\n6."] (stop after 5 items)

Stop sequences are powerful for controlling models that tend to ramble or repeat patterns. If you notice your model outputting a strange repeating symbol, add it as a stop sequence.

Response Format (JSON Mode)

Force the model to output valid JSON:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    response_format={"type": "json_object"}  # OpenAI
)

Requirements:

You must mention "JSON" in your prompt (OpenAI requirement)
The model will always produce valid JSON (syntax guaranteed)
The structure/schema is NOT guaranteed—use prompt engineering for that

11. Putting It Together: Parameter Presets

Here are battle-tested presets for common use cases:

Preset: Deterministic/Structured (JSON, Code, Classification)

STRUCTURED_PRESET = {
    "temperature": 0,
    "top_p": 1,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "max_tokens": 1000,
}

Why: You want the same input to produce the same output. No creativity needed.

Preset: Balanced (Chat, Q&A, Explanations)

BALANCED_PRESET = {
    "temperature": 0.7,
    "top_p": 1,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "max_tokens": 2000,
}

Why: Some variety keeps responses engaging, but not so much that accuracy suffers.

Preset: Creative (Marketing, Brainstorming, Writing)

CREATIVE_PRESET = {
    "temperature": 0.9,
    "top_p": 1,
    "frequency_penalty": 0.3,
    "presence_penalty": 0.3,
    "max_tokens": 3000,
}

Why: Higher temperature for creativity, light penalties to avoid repetition in longer outputs.

Preset: Exploratory (Idea Generation, Breaking Blocks)

EXPLORATORY_PRESET = {
    "temperature": 1.2,
    "top_p": 0.95,
    "frequency_penalty": 0.5,
    "presence_penalty": 0.5,
    "max_tokens": 2000,
}

Why: Maximum variety, strong push toward new territory. Review outputs carefully.

12. Hands-On Exercise: The Parameter Playground

Let's build a tool to visualize how parameters affect output.

Setup

mkdir parameter-playground
cd parameter-playground
uv init
uv add openai python-dotenv
touch parameter_playground.py

The Code

parameter_playground.py
"""
Parameter Playground
====================
Visualize how generation parameters affect LLM output.

Run the same prompt with different settings and compare results.
"""

import os
from dataclasses import dataclass
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()


@dataclass
class GenerationConfig:
    """Configuration for generation parameters."""
    name: str
    temperature: float = 1.0
    top_p: float = 1.0
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    max_tokens: int = 500


# Define presets
PRESETS = {
    "deterministic": GenerationConfig(
        name="Deterministic (T=0)",
        temperature=0,
    ),
    "focused": GenerationConfig(
        name="Focused (T=0.3)",
        temperature=0.3,
    ),
    "balanced": GenerationConfig(
        name="Balanced (T=0.7)",
        temperature=0.7,
    ),
    "creative": GenerationConfig(
        name="Creative (T=1.0)",
        temperature=1.0,
        frequency_penalty=0.3,
    ),
    "experimental": GenerationConfig(
        name="Experimental (T=1.3)",
        temperature=1.3,
        frequency_penalty=0.5,
        presence_penalty=0.5,
    ),
}


def generate_with_config(
    prompt: str,
    config: GenerationConfig,
    system_prompt: str = "You are a helpful assistant.",
    client: OpenAI = None
) -> dict:
    """Generate a response with specific parameters."""
    
    if client is None:
        client = OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Using mini for cost efficiency
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=config.temperature,
        top_p=config.top_p,
        frequency_penalty=config.frequency_penalty,
        presence_penalty=config.presence_penalty,
        max_tokens=config.max_tokens,
    )
    
    return {
        "config": config.name,
        "content": response.choices[0].message.content,
        "finish_reason": response.choices[0].finish_reason,
        "tokens_used": response.usage.completion_tokens,
    }


def run_comparison(prompt: str, num_runs: int = 3):
    """Run the same prompt across all presets, multiple times each."""
    
    print("=" * 70)
    print(f"PROMPT: {prompt}")
    print("=" * 70)
    
    client = OpenAI()
    
    for preset_name, config in PRESETS.items():
        print(f"\n{'─' * 70}")
        print(f"CONFIG: {config.name}")
        print(f"  temperature={config.temperature}, top_p={config.top_p}")
        print(f"  frequency_penalty={config.frequency_penalty}, presence_penalty={config.presence_penalty}")
        print(f"{'─' * 70}")
        
        for i in range(num_runs):
            result = generate_with_config(prompt, config, client=client)
            
            # Truncate long outputs for display
            content = result["content"]
            if len(content) > 200:
                content = content[:200] + "..."
            
            print(f"\n  Run {i + 1}: {content}")
        
        # Show consistency indicator
        if config.temperature == 0:
            print(f"\n  📊 Consistency: HIGH (deterministic)")
        elif config.temperature < 0.5:
            print(f"\n  📊 Consistency: MEDIUM-HIGH")
        elif config.temperature < 1.0:
            print(f"\n  📊 Consistency: MEDIUM")
        else:
            print(f"\n  📊 Consistency: LOW (high variance expected)")


def demonstrate_penalties():
    """Show the effect of frequency and presence penalties."""
    
    print("\n" + "=" * 70)
    print("DEMONSTRATION: Repetition Penalties")
    print("=" * 70)
    
    # A prompt that tends to produce repetitive output
    prompt = "List 10 reasons why exercise is good for you. Be detailed."
    
    configs = [
        GenerationConfig(name="No penalties", temperature=0.7),
        GenerationConfig(name="Frequency penalty=0.5", temperature=0.7, frequency_penalty=0.5),
        GenerationConfig(name="Presence penalty=0.5", temperature=0.7, presence_penalty=0.5),
        GenerationConfig(name="Both penalties=0.5", temperature=0.7, frequency_penalty=0.5, presence_penalty=0.5),
    ]
    
    client = OpenAI()
    
    for config in configs:
        print(f"\n{'─' * 70}")
        print(f"CONFIG: {config.name}")
        print(f"{'─' * 70}")
        
        result = generate_with_config(prompt, config, client=client)
        print(f"\n{result['content'][:500]}...")
        print(f"\n  Tokens used: {result['tokens_used']}")


def demonstrate_stop_sequences():
    """Show how stop sequences work."""
    
    print("\n" + "=" * 70)
    print("DEMONSTRATION: Stop Sequences")
    print("=" * 70)
    
    client = OpenAI()
    prompt = "Write a short list of 10 programming languages."
    
    # Without stop sequence
    print("\n--- Without stop sequence ---")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
    )
    print(response.choices[0].message.content)
    print(f"Finish reason: {response.choices[0].finish_reason}")
    
    # With stop sequence (stop after 5 items)
    print("\n--- With stop=['\\n6.'] (stop after 5 items) ---")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
        stop=["\n6."],
    )
    print(response.choices[0].message.content)
    print(f"Finish reason: {response.choices[0].finish_reason}")


# ═══════════════════════════════════════════════════════════════════════════
# MAIN
# ═══════════════════════════════════════════════════════════════════════════

if __name__ == "__main__":
    print("\n🎛️  PARAMETER PLAYGROUND\n")
    
    # Check for API key
    if not os.getenv("OPENAI_API_KEY"):
        print("❌ Error: OPENAI_API_KEY not found in environment")
        print("   Create a .env file with: OPENAI_API_KEY=sk-...")
        exit(1)
    
    # Test 1: Compare temperature effects
    print("\n" + "=" * 70)
    print(" TEST 1: Temperature Comparison")
    print("=" * 70)
    run_comparison(
        "Write a one-sentence description of a sunset.",
        num_runs=3
    )
    
    # Test 2: Penalties demonstration
    demonstrate_penalties()
    
    # Test 3: Stop sequences
    demonstrate_stop_sequences()
    
    print("\n" + "=" * 70)
    print(" EXPERIMENTS COMPLETE")
    print("=" * 70)
    print("""
Key observations:
1. Temperature 0 produces identical outputs every time
2. Higher temperature → more variation between runs
3. Frequency penalty reduces word repetition
4. Presence penalty encourages topic diversity
5. Stop sequences give you precise control over output length
""")

Run It

# Create .env file with your API key
echo "OPENAI_API_KEY=sk-your-key-here" > .env

# Run the playground
uv run parameter_playground.py

What to Observe

Temperature 0: Every run produces identical output
Temperature 0.7: Slight variations, but similar structure
Temperature 1.3: Wildly different outputs each time
Penalties: Notice vocabulary variety in the exercise list
Stop sequences: Clean cutoff exactly where you specify

13. Provider Differences

Not all providers use the same parameter names or ranges:

Cloud APIs

Parameter	OpenAI	Anthropic (Claude)	Google (Gemini)
temperature	0-2	0-1	0-2
top_p	0-1	0-1	0-1
top_k	❌	0-500	1-40
frequency_penalty	-2 to 2	❌	❌
presence_penalty	-2 to 2	❌	❌
max_tokens	Yes	Yes (`max_tokens`)	Yes (`max_output_tokens`)
stop sequences	Yes (`stop`)	Yes (`stop_sequences`)	Yes (`stop_sequences`)
seed	Yes	❌	❌
JSON mode	Yes	Yes	Yes

Claude's Temperature Range

Anthropic's Claude uses temperature 0-1 (not 0-2). A temperature of 1.0 in Claude is already quite creative. Don't port OpenAI settings directly without adjustment.

Local LLMs (Ollama / llama.cpp)

Ollama exposes many more parameters since you have full control over the model:

Parameter	Range	Default	Notes
temperature	0-2+	0.8	Same concept as cloud APIs
top_p	0-1	0.9	Nucleus sampling
top_k	1-100+	40	Fixed candidate count
min_p	0-1	0	Relative threshold filter
tfs_z	0-1	1	Tail-free sampling (1=disabled)
mirostat	0/1/2	0	Alternative sampling mode
mirostat_tau	0-10	5.0	Target perplexity
mirostat_eta	0-1	0.1	Learning rate
repeat_penalty	0-2	1.1	Repetition penalty
repeat_last_n	-1 to context	64	Penalty window size
num_ctx	1-model max	2048	Context window size
num_predict	-2 to max	128	Max output tokens
seed	any int	random	For reproducibility

Setting parameters in Ollama:

# In a modelfile:
FROM llama3.1
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_k 40

# At runtime (limited parameters):
/set parameter temperature 0.7

14. Common Pitfalls

Symptom	Likely Cause	Fix
Output varies wildly between requests	Temperature too high for the task	Lower to 0-0.3 for structured output
Output is robotic and repetitive	Temperature too low for creative tasks	Raise to 0.7-1.0
Same words keep appearing	No frequency penalty on long output	Add frequency_penalty=0.3-0.5
Model talks in circles about same topic	No presence penalty	Add presence_penalty=0.3-0.5
Response cuts off mid-sentence	max_tokens too low	Increase limit or check `finish_reason`
JSON sometimes invalid	Temperature > 0	Use temperature=0 and response_format=json_object
Can't reproduce results for debugging	Temperature > 0, no seed	Set temperature=0 or use seed parameter

15. Decision Framework

16. Key Takeaways

Temperature is your primary control. It scales logits before softmax. Start with 0 for structured tasks, 0.7 for general use, 1.0+ for creativity.
Pick ONE sampling method. Temperature, top-p, top-k, min-p, mirostat—choose one and master it. Most developers stick with temperature.
Penalties fight repetition. Use frequency_penalty for varied vocabulary, presence_penalty for topic diversity. Start at 0.3-0.5.
Context size matters for local LLMs. Ollama defaults to 2K tokens even if the model supports 128K. Set num_ctx explicitly.
Temperature 0 isn't magic. It's deterministic but not perfect—use seed for true reproducibility.
Stop sequences give surgical control. Perfect for limiting list length or stopping at delimiters.
Mirostat is underrated. If you're running local models and want consistent creativity, try mirostat mode 2.
Test empirically. Theory only gets you so far. Run the same prompt 10 times and observe variance.

17. What's Next

Congratulations! You've completed Part 1: Foundations of Prompt Engineering.

You now understand:

Lesson 1: Token economics (the currency)
Lesson 2: Prompting patterns (the techniques)
Lesson 3: System prompts (the programming layer)
Lesson 4: Generation parameters (the control knobs)

In Part 2: Building Your First AI Features, we'll put all of this into practice. where in Lesson 5: Text Generation & Streaming UIs we will build a real-time chat interface.

Time to build something real.

18. Additional Resources

LLM Parameters Explained (Video) — The video this lesson incorporates
OpenAI: Parameter Documentation — Official reference
Mirostat Paper — Original research on perplexity-controlled sampling
The Illustrated GPT-2 — Visual explanation of token generation

1. The Probability Engine​

2. Temperature: The Chaos Dial​

Under the Hood: Logits​

The Math (Simplified)​

Visual Intuition​

When to Use What​

3. Top-p (Nucleus Sampling): The Candidate Filter​

How It Works​

Top-p Values​

4. Top-k: The Simple Filter​

5. Min-p: The Relative Threshold​

6. Tail-Free Sampling (TFS)​

7. Mirostat: Adaptive Perplexity Control​

What's Perplexity?​

Mirostat Parameters​

When to Use Mirostat​

8. Choosing Your Sampling Strategy​

9. Repetition Control​

How Penalties Work on Logits​

Frequency Penalty​

Presence Penalty​

Repeat Window (repeat_last_n)​

Comparison​

10. Output Controls​

Max Tokens / num_predict​

Context Size (num_ctx) — Ollama Specific​

Stop Sequences​

Response Format (JSON Mode)​

11. Putting It Together: Parameter Presets​

Preset: Deterministic/Structured (JSON, Code, Classification)​

Preset: Balanced (Chat, Q&A, Explanations)​

Preset: Creative (Marketing, Brainstorming, Writing)​

Preset: Exploratory (Idea Generation, Breaking Blocks)​

12. Hands-On Exercise: The Parameter Playground​

Setup​

The Code​

Run It​

What to Observe​

13. Provider Differences​

Cloud APIs​

Local LLMs (Ollama / llama.cpp)​

14. Common Pitfalls​

15. Decision Framework​

16. Key Takeaways​

17. What's Next​

18. Additional Resources​