Lesson 4: Generation Parameters — Controlling Creativity
- The Probability Engine: Why LLMs work with logits, not probabilities directly.
- Temperature: The "chaos dial" that scales logits before softmax.
- Sampling Methods: Top-p, Top-k, Min-p, and Tail-Free Sampling.
- Mirostat: An alternative approach using perplexity targeting.
- Repetition Control: Frequency penalty, presence penalty, and repeat windows.
- Output Controls: Max tokens, stop sequences, and context size.
- Local LLMs: Ollama-specific parameters like num_ctx and num_predict.
You've crafted the perfect prompt. You've written an airtight system message. You send the request and get... a wildly creative response when you needed precise JSON. Or a robotic, repetitive answer when you wanted engaging copy. The problem isn't your prompt—it's your generation parameters.
These are the "knobs" that control how the model selects its next token. Get them wrong, and even perfect prompts produce wrong outputs.
1. The Probability Engine
Before we touch any settings, we need to understand what's actually happening when an LLM generates text.
At each step, the model doesn't "know" what to say next. It calculates a probability distribution over its entire vocabulary (50,000+ tokens). Every token gets a probability score.
Prompt: "The capital of France is"
Token Probabilities:
"Paris" → 92.3%
"the" → 2.1%
"a" → 1.4%
"located" → 0.8%
"definitely"→ 0.4%
... (50,000 more tokens with tiny probabilities)
The model then samples from this distribution to pick the next token. This is where generation parameters come in—they control how this sampling happens.
2. Temperature: The Chaos Dial
Temperature controls how "sharp" or "flat" the probability distribution is before sampling. But to really understand it, we need to look at what happens under the hood.
Under the Hood: Logits
The model doesn't actually store probabilities directly. It works with logits (logistic units)—raw, unscaled numbers typically ranging from -10 to +10.
Before softmax (raw logits):
"Paris" → 8.2
"the" → 2.1
"a" → 1.4
"located" → -0.8
"banana" → -9.5
These logits are converted to probabilities using the softmax function, which:
- Exponentiates each logit
- Divides by the sum of all exponentials
- Results in numbers between 0 and 1 that sum to 1
Temperature scales the logits BEFORE softmax:
# Simplified: how temperature affects logits
scaled_logits = original_logits / temperature
probabilities = softmax(scaled_logits)
The Math (Simplified)
- Temperature < 1 → Dividing by a small number makes differences BIGGER. High logits become dominant.
- Temperature = 1 → Use logits as-is. (Default behavior)
- Temperature > 1 → Dividing by a large number makes differences SMALLER. Low-probability tokens get boosted.
- Temperature = 0 → Special case: Always pick the highest logit. Deterministic.
Visual Intuition
Prompt: "The weather today is"
┌─────────────────────────────────────────────────────────────────┐
│ Temperature = 0.0 (Deterministic) │
│ ████████████████████████████████████████ "sunny" (100%) │
│ │
│ Temperature = 0.3 (Focused) │
│ ████████████████████████████████ "sunny" (85%) │
│ ████ "nice" (10%) │
│ █ "warm" (5%) │
│ │
│ Temperature = 1.0 (Default) │
│ ██████████████████ "sunny" (45%) │
│ ████████ "nice" (20%) │
│ ██████ "warm" (15%) │
│ ████ "beautiful" (10%) │
│ ██ "perfect" (5%) │
│ █ others (5%) │
│ │
│ Temperature = 1.5 (Creative) │
│ ██████████ "sunny" (25%) │
│ ██████ "nice" (15%) │
│ █████ "warm" (12%) │
│ ████ "beautiful" (10%) │
│ ████ "absolutely" (8%) │
│ ███ "quite" (7%) │
│ ██████████ others (23%) │
└─────────────────────────────────────────────────────────────────┘
When to Use What
| Temperature | Behavior | Best For |
|---|---|---|
| 0 | Always pick the most likely token | JSON generation, code, factual Q&A, deterministic outputs |
| 0.1 - 0.3 | Very focused, minimal variation | Data extraction, classification, structured output |
| 0.5 - 0.7 | Balanced creativity and coherence | General chat, explanations, summarization |
| 0.8 - 1.0 | More variety, occasionally surprising | Creative writing, brainstorming, marketing copy |
| 1.2 - 1.5 | High creativity, risk of incoherence | Poetry, experimental content, breaking writer's block |
| > 1.5 | Chaos mode | Almost never useful in production |
Even at temperature 0, you might see slight variations due to floating-point math and GPU parallelism. For truly reproducible outputs, also set a seed parameter (if the API supports it).
3. Top-p (Nucleus Sampling): The Candidate Filter
Top-p (also called nucleus sampling) takes a different approach: instead of reshaping probabilities, it limits which tokens are even considered.
How It Works
- Sort all tokens by probability (highest first)
- Add tokens to the "candidate pool" until their cumulative probability reaches
p - Sample only from this pool
Top-p = 0.9 means: "Only consider tokens that together account for 90% of probability mass"
Prompt: "The weather today is"
All tokens (sorted by probability):
"sunny" → 45% ✓ (cumulative: 45%)
"nice" → 20% ✓ (cumulative: 65%)
"warm" → 15% ✓ (cumulative: 80%)
"beautiful" → 10% ✓ (cumulative: 90%) ← Stop here
"perfect" → 5% ✗ (excluded)
"cloudy" → 3% ✗ (excluded)
... rest excluded
Sample only from: ["sunny", "nice", "warm", "beautiful"]
Top-p Values
| Top-p | Effect | Use Case |
|---|---|---|
| 0.1 | Only the very top tokens | Maximum focus, almost deterministic |
| 0.5 | Top ~50% probability mass | Focused but some variety |
| 0.9 | Top ~90% probability mass | Good default, filters obvious nonsense |
| 0.95 | Almost everything included | Creative tasks |
| 1.0 | All tokens considered | Full randomness (use temperature to control) |
4. Top-k: The Simple Filter
Top-k is the simplest sampling method: only consider the top K most likely tokens.
top_k = 40 (common default)
All tokens sorted by probability:
#1 "sunny" → 45% ✓ included
#2 "nice" → 20% ✓ included
...
#40 "adequate" → 0.1% ✓ included
#41 "purple" → 0.05% ✗ excluded (beyond top 40)
Top-k vs Top-p:
- Top-k: Fixed number of candidates (always exactly K tokens)
- Top-p: Variable number of candidates (depends on probability distribution)
Top-p is generally preferred because it adapts to the situation. If the model is very confident (one token has 95% probability), top-p will include fewer candidates. Top-k would still include 40 even when most are irrelevant.
5. Min-p: The Relative Threshold
Min-p is a newer alternative to top-p. Instead of cumulative probability, it uses a relative threshold based on the highest logit.
min_p = 0.1 means: "Only include tokens with probability ≥ 10% of the top token's probability"
Example:
Top token "Paris" has probability 80%
Threshold = 80% × 0.1 = 8%
"Paris" → 80% ✓ (above 8%)
"the" → 12% ✓ (above 8%)
"located" → 5% ✗ (below 8%)
"banana" → 0.1% ✗ (below 8%)
Why use min-p?
- More intuitive than top-p for some use cases
- Automatically adapts to confidence levels
- Available in Ollama and some local LLM frameworks (not OpenAI API)
6. Tail-Free Sampling (TFS)
Tail-free sampling takes a statistical approach: it analyzes the probability distribution's "tail" (the long list of unlikely tokens) and cuts it off.
tfs_z = 0.95 means: "Cut off tokens in the tail based on second derivative analysis"
Values:
- 1.0 = Disabled (no tail cutting)
- 0.99-0.95 = Light tail trimming (good starting range)
- < 0.9 = Aggressive trimming
When to use TFS:
- When top-p still includes too many unlikely tokens
- For more coherent long-form generation
- Mainly available in local LLM tools (Ollama, llama.cpp)
Most frameworks apply these in order: Temperature → Top-k → Top-p → Min-p → TFS. If you're using multiple, be aware they compound. Start with just temperature and add others only if needed.
7. Mirostat: Adaptive Perplexity Control
Mirostat is a completely different approach to sampling. Instead of manually tuning temperature and top-p, it automatically adjusts sampling to maintain a target perplexity level.
What's Perplexity?
Perplexity measures how "surprised" the model is by its own output:
- Low perplexity → Model is confident, output is predictable/coherent
- High perplexity → Model is uncertain, output is diverse/creative
Mirostat targets a specific perplexity level and adjusts sampling on-the-fly to maintain it.
Mirostat Parameters
mirostat = 0 (default): Disabled, use traditional sampling
mirostat = 1: Mirostat v1
mirostat = 2: Mirostat v2 (generally preferred)
mirostat_tau = 5.0 (default): Target perplexity level
- Higher tau → More diverse/creative output
- Lower tau → More coherent/focused output
- Range: typically 3.0 to 5.0
mirostat_eta = 0.1 (default): Learning rate
- Higher eta → Faster adaptation to target perplexity
- Lower eta → More stable, slower adaptation
When to Use Mirostat
Use Mirostat when:
- You want consistent "creativity level" across different prompts
- Manual temperature tuning isn't giving consistent results
- You're generating long-form content and want stable quality
Don't use Mirostat when:
- You need deterministic output (use temperature=0 instead)
- You're using APIs that don't support it (OpenAI, Anthropic)
- You need fine-grained control over specific parameters
Mirostat is primarily available in local LLM tools like Ollama and llama.cpp. Cloud APIs (OpenAI, Anthropic, Google) typically don't offer it—they rely on temperature and top-p.
8. Choosing Your Sampling Strategy
Here's the thing: you usually don't need all of these. Most production systems use just temperature, or temperature + top-p.
OpenAI's recommendation: Adjust one or the other, not both. If you use temperature, set top_p to 1.0. If you use top_p, set temperature to 1.0.
In practice:
- Most developers use temperature because it's more intuitive
- Use top_p when you specifically want to exclude the long tail of unlikely tokens
- Use mirostat (if available) when you want consistent creativity across varied prompts
9. Repetition Control
These parameters fight repetition—a common LLM problem, especially in longer outputs.
How Penalties Work on Logits
Penalties adjust the logits (not probabilities) of tokens that have appeared before:
If the logit is NEGATIVE: logit = logit × penalty
If the logit is POSITIVE: logit = logit / penalty
With penalty > 1 (the default approach):
- Positive logits get smaller (less likely)
- Negative logits get more negative (even less likely)
Result: Previously used tokens become less likely to appear again.
You can also set penalty below 1, which has the opposite effect—making repeated tokens MORE likely. This is rarely useful but exists for edge cases.
Frequency Penalty
Reduces the probability of tokens proportional to how often they've appeared.
frequency_penalty = 0.0 (default): No penalty
frequency_penalty = 1.0: Strong penalty against repetition
frequency_penalty = 2.0: Very strong penalty (can cause incoherence)
Example with frequency_penalty = 0.5:
- Token "the" appeared 5 times → penalty applied 5× (cumulative)
- Token "AI" appeared 2 times → penalty applied 2× (cumulative)
Use case: Long-form content where you want varied vocabulary.
Presence Penalty
Reduces the probability of tokens that have appeared at all (binary: appeared or not).
presence_penalty = 0.0 (default): No penalty
presence_penalty = 1.0: Moderate push toward new topics
presence_penalty = 2.0: Strong push toward new topics
Example with presence_penalty = 0.5:
- Token "the" appeared (any number of times) → one-time penalty of 0.5
- Token "AI" appeared (any number of times) → one-time penalty of 0.5
Use case: When you want the model to explore new topics rather than dwelling on what's already been mentioned.
Repeat Window (repeat_last_n)
In local LLM tools like Ollama, you can control how far back to look for repetitions:
repeat_last_n = 64 (default): Look at last 64 tokens
repeat_last_n = 128: Larger window, catch more distant repetition
repeat_last_n = 0: Disable repetition penalty entirely
repeat_last_n = -1: Use the entire context as the window
Why this matters: A short window (64) only penalizes recent repetition. A long window (or -1) catches patterns that repeat across the entire conversation—useful for long documents but more expensive computationally.
Comparison
| Parameter | Penalizes | Effect | Use When |
|---|---|---|---|
| frequency_penalty | Repeat count | Varied vocabulary | Long documents, avoiding word repetition |
| presence_penalty | Existence (yes/no) | Topic diversity | Brainstorming, exploring new directions |
| repeat_last_n | Window size | Scope of penalty | Ollama/local: control how far back to look |
Both penalties default to 0. Start there. If you see repetition, try 0.3-0.5. Values above 1.0 often cause erratic output.
10. Output Controls
Max Tokens / num_predict
The hard limit on response length. When reached, the model stops immediately—even mid-sentence.
# OpenAI/Anthropic: max_tokens
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=500 # Stop after ~375 words
)
# Ollama: num_predict
# In modelfile or API
num_predict = 500
# Special values:
# -1 = Generate until done (no limit)
# -2 = Fill the entire context window
Important considerations:
max_tokenscounts output only, not input- Set it based on your use case, not "just in case"
- Lower values = faster responses and lower cost
- If the model stops mid-thought, you'll see
finish_reason: "length"instead of"stop"
| Use Case | Suggested max_tokens |
|---|---|
| Classification (one word) | 10-50 |
| Short answer | 100-200 |
| Paragraph response | 300-500 |
| Long-form content | 1000-2000 |
| Maximum (let it finish naturally) | 4096+ (model dependent) |
Context Size (num_ctx) — Ollama Specific
When you see a model advertised with "128K context," that's the maximum supported context size. But in Ollama, models default to only 2,048 tokens to save memory.
# Why the default is small:
- 128K context requires significant GPU memory
- Many users have GPUs with only 8GB VRAM
- Ollama prioritizes working on modest hardware
# To use a model's full context in Ollama:
# Create a modelfile:
FROM llama3.1
PARAMETER num_ctx 131072 # 128K tokens
# Then create the model:
ollama create my-big-llama -f modelfile
To find a model's maximum context:
ollama show llama3.1
# Look for "context length" near the top
Larger context = more memory required. A 128K context model might need 20GB+ VRAM. Start with smaller contexts (4K-8K) and increase only if needed.
Stop Sequences
Tell the model to stop generating when it produces a specific string.
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stop=["```", "\n\n", "END"] # Stop at any of these
)
Use cases:
- Stop at code block end:
stop=["```"] - Stop at double newline:
stop=["\n\n"](useful for single paragraphs) - Stop at custom delimiter:
stop=["---END---"] - Prevent runaway lists:
stop=["\n6."](stop after 5 items)
Stop sequences are powerful for controlling models that tend to ramble or repeat patterns. If you notice your model outputting a strange repeating symbol, add it as a stop sequence.
Response Format (JSON Mode)
Force the model to output valid JSON:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={"type": "json_object"} # OpenAI
)
Requirements:
- You must mention "JSON" in your prompt (OpenAI requirement)
- The model will always produce valid JSON (syntax guaranteed)
- The structure/schema is NOT guaranteed—use prompt engineering for that
11. Putting It Together: Parameter Presets
Here are battle-tested presets for common use cases:
Preset: Deterministic/Structured (JSON, Code, Classification)
STRUCTURED_PRESET = {
"temperature": 0,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
"max_tokens": 1000,
}
Why: You want the same input to produce the same output. No creativity needed.
Preset: Balanced (Chat, Q&A, Explanations)
BALANCED_PRESET = {
"temperature": 0.7,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
"max_tokens": 2000,
}
Why: Some variety keeps responses engaging, but not so much that accuracy suffers.
Preset: Creative (Marketing, Brainstorming, Writing)
CREATIVE_PRESET = {
"temperature": 0.9,
"top_p": 1,
"frequency_penalty": 0.3,
"presence_penalty": 0.3,
"max_tokens": 3000,
}
Why: Higher temperature for creativity, light penalties to avoid repetition in longer outputs.
Preset: Exploratory (Idea Generation, Breaking Blocks)
EXPLORATORY_PRESET = {
"temperature": 1.2,
"top_p": 0.95,
"frequency_penalty": 0.5,
"presence_penalty": 0.5,
"max_tokens": 2000,
}
Why: Maximum variety, strong push toward new territory. Review outputs carefully.
12. Hands-On Exercise: The Parameter Playground
Let's build a tool to visualize how parameters affect output.
Setup
mkdir parameter-playground
cd parameter-playground
uv init
uv add openai python-dotenv
touch parameter_playground.py
The Code
"""
Parameter Playground
====================
Visualize how generation parameters affect LLM output.
Run the same prompt with different settings and compare results.
"""
import os
from dataclasses import dataclass
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
@dataclass
class GenerationConfig:
"""Configuration for generation parameters."""
name: str
temperature: float = 1.0
top_p: float = 1.0
frequency_penalty: float = 0.0
presence_penalty: float = 0.0
max_tokens: int = 500
# Define presets
PRESETS = {
"deterministic": GenerationConfig(
name="Deterministic (T=0)",
temperature=0,
),
"focused": GenerationConfig(
name="Focused (T=0.3)",
temperature=0.3,
),
"balanced": GenerationConfig(
name="Balanced (T=0.7)",
temperature=0.7,
),
"creative": GenerationConfig(
name="Creative (T=1.0)",
temperature=1.0,
frequency_penalty=0.3,
),
"experimental": GenerationConfig(
name="Experimental (T=1.3)",
temperature=1.3,
frequency_penalty=0.5,
presence_penalty=0.5,
),
}
def generate_with_config(
prompt: str,
config: GenerationConfig,
system_prompt: str = "You are a helpful assistant.",
client: OpenAI = None
) -> dict:
"""Generate a response with specific parameters."""
if client is None:
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini", # Using mini for cost efficiency
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
temperature=config.temperature,
top_p=config.top_p,
frequency_penalty=config.frequency_penalty,
presence_penalty=config.presence_penalty,
max_tokens=config.max_tokens,
)
return {
"config": config.name,
"content": response.choices[0].message.content,
"finish_reason": response.choices[0].finish_reason,
"tokens_used": response.usage.completion_tokens,
}
def run_comparison(prompt: str, num_runs: int = 3):
"""Run the same prompt across all presets, multiple times each."""
print("=" * 70)
print(f"PROMPT: {prompt}")
print("=" * 70)
client = OpenAI()
for preset_name, config in PRESETS.items():
print(f"\n{'─' * 70}")
print(f"CONFIG: {config.name}")
print(f" temperature={config.temperature}, top_p={config.top_p}")
print(f" frequency_penalty={config.frequency_penalty}, presence_penalty={config.presence_penalty}")
print(f"{'─' * 70}")
for i in range(num_runs):
result = generate_with_config(prompt, config, client=client)
# Truncate long outputs for display
content = result["content"]
if len(content) > 200:
content = content[:200] + "..."
print(f"\n Run {i + 1}: {content}")
# Show consistency indicator
if config.temperature == 0:
print(f"\n 📊 Consistency: HIGH (deterministic)")
elif config.temperature < 0.5:
print(f"\n 📊 Consistency: MEDIUM-HIGH")
elif config.temperature < 1.0:
print(f"\n 📊 Consistency: MEDIUM")
else:
print(f"\n 📊 Consistency: LOW (high variance expected)")
def demonstrate_penalties():
"""Show the effect of frequency and presence penalties."""
print("\n" + "=" * 70)
print("DEMONSTRATION: Repetition Penalties")
print("=" * 70)
# A prompt that tends to produce repetitive output
prompt = "List 10 reasons why exercise is good for you. Be detailed."
configs = [
GenerationConfig(name="No penalties", temperature=0.7),
GenerationConfig(name="Frequency penalty=0.5", temperature=0.7, frequency_penalty=0.5),
GenerationConfig(name="Presence penalty=0.5", temperature=0.7, presence_penalty=0.5),
GenerationConfig(name="Both penalties=0.5", temperature=0.7, frequency_penalty=0.5, presence_penalty=0.5),
]
client = OpenAI()
for config in configs:
print(f"\n{'─' * 70}")
print(f"CONFIG: {config.name}")
print(f"{'─' * 70}")
result = generate_with_config(prompt, config, client=client)
print(f"\n{result['content'][:500]}...")
print(f"\n Tokens used: {result['tokens_used']}")
def demonstrate_stop_sequences():
"""Show how stop sequences work."""
print("\n" + "=" * 70)
print("DEMONSTRATION: Stop Sequences")
print("=" * 70)
client = OpenAI()
prompt = "Write a short list of 10 programming languages."
# Without stop sequence
print("\n--- Without stop sequence ---")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
)
print(response.choices[0].message.content)
print(f"Finish reason: {response.choices[0].finish_reason}")
# With stop sequence (stop after 5 items)
print("\n--- With stop=['\\n6.'] (stop after 5 items) ---")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
stop=["\n6."],
)
print(response.choices[0].message.content)
print(f"Finish reason: {response.choices[0].finish_reason}")
# ═══════════════════════════════════════════════════════════════════════════
# MAIN
# ═══════════════════════════════════════════════════════════════════════════
if __name__ == "__main__":
print("\n🎛️ PARAMETER PLAYGROUND\n")
# Check for API key
if not os.getenv("OPENAI_API_KEY"):
print("❌ Error: OPENAI_API_KEY not found in environment")
print(" Create a .env file with: OPENAI_API_KEY=sk-...")
exit(1)
# Test 1: Compare temperature effects
print("\n" + "=" * 70)
print(" TEST 1: Temperature Comparison")
print("=" * 70)
run_comparison(
"Write a one-sentence description of a sunset.",
num_runs=3
)
# Test 2: Penalties demonstration
demonstrate_penalties()
# Test 3: Stop sequences
demonstrate_stop_sequences()
print("\n" + "=" * 70)
print(" EXPERIMENTS COMPLETE")
print("=" * 70)
print("""
Key observations:
1. Temperature 0 produces identical outputs every time
2. Higher temperature → more variation between runs
3. Frequency penalty reduces word repetition
4. Presence penalty encourages topic diversity
5. Stop sequences give you precise control over output length
""")
Run It
# Create .env file with your API key
echo "OPENAI_API_KEY=sk-your-key-here" > .env
# Run the playground
uv run parameter_playground.py
What to Observe
- Temperature 0: Every run produces identical output
- Temperature 0.7: Slight variations, but similar structure
- Temperature 1.3: Wildly different outputs each time
- Penalties: Notice vocabulary variety in the exercise list
- Stop sequences: Clean cutoff exactly where you specify
13. Provider Differences
Not all providers use the same parameter names or ranges:
Cloud APIs
| Parameter | OpenAI | Anthropic (Claude) | Google (Gemini) |
|---|---|---|---|
| temperature | 0-2 | 0-1 | 0-2 |
| top_p | 0-1 | 0-1 | 0-1 |
| top_k | ❌ | 0-500 | 1-40 |
| frequency_penalty | -2 to 2 | ❌ | ❌ |
| presence_penalty | -2 to 2 | ❌ | ❌ |
| max_tokens | Yes | Yes (max_tokens) | Yes (max_output_tokens) |
| stop sequences | Yes (stop) | Yes (stop_sequences) | Yes (stop_sequences) |
| seed | Yes | ❌ | ❌ |
| JSON mode | Yes | Yes | Yes |
Anthropic's Claude uses temperature 0-1 (not 0-2). A temperature of 1.0 in Claude is already quite creative. Don't port OpenAI settings directly without adjustment.
Local LLMs (Ollama / llama.cpp)
Ollama exposes many more parameters since you have full control over the model:
| Parameter | Range | Default | Notes |
|---|---|---|---|
| temperature | 0-2+ | 0.8 | Same concept as cloud APIs |
| top_p | 0-1 | 0.9 | Nucleus sampling |
| top_k | 1-100+ | 40 | Fixed candidate count |
| min_p | 0-1 | 0 | Relative threshold filter |
| tfs_z | 0-1 | 1 | Tail-free sampling (1=disabled) |
| mirostat | 0/1/2 | 0 | Alternative sampling mode |
| mirostat_tau | 0-10 | 5.0 | Target perplexity |
| mirostat_eta | 0-1 | 0.1 | Learning rate |
| repeat_penalty | 0-2 | 1.1 | Repetition penalty |
| repeat_last_n | -1 to context | 64 | Penalty window size |
| num_ctx | 1-model max | 2048 | Context window size |
| num_predict | -2 to max | 128 | Max output tokens |
| seed | any int | random | For reproducibility |
Setting parameters in Ollama:
# In a modelfile:
FROM llama3.1
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_k 40
# At runtime (limited parameters):
/set parameter temperature 0.7
14. Common Pitfalls
| Symptom | Likely Cause | Fix |
|---|---|---|
| Output varies wildly between requests | Temperature too high for the task | Lower to 0-0.3 for structured output |
| Output is robotic and repetitive | Temperature too low for creative tasks | Raise to 0.7-1.0 |
| Same words keep appearing | No frequency penalty on long output | Add frequency_penalty=0.3-0.5 |
| Model talks in circles about same topic | No presence penalty | Add presence_penalty=0.3-0.5 |
| Response cuts off mid-sentence | max_tokens too low | Increase limit or check finish_reason |
| JSON sometimes invalid | Temperature > 0 | Use temperature=0 and response_format=json_object |
| Can't reproduce results for debugging | Temperature > 0, no seed | Set temperature=0 or use seed parameter |
15. Decision Framework
16. Key Takeaways
-
Temperature is your primary control. It scales logits before softmax. Start with 0 for structured tasks, 0.7 for general use, 1.0+ for creativity.
-
Pick ONE sampling method. Temperature, top-p, top-k, min-p, mirostat—choose one and master it. Most developers stick with temperature.
-
Penalties fight repetition. Use frequency_penalty for varied vocabulary, presence_penalty for topic diversity. Start at 0.3-0.5.
-
Context size matters for local LLMs. Ollama defaults to 2K tokens even if the model supports 128K. Set num_ctx explicitly.
-
Temperature 0 isn't magic. It's deterministic but not perfect—use seed for true reproducibility.
-
Stop sequences give surgical control. Perfect for limiting list length or stopping at delimiters.
-
Mirostat is underrated. If you're running local models and want consistent creativity, try mirostat mode 2.
-
Test empirically. Theory only gets you so far. Run the same prompt 10 times and observe variance.
17. What's Next
Congratulations! You've completed Part 1: Foundations of Prompt Engineering.
You now understand:
- Lesson 1: Token economics (the currency)
- Lesson 2: Prompting patterns (the techniques)
- Lesson 3: System prompts (the programming layer)
- Lesson 4: Generation parameters (the control knobs)
In Part 2: Building Your First AI Features, we'll put all of this into practice. where in Lesson 5: Text Generation & Streaming UIs we will build a real-time chat interface.
Time to build something real.
18. Additional Resources
- LLM Parameters Explained (Video) — The video this lesson incorporates
- OpenAI: Parameter Documentation — Official reference
- Mirostat Paper — Original research on perplexity-controlled sampling
- The Illustrated GPT-2 — Visual explanation of token generation