Skip to main content

Lesson 4: Generation Parameters — Controlling Creativity

Topics Covered
  • The Probability Engine: Why LLMs work with logits, not probabilities directly.
  • Temperature: The "chaos dial" that scales logits before softmax.
  • Sampling Methods: Top-p, Top-k, Min-p, and Tail-Free Sampling.
  • Mirostat: An alternative approach using perplexity targeting.
  • Repetition Control: Frequency penalty, presence penalty, and repeat windows.
  • Output Controls: Max tokens, stop sequences, and context size.
  • Local LLMs: Ollama-specific parameters like num_ctx and num_predict.

You've crafted the perfect prompt. You've written an airtight system message. You send the request and get... a wildly creative response when you needed precise JSON. Or a robotic, repetitive answer when you wanted engaging copy. The problem isn't your prompt—it's your generation parameters.

These are the "knobs" that control how the model selects its next token. Get them wrong, and even perfect prompts produce wrong outputs.

1. The Probability Engine

Before we touch any settings, we need to understand what's actually happening when an LLM generates text.

At each step, the model doesn't "know" what to say next. It calculates a probability distribution over its entire vocabulary (50,000+ tokens). Every token gets a probability score.

Prompt: "The capital of France is"

Token Probabilities:
"Paris" → 92.3%
"the" → 2.1%
"a" → 1.4%
"located" → 0.8%
"definitely"→ 0.4%
... (50,000 more tokens with tiny probabilities)

The model then samples from this distribution to pick the next token. This is where generation parameters come in—they control how this sampling happens.

2. Temperature: The Chaos Dial

Temperature controls how "sharp" or "flat" the probability distribution is before sampling. But to really understand it, we need to look at what happens under the hood.

Under the Hood: Logits

The model doesn't actually store probabilities directly. It works with logits (logistic units)—raw, unscaled numbers typically ranging from -10 to +10.

Before softmax (raw logits):
"Paris" → 8.2
"the" → 2.1
"a" → 1.4
"located" → -0.8
"banana" → -9.5

These logits are converted to probabilities using the softmax function, which:

  1. Exponentiates each logit
  2. Divides by the sum of all exponentials
  3. Results in numbers between 0 and 1 that sum to 1

Temperature scales the logits BEFORE softmax:

# Simplified: how temperature affects logits
scaled_logits = original_logits / temperature
probabilities = softmax(scaled_logits)

The Math (Simplified)

  • Temperature < 1 → Dividing by a small number makes differences BIGGER. High logits become dominant.
  • Temperature = 1 → Use logits as-is. (Default behavior)
  • Temperature > 1 → Dividing by a large number makes differences SMALLER. Low-probability tokens get boosted.
  • Temperature = 0 → Special case: Always pick the highest logit. Deterministic.

Visual Intuition

Prompt: "The weather today is"

┌─────────────────────────────────────────────────────────────────┐
│ Temperature = 0.0 (Deterministic) │
│ ████████████████████████████████████████ "sunny" (100%) │
│ │
│ Temperature = 0.3 (Focused) │
│ ████████████████████████████████ "sunny" (85%) │
│ ████ "nice" (10%) │
│ █ "warm" (5%) │
│ │
│ Temperature = 1.0 (Default) │
│ ██████████████████ "sunny" (45%) │
│ ████████ "nice" (20%) │
│ ██████ "warm" (15%) │
│ ████ "beautiful" (10%) │
│ ██ "perfect" (5%) │
│ █ others (5%) │
│ │
│ Temperature = 1.5 (Creative) │
│ ██████████ "sunny" (25%) │
│ ██████ "nice" (15%) │
│ █████ "warm" (12%) │
│ ████ "beautiful" (10%) │
│ ████ "absolutely" (8%) │
│ ███ "quite" (7%) │
│ ██████████ others (23%) │
└─────────────────────────────────────────────────────────────────┘

When to Use What

TemperatureBehaviorBest For
0Always pick the most likely tokenJSON generation, code, factual Q&A, deterministic outputs
0.1 - 0.3Very focused, minimal variationData extraction, classification, structured output
0.5 - 0.7Balanced creativity and coherenceGeneral chat, explanations, summarization
0.8 - 1.0More variety, occasionally surprisingCreative writing, brainstorming, marketing copy
1.2 - 1.5High creativity, risk of incoherencePoetry, experimental content, breaking writer's block
> 1.5Chaos modeAlmost never useful in production
Temperature 0 Isn't Truly Deterministic

Even at temperature 0, you might see slight variations due to floating-point math and GPU parallelism. For truly reproducible outputs, also set a seed parameter (if the API supports it).

3. Top-p (Nucleus Sampling): The Candidate Filter

Top-p (also called nucleus sampling) takes a different approach: instead of reshaping probabilities, it limits which tokens are even considered.

How It Works

  1. Sort all tokens by probability (highest first)
  2. Add tokens to the "candidate pool" until their cumulative probability reaches p
  3. Sample only from this pool
Top-p = 0.9 means: "Only consider tokens that together account for 90% of probability mass"

Prompt: "The weather today is"

All tokens (sorted by probability):
"sunny" → 45% ✓ (cumulative: 45%)
"nice" → 20% ✓ (cumulative: 65%)
"warm" → 15% ✓ (cumulative: 80%)
"beautiful" → 10% ✓ (cumulative: 90%) ← Stop here
"perfect" → 5% ✗ (excluded)
"cloudy" → 3% ✗ (excluded)
... rest excluded

Sample only from: ["sunny", "nice", "warm", "beautiful"]

Top-p Values

Top-pEffectUse Case
0.1Only the very top tokensMaximum focus, almost deterministic
0.5Top ~50% probability massFocused but some variety
0.9Top ~90% probability massGood default, filters obvious nonsense
0.95Almost everything includedCreative tasks
1.0All tokens consideredFull randomness (use temperature to control)

4. Top-k: The Simple Filter

Top-k is the simplest sampling method: only consider the top K most likely tokens.

top_k = 40 (common default)

All tokens sorted by probability:
#1 "sunny" → 45% ✓ included
#2 "nice" → 20% ✓ included
...
#40 "adequate" → 0.1% ✓ included
#41 "purple" → 0.05% ✗ excluded (beyond top 40)

Top-k vs Top-p:

  • Top-k: Fixed number of candidates (always exactly K tokens)
  • Top-p: Variable number of candidates (depends on probability distribution)

Top-p is generally preferred because it adapts to the situation. If the model is very confident (one token has 95% probability), top-p will include fewer candidates. Top-k would still include 40 even when most are irrelevant.

5. Min-p: The Relative Threshold

Min-p is a newer alternative to top-p. Instead of cumulative probability, it uses a relative threshold based on the highest logit.

min_p = 0.1 means: "Only include tokens with probability ≥ 10% of the top token's probability"

Example:
Top token "Paris" has probability 80%
Threshold = 80% × 0.1 = 8%

"Paris" → 80% ✓ (above 8%)
"the" → 12% ✓ (above 8%)
"located" → 5% ✗ (below 8%)
"banana" → 0.1% ✗ (below 8%)

Why use min-p?

  • More intuitive than top-p for some use cases
  • Automatically adapts to confidence levels
  • Available in Ollama and some local LLM frameworks (not OpenAI API)

6. Tail-Free Sampling (TFS)

Tail-free sampling takes a statistical approach: it analyzes the probability distribution's "tail" (the long list of unlikely tokens) and cuts it off.

tfs_z = 0.95 means: "Cut off tokens in the tail based on second derivative analysis"

Values:
- 1.0 = Disabled (no tail cutting)
- 0.99-0.95 = Light tail trimming (good starting range)
- < 0.9 = Aggressive trimming

When to use TFS:

  • When top-p still includes too many unlikely tokens
  • For more coherent long-form generation
  • Mainly available in local LLM tools (Ollama, llama.cpp)
Sampling Method Priority

Most frameworks apply these in order: Temperature → Top-k → Top-p → Min-p → TFS. If you're using multiple, be aware they compound. Start with just temperature and add others only if needed.

7. Mirostat: Adaptive Perplexity Control

Mirostat is a completely different approach to sampling. Instead of manually tuning temperature and top-p, it automatically adjusts sampling to maintain a target perplexity level.

What's Perplexity?

Perplexity measures how "surprised" the model is by its own output:

  • Low perplexity → Model is confident, output is predictable/coherent
  • High perplexity → Model is uncertain, output is diverse/creative

Mirostat targets a specific perplexity level and adjusts sampling on-the-fly to maintain it.

Mirostat Parameters

mirostat = 0 (default): Disabled, use traditional sampling
mirostat = 1: Mirostat v1
mirostat = 2: Mirostat v2 (generally preferred)

mirostat_tau = 5.0 (default): Target perplexity level
- Higher tau → More diverse/creative output
- Lower tau → More coherent/focused output
- Range: typically 3.0 to 5.0

mirostat_eta = 0.1 (default): Learning rate
- Higher eta → Faster adaptation to target perplexity
- Lower eta → More stable, slower adaptation

When to Use Mirostat

Use Mirostat when:

  • You want consistent "creativity level" across different prompts
  • Manual temperature tuning isn't giving consistent results
  • You're generating long-form content and want stable quality

Don't use Mirostat when:

  • You need deterministic output (use temperature=0 instead)
  • You're using APIs that don't support it (OpenAI, Anthropic)
  • You need fine-grained control over specific parameters
Mirostat Availability

Mirostat is primarily available in local LLM tools like Ollama and llama.cpp. Cloud APIs (OpenAI, Anthropic, Google) typically don't offer it—they rely on temperature and top-p.

8. Choosing Your Sampling Strategy

Here's the thing: you usually don't need all of these. Most production systems use just temperature, or temperature + top-p.

OpenAI's recommendation: Adjust one or the other, not both. If you use temperature, set top_p to 1.0. If you use top_p, set temperature to 1.0.

In practice:

  • Most developers use temperature because it's more intuitive
  • Use top_p when you specifically want to exclude the long tail of unlikely tokens
  • Use mirostat (if available) when you want consistent creativity across varied prompts

9. Repetition Control

These parameters fight repetition—a common LLM problem, especially in longer outputs.

How Penalties Work on Logits

Penalties adjust the logits (not probabilities) of tokens that have appeared before:

If the logit is NEGATIVE: logit = logit × penalty
If the logit is POSITIVE: logit = logit / penalty

With penalty > 1 (the default approach):
- Positive logits get smaller (less likely)
- Negative logits get more negative (even less likely)

Result: Previously used tokens become less likely to appear again.

You can also set penalty below 1, which has the opposite effect—making repeated tokens MORE likely. This is rarely useful but exists for edge cases.

Frequency Penalty

Reduces the probability of tokens proportional to how often they've appeared.

frequency_penalty = 0.0 (default): No penalty
frequency_penalty = 1.0: Strong penalty against repetition
frequency_penalty = 2.0: Very strong penalty (can cause incoherence)

Example with frequency_penalty = 0.5:
- Token "the" appeared 5 times → penalty applied 5× (cumulative)
- Token "AI" appeared 2 times → penalty applied 2× (cumulative)

Use case: Long-form content where you want varied vocabulary.

Presence Penalty

Reduces the probability of tokens that have appeared at all (binary: appeared or not).

presence_penalty = 0.0 (default): No penalty
presence_penalty = 1.0: Moderate push toward new topics
presence_penalty = 2.0: Strong push toward new topics

Example with presence_penalty = 0.5:
- Token "the" appeared (any number of times) → one-time penalty of 0.5
- Token "AI" appeared (any number of times) → one-time penalty of 0.5

Use case: When you want the model to explore new topics rather than dwelling on what's already been mentioned.

Repeat Window (repeat_last_n)

In local LLM tools like Ollama, you can control how far back to look for repetitions:

repeat_last_n = 64 (default): Look at last 64 tokens
repeat_last_n = 128: Larger window, catch more distant repetition
repeat_last_n = 0: Disable repetition penalty entirely
repeat_last_n = -1: Use the entire context as the window

Why this matters: A short window (64) only penalizes recent repetition. A long window (or -1) catches patterns that repeat across the entire conversation—useful for long documents but more expensive computationally.

Comparison

ParameterPenalizesEffectUse When
frequency_penaltyRepeat countVaried vocabularyLong documents, avoiding word repetition
presence_penaltyExistence (yes/no)Topic diversityBrainstorming, exploring new directions
repeat_last_nWindow sizeScope of penaltyOllama/local: control how far back to look
Start Conservative

Both penalties default to 0. Start there. If you see repetition, try 0.3-0.5. Values above 1.0 often cause erratic output.

10. Output Controls

Max Tokens / num_predict

The hard limit on response length. When reached, the model stops immediately—even mid-sentence.

# OpenAI/Anthropic: max_tokens
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=500 # Stop after ~375 words
)

# Ollama: num_predict
# In modelfile or API
num_predict = 500
# Special values:
# -1 = Generate until done (no limit)
# -2 = Fill the entire context window

Important considerations:

  • max_tokens counts output only, not input
  • Set it based on your use case, not "just in case"
  • Lower values = faster responses and lower cost
  • If the model stops mid-thought, you'll see finish_reason: "length" instead of "stop"
Use CaseSuggested max_tokens
Classification (one word)10-50
Short answer100-200
Paragraph response300-500
Long-form content1000-2000
Maximum (let it finish naturally)4096+ (model dependent)

Context Size (num_ctx) — Ollama Specific

When you see a model advertised with "128K context," that's the maximum supported context size. But in Ollama, models default to only 2,048 tokens to save memory.

# Why the default is small:
- 128K context requires significant GPU memory
- Many users have GPUs with only 8GB VRAM
- Ollama prioritizes working on modest hardware

# To use a model's full context in Ollama:
# Create a modelfile:
FROM llama3.1
PARAMETER num_ctx 131072 # 128K tokens

# Then create the model:
ollama create my-big-llama -f modelfile

To find a model's maximum context:

ollama show llama3.1
# Look for "context length" near the top
Context Size and Memory

Larger context = more memory required. A 128K context model might need 20GB+ VRAM. Start with smaller contexts (4K-8K) and increase only if needed.

Stop Sequences

Tell the model to stop generating when it produces a specific string.

response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stop=["```", "\n\n", "END"] # Stop at any of these
)

Use cases:

  • Stop at code block end: stop=["```"]
  • Stop at double newline: stop=["\n\n"] (useful for single paragraphs)
  • Stop at custom delimiter: stop=["---END---"]
  • Prevent runaway lists: stop=["\n6."] (stop after 5 items)

Stop sequences are powerful for controlling models that tend to ramble or repeat patterns. If you notice your model outputting a strange repeating symbol, add it as a stop sequence.

Response Format (JSON Mode)

Force the model to output valid JSON:

response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={"type": "json_object"} # OpenAI
)

Requirements:

  • You must mention "JSON" in your prompt (OpenAI requirement)
  • The model will always produce valid JSON (syntax guaranteed)
  • The structure/schema is NOT guaranteed—use prompt engineering for that

11. Putting It Together: Parameter Presets

Here are battle-tested presets for common use cases:

Preset: Deterministic/Structured (JSON, Code, Classification)

STRUCTURED_PRESET = {
"temperature": 0,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
"max_tokens": 1000,
}

Why: You want the same input to produce the same output. No creativity needed.

Preset: Balanced (Chat, Q&A, Explanations)

BALANCED_PRESET = {
"temperature": 0.7,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
"max_tokens": 2000,
}

Why: Some variety keeps responses engaging, but not so much that accuracy suffers.

Preset: Creative (Marketing, Brainstorming, Writing)

CREATIVE_PRESET = {
"temperature": 0.9,
"top_p": 1,
"frequency_penalty": 0.3,
"presence_penalty": 0.3,
"max_tokens": 3000,
}

Why: Higher temperature for creativity, light penalties to avoid repetition in longer outputs.

Preset: Exploratory (Idea Generation, Breaking Blocks)

EXPLORATORY_PRESET = {
"temperature": 1.2,
"top_p": 0.95,
"frequency_penalty": 0.5,
"presence_penalty": 0.5,
"max_tokens": 2000,
}

Why: Maximum variety, strong push toward new territory. Review outputs carefully.

12. Hands-On Exercise: The Parameter Playground

Let's build a tool to visualize how parameters affect output.

Setup

mkdir parameter-playground
cd parameter-playground
uv init
uv add openai python-dotenv
touch parameter_playground.py

The Code

parameter_playground.py
"""
Parameter Playground
====================
Visualize how generation parameters affect LLM output.

Run the same prompt with different settings and compare results.
"""

import os
from dataclasses import dataclass
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()


@dataclass
class GenerationConfig:
"""Configuration for generation parameters."""
name: str
temperature: float = 1.0
top_p: float = 1.0
frequency_penalty: float = 0.0
presence_penalty: float = 0.0
max_tokens: int = 500


# Define presets
PRESETS = {
"deterministic": GenerationConfig(
name="Deterministic (T=0)",
temperature=0,
),
"focused": GenerationConfig(
name="Focused (T=0.3)",
temperature=0.3,
),
"balanced": GenerationConfig(
name="Balanced (T=0.7)",
temperature=0.7,
),
"creative": GenerationConfig(
name="Creative (T=1.0)",
temperature=1.0,
frequency_penalty=0.3,
),
"experimental": GenerationConfig(
name="Experimental (T=1.3)",
temperature=1.3,
frequency_penalty=0.5,
presence_penalty=0.5,
),
}


def generate_with_config(
prompt: str,
config: GenerationConfig,
system_prompt: str = "You are a helpful assistant.",
client: OpenAI = None
) -> dict:
"""Generate a response with specific parameters."""

if client is None:
client = OpenAI()

response = client.chat.completions.create(
model="gpt-4o-mini", # Using mini for cost efficiency
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
temperature=config.temperature,
top_p=config.top_p,
frequency_penalty=config.frequency_penalty,
presence_penalty=config.presence_penalty,
max_tokens=config.max_tokens,
)

return {
"config": config.name,
"content": response.choices[0].message.content,
"finish_reason": response.choices[0].finish_reason,
"tokens_used": response.usage.completion_tokens,
}


def run_comparison(prompt: str, num_runs: int = 3):
"""Run the same prompt across all presets, multiple times each."""

print("=" * 70)
print(f"PROMPT: {prompt}")
print("=" * 70)

client = OpenAI()

for preset_name, config in PRESETS.items():
print(f"\n{'─' * 70}")
print(f"CONFIG: {config.name}")
print(f" temperature={config.temperature}, top_p={config.top_p}")
print(f" frequency_penalty={config.frequency_penalty}, presence_penalty={config.presence_penalty}")
print(f"{'─' * 70}")

for i in range(num_runs):
result = generate_with_config(prompt, config, client=client)

# Truncate long outputs for display
content = result["content"]
if len(content) > 200:
content = content[:200] + "..."

print(f"\n Run {i + 1}: {content}")

# Show consistency indicator
if config.temperature == 0:
print(f"\n 📊 Consistency: HIGH (deterministic)")
elif config.temperature < 0.5:
print(f"\n 📊 Consistency: MEDIUM-HIGH")
elif config.temperature < 1.0:
print(f"\n 📊 Consistency: MEDIUM")
else:
print(f"\n 📊 Consistency: LOW (high variance expected)")


def demonstrate_penalties():
"""Show the effect of frequency and presence penalties."""

print("\n" + "=" * 70)
print("DEMONSTRATION: Repetition Penalties")
print("=" * 70)

# A prompt that tends to produce repetitive output
prompt = "List 10 reasons why exercise is good for you. Be detailed."

configs = [
GenerationConfig(name="No penalties", temperature=0.7),
GenerationConfig(name="Frequency penalty=0.5", temperature=0.7, frequency_penalty=0.5),
GenerationConfig(name="Presence penalty=0.5", temperature=0.7, presence_penalty=0.5),
GenerationConfig(name="Both penalties=0.5", temperature=0.7, frequency_penalty=0.5, presence_penalty=0.5),
]

client = OpenAI()

for config in configs:
print(f"\n{'─' * 70}")
print(f"CONFIG: {config.name}")
print(f"{'─' * 70}")

result = generate_with_config(prompt, config, client=client)
print(f"\n{result['content'][:500]}...")
print(f"\n Tokens used: {result['tokens_used']}")


def demonstrate_stop_sequences():
"""Show how stop sequences work."""

print("\n" + "=" * 70)
print("DEMONSTRATION: Stop Sequences")
print("=" * 70)

client = OpenAI()
prompt = "Write a short list of 10 programming languages."

# Without stop sequence
print("\n--- Without stop sequence ---")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
)
print(response.choices[0].message.content)
print(f"Finish reason: {response.choices[0].finish_reason}")

# With stop sequence (stop after 5 items)
print("\n--- With stop=['\\n6.'] (stop after 5 items) ---")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
stop=["\n6."],
)
print(response.choices[0].message.content)
print(f"Finish reason: {response.choices[0].finish_reason}")


# ═══════════════════════════════════════════════════════════════════════════
# MAIN
# ═══════════════════════════════════════════════════════════════════════════

if __name__ == "__main__":
print("\n🎛️ PARAMETER PLAYGROUND\n")

# Check for API key
if not os.getenv("OPENAI_API_KEY"):
print("❌ Error: OPENAI_API_KEY not found in environment")
print(" Create a .env file with: OPENAI_API_KEY=sk-...")
exit(1)

# Test 1: Compare temperature effects
print("\n" + "=" * 70)
print(" TEST 1: Temperature Comparison")
print("=" * 70)
run_comparison(
"Write a one-sentence description of a sunset.",
num_runs=3
)

# Test 2: Penalties demonstration
demonstrate_penalties()

# Test 3: Stop sequences
demonstrate_stop_sequences()

print("\n" + "=" * 70)
print(" EXPERIMENTS COMPLETE")
print("=" * 70)
print("""
Key observations:
1. Temperature 0 produces identical outputs every time
2. Higher temperature → more variation between runs
3. Frequency penalty reduces word repetition
4. Presence penalty encourages topic diversity
5. Stop sequences give you precise control over output length
""")

Run It

# Create .env file with your API key
echo "OPENAI_API_KEY=sk-your-key-here" > .env

# Run the playground
uv run parameter_playground.py

What to Observe

  1. Temperature 0: Every run produces identical output
  2. Temperature 0.7: Slight variations, but similar structure
  3. Temperature 1.3: Wildly different outputs each time
  4. Penalties: Notice vocabulary variety in the exercise list
  5. Stop sequences: Clean cutoff exactly where you specify

13. Provider Differences

Not all providers use the same parameter names or ranges:

Cloud APIs

ParameterOpenAIAnthropic (Claude)Google (Gemini)
temperature0-20-10-2
top_p0-10-10-1
top_k0-5001-40
frequency_penalty-2 to 2
presence_penalty-2 to 2
max_tokensYesYes (max_tokens)Yes (max_output_tokens)
stop sequencesYes (stop)Yes (stop_sequences)Yes (stop_sequences)
seedYes
JSON modeYesYesYes
Claude's Temperature Range

Anthropic's Claude uses temperature 0-1 (not 0-2). A temperature of 1.0 in Claude is already quite creative. Don't port OpenAI settings directly without adjustment.

Local LLMs (Ollama / llama.cpp)

Ollama exposes many more parameters since you have full control over the model:

ParameterRangeDefaultNotes
temperature0-2+0.8Same concept as cloud APIs
top_p0-10.9Nucleus sampling
top_k1-100+40Fixed candidate count
min_p0-10Relative threshold filter
tfs_z0-11Tail-free sampling (1=disabled)
mirostat0/1/20Alternative sampling mode
mirostat_tau0-105.0Target perplexity
mirostat_eta0-10.1Learning rate
repeat_penalty0-21.1Repetition penalty
repeat_last_n-1 to context64Penalty window size
num_ctx1-model max2048Context window size
num_predict-2 to max128Max output tokens
seedany intrandomFor reproducibility

Setting parameters in Ollama:

# In a modelfile:
FROM llama3.1
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_k 40

# At runtime (limited parameters):
/set parameter temperature 0.7

14. Common Pitfalls

SymptomLikely CauseFix
Output varies wildly between requestsTemperature too high for the taskLower to 0-0.3 for structured output
Output is robotic and repetitiveTemperature too low for creative tasksRaise to 0.7-1.0
Same words keep appearingNo frequency penalty on long outputAdd frequency_penalty=0.3-0.5
Model talks in circles about same topicNo presence penaltyAdd presence_penalty=0.3-0.5
Response cuts off mid-sentencemax_tokens too lowIncrease limit or check finish_reason
JSON sometimes invalidTemperature > 0Use temperature=0 and response_format=json_object
Can't reproduce results for debuggingTemperature > 0, no seedSet temperature=0 or use seed parameter

15. Decision Framework

16. Key Takeaways

  1. Temperature is your primary control. It scales logits before softmax. Start with 0 for structured tasks, 0.7 for general use, 1.0+ for creativity.

  2. Pick ONE sampling method. Temperature, top-p, top-k, min-p, mirostat—choose one and master it. Most developers stick with temperature.

  3. Penalties fight repetition. Use frequency_penalty for varied vocabulary, presence_penalty for topic diversity. Start at 0.3-0.5.

  4. Context size matters for local LLMs. Ollama defaults to 2K tokens even if the model supports 128K. Set num_ctx explicitly.

  5. Temperature 0 isn't magic. It's deterministic but not perfect—use seed for true reproducibility.

  6. Stop sequences give surgical control. Perfect for limiting list length or stopping at delimiters.

  7. Mirostat is underrated. If you're running local models and want consistent creativity, try mirostat mode 2.

  8. Test empirically. Theory only gets you so far. Run the same prompt 10 times and observe variance.

17. What's Next

Congratulations! You've completed Part 1: Foundations of Prompt Engineering.

You now understand:

  • Lesson 1: Token economics (the currency)
  • Lesson 2: Prompting patterns (the techniques)
  • Lesson 3: System prompts (the programming layer)
  • Lesson 4: Generation parameters (the control knobs)

In Part 2: Building Your First AI Features, we'll put all of this into practice. where in Lesson 5: Text Generation & Streaming UIs we will build a real-time chat interface.

Time to build something real.

18. Additional Resources