Lesson 7: Vision & Multimodal Inputs
- Multimodal Models: What they can (and can't) see.
- Image Encoding: Base64 vs URLs—when to use which.
- Provider APIs: OpenAI, Claude, and Gemini vision side-by-side.
- Practical Patterns: Screenshot analysis, document OCR, diagram understanding.
- Cost Optimization: Image sizing, tiling, and token budgets.
- Building Tools: Screenshot-to-code, receipt scanner, UI analyzer.
Your LLM can read. Now it's time to teach it to see. Modern models like OpenAI's GPT-5, Anthropic's Claude, and Google's Gemini can process images alongside text, opening up use cases from document scanning to UI analysis to visual Q&A. In this lesson, you'll learn to build applications that truly understand what they're looking at.
1. What Vision Models Can (and Can't) Do
Modern multimodal models can:
| Capability | Examples |
|---|---|
| Describe images | "A sunset over mountains with orange and purple clouds" |
| Read text (OCR) | Extract text from screenshots, documents, signs |
| Analyze charts | Interpret bar charts, line graphs, pie charts |
| Understand diagrams | Read flowcharts, architecture diagrams, wireframes |
| Identify objects | "There are 3 people and 2 dogs in this photo" |
| Compare images | Spot differences between two screenshots |
| Answer questions | "What color is the car?" → "Red" |
What They Can't Do (Well):
| Limitation | Details |
|---|---|
| Precise counting | "How many windows?" often wrong for complex images |
| Spatial reasoning | "Is the cup to the left or right of the plate?" can be unreliable |
| Small text | Text under ~12px is often missed or misread |
| Handwriting | Messy handwriting recognition is hit-or-miss |
| Faces | Models intentionally limit facial recognition for privacy |
| Real-time video | Images only—no native video understanding (yet) |
2. Image Encoding: Base64 vs URLs
There are two ways to send images to LLMs:
Option 1: Base64 Encoding
Embed the image data directly in your request.
import base64
from pathlib import Path
def encode_image_to_base64(image_path: str) -> str:
"""Convert an image file to base64 string."""
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
# Usage
image_data = encode_image_to_base64("screenshot.png")
# Returns: "iVBORw0KGgoAAAANSUhEUgAA..."
Pros:
- Works with local files
- No external dependencies
- Image is guaranteed to be available
Cons:
- Increases request size significantly
- Slower uploads for large images
- Base64 adds ~33% overhead
Option 2: URL Reference
Point to an image hosted online.
image_url = "https://example.com/image.png"
# Model fetches the image directly
Pros:
- Smaller request payload
- Faster for already-hosted images
- Good for public images
Cons:
- URL must be publicly accessible
- Image might change or disappear
- Some models have URL restrictions
When to Use Which
| Scenario | Recommended |
|---|---|
| User uploads in your app | Base64 |
| Processing local files | Base64 |
| Referencing public images | URL |
| Reproducible/archived requests | Base64 |
| Large batch processing | URL (if hosted) |
3. OpenAI Vision API
OpenAI's GPT-4o and GPT-4o-mini both support vision.
Basic Image Analysis
"""
OpenAI Vision API
=================
Send images to GPT for analysis.
"""
import base64
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
def encode_image(image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
def analyze_image(
image_path: str,
prompt: str = "What's in this image?",
detail: str = "auto", # "low", "high", or "auto"
) -> str:
"""
Analyze an image with GPT-4o.
Args:
image_path: Path to image file
prompt: Question or instruction about the image
detail: Image quality ("low" = faster/cheaper, "high" = better quality)
"""
base64_image = encode_image(image_path)
# Detect media type
suffix = image_path.lower().split(".")[-1]
media_types = {"png": "image/png", "jpg": "image/jpeg", "jpeg": "image/jpeg",
"gif": "image/gif", "webp": "image/webp"}
media_type = media_types.get(suffix, "image/png")
response = client.chat.completions.create(
model="gpt-5.2",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:{media_type};base64,{base64_image}",
"detail": detail,
}
}
]
}
],
max_completion_tokens=1000,
)
return response.choices[0].message.content
def analyze_image_url(image_url: str, prompt: str = "What's in this image?") -> str:
"""Analyze an image from URL."""
response = client.chat.completions.create(
model="gpt-5.2",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": image_url}
}
]
}
],
max_completion_tokens=1000,
)
return response.choices[0].message.content
if __name__ == "__main__":
# Analyze a local image
result = analyze_image(
"warszawa.jpg",
prompt="Describe what's in the image. Where was this picture taken?"
)
print(result)
Detail Parameter
OpenAI offers a detail parameter that controls image processing:
| Detail | Resolution | Tokens | Use Case |
|---|---|---|---|
low | 512×512 | 85 tokens | Quick classification, simple questions |
high | Up to 2048×2048 (tiled) | 85-1105 tokens | OCR, detailed analysis, small text |
auto | Model decides | Varies | Default, usually good |
# Low detail for simple questions
analyze_image("photo.jpg", "Is this a cat or a dog?", detail="low")
# High detail for text extraction
analyze_image("document.png", "Extract all text from this image.", detail="high")
4. Practical Pattern: Screenshot-to-Code
One of the most powerful vision applications is generating code from UI screenshots:
"""
Screenshot-to-Code Generator
============================
Turn UI screenshots into HTML/React code.
"""
import base64
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
SYSTEM_PROMPT = """You are an expert frontend developer. When given a screenshot of a UI,
you generate clean, semantic HTML with Tailwind CSS that recreates the design.
Rules:
1. Use semantic HTML5 elements (header, nav, main, section, footer)
2. Use Tailwind CSS for all styling - no custom CSS
3. Make it responsive (mobile-first)
4. Include realistic placeholder content
5. Add appropriate hover/focus states
6. Use modern design patterns
Output ONLY the HTML code, no explanation."""
def screenshot_to_html(image_path: str) -> str:
"""Convert a screenshot to HTML + Tailwind code."""
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Recreate this UI as HTML with Tailwind CSS."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}",
"detail": "high" # Need high detail for UI elements
}
}
]
}
],
max_tokens=4000,
temperature=0.2, # Lower temperature for code
)
return response.choices[0].message.content
def screenshot_to_react(image_path: str, component_name: str = "Component") -> str:
"""Convert a screenshot to a React component."""
react_prompt = f"""You are an expert React developer. When given a screenshot of a UI,
you generate a clean React functional component with Tailwind CSS.
Rules:
1. Create a functional component named {component_name}
2. Use TypeScript with proper types
3. Use Tailwind CSS for styling
4. Make it responsive
5. Use realistic placeholder data
6. Include proper accessibility attributes (aria-labels, roles)
7. Export the component as default
Output ONLY the code, no explanation."""
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": react_prompt},
{
"role": "user",
"content": [
{"type": "text", "text": "Recreate this UI as a React component."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}",
"detail": "high"
}
}
]
}
],
max_tokens=4000,
temperature=0.2,
)
return response.choices[0].message.content
if __name__ == "__main__":
# Generate HTML from a screenshot
html = screenshot_to_html("landing_page.png")
with open("output.html", "w") as f:
f.write(f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script src="https://cdn.tailwindcss.com"></script>
<title>Generated UI</title>
</head>
<body>
{html}
</body>
</html>""")
print("Generated output.html")
The same technique works for:
- Receipt/document scanning: Extract structured data (vendor, items, totals) using Pydantic schemas
- UI analysis: Assess accessibility, UX issues, and design feedback
- Diagram understanding: Parse flowcharts, architecture diagrams, wireframes
Combine vision with JSON mode (response_format={"type": "json_object"}) for reliable structured extraction.
5. Cost Optimization
Vision API calls can be expensive. Here's how to optimize:
Image Sizing
from PIL import Image
import io
import base64
def optimize_image_for_vision(
image_path: str,
max_size: int = 1024,
quality: int = 85,
) -> str:
"""
Resize and compress image before sending to API.
Returns base64-encoded optimized image.
"""
with Image.open(image_path) as img:
# Convert to RGB if necessary (handles PNG with alpha)
if img.mode in ("RGBA", "P"):
img = img.convert("RGB")
# Resize if larger than max_size
if max(img.size) > max_size:
ratio = max_size / max(img.size)
new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
img = img.resize(new_size, Image.Resampling.LANCZOS)
# Compress to JPEG
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=quality, optimize=True)
return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
# Usage
optimized = optimize_image_for_vision("large_screenshot.png", max_size=1024)
# Much smaller payload, still good for most analysis
When to Use Low vs High Detail
| Task | Detail Setting | Why |
|---|---|---|
| "Is this a cat or dog?" | low | Simple classification |
| "What's the general layout?" | low | Overview questions |
| "Extract all text" | high | OCR needs pixel-level detail |
| "What's the phone number?" | high | Small text extraction |
| "Describe the colors" | low | Color doesn't need high res |
| "Count the items" | high | Precision matters |
Token Estimation
OpenAI charges per image based on size and detail:
| Detail | Image Size | Tokens |
|---|---|---|
low | Any | 85 tokens |
high | ≤ 512×512 | 85 + 170 = 255 tokens |
high | 1024×1024 | 85 + (170 × 4) = 765 tokens |
high | 2048×2048 | 85 + (170 × 16) = 1,105 tokens |
6. Common Pitfalls
| Symptom | Cause | Fix |
|---|---|---|
| "I cannot see the image" | Wrong base64 encoding | Check media type prefix |
| Text extraction misses words | Image too small or low detail | Use detail: "high", increase resolution |
| Slow responses | Very large images | Resize before sending |
| High costs | Always using high detail | Use low for simple questions |
| URL images fail | Not publicly accessible | Use base64 for private images |
| Colors described wrong | Model limitation | Ask for specific color questions |
| Counts are wrong | Vision models struggle with counting | Use lower-level CV for counting |
7. Key Takeaways
-
Base64 for uploads, URLs for public images. Match encoding to your use case.
-
Use
detail: "high"for OCR. Small text needs high resolution. -
Resize before sending. 1024px is usually enough; larger just costs more.
-
Vision + JSON mode = structured extraction. Combine for best results.
-
Models can't count reliably. Don't trust "how many X" questions.
-
Abstract across providers. Build interfaces, swap implementations.
8. What's Next
You've completed the core API integration patterns: streaming chat, structured extraction, and vision. In Part 3: Production & Operations, you'll learn to deploy these features reliably—handling errors, managing costs, implementing security, and monitoring production AI systems.
9. Additional Resources
- OpenAI Vision Guide — Official documentation
- Anthropic Vision Docs — Claude vision capabilities
- Gemini Vision Tutorial — Google's vision guide