Skip to main content

Lesson 7: Vision & Multimodal Inputs

Topics Covered
  • Multimodal Models: What they can (and can't) see.
  • Image Encoding: Base64 vs URLs—when to use which.
  • Provider APIs: OpenAI, Claude, and Gemini vision side-by-side.
  • Practical Patterns: Screenshot analysis, document OCR, diagram understanding.
  • Cost Optimization: Image sizing, tiling, and token budgets.
  • Building Tools: Screenshot-to-code, receipt scanner, UI analyzer.

Your LLM can read. Now it's time to teach it to see. Modern models like OpenAI's GPT-5, Anthropic's Claude, and Google's Gemini can process images alongside text, opening up use cases from document scanning to UI analysis to visual Q&A. In this lesson, you'll learn to build applications that truly understand what they're looking at.

1. What Vision Models Can (and Can't) Do

Modern multimodal models can:

CapabilityExamples
Describe images"A sunset over mountains with orange and purple clouds"
Read text (OCR)Extract text from screenshots, documents, signs
Analyze chartsInterpret bar charts, line graphs, pie charts
Understand diagramsRead flowcharts, architecture diagrams, wireframes
Identify objects"There are 3 people and 2 dogs in this photo"
Compare imagesSpot differences between two screenshots
Answer questions"What color is the car?" → "Red"

What They Can't Do (Well):

LimitationDetails
Precise counting"How many windows?" often wrong for complex images
Spatial reasoning"Is the cup to the left or right of the plate?" can be unreliable
Small textText under ~12px is often missed or misread
HandwritingMessy handwriting recognition is hit-or-miss
FacesModels intentionally limit facial recognition for privacy
Real-time videoImages only—no native video understanding (yet)

2. Image Encoding: Base64 vs URLs

There are two ways to send images to LLMs:

Option 1: Base64 Encoding

Embed the image data directly in your request.

import base64
from pathlib import Path

def encode_image_to_base64(image_path: str) -> str:
"""Convert an image file to base64 string."""
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")

# Usage
image_data = encode_image_to_base64("screenshot.png")
# Returns: "iVBORw0KGgoAAAANSUhEUgAA..."

Pros:

  • Works with local files
  • No external dependencies
  • Image is guaranteed to be available

Cons:

  • Increases request size significantly
  • Slower uploads for large images
  • Base64 adds ~33% overhead

Option 2: URL Reference

Point to an image hosted online.

image_url = "https://example.com/image.png"
# Model fetches the image directly

Pros:

  • Smaller request payload
  • Faster for already-hosted images
  • Good for public images

Cons:

  • URL must be publicly accessible
  • Image might change or disappear
  • Some models have URL restrictions

When to Use Which

ScenarioRecommended
User uploads in your appBase64
Processing local filesBase64
Referencing public imagesURL
Reproducible/archived requestsBase64
Large batch processingURL (if hosted)

3. OpenAI Vision API

OpenAI's GPT-4o and GPT-4o-mini both support vision.

Basic Image Analysis

openai_vision.py
"""
OpenAI Vision API
=================
Send images to GPT for analysis.
"""

import base64
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()


def encode_image(image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")


def analyze_image(
image_path: str,
prompt: str = "What's in this image?",
detail: str = "auto", # "low", "high", or "auto"
) -> str:
"""
Analyze an image with GPT-4o.

Args:
image_path: Path to image file
prompt: Question or instruction about the image
detail: Image quality ("low" = faster/cheaper, "high" = better quality)
"""
base64_image = encode_image(image_path)

# Detect media type
suffix = image_path.lower().split(".")[-1]
media_types = {"png": "image/png", "jpg": "image/jpeg", "jpeg": "image/jpeg",
"gif": "image/gif", "webp": "image/webp"}
media_type = media_types.get(suffix, "image/png")

response = client.chat.completions.create(
model="gpt-5.2",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:{media_type};base64,{base64_image}",
"detail": detail,
}
}
]
}
],
max_completion_tokens=1000,
)

return response.choices[0].message.content


def analyze_image_url(image_url: str, prompt: str = "What's in this image?") -> str:
"""Analyze an image from URL."""

response = client.chat.completions.create(
model="gpt-5.2",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": image_url}
}
]
}
],
max_completion_tokens=1000,
)

return response.choices[0].message.content


if __name__ == "__main__":
# Analyze a local image
result = analyze_image(
"warszawa.jpg",
prompt="Describe what's in the image. Where was this picture taken?"
)
print(result)

Detail Parameter

OpenAI offers a detail parameter that controls image processing:

DetailResolutionTokensUse Case
low512×51285 tokensQuick classification, simple questions
highUp to 2048×2048 (tiled)85-1105 tokensOCR, detailed analysis, small text
autoModel decidesVariesDefault, usually good
# Low detail for simple questions
analyze_image("photo.jpg", "Is this a cat or a dog?", detail="low")

# High detail for text extraction
analyze_image("document.png", "Extract all text from this image.", detail="high")

4. Practical Pattern: Screenshot-to-Code

One of the most powerful vision applications is generating code from UI screenshots:

screenshot_to_code.py
"""
Screenshot-to-Code Generator
============================
Turn UI screenshots into HTML/React code.
"""

import base64
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()


SYSTEM_PROMPT = """You are an expert frontend developer. When given a screenshot of a UI,
you generate clean, semantic HTML with Tailwind CSS that recreates the design.

Rules:
1. Use semantic HTML5 elements (header, nav, main, section, footer)
2. Use Tailwind CSS for all styling - no custom CSS
3. Make it responsive (mobile-first)
4. Include realistic placeholder content
5. Add appropriate hover/focus states
6. Use modern design patterns

Output ONLY the HTML code, no explanation."""


def screenshot_to_html(image_path: str) -> str:
"""Convert a screenshot to HTML + Tailwind code."""

with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Recreate this UI as HTML with Tailwind CSS."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}",
"detail": "high" # Need high detail for UI elements
}
}
]
}
],
max_tokens=4000,
temperature=0.2, # Lower temperature for code
)

return response.choices[0].message.content


def screenshot_to_react(image_path: str, component_name: str = "Component") -> str:
"""Convert a screenshot to a React component."""

react_prompt = f"""You are an expert React developer. When given a screenshot of a UI,
you generate a clean React functional component with Tailwind CSS.

Rules:
1. Create a functional component named {component_name}
2. Use TypeScript with proper types
3. Use Tailwind CSS for styling
4. Make it responsive
5. Use realistic placeholder data
6. Include proper accessibility attributes (aria-labels, roles)
7. Export the component as default

Output ONLY the code, no explanation."""

with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": react_prompt},
{
"role": "user",
"content": [
{"type": "text", "text": "Recreate this UI as a React component."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}",
"detail": "high"
}
}
]
}
],
max_tokens=4000,
temperature=0.2,
)

return response.choices[0].message.content


if __name__ == "__main__":
# Generate HTML from a screenshot
html = screenshot_to_html("landing_page.png")

with open("output.html", "w") as f:
f.write(f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script src="https://cdn.tailwindcss.com"></script>
<title>Generated UI</title>
</head>
<body>
{html}
</body>
</html>""")

print("Generated output.html")
Other Vision Patterns

The same technique works for:

  • Receipt/document scanning: Extract structured data (vendor, items, totals) using Pydantic schemas
  • UI analysis: Assess accessibility, UX issues, and design feedback
  • Diagram understanding: Parse flowcharts, architecture diagrams, wireframes

Combine vision with JSON mode (response_format={"type": "json_object"}) for reliable structured extraction.

5. Cost Optimization

Vision API calls can be expensive. Here's how to optimize:

Image Sizing

from PIL import Image
import io
import base64


def optimize_image_for_vision(
image_path: str,
max_size: int = 1024,
quality: int = 85,
) -> str:
"""
Resize and compress image before sending to API.

Returns base64-encoded optimized image.
"""
with Image.open(image_path) as img:
# Convert to RGB if necessary (handles PNG with alpha)
if img.mode in ("RGBA", "P"):
img = img.convert("RGB")

# Resize if larger than max_size
if max(img.size) > max_size:
ratio = max_size / max(img.size)
new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
img = img.resize(new_size, Image.Resampling.LANCZOS)

# Compress to JPEG
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=quality, optimize=True)

return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")


# Usage
optimized = optimize_image_for_vision("large_screenshot.png", max_size=1024)
# Much smaller payload, still good for most analysis

When to Use Low vs High Detail

TaskDetail SettingWhy
"Is this a cat or dog?"lowSimple classification
"What's the general layout?"lowOverview questions
"Extract all text"highOCR needs pixel-level detail
"What's the phone number?"highSmall text extraction
"Describe the colors"lowColor doesn't need high res
"Count the items"highPrecision matters

Token Estimation

OpenAI charges per image based on size and detail:

DetailImage SizeTokens
lowAny85 tokens
high≤ 512×51285 + 170 = 255 tokens
high1024×102485 + (170 × 4) = 765 tokens
high2048×204885 + (170 × 16) = 1,105 tokens

6. Common Pitfalls

SymptomCauseFix
"I cannot see the image"Wrong base64 encodingCheck media type prefix
Text extraction misses wordsImage too small or low detailUse detail: "high", increase resolution
Slow responsesVery large imagesResize before sending
High costsAlways using high detailUse low for simple questions
URL images failNot publicly accessibleUse base64 for private images
Colors described wrongModel limitationAsk for specific color questions
Counts are wrongVision models struggle with countingUse lower-level CV for counting

7. Key Takeaways

  1. Base64 for uploads, URLs for public images. Match encoding to your use case.

  2. Use detail: "high" for OCR. Small text needs high resolution.

  3. Resize before sending. 1024px is usually enough; larger just costs more.

  4. Vision + JSON mode = structured extraction. Combine for best results.

  5. Models can't count reliably. Don't trust "how many X" questions.

  6. Abstract across providers. Build interfaces, swap implementations.

8. What's Next

You've completed the core API integration patterns: streaming chat, structured extraction, and vision. In Part 3: Production & Operations, you'll learn to deploy these features reliably—handling errors, managing costs, implementing security, and monitoring production AI systems.

9. Additional Resources