Lesson 7: Vision & Multimodal Inputs

Topics Covered

Multimodal Models: What they can (and can't) see.
Image Encoding: Base64 vs URLs—when to use which.
Provider APIs: OpenAI, Claude, and Gemini vision side-by-side.
Practical Patterns: Screenshot analysis, document OCR, diagram understanding.
Cost Optimization: Image sizing, tiling, and token budgets.
Building Tools: Screenshot-to-code, receipt scanner, UI analyzer.

Your LLM can read. Now it's time to teach it to see. Modern models like OpenAI's GPT-5, Anthropic's Claude, and Google's Gemini can process images alongside text, opening up use cases from document scanning to UI analysis to visual Q&A. In this lesson, you'll learn to build applications that truly understand what they're looking at.

1. What Vision Models Can (and Can't) Do

Modern multimodal models can:

Capability	Examples
Describe images	"A sunset over mountains with orange and purple clouds"
Read text (OCR)	Extract text from screenshots, documents, signs
Analyze charts	Interpret bar charts, line graphs, pie charts
Understand diagrams	Read flowcharts, architecture diagrams, wireframes
Identify objects	"There are 3 people and 2 dogs in this photo"
Compare images	Spot differences between two screenshots
Answer questions	"What color is the car?" → "Red"

What They Can't Do (Well):

Limitation	Details
Precise counting	"How many windows?" often wrong for complex images
Spatial reasoning	"Is the cup to the left or right of the plate?" can be unreliable
Small text	Text under ~12px is often missed or misread
Handwriting	Messy handwriting recognition is hit-or-miss
Faces	Models intentionally limit facial recognition for privacy
Real-time video	Images only—no native video understanding (yet)

2. Image Encoding: Base64 vs URLs

There are two ways to send images to LLMs:

Option 1: Base64 Encoding

Embed the image data directly in your request.

import base64
from pathlib import Path

def encode_image_to_base64(image_path: str) -> str:
    """Convert an image file to base64 string."""
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

# Usage
image_data = encode_image_to_base64("screenshot.png")
# Returns: "iVBORw0KGgoAAAANSUhEUgAA..."

Pros:

Works with local files
No external dependencies
Image is guaranteed to be available

Cons:

Increases request size significantly
Slower uploads for large images
Base64 adds ~33% overhead

Option 2: URL Reference

Point to an image hosted online.

image_url = "https://example.com/image.png"
# Model fetches the image directly

Pros:

Smaller request payload
Faster for already-hosted images
Good for public images

Cons:

URL must be publicly accessible
Image might change or disappear
Some models have URL restrictions

When to Use Which

Scenario	Recommended
User uploads in your app	Base64
Processing local files	Base64
Referencing public images	URL
Reproducible/archived requests	Base64
Large batch processing	URL (if hosted)

3. OpenAI Vision API

OpenAI's GPT-4o and GPT-4o-mini both support vision.

Basic Image Analysis

openai_vision.py
"""
OpenAI Vision API
=================
Send images to GPT for analysis.
"""

import base64
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()


def encode_image(image_path: str) -> str:
    """Encode image to base64."""
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")


def analyze_image(
    image_path: str,
    prompt: str = "What's in this image?",
    detail: str = "auto",  # "low", "high", or "auto"
) -> str:
    """
    Analyze an image with GPT-4o.
    
    Args:
        image_path: Path to image file
        prompt: Question or instruction about the image
        detail: Image quality ("low" = faster/cheaper, "high" = better quality)
    """
    base64_image = encode_image(image_path)
    
    # Detect media type
    suffix = image_path.lower().split(".")[-1]
    media_types = {"png": "image/png", "jpg": "image/jpeg", "jpeg": "image/jpeg", 
                   "gif": "image/gif", "webp": "image/webp"}
    media_type = media_types.get(suffix, "image/png")
    
    response = client.chat.completions.create(
        model="gpt-5.2",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:{media_type};base64,{base64_image}",
                            "detail": detail,
                        }
                    }
                ]
            }
        ],
        max_completion_tokens=1000,
    )
    
    return response.choices[0].message.content


def analyze_image_url(image_url: str, prompt: str = "What's in this image?") -> str:
    """Analyze an image from URL."""
    
    response = client.chat.completions.create(
        model="gpt-5.2",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": image_url}
                    }
                ]
            }
        ],
        max_completion_tokens=1000,
    )
    
    return response.choices[0].message.content


if __name__ == "__main__":
    # Analyze a local image
    result = analyze_image(
        "warszawa.jpg",
        prompt="Describe what's in the image. Where was this picture taken?"
    )
    print(result)

Detail Parameter

OpenAI offers a detail parameter that controls image processing:

Detail	Resolution	Tokens	Use Case
`low`	512×512	85 tokens	Quick classification, simple questions
`high`	Up to 2048×2048 (tiled)	85-1105 tokens	OCR, detailed analysis, small text
`auto`	Model decides	Varies	Default, usually good

# Low detail for simple questions
analyze_image("photo.jpg", "Is this a cat or a dog?", detail="low")

# High detail for text extraction
analyze_image("document.png", "Extract all text from this image.", detail="high")

4. Practical Pattern: Screenshot-to-Code

One of the most powerful vision applications is generating code from UI screenshots:

screenshot_to_code.py
"""
Screenshot-to-Code Generator
============================
Turn UI screenshots into HTML/React code.
"""

import base64
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()


SYSTEM_PROMPT = """You are an expert frontend developer. When given a screenshot of a UI, 
you generate clean, semantic HTML with Tailwind CSS that recreates the design.

Rules:
1. Use semantic HTML5 elements (header, nav, main, section, footer)
2. Use Tailwind CSS for all styling - no custom CSS
3. Make it responsive (mobile-first)
4. Include realistic placeholder content
5. Add appropriate hover/focus states
6. Use modern design patterns

Output ONLY the HTML code, no explanation."""


def screenshot_to_html(image_path: str) -> str:
    """Convert a screenshot to HTML + Tailwind code."""
    
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Recreate this UI as HTML with Tailwind CSS."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_data}",
                            "detail": "high"  # Need high detail for UI elements
                        }
                    }
                ]
            }
        ],
        max_tokens=4000,
        temperature=0.2,  # Lower temperature for code
    )
    
    return response.choices[0].message.content


def screenshot_to_react(image_path: str, component_name: str = "Component") -> str:
    """Convert a screenshot to a React component."""
    
    react_prompt = f"""You are an expert React developer. When given a screenshot of a UI,
you generate a clean React functional component with Tailwind CSS.

Rules:
1. Create a functional component named {component_name}
2. Use TypeScript with proper types
3. Use Tailwind CSS for styling
4. Make it responsive
5. Use realistic placeholder data
6. Include proper accessibility attributes (aria-labels, roles)
7. Export the component as default

Output ONLY the code, no explanation."""

    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": react_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Recreate this UI as a React component."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_data}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=4000,
        temperature=0.2,
    )
    
    return response.choices[0].message.content


if __name__ == "__main__":
    # Generate HTML from a screenshot
    html = screenshot_to_html("landing_page.png")
    
    with open("output.html", "w") as f:
        f.write(f"""<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="https://cdn.tailwindcss.com"></script>
    <title>Generated UI</title>
</head>
<body>
{html}
</body>
</html>""")
    
    print("Generated output.html")

Other Vision Patterns

The same technique works for:

Receipt/document scanning: Extract structured data (vendor, items, totals) using Pydantic schemas
UI analysis: Assess accessibility, UX issues, and design feedback
Diagram understanding: Parse flowcharts, architecture diagrams, wireframes

Combine vision with JSON mode (response_format={"type": "json_object"}) for reliable structured extraction.

5. Cost Optimization

Vision API calls can be expensive. Here's how to optimize:

Image Sizing

from PIL import Image
import io
import base64


def optimize_image_for_vision(
    image_path: str,
    max_size: int = 1024,
    quality: int = 85,
) -> str:
    """
    Resize and compress image before sending to API.
    
    Returns base64-encoded optimized image.
    """
    with Image.open(image_path) as img:
        # Convert to RGB if necessary (handles PNG with alpha)
        if img.mode in ("RGBA", "P"):
            img = img.convert("RGB")
        
        # Resize if larger than max_size
        if max(img.size) > max_size:
            ratio = max_size / max(img.size)
            new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
            img = img.resize(new_size, Image.Resampling.LANCZOS)
        
        # Compress to JPEG
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=quality, optimize=True)
        
        return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")


# Usage
optimized = optimize_image_for_vision("large_screenshot.png", max_size=1024)
# Much smaller payload, still good for most analysis

When to Use Low vs High Detail

Task	Detail Setting	Why
"Is this a cat or dog?"	`low`	Simple classification
"What's the general layout?"	`low`	Overview questions
"Extract all text"	`high`	OCR needs pixel-level detail
"What's the phone number?"	`high`	Small text extraction
"Describe the colors"	`low`	Color doesn't need high res
"Count the items"	`high`	Precision matters

Token Estimation

OpenAI charges per image based on size and detail:

Detail	Image Size	Tokens
`low`	Any	85 tokens
`high`	≤ 512×512	85 + 170 = 255 tokens
`high`	1024×1024	85 + (170 × 4) = 765 tokens
`high`	2048×2048	85 + (170 × 16) = 1,105 tokens

6. Common Pitfalls

Symptom	Cause	Fix
"I cannot see the image"	Wrong base64 encoding	Check media type prefix
Text extraction misses words	Image too small or low detail	Use `detail: "high"`, increase resolution
Slow responses	Very large images	Resize before sending
High costs	Always using high detail	Use `low` for simple questions
URL images fail	Not publicly accessible	Use base64 for private images
Colors described wrong	Model limitation	Ask for specific color questions
Counts are wrong	Vision models struggle with counting	Use lower-level CV for counting

7. Key Takeaways

Base64 for uploads, URLs for public images. Match encoding to your use case.
Use detail: "high" for OCR. Small text needs high resolution.
Resize before sending. 1024px is usually enough; larger just costs more.
Vision + JSON mode = structured extraction. Combine for best results.
Models can't count reliably. Don't trust "how many X" questions.
Abstract across providers. Build interfaces, swap implementations.

8. What's Next

You've completed the core API integration patterns: streaming chat, structured extraction, and vision. In Part 3: Production & Operations, you'll learn to deploy these features reliably—handling errors, managing costs, implementing security, and monitoring production AI systems.

9. Additional Resources

OpenAI Vision Guide — Official documentation
Anthropic Vision Docs — Claude vision capabilities
Gemini Vision Tutorial — Google's vision guide

1. What Vision Models Can (and Can't) Do​

2. Image Encoding: Base64 vs URLs​

Option 1: Base64 Encoding​

Option 2: URL Reference​

When to Use Which​

3. OpenAI Vision API​

Basic Image Analysis​

Detail Parameter​

4. Practical Pattern: Screenshot-to-Code​

5. Cost Optimization​

Image Sizing​

When to Use Low vs High Detail​

Token Estimation​

6. Common Pitfalls​

7. Key Takeaways​

8. What's Next​

9. Additional Resources​