Lesson 6: Structured Data Extraction

Topics Covered

The Problem: Why free-text responses break production systems.
Document Preprocessing: Using MarkItDown to convert files to LLM-friendly Markdown.
JSON Mode: Forcing valid JSON output from LLMs.
Schema Definition: Describing exactly what structure you want.
Validation: Pydantic for type safety and error handling.
Retry Strategies: Handling malformed output gracefully.
Real-World Pipeline: From raw documents to structured data.

Your LLM returns beautiful prose. Your database expects {"name": "string", "amount": number}. Something has to give. In this lesson, you'll learn to reliably extract structured data from LLMs—the foundation of every AI-powered automation.

1. The Structured Output Problem

LLMs are trained to generate natural language. When you ask for JSON, you might get:

Prompt: "Extract the person's name and age from: 'John is 25 years old'"

❌ Bad outputs you might receive:
- "The person's name is John and they are 25 years old."
- "Name: John\nAge: 25"
- "```json\n{\"name\": \"John\", \"age\": 25}\n```"
- "{name: John, age: 25}"  // Invalid JSON (unquoted keys)
- "{"name": "John", "age": "25"}"  // Age is string, not number

The core tension: LLMs want to be helpful and conversational. Your code wants exact, parseable data.

2. Document Preprocessing with MarkItDown

Before you can extract structured data, you need to get the raw content into a format LLMs can process. Docling (by IBM) or MarkItDown (by Microsoft) convert virtually any document format into clean Markdown—the format LLMs understand best.

Going forward in this lesson we will be using MarkItDown in code examples.

Why Markdown?

LLMs like GPT-4 and Claude were trained on massive amounts of Markdown. They understand headings, lists, tables, and links natively. Markdown is also token-efficient; it has a minimal markup overhead compared to HTML or XML.

The extraction pipeline:

Supported Formats

MarkItDown handles an impressive range of formats:

Category	Formats
Documents	PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .xls)
Web	HTML, XML, RSS
Data	CSV, JSON, JSONL
Media	Images (with OCR), Audio (with transcription)
Archives	ZIP files (extracts and converts contents)
Other	YouTube URLs (transcripts), EPub, Outlook (.msg)

Setup

# Install with all converters
uv add 'markitdown[all]'

# Or install only what you need
uv add 'markitdown[pdf,docx,xlsx,pptx]'

Basic Usage

markitdown_basics.py
"""
MarkItDown: Convert Any Document to Markdown
=============================================
The first step in any document extraction pipeline.
"""

from markitdown import MarkItDown

# Initialize the converter
md = MarkItDown()

# Convert a PDF
result = md.convert("invoice.pdf")
print(result.text_content)

# Convert a Word document
result = md.convert("contract.docx")
print(result.text_content)

# Convert an Excel spreadsheet
result = md.convert("sales_data.xlsx")
print(result.text_content)  # Tables become Markdown tables!

# Convert a PowerPoint presentation
result = md.convert("quarterly_report.pptx")
print(result.text_content)  # Each slide becomes a section

Converting from URLs and Streams

markitdown_advanced.py
"""
MarkItDown: URLs, Streams, and Remote Files
============================================
"""

from markitdown import MarkItDown
from io import BytesIO

md = MarkItDown()

# Convert from URL (HTML pages, PDFs, etc.)
result = md.convert_url("https://example.com/article.html")
print(result.text_content)

# Convert YouTube video (extracts transcript)
result = md.convert_url("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(result.text_content)

# Convert from bytes (e.g., file uploads)
with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

result = md.convert_stream(BytesIO(pdf_bytes), file_extension=".pdf")
print(result.text_content)

Adding LLM-Powered Image Descriptions

Optional Feature

LLM-powered image descriptions are optional. Basic MarkItDown works without any API key. This feature only enhances image extraction and requires OPENAI_API_KEY in your environment.

For images and presentations with images, MarkItDown can use an LLM to generate descriptions:

markitdown_with_llm.py
"""
MarkItDown with LLM Image Descriptions
======================================
Useful for extracting data from screenshots, diagrams, and presentations.
Requires: OPENAI_API_KEY environment variable
"""

from markitdown import MarkItDown
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

# Initialize with LLM support (requires API key)
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o-mini",
    llm_prompt="Describe this image in detail, focusing on any text, numbers, or data visible."
)

# Now images get AI-generated descriptions
result = md.convert("screenshot.png")
print(result.text_content)

Example: What MarkItDown Produces

Input: A PDF invoice

Output:

# INVOICE

**Invoice Number:** INV-2024-0042
**Date:** January 15, 2024
**Due Date:** February 15, 2024

## From
Acme Software Solutions
123 Tech Street
San Francisco, CA 94105

## Bill To
TechStart Inc.
456 Innovation Ave
Austin, TX 78701

## Items

| Description | Qty | Unit Price | Total |
|-------------|-----|------------|-------|
| Software License (Annual) | 1 | $5,000.00 | $5,000.00 |
| Implementation Services | 10 | $150.00 | $1,500.00 |
| Training (per person) | 5 | $200.00 | $1,000.00 |

**Subtotal:** $7,500.00
**Tax (8.25%):** $618.75
**Total Due:** $8,118.75

---
*Payment Terms: Net 30*

This structured Markdown is much easier for an LLM to extract data from than raw PDF bytes or OCR output.

3. JSON Mode: The First Line of Defense

Most providers now offer a "JSON mode" that guarantees syntactically valid JSON.

OpenAI JSON Mode

json_mode_openai.py
"""
OpenAI JSON Mode
================
Guarantees valid JSON syntax (but not schema).
"""

import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()

def extract_with_json_mode(text: str, instruction: str) -> dict:
    """
    Extract structured data using JSON mode.
    
    IMPORTANT: You must mention "JSON" in your prompt!
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"""You are a data extraction assistant.
                
{instruction}

Respond with valid JSON only. No markdown, no explanation."""
            },
            {
                "role": "user", 
                "content": text
            }
        ],
        response_format={"type": "json_object"},  # The magic flag
        temperature=0,  # Deterministic for structured output
    )
    
    # response.choices[0].message.content is guaranteed to be valid JSON
    return json.loads(response.choices[0].message.content)


if __name__ == "__main__":
    text = "John Smith from Acme Corp called about the Q3 report. He's 35 years old."
    
    instruction = """Extract the following fields:
- name: The person's full name
- company: The company they work for
- topic: What they called about
- age: Their age as a number"""
    
    result = extract_with_json_mode(text, instruction)
    print(json.dumps(result, indent=2))
    
    # Output:
    # {
    #   "name": "John Smith",
    #   "company": "Acme Corp",
    #   "topic": "Q3 report",
    #   "age": 35
    # }

Claude JSON Mode

json_mode_claude.py
"""
Anthropic Claude JSON Mode
==========================
Similar concept, different implementation.
"""

import json
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()

client = Anthropic()

def extract_with_json_mode(text: str, instruction: str) -> dict:
    """
    Extract structured data using Claude.
    
    Claude doesn't have a response_format parameter,
    but you can get reliable JSON with careful prompting.
    """
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        system=f"""You are a data extraction assistant.

{instruction}

CRITICAL: Respond with ONLY valid JSON. No markdown code fences, no explanation, no preamble. Just the raw JSON object.""",
        messages=[
            {"role": "user", "content": text}
        ],
        temperature=0,
    )
    
    content = response.content[0].text.strip()
    
    # Clean up common issues
    if content.startswith("```"):
        # Remove markdown code fences
        content = content.split("```")[1]
        if content.startswith("json"):
            content = content[4:]
        content = content.strip()
    
    return json.loads(content)

What JSON Mode Guarantees (and Doesn't)

Aspect	Guaranteed?	Notes
Valid JSON syntax	✅ Yes	Will always parse without errors
Correct field names	❌ No	Might use "fullName" instead of "name"
Correct types	❌ No	Might return "25" (string) instead of 25 (number)
All fields present	❌ No	Might omit optional fields
No extra fields	❌ No	Might add fields you didn't ask for

JSON mode is necessary but not sufficient. You also need schema validation.

4. Schema-First Extraction with Pydantic

Pydantic lets you define exactly what shape your data should have—and validates it at runtime.

Setup

uv add pydantic

Basic Schema Validation

pydantic_extraction.py
"""
Pydantic Schema Validation
==========================
Define your schema, validate the output.
"""

import json
from pydantic import BaseModel, Field, ValidationError
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()


# ─────────────────────────────────────────────────────────────────────────────
# Define Your Schema
# ─────────────────────────────────────────────────────────────────────────────

class Person(BaseModel):
    """Schema for extracted person data."""
    
    name: str = Field(description="The person's full name")
    company: str | None = Field(default=None, description="Company they work for")
    age: int | None = Field(default=None, description="Age in years")
    email: str | None = Field(default=None, description="Email address")


class ExtractionResult(BaseModel):
    """Wrapper for extraction results."""
    
    people: list[Person] = Field(description="List of people mentioned")
    topics: list[str] = Field(description="Topics discussed")
    sentiment: str = Field(description="Overall sentiment: positive, negative, or neutral")


# ─────────────────────────────────────────────────────────────────────────────
# Generate Schema Description for the Prompt
# ─────────────────────────────────────────────────────────────────────────────

def schema_to_prompt(model: type[BaseModel]) -> str:
    """
    Convert a Pydantic model to a prompt-friendly description.
    """
    schema = model.model_json_schema()
    return json.dumps(schema, indent=2)


# ─────────────────────────────────────────────────────────────────────────────
# Extraction Function
# ─────────────────────────────────────────────────────────────────────────────

def extract_structured(text: str, schema: type[BaseModel]) -> BaseModel:
    """
    Extract structured data matching the given Pydantic schema.
    
    Raises ValidationError if the output doesn't match the schema.
    """
    schema_description = schema_to_prompt(schema)
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"""You are a precise data extraction assistant.

Extract data from the user's text and return it as JSON matching this exact schema:

{schema_description}

Rules:
- Return ONLY valid JSON matching the schema
- Use null for missing optional fields
- Use exact field names from the schema
- Ensure types match (integers for age, strings for names, etc.)"""
            },
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    
    raw_json = json.loads(response.choices[0].message.content)
    
    # Validate against schema - raises ValidationError if invalid
    return schema.model_validate(raw_json)


# ─────────────────────────────────────────────────────────────────────────────
# Example Usage
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    text = """
    Had a great meeting today! John Smith ([email protected]) from Acme Corp 
    presented the Q3 results. He's been with the company for 10 years 
    and is turning 35 next month. Sarah Johnson from TechStart also joined - 
    she seemed very enthusiastic about the partnership opportunity.
    """
    
    try:
        result = extract_structured(text, ExtractionResult)
        
        print("=== Extraction Result ===")
        print(f"Sentiment: {result.sentiment}")
        print(f"Topics: {result.topics}")
        print(f"\nPeople found: {len(result.people)}")
        
        for person in result.people:
            print(f"\n  Name: {person.name}")
            print(f"  Company: {person.company}")
            print(f"  Age: {person.age}")
            print(f"  Email: {person.email}")
            
    except ValidationError as e:
        print(f"Validation failed: {e}")

OpenAI Structured Outputs (Native Schema Support)

OpenAI now supports native schema enforcement using Pydantic directly:

openai_structured_outputs.py
"""
OpenAI Structured Outputs
=========================
Native schema enforcement - even stricter than JSON mode.
"""

from pydantic import BaseModel, Field
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()


class CalendarEvent(BaseModel):
    """Schema for a calendar event."""
    
    title: str = Field(description="Event title")
    date: str = Field(description="Date in YYYY-MM-DD format")
    time: str | None = Field(default=None, description="Time in HH:MM format")
    duration_minutes: int = Field(default=60, description="Duration in minutes")
    attendees: list[str] = Field(default_factory=list, description="List of attendee names")


def extract_calendar_event(text: str) -> CalendarEvent:
    """
    Extract a calendar event using OpenAI's native structured outputs.
    
    This guarantees the response matches the schema exactly.
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Extract calendar event details from the user's text."
            },
            {"role": "user", "content": text}
        ],
        response_format=CalendarEvent,  # Pass the Pydantic model directly!
    )
    
    # Already parsed and validated
    return completion.choices[0].message.parsed


if __name__ == "__main__":
    text = """
    Let's schedule a team sync for next Tuesday (2024-01-15) at 2pm. 
    It should be about 30 minutes. Invite Alice, Bob, and Charlie.
    """
    
    event = extract_calendar_event(text)
    
    print(f"Title: {event.title}")
    print(f"Date: {event.date}")
    print(f"Time: {event.time}")
    print(f"Duration: {event.duration_minutes} minutes")
    print(f"Attendees: {', '.join(event.attendees)}")

When to Use Which

JSON mode + manual validation: Works with any provider, more control
OpenAI Structured Outputs: Stricter guarantees, less code, OpenAI-only

5. End-to-End Pipeline: Document to Structured Data

Let's combine everything into a complete extraction pipeline that handles real documents:

document_extraction_pipeline.py
"""
Complete Document Extraction Pipeline
=====================================
From raw files (PDF, DOCX, XLSX) to typed Python objects.
"""

import json
from pathlib import Path
from typing import TypeVar, Type
from pydantic import BaseModel, Field, ValidationError
from markitdown import MarkItDown
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

T = TypeVar('T', bound=BaseModel)


class DocumentExtractor:
    """
    Extract structured data from any document format.
    
    Pipeline: File → MarkItDown → Markdown → LLM → Pydantic → Typed Object
    """
    
    def __init__(
        self,
        model: str = "gpt-4o-mini",
        enable_image_descriptions: bool = False,
    ):
        self.llm_client = OpenAI()
        self.model = model
        
        # Configure MarkItDown
        if enable_image_descriptions:
            self.converter = MarkItDown(
                llm_client=self.llm_client,
                llm_model=model,
                llm_prompt="Describe this image in detail, focusing on text, numbers, and data."
            )
        else:
            self.converter = MarkItDown()
    
    def extract(
        self,
        file_path: str | Path,
        schema: Type[T],
        extraction_prompt: str | None = None,
    ) -> T:
        """
        Extract structured data from a document.
        
        Args:
            file_path: Path to the document (PDF, DOCX, XLSX, etc.)
            schema: Pydantic model defining the expected structure
            extraction_prompt: Optional custom instructions for extraction
            
        Returns:
            Validated instance of the schema
        """
        # Step 1: Convert document to Markdown
        result = self.converter.convert(str(file_path))
        markdown_content = result.text_content
        
        # Step 2: Generate schema description
        schema_json = json.dumps(schema.model_json_schema(), indent=2)
        
        # Step 3: Build the extraction prompt
        if extraction_prompt:
            system_content = extraction_prompt + f"\n\nOutput schema:\n{schema_json}"
        else:
            system_content = f"""You are a precise data extraction assistant.

Extract structured data from the provided document and return it as JSON matching this exact schema:

{schema_json}

Rules:
- Extract all visible information that matches the schema fields
- Use null for fields that aren't present in the document
- Ensure types match exactly (integers for numbers, strings for text)
- Do not invent or hallucinate data not present in the document"""

        # Step 4: Call the LLM
        response = self.llm_client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_content},
                {"role": "user", "content": f"Extract data from this document:\n\n{markdown_content}"}
            ],
            response_format={"type": "json_object"},
            temperature=0,
        )
        
        # Step 5: Parse and validate
        raw_json = json.loads(response.choices[0].message.content)
        return schema.model_validate(raw_json)
    
    def extract_with_context(
        self,
        file_path: str | Path,
        schema: Type[T],
    ) -> tuple[T, str]:
        """
        Extract data and also return the intermediate Markdown.
        
        Useful for debugging or showing users what was extracted.
        """
        result = self.converter.convert(str(file_path))
        markdown_content = result.text_content
        
        extracted = self.extract(file_path, schema)
        
        return extracted, markdown_content


# ─────────────────────────────────────────────────────────────────────────────
# Example: Multi-Format Invoice Processing
# ─────────────────────────────────────────────────────────────────────────────

class LineItem(BaseModel):
    description: str = Field(description="Item description")
    quantity: float = Field(description="Quantity")
    unit_price: float = Field(description="Price per unit")
    total: float = Field(description="Line total")


class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice number/ID")
    invoice_date: str = Field(description="Date in YYYY-MM-DD format")
    vendor_name: str = Field(description="Vendor/seller name")
    customer_name: str = Field(description="Customer/buyer name")
    line_items: list[LineItem] = Field(description="List of items")
    subtotal: float = Field(description="Subtotal before tax")
    tax_amount: float = Field(default=0, description="Tax amount")
    total: float = Field(description="Total amount due")
    currency: str = Field(default="USD", description="Currency code")


if __name__ == "__main__":
    extractor = DocumentExtractor()
    
    # Works with any format MarkItDown supports!
    for invoice_file in ["invoice.pdf", "invoice.docx", "invoice.xlsx"]:
        if Path(invoice_file).exists():
            print(f"\n=== Processing {invoice_file} ===")
            
            invoice = extractor.extract(invoice_file, Invoice)
            
            print(f"Invoice #: {invoice.invoice_number}")
            print(f"Vendor: {invoice.vendor_name}")
            print(f"Customer: {invoice.customer_name}")
            print(f"Total: {invoice.currency} {invoice.total}")

Multi-Format Support

The same DocumentExtractor pattern works for Excel (tables become Markdown tables), PowerPoint (slides become sections), and any other format MarkItDown supports. No separate code needed.

6. Retry Strategies for Malformed Output

Even with JSON mode and schemas, things can go wrong. Here's how to handle failures:

retry_extraction.py
"""
Robust Extraction with Retries
==============================
Handle failures gracefully with multiple strategies.
"""

import json
from typing import TypeVar, Type
from pydantic import BaseModel, ValidationError
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()
T = TypeVar('T', bound=BaseModel)


class ExtractionError(Exception):
    """Raised when extraction fails after all retries."""
    pass


def extract_with_retry(
    text: str,
    schema: Type[T],
    max_retries: int = 3,
    model: str = "gpt-4o-mini",
) -> T:
    """
    Extract structured data with retry logic.
    
    Strategies:
    1. First attempt: Standard extraction
    2. On validation error: Include the error in retry prompt
    3. On JSON error: Ask for cleaner output
    """
    schema_json = json.dumps(schema.model_json_schema(), indent=2)
    
    messages = [
        {
            "role": "system",
            "content": f"""Extract data as JSON matching this schema:

{schema_json}

Return ONLY valid JSON. No explanation."""
        },
        {"role": "user", "content": text}
    ]
    
    last_error = None
    
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                response_format={"type": "json_object"},
                temperature=0,
            )
            
            raw_content = response.choices[0].message.content
            raw_json = json.loads(raw_content)
            
            # Validate against schema
            return schema.model_validate(raw_json)
            
        except json.JSONDecodeError as e:
            last_error = e
            # Add correction prompt
            messages.append({
                "role": "assistant",
                "content": raw_content
            })
            messages.append({
                "role": "user",
                "content": f"That was not valid JSON. Error: {e}. Please return ONLY valid JSON."
            })
            
        except ValidationError as e:
            last_error = e
            # Add correction prompt with specific errors
            error_details = []
            for error in e.errors():
                field = ".".join(str(x) for x in error["loc"])
                error_details.append(f"- {field}: {error['msg']}")
            
            messages.append({
                "role": "assistant",
                "content": raw_content
            })
            messages.append({
                "role": "user",
                "content": f"""The JSON didn't match the required schema. Errors:
{chr(10).join(error_details)}

Please fix these issues and return valid JSON."""
            })
    
    raise ExtractionError(f"Failed after {max_retries} attempts. Last error: {last_error}")


# ─────────────────────────────────────────────────────────────────────────────
# Alternative: Fallback Chain
# ─────────────────────────────────────────────────────────────────────────────

def extract_with_fallback(
    text: str,
    schema: Type[T],
    models: list[str] = ["gpt-4o-mini", "gpt-4o"],
) -> T:
    """
    Try multiple models in sequence.
    
    Useful when cheaper models fail on complex extractions.
    """
    errors = []
    
    for model in models:
        try:
            return extract_with_retry(text, schema, model=model, max_retries=2)
        except ExtractionError as e:
            errors.append(f"{model}: {e}")
            continue
    
    raise ExtractionError(f"All models failed:\n" + "\n".join(errors))

7. Batch Processing

For multiple documents, use async with semaphores to limit concurrent API calls:

batch_extraction.py
import asyncio
from openai import AsyncOpenAI

async def extract_batch(texts: list[str], schema_json: str, max_concurrent: int = 5):
    client = AsyncOpenAI()
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def extract_one(text: str):
        async with semaphore:
            response = await client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": f"Extract as JSON:\n{schema_json}"},
                    {"role": "user", "content": text}
                ],
                response_format={"type": "json_object"},
                temperature=0,
            )
            return json.loads(response.choices[0].message.content)
    
    return await asyncio.gather(*[extract_one(t) for t in texts])

8. Common Pitfalls

Symptom	Cause	Fix
"I cannot extract data as JSON"	Model refusing to comply	Use JSON mode, simplify schema
Wrong field names	Schema not in prompt	Include full schema with field names
Missing fields	Optional fields not specified	Use `Field(default=None)` for optional
Type mismatches	String "25" instead of int 25	Be explicit about types in schema description
Extra markdown	Model wrapping in ```json	Use JSON mode or strip markdown
Hallucinated data	Model inventing values	Add "use null if not found" to prompt
Inconsistent results	Temperature > 0	Use temperature=0 for extraction
PDF extraction fails	MarkItDown missing deps	Install `markitdown[pdf]`
Tables not extracted	Complex PDF layout	Try Azure Document Intelligence
Images ignored	No LLM configured	Pass `llm_client` to MarkItDown

9. Try It Yourself

Challenge 1: Email Parser

Create a schema and parser for email metadata:

class Email(BaseModel):
    sender: str
    recipients: list[str]
    subject: str
    date: str
    is_reply: bool
    has_attachments: bool
    sentiment: str  # positive, negative, neutral
    action_items: list[str]

Challenge 2: Product Review Analyzer

Extract structured sentiment from product reviews:

class Review(BaseModel):
    product_name: str
    rating: int  # 1-5
    pros: list[str]
    cons: list[str]
    would_recommend: bool
    key_quotes: list[str]

Challenge 3: Resume Parser

Build a resume parser that handles various formats:

class Resume(BaseModel):
    name: str
    email: str | None
    phone: str | None
    education: list[Education]
    experience: list[Experience]
    skills: list[str]

10. Key Takeaways

MarkItDown converts anything to Markdown. PDF, Word, Excel, PowerPoint, images, audio—all become LLM-friendly text.
JSON mode guarantees syntax, not schema. Always validate with Pydantic.
Temperature 0 is essential. Structured extraction needs determinism.
Include the full schema in your prompt. Models can't read your Python code.
Plan for failures. Implement retry logic with helpful error messages.
Use native structured outputs when available. OpenAI's response_format with Pydantic is the gold standard.
Batch with concurrency limits. Don't overwhelm the API—use semaphores for LLM calls.

11. What's Next

You can now turn messy documents into typed objects. In Lesson 7: Vision & Multimodal Inputs, we'll learn to process images directly with LLMs—analyzing screenshots, extracting data from photos, and building tools that can "see."

12. Additional Resources

Docling — Converts complex documents into structured Markdown (by IBM).
MarkItDown (GitHub) — Microsoft's document-to-Markdown converter
OpenAI Structured Outputs Guide — Official documentation
Pydantic Documentation — Python validation library
Azure Document Intelligence — For complex document layouts

1. The Structured Output Problem​

2. Document Preprocessing with MarkItDown​

Why Markdown?​

Supported Formats​

Setup​

Basic Usage​

Converting from URLs and Streams​

Adding LLM-Powered Image Descriptions​

Example: What MarkItDown Produces​

3. JSON Mode: The First Line of Defense​

OpenAI JSON Mode​

Claude JSON Mode​

What JSON Mode Guarantees (and Doesn't)​

4. Schema-First Extraction with Pydantic​

Setup​

Basic Schema Validation​

OpenAI Structured Outputs (Native Schema Support)​

5. End-to-End Pipeline: Document to Structured Data​

6. Retry Strategies for Malformed Output​

7. Batch Processing​

8. Common Pitfalls​

9. Try It Yourself​

Challenge 1: Email Parser​

Challenge 2: Product Review Analyzer​

Challenge 3: Resume Parser​

10. Key Takeaways​

11. What's Next​

12. Additional Resources​

1. The Structured Output Problem

2. Document Preprocessing with MarkItDown

Why Markdown?

Supported Formats

Setup

Basic Usage

Converting from URLs and Streams

Adding LLM-Powered Image Descriptions

Example: What MarkItDown Produces

3. JSON Mode: The First Line of Defense

OpenAI JSON Mode

Claude JSON Mode

What JSON Mode Guarantees (and Doesn't)

4. Schema-First Extraction with Pydantic

Setup

Basic Schema Validation

OpenAI Structured Outputs (Native Schema Support)

5. End-to-End Pipeline: Document to Structured Data

6. Retry Strategies for Malformed Output

7. Batch Processing

8. Common Pitfalls

9. Try It Yourself

Challenge 1: Email Parser

Challenge 2: Product Review Analyzer

Challenge 3: Resume Parser

10. Key Takeaways

11. What's Next

12. Additional Resources