Lesson 6: Structured Data Extraction
- The Problem: Why free-text responses break production systems.
- Document Preprocessing: Using MarkItDown to convert files to LLM-friendly Markdown.
- JSON Mode: Forcing valid JSON output from LLMs.
- Schema Definition: Describing exactly what structure you want.
- Validation: Pydantic for type safety and error handling.
- Retry Strategies: Handling malformed output gracefully.
- Real-World Pipeline: From raw documents to structured data.
Your LLM returns beautiful prose. Your database expects {"name": "string", "amount": number}. Something has to give. In this lesson, you'll learn to reliably extract structured data from LLMs—the foundation of every AI-powered automation.
1. The Structured Output Problem
LLMs are trained to generate natural language. When you ask for JSON, you might get:
Prompt: "Extract the person's name and age from: 'John is 25 years old'"
❌ Bad outputs you might receive:
- "The person's name is John and they are 25 years old."
- "Name: John\nAge: 25"
- "```json\n{\"name\": \"John\", \"age\": 25}\n```"
- "{name: John, age: 25}" // Invalid JSON (unquoted keys)
- "{"name": "John", "age": "25"}" // Age is string, not number
The core tension: LLMs want to be helpful and conversational. Your code wants exact, parseable data.
2. Document Preprocessing with MarkItDown
Before you can extract structured data, you need to get the raw content into a format LLMs can process. Docling (by IBM) or MarkItDown (by Microsoft) convert virtually any document format into clean Markdown—the format LLMs understand best.
Going forward in this lesson we will be using MarkItDown in code examples.
Why Markdown?
LLMs like GPT-4 and Claude were trained on massive amounts of Markdown. They understand headings, lists, tables, and links natively. Markdown is also token-efficient; it has a minimal markup overhead compared to HTML or XML.
The extraction pipeline:
Supported Formats
MarkItDown handles an impressive range of formats:
| Category | Formats |
|---|---|
| Documents | PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .xls) |
| Web | HTML, XML, RSS |
| Data | CSV, JSON, JSONL |
| Media | Images (with OCR), Audio (with transcription) |
| Archives | ZIP files (extracts and converts contents) |
| Other | YouTube URLs (transcripts), EPub, Outlook (.msg) |
Setup
# Install with all converters
uv add 'markitdown[all]'
# Or install only what you need
uv add 'markitdown[pdf,docx,xlsx,pptx]'
Basic Usage
"""
MarkItDown: Convert Any Document to Markdown
=============================================
The first step in any document extraction pipeline.
"""
from markitdown import MarkItDown
# Initialize the converter
md = MarkItDown()
# Convert a PDF
result = md.convert("invoice.pdf")
print(result.text_content)
# Convert a Word document
result = md.convert("contract.docx")
print(result.text_content)
# Convert an Excel spreadsheet
result = md.convert("sales_data.xlsx")
print(result.text_content) # Tables become Markdown tables!
# Convert a PowerPoint presentation
result = md.convert("quarterly_report.pptx")
print(result.text_content) # Each slide becomes a section
Converting from URLs and Streams
"""
MarkItDown: URLs, Streams, and Remote Files
============================================
"""
from markitdown import MarkItDown
from io import BytesIO
md = MarkItDown()
# Convert from URL (HTML pages, PDFs, etc.)
result = md.convert_url("https://example.com/article.html")
print(result.text_content)
# Convert YouTube video (extracts transcript)
result = md.convert_url("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(result.text_content)
# Convert from bytes (e.g., file uploads)
with open("document.pdf", "rb") as f:
pdf_bytes = f.read()
result = md.convert_stream(BytesIO(pdf_bytes), file_extension=".pdf")
print(result.text_content)
Adding LLM-Powered Image Descriptions
LLM-powered image descriptions are optional. Basic MarkItDown works without any API key. This feature only enhances image extraction and requires OPENAI_API_KEY in your environment.
For images and presentations with images, MarkItDown can use an LLM to generate descriptions:
"""
MarkItDown with LLM Image Descriptions
======================================
Useful for extracting data from screenshots, diagrams, and presentations.
Requires: OPENAI_API_KEY environment variable
"""
from markitdown import MarkItDown
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
# Initialize with LLM support (requires API key)
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o-mini",
llm_prompt="Describe this image in detail, focusing on any text, numbers, or data visible."
)
# Now images get AI-generated descriptions
result = md.convert("screenshot.png")
print(result.text_content)
Example: What MarkItDown Produces
Input: A PDF invoice
Output:
# INVOICE
**Invoice Number:** INV-2024-0042
**Date:** January 15, 2024
**Due Date:** February 15, 2024
## From
Acme Software Solutions
123 Tech Street
San Francisco, CA 94105
## Bill To
TechStart Inc.
456 Innovation Ave
Austin, TX 78701
## Items
| Description | Qty | Unit Price | Total |
|-------------|-----|------------|-------|
| Software License (Annual) | 1 | $5,000.00 | $5,000.00 |
| Implementation Services | 10 | $150.00 | $1,500.00 |
| Training (per person) | 5 | $200.00 | $1,000.00 |
**Subtotal:** $7,500.00
**Tax (8.25%):** $618.75
**Total Due:** $8,118.75
---
*Payment Terms: Net 30*
This structured Markdown is much easier for an LLM to extract data from than raw PDF bytes or OCR output.
3. JSON Mode: The First Line of Defense
Most providers now offer a "JSON mode" that guarantees syntactically valid JSON.
OpenAI JSON Mode
"""
OpenAI JSON Mode
================
Guarantees valid JSON syntax (but not schema).
"""
import json
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
def extract_with_json_mode(text: str, instruction: str) -> dict:
"""
Extract structured data using JSON mode.
IMPORTANT: You must mention "JSON" in your prompt!
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"""You are a data extraction assistant.
{instruction}
Respond with valid JSON only. No markdown, no explanation."""
},
{
"role": "user",
"content": text
}
],
response_format={"type": "json_object"}, # The magic flag
temperature=0, # Deterministic for structured output
)
# response.choices[0].message.content is guaranteed to be valid JSON
return json.loads(response.choices[0].message.content)
if __name__ == "__main__":
text = "John Smith from Acme Corp called about the Q3 report. He's 35 years old."
instruction = """Extract the following fields:
- name: The person's full name
- company: The company they work for
- topic: What they called about
- age: Their age as a number"""
result = extract_with_json_mode(text, instruction)
print(json.dumps(result, indent=2))
# Output:
# {
# "name": "John Smith",
# "company": "Acme Corp",
# "topic": "Q3 report",
# "age": 35
# }
Claude JSON Mode
"""
Anthropic Claude JSON Mode
==========================
Similar concept, different implementation.
"""
import json
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
def extract_with_json_mode(text: str, instruction: str) -> dict:
"""
Extract structured data using Claude.
Claude doesn't have a response_format parameter,
but you can get reliable JSON with careful prompting.
"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
system=f"""You are a data extraction assistant.
{instruction}
CRITICAL: Respond with ONLY valid JSON. No markdown code fences, no explanation, no preamble. Just the raw JSON object.""",
messages=[
{"role": "user", "content": text}
],
temperature=0,
)
content = response.content[0].text.strip()
# Clean up common issues
if content.startswith("```"):
# Remove markdown code fences
content = content.split("```")[1]
if content.startswith("json"):
content = content[4:]
content = content.strip()
return json.loads(content)
What JSON Mode Guarantees (and Doesn't)
| Aspect | Guaranteed? | Notes |
|---|---|---|
| Valid JSON syntax | ✅ Yes | Will always parse without errors |
| Correct field names | ❌ No | Might use "fullName" instead of "name" |
| Correct types | ❌ No | Might return "25" (string) instead of 25 (number) |
| All fields present | ❌ No | Might omit optional fields |
| No extra fields | ❌ No | Might add fields you didn't ask for |
JSON mode is necessary but not sufficient. You also need schema validation.
4. Schema-First Extraction with Pydantic
Pydantic lets you define exactly what shape your data should have—and validates it at runtime.
Setup
uv add pydantic
Basic Schema Validation
"""
Pydantic Schema Validation
==========================
Define your schema, validate the output.
"""
import json
from pydantic import BaseModel, Field, ValidationError
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
# ─────────────────────────────────────────────────────────────────────────────
# Define Your Schema
# ─────────────────────────────────────────────────────────────────────────────
class Person(BaseModel):
"""Schema for extracted person data."""
name: str = Field(description="The person's full name")
company: str | None = Field(default=None, description="Company they work for")
age: int | None = Field(default=None, description="Age in years")
email: str | None = Field(default=None, description="Email address")
class ExtractionResult(BaseModel):
"""Wrapper for extraction results."""
people: list[Person] = Field(description="List of people mentioned")
topics: list[str] = Field(description="Topics discussed")
sentiment: str = Field(description="Overall sentiment: positive, negative, or neutral")
# ─────────────────────────────────────────────────────────────────────────────
# Generate Schema Description for the Prompt
# ─────────────────────────────────────────────────────────────────────────────
def schema_to_prompt(model: type[BaseModel]) -> str:
"""
Convert a Pydantic model to a prompt-friendly description.
"""
schema = model.model_json_schema()
return json.dumps(schema, indent=2)
# ─────────────────────────────────────────────────────────────────────────────
# Extraction Function
# ─────────────────────────────────────────────────────────────────────────────
def extract_structured(text: str, schema: type[BaseModel]) -> BaseModel:
"""
Extract structured data matching the given Pydantic schema.
Raises ValidationError if the output doesn't match the schema.
"""
schema_description = schema_to_prompt(schema)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"""You are a precise data extraction assistant.
Extract data from the user's text and return it as JSON matching this exact schema:
{schema_description}
Rules:
- Return ONLY valid JSON matching the schema
- Use null for missing optional fields
- Use exact field names from the schema
- Ensure types match (integers for age, strings for names, etc.)"""
},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
temperature=0,
)
raw_json = json.loads(response.choices[0].message.content)
# Validate against schema - raises ValidationError if invalid
return schema.model_validate(raw_json)
# ─────────────────────────────────────────────────────────────────────────────
# Example Usage
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
text = """
Had a great meeting today! John Smith ([email protected]) from Acme Corp
presented the Q3 results. He's been with the company for 10 years
and is turning 35 next month. Sarah Johnson from TechStart also joined -
she seemed very enthusiastic about the partnership opportunity.
"""
try:
result = extract_structured(text, ExtractionResult)
print("=== Extraction Result ===")
print(f"Sentiment: {result.sentiment}")
print(f"Topics: {result.topics}")
print(f"\nPeople found: {len(result.people)}")
for person in result.people:
print(f"\n Name: {person.name}")
print(f" Company: {person.company}")
print(f" Age: {person.age}")
print(f" Email: {person.email}")
except ValidationError as e:
print(f"Validation failed: {e}")
OpenAI Structured Outputs (Native Schema Support)
OpenAI now supports native schema enforcement using Pydantic directly:
"""
OpenAI Structured Outputs
=========================
Native schema enforcement - even stricter than JSON mode.
"""
from pydantic import BaseModel, Field
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
class CalendarEvent(BaseModel):
"""Schema for a calendar event."""
title: str = Field(description="Event title")
date: str = Field(description="Date in YYYY-MM-DD format")
time: str | None = Field(default=None, description="Time in HH:MM format")
duration_minutes: int = Field(default=60, description="Duration in minutes")
attendees: list[str] = Field(default_factory=list, description="List of attendee names")
def extract_calendar_event(text: str) -> CalendarEvent:
"""
Extract a calendar event using OpenAI's native structured outputs.
This guarantees the response matches the schema exactly.
"""
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Extract calendar event details from the user's text."
},
{"role": "user", "content": text}
],
response_format=CalendarEvent, # Pass the Pydantic model directly!
)
# Already parsed and validated
return completion.choices[0].message.parsed
if __name__ == "__main__":
text = """
Let's schedule a team sync for next Tuesday (2024-01-15) at 2pm.
It should be about 30 minutes. Invite Alice, Bob, and Charlie.
"""
event = extract_calendar_event(text)
print(f"Title: {event.title}")
print(f"Date: {event.date}")
print(f"Time: {event.time}")
print(f"Duration: {event.duration_minutes} minutes")
print(f"Attendees: {', '.join(event.attendees)}")
- JSON mode + manual validation: Works with any provider, more control
- OpenAI Structured Outputs: Stricter guarantees, less code, OpenAI-only
5. End-to-End Pipeline: Document to Structured Data
Let's combine everything into a complete extraction pipeline that handles real documents:
"""
Complete Document Extraction Pipeline
=====================================
From raw files (PDF, DOCX, XLSX) to typed Python objects.
"""
import json
from pathlib import Path
from typing import TypeVar, Type
from pydantic import BaseModel, Field, ValidationError
from markitdown import MarkItDown
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
T = TypeVar('T', bound=BaseModel)
class DocumentExtractor:
"""
Extract structured data from any document format.
Pipeline: File → MarkItDown → Markdown → LLM → Pydantic → Typed Object
"""
def __init__(
self,
model: str = "gpt-4o-mini",
enable_image_descriptions: bool = False,
):
self.llm_client = OpenAI()
self.model = model
# Configure MarkItDown
if enable_image_descriptions:
self.converter = MarkItDown(
llm_client=self.llm_client,
llm_model=model,
llm_prompt="Describe this image in detail, focusing on text, numbers, and data."
)
else:
self.converter = MarkItDown()
def extract(
self,
file_path: str | Path,
schema: Type[T],
extraction_prompt: str | None = None,
) -> T:
"""
Extract structured data from a document.
Args:
file_path: Path to the document (PDF, DOCX, XLSX, etc.)
schema: Pydantic model defining the expected structure
extraction_prompt: Optional custom instructions for extraction
Returns:
Validated instance of the schema
"""
# Step 1: Convert document to Markdown
result = self.converter.convert(str(file_path))
markdown_content = result.text_content
# Step 2: Generate schema description
schema_json = json.dumps(schema.model_json_schema(), indent=2)
# Step 3: Build the extraction prompt
if extraction_prompt:
system_content = extraction_prompt + f"\n\nOutput schema:\n{schema_json}"
else:
system_content = f"""You are a precise data extraction assistant.
Extract structured data from the provided document and return it as JSON matching this exact schema:
{schema_json}
Rules:
- Extract all visible information that matches the schema fields
- Use null for fields that aren't present in the document
- Ensure types match exactly (integers for numbers, strings for text)
- Do not invent or hallucinate data not present in the document"""
# Step 4: Call the LLM
response = self.llm_client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": f"Extract data from this document:\n\n{markdown_content}"}
],
response_format={"type": "json_object"},
temperature=0,
)
# Step 5: Parse and validate
raw_json = json.loads(response.choices[0].message.content)
return schema.model_validate(raw_json)
def extract_with_context(
self,
file_path: str | Path,
schema: Type[T],
) -> tuple[T, str]:
"""
Extract data and also return the intermediate Markdown.
Useful for debugging or showing users what was extracted.
"""
result = self.converter.convert(str(file_path))
markdown_content = result.text_content
extracted = self.extract(file_path, schema)
return extracted, markdown_content
# ─────────────────────────────────────────────────────────────────────────────
# Example: Multi-Format Invoice Processing
# ─────────────────────────────────────────────────────────────────────────────
class LineItem(BaseModel):
description: str = Field(description="Item description")
quantity: float = Field(description="Quantity")
unit_price: float = Field(description="Price per unit")
total: float = Field(description="Line total")
class Invoice(BaseModel):
invoice_number: str = Field(description="Invoice number/ID")
invoice_date: str = Field(description="Date in YYYY-MM-DD format")
vendor_name: str = Field(description="Vendor/seller name")
customer_name: str = Field(description="Customer/buyer name")
line_items: list[LineItem] = Field(description="List of items")
subtotal: float = Field(description="Subtotal before tax")
tax_amount: float = Field(default=0, description="Tax amount")
total: float = Field(description="Total amount due")
currency: str = Field(default="USD", description="Currency code")
if __name__ == "__main__":
extractor = DocumentExtractor()
# Works with any format MarkItDown supports!
for invoice_file in ["invoice.pdf", "invoice.docx", "invoice.xlsx"]:
if Path(invoice_file).exists():
print(f"\n=== Processing {invoice_file} ===")
invoice = extractor.extract(invoice_file, Invoice)
print(f"Invoice #: {invoice.invoice_number}")
print(f"Vendor: {invoice.vendor_name}")
print(f"Customer: {invoice.customer_name}")
print(f"Total: {invoice.currency} {invoice.total}")
The same DocumentExtractor pattern works for Excel (tables become Markdown tables), PowerPoint (slides become sections), and any other format MarkItDown supports. No separate code needed.
6. Retry Strategies for Malformed Output
Even with JSON mode and schemas, things can go wrong. Here's how to handle failures:
"""
Robust Extraction with Retries
==============================
Handle failures gracefully with multiple strategies.
"""
import json
from typing import TypeVar, Type
from pydantic import BaseModel, ValidationError
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
T = TypeVar('T', bound=BaseModel)
class ExtractionError(Exception):
"""Raised when extraction fails after all retries."""
pass
def extract_with_retry(
text: str,
schema: Type[T],
max_retries: int = 3,
model: str = "gpt-4o-mini",
) -> T:
"""
Extract structured data with retry logic.
Strategies:
1. First attempt: Standard extraction
2. On validation error: Include the error in retry prompt
3. On JSON error: Ask for cleaner output
"""
schema_json = json.dumps(schema.model_json_schema(), indent=2)
messages = [
{
"role": "system",
"content": f"""Extract data as JSON matching this schema:
{schema_json}
Return ONLY valid JSON. No explanation."""
},
{"role": "user", "content": text}
]
last_error = None
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
response_format={"type": "json_object"},
temperature=0,
)
raw_content = response.choices[0].message.content
raw_json = json.loads(raw_content)
# Validate against schema
return schema.model_validate(raw_json)
except json.JSONDecodeError as e:
last_error = e
# Add correction prompt
messages.append({
"role": "assistant",
"content": raw_content
})
messages.append({
"role": "user",
"content": f"That was not valid JSON. Error: {e}. Please return ONLY valid JSON."
})
except ValidationError as e:
last_error = e
# Add correction prompt with specific errors
error_details = []
for error in e.errors():
field = ".".join(str(x) for x in error["loc"])
error_details.append(f"- {field}: {error['msg']}")
messages.append({
"role": "assistant",
"content": raw_content
})
messages.append({
"role": "user",
"content": f"""The JSON didn't match the required schema. Errors:
{chr(10).join(error_details)}
Please fix these issues and return valid JSON."""
})
raise ExtractionError(f"Failed after {max_retries} attempts. Last error: {last_error}")
# ─────────────────────────────────────────────────────────────────────────────
# Alternative: Fallback Chain
# ─────────────────────────────────────────────────────────────────────────────
def extract_with_fallback(
text: str,
schema: Type[T],
models: list[str] = ["gpt-4o-mini", "gpt-4o"],
) -> T:
"""
Try multiple models in sequence.
Useful when cheaper models fail on complex extractions.
"""
errors = []
for model in models:
try:
return extract_with_retry(text, schema, model=model, max_retries=2)
except ExtractionError as e:
errors.append(f"{model}: {e}")
continue
raise ExtractionError(f"All models failed:\n" + "\n".join(errors))
7. Batch Processing
For multiple documents, use async with semaphores to limit concurrent API calls:
import asyncio
from openai import AsyncOpenAI
async def extract_batch(texts: list[str], schema_json: str, max_concurrent: int = 5):
client = AsyncOpenAI()
semaphore = asyncio.Semaphore(max_concurrent)
async def extract_one(text: str):
async with semaphore:
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Extract as JSON:\n{schema_json}"},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
return await asyncio.gather(*[extract_one(t) for t in texts])
8. Common Pitfalls
| Symptom | Cause | Fix |
|---|---|---|
| "I cannot extract data as JSON" | Model refusing to comply | Use JSON mode, simplify schema |
| Wrong field names | Schema not in prompt | Include full schema with field names |
| Missing fields | Optional fields not specified | Use Field(default=None) for optional |
| Type mismatches | String "25" instead of int 25 | Be explicit about types in schema description |
| Extra markdown | Model wrapping in ```json | Use JSON mode or strip markdown |
| Hallucinated data | Model inventing values | Add "use null if not found" to prompt |
| Inconsistent results | Temperature > 0 | Use temperature=0 for extraction |
| PDF extraction fails | MarkItDown missing deps | Install markitdown[pdf] |
| Tables not extracted | Complex PDF layout | Try Azure Document Intelligence |
| Images ignored | No LLM configured | Pass llm_client to MarkItDown |
9. Try It Yourself
Challenge 1: Email Parser
Create a schema and parser for email metadata:
class Email(BaseModel):
sender: str
recipients: list[str]
subject: str
date: str
is_reply: bool
has_attachments: bool
sentiment: str # positive, negative, neutral
action_items: list[str]
Challenge 2: Product Review Analyzer
Extract structured sentiment from product reviews:
class Review(BaseModel):
product_name: str
rating: int # 1-5
pros: list[str]
cons: list[str]
would_recommend: bool
key_quotes: list[str]
Challenge 3: Resume Parser
Build a resume parser that handles various formats:
class Resume(BaseModel):
name: str
email: str | None
phone: str | None
education: list[Education]
experience: list[Experience]
skills: list[str]
10. Key Takeaways
-
MarkItDown converts anything to Markdown. PDF, Word, Excel, PowerPoint, images, audio—all become LLM-friendly text.
-
JSON mode guarantees syntax, not schema. Always validate with Pydantic.
-
Temperature 0 is essential. Structured extraction needs determinism.
-
Include the full schema in your prompt. Models can't read your Python code.
-
Plan for failures. Implement retry logic with helpful error messages.
-
Use native structured outputs when available. OpenAI's
response_formatwith Pydantic is the gold standard. -
Batch with concurrency limits. Don't overwhelm the API—use semaphores for LLM calls.
11. What's Next
You can now turn messy documents into typed objects. In Lesson 7: Vision & Multimodal Inputs, we'll learn to process images directly with LLMs—analyzing screenshots, extracting data from photos, and building tools that can "see."
12. Additional Resources
- Docling — Converts complex documents into structured Markdown (by IBM).
- MarkItDown (GitHub) — Microsoft's document-to-Markdown converter
- OpenAI Structured Outputs Guide — Official documentation
- Pydantic Documentation — Python validation library
- Azure Document Intelligence — For complex document layouts