Introduction: The Data Chaos of LLMs
Tired of appending please output JSON
to every LLM prompt like a digital prayer? While large Language Models (LLMs) revolutionize how we interact with unstructured data, integrating their outputs into traditional software workflows can be highly unpredictable and unscalable.
Enter Pydantic, the Python library that’s quietly become the backbone of modern AI development.
In this guide, we’ll build two real-world tools that show why PydanticAI is a non-negotiable for LLM projects:
- A web scraper to extract product data from an e-commerce site.
- A PDF parser to decode complex technical schematics.
By the end, you’ll learn:
- How to define self-validating data contracts
- Why nested schemas matter for real-world complexity
- The 3-line validation pattern every LLM pipeline needs
Why Should I Care About Pydantic?
PydanticAI (a marriage of Pydantic’s data validation and LLM-friendly features) solves the #1 problem in AI workflows: trusting your LLM’s output. It lets you:
- Define strict data models (e.g., “A product must have a price and SKU”).
- Automatically validate LLM outputs against those models.
- Ensure type-safe validation that catches errors early.
Let’s see it in action.
Example 1: Web Scraping an E-Commerce Site
The Problem:
You scrape a product page, but the raw HTML is a jungle of nested divs and inconsistent classes. While LLMs can extract data, their outputs are frustratingly unpredictable. One moment the price is “$10.00”, the next it’s “ten dollars” - and sometimes the SKU is missing entirely.
The PydanticAI Fix:
Step 1: Define Your Data Contract
from pydantic import BaseModel, Field
class Product(BaseModel):
"""Schema for e-commerce product data extraction."""
str = Field(..., description="Product name from the webpage")
name: str = Field(..., description="Current price including currency")
price: float = Field(None, description="User rating between 1-5 stars")
rating: list[str] = Field(..., description="Key product features") features:
This model acts as a “data contract” - it explicitly defines what you expect from the LLM. Even optional fields like rating are declared upfront.
Step 2: Configure Your LLM-Powered Scraper
import os
from crawl4ai import LLMConfig, AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
= LLMExtractionStrategy(
llm_strategy =LLMConfig(
llm_config="openai/gpt-4o",
provider=os.getenv("OPENAI_API_KEY")
api_token
),=Product.schema(),
schema="schema",
extraction_type="Extract product details from the e-commerce page",
instruction=2048,
chunk_token_threshold=True
verbose )
By passing Product.schema()
to crawl4ai, we’re “locking in” the LLM’s output format to match our Pydantic model. No more guessing games.
Step 3: Execute & Validate in One Shot
import asyncio
async def main():
# Set up crawler configuration
= BrowserConfig(
browser_config =True,
headless=True,
verbose=["--disable-gpu", "--no-sandbox"]
extra_args
)
= CrawlerRunConfig(
crawl_config =llm_strategy,
extraction_strategy=CacheMode.BYPASS,
cache_mode=100
word_count_threshold
)
async with AsyncWebCrawler(config=browser_config) as crawler:
= await crawler.arun(
result ="https://www.amazon.com/alm/storefront/?almBrandId=VUZHIFdob2xlIEZvb2Rz",
url=crawl_config
config
)
if result.success:
try:
# Validate extraction with Pydantic
= Product.parse_raw(result.extracted_content)
product_data print(f"Extracted product: {product_data.json(indent=2)}")
except Exception as e:
print(f"Validation error: {str(e)}")
print(f"Raw content: {result.extracted_content}")
else:
print("Extraction failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
Output:
{
"name": "Annie's Frozen Pizza Poppers",
"price": "$4.59",
"rating": null,
"features": ["Three Cheese", "Snacks", "6.8 oz", "15 ct"]
}
Why This Approach Wins
- Self-Documenting Schemas: The
Field(..., description="...")
syntax acts as embedded documentation for future developers (or your future self). - Fail-Fast Validation: The moment the LLM hallucinates a field (e.g., returns a price of “ten dollars” instead of “$10.00”),
Product.parse_raw()
throws an error. No silent failures. - Toolchain Agnostic: Swap out crawl4ai for BeautifulSoup or Scrapy - your validation logic stays the same.
Without PydanticAI, you’d be stuck writing:
# Manual validation hell
= json.loads(llm_output)
raw_data if "name" not in raw_data:
raise ValueError("Missing field: name")
if not isinstance(raw_data.get("features"), list):
raise TypeError("Features must be a list")
...
Result: Code that’s 40% longer, brittle to schema changes, and impossible to maintain at scale.
PydanticAI doesn’t just validate data - it enforces structure on LLMs. By defining schemas upfront, you turn unpredictable text generation into a reliable data pipeline.
Web scraping is just the start. Let’s tackle a harder problem: extracting domain-specific technical data where errors have real-world consequences.
Example 2: Parsing Technical PDFs
The Problem:
You’re handed a 31-page PDF datasheet for the LM317 voltage regulator. Traditional PDF parsers choke on the mix of tables, diagrams, and technical jargon. Even LLMs struggle - they might extract current ratings as strings (“1.5A”) instead of numbers, or miss nested specifications entirely.
The PydanticAI Power Play:
Step 1: Model the Electronics Domain
from pydantic import BaseModel, Field
from typing import List
class VoltageRange(BaseModel):
"""Voltage specification with min/max range."""
float = Field(..., description="Minimum voltage in volts")
min_voltage: float = Field(..., description="Maximum voltage in volts")
max_voltage: str = Field("V", description="Voltage unit")
unit:
class PinConfiguration(BaseModel):
"""Pin layout specification."""
int = Field(..., description="Number of pins")
pin_count: str = Field(..., description="Detailed pin layout description")
layout:
class LM317Spec(BaseModel):
"""Complete specification for LM317 voltage regulator."""
str = Field(..., description="Name of the component")
component_name: # ← Nested model!
output_voltage: VoltageRange float = Field(..., description="Dropout voltage in volts")
dropout_voltage: float = Field(..., description="Maximum current rating in amperes")
max_current: # ← Another nested model pin_configuration: PinConfiguration
These models act like technical blueprints. The nested VoltageRange
ensures even complex specs stay structured.
Step 2: Configure Your PDF Extraction Pipeline
from llama_cloud_services import LlamaExtract
from llama_cloud import ExtractConfig
from dotenv import load_dotenv
# Load environment variables
=True)
load_dotenv(override
# Initialize with your API credentials
= LlamaExtract(
llama_extract ="your-project-id",
project_id="your-org-id"
organization_id
)
# Create AI agent with our schema
= llama_extract.create_agent(
agent ="lm317-datasheet",
name=LM317Spec, # ← Our Pydantic model becomes the extraction template
data_schema=ExtractConfig(extraction_mode="BALANCED")
config )
Step 3: Extract & Validate in One Move
import json
# Process PDF and get validated output
= agent.extract("./data/lm317.pdf")
lm317_extract
print(json.dumps(lm317_extract.data, indent=2))
Output:
{
"component_name": "LM317T Voltage Regulator",
"output_voltage": {
"min_voltage": 1.2,
"max_voltage": 37.0,
"unit": "V"
},
"dropout_voltage": 2.0,
"max_current": 1.5,
"pin_configuration": {
"pin_count": 3,
"layout": "TO-220: 1=Adjust, 2=Output, 3=Input"
}
}
Why This Matters for Our Clients (Hardware Engineers)
- Nested Validation: PydanticAI checks both top-level fields and their nested structures. If the LLM returns “1.5 amps” instead of 1.5, you’ll get an error.
- Domain-Specific Modeling: The
VoltageRange
class encodes electronics-specific logic that generic schemas miss. - Future-Proofing: Adding thermal characteristics? Just extend the model - no parser rewrite needed.
Without PydanticAI, you’d be stuck with:
= json.loads(llm_output)
raw_data try:
= float(raw_data["output_voltage"]["min"].replace("V", ""))
min_v except KeyError:
# Which level failed? We can't tell!
pass
Result: Brittle code that breaks when document formats change.
PydanticAI transforms LLMs from text generators into structured data engineers. By combining domain modeling with automatic validation, you get production-ready specs from day one.
Summary
This guide demonstrated PydanticAI’s power through two practical examples:
Key Benefits
- Type Safety: Catch data inconsistencies before they break your pipeline
- Self-Documentation: Schemas serve as living documentation
- Tool Agnostic: Switch LLM providers without rewriting validation logic
Next Steps
Try applying these patterns to your own data sources:
- Emails → CRM entries
- Meeting transcripts → Action items
- Research papers → Citation graphs
Conclusion
PydanticAI transforms the biggest challenge in LLM projects—unreliable outputs—into a solved problem. By defining schemas upfront, you get production-ready data validation from day one.
Your schemas outlive your tools. Switch LLM providers? Redesign your UI? Your validated data pipeline remains untouched.
This post is just the tip of the iceberg. The Pydantic Documentation and Real Python’s Pydantic Guide are a great place to start learning more about all Pydantic has to offer. I also highly recommend checking out PydanticAI, another excellent framework for building LLM-powered applications.