PydanticAI Fundamentals: From Web Scraping to PDF Parsing

Introduction: The Data Chaos of LLMs

Tired of appending please output JSON to every LLM prompt like a digital prayer? While large Language Models (LLMs) revolutionize how we interact with unstructured data, integrating their outputs into traditional software workflows can be highly unpredictable and unscalable.

Enter Pydantic, the Python library that’s quietly become the backbone of modern AI development.

In this guide, we’ll build two real-world tools that show why PydanticAI is a non-negotiable for LLM projects:

A web scraper to extract product data from an e-commerce site.
A PDF parser to decode complex technical schematics.

By the end, you’ll learn:

How to define self-validating data contracts
Why nested schemas matter for real-world complexity
The 3-line validation pattern every LLM pipeline needs

Why Should I Care About Pydantic?

PydanticAI (a marriage of Pydantic’s data validation and LLM-friendly features) solves the #1 problem in AI workflows: trusting your LLM’s output. It lets you:

Define strict data models (e.g., “A product must have a price and SKU”).
Automatically validate LLM outputs against those models.
Ensure type-safe validation that catches errors early.

Let’s see it in action.

Example 1: Web Scraping an E-Commerce Site

The Problem:

You scrape a product page, but the raw HTML is a jungle of nested divs and inconsistent classes. While LLMs can extract data, their outputs are frustratingly unpredictable. One moment the price is “$10.00”, the next it’s “ten dollars” - and sometimes the SKU is missing entirely.

The PydanticAI Fix:

Step 1: Define Your Data Contract

from pydantic import BaseModel, Field

class Product(BaseModel):
    """Schema for e-commerce product data extraction."""
    name: str = Field(..., description="Product name from the webpage")
    price: str = Field(..., description="Current price including currency")
    rating: float = Field(None, description="User rating between 1-5 stars")
    features: list[str] = Field(..., description="Key product features")

Why this matters

This model acts as a “data contract” - it explicitly defines what you expect from the LLM. Even optional fields like rating are declared upfront.

Step 2: Configure Your LLM-Powered Scraper

import os
from crawl4ai import LLMConfig, AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy

llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o",
        api_token=os.getenv("OPENAI_API_KEY")
    ),
    schema=Product.schema(),
    extraction_type="schema",
    instruction="Extract product details from the e-commerce page",
    chunk_token_threshold=2048,
    verbose=True
)

Key Insight

By passing Product.schema() to crawl4ai, we’re “locking in” the LLM’s output format to match our Pydantic model. No more guessing games.

Step 3: Execute & Validate in One Shot

import asyncio

async def main():
    # Set up crawler configuration
    browser_config = BrowserConfig(
        headless=True,
        verbose=True,
        extra_args=["--disable-gpu", "--no-sandbox"]
    )

    crawl_config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=100
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.amazon.com/alm/storefront/?almBrandId=VUZHIFdob2xlIEZvb2Rz",
            config=crawl_config
        )
        
        if result.success:
            try:
                # Validate extraction with Pydantic
                product_data = Product.parse_raw(result.extracted_content)
                print(f"Extracted product: {product_data.json(indent=2)}")
            except Exception as e:
                print(f"Validation error: {str(e)}")
                print(f"Raw content: {result.extracted_content}")
        else:
            print("Extraction failed:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

Output:

{
  "name": "Annie's Frozen Pizza Poppers",
  "price": "$4.59",
  "rating": null,
  "features": ["Three Cheese", "Snacks", "6.8 oz", "15 ct"]
}

Why This Approach Wins

Self-Documenting Schemas: The Field(..., description="...") syntax acts as embedded documentation for future developers (or your future self).
Fail-Fast Validation: The moment the LLM hallucinates a field (e.g., returns a price of “ten dollars” instead of “$10.00”), Product.parse_raw() throws an error. No silent failures.
Toolchain Agnostic: Swap out crawl4ai for BeautifulSoup or Scrapy - your validation logic stays the same.

Without PydanticAI, you’d be stuck writing:

# Manual validation hell  
raw_data = json.loads(llm_output)  
if "name" not in raw_data:  
    raise ValueError("Missing field: name")  
if not isinstance(raw_data.get("features"), list):  
    raise TypeError("Features must be a list")  
...

Result: Code that’s 40% longer, brittle to schema changes, and impossible to maintain at scale.

Key Takeaway

PydanticAI doesn’t just validate data - it enforces structure on LLMs. By defining schemas upfront, you turn unpredictable text generation into a reliable data pipeline.

Web scraping is just the start. Let’s tackle a harder problem: extracting domain-specific technical data where errors have real-world consequences.

Example 2: Parsing Technical PDFs

The Problem:

You’re handed a 31-page PDF datasheet for the LM317 voltage regulator. Traditional PDF parsers choke on the mix of tables, diagrams, and technical jargon. Even LLMs struggle - they might extract current ratings as strings (“1.5A”) instead of numbers, or miss nested specifications entirely.

The PydanticAI Power Play:

Step 1: Model the Electronics Domain

from pydantic import BaseModel, Field
from typing import List

class VoltageRange(BaseModel):
    """Voltage specification with min/max range."""
    min_voltage: float = Field(..., description="Minimum voltage in volts")
    max_voltage: float = Field(..., description="Maximum voltage in volts")
    unit: str = Field("V", description="Voltage unit")

class PinConfiguration(BaseModel):
    """Pin layout specification."""
    pin_count: int = Field(..., description="Number of pins")
    layout: str = Field(..., description="Detailed pin layout description")

class LM317Spec(BaseModel):
    """Complete specification for LM317 voltage regulator."""
    component_name: str = Field(..., description="Name of the component")
    output_voltage: VoltageRange  # ← Nested model!
    dropout_voltage: float = Field(..., description="Dropout voltage in volts")
    max_current: float = Field(..., description="Maximum current rating in amperes")
    pin_configuration: PinConfiguration  # ← Another nested model

Why this matters

These models act like technical blueprints. The nested VoltageRange ensures even complex specs stay structured.

Step 2: Configure Your PDF Extraction Pipeline

from llama_cloud_services import LlamaExtract
from llama_cloud import ExtractConfig
from dotenv import load_dotenv

# Load environment variables
load_dotenv(override=True)

# Initialize with your API credentials
llama_extract = LlamaExtract(
    project_id="your-project-id", 
    organization_id="your-org-id"
)

# Create AI agent with our schema
agent = llama_extract.create_agent(
    name="lm317-datasheet",
    data_schema=LM317Spec,  # ← Our Pydantic model becomes the extraction template
    config=ExtractConfig(extraction_mode="BALANCED")
)

Step 3: Extract & Validate in One Move

import json

# Process PDF and get validated output
lm317_extract = agent.extract("./data/lm317.pdf")

print(json.dumps(lm317_extract.data, indent=2))

Output:

{
  "component_name": "LM317T Voltage Regulator",
  "output_voltage": {
    "min_voltage": 1.2,
    "max_voltage": 37.0,
    "unit": "V"
  },
  "dropout_voltage": 2.0,
  "max_current": 1.5,
  "pin_configuration": {
    "pin_count": 3,
    "layout": "TO-220: 1=Adjust, 2=Output, 3=Input"
  }
}

Why This Matters for Our Clients (Hardware Engineers)

Nested Validation: PydanticAI checks both top-level fields and their nested structures. If the LLM returns “1.5 amps” instead of 1.5, you’ll get an error.
Domain-Specific Modeling: The VoltageRange class encodes electronics-specific logic that generic schemas miss.
Future-Proofing: Adding thermal characteristics? Just extend the model - no parser rewrite needed.

Without PydanticAI, you’d be stuck with:

raw_data = json.loads(llm_output)
try:
    min_v = float(raw_data["output_voltage"]["min"].replace("V", ""))
except KeyError:
    # Which level failed? We can't tell!
    pass

Result: Brittle code that breaks when document formats change.

Key Takeaway

PydanticAI transforms LLMs from text generators into structured data engineers. By combining domain modeling with automatic validation, you get production-ready specs from day one.

Summary

This guide demonstrated PydanticAI’s power through two practical examples:

Web Scraping (Section 3): Extracting structured product data from messy HTML
PDF Parsing (Section 4): Converting technical documents into validated schemas

Key Benefits

Type Safety: Catch data inconsistencies before they break your pipeline
Self-Documentation: Schemas serve as living documentation
Tool Agnostic: Switch LLM providers without rewriting validation logic

Next Steps

Try applying these patterns to your own data sources:

Emails → CRM entries
Meeting transcripts → Action items
Research papers → Citation graphs

Conclusion

PydanticAI transforms the biggest challenge in LLM projects—unreliable outputs—into a solved problem. By defining schemas upfront, you get production-ready data validation from day one.

The Ultimate Benefit

Your schemas outlive your tools. Switch LLM providers? Redesign your UI? Your validated data pipeline remains untouched.

Dive Deeper into Pydantic

This post is just the tip of the iceberg. The Pydantic Documentation and Real Python’s Pydantic Guide are a great place to start learning more about all Pydantic has to offer. I also highly recommend checking out PydanticAI, another excellent framework for building LLM-powered applications.