OpenAI deprecated JSON Mode in favor of strict Structured Outputs in August 2024, and within 6 months every major LLM provider followed, according to OpenAI, 2024. The shift reflects a broader industry pattern: production LLM applications need guaranteed output schemas, not “probably valid JSON.” This tutorial covers the schema-first development pattern, implementation with Zod (TypeScript) and Pydantic (Python), integration with OpenAI, Anthropic, and Google APIs, and the critical step most teams skip: verifying the factual content inside the valid structure.
- JSON Mode is now legacy; strict JSON schema mode is the production default across OpenAI, Anthropic, and Google.
- Defining the output schema before writing prompts reduces parsing errors by up to 90%.
- Zod (TypeScript) and Pydantic (Python) serve as the contract layer between your application and the LLM.
- Structural validity does not equal factual accuracy; structured output with hallucinated values is still wrong.
- The production pattern is: define schema, generate, validate structure, verify facts with Webcite.
Why Did JSON Mode Become Legacy?
JSON Mode, introduced by OpenAI in November 2023, solved one problem: it guaranteed the model would output valid JSON. But it created another: the JSON could be any valid JSON. The model might return {"answer": "42"}, or it might return {"result": {"data": [1,2,3], "meta": null}}, for the same prompt. Your application had to parse an unpredictable structure and handle every possible shape the model might produce.
This led to brittle code. Developers wrapped every JSON Mode call in try/catch blocks, added fallback parsing logic, and built retry mechanisms for when the model returned valid JSON in the wrong shape. Approximately 15% to 25% of JSON Mode responses required post-processing or retries due to schema mismatches, according to internal benchmarks shared by teams using the OpenAI developer forum, OpenAI Community, 2024.
Structured Outputs, launched by OpenAI in August 2024, changed the contract. You provide a JSON schema, and the model is constrained during token generation to only produce outputs matching that schema. Not “usually” matching, not “mostly” matching, but provably matching. OpenAI uses constrained decoding, which modifies the token sampling process to zero out the probability of any token that would violate the schema.
The result is 100% schema conformance. No parsing errors. No retry logic. No fallback handling. The model either returns a valid response or an error due to content policy, according to OpenAI Structured Outputs documentation, 2024.
Anthropic followed with schema-enforced tool use. Claude generates arguments matching a defined JSON schema when calling tools, according to Anthropic documentation, 2025.
Google Gemini added its own constraint mechanism, response_schema, a generation config parameter that forces output to match a JSON schema, according to Google AI documentation, 2025. The industry consensus is clear: unstructured LLM output is for prototypes; structured output is for production.
What Does This Development Pattern Look Like?
This approach inverts the typical LLM application workflow. Instead of starting with prompts and parsing the output, you start with the output schema and build everything else around it.
Step 1: Define the Schema
The schema is the single source of truth for your LLM output. In TypeScript, use Zod. In Python, use Pydantic. Both libraries generate JSON Schema definitions that LLM providers accept directly.
Here is a Zod schema for a product review analysis:
import { z } from "zod"
const ReviewAnalysis = z.object({
sentiment: z.enum(["positive", "negative", "neutral"]),
confidence: z.number().min(0).max(1),
key_themes: z.array(z.string()).min(1).max(5),
summary: z.string().max(200),
factual_claims: z.array(
z.object({
claim: z.string(),
requires_verification: z.boolean()
})
),
recommendation: z.enum(["highlight", "flag", "ignore"])
})
type ReviewAnalysis = z.infer<typeof ReviewAnalysis>
And the equivalent Pydantic schema in Python:
from pydantic import BaseModel, Field
from typing import Literal
from enum import Enum
class FactualClaim(BaseModel):
claim: str
requires_verification: bool
class ReviewAnalysis(BaseModel):
sentiment: Literal["positive", "negative", "neutral"]
confidence: float = Field(ge=0, le=1)
key_themes: list[str] = Field(min_length=1, max_length=5)
summary: str = Field(max_length=200)
factual_claims: list[FactualClaim]
recommendation: Literal["highlight", "flag", "ignore"]
The schema defines types, constraints, enumerations, and nesting. Every downstream function in your application can rely on this structure being present and valid. No null checks, no optional chaining, no defensive parsing.
Step 2: Write Prompts That Populate the Schema
With the schema defined, prompts become instructions for populating specific fields rather than open-ended requests. The prompt does not need to describe the output format; the schema handles that. The prompt focuses on the task:
Analyze the following product review. Identify the overall sentiment, extract up to 5 key themes, write a summary under 200 characters, and flag any factual claims that should be independently verified.
Review: {review_text}
This separation of concerns, schema for structure, prompt for semantics, makes both easier to maintain and iterate.
Step 3: Generate with Schema Enforcement
The API call includes the schema as a parameter, and the model is constrained to produce conforming output.
How Do You Implement Structured Outputs with Each Provider?
Each major LLM provider supports structured outputs through slightly different API surfaces. Here is the implementation pattern for each.
OpenAI (GPT-4o, GPT-4o-mini)
OpenAI’s Structured Outputs require the response_format parameter (type: “json_schema”, strict: true) for constrained decoding:
import OpenAI from "openai"
import { zodResponseFormat } from "openai/helpers/zod"
const openai = new OpenAI()
const response = await openai.chat.completions.create({
model: "gpt-4o-2024-08-06",
messages: [
{ role: "system", content: "Analyze product reviews." },
{ role: "user", content: `Analyze this review: ${reviewText}` }
],
response_format: zodResponseFormat(ReviewAnalysis, "review_analysis")
})
const parsed = JSON.parse(response.choices[0].message.content)
const validated = ReviewAnalysis.parse(parsed) // Zod validation
Using zodResponseFormat, the Zod schema is converted to OpenAI’s JSON Schema format. The model generates output matching the schema. .parse() provides runtime type safety, according to OpenAI Node.js SDK documentation, 2024.
Anthropic (Claude 3.5 Sonnet, Claude 3 Opus)
Anthropic supports structured output through tool use. You define a tool with the desired output schema, and Claude generates arguments matching that schema:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=[{
"name": "analyze_review",
"description": "Analyze a product review and return structured results",
"input_schema": ReviewAnalysis.model_json_schema()
}],
messages=[{
"role": "user",
"content": f"Use the analyze_review tool on this review: {review_text}"
}]
)
tool_result = next( # extract tool use result
block for block in response.content
if block.type == "tool_use"
)
validated = ReviewAnalysis.model_validate(tool_result.input)
Anthropic’s tool use constrains the model to generate valid arguments that match the tool’s schema. .model_json_schema() from Pydantic produces the JSON Schema definition that the API accepts, according to Anthropic documentation, 2025.
Google Gemini
Google Gemini constrains structured output via response_schema (inside generation_config):
import google.generativeai as genai
model = genai.GenerativeModel(
"gemini-1.5-pro",
generation_config=genai.GenerationConfig(
response_mime_type="application/json",
response_schema=ReviewAnalysis.model_json_schema()
)
)
response = model.generate_content(
f"Analyze this product review: {review_text}"
)
validated = ReviewAnalysis.model_validate_json(response.text)
Gemini constrains generation to match the provided schema, producing output that Pydantic can validate directly, according to Google AI documentation, 2025.
Why Structural Validity Is Not Enough
Schema enforcement solves the parsing problem. The output is always valid JSON in the right shape. But it does not solve the accuracy problem. Consider this perfectly valid structured output:
{
"sentiment": "positive",
"confidence": 0.92,
"key_themes": ["durability", "value", "customer service"],
"summary": "Customers report 98% satisfaction rate and average product lifespan of 7 years.",
"factual_claims": [
{
"claim": "98% customer satisfaction rate",
"requires_verification": true
},
{
"claim": "Average product lifespan of 7 years",
"requires_verification": true
}
],
"recommendation": "highlight"
}
The JSON is valid. The types are correct. Every field conforms to the schema. But the “98% satisfaction rate” and “7 year lifespan” claims may be completely fabricated. The model generated plausible-sounding statistics that fit the schema constraints (they are strings, so they pass type checking) but have no basis in reality.
This is the structured hallucination problem: well-formatted lies. It is arguably more dangerous than unstructured hallucination because the clean structure creates an illusion of reliability. A developer seeing valid typed output is less likely to question the content than one parsing messy text.
Research from Vectara found that even the best LLMs hallucinate 3% to 5% of the time on straightforward factual questions, and the rate increases significantly on domain-specific or numerical queries, according to Vectara Hallucination Leaderboard, 2025. Structured output does not change this rate; it changes the packaging.
How to Add Verification to the Output Pipeline
The complete pipeline has four stages: define, generate, validate, verify. Most teams implement the first three and skip the fourth. Here is the full pattern.
Stage 1: Define
Define your schema in Zod or Pydantic. Add factual_claims, an array that explicitly asks the model to extract verifiable claims from its output. This makes the verification step simpler because the model pre-identifies what needs checking.
Stage 2: Generate
Call the LLM API with schema enforcement. The model produces conforming JSON.
Stage 3: Validate Structure
Parse the response with your schema library. Call .safeParse() (Zod) or .model_validate() (Pydantic) to confirm structural validity and catch edge cases like values outside allowed ranges.
Stage 4: Verify Facts
Send each factual claim to the Webcite verification API. Replace or flag claims that are not supported by real-world sources.
Here is the full pipeline in TypeScript:
import { z } from "zod"
import OpenAI from "openai"
import { zodResponseFormat } from "openai/helpers/zod"
// Stage 1: Define
const ResearchOutput = z.object({
topic: z.string(),
summary: z.string().max(500),
claims: z.array(
z.object({
statement: z.string(),
source_hint: z.string().optional()
})
),
confidence: z.number().min(0).max(1)
})
// Stage 2: Generate
const openai = new OpenAI()
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "Research the given topic. Extract specific factual claims." },
{ role: "user", content: "Enterprise AI adoption rates in 2025" }
],
response_format: zodResponseFormat(ResearchOutput, "research_output")
})
// Stage 3: Validate
const parsed = ResearchOutput.safeParse(
JSON.parse(response.choices[0].message.content)
)
if (!parsed.success) {
console.error("Schema validation failed:", parsed.error)
// Handle validation failure
}
// Stage 4: Verify
const verifiedClaims = await Promise.all(
parsed.data.claims.map(async (claim) => {
const verifyResponse = await fetch(
"https://api.webcite.co/api/v1/verify",
{
method: "POST",
headers: {
"x-api-key": process.env.WEBCITE_API_KEY,
"Content-Type": "application/json"
},
body: JSON.stringify({
claim: claim.statement,
include_stance: true,
include_verdict: true
})
}
)
const result = await verifyResponse.json()
return {
...claim,
verdict: result.verdict?.result,
confidence: result.verdict?.confidence,
citations: result.citations
}
})
)
// Filter to only supported claims
const supportedClaims = verifiedClaims.filter(
(c) => c.verdict === "supported" && c.confidence > 80
)
And the equivalent in Python:
import requests
from openai import OpenAI
from pydantic import BaseModel, Field
class Claim(BaseModel): # Stage 1: Define
statement: str
source_hint: str | None = None
class ResearchOutput(BaseModel):
topic: str
summary: str = Field(max_length=500)
claims: list[Claim]
confidence: float = Field(ge=0, le=1)
client = OpenAI() # Stage 2: Generate
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Research the given topic. Extract specific factual claims."},
{"role": "user", "content": "Enterprise AI adoption rates in 2025"}
],
response_format=ResearchOutput
)
output = response.choices[0].message.parsed # Stage 3: Validate via .parse()
verified_claims = [] # Stage 4: Verify
for claim in output.claims:
verify_response = requests.post(
"https://api.webcite.co/api/v1/verify",
headers={
"x-api-key": "your-api-key",
"Content-Type": "application/json"
},
json={
"claim": claim.statement,
"include_stance": True,
"include_verdict": True
}
)
result = verify_response.json()
verified_claims.append({
"statement": claim.statement,
"verdict": result.get("verdict", {}).get("result"),
"confidence": result.get("verdict", {}).get("confidence"),
"citations": result.get("citations", [])
})
Webcite’s free tier includes 50 credits per month ($0) for testing this pipeline. The Builder plan at $20 per month provides 500 credits for 125 verifications. Enterprise plans start at 10,000+ credits with custom pricing. Each verification uses 4 credits. Authentication uses the x-api-key header.
What Patterns Work Best for Production Schema Design?
Five schema design patterns have emerged as best practices across the LLM application ecosystem:
Pattern 1: Explicit Claim Extraction
Include a dedicated field for factual claims in every schema. This forces the model to separate opinions from facts and makes downstream verification straightforward. Use requires_verification (boolean) to let the model flag its own uncertain claims.
Pattern 2: Confidence Scoring
Add a confidence field (0 to 1) that the model self-reports. Research from Microsoft found that LLM confidence scores correlate with actual accuracy at r=0.72 when the model is explicitly asked to assess its own certainty, according to Kadavath et al., Anthropic, 2022. This is not reliable enough to replace verification, but it provides useful signal for prioritizing which claims to verify first.
Pattern 3: Source Hints
Include an optional source field where the model indicates where it believes its information comes from. This hint helps the verification API find relevant sources faster and helps developers debug hallucination patterns.
Pattern 4: Enum-Constrained Categories
Use enums for any categorical field. Instead of letting the model generate free-text categories (which creates inconsistency), constrain it to predefined values. This eliminates an entire class of downstream parsing issues.
Pattern 5: Nested Validation
Use nested objects with per-field constraints rather than flat schemas with string fields. Validating a date field as an ISO 8601 string is more reliable than using a generic string field with a prompt instruction to “use ISO 8601 format.” Zod and Pydantic both support rich constraint definitions that the LLM providers honor during constrained decoding.
The Instructor library by Jason Liu, which has over 7,000 GitHub stars, provides a higher-level abstraction over these patterns for both Python and TypeScript, according to Instructor documentation, 2025. It wraps OpenAI, Anthropic, and other providers with Pydantic schema support, automatic retries on validation failures, and streaming support for structured outputs.
Schema-Centric Development and AI Verification Together
Schema-centric development and external verification solve complementary problems. Schemas guarantee structure; verification guarantees accuracy. Together, they eliminate the two most common failure modes in production LLM applications: unparseable output and factually incorrect output.
The workflow becomes predictable. Your application code never handles malformed responses because the schema prevents them. Your users never see hallucinated facts because verification catches them. The error surface shrinks to the narrow space between “structurally valid” and “factually supported,” and even that space is monitored.
For teams building production AI applications, this pattern, schemas plus external verification, is becoming the standard. For a broader look at how verification fits into AI content pipelines, see our guide on building a citation pipeline.
Frequently Asked Questions
What is structured output in LLMs?
Structured output is a mode where an LLM generates responses that conform to a predefined JSON schema rather than free-form text. The model is constrained during generation to only produce tokens that result in valid JSON matching the schema. This eliminates parsing errors and ensures the output can be programmatically consumed without post-processing.
What is the difference between JSON Mode and Structured Outputs?
JSON Mode guarantees valid JSON but does not enforce a specific schema. The model might return any valid JSON object. Structured Outputs (strict mode) guarantees both valid JSON and conformance to a specific schema you define. OpenAI introduced Structured Outputs in August 2024 as a replacement for the older JSON Mode, which is now considered legacy.
What is the schema-driven approach for LLM applications?
Defining the output schema before writing any other code is a practice called the schema-driven approach. You define your output schema (using Zod, Pydantic, or JSON Schema) before writing prompts or LLM integration code. The schema becomes the contract between your application and the LLM. Prompts are written to populate the schema, not the other way around. This approach reduces parsing errors by up to 90% compared to unstructured text generation.
Which LLM providers support structured outputs?
OpenAI (GPT-4o, GPT-4o-mini) supports strict JSON schema mode since August 2024. Anthropic Claude supports tool use with JSON schemas. Google Gemini supports response schemas in the generationConfig. Open-source models via Outlines, Instructor, or vLLM also support constrained decoding. The feature is now available across all major providers.
How do I validate structured LLM output?
Use runtime validation libraries matching your schema: .parse() / .safeParse() (Zod, TypeScript) or .model_validate() (Pydantic, Python). These libraries check that the LLM output conforms to the expected types, required fields, and value constraints. After structural validation, use a verification API like Webcite to check factual claims.
Why should I verify structured output if the schema is already enforced?
Schema enforcement guarantees structure, not accuracy. A structured output can have perfectly valid JSON with correct types and required fields while containing fabricated statistics, hallucinated citations, or incorrect factual claims. Verification checks the content of the values, not just the shape of the data.