AI Engineering Architecture
Overview
This page describes the production architecture for foundation model applications, synthesized from Chip Huyen's AI Engineering (O'Reilly, 2025, Chapter 10). The architecture follows a progressive approach: start with the simplest possible system and add components incrementally as production needs justify them, rather than over-engineering from the start.
Despite the diversity of AI applications, they share a common set of components. This five-step architecture has been validated across multiple production deployments.
The Simplest Architecture
The minimal viable architecture receives a query and sends it directly to a model API, which returns the response to the user. No context augmentation, no guardrails, no optimization.
[User Query] → [Model API] → [Response]
The Model API box covers both third-party APIs (OpenAI, Anthropic, Google) and self-hosted models (requiring an inference server, discussed under Inference Optimization).
Starting here keeps early development fast. Complexity is added only when evidence from users or production metrics demands it.
Step 1: Enhance Context
The first expansion adds mechanisms for dynamic context construction — giving the model relevant information it doesn't have in its weights.
Context can be constructed via:
- Text retrieval (RAG): Search a vector database or keyword index to retrieve relevant document chunks
- Image/multimodal retrieval: Retrieve relevant images, tables, or structured data
- Tabular data retrieval: SQL query execution to retrieve structured records
- API-based augmentation: Real-time data via web search, weather, news, or internal API calls
Context construction is analogous to feature engineering in traditional ML: it provides the model with the data it needs to produce an accurate output.
Key consideration: Different model API providers differ in context construction support — upload limits, chunk sizes, retrieval algorithms, tool execution modes (sequential vs. parallel). When the generic API is insufficient, a specialized RAG solution may allow unlimited document upload bounded only by vector database capacity.
Step 2: Put in Guardrails
Guardrails protect the system and its users from risks. They apply to both inputs and outputs.
Input Guardrails
Two primary risks: 1. Leaking private information to external APIs: Employees or applications may inadvertently include sensitive data (PII, trade secrets, internal policies) in prompts sent to third-party APIs. 2. Prompt injection and jailbreaking: Attackers craft inputs that override system instructions or extract information the system should not reveal.
Mitigation: - PII/sensitive data detection: Automated tools scan prompts for PII (ID numbers, bank accounts, faces), IP keywords, or privileged phrases. If detected, the prompt can be blocked, redacted, or routed to a local model. - Injection detection: Pattern-based and model-based classifiers detect injection attempts.
Output Guardrails
Three primary output risks: 1. Empty or malformed responses: Model fails or returns invalid structure 2. Hallucinations: Factually incorrect content presented as fact 3. Unsafe content: Toxic, harmful, or policy-violating outputs
Output guardrails can include: - Model-based scorers: Small models evaluate output quality (relevance, faithfulness, toxicity) and trigger remediation (regenerate, escalate to human, return fallback) - Format validators: Ensure structured outputs (JSON, XML) conform to expected schemas - Content classifiers: Flag unsafe, off-topic, or legally risky content
Tradeoff: Output guardrails require additional model calls per response, adding latency and cost.
Step 3: Add Model Router and Gateway
As applications grow, managing multiple models and entry points requires additional infrastructure.
Router
A router uses an intent classifier or next-action predictor to route each query to the optimal solution:
- Model routing: Route simple queries to cheap/fast models; complex queries to frontier models
- Action routing: For agents, predict the next action (code interpreter, web search, human escalation)
- Memory routing: Determine which memory tier (in-context, semantic cache, vector store, database) to pull from
Routers are typically small, fast models (fine-tuned BERT, Llama-7B, or even rule-based classifiers) that add minimal latency while enabling significant cost savings.
Routing order: Routing often happens before retrieval (determine if the query is in scope, if retrieval is needed). Post-retrieval routing is also used (route to human when retrieval confidence is low).
Gateway
A model gateway provides a unified interface to multiple models, abstracting away provider-specific APIs:
- Unified interface: One API call regardless of underlying model (OpenAI, Google, Anthropic, self-hosted)
- Access control: Centralized management of API keys and per-team/user access permissions
- Cost monitoring: Track and limit API consumption per user, project, or model
- Fallback policies: Automatic failover when primary model is rate-limited or unavailable
- Versioning: Pin specific model versions to prevent unexpected behavior changes from provider updates
Step 4: Reduce Latency with Caches
Caching prevents redundant computation and dramatically reduces both latency and cost for repeated or similar queries.
Exact Caching
Cache results keyed by exact query or embedding hash:
- Response cache: If the same query (or the same product summary request) was answered before, return the cached result directly
- Embedding cache: If a query was already embedded and vector-searched, return cached retrieval results
- Tool result cache: Cache deterministic tool outputs (API responses, SQL results) with appropriate TTLs
Implementation: Redis (for fast in-memory lookup), PostgreSQL or tiered storage for larger caches. Eviction policies: LRU, LFU, or FIFO. Train a classifier to predict which queries are worth caching (avoid user-specific or time-sensitive queries).
Security risk: Improper caching can cause data leaks. User-specific responses must never be cached and returned to a different user.
Semantic Caching
Return cached results for semantically similar (not just identical) queries using embedding similarity: - Embed the incoming query and search a cache of past query embeddings - Return the cached response if cosine similarity exceeds a threshold
Particularly effective for FAQ agents, classification pipelines, and any high-volume use case with predictable query patterns.
Step 5: Add Agent Patterns
Simple sequential query-response flows can be upgraded to agentic patterns for complex tasks:
- Loops: Generated output is fed back into the model when the task is not yet complete (e.g., a search agent that determines its first retrieval was insufficient)
- Parallel execution: Multiple retrieval or tool calls dispatched concurrently
- Conditional branching: Different execution paths based on intermediate model outputs
- Write actions: Model outputs trigger real-world changes (compose email, place order, update database, initiate transfer)
Critical caution on write actions: Write actions dramatically expand what a system can accomplish, but they also introduce irreversible consequences. Write action capabilities should be granted only when necessary, with explicit human-in-the-loop checkpoints for high-impact operations, strict idempotency guarantees to prevent duplicate side effects, and audit logging for all write events.
Monitoring and Observability
Observability should be designed into the application, not added after. The more complex the system, the more critical this is.
Key distinctions: - Monitoring: Tracking external outputs of a system to detect when something goes wrong - Observability: Instrumenting the system so its internal state can be inferred from external outputs — enabling root cause diagnosis without deploying new code
Three DevOps-derived metrics for evaluating observability quality:
| Metric | Definition | Goal |
|---|---|---|
| MTTD (Mean Time to Detection) | How long to detect a failure | Minimize |
| MTTR (Mean Time to Response) | How long to resolve after detection | Minimize |
| CFR (Change Failure Rate) | % of deployments causing failures requiring rollback | Track and reduce |
What to instrument specifically for foundation model apps: - Response quality metrics: Sampled AI-as-judge scores for relevance, faithfulness, toxicity - User feedback signals: Thumbs up/down, explicit ratings, follow-up questions (implicit dissatisfaction signal) - Latency breakdowns: Time in retrieval, model inference, guardrails, caching - Cost per request: Token counts, model tier used, cache hit rate - Failure modes: Guardrail rejections, empty responses, tool call errors, routing misclassifications
Evaluation and monitoring pipelines must stay synchronized: a model that performs well in offline evaluation should also perform well in production. Issues detected in monitoring should feed back into the evaluation pipeline.
AI Pipeline Orchestration
For complex multi-step pipelines, an orchestration layer chains all components together:
- Sequential chaining: Retrieval → generation → scoring → output
- Parallel fan-out: Multiple retrievals or tool calls dispatched simultaneously, results merged
- Retry logic: Re-invoke failed steps with exponential backoff
- State management: Track partial execution state for long-running agentic workflows (durable execution)
Popular orchestration frameworks: LangChain/LangGraph, AWS Strands, Google ADK, Temporal (for durable execution), and custom orchestration for simple pipelines.
Architecture Evolution Summary
| Step | Component Added | Problem Solved |
|---|---|---|
| Base | Model API | Basic query-response |
| Step 1 | Context construction (RAG, tools) | Model lacks knowledge of private/recent data |
| Step 2 | Input/output guardrails | Security risks, quality control |
| Step 3 | Router + Gateway | Multi-model management, cost control |
| Step 4 | Exact + semantic caching | Latency and cost at scale |
| Step 5 | Agent patterns + write actions | Complex multi-step tasks |
| + | Monitoring & observability | Production reliability |
| + | Orchestration layer | Chaining all components |
See Also
- AI Engineering (Concepts)
- RAG Architecture
- Agent Definition and Planning
- Prompt Engineering
- AI as a Judge
- Observability Solutions
- Cost Management
- Context Engineering
References
- AI Engineering: Building Applications with Foundation Models — Chip Huyen, O'Reilly, 2024. Chapter 10.