Outcome focus: Defined a production Python layout for AI and data systems that separates experimentation, evaluation, domain logic, infrastructure adapters, and deployable service code.
pythonai engineeringdata engineeringsoftware architectureevaluation
The notebook did exactly what it was supposed to do.
It explored.
Then it became a job. Then the job became the feature pipeline. Then the feature pipeline became the model scoring path. Then the scoring path needed retries, secrets, audit logs, schema changes, and rollback. By that point the notebook was no longer a notebook. It was production architecture with no architectural affordances.
This is the Python failure mode I see most often in AI and data systems: experimentation becomes production by sediment.
The fix is not banning notebooks. The fix is giving production code a better place to live before the notebook gets promoted accidentally.
Modern Python practices help, but only if they are assembled as a system: typed boundaries, src/ layout, explicit layers, dataclasses for trusted value objects, runtime validation at edges, pytest and property tests, reproducible environments, and evaluation artifacts that live outside ad hoc analysis.
Separate the Work Modes#
AI and data projects usually contain at least four different work modes:
| Mode | Purpose | Failure if mixed with production |
|---|---|---|
| Exploration | learn the data, test ideas, inspect errors | unreviewed assumptions become dependencies |
| Experimentation | compare models, prompts, retrieval, features | winning run cannot be reproduced |
| Evaluation | define release and regression gates | demos replace measurable behavior |
| Production | serve, schedule, monitor, retry, roll back | notebooks become operational systems |
Those modes can live in one repository. They should not live in one layer.
The Python Packaging User Guide's src/ layout discussion is useful here because it forces importable production code into a package boundary. It does not solve AI architecture by itself. It gives the repo a spine.
A Production Layout#
For an AI/data/backend Python system, I want this shape:
project/
├── pyproject.toml
├── uv.lock
├── README.md
├── .env.example
├── src/
│ └── app/
│ ├── __init__.py
│ ├── py.typed
│ ├── api/
│ ├── config/
│ ├── domain/
│ ├── services/
│ ├── repositories/
│ ├── infrastructure/
│ ├── agents/
│ ├── workflows/
│ ├── prompts/
│ ├── retrieval/
│ ├── models/
│ ├── evaluation/
│ └── main.py
├── tests/
│ ├── unit/
│ ├── integration/
│ ├── contract/
│ └── e2e/
├── evals/
│ ├── datasets/
│ ├── rubrics/
│ └── reports/
├── notebooks/
├── experiments/
├── docs/
├── scripts/
└── migrations/The names can change. The separation should not.
notebooks/ is allowed to be messy. experiments/ can contain exploratory runners, sweep configs, and temporary result files. evals/ holds assets that decide whether behavior improved. src/app/ contains importable production code. tests/ protects boundaries.
If the model-serving API imports from notebooks/, the architecture has already lost.
The Layer Responsibilities#
I would draw the production package like this:
The domain layer should not know about FastAPI, LangGraph, Google Cloud, S3, Postgres, OpenAI, Snowflake, or a vector database. It should know the rules of the decision.
The services layer coordinates use cases. It can call repositories, retrieval policies, model providers, and evaluators through interfaces. The infrastructure layer adapts those interfaces to actual systems.
The agent layer is not a place to hide business logic. It is where tool policies, orchestration prompts, memory boundaries, and action permissions live. If the agent decides whether a refund is allowed, that rule belongs in the domain or service layer where tests can exercise it without an LLM.
Typed Payloads, Validated Edges#
AI systems pass around suspiciously shaped data: tool arguments, model messages, retrieval results, eval cases, embeddings, document chunks, model outputs, safety decisions, and human feedback.
Static typing helps, but it must be paired with runtime validation at trust boundaries.
from dataclasses import dataclass
from typing import Protocol, TypedDict
class RetrievalHitPayload(TypedDict):
document_id: str
chunk_id: str
score: float
text: str
@dataclass(frozen=True, slots=True, kw_only=True)
class RetrievalHit:
document_id: str
chunk_id: str
score: float
text: str
class Retriever(Protocol):
def search(self, query: str, *, limit: int) -> tuple[RetrievalHit, ...]: ...
def parse_hit(payload: RetrievalHitPayload) -> RetrievalHit:
if not payload["text"].strip():
raise ValueError("retrieval hit text cannot be empty")
return RetrievalHit(
document_id=payload["document_id"],
chunk_id=payload["chunk_id"],
score=payload["score"],
text=payload["text"],
)The tradeoff is more conversion code. The benefit is that the rest of the system can stop defending itself against every possible wire shape.
Evaluation Is a First-Class Package#
Evaluation should not be a folder of screenshots and vibes.
The src/app/evaluation/ package should contain code for scoring behavior. The top-level evals/ folder should contain datasets, rubrics, reports, and fixtures that are allowed to be inspected by product and domain reviewers.
src/app/evaluation/
├── __init__.py
├── cases.py
├── metrics.py
├── graders.py
├── reports.py
└── regression.py
evals/
├── datasets/
│ ├── refund-policy-golden.jsonl
│ └── retrieval-hard-negatives.jsonl
├── rubrics/
│ └── support-agent-rubric.md
└── reports/
└── 2026-06-16-agent-regression.mdThe failure this prevents is the demo-driven release. A prompt looks better in five hand-picked examples, gets promoted, and then loses a behavior the old prompt handled. The evaluator should be able to say: this version improved retrieval grounding, regressed refusal precision, and exceeded the latency budget.
Use pytest for deterministic checks. Use Hypothesis where invariants matter, such as chunk normalization, date parsing, deduplication, or schema round-trips. Use coverage.py as a signal, not as a substitute for meaningful assertions.
Data Libraries Belong Behind Policies#
For data-heavy systems, NumPy, Polars, and PyArrow are often the right tools. The architecture mistake is letting library calls become business policy.
Bad shape:
api route -> dataframe code -> feature policy -> storage writeBetter shape:
api route -> service -> feature policy -> dataframe adapter -> storage writeThe service should ask for "active customer feature rows for this scoring window." The infrastructure adapter can decide whether the efficient implementation is Polars lazy frames, Arrow tables, SQL, or a warehouse query. If the policy is buried in dataframe expressions inside the adapter, the behavior becomes harder to test without the data engine.
Prompts Are Artifacts#
Prompts should be versioned artifacts with owners, inputs, outputs, and tests. They are not comments.
name: "support_triage"
version: "2026-06-16"
owner: "support-platform"
inputs:
- "ticket_text"
- "customer_plan"
- "retrieval_hits"
outputs:
schema: "SupportTriageDecision"
release_gate:
eval_dataset: "evals/datasets/support-triage-golden.jsonl"
min_grounded_answer_rate: 0.92
max_invalid_action_rate: 0.01
p95_latency_ms: 2500If an agent can take an action, the prompt alone is not enough. The service layer needs a permission check. The tool implementation needs validation. The evaluation suite needs action safety cases. The audit log needs the prompt version and tool arguments.
The Quality Gates#
For an AI/data Python system, I want these gates:
| Gate | Catches |
|---|---|
| Ruff lint and format | syntax drift, import drift, simple bug patterns |
| Type checker | contract drift between services, repositories, tools |
| Unit tests | domain rules and pure transforms |
| Integration tests | database, queue, object store, model-provider adapters |
| Contract tests | external payload shape, tool args, event schemas |
| Evaluation regression | model, prompt, retrieval, and agent behavior |
| Build/package check | import and artifact shape |
The build should fail when the repo cannot prove the behavior it claims.
The Tradeoff#
This architecture costs more than a notebook and a script.
It creates more files. It asks engineers to name boundaries. It makes prompt changes go through evaluation. It prevents a convenient dataframe expression from living directly inside a route handler. It turns one-off model improvements into release candidates with evidence.
The alternative is a system where the production path is assembled from whatever explored fastest. That is fun until the first incident asks where the policy lives.
Python is excellent for AI and data work because it lets teams move from exploration to production in one language. That strength is also the trap. The language makes promotion easy. Architecture decides whether promotion is safe.
Keep notebooks for learning. Keep experiments for discovery. Keep evaluations for release truth. Keep production code in src/, behind typed boundaries, with tests that can run when the notebook is closed.