Platform & AI Engineering

Software, data, and AI systems built with a product manager's discipline — GCP architecture, BigQuery, Dataform governance, ML pipelines, agent workflows, and the evaluation that makes them trustworthy in production.

May 14, 202617 min — Platform & AI

dbt on BigQuery Ingestion, Snapshots, and Cost Gates

A dbt on BigQuery starter kit for the parts that usually fail after the demo: raw loads without partition filters, snapshots with weak change detection, and CI that lets expensive SQL promote.

Outcome: Reader can scaffold a dbt and BigQuery project with manifest-backed incremental loads, timestamp-first snapshots, partitioned models, and a dry-run bytes gate before production promotion.

May 8, 202614 min — Platform & AI

Data Governance with AI in 2026: A Current Map for Operators

Half the 2025 AI-governance recipes still in production cite documents that were rescinded, delayed, or replaced in the last twelve months. The current map: what got retired, what's still authoritative, and what an operating governance program actually has to cover in 2026.

Outcome: Reader can audit their AI/data governance program against the actual 2026 regulatory and standards stack — including the federal rescissions, the EU AI Act timeline shift agreed May 7, 2026, the ISO/IEC 5259 Part 5 publication, and the OWASP Agentic Top 10 — and retire stale references with confidence.

Why custom LLM logging leaves you flying blind in production, and how OpenTelemetry's GenAI semantic conventions turn every model call, tool invocation, and agent step into a traceable, cost-accountable span.

Outcome: Reader can instrument an LLM pipeline or agent workflow with OTEL GenAI conventions, export spans and cost metrics to any compatible backend, and build alerts on real token spend and latency instead of inferring from flat logs.

Why standard code review misses capability escalation in skill manifests, and how to wire a pre-merge conftest policy gate and post-merge SLSA provenance chain that actually work — correcting three common mistakes in the recipes that circulate online.

Outcome: Reader can wire a working pre-merge OPA/conftest gate on skill manifests, add a correct post-merge SLSA L2 provenance workflow using the SLSA GitHub Generator reusable workflow (not the nonexistent slsa CLI), and align OTel instrumentation with the GenAI semantic conventions.

May 2, 202613 min — Platform & AI

Comprehension Debt: When Code Ships Without Theory

Why a two-day debug session on a one-month-old AI-generated bug is not a debugging problem but a theory-building problem you skipped, and the operating discipline that makes the missing theory recoverable.

Outcome: Reader has a working definition of comprehension debt distinct from technical debt, three questions to test whether a theory exists for an AI-generated component, a PR comprehension scoring rubric, and a deliberate-practice tactic set that prevents the doom loop.

Apr 30, 20268 min — Platform & AI

The Go and gRPC Version of the SaaS Stack

When a SaaS product should graduate from a flexible Python-first backend into Go, gRPC, Cloud Run, and Google Cloud service boundaries.

Outcome: Mapped a Go and gRPC adoption path for SaaS teams that need stronger service contracts, concurrency, latency discipline, and Google Cloud operations without premature rewrites.

Apr 29, 202615 min — Platform & AI

Your Repo Needs an Agent Harness, Not More Prompt Paste

A critical guide to README.md, AGENTS.md, CLAUDE.md, SKILL.md, .agents, and .claude patterns for teams that want coding agents to follow repo rules without stuffing every workflow into one giant prompt.

Outcome: Defined a repo documentation harness that separates human orientation, always-loaded agent rules, tool-specific compatibility files, on-demand skills, dynamic docs, and deterministic enforcement.

Apr 28, 20269 min — Platform & AI

What ADK 2.0 Adds, and Where the Approval Path Still Breaks

Why an ADK 2.0 ToolConfirmation flow paired with VertexAiSessionService re-presented the same approval to a reviewer on Monday morning and ran the tool twice, and what the gap tells you about how to evaluate harness primitives at different maturity levels.

Outcome: Reader can map ADK 2.0 primitives onto a session-service backing store and decide which combinations are production-ready, which are beta-with-known-gaps, and which require waiting.

Apr 28, 202611 min — Platform & AI

Why I Reach for DuckDB When Reading Parquet from Swift or Zig

What an oversized iOS binary, a Linux linker error, and a SQL boundary teach about embedding DuckDB as the Parquet reader for languages without a mature native library.

Outcome: Reader can decide when DuckDB is the right Parquet path for a Swift or Zig project, configure the SPM and build.zig integrations correctly the first time, and avoid the binary-size and linker failures that the unconfigured path produces.

How a compact Python ML cheatsheet becomes useful when synthetic demos, metrics, pipelines, and version drift are tied to the model-review decisions they can actually defend.

Outcome: Reader can use minimal scikit-learn examples as smoke tests for task framing, metric choice, pipeline boundaries, and environment drift instead of treating them as production recipes.

Apr 25, 202612 min — Platform & AI

Every Engineer Is a Manager Now

AI coding agents are turning software work into management work: engineers now have to manage intent, context, agent output, teammate coordination, stakeholder evidence, and long-term maintenance.

Outcome: Defined a public operating model for engineers and consultants who need to coordinate human teammates and AI agents without producing artifacts that create hidden technical debt.

Why a precompiled-NIF fall-through on a less-common Linux target adds quiet minutes to a deploy, and what the borrowed-runtime pattern actually looks like for Elixir and Mojo.

Outcome: Reader can ship Parquet-reading Elixir without surprise source compilation in CI, recognize where Mojo's Python interop boundary is the bottleneck rather than Mojo itself, and know which DataFrame guarantees leak at the BEAM and PyArrow boundaries.

Apr 22, 20268 min — Platform & AI

Building an NPS Classifier You Can Actually Act On

A scikit-learn NPS ordinal classifier with SMOTE, probability calibration, utility-based thresholding, and PSI drift checks. The parts that make it useful to the retention team, not just accurate on a dashboard.

Outcome: Shipped a calibrated multiclass NPS model with a utility-driven operating threshold and a PSI-based drift loop, giving the retention team a per-customer detractor probability they can act on and a rule for when to retrain.

Apr 21, 202614 min — Platform & AI

Coding Assistants Work Best When the Blast Radius Is Small

An Android-first operating pattern for using GitHub Copilot, Amazon Q Developer, Android CLI, and Android skills without letting coding assistants rewrite Gradle, manifests, architecture, and security posture by accident.

Outcome: Defined a repeatable assistant workflow for Android teams that combines sliced prompts, repo instructions, Android skills, screenshots, atomic commits, tests, and GHAS gates into one controlled development loop.

Apr 20, 202611 min — Platform & AI

How I Read Parquet in Rust and Go Without an OOM

Why a default Go parquet.Read[T] call slurped a 1.4 GB file into 11 GB of resident memory, and the column-native Rust and Go patterns that replaced it.

Outcome: Reader can pick the streaming Parquet read path in Rust and Go, configure the compression-codec features explicitly, and avoid the eager-load anti-patterns that look fine in benchmarks and break in production.

Apr 18, 202613 min — Platform & AI

Treat Agent Skills Like Supply-Chain Dependencies

A repo-ready operating contract for agent skills that prevents prompt bundles from drifting into unsigned, over-permissioned, unreviewed production dependencies.

Outcome: Defined a hardened-by-default skill contract covering version pins, manifest provenance, prompt review, IO tests, least-privilege tools, runtime isolation, observability, rotation, and decommissioning.

Apr 17, 202615 min — Platform & AI

AI Coding Assistants Expose Process Debt

Why teams using Claude, GPT-style coding agents, Cursor, and Copilot often get unstable app work when requirements, versions, conventions, tests, and handoffs are implicit.

Outcome: Defined a docs-first assistant workflow that turns requirements, pinned stack choices, task slices, review loops, tests, and Git checkpoints into a repeatable way to ship with AI without surrendering architecture control.

Apr 14, 20264 min — Platform & AI

What AI Researchers Do That I Do Not

A short, honest read on what AI researchers actually do day to day, written from outside the role by an applied engineer who reads papers when the work demands it.

Outcome: Reader can distinguish AI research work from applied AI engineering work, decide which research outputs change their quarter and which do not, and avoid hiring or being hired against the wrong role description.

Apr 1, 202619 min — Platform & AI

A Software Architecture Reading Path for Working Engineers

A practical reading path through software design, architecture, system design interviews, data-intensive applications, and systems analysis for engineers who want to grow beyond implementation.

Outcome: Reviewed the architecture and system design books from the DEV Community list, corrected the list count, summarized each book, and arranged them into a practical learning path.

Mar 20, 20268 min — Platform & AI

Fine-Tuning GPT-OSS 20B on a 64GB MacBook Pro

A practical MLX-first recipe for experimenting with openai/gpt-oss-20b on a 64GB Apple Silicon Mac without confusing local LoRA work for CUDA-scale training.

Outcome: Defined a local 64GB MacBook Pro fine-tuning path for GPT-OSS 20B that prioritizes Harmony formatting, MLX quantized LoRA, small evals, and a clear fallback to NVIDIA when scale is required.

Mar 16, 20267 min — Platform & AI

Fine-Tuning LLMs on a MacBook Pro With MPS and MLX

Why Apple Silicon is useful for local LLM prototyping and LoRA experiments, but still has sharp boundaries compared with CUDA-scale NeMo or Hugging Face training.

Outcome: Separated Mac-local MPS and MLX fine-tuning paths from NVIDIA-only training features so local experiments can start with realistic hardware expectations.

Mar 4, 202611 min — Platform & AI

NVFP4 and the Infrastructure Meaning of Precision

A grounded read of NVIDIA's NVFP4 training post and why 4-bit pretraining matters for model quality, token throughput, cost, and AI infrastructure strategy.

Outcome: Explained NVIDIA's NVFP4 training recipe, separated the credible technical signal from the marketing surface, and connected low-precision training to practical AI infrastructure decisions.

Feb 24, 20266 min — Platform & AI

DSPy + RAG Evaluation Ops in Production

How to turn DSPy and RAG evaluation into a production release loop with golden sets, retrieval checks, generation rubrics, regression thresholds, and versioned prompt programs.

Outcome: Promoted the note into an essay by defining a repeatable RAG evaluation workflow that separates retrieval quality from generation quality and blocks prompt-program regressions before release.

Feb 10, 20266 min — Platform & AI

Evaluating Multi-Agent Workflows for Enterprise Reliability

A practical evaluation loop for multi-agent workflows that catches demo-friendly failures in task handoff, tool use, permissions, latency, and completion criteria before release.

Outcome: Established a repeatable evaluation workflow that gates multi-agent releases on task completion, handoff quality, tool correctness, latency, and recoverability instead of demo impressions.

Feb 4, 202618 min — Platform & AI

Machine Learning Terms That Make Model Reviews Better

A practical ML terminology guide for model reviews where feature definitions, data splits, task type, optimization behavior, overfitting risk, regularization, ensembles, and embeddings need to be discussed precisely.

Outcome: Gave peers a review-ready vocabulary for inspecting ML systems by connecting core terms to design choices, failure modes, and release questions.

Jan 27, 202618 min — Platform & AI

Local MCP and Private Open Model Infrastructure

A practical guide to running MCP servers locally, choosing affordable clients, and deploying private open models with Cloud Run, Ollama, and Open WebUI.

Outcome: Separated local agent tool access from private model serving, then defined a safer setup for MCP clients, local servers, and Cloud Run GPU sidecars.

Jan 15, 202611 min — Platform & AI

When 0.3 Does Not Mean 30 Percent

How imbalanced classifiers can keep a strong AUC while producing probabilities that break thresholds, alerts, and cost-sensitive decisions in production.

Outcome: Defined a production calibration gate that logs Brier score, ECE, reliability diagrams, cost-sensitive thresholds, run metadata, and promotion criteria for imbalanced classifiers.

Jan 12, 20265 min — Platform & AI

Compliant GCP Platform Playbook for Analytics and ML

A sanitized GCP platform case study where compliance, analytics delivery, and ML feature access had to be designed as one release path instead of three disconnected workstreams.

Outcome: Reduced governed dataset onboarding from weeks to days in the sanitized pattern while preserving auditability, cost visibility, and promotion rules for analytics and ML use cases.

Jan 11, 202612 min — Platform & AI

scikit-learn Pipelines That Survive Tuning and Deployment

Why tabular models drift between notebooks and production when preprocessing, sample metadata, hyperparameter search, and persistence are not treated as one scikit-learn pipeline contract.

Outcome: Defined a scikit-learn pipeline contract that keeps column preprocessing, metadata routing, hyperparameter search, evaluation, and deployment artifacts reproducible across dev, stage, and production.

Jan 7, 202620 min — Platform & AI

Statistics for Data Science, Written for Software Developers

A software-developer guide to the statistics that actually change data-science decisions: samples, estimates, uncertainty, effect size, bias, probability, distributions, and model metrics.

Outcome: Defined a practical estimate-review workflow that helps software developers report effect size, confidence intervals, p-values, sampling bias, and classification metrics without treating statistics as glossary trivia.

Dec 30, 202512 min — Platform & AI

Vertex AI Feature Store Is the Production Loop

A production-focused Vertex AI post on turning raw data, BigQuery features, online feature serving, model endpoints, monitoring, and retraining into one governed ML loop instead of another platform checklist.

Outcome: Defined a concrete Vertex AI feature-serving loop with source contracts, BigQuery feature views, point-in-time training exports, endpoint serving rules, monitoring thresholds, and retraining triggers.

Dec 26, 202510 min — Platform & AI

Vertex AI Makes More Sense as an MLOps Map

A Vertex AI architecture map for teams that need to decide which Google Cloud AI services belong in the ML lifecycle, where ownership changes hands, and which older assumptions are now unsafe.

Outcome: Gave teams an operating contract for using Vertex AI across data, features, training, deployment, monitoring, and generative AI without confusing a product menu for a production ML system.

Dec 22, 202515 min — Platform & AI

Correlation Is a Feature Screen, Not a Feature Strategy

A long-form feature-screening workflow that uses correlation for quick linear checks, then adds redundancy clustering, mutual information, chi-squared tests, L1 models, tree importances, permutation importance, and domain review.

Outcome: Defined a practical feature review loop that prevents teams from dropping useful nonlinear signals or keeping redundant features just because a correlation heatmap looked convincing.

Nov 20, 202514 min — Platform & AI

Agent Memory Is an Operating Boundary

A practical look at Google ADK memory, Vertex AI Memory Bank, session state, retrieval, retention, access control, and why durable agent memory needs production discipline.

Outcome: Clarified the difference between short-term session state and durable agent memory, then mapped the operational risks around retrieval, security, retention, cost, and memory poisoning.

Nov 16, 20258 min — Platform & AI

The Question About Your AI Agent Has Changed

Capability is no longer the hard question about AI agents. What the agent is permitted to do, and whether it will do it successfully, are. Here is why that distinction matters architecturally.

Outcome: Reframed agent deployment decisions around permission scope and blast radius rather than capability, reducing the risk of production failures from over-permissioned agentic systems.

Nov 12, 202513 min — Platform & AI

Codex Plugins Extend Agents, Not Interfaces

Why Codex plugins point toward a different software design mindset: fewer UI extensions, more safe agent capabilities, system access points, and operational boundaries.

Outcome: Framed plugins as reusable agent capability bundles that require structured systems, permissions, predictable workflows, and safer operational surfaces.

Nov 4, 202515 min — Platform & AI

AI Strategy Starts Before the Model

A practical AI strategy framework with a worked example that connects business levers, data readiness, pilots, evaluation, governance, deployment, and operating metrics.

Outcome: Defined an end-to-end AI strategy playbook and worked example that ties data readiness, use-case selection, model development, governance, deployment, and operating ownership to measurable business outcomes.

Oct 31, 202514 min — Platform & AI

Cloud Run GPU Sidecars Need Deployment Discipline

A practical deployment guide for running Ollama behind Open WebUI on Cloud Run GPUs without mixing service specs, model storage modes, sidecar startup order, or auth assumptions.

Outcome: Clarified Cloud Run GPU sidecar deployment choices so model storage, service YAML, startup ordering, authentication, and billing constraints are explicit before launch.

How to add coverage-guaranteed prediction sets, temperature scaling calibration, and risk-coverage curves to a classifier using MAPIE — the pieces that make uncertainty quantification operationally useful rather than decorative.

Outcome: Added coverage-guaranteed prediction sets and operational abstention gates to a classification pipeline, cutting acted-upon error rate without retraining the model.

Oct 15, 202518 min — Platform & AI

Fine-Tuning Open Source LLMs With NVIDIA NeMo

A practical map of NVIDIA NeMo for teams that want to curate data, fine-tune open-source LLMs, evaluate them, and move from research checkpoints to production inference.

Outcome: Separated data curation, fine-tuning, alignment, evaluation, export, and serving concerns so open-source LLM customization could move from experiments to governed production workflows.

Oct 11, 202516 min — Platform & AI

Plain-Language Machine Learning Metrics for Real Decisions

A practical explanation of ML metrics with decision tables for regression tolerance, rare-event classification, threshold tradeoffs, and the failure case where accuracy looked good but the decision failed.

Outcome: Clarified how metric choice, threshold design, tree-based pattern discovery, and logit interpretation affect whether ML outputs are useful for action.

Oct 3, 20257 min — Platform & AI

The Three-Run Lab: How I Triage Slow PyTorch Training

A repeatable triage routine — the three-run baseline, DataLoader diagnosis, five profiler signatures, and a copy-paste scaffold — for finding where training time actually goes before touching the model.

Outcome: Identified and resolved training bottlenecks in under an hour by running the three-run baseline and reading profiler signatures before changing any model code.