Writing

Long-form and short-form notes across data platforms, ML agents, evaluation, system design, and game development.

May 16, 202611 min — Systems Notes

Offline Claims PWA MVP for Field Adjusters

A pilot plan for the claims app failure that usually arrives late: photos captured offline, model scores nobody trusts, sync queues with no proof, and evidence bundles that cannot defend chain of custody.

Outcome: Reader can scope an adjuster-focused claims MVP around offline capture, on-device triage, sync recovery, audit receipts, and acceptance tests that produce pilot evidence instead of demo theater.

Why the May 2026 state-machine story is not finite automata becoming fashionable again, but durable execution becoming the reliability boundary for agents, workflow engines, and document-heavy product systems.

Outcome: Reader can distinguish statechart formalism from durable execution, pick the right runtime for agents and long-running workflows, and model document-heavy product lifecycles with explicit states, transitions, checkpoints, and ownership.

Why teams get stuck between tangled monoliths and premature microservices, and how to choose the next boundary with delivery metrics, ownership, and blast radius.

Outcome: Reader can write a decision memo that chooses modular monolith, service extraction, or boundary repair based on team ownership, delivery metrics, scaling pressure, and observable blast radius.

A de-duplicated taxonomy for system design papers: what to read first, what each paper teaches, and how to move from classic distributed systems into modern databases, observability, serverless, and AI infrastructure.

Outcome: Reader can turn scattered system-design paper lists into a practical reading path, identify duplicates and mixed source types, and choose papers by the design question they answer instead of by prestige.

May 14, 202617 min — Platform & AI

dbt on BigQuery Ingestion, Snapshots, and Cost Gates

A dbt on BigQuery starter kit for the parts that usually fail after the demo: raw loads without partition filters, snapshots with weak change detection, and CI that lets expensive SQL promote.

Outcome: Reader can scaffold a dbt and BigQuery project with manifest-backed incremental loads, timestamp-first snapshots, partitioned models, and a dry-run bytes gate before production promotion.

May 8, 202614 min — Platform & AI

Data Governance with AI in 2026: A Current Map for Operators

Half the 2025 AI-governance recipes still in production cite documents that were rescinded, delayed, or replaced in the last twelve months. The current map: what got retired, what's still authoritative, and what an operating governance program actually has to cover in 2026.

Outcome: Reader can audit their AI/data governance program against the actual 2026 regulatory and standards stack — including the federal rescissions, the EU AI Act timeline shift agreed May 7, 2026, the ISO/IEC 5259 Part 5 publication, and the OWASP Agentic Top 10 — and retire stale references with confidence.

Why custom LLM logging leaves you flying blind in production, and how OpenTelemetry's GenAI semantic conventions turn every model call, tool invocation, and agent step into a traceable, cost-accountable span.

Outcome: Reader can instrument an LLM pipeline or agent workflow with OTEL GenAI conventions, export spans and cost metrics to any compatible backend, and build alerts on real token spend and latency instead of inferring from flat logs.

Why standard code review misses capability escalation in skill manifests, and how to wire a pre-merge conftest policy gate and post-merge SLSA provenance chain that actually work — correcting three common mistakes in the recipes that circulate online.

Outcome: Reader can wire a working pre-merge OPA/conftest gate on skill manifests, add a correct post-merge SLSA L2 provenance workflow using the SLSA GitHub Generator reusable workflow (not the nonexistent slsa CLI), and align OTel instrumentation with the GenAI semantic conventions.

May 2, 202613 min — Platform & AI

Comprehension Debt: When Code Ships Without Theory

Why a two-day debug session on a one-month-old AI-generated bug is not a debugging problem but a theory-building problem you skipped, and the operating discipline that makes the missing theory recoverable.

Outcome: Reader has a working definition of comprehension debt distinct from technical debt, three questions to test whether a theory exists for an AI-generated component, a PR comprehension scoring rubric, and a deliberate-practice tactic set that prevents the doom loop.

Apr 30, 20268 min — Platform & AI

The Go and gRPC Version of the SaaS Stack

When a SaaS product should graduate from a flexible Python-first backend into Go, gRPC, Cloud Run, and Google Cloud service boundaries.

Outcome: Mapped a Go and gRPC adoption path for SaaS teams that need stronger service contracts, concurrency, latency discipline, and Google Cloud operations without premature rewrites.

Apr 29, 202615 min — Platform & AI

Your Repo Needs an Agent Harness, Not More Prompt Paste

A critical guide to README.md, AGENTS.md, CLAUDE.md, SKILL.md, .agents, and .claude patterns for teams that want coding agents to follow repo rules without stuffing every workflow into one giant prompt.

Outcome: Defined a repo documentation harness that separates human orientation, always-loaded agent rules, tool-specific compatibility files, on-demand skills, dynamic docs, and deterministic enforcement.

Apr 28, 20269 min — Platform & AI

What ADK 2.0 Adds, and Where the Approval Path Still Breaks

Why an ADK 2.0 ToolConfirmation flow paired with VertexAiSessionService re-presented the same approval to a reviewer on Monday morning and ran the tool twice, and what the gap tells you about how to evaluate harness primitives at different maturity levels.

Outcome: Reader can map ADK 2.0 primitives onto a session-service backing store and decide which combinations are production-ready, which are beta-with-known-gaps, and which require waiting.

Apr 28, 202611 min — Platform & AI

Why I Reach for DuckDB When Reading Parquet from Swift or Zig

What an oversized iOS binary, a Linux linker error, and a SQL boundary teach about embedding DuckDB as the Parquet reader for languages without a mature native library.

Outcome: Reader can decide when DuckDB is the right Parquet path for a Swift or Zig project, configure the SPM and build.zig integrations correctly the first time, and avoid the binary-size and linker failures that the unconfigured path produces.

Apr 27, 202613 min — Systems Notes

State Machines in Go, Elixir, Swift, and Zig

Why a Go retry loop ran forever because the attempt counter lived on the loop instead of the state, and what the runtime guarantees of Elixir, Swift, and Zig change about which state-machine idioms are honest in each.

Outcome: Reader can pick the right state-machine idiom for their language by recognizing which runtime guarantees the language ships, distinguish a true finite-state machine from unidirectional data flow, and avoid the cross-language mistake of treating one language's idiom as the universal pattern.

How a compact Python ML cheatsheet becomes useful when synthetic demos, metrics, pipelines, and version drift are tied to the model-review decisions they can actually defend.

Outcome: Reader can use minimal scikit-learn examples as smoke tests for task framing, metric choice, pipeline boundaries, and environment drift instead of treating them as production recipes.

Apr 25, 202612 min — Platform & AI

Every Engineer Is a Manager Now

AI coding agents are turning software work into management work: engineers now have to manage intent, context, agent output, teammate coordination, stakeholder evidence, and long-term maintenance.

Outcome: Defined a public operating model for engineers and consultants who need to coordinate human teammates and AI agents without producing artifacts that create hidden technical debt.

Why a precompiled-NIF fall-through on a less-common Linux target adds quiet minutes to a deploy, and what the borrowed-runtime pattern actually looks like for Elixir and Mojo.

Outcome: Reader can ship Parquet-reading Elixir without surprise source compilation in CI, recognize where Mojo's Python interop boundary is the bottleneck rather than Mojo itself, and know which DataFrame guarantees leak at the BEAM and PyArrow boundaries.

Apr 23, 202611 min — Systems Notes

State Machines in Python: from xstate-python to LangGraph

Why an agent harness re-fired a half-finished tool after a worker restart, the four Python libraries that solve different parts of the problem, and a concrete contribution roadmap for xstate-python.

Outcome: Reader can map a Python workflow to the right state-machine library, distinguish statechart formalism from durable execution, and know where to start contributing to xstate-python with file paths and named missing features.

Apr 22, 20268 min — Platform & AI

Building an NPS Classifier You Can Actually Act On

A scikit-learn NPS ordinal classifier with SMOTE, probability calibration, utility-based thresholding, and PSI drift checks. The parts that make it useful to the retention team, not just accurate on a dashboard.

Outcome: Shipped a calibrated multiclass NPS model with a utility-driven operating threshold and a PSI-based drift loop, giving the retention team a per-customer detractor probability they can act on and a rule for when to retrain.

Apr 21, 202614 min — Platform & AI

Coding Assistants Work Best When the Blast Radius Is Small

An Android-first operating pattern for using GitHub Copilot, Amazon Q Developer, Android CLI, and Android skills without letting coding assistants rewrite Gradle, manifests, architecture, and security posture by accident.

Outcome: Defined a repeatable assistant workflow for Android teams that combines sliced prompts, repo instructions, Android skills, screenshots, atomic commits, tests, and GHAS gates into one controlled development loop.

Apr 20, 202611 min — Platform & AI

How I Read Parquet in Rust and Go Without an OOM

Why a default Go parquet.Read[T] call slurped a 1.4 GB file into 11 GB of resident memory, and the column-native Rust and Go patterns that replaced it.

Outcome: Reader can pick the streaming Parquet read path in Rust and Go, configure the compression-codec features explicitly, and avoid the eager-load anti-patterns that look fine in benchmarks and break in production.

Apr 19, 202611 min — Systems Notes

XState, Actors, and What the Stately Argument Actually Buys

Why a hand-rolled retry double-charged a Stripe customer because the cancel state was implicit, and what XState 5's setup-plus-actors model gives you that useReducer does not.

Outcome: Reader can write an XState 5 machine using the setup pattern, distinguish invoked from spawned actors, decide when to graduate from useReducer to a state machine library, and read XState code as a structured argument rather than a configuration object.

Apr 18, 202613 min — Platform & AI

Treat Agent Skills Like Supply-Chain Dependencies

A repo-ready operating contract for agent skills that prevents prompt bundles from drifting into unsigned, over-permissioned, unreviewed production dependencies.

Outcome: Defined a hardened-by-default skill contract covering version pins, manifest provenance, prompt review, IO tests, least-privilege tools, runtime isolation, observability, rotation, and decommissioning.

Apr 17, 202615 min — Platform & AI

AI Coding Assistants Expose Process Debt

Why teams using Claude, GPT-style coding agents, Cursor, and Copilot often get unstable app work when requirements, versions, conventions, tests, and handoffs are implicit.

Outcome: Defined a docs-first assistant workflow that turns requirements, pinned stack choices, task slices, review loops, tests, and Git checkpoints into a repeatable way to ship with AI without surrendering architecture control.

Apr 15, 20268 min — Systems Notes

When the State Chart Pays Off

Why a React form with seven boolean flags shipped a flicker bug that statecharts would have surfaced before the first render, and the decision rule that says when this discipline earns its place.

Outcome: Reader can decide when a workflow is state-machine-shaped, replace boolean-flag explosion with a small statechart that names guards and transitions, and recognize statecharts as an architectural discipline rather than a UI utility.

Apr 14, 20264 min — Platform & AI

What AI Researchers Do That I Do Not

A short, honest read on what AI researchers actually do day to day, written from outside the role by an applied engineer who reads papers when the work demands it.

Outcome: Reader can distinguish AI research work from applied AI engineering work, decide which research outputs change their quarter and which do not, and avoid hiring or being hired against the wrong role description.

Apr 1, 202619 min — Platform & AI

A Software Architecture Reading Path for Working Engineers

A practical reading path through software design, architecture, system design interviews, data-intensive applications, and systems analysis for engineers who want to grow beyond implementation.

Outcome: Reviewed the architecture and system design books from the DEV Community list, corrected the list count, summarized each book, and arranged them into a practical learning path.

Mar 24, 202613 min — Systems Notes

A Software Developer Job Description Is an Operating Contract

Why generic software developer job descriptions over-index on writing code and under-specify the ownership, testing, maintenance, communication, and judgment that make software engineering work.

Outcome: Provided a practical role model and responsibility checklist that teams can use to write clearer software developer expectations and evaluate engineering work beyond code output.

Mar 20, 20268 min — Platform & AI

Fine-Tuning GPT-OSS 20B on a 64GB MacBook Pro

A practical MLX-first recipe for experimenting with openai/gpt-oss-20b on a 64GB Apple Silicon Mac without confusing local LoRA work for CUDA-scale training.

Outcome: Defined a local 64GB MacBook Pro fine-tuning path for GPT-OSS 20B that prioritizes Harmony formatting, MLX quantized LoRA, small evals, and a clear fallback to NVIDIA when scale is required.

Mar 16, 20267 min — Platform & AI

Fine-Tuning LLMs on a MacBook Pro With MPS and MLX

Why Apple Silicon is useful for local LLM prototyping and LoRA experiments, but still has sharp boundaries compared with CUDA-scale NeMo or Hugging Face training.

Outcome: Separated Mac-local MPS and MLX fine-tuning paths from NVIDIA-only training features so local experiments can start with realistic hardware expectations.

Mar 8, 20267 min — Systems Notes

Why Data Platforms Fail as Systems, Not Tools

A data platform failure pattern where tool replacement looked like the fix, but the real problem was ownership, release discipline, metric mismatch, and governance outside the workflow.

Outcome: Reframed platform recovery around ownership contracts, operating metrics, and release discipline so teams could fix the system instead of replacing another tool.

Mar 4, 202611 min — Platform & AI

NVFP4 and the Infrastructure Meaning of Precision

A grounded read of NVIDIA's NVFP4 training post and why 4-bit pretraining matters for model quality, token throughput, cost, and AI infrastructure strategy.

Outcome: Explained NVIDIA's NVFP4 training recipe, separated the credible technical signal from the marketing surface, and connected low-precision training to practical AI infrastructure decisions.

A game economy design essay for Hippi Kingdom covering currency loops, sinks and sources, telemetry, a rejected progression model, and the balancing mistake that made hoarding look like engagement.

Outcome: Created an economy balancing framework that separates progression health from currency hoarding, making pacing, reward pressure, and retention tradeoffs easier to test.

Feb 24, 20266 min — Platform & AI

DSPy + RAG Evaluation Ops in Production

How to turn DSPy and RAG evaluation into a production release loop with golden sets, retrieval checks, generation rubrics, regression thresholds, and versioned prompt programs.

Outcome: Promoted the note into an essay by defining a repeatable RAG evaluation workflow that separates retrieval quality from generation quality and blocks prompt-program regressions before release.

Feb 20, 202618 min — Systems Notes

An Enterprise Data Governance Glossary Operators Can Use

A practical enterprise data governance glossary that turns business intelligence, stewardship, metadata, security, privacy, quality, and lifecycle terms into usable review language.

Outcome: Created a shared vocabulary and term-entry contract that helps governance, data engineering, analytics, security, and business teams align definitions before certifying data products.

Feb 16, 202615 min — Systems Notes

Data Governance Roles Need Decision Rights

A data governance operating model for assigning owners, stewards, custodians, and SMEs without leaving quality rules, access decisions, retention, source-of-truth choices, and incident closure ambiguous.

Outcome: Defined a role-and-cadence contract that lets governance teams assign decision rights, artifacts, escalation paths, and success measures before a data product is certified.

Feb 10, 20266 min — Platform & AI

Evaluating Multi-Agent Workflows for Enterprise Reliability

A practical evaluation loop for multi-agent workflows that catches demo-friendly failures in task handoff, tool use, permissions, latency, and completion criteria before release.

Outcome: Established a repeatable evaluation workflow that gates multi-agent releases on task completion, handoff quality, tool correctness, latency, and recoverability instead of demo impressions.

Feb 4, 202618 min — Platform & AI

Machine Learning Terms That Make Model Reviews Better

A practical ML terminology guide for model reviews where feature definitions, data splits, task type, optimization behavior, overfitting risk, regularization, ensembles, and embeddings need to be discussed precisely.

Outcome: Gave peers a review-ready vocabulary for inspecting ML systems by connecting core terms to design choices, failure modes, and release questions.

Jan 27, 202618 min — Platform & AI

Local MCP and Private Open Model Infrastructure

A practical guide to running MCP servers locally, choosing affordable clients, and deploying private open models with Cloud Run, Ollama, and Open WebUI.

Outcome: Separated local agent tool access from private model serving, then defined a safer setup for MCP clients, local servers, and Cloud Run GPU sidecars.

Jan 15, 202611 min — Platform & AI

When 0.3 Does Not Mean 30 Percent

How imbalanced classifiers can keep a strong AUC while producing probabilities that break thresholds, alerts, and cost-sensitive decisions in production.

Outcome: Defined a production calibration gate that logs Brier score, ECE, reliability diagrams, cost-sensitive thresholds, run metadata, and promotion criteria for imbalanced classifiers.

Jan 12, 20265 min — Platform & AI

Compliant GCP Platform Playbook for Analytics and ML

A sanitized GCP platform case study where compliance, analytics delivery, and ML feature access had to be designed as one release path instead of three disconnected workstreams.

Outcome: Reduced governed dataset onboarding from weeks to days in the sanitized pattern while preserving auditability, cost visibility, and promotion rules for analytics and ML use cases.

Jan 11, 202612 min — Platform & AI

scikit-learn Pipelines That Survive Tuning and Deployment

Why tabular models drift between notebooks and production when preprocessing, sample metadata, hyperparameter search, and persistence are not treated as one scikit-learn pipeline contract.

Outcome: Defined a scikit-learn pipeline contract that keeps column preprocessing, metadata routing, hyperparameter search, evaluation, and deployment artifacts reproducible across dev, stage, and production.

Jan 7, 202620 min — Platform & AI

Statistics for Data Science, Written for Software Developers

A software-developer guide to the statistics that actually change data-science decisions: samples, estimates, uncertainty, effect size, bias, probability, distributions, and model metrics.

Outcome: Defined a practical estimate-review workflow that helps software developers report effect size, confidence intervals, p-values, sampling bias, and classification metrics without treating statistics as glossary trivia.

Dec 30, 202512 min — Platform & AI

Vertex AI Feature Store Is the Production Loop

A production-focused Vertex AI post on turning raw data, BigQuery features, online feature serving, model endpoints, monitoring, and retraining into one governed ML loop instead of another platform checklist.

Outcome: Defined a concrete Vertex AI feature-serving loop with source contracts, BigQuery feature views, point-in-time training exports, endpoint serving rules, monitoring thresholds, and retraining triggers.

Dec 26, 202510 min — Platform & AI

Vertex AI Makes More Sense as an MLOps Map

A Vertex AI architecture map for teams that need to decide which Google Cloud AI services belong in the ML lifecycle, where ownership changes hands, and which older assumptions are now unsafe.

Outcome: Gave teams an operating contract for using Vertex AI across data, features, training, deployment, monitoring, and generative AI without confusing a product menu for a production ML system.

Dec 22, 202515 min — Platform & AI

Correlation Is a Feature Screen, Not a Feature Strategy

A long-form feature-screening workflow that uses correlation for quick linear checks, then adds redundancy clustering, mutual information, chi-squared tests, L1 models, tree importances, permutation importance, and domain review.

Outcome: Defined a practical feature review loop that prevents teams from dropping useful nonlinear signals or keeping redundant features just because a correlation heatmap looked convincing.

Dec 2, 202516 min — Systems Notes

TypeScript Concepts Make More Sense Inside React

A practical TypeScript and React guide to the event loop, hoisting, throttling, debouncing, timers, closures, callbacks, IIFEs, promises, async, and await through code patterns that show up in real components.

Outcome: Provided a React-centered runtime map and reusable TypeScript examples for debugging async UI behavior, timer cleanup, stale closures, callback flow, and promise-based rendering work.

Nov 20, 202514 min — Platform & AI

Agent Memory Is an Operating Boundary

A practical look at Google ADK memory, Vertex AI Memory Bank, session state, retrieval, retention, access control, and why durable agent memory needs production discipline.

Outcome: Clarified the difference between short-term session state and durable agent memory, then mapped the operational risks around retrieval, security, retention, cost, and memory poisoning.

Nov 16, 20258 min — Platform & AI

The Question About Your AI Agent Has Changed

Capability is no longer the hard question about AI agents. What the agent is permitted to do, and whether it will do it successfully, are. Here is why that distinction matters architecturally.

Outcome: Reframed agent deployment decisions around permission scope and blast radius rather than capability, reducing the risk of production failures from over-permissioned agentic systems.

Nov 12, 202513 min — Platform & AI

Codex Plugins Extend Agents, Not Interfaces

Why Codex plugins point toward a different software design mindset: fewer UI extensions, more safe agent capabilities, system access points, and operational boundaries.

Outcome: Framed plugins as reusable agent capability bundles that require structured systems, permissions, predictable workflows, and safer operational surfaces.

Nov 4, 202515 min — Platform & AI

AI Strategy Starts Before the Model

A practical AI strategy framework with a worked example that connects business levers, data readiness, pilots, evaluation, governance, deployment, and operating metrics.

Outcome: Defined an end-to-end AI strategy playbook and worked example that ties data readiness, use-case selection, model development, governance, deployment, and operating ownership to measurable business outcomes.

Oct 31, 202514 min — Platform & AI

Cloud Run GPU Sidecars Need Deployment Discipline

A practical deployment guide for running Ollama behind Open WebUI on Cloud Run GPUs without mixing service specs, model storage modes, sidecar startup order, or auth assumptions.

Outcome: Clarified Cloud Run GPU sidecar deployment choices so model storage, service YAML, startup ordering, authentication, and billing constraints are explicit before launch.

How to add coverage-guaranteed prediction sets, temperature scaling calibration, and risk-coverage curves to a classifier using MAPIE — the pieces that make uncertainty quantification operationally useful rather than decorative.

Outcome: Added coverage-guaranteed prediction sets and operational abstention gates to a classification pipeline, cutting acted-upon error rate without retraining the model.

Oct 15, 202518 min — Platform & AI

Fine-Tuning Open Source LLMs With NVIDIA NeMo

A practical map of NVIDIA NeMo for teams that want to curate data, fine-tune open-source LLMs, evaluate them, and move from research checkpoints to production inference.

Outcome: Separated data curation, fine-tuning, alignment, evaluation, export, and serving concerns so open-source LLM customization could move from experiments to governed production workflows.

Oct 11, 202516 min — Platform & AI

Plain-Language Machine Learning Metrics for Real Decisions

A practical explanation of ML metrics with decision tables for regression tolerance, rare-event classification, threshold tradeoffs, and the failure case where accuracy looked good but the decision failed.

Outcome: Clarified how metric choice, threshold design, tree-based pattern discovery, and logit interpretation affect whether ML outputs are useful for action.

Oct 3, 20257 min — Platform & AI

The Three-Run Lab: How I Triage Slow PyTorch Training

A repeatable triage routine — the three-run baseline, DataLoader diagnosis, five profiler signatures, and a copy-paste scaffold — for finding where training time actually goes before touching the model.

Outcome: Identified and resolved training bottlenecks in under an hour by running the three-run baseline and reading profiler signatures before changing any model code.

Sep 17, 202511 min — Systems Notes

Why Teams Miss Goals They Actually Care About

The four reasons goal execution breaks down, the 4DX framework that addresses them, and why the apparent tension between goal-thinking and systems-thinking resolves the moment you understand lead measures.

Outcome: Clearer framework for translating organizational goals into team-level execution through lead measures, visible scoreboards, and accountable weekly cadences.