Tag: agents

27 entries tagged "agents" — 15 posts, 12 links.

Posts

Jun 29, 2026 — 10 min — Games & Sim

Customer Experience Simulation Is Agent-Driven User Research

Why plausible synthetic customer behavior can turn an onboarding simulation into false evidence unless every journey claim has a trace and a real-world validation gate.

Outcome: Reader can design an agent-driven CX simulation as a hypothesis generator, with validation gates that keep synthetic behavior from being mistaken for customer evidence.

agents simulation customer experience user research product systems

Jun 18, 2026 — 8 min — Platform & AI

Agent Frameworks Are Infrastructure Now

A practical 2026 map of AI agent frameworks by infrastructure primitive: orchestration, tools, state, multi-agent delegation, approvals, observability, evaluation, and deployment.

Outcome: Reframed agent-framework selection away from tier lists and toward an operating contract for which primitives the framework owns, which ones the team must own, and where bespoke orchestration is still justified.

agents agent frameworks mcp orchestration observability

May 16, 2026 — 15 min — Systems Notes

State Machines in 2026: Durable Execution for Agents and Workflows

Why the May 2026 state-machine story is not finite automata becoming fashionable again, but durable execution becoming the reliability boundary for agents, workflow engines, and document-heavy product systems.

Outcome: Reader can distinguish statechart formalism from durable execution, pick the right runtime for agents and long-running workflows, and model document-heavy product lifecycles with explicit states, transitions, checkpoints, and ownership.

state machines durable execution agents workflow engines xstate temporal langgraph

May 3, 2026 — 7 min — Platform & AI

LLM and Agent Observability with OpenTelemetry GenAI Conventions

Why custom LLM logging leaves you flying blind in production, and how OpenTelemetry's GenAI semantic conventions turn every model call, tool invocation, and agent step into a traceable, cost-accountable span.

Outcome: Reader can instrument an LLM pipeline or agent workflow with OTEL GenAI conventions, export spans and cost metrics to any compatible backend, and build alerts on real token spend and latency instead of inferring from flat logs.

observability opentelemetry llms agents monitoring mlops ai engineering

May 2, 2026 — 12 min — Platform & AI

Agent Repo Trust Gates: Conftest Policies, SLSA Provenance, and SBOM in GitHub Actions

Why standard code review misses capability escalation in skill manifests, and how to wire a pre-merge conftest policy gate and post-merge SLSA provenance chain that actually work — correcting three common mistakes in the recipes that circulate online.

Outcome: Reader can wire a working pre-merge OPA/conftest gate on skill manifests, add a correct post-merge SLSA L2 provenance workflow using the SLSA GitHub Generator reusable workflow (not the nonexistent slsa CLI), and align OTel instrumentation with the GenAI semantic conventions.

supply chain github actions slsa conftest opa sbom agents ci/cd security

Apr 28, 2026 — 9 min — Platform & AI

What ADK 2.0 Adds, and Where the Approval Path Still Breaks

Why an ADK 2.0 ToolConfirmation flow paired with VertexAiSessionService re-presented the same approval to a reviewer on Monday morning and ran the tool twice, and what the gap tells you about how to evaluate harness primitives at different maturity levels.

Outcome: Reader can map ADK 2.0 primitives onto a session-service backing store and decide which combinations are production-ready, which are beta-with-known-gaps, and which require waiting.

agents adk agent harness human in the loop memory

Apr 18, 2026 — 13 min — Platform & AI

Treat Agent Skills Like Supply-Chain Dependencies

A repo-ready operating contract for agent skills that prevents prompt bundles from drifting into unsigned, over-permissioned, unreviewed production dependencies.

Outcome: Defined a hardened-by-default skill contract covering version pins, manifest provenance, prompt review, IO tests, least-privilege tools, runtime isolation, observability, rotation, and decommissioning.

agent skills supply chain agents security developer workflow

Feb 28, 2026 — 15 min — Platform & AI

Context Engineering Keeps Long Context Useful

A practical synthesis of Drew Breunig, Simon Willison, and Anthropic on how long contexts fail, how to fix them, and why multi-agent systems need context discipline.

Outcome: Turned long-context failure modes into an engineering playbook for selecting, isolating, pruning, summarizing, offloading, and evaluating context in agent systems.

agents context engineering llm evaluation tool use multi-agent systems

Feb 24, 2026 — 6 min — Platform & AI

DSPy + RAG Evaluation Ops in Production

How to turn DSPy and RAG evaluation into a production release loop with golden sets, retrieval checks, generation rubrics, regression thresholds, and versioned prompt programs.

Outcome: Promoted the note into an essay by defining a repeatable RAG evaluation workflow that separates retrieval quality from generation quality and blocks prompt-program regressions before release.

dspy rag evaluation mlops agents

Feb 10, 2026 — 6 min — Platform & AI

Evaluating Multi-Agent Workflows for Enterprise Reliability

A practical evaluation loop for multi-agent workflows that catches demo-friendly failures in task handoff, tool use, permissions, latency, and completion criteria before release.

Outcome: Established a repeatable evaluation workflow that gates multi-agent releases on task completion, handoff quality, tool correctness, latency, and recoverability instead of demo impressions.

agents evaluation reliability enterprise ai observability

Jan 27, 2026 — 18 min — Platform & AI

Local MCP and Private Open Model Infrastructure

A practical guide to running MCP servers locally, choosing affordable clients, and deploying private open models with Cloud Run, Ollama, and Open WebUI.

Outcome: Separated local agent tool access from private model serving, then defined a safer setup for MCP clients, local servers, and Cloud Run GPU sidecars.

mcp agents cloud run ollama open webui

Nov 20, 2025 — 14 min — Platform & AI

Agent Memory Is an Operating Boundary

A practical look at Google ADK memory, Vertex AI Memory Bank, session state, retrieval, retention, access control, and why durable agent memory needs production discipline.

Outcome: Clarified the difference between short-term session state and durable agent memory, then mapped the operational risks around retrieval, security, retention, cost, and memory poisoning.

agents google cloud adk memory rag

Nov 16, 2025 — 8 min — Platform & AI

The Question About Your AI Agent Has Changed

Capability is no longer the hard question about AI agents. What the agent is permitted to do, and whether it will do it successfully, are. Here is why that distinction matters architecturally.

Outcome: Reframed agent deployment decisions around permission scope and blast radius rather than capability, reducing the risk of production failures from over-permissioned agentic systems.

agents ai governance enterprise ai authorization security

Nov 12, 2025 — 13 min — Platform & AI

Codex Plugins Extend Agents, Not Interfaces

Why Codex plugins point toward a different software design mindset: fewer UI extensions, more safe agent capabilities, system access points, and operational boundaries.

Outcome: Framed plugins as reusable agent capability bundles that require structured systems, permissions, predictable workflows, and safer operational surfaces.

codex agents plugins mcp software architecture

Nov 8, 2025 — 13 min — Platform & AI

Sandboxed Agents and the Production Automation Boundary

OpenAI's April 2026 Agents SDK update matters because sandboxed execution, manifests, resumable state, and memory move agents closer to real production automation.

Outcome: Framed sandboxed agent execution as an architecture boundary for safer, stateful, long-running automation instead of another demo-layer SDK feature.

agents openai sandboxing automation enterprise ai

Links

Articleanthropic.comApr 29, 2026Permalink

Building Effective AI Agents

Erik Schluntz and Barry Zhang, Anthropic

This is one of the cleanest public pieces on agent design because it separates workflows from agents and keeps repeating the uncomfortable production lesson: start simple, add autonomy only when the task needs it, and make the tool interface inspectable.

Worth keeping next to any agent architecture work because it gives language for the tradeoff. Agents can improve task performance, but they buy that with latency, cost, and new failure modes.

Articlehuyenchip.comApr 28, 2026Permalink

Agents

Chip Huyen, Chip Huyen

Chip Huyen's agent writeup is useful as a systems map rather than a hype post. It treats agents as planning, tool use, memory, feedback, and environment design, which is the level where most production mistakes actually happen.

I would pair this with Anthropic's piece when deciding whether a workflow should stay deterministic or earn the complexity of agentic control.

agents ai engineering llm evaluation product engineering

Podcastlatent.spaceApr 27, 2026Permalink

Language Agents: From Reasoning to Acting

Swyx and Alessio Fanelli, Latent Space

This is a good companion link for grounding the agent conversation in the reasoning-to-acting lineage instead of treating tool use as a brand-new product category.

The useful angle for this site is not the vocabulary alone. It is the reminder that agent systems have to cross the boundary from thought traces to actions, and that boundary needs evaluation, tools, and constraints.

agents reasoning ai engineering llm evaluation

Articlepedramnavid.comApr 26, 2026Permalink

Evaluating and Optimizing LLM Applications with DSPy

Pedram Navid

This is the DSPy link I would hand to someone who keeps saying "prompt engineering" when they really mean eval-driven optimization. The example is concrete, costed, and honest about train, validation, and holdout splits.

Especially useful because it makes LLM application work feel closer to data science: define a task, build examples, pick a metric, optimize, and check whether the gains survive outside the optimizer.

dspy llm evaluation agents ai engineering

Articlenicolasbustamante.comApr 24, 2026Permalink

The RAG Obituary: Killed by Agents, Buried by Context Windows

Nicolas Bustamante, Nicolas Bustamante

The thesis is deliberately provocative, which makes it useful. RAG is not simply dead, but the old chunk-embed-rerank default is under pressure from larger contexts, agentic retrieval loops, and document-native workflows.

I would keep this as a debate starter, not a doctrine. The production question is still which retrieval shape gives the model the right evidence at the right cost with inspectable failure modes.

rag agents context engineering ai engineering

Articlequesma.comApr 22, 2026Permalink

Tau2 Benchmark: How a Prompt Rewrite Boosted GPT-5-mini by 22%

Przemyslaw Hejman, Quesma

This is a good reminder that eval improvements do not always come from a larger model. Sometimes the gain comes from rewriting the task boundary so the model can actually see the job.

The caveat is obvious but important: benchmark gains need translation into product behavior. Still, a 22 percent lift from prompt structure is worth pinning next to any agent eval work.

llm evaluation prompting agents ai engineering

Threaddiscuss.google.devApr 19, 2026Permalink

Vertex AI Agent Engine Networking Overview

Google Developer forums

This forum post is useful because Agent Engine networking is exactly where cloud AI demos turn into platform engineering. Connectivity, controls, private access, and service boundaries are not side quests.

Keeping it here because Google Cloud agent work needs operational references, not only model and orchestration references.

vertex ai gcp agents ai engineering

Articledeveloper.nvidia.comApr 18, 2026Permalink

Build Enterprise AI Agents with NVIDIA Llama Nemotron Reasoning Models

NVIDIA Technical Blog

This is NVIDIA's enterprise agent argument for reasoning models: model size choices, test-time scaling, open deployment, and the operational need to turn reasoning on or off.

The signal here is not the benchmark marketing by itself. It is the systems question underneath: when does agent work need explicit reasoning compute, and when does that cost hurt more than it helps?

agents reasoning nvidia ai engineering

Articledeveloper.nvidia.comApr 17, 2026Permalink

Build More Accurate and Efficient AI Agents with NVIDIA Llama Nemotron Super v1.5

NVIDIA Technical Blog

This follow-up is useful because it keeps the agent-model conversation tied to efficiency, not only capability. Smaller or better-routed reasoning models matter when an agent system has to run repeatedly, not just impress once.

I would read it beside the earlier Nemotron post and ask where model improvements change the architecture, not only the benchmark table.

agents reasoning nvidia ai engineering

Articlephilschmid.deMar 25, 2026Permalink

The New Skill in AI is Not Prompting, It's Context Engineering

Philipp Schmid

This is a useful framing link because it shifts the work from clever prompts to the operational discipline of selecting, shaping, and evaluating context. That maps directly to agent systems, RAG, long-context workflows, and product features that must explain why the model saw what it saw.

The caution is that "context engineering" can become a new vague label. It is only valuable when it produces inspectable inputs, measurable outcomes, and clearer failure modes.

context engineering agents rag ai engineering

Articlelilianweng.github.ioMar 21, 2026Permalink

LLM Powered Autonomous Agents

Lilian Weng

This is one of the foundational agent explainers: planning, memory, tool use, reflection, and the early patterns that still show up in current products. It belongs in the stream because the best agent discussions need a shared vocabulary.

The piece is also a useful time capsule. Some implementation details have moved quickly, but the core system questions have not gone away.

agents llms tool use ai engineering

Articlemedium.comMar 19, 2026Permalink

Scaling Inference To Billions of Users And Agents

Federico Iezzi, Google Cloud

This is useful because it connects agent adoption to inference architecture. Agents do not make one call; they fan out across planning, retrieval, tool use, retries, and evaluation, which changes the serving math quickly.

Worth keeping as a scale reference for Google Cloud AI work. The product question is where inference cost becomes a feature constraint rather than a backend detail.

model serving gcp agents inference

All tags