Outcome focus: Reader can instrument an LLM pipeline or agent workflow with OTEL GenAI conventions, export spans and cost metrics to any compatible backend, and build alerts on real token spend and latency instead of inferring from flat logs.
observabilityopentelemetryllmsagentsmonitoringmlopsai engineering
The billing alert fired on day eleven, not the observability dashboard. The agent had passed every eval and every integration test, but nobody had wired token counting to a real metric before launch. The alert said we were over budget. It could not say which step was responsible — the planner call, the retrieval step, or the generator. The answer was somewhere in the logs, but the logs were a flat stream of JSON events with no trace graph and no causal chain between steps.
That is the failure pattern. You can reconstruct what happened after the fact, but you cannot catch it while it is happening, and you cannot prevent it next time without a different instrumentation model.
What custom logging misses#
Most teams instrument LLM calls the same way they instrument REST API calls: log the request, log the response, pull usage.prompt_tokens and usage.completion_tokens into a custom metric. That works for a single call. It breaks in a multi-step agent pipeline.
When five spans execute in sequence — intent classification, document retrieval, context assembly, generation, output validation — and each one logs independently, the result is a flat stream of events with no parent-child relationship. You cannot answer which step took 3.8 seconds. You cannot tell whether the retry on step three was caused by a timeout in step two's tool call or by a model capacity issue. Token volume sits in aggregate metrics while cost attribution by pipeline step is guesswork.
The gap is not log verbosity. It is trace topology. Flat logs record what happened; spans record what caused what, in what order, and at what cost.
What GenAI semantic conventions define#
OpenTelemetry's GenAI semantic conventions are a CNCF-backed vocabulary for AI workload instrumentation. They define span names, attribute keys, and metric names so that every model call, tool invocation, and agent workflow emits the same shape of data regardless of provider.
The core attribute set covers what every observability query needs:
| Attribute | Span type | Example |
|---|---|---|
gen_ai.system | All | openai, anthropic, vertex_ai |
gen_ai.operation.name | All | chat, embeddings, execute_tool |
gen_ai.request.model | LLM call | gpt-4o, claude-sonnet-4-6 |
gen_ai.response.model | LLM call | actual model version served |
gen_ai.usage.input_tokens | LLM call | 342 |
gen_ai.usage.output_tokens | LLM call | 89 |
gen_ai.request.max_tokens | LLM call | 1024 |
gen_ai.tool.name | Tool call | search_docs, execute_code |
gen_ai.agent.name | Agent span | support-triage-agent |
The contract is that every conformant system uses the same keys. Your backend dashboard, alert rule, and cost query stay stable as models and providers change. When the model name rotates from gpt-4o to the next version, the attribute value changes; the attribute name does not.
Datadog, Grafana, and Uptrace increasingly ingest GenAI-conformant spans and surface them natively. You pay for the standard once at instrumentation time; the tooling compounds it.
How a trace hierarchy works for an agent#
A conformant trace for a multi-step support triage agent looks like this:
What this makes visible that flat logs cannot: step ordering, latency per hop, token cost per step, and which tool branch triggered the expensive model call. The 1,204-token input on the generation step is traceable back to what the retrieval step returned. That chain is invisible in aggregate metrics.
Turning token attributes into cost metrics#
The gen_ai.usage.input_tokens and gen_ai.usage.output_tokens span attributes are readable after the fact in a trace viewer. To build real-time cost dashboards and alerts, you also need to emit them as OTEL counters.
The standard metric name is gen_ai.client.token.usage, dimensioned by gen_ai.system, gen_ai.request.model, and gen_ai.token.type. Here is the minimal instrumentation pattern using opentelemetry-sdk and opentelemetry-semantic-conventions:
from opentelemetry import trace, metrics
from opentelemetry.semconv._incubating.attributes import gen_ai_attributes
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
token_counter = meter.create_counter(
name="gen_ai.client.token.usage",
description="Token usage by operation, model, and token type",
unit="{token}",
)
def traced_chat(model: str, messages: list[dict]) -> str:
with tracer.start_as_current_span("chat") as span:
span.set_attribute(gen_ai_attributes.GEN_AI_SYSTEM, "openai")
span.set_attribute(gen_ai_attributes.GEN_AI_REQUEST_MODEL, model)
response = client.chat.completions.create(model=model, messages=messages)
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
span.set_attribute(gen_ai_attributes.GEN_AI_USAGE_INPUT_TOKENS, input_tokens)
span.set_attribute(gen_ai_attributes.GEN_AI_USAGE_OUTPUT_TOKENS, output_tokens)
for token_type, count in [("input", input_tokens), ("output", output_tokens)]:
token_counter.add(
count,
{
gen_ai_attributes.GEN_AI_SYSTEM: "openai",
gen_ai_attributes.GEN_AI_REQUEST_MODEL: model,
"gen_ai.token.type": token_type,
},
)
return response.choices[0].message.contentThe span carries the token counts for trace-level visibility. The counter carries them for metric-level aggregation: cost by model, cost by agent step, cost by environment. Both come from the same call. The counter attributes are stable across provider and model changes because the attribute names are part of the convention, not your code.
The tradeoff#
Bespoke per-model logging gives more flexibility — custom log shapes, non-standard cost fields, pipeline-specific context that is hard to express in a general schema. What it costs is portability (each new model or tool integration needs new instrumentation logic), cross-service correlation (no trace context propagation), and dashboard stability (attribute names drift as teams make different choices).
Semantic conventions trade that flexibility for portability, consistency, and first-class tool support. The right time to make that trade is before the second model integration, not after the fourth. At one model, custom logging is a working shortcut. At four models, two tools, and three agent variants, it is the reason your on-call dashboard has five separate panels with incompatible schemas.
The failure mode worth naming#
Exporting via the OTEL Collector#
Spans and token counters flow out of your service via the OTEL SDK and into a Collector that routes them to your backend. A minimal Collector config that targets GCP Monitoring and Managed Prometheus:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
googlecloud:
project: ${GCP_PROJECT_ID}
prometheusremotewrite:
endpoint: http://managed-prometheus:9090/api/v1/write
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [googlecloud]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [googlecloud, prometheusremotewrite]Because traces and metrics share the same Collector pipeline, GCP Monitoring receives both the trace topology and the cost counters. The gen_ai.client.token.usage counter becomes an alertable signal sitting next to CPU and memory in the same dashboard. The trace view and the cost graph are correlated by trace ID, not assembled manually from separate log queries.
The same Collector config routes to Datadog or Grafana by swapping the exporter block. The service code does not change.
If you are shipping an agent this sprint, add the OTEL SDK and the token counter before the demo, not after. The cost signal you skip in testing is the one that fires a billing alert in week two. The trace you skip during integration is the one you reconstruct by hand at 2am when latency doubles.
The conventions are stable enough to instrument once. That is the actual payoff: you stop rewriting dashboards every time the model name changes in the response.