Dev.to헤드라인2026. 05. 08. 04:47

Agent Trace Sampling: When 100% Capture Stops Being Worth It

요약

AI 에이전트의 트레이스(trace) 데이터는 일반적인 HTTP 요청보다 훨씬 많은 스팬(span)을 생성하여, 서비스 운영 비용(특히 저장 및 인덱싱 비용)을 급격히 증가시킵니다. 따라서 모든 상호작용을 100% 기록하는 것은 비효율적이며, 실제로 가치가 높은 '흥미로운' 트레이스에만 집중해야 합니다. 이 글은 에이전트의 복잡한 아키텍처(모델 호출, 다중 도구 사용 등)로 인해 발생하는 과도한 데이터 양을 관리하고, 비용 효율적으로 디버깅할 수 있는 샘플링 전략과 수학적 접근 방식을 제시합니다.

핵심 포인트

에이전트 트레이스는 일반 HTTP 요청보다 훨씬 많은 스팬(span)을 생성하여 저장 비용을 급격히 증가시킨다.
모든 에이전트 상호작용을 100% 기록하는 것은 비용 대비 효율성이 매우 낮아진다.
트레이스의 가치는 균일하지 않으므로, 디버깅에 필수적인 '흥미로운' 트레이스(예: 실패한 시도, 복잡한 도구 호출)에만 집중해야 한다.
비용 관리를 위해 샘플링 전략을 수립하고, 이를 위한 OpenTelemetry (OTel) 설정을 이해하는 것이 중요하다.

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship an agent. Traffic ramps. The traces look great in Honeycomb for the first week. Every turn, every tool call, every token sits there to scrub when something breaks.
Then the bill arrives and the trace storage line is bigger than the inference line.
The question that follows is the one nobody wants to answer on a Friday: what do we drop, and how do we drop it without losing the traces we actually need?
The honest answer is that 100% capture of an agent fleet stops being worth it surprisingly early.
Traces are still the only thing that lets you debug a tool-call loop.
The problem is the value of a trace is wildly uneven.
A clean three-second turn that returned the right answer is statistical filler.
A 47-second turn that ran 11 tool calls and ended in MAX_TOKENS is gold.
Pay full price for both and you are subsidising the boring traces with the budget you need for the interesting ones.
This post is the math and the OTel config for getting that ratio right.

Why agent traces blow up the storage bill faster than HTTP traces
A normal HTTP request trace is maybe 10 spans.
A request comes in, hits a handler, talks to a database, calls one downstream service, returns.
Honeycomb, Datadog, and the rest were priced around that shape.
An agent turn is not that shape.
A single user message can generate:
1 agent.turn parent span
4 to 8 model.input / model.output spans (one per loop iteration)
5 to 30 tool.execute child spans, sometimes nested two deep when tools call other tools
A gen_ai.usage.input_tokens attribute on every model span, often with the full prompt as a span event for replay
Add Model Context Protocol (MCP) tool boundaries and you double the count again because every MCP call is its own client + server span pair.
A chatty agent doing code search across three repos lands in the 60-to-120 span range per turn .
Multiply by 50,000 turns a day and you are at 3 to 6 million spans.
At low-single-digit dollars per million indexed spans on the public APM pricing pages linked below, that is real money before you have routed anything anywhere.
Per Datadog's public APM pricing page (observed 2026-05), indexed spans list around the low-single-digit dollar range per million spans retained for 15 days — call it roughly $1.27/M as an illustrative figure for the math below.
Honeycomb's pricing page publishes a billion-events-per-month-class quota on its paid tiers, which sounds like a lot until you do the per-turn math above and realise a mid-volume agent fleet eats it in two weeks.
What matters is cost per useful trace , and at 100% sampling that ratio is dominated by the boring 90%.

Head sampling: the wrong knob first
The first instinct is head sampling: make a coin-flip decision at the start of the trace and either record it or do not.
OTel's TraceIdRatioBased sampler does exactly that.
The agent SDK creates the root span, the sampler decides "you are in the 10% bucket, record everything downstream; you are in the 90% bucket, drop everything."
from opentelemetry.sdk.trace.sampling import (
TraceIdRatioBased,
)
from opentelemetry.sdk.trace import TracerProvider
provider = TracerProvider (
sampler = TraceIdRatioBas

ed (0.1) Cheap, simple, deterministic. Also wrong for agents. The defect is that head sampling decides before the trace exists. You cannot ask "did this turn fail?" or "did this turn run more than five tool calls?" at root-span creation time, because none of that has happened yet. So the 90% you drop contains the same proportion of broken turns as the 10% you keep. When a customer pings support saying "the agent looped forever yesterday at 14:32," there is a 90% chance the trace is gone. That is the trace you needed. Head sampling is fine for a high-volume read endpoint where every request looks like every other request. Agent turns do not look like each other. Use head sampling here and you save money on storage by throwing away exactly the traces that would have paid for the bill. Tail sampling: keep the interesting ones
Tail sampling is the opposite move. Buffer the whole trace in the OTel Collector, wait until the trace is complete, then decide based on what actually happened. The OTel Collector ships a tail_sampling processor for this. The shape is a list of policies; if any policy says "keep," the trace is kept. The minimum useful config for an agent fleet looks like this:
processors : tail_sampling
decision_wait : 30s
num_traces : 100000
expected_new_traces_per_sec : 200
policies :

name : keep-errors
type : status_code
status_code :
status_codes : [ ERROR ]
name : keep-slow
type : latency
latency :
threshold_ms : 10000
name : keep-tool-heavy
type : span_count
span_count :
min_spans : 20
name : keep-baseline
type : probabilistic
probabilistic :
sampling_percentage : 5
Four rules, in plain English: keep every trace that errored, keep every trace longer than 10 seconds, keep every trace with more than 20 spans, and keep a 5% baseline of everything else for when you need to compute "what does normal look like?" That mix typically retains traces in the 10–20% range depending on traffic mix while keeping close to 100% of the broken turns. Your numbers will vary. The ratio of useful-trace-to-cost flips. You stop paying full freight for the boring 80%, and the on-call rotation stops opening the trace UI to find the trace they need has been thrown away.
Two things to know before you ship that config. The first is decision_wait : the collector buffers each trace in memory until it has been quiet for that long, then decides. Set it shorter than your slowest agent turn and you make the decision before all the spans arrive. Set it longer than you need and the collector's memory grows. 30 seconds is a sensible default for agent fleets where most turns finish under 15. Tail sampling also assumes every span of a trace lands at the same collector instance. That is why production deployments put a load-balancing exporter in front of a fixed pool of tail-sampling collectors. Skip the load balancer and the policy decisions get random because each collector only sees half the trace.
Reservoir sampling: the baseline that does not lie to you
The 5% probabilistic policy above is good enough for most teams. There is a sharper version when you care about statistical questions like "what is the p95 token count of a normal turn?" The problem with naive 5% sampling is that during a traffic spike (a marketing email lands, traffic 10×s for an hour) you keep 5% of a much bigger pool, and your "baseline" sample is dominated by an unusual hour. The p95 you compute from those traces is the p95 of that hour , not of normal operation.

Reservoir sampling fixes this. Pick a fixed reservoir size (say, 10,000 traces per hour) and admit traces with probability that decreases as the reservoir fills. Every trace has equal probability of being in the final reservoir, regardless of whether it arrived during a spike or a quiet patch. The OTel tail_sampling processor does not ship a true reservoir policy. The closest practical approximation is a rate_limiting policy, which is a flow cap rather than a reservoir, but it has the property you actually want here: it bounds baseline volume regardless of input rate.

name : keep-baseline-rate-cap type : rate_limiting rate_limiting : { spans_per_second : 50 }

At 80 spans per turn that is roughly 180k spans an hour, or about 2,000 baseline traces an hour, regardless of whether the upstream traffic doubled.

Combined with the keep-errors and keep-slow policies above, it gives you a baseline whose volume is stable across spikes. That is what you need when you compute SLOs and want the p95 to mean something.

If you need true reservoir semantics for statistical accuracy, do the reservoir admission yourself in a custom processor or upstream of the collector.

Parent-based propagation across MCP and tool boundaries

The configs above only work if the trace is connected. The instant you cross a tool-call boundary into an MCP server or a downstream microservice, you need parent-based sampling, or the child spans get their own sampling decision and the trace fragments.

OTel's ParentBased sampler is the standard wrapper:
from opentelemetry.sdk.trace.sampling import (
ALWAYS_OFF,
ALWAYS_ON,
ParentBased,
TraceIdRatioBased,
)
provider = TracerProvider(
sampler = ParentBased(
root = TraceIdRatioBased(
1.0,
),
remote_parent_sampled = ALWAYS_ON,
remote_parent_not_sampled = ALWAYS_OFF,
),
)

Translation: at the root, capture everything (let the tail sampler in the collector decide later). For any span with a remote parent, defer to whatever the parent decided. If the parent's trace was sampled, sample me; if not, drop me.

That is the only configuration where MCP tool calls show up under the agent turn that triggered them.

The mistake to avoid is independent samplers in each service.

If your agent runs TraceIdRatioBased(0.1) and the MCP server it calls runs TraceIdRatioBased(0.1) independently, the joint probability of a complete trace is 1%, not 10%.
After three hops you are at 0.1%.

Always wrap with ParentBased once you have downstream services, and let one decision propagate.

Sample-by-tenant when one customer is the problem

The last policy is the one that pays for itself the moment a single customer files a support ticket.

Sample-by-tenant: keep a higher fraction of traces from a named subset of tenants.

The OTel collector supports it via the string_attribute policy:

name : keep-debug-tenants type : string_attribute string_attribute : key : tenant.id values : [ acme-corp , big-bank-inc ] enabled_regex_matching : false

Set tenant.id as a span attribute on the root span of every turn.

When support pings you about acme-corp, flip them onto the debug list and you get 100% of their traces from that moment forward without changing global retention.

When their issue is fixed, take them off.

The tenant who is not the problem stays on the default mix.

This is also the policy that makes per-tenant SLOs honest.

If your enterprise tier promises a different latency SLO than free, sample those tenants at a higher rate so the p95 you report on their dashboar

d 는 통계적 노이즈가 아닙니다. 작은 tenancy(임차인) 에서 5% baseline 을 적용하면 하루에 약 50 개의 trace 를 기반으로 p95 를 계산하게 되는데, 이는 신뢰할 수 있는 수준이 아닙니다.

단일 문단에 포함된 비용 산출: 일반적인 agent fleet 에 네 가지 정책 (keep-errors, keep-slow, keep-tool-heavy, 5% baseline) 을 모두 실행하면 일반적으로 10–20% 의 trace 를 유지하면서 on-call engineer 가 실제로 열게 될 trace 의 약 100% 를 보존합니다. 원가 기준 가격으로 계산하면 trace 저장 비용 지출을 약 80% 감축하는 수준입니다.

50,000 turns(전환) 을 하루에 처리하며 ~80 spans(turn당) 이 있는 fleet 의 경우, 100% capture(포착) 에서 4 million ingested spans(일일 400 만 개) 를 (≈120M/월) 에서 위 정책들을 적용하여 약 600k 일일 (≈18M/월) 로 줄입니다.

이전 Datadog 의 공개 APM 가격 페이지에서 인용한 $1.27 per million indexed spans(당백만 인덱스 스페인 당 1.27 달러, 관찰 데이터 2026-05) 를 사용하여 월별 차이 (delta) 는 trace 라인 기준으로 약 $150 vs $23 입니다. 파손된 turns 는 여전히 모두 있으므로 운영적 손실은 없습니다.

달러 수치는 공개 가격 목록의 산술 계산으로 취급하세요, 실제 청구서와는 다릅니다. 실제 유효율은 commit(커밋), retention tier(보존 계층), 그리고 계정 팀이 협상한 조건에 따라 달라집니다.

이번 주에 배포할 것: 100% sampling 이 적용된 agent service 에서 저장 라인 (storage line) 이 증가하고 있다면, 순차적 배포 명령은 다음과 같습니다:

모든 서비스에서 parent-based sampler(부모 기반 샘플러) 를 먼저 적용합니다 (trace 가 연결되도록 함).
위 네 가지 정책과 함께 tail-sampling collector pool 을 적용합니다.
root span 에 tenant attribute 를 추가합니다.

기저선 (baseline) 에서 통계를 계산할 때만 reservoir sampling(저장고 샘플링) 으로 넘어가세요. 대부분의 팀은 이를 하지 않으며, rate-limiting policy(제한 정책) 만으로도 충분합니다.

실제 성과는 유지하는 trace 가 필요한 것들임을 아는 것입니다. 부모 기반 샘플러를 다음 날 아침에 배포한 후, 다음 청구서가 도착하기 전에 필요한 정책을 선택하세요.

이게 유용했다면: The LLM Observability Pocket Guide 는 위 구성 (configs) 이 의존하는 GenAI semantic conventions(구문), tail-sampling 을 정직한 상태로 유지하는 collector deployment topology(배포 토폴로지), 그리고 전 고객별 디버깅을 가능하게 하는 tenant attribute 작업 (전체 보존 변경 없이 한 고객만 디버깅 가능) 을 설명합니다. agent-tracing 장은 이 포스트와 직접적으로 짝을 이룹니다.

AI 자동 생성 콘텐츠

원문 바로가기

Agent Trace Sampling: When 100% Capture Stops Being Worth It

요약

핵심 포인트

댓글