Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team ships a prompt change. The eval set has 100 questions. The old prompt scored 78, the new prompt scores 82. Slack lights up; the numbers land in the deploy note; the change ships.
Two weeks later, customer support tickets are flat. The "win" was four examples flipping out of a hundred. With a sample that small, four is well inside the noise floor. The same prompt re-run on the same model on a different day moves by more than that. The team did not measure an improvement; they measured a coin flip and treated the side it landed on as a result.
This is the cheapest, most common eval mistake in production LLM work.
The fix is not better judges or more sophisticated metrics. The fix is making the eval set large enough that a 4-point delta means something. That is a math problem with a known answer.

Why 100 examples is a coin flip
Pass-rate evals are a binomial. Each example is a Bernoulli trial: the answer is correct or it is not. The score is the sample proportion. The standard error of a proportion at n=100 and p=0.80 is about sqrt(0.80 * 0.20 / 100) = 0.04.
A 95% confidence interval is roughly ±2 * SE = ±8 percentage points.
Read that again. With 100 examples at an 80% pass rate, the true pass rate sits between 72 and 88 with 95% confidence.
A 4-point delta between two prompts is less than half the noise. That is for a single run. To detect a delta between prompt A and prompt B, you need both confidence intervals not to overlap. The standard error of the difference grows with the variances of both arms, so the threshold is stricter.

import math
def se_proportion ( p : float , n : int ) -> float :
return math . sqrt ( p * ( 1 - p ) / n )

def ci_95 ( p : float , n : int ) -> tuple [ float , float ]:
se = se_proportion ( p , n )
return ( p - 1.96 * se , p + 1.96 * se )

print ( ci_95 ( 0.80 , 100 )) # (~0.72, ~0.88)
print ( ci_95 ( 0.80 , 1000 )) # (~0.78, ~0.82)
print ( ci_95 ( 0.80 , 5000 )) # (~0.789, ~0.811)

At n=1000, the 95% interval shrinks to roughly ±2.5 points.
At n=5000, ±1.1 points.
The cost of an evaluation run scales linearly with n, but the precision of your verdict scales with sqrt(n). Pay for the samples or skip the claim.

The sample-size formula nobody runs
The number you actually want is: how many examples do I need to detect a delta of size Δ with significance α=0.05 and power 1−β=0.80.
The closed-form for two-proportion comparison is:
n_per_arm ≈ ( z(α/2) * sqrt(2 * p̄ * (1 − p̄)) + z(β) * sqrt(p1*(1−p1) + p2*(1−p2)) )² / Δ²
where p̄ = (p1 + p2) / 2, Δ = |p2 − p1|, z(0.025) = 1.96, z(0.20) = 0.84.

In code:
import math
from statistics import NormalDist
def n_per_arm ( p1 : float , p2 : float , alpha : float = 0.05 , power : float = 0.80 ) -> int :
z_alpha = NormalDist (). inv_cdf ( 1 - alpha / 2 )
z_beta = NormalDist (). inv_cdf ( power )
p_bar = ( p1 + p2 ) / 2
pooled = math . sqrt ( 2 * p_bar * ( 1 - p_bar ))
split = math . sqrt ( p1 * ( 1 - p1 ) + p2 * ( 1 - p2 ))
delta = abs ( p2 - p1 )
n = (( z_alpha * pooled + z_beta * split ) / delta ) ** 2
return math . ceil ( n )

print ( n_per_arm ( 0.78 , 0.80 )) # ~6,300
print ( n_per_arm ( 0.78 , 0.82 )) # ~1,580
print ( n_per_ar

m ( 0.78 , 0.85 )) # ~480 print ( n_per_arm ( 0.78 , 0.90 )) # ~165

A 2-point delta at the 80% range needs about 6,300 examples per arm to detect with 95% confidence and 80% power. A 4-point delta needs about 1,580 per arm . A 7-point delta needs about 480 per arm . That is the first number where a 500-example eval is honest. A 12-point delta needs about 165 per arm . The 100-example eval from the opening can detect roughly a 15-point delta. Below that, every "win" is statistically a flip. How many examples you need depends on the size of the change you want to catch. If you only care about catching regressions of 10 points or more, 250 examples is fine. If you want to catch a 2-point lift, you are in the thousands per arm or you are guessing.

You do not actually need both arms full Most prompt-change A/B tests are paired : the same eval question runs through both prompts. That structure gives you a sharper test for free. The relevant statistic is McNemar's test on the discordant pairs (cases where one prompt passed and the other failed), not two independent proportions.

def mcnemar_n ( p_disc : float , delta_disc : float , alpha : float = 0.05 , power : float = 0.80 ) -> int :
""" Sample size for paired binary outcome.
p_disc: total fraction of discordant pairs (one passes, one fails)
delta_disc: difference between the two discordant cells (p10 - p01 in McNemar terms)
"""
z_alpha = NormalDist (). inv_cdf ( 1 - alpha / 2 )
z_beta = NormalDist (). inv_cdf ( power )
n = (( z_alpha * math . sqrt ( p_disc ) + z_beta * math . sqrt ( p_disc - delta_disc ** 2 )) / delta_disc ) ** 2
return math . ceil ( n )

8% of pairs disagree, of which the new prompt wins 5% net

print ( mcnemar_n ( 0.08 , 0.05 )) # ~290

For the same effect size, the paired design needs an order of magnitude fewer examples than two independent arms. If both prompts pass on the easy questions and both fail on the hard ones, those rows tell you nothing. The signal lives entirely in the discordant pairs. Counting the rest is wasted compute.
If the eval harness already runs both prompts on the same questions (and it should), the paired test is free. Switch to it.

Sequential testing: stop early when the answer is obvious
Running 6,000 examples per arm is expensive. If the first 800 already show a 10-point gap, the remaining 5,200 are wasted compute. Sequential testing lets you peek at the data and stop early without inflating the false-positive rate, provided you correct for the peeks.
The naive version is wrong: "I'll check after every 100 examples and stop when p < 0.05 ." Each peek is another shot at a false positive. Five peeks at α=0.05 each gives a true α closer to 0.20.
The correct version uses an alpha-spending function. Pocock and O'Brien-Fleming boundaries are the standard. The shape: at the first interim look, require a much stricter p -value than 0.05; loosen it as more data comes in; the cumulative type-I error stays at 0.05 across all peeks.

def obrien_fleming_threshold ( k : int , K : int , alpha : float = 0.05 ) -> float :
""" O ' Brien-Fleming z-threshold at peek k of K total peeks. Stricter early, loose late. Returns z-critical. """
z_full = NormalDist (). inv_cdf ( 1 - alpha / 2 )
return z_full * math . sqrt ( K / k )

for k in range ( 1 , 6 ):
z = obrien_fleming_threshold ( k , K = 5 )
print ( f " peek { k } /5: |z| > { z : . 3 f } " )

peek 1/5: |z| > 4.382

peek 2/5: |z| > 3.099

peek 3/5: |z| > 2.530

peek 4/5: |z| > 2.191

peek 5/5: |z| > 1.960

The fir

st peek requires |z| > 4.4 . That is a delta so large you would not need a test. The final peek is the usual 1.96 . When the effect is real and large, the early peeks catch it. Small effects mean you run to the end. The overall false-positive rate stays at the α you advertised. For LLM evals, this matters most when each example costs API tokens. A regression eval at every PR running 6,000 examples is a real cost line. Stopping at 1,500 when the answer is clearly negative saves the rest. The Optimizely sequential-testing glossary covers the practical theory; the statsmodels interim analysis tooling implements the standard boundaries if you want a library instead of rolling it. Stratify by query type or eat Simpson's paradox

The scariest result in eval analysis is the one where the new prompt wins the aggregate by 3 points, loses every individual subgroup, and ships anyway. This is Simpson's paradox, and it shows up in LLM evals because eval sets are usually mixtures: factual queries, math queries, refusal probes, multi-turn dialogues, code questions, summarization, and so on. Each subgroup has a different baseline accuracy and a different difficulty curve. When the new prompt shifts the distribution of which queries the judge sampled, even subtly, the aggregate average can move in the opposite direction of every subgroup average. The classic epidemiology example (kidney stones, Charig et al., 1986 ) is identical in structure: treatment A wins overall, treatment B wins on small stones and on large stones. The aggregate flipped the conclusion. The defense is not exotic. Stratify the eval set into well-defined query types, hold the per-type sample counts fixed across runs, and report per-stratum deltas alongside the aggregate. If three out of four strata regressed and one improved, the aggregate "win" is not a win.

def stratified_summary ( rows , p1_col = " old " , p2_col = " new " ): """ rows: iterable of dicts with keys ' stratum ' , ' old ' , ' new ' (0/1). """ from collections import defaultdict by_stratum = defaultdict ( list ) for r in rows : by_stratum [ r [ " stratum " ]]. append ( r ) out = [] for s , rs in sorted ( by_stratum . items ()): n = len ( rs ) old = sum ( r [ p1_col ] for r in rs ) / n new = sum ( r [ p2_col ] for r in rs ) / n out . append ({ " stratum " : s , " n " : n , " old " : old , " new " : new , " delta " : new - old , }) return out

The harness should print per-stratum results before the aggregate. If the stratum table makes a reviewer flinch, do not call the aggregate a result. The corollary for sample-size planning: budget examples per stratum, not just overall. A 2,000-example eval evenly split across 8 strata gives 250 per stratum. That is fine for catching 8-point per-stratum effects and useless for 2-point ones. If a specific query type is the one you are trying to move, oversample it.

A short checklist before you ship the deploy note Compute n_per_arm for the smallest delta you actually care to detect, before running the eval. If your eval set is smaller than that, the result cannot back the claim. Use the paired McNemar test when both prompts run on the same questions. It is one function call away and frees up an order of magnitude of compute. Sequential testing with O'Brien-Fleming boundaries lets you stop early on obvious wins or obvious losses without inflating false positives. Stratify by query type, report per-stratum deltas, and treat any aggregate win that hides a subgroup loss as a regression.

배포 노트(deploy note)에 있는 멤버들은 점 추정치(point estimate)뿐만 아니라 신뢰 구간(confidence interval)을 받습니다. 82% (95% CI: 80.0 – 84.0)는 솔직합니다; 82%는 브랜드링(branding)과 함께 동전 던지기(coin flip)입니다. 초기 비용은 파워 계산기(power calculator), 층화(strata), 그리고 하네스(harness)에 있는 신뢰 구간(CIs)입니다. 그 후 팀은 더 이상 노이즈(noise)를 기능(feature)으로 배포하지 않습니다. 만약 이것이 유용하다면, 위의 수학은 LLM Observability Pocket Guide가 다루는 것의 작은 조각일 뿐입니다: 올바른 평가 도구(tooling) 선택, 프로토타입 단계 이후에도 확장 가능한 하네스(harness) 구축, 그리고 대시보드에 숨겨진 버그(bugs)를 잡을 만큼 잘 트레이스(Trace) 읽기. 만약 샘플 크기(walk-through)가 유용했다면, 책에서는 동일한 원칙을 상세하게 다룹니다: 스테이크홀더(stakeholder)에게 배포하는 모든 주장에는 신뢰 구간(confidence interval)이 뒤따릅니다.

Eval Set Sizing: The Statistical Power Math Behind LLM A/B Tests

요약