Promptfoo is a CI gate, not an eval framework. Treating it like one cost us $4,200.
Eight weeks of fixing the layer I was missing between regression tests and production reality.
What you will learn in this post:
Why a passing Promptfoo regression suite does not mean your eval is working
How to wire a parallel judge-validation pipeline against production traces
The structural changes to a GPT-4 judge prompt that move Cohen's kappa from 0.47 to 0.68
Position bias and verbosity bias measurement, with mitigations
The 20-hour, $180-per-month total cost of the fix
Last Monday I logged into our billing dashboard and saw a $4,200 LangSmith spike from the weekend. Our auto-eval pipeline had been running overnight against a fresh prompt change. The Promptfoo regression suite passed 91% of its 300 questions. The release went out Monday at 9am.
By Tuesday evening, our on-call channel had 14 customer escalations about wrong refund amounts.
That is when I stopped treating Promptfoo as an eval framework.
The category error
I had built what looked like a real evaluation pipeline. 300 frozen test cases. Pass-fail thresholds. CI gate that blocked merges on any drop below 85%.
It still missed the bugs that hit production.
Promptfoo is a regression test runner. It tells you "your prompt change did not break the cases you had already thought to test." That is useful. It is not eval. Eval requires a judge that has been validated against humans on your task.
Our judge was a GPT-4 call. When I hand-labeled 200 production traces over a weekend and compared them against the judge's scores, Cohen's kappa was 0.47. Barely above chance for a 5-class problem.
The fix is two pieces
The fix took 8 weeks. Most teams have piece 1 and are missing piece 2.
Piece 1: Promptfoo stays as the CI gate
# .promptfoo.yaml (excerpt)
prompts: [refund_agent_v3.txt]
providers: [openai:gpt-4]
tests: !file ./tests.yaml
defaultTest:
assert:
- type: model-graded-fact
value: "Matches expected refund amount and reason"
- type: latency
threshold: 3000
Piece 2: A separate judge-validation pipeline against production traces
# weekly_judge_validation.py
from datadog import statsd
from sklearn.metrics import cohen_kappa_score
import scipy.stats
def run():
traces = pull_traces(days=7, n=50)
judge_scores = [run_judge(t) for t in traces]
human_scores = await_human_labels(traces, timeout="48h")
kappa = cohen_kappa_score(judge_scores, human_scores)
statsd.gauge("eval.judge.kappa", kappa)
if kappa < 0.55:
pagerduty.trigger(
"judge-drift",
details=f"kappa={kappa:.2f}, threshold=0.55"
)
When we wired this up 8 weeks ago, kappa was 0.47. Today it is 0.68. Acceptable.
What we changed in the judge
Three structural changes:
Score criteria separately (refund amount, denial reason, customer-facing tone). Kappa per criterion runs 0.65 to 0.74.
Force the judge to cite the expected answer portion that justifies its score.
Score against a 4-page rubric instead of vibes.
Position bias and verbosity bias
Position bias: 71% self-agreement when answer order swapped. 29% of judgments flip on order.
Verbosity bias: padded responses scored 0.4 points higher on average.
Mitigations: randomize order and average. Truncate to max length before judging.
The lesson
Promptfoo is a CI gate, not an eval framework. The actual eval is the judge-validation pipeline that lives next to it.
If you only have Promptfoo, you are flying on uncalibrated faith. Most teams I talk to are missing piece 2: a judge-validation step against production traces.
Total cost of the fix: about 20 engineer-hours and $180 per month in API calls. The $4,200 weekend was the bigger number.
Three things I am still working on
The first is calibration set size. I use 200 traces per week. I suspect 100 with tighter stratification gives the same CI, but I have not run the variance experiment yet.
The second is whether cross-judge agreement can stand in as a noisy proxy for human labels. Works for obvious cases, breaks at the margin where you most need eval.
The third, and the one I find hardest, is putting a dollar value on lost user trust when production breaks on cases the judge passed.
If you have solved any of these, I would like to compare notes.
