Skip to main content

Command Palette

Search for a command to run...

Promptfoo is a CI gate, not an eval framework. Treating it like one cost us $4,200.

Eight weeks of fixing the layer I was missing between regression tests and production reality.

Updated
4 min read

What you will learn in this post:

  • Why a passing Promptfoo regression suite does not mean your eval is working

  • How to wire a parallel judge-validation pipeline against production traces

  • The structural changes to a GPT-4 judge prompt that move Cohen's kappa from 0.47 to 0.68

  • Position bias and verbosity bias measurement, with mitigations

  • The 20-hour, $180-per-month total cost of the fix

Last Monday I logged into our billing dashboard and saw a $4,200 LangSmith spike from the weekend. Our auto-eval pipeline had been running overnight against a fresh prompt change. The Promptfoo regression suite passed 91% of its 300 questions. The release went out Monday at 9am.

By Tuesday evening, our on-call channel had 14 customer escalations about wrong refund amounts.

That is when I stopped treating Promptfoo as an eval framework.

The category error

I had built what looked like a real evaluation pipeline. 300 frozen test cases. Pass-fail thresholds. CI gate that blocked merges on any drop below 85%.

It still missed the bugs that hit production.

Promptfoo is a regression test runner. It tells you "your prompt change did not break the cases you had already thought to test." That is useful. It is not eval. Eval requires a judge that has been validated against humans on your task.

Our judge was a GPT-4 call. When I hand-labeled 200 production traces over a weekend and compared them against the judge's scores, Cohen's kappa was 0.47. Barely above chance for a 5-class problem.

The fix is two pieces

The fix took 8 weeks. Most teams have piece 1 and are missing piece 2.

Piece 1: Promptfoo stays as the CI gate

# .promptfoo.yaml (excerpt)
prompts: [refund_agent_v3.txt]
providers: [openai:gpt-4]
tests: !file ./tests.yaml
defaultTest:
  assert:
    - type: model-graded-fact
      value: "Matches expected refund amount and reason"
    - type: latency
      threshold: 3000

Piece 2: A separate judge-validation pipeline against production traces

# weekly_judge_validation.py
from datadog import statsd
from sklearn.metrics import cohen_kappa_score
import scipy.stats

def run():
    traces = pull_traces(days=7, n=50)
    judge_scores = [run_judge(t) for t in traces]
    human_scores = await_human_labels(traces, timeout="48h")

    kappa = cohen_kappa_score(judge_scores, human_scores)
    statsd.gauge("eval.judge.kappa", kappa)

    if kappa < 0.55:
        pagerduty.trigger(
            "judge-drift",
            details=f"kappa={kappa:.2f}, threshold=0.55"
        )

When we wired this up 8 weeks ago, kappa was 0.47. Today it is 0.68. Acceptable.

What we changed in the judge

Three structural changes:

  1. Score criteria separately (refund amount, denial reason, customer-facing tone). Kappa per criterion runs 0.65 to 0.74.

  2. Force the judge to cite the expected answer portion that justifies its score.

  3. Score against a 4-page rubric instead of vibes.

Position bias and verbosity bias

Position bias: 71% self-agreement when answer order swapped. 29% of judgments flip on order.

Verbosity bias: padded responses scored 0.4 points higher on average.

Mitigations: randomize order and average. Truncate to max length before judging.

The lesson

Promptfoo is a CI gate, not an eval framework. The actual eval is the judge-validation pipeline that lives next to it.

If you only have Promptfoo, you are flying on uncalibrated faith. Most teams I talk to are missing piece 2: a judge-validation step against production traces.

Total cost of the fix: about 20 engineer-hours and $180 per month in API calls. The $4,200 weekend was the bigger number.

Three things I am still working on

The first is calibration set size. I use 200 traces per week. I suspect 100 with tighter stratification gives the same CI, but I have not run the variance experiment yet.

The second is whether cross-judge agreement can stand in as a noisy proxy for human labels. Works for obvious cases, breaks at the margin where you most need eval.

The third, and the one I find hardest, is putting a dollar value on lost user trust when production breaks on cases the judge passed.

If you have solved any of these, I would like to compare notes.

10 views