How to Build a Golden Dataset for LLM Evals (2026 Playbook)
How to Build a Golden Dataset for LLM Evals
Every team that runs LLM evaluations eventually discovers the same hard truth. The eval framework does not matter. The rules do not matter. The model does not matter. What matters is the golden dataset underneath all of it. If your dataset is bad, every number you generate is lying to you. If your dataset is good, a simple contains check can save your product.
This guide walks through how to build a golden dataset that catches real bugs, how to scale it without burning out your labelers, and what to do when the dataset disagrees with production.
What a golden dataset actually is
A golden dataset is a set of paired inputs and expected behaviors. Inputs are what your users send to your LLM. Expected behaviors are what any acceptable response must satisfy. The pair is your ground truth.
The word "golden" is load-bearing. These rows are trusted. They represent what your product must do correctly. If the model fails on a golden row, you ship a fix. If the expected behavior changes, you update the row with a deliberate commit. You do not silently accept drift.
Golden datasets sit at the center of every serious LLM evaluation stack. DeepEval calls them "goldens". Confident AI, Arize, and Datadog all converge on the same concept with slightly different names. The implementation details vary. The idea does not.
Five properties of a useful dataset
The research community has settled on five properties that any useful golden dataset should have.
Defined scope. The dataset targets one task. Not "our whole product". Not "everything the model does". One prompt, one use case, one intended behavior. If you have three prompts you need three datasets.
Demonstrative of production. The inputs look like what real users send. Not synthetic examples written by an engineer in ten minutes. Real typos, real language mix, real edge cases.
Diverse. The dataset covers the variety of the problem space. Easy cases. Hard cases. Ambiguous cases. Cases at the boundary of the prompt's intended scope.
Decontaminated. The dataset is distinct from any data used to train or fine-tune your model. Otherwise your numbers are meaningless.
Dynamic. The dataset evolves. You add rows when users report bugs. You remove rows when your product pivots. It is a living artifact, not a frozen snapshot from the day the project started.
Miss any of these and the dataset gives misleading signal. Hit all five and you have a dataset that catches regressions early and gives you confidence when it passes.
Sourcing your first 30 rows
Pull from production. This is the single most important rule. Do not hand-author test inputs. Engineers cannot simulate the distribution of real user behavior. They try. They fail. The dataset that results is easy for the model and therefore useless.
If you have production logs, export the last thousand calls to the prompt you care about. Sample thirty rows. Weight the sample toward the tail. Include the weird ones. A uniform random sample gives you thirty average inputs. You want thirty inputs that together span the range of things users actually do.
If you do not have production logs yet, source from adjacent data. Support tickets, customer emails, forum questions. Anything that looks like the user language your prompt will eventually face. Synthetic data is a last resort and needs to be marked as synthetic.
Labeling each row
For each row you need either an expected output or a set of rules the output must satisfy. Start with rules. Hardcoded expected outputs break as soon as the model changes wording, and you end up maintaining a huge set of fragile strings.
Rules look like:
- The response must contain a product name from this list
- The response must not contain the words "I cannot help" or "I am just an AI"
- The response must be valid JSON matching this schema
- The response must answer the user's question in under three sentences
These are all eval rules in the EmberLM sense. Each rule fires a pass or fail on the generated output.
Have a domain expert, not an engineer, write the rules. Engineers write rules that match what the model happens to generate. Domain experts write rules that match what the user actually needs. The difference matters.
Labeling tools
You do not need a specialized tool to start. A Google Sheet with three columns works: input, expected, notes. Paste thirty rows, fill in the columns, export to CSV.
If you are using EmberLM, the Datasets feature imports CSV directly and lets you attach eval rules per row. Skip the sheet entirely.
Once you pass a hundred rows, invest in a proper labeling workflow. Confident AI, Scale, Label Studio all work. The specific tool matters less than having one place where labelers see inputs and attach rules.
How many rows is enough
Thirty rows is the minimum for a useful regression. Under thirty the pass rate has too much variance to be meaningful. A one-row failure moves the score by three points.
Fifty rows is a comfortable starting point for most teams.
Two hundred rows is where coverage starts to feel complete for a single-task prompt.
Above five hundred rows you start paying real money and time on every run. Most teams do not need this scale.
If you are covering a complex prompt with many distinct modes of use, you may want a dataset per mode. One dataset for the "summarize document" use case, one for the "answer follow up question" use case. Each at thirty to two hundred rows. Run them separately.
Keeping the dataset distinct from training data
If you fine-tune the model, your golden dataset must not overlap with your training data. If it overlaps, the model has seen the inputs, learned the expected outputs, and your eval is measuring memorization rather than capability. You will ship a model that scores great on evals and fails in production.
Keep golden rows in a separate store. Never include them in fine-tune batches. When you expand the dataset, source from production, not from prior training examples.
For closed-model users who do not fine-tune, this is less of a worry. The concern reappears when you move to hosted fine-tunes or self-hosted models later.
Keeping the dataset dynamic
The most common failure mode is the frozen dataset. A team builds it once, ships it, and never touches it again. Six months later the product has changed, the users have changed, and the dataset tests behaviors the product no longer cares about.
Schedule a quarterly dataset review. Every user-reported bug that survives triage becomes a golden row. Every row that no longer reflects the product gets removed or archived. Tag rows with the date added. Rows older than a year get flagged for review.
Treat the dataset like a production database. It has a schema, it has migrations, it has a changelog.
Handling open-ended outputs
Rules work for structured outputs. They get harder on open-ended generation. A customer support reply is not a JSON object. It is three paragraphs of English. What rule captures "helpful, polite, factually correct"?
For open-ended generation, pair rules with an LLM judge. The judge gets the input, the response, and a rubric. It returns a score or a boolean. Simple rubrics work best. "Does the response answer the user's question and stay under 200 words? Return pass or fail."
Judges have their own failure modes. They are stochastic. They can be biased toward longer responses. They can disagree with each other. Accept the noise and run them multiple times if you need tight confidence intervals, or fall back to human review for the final ten percent that judges cannot decide.
Versioning the dataset with the prompt
Your prompt has versions. Your dataset should have versions. When you change a prompt in a way that changes the expected output shape, you update the dataset in the same commit. When you change the dataset to add new cases, you increment its version.
Prompt v7 paired with dataset v3 might score 92 percent. Prompt v8 paired with dataset v3 might score 87 percent. That is a real regression. Prompt v8 paired with dataset v4 might score 93 percent. That tells you dataset v4 has easier rows, not that the prompt improved.
Always pin dataset version when comparing scores. Always log which dataset version produced which score.
When the dataset disagrees with production
You will run a regression, get a passing score, deploy, and a user will immediately hit a case your dataset did not cover. That is not a failure. That is the dataset doing its job. Add the case, update the rules, run the regression, ship again.
The goal is not a dataset that covers every possible input. The goal is a dataset that covers every input mode your product has seen. Production feeds the dataset. The dataset catches regressions on production modes. The loop closes.
Starting today
Practical plan:
- Pick one prompt
- Export thirty real production inputs
- Write one contains rule and one not-contains rule per input
- Run the regression
- Review the failures
- Commit the dataset to version control
That is a full working golden dataset in under an hour. Expand from there.
EmberLM includes a full Datasets feature with CSV import, per-row eval rules, and direct wiring into regression runs and scheduled tests. Pro plan twenty dollars per month. Start at emberlm.dev/signup.