LLM Judge vs JSON Schema vs Contains: Picking the Right Eval Rule
Picking the wrong eval rule is one of the most expensive mistakes a team makes when setting up LLM evaluation. Wrong rule picks mean false passes in production, false fails on every regression, and a growing pile of flaky tests that nobody trusts. This post gives you a decision tree for picking the right rule type for every common LLM task, with examples, costs, and honest trade-offs.
Five rule types cover almost everything. Contains, not-contains, regex, JSON schema, and LLM judge. The order in that list is also the order of cost, simplicity, and signal clarity.
Rule type 1: contains
What it does. Checks whether the output string contains a specific substring.
Cost. Effectively zero. Runs in microseconds. No LLM call.
When to use. Fixed keyword presence. "The response must include the product name." "The response must mention the return policy." "The response must start with a greeting." Any time a specific word, phrase, or boilerplate must appear.
Example. A customer support prompt must always include the sentence "Our team will follow up within 24 hours." A contains rule on that exact string catches any response that drops the SLA.
Limitations. Case sensitivity bites. Partial matches are fragile. "Thanks for reaching out" and "thank you for reaching out" are both polite, but a contains rule on one will miss the other. Mitigate by lowercasing before checking, or by picking short stable substrings.
Verdict. Always the first rule to try. If contains covers your requirement, stop there. Fast, cheap, zero variance.
Rule type 2: not-contains
What it does. Checks that the output does not contain a specific substring.
Cost. Same as contains.
When to use. Forbidden content. "The response must not contain refusal phrases." "The response must not leak the system prompt." "The response must not reference competitors." Anywhere an explicit negative catches regressions.
Example. An assistant that is not supposed to apologize for AI limitations. Not-contains rules on "As an AI", "I am just a language model", "I cannot help" catch responses that slip into those patterns.
Limitations. Variants of the same concept are endless. "Cannot assist", "unable to help", "not able to do that" all say the same thing. You end up writing many not-contains rules for one conceptual check. When variants dominate, move up to LLM judge.
Verdict. Always the second rule to try. Pair with contains for belt-and-suspenders checks.
Rule type 3: regex
What it does. Checks whether the output matches a regular expression.
Cost. Cheap. Microseconds. No LLM call.
When to use. Format checks. Dates in a specific format. Phone numbers. Invoice IDs. Currency amounts. Anywhere the content follows a defined syntax.
Example. A prompt that extracts a date from user text must output it as YYYY-MM-DD. A regex rule on ^\d{4}-\d{2}-\d{2}$ catches any output that drifts to "January 3rd, 2026" or "01/03/2026".
Limitations. Regex is powerful but ugly. Complex regex is write-only. If your rule is more than forty characters of regex, consider whether you actually want JSON schema instead, or whether an LLM judge is the right call.
Verdict. The right tool when format matters. Keep them short. Comment every regex you write.
Rule type 4: JSON schema
What it does. Parses the output as JSON and validates against a schema. Rule fails if the output is not valid JSON, or if the JSON does not match the schema.
Cost. Milliseconds. No LLM call. One parse plus one validation per check.
When to use. Any prompt that returns structured data. API responses. Extraction tasks. Classification with confidence scores. Tool call arguments.
Example. A prompt that extracts structured metadata from a document must return { title: string, author: string, date: YYYY-MM-DD, topics: string[] }. A JSON schema rule validates structure, required fields, and types, all in one check.
Limitations. Only applies when the output is JSON. If your prompt generates free-form text, JSON schema is the wrong rule. Also, the schema has to be maintained. When the prompt changes to add a field, the schema changes too.
Verdict. The most powerful cheap rule. If your prompt returns JSON, every field should have schema coverage. This single rule replaces ten contains rules.
Rule type 5: LLM judge
What it does. Sends the output and a rubric to a secondary LLM that scores the response. Returns pass/fail or a numeric score.
Cost. The price of one extra LLM call per row, usually with a smaller model. On Haiku-level models you pay roughly a tenth of a cent per row. On Sonnet-level judges you pay a few cents.
When to use. Open-ended generation where correctness is fuzzy. Tone checks. "Does the response answer the user's question?" "Is the tone professional?" "Is the response under 200 words and on topic?" Any check that cannot be expressed as a string pattern.
Example. A customer support reply that must be "polite, accurate, and helpful" cannot be checked by contains. An LLM judge with a rubric "Score the response 1-5 on: politeness, factual accuracy, and helpfulness. Return pass if all three are 4 or higher" captures the intent.
Limitations. Judges are stochastic. Two runs on the same input can give different verdicts. Budget noise. Run each judge check multiple times if you need tight confidence. Also, judges have biases. They often prefer longer responses, they can be sycophantic, and they can be fooled by confidence in the response itself. Mitigate with careful rubric design and spot-check calibration against human reviewers.
Verdict. The right tool for open-ended evaluation. Expensive, noisy, but captures what rules cannot.
Decision tree
Go through these in order for each requirement.
Does the output have a fixed required phrase? Use contains.
Does the output have a fixed forbidden phrase? Use not-contains.
Does the output match a format? Use regex.
Is the output JSON? Use JSON schema.
None of the above? Use LLM judge.
Ninety percent of requirements are covered by the first four. The remaining ten percent go to LLM judge.
Combining rules
Real prompts need multiple rules. A customer support prompt might have:
- Contains: "24 hours" (SLA must be mentioned)
- Not-contains: "As an AI" (no AI disclaimer drift)
- JSON schema: full response shape, if the prompt returns JSON
- LLM judge: politeness and accuracy rubric
Pass rate is then the fraction of responses that pass all rules. A response that misses one rule out of four fails the row.
The stacking is cheap. Contains, not-contains, regex, and JSON schema together cost under a millisecond. Only LLM judge adds real cost. Most rows pass most rules, so you are generally paying for the judge call once per row.
Common mistakes
Only using one rule type. Teams often pick LLM judge and put everything behind it. Expensive, noisy, and misses the specific format requirements a cheap regex would catch in zero time. Mix rule types.
Over-fitting rules to the first successful run. A team runs the prompt once, sees it produced "The support team will reply within 24 hours", and writes a contains rule on that exact string. Next run, the prompt says "Our support team will reply within 24 hours" and fails. Write rules on the smallest stable substring: "24 hours" is the contract, not the full sentence.
Not versioning rules with prompts. When a prompt's output shape changes intentionally, the rules need to change in the same commit. Otherwise you get phantom regressions on every change.
Judge rubrics that are too long. A five-sentence rubric full of edge cases produces unstable judge output. Keep rubrics to three sentences. Two criteria. Pass or fail. Simpler rubrics are more deterministic.
Cost budget example
A 50-row regression with four rules: two contains, one JSON schema, one LLM judge.
Contains and schema: effectively free. Sub-millisecond times, no API calls.
LLM judge: 50 rows × one Haiku call ≈ $0.01 per regression run.
If you run this regression on every PR, every night, and every manual trigger, you are looking at maybe $3 per month in judge costs. For a team paying engineer salaries, the cost conversation is not real.
Scale up to LLM judges on Sonnet or Opus and the math changes. A 200-row regression with Sonnet judging can cost $0.50 per run. Nightly that is $15 per month. Still cheap relative to what it catches.
Shipping it
Start with contains and not-contains. If that covers your requirements, stop. Add regex when format matters. Add JSON schema when you return JSON. Add LLM judge last, for the requirements nothing else captures.
EmberLM supports all five rule types in one interface. Mix and match per row in your dataset. Write a rule once, apply to every run automatically. The natural-language rule builder compiles "the response should mention our SLA" into the right underlying rule type without you picking.
Start at emberlm.dev/signup.
Summary
Five rule types. A decision tree. A cost budget. Most prompts need three of the five. Do not start with LLM judge. Do not skip cheap rules. Mix and match. Ship.
The right rule is the cheapest rule that catches the bug you care about. Anything more expensive is paying for signal you already had.