What is LLM evaluation?

LLM evaluation is the systematic measurement of whether a large language model or the application built on it produces outputs that are accurate, safe, and reliable enough for real use. Because the same prompt can yield different answers on different runs and quality is a matter of degree, evaluation replaces subjective impressions with repeatable, comparable scores you can regression-test across versions.

Why are BLEU and ROUGE not enough for evaluating LLMs?

BLEU and ROUGE score text by n-gram overlap with a reference answer, so they measure surface-level word matching rather than meaning. A response can be accurate, fluent, and helpful while sharing few words with the reference, and these metrics will penalise it. Research consistently finds they correlate poorly with human judgement on open-ended tasks, which is why teams add LLM-as-a-judge and human evaluation.

What is LLM-as-a-judge and what biases does it have?

LLM-as-a-judge uses a strong model to grade or compare outputs against a rubric, giving human-like nuance at machine speed and cost. Its main biases are position bias, where the judge favours a response based on its order in a comparison, and verbosity bias, where it rewards longer, more fluent answers regardless of substance. Swapping presentation order, using length-controlled rubrics, choosing a judge from a different model family, and calibrating against human labels all help mitigate these.

How do you evaluate a RAG system?

RAG evaluation decomposes the pipeline to separate retrieval failures from generation failures. Frameworks like RAGAS measure faithfulness (whether the answer is grounded in retrieved context), answer relevance (whether it addresses the question), context precision (how much retrieved material was relevant), and context recall (whether retrieval pulled in all needed evidence). Low faithfulness points at the generation prompt or model, while low context recall points at chunking, embeddings, or the retriever.

What is the difference between offline and online LLM evaluation?

Offline evaluation runs against curated, pre-production datasets to iterate quickly and block regressions before release. Online evaluation runs in production against live, messy, sometimes adversarial traffic to catch novel failures and distribution shifts the offline suite never anticipated. The rule of thumb is to use offline evals to iterate and prevent regressions, and online monitoring to confirm real-world value and watch for drift over time.

What is a golden dataset in LLM evaluation?

A golden dataset is a small but sharp, curated set of representative inputs paired with known-good expected outputs or grading criteria. Drawn from real usage and hard edge cases rather than synthetic filler and versioned like code, it is the anchor of an evaluation pipeline. Every prompt, model, or fine-tune change is scored against it in CI, and newly discovered production failures are fed back in so the suite grows sharper over time.

LLM Evaluation: How to Test GenAI Apps for Quality

LLM evaluation is the practice of systematically measuring whether a large language model or the application built on top of it produces outputs that are accurate, safe, and reliable enough for real use. Unlike traditional software, where the same input reliably produces the same output, an LLM can answer the same prompt differently on two consecutive runs, and quality is a matter of degree rather than a passing or failing assertion. That is precisely why evaluation matters so much: without a repeatable way to score outputs, teams ship on vibes, and the failure modes surface only when real users hit them. Evals, as practitioners call them, turn a subjective sense that the model seems good into evidence you can regression-test, compare across versions, and defend to a governance board.

Why traditional metrics fall short

The first instinct is to reach for reference-based metrics such as BLEU and ROUGE, which score generated text by how much its n-grams overlap with a human-written reference answer. These metrics were built for machine translation and summarisation, and they still have narrow uses, but they measure surface-level word overlap rather than meaning. A response can be factually perfect, fluent, and helpful while sharing almost no n-grams with the reference, and BLEU or ROUGE will punish it. Research consistently finds these scores correlate poorly with human judgement and cannot capture fluency, coherence, or coverage in open-ended tasks, which makes them inadequate for the conversational and reasoning-heavy work modern LLMs are asked to do.

Task-specific benchmarks are the other traditional pillar. MMLU tests reasoning and knowledge across 57 academic subjects with more than 16,000 multiple-choice questions, and it remains the most widely cited general-capability benchmark. HELM (Holistic Evaluation of Language Models) takes a broader stance, scoring models across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency to produce a multi-dimensional profile rather than a single leaderboard number. Benchmarks like these are invaluable for comparing base models, but they tell you almost nothing about your application. Your users do not ask MMLU questions; they ask about your product, your documents, and your edge cases. Benchmark performance is a starting filter, not a substitute for evaluating the system you actually built.

The core evaluation methods

In practice, mature teams combine several complementary methods rather than betting on one. Each trades off cost, speed, and how closely it tracks what humans actually care about.

Reference-based metrics (BLEU, ROUGE, exact match) — cheap and fast, useful for narrow tasks with a single correct answer, but weak on open-ended generation
Human evaluation — the gold standard for nuanced quality, empathy, and safety judgement, but slow, expensive, and hard to scale to every release
LLM-as-a-judge — using a strong model to grade outputs against a rubric, offering human-like nuance at machine speed and cost
Task-specific benchmarks (MMLU, HELM, GSM8K) — good for comparing base models, poor as a proxy for your own application quality
Programmatic and rule-based checks — deterministic assertions for format, schema, banned terms, or presence of required fields, ideal as fast guardrails

LLM-as-a-judge and its pitfalls

LLM-as-a-judge has become the workhorse of modern evaluation because it approximates human grading at a fraction of the cost and time. You give a capable model a rubric, the input, and the candidate output, and ask it to score or compare. The technique scales beautifully, but it is not a neutral oracle, and treating its verdicts as ground truth is a mistake that quietly corrupts entire eval suites. Judges carry systematic biases that skew results in predictable directions, and if you do not correct for them, you optimise your product toward the judge rather than toward your users.

Two biases dominate. Position bias means a judge favours a response based on the order it appears in a pairwise comparison; in code-judging experiments, simply swapping which candidate is shown first can shift accuracy by more than ten percent. Verbosity bias means judges reward longer, more fluent, more confident-sounding answers regardless of whether the extra words add substance, an artefact of how these models were pretrained and tuned. Self-preference or self-enhancement bias, where a model rates its own family of outputs more highly, compounds the problem when the judge and the system under test share a lineage.

Practical mitigations are well established: randomise or swap the presentation order and average across both positions to cancel position bias; enforce length-controlled or rubric-anchored scoring to blunt verbosity bias; use a judge from a different model family than the system you are grading; and periodically calibrate the judge against a human-labelled sample so you know its true sensitivity and specificity rather than assuming it is perfect.

Evaluating RAG systems

Retrieval-augmented generation adds a moving part that generic quality metrics miss entirely: the retrieval step. A RAG answer can be fluent and wrong because the model hallucinated beyond its sources, or it can be poor because retrieval never surfaced the right context in the first place. RAG-specific metrics decompose the pipeline so you can tell which half failed. The RAGAS framework popularised four widely used measures: faithfulness (also called groundedness), which checks whether the answer is actually supported by the retrieved context rather than invented; answer relevance, which checks whether the response directly addresses the user question; context precision, which measures how much of the retrieved material was actually relevant; and context recall, which measures whether retrieval pulled in all the evidence needed to answer.

The power of this decomposition is diagnostic. Low faithfulness with high context recall points at the generation prompt or model. Low context recall points at chunking, embeddings, or the retriever. This is also where evaluation and security overlap, because a poisoned or manipulated knowledge base degrades faithfulness in ways that look like a quality bug until you investigate; our deep dive on rag-security-risks walks through how retrieval pipelines get attacked, and faithfulness scoring is one of the earliest signals that something is wrong.

Offline evals versus online production monitoring

Evaluation is not a single gate before launch; it is two loops running at different speeds. Offline evaluation runs against curated, pre-production datasets so you can iterate quickly and block regressions before they reach users. Online evaluation runs in production against live, messy, sometimes adversarial traffic, catching the novel failures and distribution shifts your curated suite never anticipated. The rule of thumb practitioners converge on is simple: use offline evals to iterate fast and prevent regressions, and use online monitoring to confirm real-world value and watch the system drift over time. The two are complementary, and a common mistake is treating a strong offline score as proof the system is safe in production.

Online monitoring is also where guardrails and safety testing live at runtime. LLMs deployed in sensitive settings need real-time checks that detect and mitigate risky behaviour before it reaches a user, and mature online systems fold those guardrails directly into the evaluation stream. This is adjacent to adversarial testing: our guide to ai-red-teaming-guide covers stress-testing a model against deliberate abuse, and the safety evals you run continuously in production are the standing counterpart to a point-in-time red-team exercise. Because judged scores can themselves be biased, teams increasingly track fairness and demographic performance separately, a discipline we cover in measuring-ai-bias-and-fairness.

Building an evaluation pipeline

Turning these ideas into a repeatable pipeline follows a consistent sequence. The single most important asset is the golden dataset: a small but sharp set of representative inputs with known-good expected outputs or grading criteria, curated from real usage and hard edge cases rather than synthetic filler.

Build a golden dataset — representative inputs plus expected outputs or rubric criteria, drawn from real traffic and known failure modes, versioned like code
Define metrics per task — pick faithfulness and context recall for RAG, task-completion and safety scores for agents, format checks for structured output
Choose evaluators — combine cheap programmatic assertions with an LLM-as-a-judge from a different model family, and reserve human review for the highest-stakes cases
Run offline evals in CI — gate every prompt, model, or fine-tune change against the golden set so regressions never merge
Add safety and guardrail tests — probe for prompt injection, jailbreaks, toxic output, and policy violations before and during production
Monitor in production — sample live traffic, score it with the same evaluators, and alert on drift so offline wins are confirmed to hold up
Close the loop — feed newly discovered production failures back into the golden dataset so the suite grows sharper with every incident

Several mature frameworks bundle these capabilities so you do not build the harness from scratch. Tools such as Arize Phoenix, LangSmith, Braintrust, and OpenAI Evals let teams score offline experiments and stream online metrics through one interface, and reusing the same evaluator across both modes keeps your offline and production judgements consistent. RAGAS specialises in the retrieval metrics above. Which tool you pick matters far less than committing to the loop itself.

A pattern we see repeatedly: a team ships on impressive benchmark numbers and a handful of manual spot-checks, then spends the next quarter firefighting quality complaints they have no way to reproduce. The teams that stay calm are the ones that built a golden dataset on day one and treated every production incident as a new test case rather than a one-off apology.

Evaluation is ultimately what separates a GenAI demo from a GenAI product. The model is a fast but unreliable component, and evals are the measurement instrument that tells you, release after release, whether the system wrapped around it is getting better or quietly worse. If you want the mental model for why these systems behave probabilistically in the first place, our explainer on how-llms-actually-work-without-math is a useful companion; and if you are evaluating multi-step agents rather than single responses, the reliability patterns in agentic-design-patterns-production pair naturally with the evaluation discipline described here. Build the loop early, keep the golden set sharp, and let the evidence, not the demo, decide what ships.