Skip to main content

Key terms

This page provides an overview of key terms and concepts that apply to Eval Studio.

Large language model (LLM)

Large Language Models (LLMs) are artificial intelligence systems trained on vast amounts of text data to generate human-like responses in natural language processing tasks. In the context of EvalStudio, an LLM can be considered as a corpus-less RAG with perfect search context.

Retrieval-augmented generation (RAG)

A retrieval-augmented generation (RAG) product. In a more general sense, RAG refers to a technique that combines retrieving relevant information from an external corpus with a pre-trained language model to generate more accurate and contextually rich responses.

Eval Studio evaluator

  • Code that evaluates an LLM or RAG.
  • Python class that inherits from Evaluator abstract base class.
  • Executed by backend Python library.


An Eval Studio test is a collection of documents (that is, a corpus) along with prompts that are relevant to the corpus, ground truth, constraints, and other parameters that are used to evaluate a RAG or LLM model.

Test suite

An Eval Studio test suite is a collection of Tests.

Test lab

An Eval Studio test lab is a set of resolved prompts, ground truth, constraints, and other parameters that are used to evaluate a RAG or LLM model. Test labs are created by EvalStudio from the Test Suite.


An Eval Studio report is a collection of metrics and visualizations that describe the validity of a RAG or LLM model on a given test configuration.

Ground truth

In the context of LLM and RAG evaluation, ground truth refers to the actual or correct answer to a given question or prompt. It is used as a standard of comparison to evaluate the performance of a model by measuring how closely its outputs match the ground truth.

For example, if a model is asked the question "What is the capital of France?", the ground truth would be "Paris". If the model's output is also "Paris", then it has correctly answered the question. However, if the model's output is "London", then it has made an error, and the difference between its output and the ground truth can be used to measure the model's accuracy or performance.

In the case of RAG pipelines, the ground truth may consist of both the retrieved documents and the final generated answer. The model's ability to retrieve relevant documents and generate accurate answers based on those documents can be evaluated by comparing its outputs to the ground truth.