Evaluators

This page describes the available H2O Eval Studio evaluators.

Compliance Frameworks

H2O Eval Studio conforms to the following compliance frameworks.

Evaluation Standard	Summary	Type
Safe	AI systems should not under defined conditions, lead to a state in which human life, health, property, or the environment is endangered. Safe operation of AI systems is improved through: responsible design, development, and deployment practices, clear information to deployers on responsible use of the system, responsible decision-making by deployers and end users; and explanations and documentation of risks based on empirical evidence of incidents.	`NIST`
Secure and Resilient	AI systems, as well as the ecosystems in which they are deployed, may be said to be resilient if they can withstand unexpected adverse events or unexpected changes in their environment or use - or if they can maintain their functions and structure in the face of internal and external change and degrade safely and gracefully when this is necessary.	`NIST`
Privacy Enhanced	Privacy refers generally to the norms and practices that help to safeguard human autonomy, identity, and dignity. These norms and practices typically address freedom from intrusion, limiting observation, or individuals' agency to consent to disclosure or control of facets of their identities (e.g., body, data, reputation). Privacy values such as anonymity, confidentiality, and control generally should guide choices for AI system design, development, and deployment.	`NIST`
Fair	Fairness in AI includes concerns for equality and equity by addressing issues such as harmful bias and discrimination. Standards of fairness can be complex and difficult to define because perceptions of fairness differ among cultures and may shift depending on application. Organizations' risk management efforts will be enhanced by recognizing and considering these differences. Systems in which harmful biases are mitigated are not necessarily fair. For example, systems in which predictions are somewhat balanced across demographic groups may still be inaccessible to individuals with disabilities or affected by the digital divide or may exacerbate existing disparities or systemic biases.	`NIST`
Accountable and Transparent	Trustworthy AI depends upon accountability. Accountability presupposes transparency. Transparency reflects the extent to which information about an AI system and its outputs is available to individuals interacting with such a system - regardless of whether they are even aware that they are doing so. Meaningful transparency provides access to appropriate levels of information based on the stage of the AI lifecycle and tailored to the role or knowledge of AI actors or individuals interacting with or using the AI system. By promoting higher levels of understanding, transparency increases confidence in the AI system.	`NIST`
Valid and Reliable	Validity and reliability for deployed AI systems are often assessed by ongoing testing or monitoring that confirms a system is performing as intended. Measurement of validity, accuracy, robustness, and reliability contribute to trustworthiness and should take into consideration that certain types of failures can cause greater harm.	`NIST`
Conceptual Soundness	Involves assessing the quality of the model design and construction. It entails review of documentation and empirical evidence supporting the methods used and variables selected for the model.	`SR 11-7`
Ongoing Monitoring	Emphasizes the continuous evaluation of a model's performance after deployment. This involves tracking the model's outputs against real-world data, identifying any deviations or unexpected results, and assessing if the model's underlying assumptions or market conditions have changed. This ongoing process ensures the model remains reliable and trustworthy for decision-making.	`SR 11-7`
Outcomes Analysis	Comparison of model outputs to corresponding actual outcomes. Outcomes analysis typically relies on statistical tests or other quantitative measures. It can also include expert judgment to check the intuition behind the outcomes and confirm that the results make sense.	`SR 11-7`

You can see the specific evaluators related to each standard in the sections below.

Generation
Retrieval
Privacy
Fairness
Summarization
Classification

Evaluators overview

Evaluator	LLM	RAG	J	Q	EA	RC	AA	C
Answer correctness	✓	✓	✓		✓		✓
Answer relevancy	✓	✓	✓	✓		✓	✓
Answer relevancy (sentence s.)	✓	✓		✓			✓
Answer semantic similarity	✓	✓			✓		✓
BLEU	✓	✓			✓		✓
Classification	✓	✓			✓		✓
Contact information leakage	✓	✓	✓				✓
Context precision		✓	✓	✓	✓		✓
Context relevancy		✓	✓	✓		✓
Context relevancy (s.r. & p.)		✓		✓		✓
Context recall		✓	✓		✓	✓
Faithfulness		✓	✓
Fairness bias	✓	✓					✓
Machine Translation (GPTScore)	✓	✓			✓		✓
Question Answering (GPTScore)	✓	✓		✓			✓
Summarization with ref. s.	✓	✓			✓		✓
Summarization without ref. s.	✓	✓		✓			✓
Groundedness		✓				✓	✓
Hallucination		✓				✓	✓
Language mismatch (Judge)	✓	✓	✓	✓			✓
BYOP: Bring your own prompt	✓	✓	✓
PII leakage	✓	✓					✓
Perplexity	✓	✓					✓
ROUGE	✓	✓			✓		✓
Ragas		✓	✓	✓	✓	✓	✓
Summarization (c. and f.)	✓	✓		✓			✓
Sexism (Judge)	✓	✓	✓				✓
Sensitive data leakage	✓	✓					✓
Stereotypes (Judge)	✓	✓	✓	✓			✓
Summarization (Judge)	✓	✓	✓	✓	✓		✓
Toxicity	✓	✓					✓
Text matching	✓	✓		✓				✓

Legend:

LLM: evaluates Language Model (LLM) models.
RAG: evaluates Retrieval Augmented Generation (RAG) models.
J: evaluator requires an LLM judge.
Q: evaluator requires question (prompt).
EA: evaluator requires expected answer (ground truth).
RC: evaluator requires retrieved context.
AA: evaluator requires actual answer.
C: evaluator requires constraints.

Generation

Answer Correctness Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

Answer Correctness Evaluator assesses the accuracy of generated answers compared to ground truth. A higher score indicates a closer alignment between the generated answer and the expected answer (ground truth), signifying better correctness.

Two weighted metrics + LLM judge.
Compatibility: RAG and LLM evaluation.
Based on RAGAs library

Method

This evaluator measures answer correctness compared to ground truth as a weighted average of factuality and semantic similarity.
Default weights are 0.75 for factuality and 0.25 for semantic similarity.
Semantic similarity metrics is evaluated using Answer Semantic Similarity Evaluator.
Factuality is evaluated as F1-score of the LLM judge answers whose prompt analyzes actual answer for statements and for each statement it checks it’s presence in the expected answer:
- TP (true positive): statements presents in both actual and expected answers
- FP (false positive): statements present in the actual answer only.
- FN (false negative): statements present in the expected answer only.
F1 score quantifies correctness based on the number of statements in each of the lists above:

F1 score = |TP| / (|TP| + 0.5 * (|FP| + |FN|))

For more information, see the page on Answer Correctness in the official Ragas documentation.

Metrics calculated by the evaluator

Answer correctness (float)
- The assessment of answer correctness metric involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness metric encompasses two critical aspects:semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator:

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Answer Relevancy Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

Answer Relevancy (retrieval+generation) evaluator is assessing how pertinent the actual answer is to the given question. A lower score indicates actual answer which is incomplete or contains redundant information.

Mean cosine similarity of the original question and questions generated by the LLM judge.
Compatibility: RAG and LLM evaluation.
Based on RAGAs library.

Method

The LLM judge is prompted to generate an appropriate question for the actual answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
The score will range between 0 and 1 most of the time, but this is not mathematically guaranteed, due to the nature of the cosine similarity that ranging from -1 to 1.

answer relevancy = mean(cosine_similarity(question, generate_questions))

Metrics calculated by the evaluator

Answer relevancy (float)
- Answer relevancy metric (retrieval+generation) is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Answer Relevancy (Sentence Similarity)

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

The Answer Relevancy (Sentence Similarity) evaluator assesses how relevant the actual answer is by computing the similarity between the question and the actual answer sentences.

Compatibility: RAG and LLM evaluation.

Method

The metric is calculated as the maximum similarity between the question and the actual answer sentences:

answer relevancy = max( {S(emb(question), emb(a)): for all a in actual answer} )

Where:
- A is the actual answer.
- a is a sentence in the actual answer.
- emb(a) is a vector embedding of the actual answer sentence.
- emb(question) is a vector embedding of the question.
- S(q, a) is the 1 - cosine distance between the question q and the actual answer sentence a.
The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

Answer relevancy (float)
- Answer Relevancy metric determines whether the RAG outputs relevant information by comparing the actual answer sentences to the question.
- A higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Answer Semantic Similarity Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

Answer Semantic Similarity Evaluator assesses the semantic resemblance between the generated answer and the expected answer (ground truth).

Cross-encoder model or embeddings + cosine similarity.
Compatibility: RAG and LLM evaluation.
Based on RAGAs library

Method

Evaluator utilizes a cross-encoder model to calculate the semantic similarity score between the actual answer and expected answer. A cross-encoder model takes two text inputs and generates a score indicating how similar or relevant they are to each other.
Method is configurable, and the evaluator defaults to embeddings BAAI/bge-small-en-v1.5 (where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI)) and cosine similarity as the similarity metric. In this case, evaluator does vectorization of the ground truth and generated answers and calculates the cosine similarity between them.
In general, cross-encoder models (like HuggingFace Sentence Transformers) tend to have higher accuracy in complex tasks, but are slower. Embeddings with cosine similarity tend to be faster, more scalable, but less accurate for nuanced similarities.

See also:

Paper "Semantic Answer Similarity for Evaluating Question Answering Models": https://arxiv.org/pdf/2108.06130.pdf
3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Answer similarity (float)
- The concept of answer semantic similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth. Semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator:

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Faithfulness Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

Faithfulness Evaluator measures the factual consistency of the generated answer with the given context.

LLM finds claims in the actual answer and ensures that these claims are present in the retrieved context.
Compatibility: RAG only evaluation.
Based on RAGAs library

Method

Faithfulness is calculated based on the actual answer and retrieved context.
The evaluation assesses whether the claims made in the actual answer can be inferred from the retrieved context, avoiding any hallucinations.
The score is determined by the ratio of the actual answer's claims present in the context to the total number of claims in the answer.

faithfulness = number of claims inferable from the context / claims in the answer

See also:

3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Faithfulness (float)
- Faithfulness (generation) metric measures the factual consistency of the generated answer against the given context. It is calculated from the answer and retrieved context. Higher is better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context: (number of claims inferable from the context / claims in the answer).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Groundedness Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

Groundedness (Semantic Similarity) Evaluator assesses the groundedness of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the actual answer is factually correct information by comparing the actual answer to the retrieved context - as the actual answer generated by the LLM model must be based on the retrieved context.

Method

The groundedness metric is calculated as:

groundedness = min( { max( {S(emb(a), emb(c)): for all c in C} ): for all a in A } )

Where:
- A is the actual answer.
- emb(a) is a vector embedding of the actual answer sentence.
- C is the context retrieved by the RAG model.
- emb(c) is a vector embedding of the context chunk sentence.
- S(a, c) is the 1 - cosine distance between the actual answer sentence a and the retrieved context sentence c.
The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

Groundedness (float)
- Groundedness metric determines whether the RAG outputs factually correct information by comparing the actual answer to the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates facts that are not supported by the context.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
If the actual answer is so small that the embedding ends up empty then the evaluator will produce a problem.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, where most of the evaluated LLM models hallucinated.
The least grounded actual answer sentence (in case the output metric score is below the threshold).

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Hallucination Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

Hallucination Evaluator assesses the hallucination of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the actual output is factually correct information by comparing the actual output to the retrieved context - as the actual output generated by the LLM model must be based on the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates or discards facts that are not supported by the context.

Cross-encoder model assessing retrieved context and actual answer similarity.
Compatibility: RAG evaluation only.

Method

The evaluation uses Vectara hallucination evaluation cross-encoder model to calculate a score that measures the extent of hallucination in the generated answer from the retrieved context.

See also:

3rd party model used: https://huggingface.co/vectara/hallucination_evaluation_model

Metrics calculated by the evaluator

Hallucination (float)
- Hallucination metric determines whether the RAG outputs factually correct information by comparing the actual output to the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates facts that are not supported by the context.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, where most of the evaluated LLM models hallucinated.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Language Mismatch Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

Language mismatch evaluator tries to determine whether the language of the question (prompt/input) input and the actual answer is the same.

LLM judge based language detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to compare languages in the question and actual answer.
Evaluator checks every test case. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Same language (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for language mismatch metric which detects whether the language of the input and output is the same.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Language mismatch (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Language mismatch retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Language mismatch generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Language mismatch parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Looping Detection Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Looping detection evaluator tries to find out whether the LLM generation went into a loop.

Compatibility: RAG and LLM models.

Method

This evaluator provides three metrics:

                        number of unique sequences
   unique sentences =  ----------------------------
                          number of all sentences

                                longest repeated substring * frequency of this substring
   longest repeated substring = --------------------------------------------------------
                                                   length of the text

                        length in bytes of compressed string
   compression ratio = --------------------------------------
                         length in bytes of original string

Where:

unique sentences omits sentences shorter than 10 characters.
compression ratio is calculated using Python's zlib and the maximum compression level (9).

Metrics calculated by the evaluator

Unique Sentences (float)
- Unique sentences metric is a ratio number of unique sequences / number of all sentences, where sentences shorter than 10 characters are omitted.
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Longest Repeated Substring (float)
- Longest repeated substring metric is a ratio longest repeated substring * frequency of this substring / length of the text.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Compression Ratio (float)
- Ratio length in bytes of compressed string / length in bytes of original string. Compression is done using Python's zlib and the maximum compression level (9).
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Machine Translation (GPTScore) Evaluator

Input	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

GPT Score evaluator family is based on a novel evaluation framework specifically designed for RAGs and LLMs. It utilizes the inherent abilities of LLMs, particularly their ability to understand and respond to instructions, to assess the quality of generated text.

LLM judge based evaluation.
Compatibility: RAG and LLM models.

Method

The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow In other words,.

Instructions used by the evaluator are:

Accuracy:

Rewrite the following text with its core information and consistent facts: {ref_hypo} In other words, {hypo_ref}

Fluency:

Rewrite the following text to make it more grammatical and well-written: {ref_hypo} In other words, {hypo_ref}

Multidimensional quality metrics:

Rewrite the following text into high-quality text with its core information: {ref_hypo} In other words, {hypo_ref}

Each instruction is evaluated twice - first it uses the expected answer for {ref_hypo} and the actual answer for {hypo_ref}, and then it is reversed. The calculated scores are then averaged.
The lower the metric value, the better.

See also:

Paper "GPTScore: Evaluate as You Desire"

Metrics calculated by the evaluator

Accuracy (float)
- Are there inaccuracies, missing, or unfactual content in the generated text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
- This is the primary metric.
Fluency (float)
- Is the generated text well-written and grammatical?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Multidimensional Quality Metrics (float)
- How is the overall quality of the generated text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Parameterizable BYOP Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓	✓	✓	✓

Bring Your Own Prompt (BYOP) evaluator uses user supplied custom prompt and an LLM judge to evaluate LLMs/RAGs. Currently implemented BYOP supports only binary problems, thus the prompt has to guide the judge to output either "true" or "false".

Method

User provides a custom prompt and an LLM judge.
Custom prompt may use question, expected answer, retrieved context and/or actual answer.
The evaluator prompts the LLM judge using the custom prompt provided by user.
Evaluator checks every test case. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Model passes (float)
- Percentage of successfully evaluated RAG/LLM outputs.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Model failures (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model parse failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Perplexity Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Perplexity measures how well a model predicts the next word based on what came before. The lower the perplexity score, the better the model is at predicting the next word.

Lower perplexity indicates that the model is more certain about its predictions. In comparison, higher perplexity suggests the model is more uncertain. Perplexity is a crucial metric for evaluating the performance of language models in tasks like machine translation, speech recognition, and text generation.

Evaluator uses distilgpt2 language model to calculate perplexity of the actual answer using lmppl package.
Compatibility: RAG and LLM models.

Method

Evaluator utilizes distilgpt2 language model to calculate perplexity of the actual answer using lmppl package. The calculation is as follows:

perplexity = exp(mean(cross-entropy loss))

Where the cross-entropy loss corresponds to cross-entropy loss of distilgpt2 calculated on the actual answer.

Metrics calculated by the evaluator

Perplexity (float)
- Perplexity measures how well a model predicts the next word based on what came before (sliding window). The lower the perplexity score, the better the model is at predicting the next word. Perplexity is calculated as exp(mean(-log likelihood)), where log-likelihood is computed using the distilgpt2 language model as the probability of predicting the next word.
- Lower is better.
- Range: [0, inf]
- Default threshold: 0.5
- Primary metric.

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Question Answering (GPTScore) Evaluator

Input	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

LLM judge based evaluation.
Compatibility: RAG and LLM models.

Method

The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens Answer: Yes.

Instructions used by the evaluator are:

Interest:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI interesting? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Engagement:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI engaging? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Understandability:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI understandable? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Relevance:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI relevant to the conversation? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Specific:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI generic or specific to the conversation? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Correctness:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI correct to conversations? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Semantically appropriate:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI semantically appropriate? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Fluency:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI fluently written? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Where {history} corresponds to the conversation - question and actual answer.
The lower the metric value, the better.

See also:

Paper "GPTScore: Evaluate as You Desire"

Metrics calculated by the evaluator

Interest (float)
Is the generated text interesting?
Lower score is better.
Range: [0, inf]
Default threshold: inf
This is the primary metric.
Engagement (float)
Is the generated text engaging?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Understandability (float)
Is the generated text understandable?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Relevance (float)
How well is the generated text relevant to its source text?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Specific (float)
Is the generated text generic or specific to the source text?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Correctness (float)
Is the generated text correct or was there a misunderstanding of the source text?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Semantically Appropriate (float)
Is the generated text semantically appropriate?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Fluency (float)
Is the generated text well-written and grammatical?
Lower score is better.
Range: [0, inf]
Default threshold: inf

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

RAGAS Evaluator

RAGAs (RAG Assessment) is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG refers to LLM applications that use external data to enhance the context. Evaluation and quantifying the performance of your pipeline can be hard. This is where Ragas (RAG Assessment) comes in. RAGAs metrics score includes both performance of the retrieval and generation components of the RAG pipeline. Therefore RAGAs score represents the overall quality of the answer considering both the retrieval and the answer generation itself.

Harmonic mean of Faithfulness, Answer Relevancy, Context precision, and Context Recall metrics.
Compatibility: RAG evaluation only.
Based on RAGAs library

Method

RAGAs metric score is calculated as harmonic mean of the four metrics calculated by the following evaluators:
- Faithfulness Evaluator (generation)
- Answer Relevancy Evaluator (retrieval+generation)
- Context Precision Evaluator (retrieval)
- Context Recall Evaluator (retrieval)
Faithfulness covers generation answer quality, Answer Relevancy covers answer generation and retrieval quality. Context Precision and Context Recall evaluate the retrieval quality.

See also:

Paper: "RAGAS: Automated Evaluation of Retrieval Augmented Generation": https://arxiv.org/abs/2309.15217
3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

RAGAS (float)
- RAGAs (RAG Assessment) metric is a harmonic mean of the following metrics: faithfulness, answer relevancy, context precision and context recall.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Faithfulness (float)
- Faithfulness (generation) metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. Higher the better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context: (number of claims inferable from the context / claims in the answer).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Answer relevancy (float)
- Answer relevancy metric (retrieval+generation) is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Context precision (float)
- Context precision metric (retrieval) evaluator uses a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not - ideally all the relevant chunks must appear at the top of the context - ranked high.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Context recall (float)
- Context recall metric (retrieval) measures the extent to which the retrieved context aligns with the answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context. Higher the better. Each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not: (answer sentences that can be attributed to context / answer sentences count)
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Tokens Presence Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓	✓

Tokens Presence Evaluator assesses whether both the retrieved context (in the case of RAG hosted models) and the generated answer contain/match a specified set of required strings. The evaluation is based on the match/no match of the required strings, using substring and/or regular expression-based search in the retrieved context and actual answer.

Boolean expression where operands are strings or regular expressions.
Compatibility: RAG and LLM evaluation.

Constraints are defined as a list of strings, regular expressions, and lists:

In case of the string, context and/or answer must contain the string.
In case of the string with REGEXP: prefix, context and/or answer is checked to match the given regular expression. Use Python regular expression notation - for example REGEXP:^[Aa]nswer:? B$.
In case of a list, context and/or answer is checked to contain/match at least one of the list items (be it string or regular expression).

Method

Evaluator checks every test case - actual answer and retrieved context - for the presence of the required strings and regular expressions. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Examples

Test Suite constraints Example #1:

"output_constraints": [
    "15,969",
    "REGEXP:[Mm]illion",
    "REGEXP:^15,969 [Mm]illion$",
    ["either", "or"]
]

Test Suite Constraints Example 1:

"output_constraints": [
    ["either", "or", "REGEXP:[Mm]illion"]
]

The preceding constraints indicate the following:

must contain 15,969 and
must match regular expression [Mm]illion and
must match regular expression ^15,969 [Mm]illion$ and
must contain either either or or

Test Suite Constraints Example 2:

"output_constraints": [
    ["either", "or", "REGEXP:[Mm]illion"]
]

The preceding constraints indicate the following:

Must contain either either or or or match regular expression [Mm]illion

Metrics calculated by the evaluator

Model passes (float)
- Percentage of successfully evaluated RAG/LLM outputs.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Model failures (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model parse failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Retrieval

Context Precision Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓	✓	✓

Context Precision Evaluator assesses the quality of the retrieved context by evaluating the order and relevance of text chunks on the context stack - precision of the context retrieval. Ideally, all relevant chunks (ranked higher) should be appearing at the top of the context.

LLM judge evaluating the chunk quality.
Based on RAGAs library

Method

The evaluator calculates a score based on the presence of the expected answer (ground truth) in the text chunks at the top of the retrieved context chunk stack.
Irrelevant chunks and unnecessarily large context decrease the score.
Top of the stack is defined as n top-most chunks at the top of the stack.
Chunk relevance is determined by the LLM judge as a [0, 1] value. Chunk relevances are multiplied by the chunk position (depth) in the stack, summed, and normalized to calculate the score:

context precision = sum( chunk precision (depth) * relevance (depth)) / number of relevant items at the top of the chunk stack

chunk precision (depth) = true positives (depth) / (true positives (depth) + false positives (depth))

See also:

3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Context precision (float)
- Context precision metric (retrieval) evaluator uses a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not - ideally all the relevant chunks must appear at the top of the context - ranked high.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Context Recall Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓	✓

Context Recall Evaluator measures the alignment between the retrieved context and the answer (ground truth).

LLM judge is checking ground truth sentences' presence in the retrieved context.
Compatibility: RAG evaluation only.
Based on RAGAs library

Method

Metric is computed based on the ground truth and the retrieved context.
The LLM judge analyzes each sentence in the expected answer (ground truth) to determine if it can be attributed to the retrieved context.
The score is calculated as the ratio of the number of sentences in the expected answer that can be attributed to the context to the total number of sentences in the expected answer (ground truth).

Score formula:

context recall = (expected answer sentences that can be attributed to context) / (expected answer sentences count)

See also:

3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Context recall (float)
- Context recall metric (retrieval) measures the extent to which the retrieved context aligns with the answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context. Higher is better. Each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not: (answer sentences that can be attributed to context / answer sentences count)
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Context Relevancy Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓		✓

Context Relevancy Evaluator measures the relevancy of the retrieved context based on the question and contexts.

Extraction and relevance assessment by an LLM judge.
Compatibility: RAG evaluation only.
Based on RAGAs library

Method

The evaluator uses an LLM judge to identify relevant sentences within the retrieved context to compute the score using the formula:

context relevancy = (number of relevant context sentences) / (total number of context sentences)

Total number of sentences is determined by a sentence tokenizer.

See also:

3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Context relevancy (float)
- Context relevancy metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. Ideally, the retrieved context should exclusively contain essential information to address the provided query. To compute this, evaluator initially estimates the value by identifying sentences within the retrieved context that are relevant for answering the given question. The final score is determined by the following formula: context relevancy = (number of relevant sentences / total number of sentences).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Context Relevancy Evaluator (Soft Recall and Precision)

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓		✓

Context Relevancy (Soft Recall and Precision) Evaluator measures the relevancy of the retrieved context based on the question and context sentences and produces two metrics - precision and recall relevancy.

Compatibility: RAG evaluation only.

Method

The evaluator brings two metrics calculated as:

chunk context relevancy(ch) = max( {S(emb(q), emb(s)): for all s in ch} )

recall relevancy = max( {chunk context relevancy(ch): for all ch in rc} )
precision relevancy = avg( {chunk context relevancy(ch): for all ch in rc} )

Where:
- rc is the retrieved context.
- ch is a chunk of the retrieved context.
- emb(s) is a vector embedding of the retrieved context chunk sentence.
- emb(q) is a vector embedding of the query.
- S(question, s) is the 1 - cosine distance between the question and the retrieved context sentence s.
The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

Recall Relevancy (float)
- Maximum retrieved context chunk relevancy.
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Precision Relevancy (float)
- Average retrieved context chunk relevancy.
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Privacy

Contact Information Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Contact Information Evaluator checks for potential leakages of contact information in the text generated by RAG/LLM models. It assesses whether the generated answer contains contact information such as names, addresses, phone numbers, medical information, user names, and emails.

LLM judge based contact information detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to detect contact information in the actual answer.
Evaluator checks every test case for the presence of the contact information. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

No contact information leakages (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Contact information leakages (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Contact information retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Contact information generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Contact information parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

PII Leakage Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

PII leakage evaluator checks for potential personally identifiable information - like credit card numbers, social security numbers, email addresses - leakages in the text generated by the LLM/RAG model.

Regular expressions suite to detect PII in the retrieved context and actual answer.
Compatibility: RAG and LLM.

Method

PII Leakage Evaluator checks for potential personally identifiable information (PII) leakages in the text generated by LLM/RAG models.
The evaluation utilizes a regex suite that can quickly and reliably detect formatted PII, including credit card numbers, SSNs, and emails.
Evaluator checks every test case - actual answer and retrieved context - for the presence of the PIIs. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

No PII leakages (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
PII leakages (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
PII retrieval leakages (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
PII generation leakages (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Sensitive Data Leakage Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

Sensitive Data Leakage Evaluator checks for potential leakages of security-related and/or sensitive data in the text generated by LLM/RAG models. It assesses whether the generated answer contains security-related information such as activation keys, passwords, API keys, tokens, or certificates.

Regular expressions suite to detect sensitive data in the retrieved context and actual answer.
Compatibility: RAG and LLM.

Method

The evaluator utilizes a regex suite that can quickly and reliably detect formatted sensitive data, including certificates in SSL/TLS PEM format, API keys for H2O.ai and OpenAI, and activation keys for Windows.
Evaluator checks every test case - actual answer and retrieved context - for the presence of the PIIs. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

No sensitive data leakages (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Sensitive data leakages (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sensitive data retrieval leakages (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sensitive data generation leakages (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sensitive data parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Fairness

Fairness Bias Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Fairness bias evaluator assesses whether the LLM/RAG output contains gender, racial, or political bias. This information can then be used to improve the development and deployment of LLMs/RAGs by identifying and mitigating potential biases.

Compatibility: RAG and LLM models.

Method

The evaluator uses bias-detection-model library to calculate the metric score.

See also:

3rd party model used: https://huggingface.co/d4data/bias-detection-model

Metric calculated by the evaluator

Fairness bias (float)
- Fairness bias metric indicates the level of gender, racial, or political bias in the generated text. High score indicates high fairness bias.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Sexism Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Sexism evaluator evaluates input and LLM output to find possible instances of sexism.

LLM judge based sexism detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to detect sexism in the actual answer.
Evaluator checks every test case for the presence of sexism. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

No Sexism (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for sexism metric which detects possible instances of sexism.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Sexist (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for sexism metric which detects possible instances of sexism.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sexism in retrieval (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for sexism metric which detects possible instances of sexism.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sexism in generation (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for sexism metric which detects possible instances of sexism.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sexism parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for sexism metric which detects possible instances of sexism.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Stereotypes Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

Stereotype evaluator tries to guess if the LLM output contains stereotypes - assess whether the answer contains added information about gender or race with no reference in the question.

LLM judge based gender stereotypes detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to detect gender stereotypes in the actual answer and also to check the question.
Evaluator checks every test case for the presence of stereotypes. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Stereotype-free (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Stereotyped (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Stereotypes in retrieval (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Stereotypes in generation (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Stereotypes parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Toxicity Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Toxicity evaluator is a tool that is used to assess the level of toxicity in the actual answers. RAGs/LLMs can generate human-quality text, but they can also be prone to generating toxic content, such as hate speech, offensive language, and discriminatory language.

The value of LLM toxicity evaluator is twofold - it can help to ensure that LLMs are not used to generate toxic content that could harm individuals or groups.

It can also help to improve the accuracy and reliability of RAGs/LLMs by identifying and mitigating the generation of toxic content.

Compatibility: RAG and LLM models.

Method

Toxicity evaluator uses Detoxify library to calculate toxicity scores.

See also:

3rd party library used: https://pypi.org/project/detoxify/
3rd party library source: https://github.com/unitaryai/detoxify

Metrics calculated by the evaluator

Toxicity (float)
- Toxicity score indicates the level of toxicity in the generated text, e.g., hate speech, offensive language, threatening, insulting, and identity attacking language. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Severe Toxicity (float)
- Toxicity score indicates the level of severe toxicity in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Obscene (float)
- Toxicity score indicates the level of obscenity in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Threat (float)
- Toxicity score indicates the level of threat in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Insult (float)
- Toxicity score indicates the level of insults in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Identity Attack (float)
- Toxicity score indicates the level of identity attacks in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Summarization

BLEU Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

BLEU (Bilingual Evaluation Understudy) measures the quality of machine-generated texts by comparing them to reference texts. BLEU calculates a score between 0.0 and 1.0, where a higher score indicates a better match with the reference text.

Compatibility: RAG and LLM models.

Method

BLEU is based on the concept of n-grams, which are contiguous sequences of words. The different variations of BLEU, such as BLEU-1, BLEU-2, BLEU-3, and BLEU-4, differ in the size of the n-grams considered for evaluation.
BLEU-n measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping n-grams and dividing it by the total number of n-grams in the generated text.
NLTK library is used to tokenize the text using punkt tokenizer and then calculate the BLEU score.

See also:

3rd party library BLEU implementation used: https://www.nltk.org/_modules/nltk/translate/bleu_score.html

Metrics calculated by the evaluator

BLEU-1 (float)
- BLEU-1 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping unigrams and dividing it by the total number of unigrams in the generated text.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
BLEU-2 (float)
- BLEU-2 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping bigrams and dividing it by the total number of bigrams in the generated text.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
BLEU-3 (float)
- BLEU-3 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping trigrams and dividing it by the total number of trigrams in the generated text.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
BLEU-4 (float)
- BLEU-4 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping 4-grams and dividing it by the total number of 4-grams in the generated text.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

ROUGE Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of evaluation metrics used to assess the quality of generated summaries compared to reference summaries. There are several variations of ROUGE metrics, including ROUGE-1, ROUGE-2, and ROUGE-L.

Compatibility: RAG and LLM models.

Method

The evaluator reports the F1 score between the generated and reference n-grams.
ROUGE-1 measures the overlap of 1-grams (individual words) between the generated and the reference summaries.
ROUGE-2 extends the evaluation to 2-grams (pairs of consecutive words).
ROUGE-L considers the longest common subsequence (LCS) between the generated and reference summaries.
These ROUGE metrics provide a quantitative evaluation of the similarity between the generated and reference texts to assess the effectiveness of text summarization algorithms.

See also:

3rd party library ROUGE: https://github.com/google-research/google-research/tree/master/rouge

Metrics calculated by the evaluator

ROUGE-1 (float)
- ROUGE-1 metric measures the overlap of 1-grams (individual words) between the generated and the reference summaries.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
ROUGE-2 (float)
- ROUGE-2 metric measures the overlap of 2-grams (pairs of consecutive words) between the generated and the reference summaries.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
ROUGE-L (float)
- ROUGE-L metric considers the longest common subsequence (LCS) between the generated and reference summaries.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, the evaluator will report a problem for each perturbed test case and LLM model whose metric flips (moved above or below the threshold) after perturbation.

Insights diagnosed by the evaluator

The best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt that most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Summarization (Completeness and Faithfulness) Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

This summarization evaluator, which does not require a reference summary, uses two faithfulness metrics based on SummaC (Conv and ZS) and one completeness metric.

Compatibility: RAG and LLM models.

Method

SummaC Conv is a trained model consisting of a single learned convolution layer compiling the distribution of entailment scores of all document sentences into a single score.
SummaC ZS performs zero-shot aggregation by combining sentence-level scores using max and mean operators. This metric is more sensitive to outliers than SummaC Conv.
Completeness metric is calculated using the distance of embeddings between the reference and faithful parts of the summary.

See also:

3rd party SummaC library used: https://github.com/tingofurro/summac
Paper: "SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization"

Metrics calculated by the evaluator

Completeness (float)
- Completeness metric is calculated using the distance of embeddings between the reference and faithful parts of the summary.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Faithfulness (SummaC Conv) (float)
- The faithfulness metric measures how well the summary preserves the meaning and factual content of the original text. SummaC Conv is a trained model consisting of a single learned convolution layer compiling the distribution of entailment scores of all document sentences into a single score.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Faithfulness (SummaC ZS) (float)
- The faithfulness metric measures how well the summary preserves the meaning and factual content of the original text. SummaC ZS performs zero-shot aggregation by combining sentence-level scores using max and mean operators. This metric is more sensitive to outliers than SummaC Conv.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Summarization (Judge) Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

Summarization evaluator uses an LLM judge to assess the quality of the summary made by the evaluated model using a reference summary.

LLM judge based summarization evaluation.
Requires a reference summary.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to compare the actual answer (evaluated RAG/LLM's summary) and the expected answer (reference summary).
Evaluator checks every test case for the presence of the contact information. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Good summary (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Bad summary (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Summarization retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Summarization generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Summarization parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Summarization with reference (GPTScore) Evaluator

Input	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

LLM judge based evaluation.
Compatibility: RAG and LLM models.

Method

The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow In other words,.

Instructions used by the evaluator are:

Semantic coverage:

Rewrite the following text with the same semantics. {ref_hypo} In other words, {hypo_ref}

Factuality:

Rewrite the following text with consistent facts. {ref_hypo} In other words, {hypo_ref}

Informativeness:

Rewrite the following text with its core information. {ref_hypo} In other words, {hypo_ref}

Coherence:

Rewrite the following text into a coherent text. {ref_hypo} In other words, {hypo_ref}

Relevance:

Rewrite the following text with consistent details. {ref_hypo} In other words, {hypo_ref}

Fluency:

Rewrite the following text into a fluent and grammatical text. {ref_hypo} In other words, {hypo_ref}

Each instruction is evaluated twice - first it uses the expected answer for {ref_hypo} and the actual answer for {hypo_ref}, and then it is reversed. The calculated scores are then averaged.
The lower the metric value, the better.

See also:

Paper "GPTScore: Evaluate as You Desire"

Metrics calculated by the evaluator

Semantic Coverage (float)
- How many semantic content units from the reference text are covered by the generated text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
- This is the primary metric.
Factuality (float)
- Does the generated text preserve the factual statements of the source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Informativeness (float)
- How well does the generated text capture the key ideas of its source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Coherence (float)
- How much does the generated text make sense?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Relevance (float)
- How well is the generated text relevant to its source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Fluency (float)
- Is the generated text well-written and grammatical?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Summarization without reference (GPTScore) Evaluator

Input	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

LLM judge based evaluation.
Compatibility: RAG and LLM models.

Method

The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow Tl;dr\n.

Instructions used by the evaluator are:

Semantic coverage:

Generate a summary with as much semantic coverage as possible for the following text: {src}
Tl;dr
{target}

Factuality:

Generate a summary with consistent facts for the following text: {src}
Tl;dr
{target}

Consistency:

Generate a factually consistent summary for the following text: {src}
Tl;dr
{target}

Informativeness:

Generate an informative summary that captures the key points of the following text: {src}
Tl;dr
{target}

Coherence:

Generate a coherent summary for the following text: {src}
Tl;dr
{target}

Relevance:

Generate a relevant summary with consistent details for the following text: {src}
Tl;dr
{target}

Fluency:

Generate a fluent and grammatical summary for the following text: {src}
Tl;dr
{target}

Where {src} corresponds to the question and {target} to the actual answer.
The lower the metric value, the better.

See also:

Paper "GPTScore: Evaluate as You Desire"

Metrics calculated by the evaluator

Semantic Coverage (float)
- How many semantic content units from the reference text are covered by the generated text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
- This is the primary metric.
Factuality (float)
- Does the generated text preserve the factual statements of the source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Consistency (float)
- Is the generated text consistent in the information it provides?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Informativeness (float)
- How well does the generated text capture the key ideas of its source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Coherence (float)
- How much does the generated text make sense?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Relevance (float)
- How well is the generated text relevant to its source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Fluency (float)
- Is the generated text well-written and grammatical?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Classification

Classification Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

Binomial and multinomial classification evaluator for LLM models and RAG systems which are used to classify data into two or more classes.

Compatibility: RAG and LLM models.

Method

The evaluator matches expected answer (label) and actual answers (prediction) for each test case and calculates the confusion matrix and metrics such as accuracy, precision, recall, and F1 score for each model.

Metrics calculated by the evaluator

Accuracy (float)
- Accuracy metric measures how often the model makes correct predictions using the formula: (True Positives + True Negatives) / Total Predictions.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Precision (float)
- Precision metric measures the proportion of the positive predictions that were actually correct using the formula: True Positives / (True Positives + False Positives).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Recall (float)
- Recall metric measures the proportion of the actual positive cases that were correctly predicted using the formula: True Positives / (True Positives + False Negatives).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
F1 (float)
- F1 metric measures the balance between precision and recall using the formula: 2 (Precision Recall) / (Precision + Recall).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Feedback

Submit and view feedback for this page
Send feedback about H2O Eval Studio to cloud-feedback@h2o.ai

Evaluators

Compliance Frameworks​

Evaluators overview​

Generation​

Answer Correctness Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator:​

Evaluator parameters​

Answer Relevancy Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Answer Relevancy (Sentence Similarity)​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Answer Semantic Similarity Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator:​

Insights diagnosed by the evaluator​

Evaluator parameters​

Faithfulness Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Groundedness Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Hallucination Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Language Mismatch Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Looping Detection Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Machine Translation (GPTScore) Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Parameterizable BYOP Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Perplexity Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Question Answering (GPTScore) Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Compliance Frameworks

Evaluators overview

Generation

Answer Correctness Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator:

Evaluator parameters

Answer Relevancy Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Answer Relevancy (Sentence Similarity)

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Answer Semantic Similarity Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator:

Insights diagnosed by the evaluator

Evaluator parameters

Faithfulness Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Groundedness Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Hallucination Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Language Mismatch Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Looping Detection Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Machine Translation (GPTScore) Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Parameterizable BYOP Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Perplexity Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Question Answering (GPTScore) Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters