Evaluators
This page describes the available H2O Eval Studio evaluators.
- Answer Correctness Evaluator
- Answer Semantic Similarity Evaluator
- Context Relevancy Evaluator
- Hallucination Evaluator
- RAGAS Evaluator
- Tokens Presence Evaluator
- Context Precision Evaluator
- Faithfulness Evaluator
- Context Recall Evaluator
- Answer Relevancy Evaluator
- PII Leakage Evaluator
- Sensitive Data Leakage Evaluator
- Toxicity Evaluator
- Fairness Bias Evaluator
- Contact Information Evaluator
- Language Mismatch Evaluator
- Parameterizable BYOP Evaluator
- Sexism Evaluator
- Stereotypes Evaluator
- Summarization Evaluator
- BLEU Evaluator
- ROUGE Evaluator
- Classification Evaluator
Answer Correctness Evaluator
Answer Correctness Evaluator assesses the accuracy of generated answers compared to ground truth, scoring from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.
Evaluation considers semantic and factual similarities, combined using a weighted scheme for the answer correctness score.
For more information, see the page on Answer Correctness in the official Ragas documentation.
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Answer Semantic Similarity Evaluator
Answer Semantic Similarity Evaluator assesses the semantic resemblance between the generated answer and the ground truth. The score ranges from 0 to 1, with a higher score indicating better alignment. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score between the answer and expected answer (OpenAI embeddings are used by default, but the evaluator is reconfigured to a fallback embeddings provider when a custom judge is used).
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Context Relevancy Evaluator
Context Relevancy Evaluator measures the relevancy of the retrieved context based on the question and contexts. The score ranges from 0 to 1, with higher values indicating better relevancy. The evaluation identifies relevant sentences within the retrieved context to compute the score using the formula:
ctx relevancy = (number of relevant sentences) / (total number of sentences)"
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Hallucination Evaluator
Hallucination Evaluator assesses the hallucination of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the LLM application outputs factually correct information by comparing the actual output to the provided context. The evaluation uses the Vectara hallucination evaluation cross-encoder model to calculate a score that measures the extent of hallucination in the generated answer from the retrieved context.
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
metric
- Metric to calculate - e.g. `bias`, `toxicity`, `ragas`, `answer_relevancy`, or `hallucination`.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
RAGAS Evaluator
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG refers to LLM applications that use external data to enhance the context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in.
Ragas score is a harmonic mean of the following metrics:
Faithfulness (generation) measures the factual consistency of the generated answer with the given context. It evaluates if all claims in the answer can be inferred from the context.
Answer Relevancy (retrieval+generation) assesses the pertinence of the generated answer to the prompt. It measures how well the answer addresses the original question.
Context Precision (retrieval) evaluates if all relevant items in the ground truth are ranked higher in the retrieved context.
Context Recall (retrieval) measures the alignment between the retrieved context and the answer. It determines the extent to which the context represents the ground truth.
The RAGAS Evaluator provides a composite metric score, which is the harmonic mean of Faithfulness, Answer Relevancy, Context Precision, and Context Recall. This score represents the overall quality of the answer considering both the context and the answer itself.
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Tokens Presence Evaluator
Tokens Presence Evaluator assesses whether both the retrieved context (in the case of RAG hosted models) and the generated answer contain a specified set of required strings. The evaluation is based on the match/no match of the required strings, using substring and/or regular expression-based search in the retrieved context and answer.
Constraints are defined as a list of strings, regular expressions and lists:
in case of the string, context and/or answer must contain the string
in case of the string with REGEXP: prefix, context and/or answer is checked to matches the given regular expression. Use Python regular expression notation - for example REGEXP:^[Aa]nswer:? B$.
in case of list, context and/or answer is checked to contain/match at least one of the list items (be it string or regular expression.)
Test Suite Constraints Example 1:
"output_constraints": [
["either", "or", "REGEXP:[Mm]illion"]
]
The preceding constraints indicate the following:
must contain
15,969
andmust match regular expression
[Mm]illion
andmust match regular expression
^15,969 [Mm]illion$
andmust contain either
either
oror
Test Suite Constraints Example 2:
"output_constraints": [
["either", "or", "REGEXP:[Mm]illion"]
]
The preceding constraints indicate the following:
- must contain either
either
oror
or match regular expression[Mm]illion
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Context Precision Evaluator
Context Precision Evaluator assesses the quality of the retrieved context by evaluating the order and relevance of text chunks on the context stack. The goal is to have all relevant chunks ranked higher, ideally appearing at the top of the context. The evaluation calculates a score based on the presence of the expected answer (ground truth) in the text chunks at the top of the retrieved context chunk stack. Irrelevant chunks, deep stacks, and unnecessarily large context decrease the score.
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Faithfulness Evaluator
Faithfulness Evaluator measures the factual consistency of the generated answer with the given context. It is calculated based on the answer and retrieved context, with a higher score indicating better consistency. The evaluation assesses whether the claims made in the answer can be inferred from the context, avoiding any hallucinations. The score is determined by the ratio of the answer’s claims present in the context to the total number of claims in the answer (number of claims inferable from the context / claims in the answer).
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Context Recall Evaluator
Context Recall Evaluator measures the alignment between the retrieved context and the answer (ground truth). It is computed based on the ground truth and the retrieved context, with a higher score indicating better alignment. The evaluation analyzes each sentence in the ground truth answer to determine if it can be attributed to the retrieved context. The score is calculated as the ratio of the number of sentences in the ground truth that can be attributed to the context to the total number of sentences in the ground truth.
Score formula:
ctx recall = (answer sentences that can be attributed to context) / (answer sentences count)
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Answer Relevancy Evaluator
Answer Relevancy (retrieval+generation) evaluator is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
PII Leakage Evaluator
PII leakage evaluator checks for potential personally identifiable information - like credit card numbers, social security numbers, email addresses - leakages in the text generated by my LLM/RAG model.
PII Leakage Evaluator checks for potential personally identifiable information (PII) leakages in the text generated by LLM/RAG models. It assesses whether the generated answer contains PII such as credit card numbers, phone numbers, social security numbers (SSN), street addresses, email addresses, and employee names. The evaluation utilizes a regex suite that can quickly and reliably detect formatted PII, including credit card numbers, SSNs, and emails. In the future, additional models may be added to detect addresses, names, and other types of PII.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.
Sensitive Data Leakage Evaluator
Sensitive Data Leakage Evaluator checks for potential leakages of security-related and/or sensitive data in the text generated by LLM/RAG models. It assesses whether the generated answer contains security-related information such as activation keys, passwords, API keys, tokens, or certificates. The evaluation utilizes a regex suite that can quickly and reliably detect formatted sensitive data, including certificates in SSL/TLS PEM format, API keys for H2O.ai and OpenAI, and activation keys for Windows.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.
Toxicity Evaluator
Toxicity evaluator is a tool that is used to assess the level of toxicity in the text generated by large language models (LLMs). LLMs can generate human-quality text, but they can also be prone to generating toxic content, such as hate speech, offensive language, and discriminatory language. The value of LLM toxicity evaluator is twofold - it can help to ensure that LLMs are not used to generate toxic content that could harm individuals or groups. It can can als help to improve the accuracy and reliability of LLMs by identifying and mitigating the generation of toxic content.
Toxicity evaluator calculates the following metrics:
toxicity
- The toxicity score of the generated text. The toxicity score is a value between 0 and 1, where 0 indicates that the text is not toxic and 1 indicates that the text is highly toxic.
severe_toxicity
- The severe toxicity score of the generated text. The severe toxicity score is a value between 0 and 1, where 0 indicates that the text is not severely toxic and 1 indicates that the text is highly severely toxic.
obscene
- The obscenity score of the generated text. The obscenity score is a value between 0 and 1, where 0 indicates that the text is not obscene and 1 indicates that the text is highly obscene.
threat
- The threat score of the generated text. The threat score is a value between 0 and 1, where 0 indicates that the text is not threatening and 1 indicates that the text is highly threatening.
insult
- The insult score of the generated text. The insult score is a value between 0 and 1, where 0 indicates that the text is not insulting and 1 indicates that the text is highly insulting.
identity_attack
- The identity hate score of the generated text. The identity hate score is a value between 0 and 1, where 0 indicates that the text is not hateful and 1 indicates that the text is highly hateful.
See also: https://github.com/unitaryai/detoxify
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Fairness Bias Evaluator
Fairness bias evaluator assesses whether the LLM/RAG output contains gender, racial, or political bias. This information can then be used to improve the development and deployment of LLMs/RAGs by identifying and mitigating potential biases. High score indicates high fairness bias.
See also: https://huggingface.co/d4data/bias-detection-model
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.
Contact Information Evaluator
Contact Information Evaluator checks for potential leakages of contact information in the text generated by LLM/RAG models. It assesses whether the generated answer contains contact information such names, addresses, phone numbers, medical information, user names and emails.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.
Language Mismatch Evaluator
Language mismatch evaluator tries to determine whether the language of the user input and the LLM output is the same.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.
Parameterizable BYOP Evaluator
Bring Your Own Prompt (BYOP) evaluator uses user supplied prompt and a judge to evaluate LLM or RAG output. Currently implemented BYOP supports only binary problems, thus the prompt has to guide the judge to output either "true" or "false".
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.
Sexism Evaluator
Sexism evaluator evaluates input and LLM output to find possible instances of sexism.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.
Stereotypes Evaluator
Stereotype evaluator tries guess if the LLM output contains stereotypes - Assess whether the answer contains an added information about gender or race with no reference in the prompt.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.
Summarization Evaluator
Summarization evaluator uses a judge to assess the quality of the summary made by the evaluated model.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.
BLEU Evaluator
BLEU (Bilingual Evaluation Understudy) measures the quality of machine-generated texts by comparing them to reference texts. BLEU calculates a score between 0 and 1, where a higher score indicates a better match with the reference text.
BLEU is based on the concept of n-grams, which are contiguous sequences of words. The different variations of BLEU such as BLEU-1, BLEU-2, BLEU-3, and BLEU-4 differ in the size of the n-grams considered for evaluation.
BLEU-n measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping n-grams and dividing it by the total number of n-grams in the generated text.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
ROUGE Evaluator
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of evaluation metrics used to assess the quality of generated summaries compared to reference summaries. There are several variations of ROUGE metrics, including ROUGE-1, ROUGE-2, and ROUGE-L.
This evaluator reports F1 score between the generated and reference n-grams.
ROUGE-1 measures the overlap of 1-grams (individual words) between the generated and the reference summaries.
ROUGE-2 extends the evaluation to 2-grams (pairs of consecutive words).
ROUGE-L considers the longest common subsequence (LCS) between the generated and reference summaries.
These ROUGE metrics provide a quantitative evaluation of the similarity between the generated and reference texts to assess the effectiveness of text summarization algorithms.
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
Classification Evaluator
Binomial and multinomial classification evaluator for LLM models and RAG systems which are used to classify data into two or more classes. The evaluator calculates the confusion matrix and metrics such as accuracy, precision, recall, and F1 score for each model.
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
- Submit and view feedback for this page
- Send feedback about H2O Eval Studio to cloud-feedback@h2o.ai