Skip to main content

Evaluators

This page describes the available H2O Eval Studio evaluators.

Answer Correctness Evaluator

Answer Correctness Evaluator assesses the accuracy of generated answers compared to ground truth, scoring from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

Evaluation considers semantic and factual similarities, combined using a weighted scheme for the answer correctness score.

For more information, see the page on Answer Correctness in the official Ragas documentation.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Answer Semantic Similarity Evaluator

Answer Semantic Similarity Evaluator assesses the semantic resemblance between the generated answer and the ground truth. The score ranges from 0 to 1, with a higher score indicating better alignment. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score between the answer and expected answer (OpenAI embeddings are used by default, but the evaluator is reconfigured to a fallback embeddings provider when a custom judge is used).

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Context Relevancy Evaluator

Context Relevancy Evaluator measures the relevancy of the retrieved context based on the question and contexts. The score ranges from 0 to 1, with higher values indicating better relevancy. The evaluation identifies relevant sentences within the retrieved context to compute the score using the formula:

ctx relevancy = (number of relevant sentences) / (total number of sentences)"

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Hallucination Evaluator

Hallucination Evaluator assesses the hallucination of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the LLM application outputs factually correct information by comparing the actual output to the provided context. The evaluation uses the Vectara hallucination evaluation cross-encoder model to calculate a score that measures the extent of hallucination in the generated answer from the retrieved context.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

metric

- Metric to calculate - e.g. `bias`, `toxicity`, `ragas`, `answer_relevancy`, or `hallucination`.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

RAGAS Evaluator

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG refers to LLM applications that use external data to enhance the context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in.

Ragas score is a harmonic mean of the following metrics:

Faithfulness (generation) measures the factual consistency of the generated answer with the given context. It evaluates if all claims in the answer can be inferred from the context.

Answer Relevancy (retrieval+generation) assesses the pertinence of the generated answer to the prompt. It measures how well the answer addresses the original question.

Context Precision (retrieval) evaluates if all relevant items in the ground truth are ranked higher in the retrieved context.

Context Recall (retrieval) measures the alignment between the retrieved context and the answer. It determines the extent to which the context represents the ground truth.

The RAGAS Evaluator provides a composite metric score, which is the harmonic mean of Faithfulness, Answer Relevancy, Context Precision, and Context Recall. This score represents the overall quality of the answer considering both the context and the answer itself.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Tokens Presence Evaluator

Tokens Presence Evaluator assesses whether both the retrieved context (in the case of RAG hosted models) and the generated answer contain a specified set of required strings. The evaluation is based on the match/no match of the required strings, using substring and/or regular expression-based search in the retrieved context and answer.

Constraints are defined as a list of strings, regular expressions and lists:

  • in case of the string, context and/or answer must contain the string

  • in case of the string with REGEXP: prefix, context and/or answer is checked to matches the given regular expression. Use Python regular expression notation - for example REGEXP:^[Aa]nswer:? B$.

  • in case of list, context and/or answer is checked to contain/match at least one of the list items (be it string or regular expression.)

Test Suite Constraints Example 1:

"output_constraints": [
["either", "or", "REGEXP:[Mm]illion"]
]

The preceding constraints indicate the following:

  1. must contain 15,969 and

  2. must match regular expression [Mm]illion and

  3. must match regular expression ^15,969 [Mm]illion$ and

  4. must contain either either or or

Test Suite Constraints Example 2:

"output_constraints": [
["either", "or", "REGEXP:[Mm]illion"]
]

The preceding constraints indicate the following:

  1. must contain either either or or or match regular expression [Mm]illion

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Context Precision Evaluator

Context Precision Evaluator assesses the quality of the retrieved context by evaluating the order and relevance of text chunks on the context stack. The goal is to have all relevant chunks ranked higher, ideally appearing at the top of the context. The evaluation calculates a score based on the presence of the expected answer (ground truth) in the text chunks at the top of the retrieved context chunk stack. Irrelevant chunks, deep stacks, and unnecessarily large context decrease the score.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Faithfulness Evaluator

Faithfulness Evaluator measures the factual consistency of the generated answer with the given context. It is calculated based on the answer and retrieved context, with a higher score indicating better consistency. The evaluation assesses whether the claims made in the answer can be inferred from the context, avoiding any hallucinations. The score is determined by the ratio of the answer’s claims present in the context to the total number of claims in the answer (number of claims inferable from the context / claims in the answer).

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Context Recall Evaluator

Context Recall Evaluator measures the alignment between the retrieved context and the answer (ground truth). It is computed based on the ground truth and the retrieved context, with a higher score indicating better alignment. The evaluation analyzes each sentence in the ground truth answer to determine if it can be attributed to the retrieved context. The score is calculated as the ratio of the number of sentences in the ground truth that can be attributed to the context to the total number of sentences in the ground truth.

Score formula:

ctx recall = (answer sentences that can be attributed to context) / (answer sentences count)

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Answer Relevancy Evaluator

Answer Relevancy (retrieval+generation) evaluator is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

PII Leakage Evaluator

PII leakage evaluator checks for potential personally identifiable information - like credit card numbers, social security numbers, email addresses - leakages in the text generated by my LLM/RAG model.

PII Leakage Evaluator checks for potential personally identifiable information (PII) leakages in the text generated by LLM/RAG models. It assesses whether the generated answer contains PII such as credit card numbers, phone numbers, social security numbers (SSN), street addresses, email addresses, and employee names. The evaluation utilizes a regex suite that can quickly and reliably detect formatted PII, including credit card numbers, SSNs, and emails. In the future, additional models may be added to detect addresses, names, and other types of PII.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Sensitive Data Leakage Evaluator

Sensitive Data Leakage Evaluator checks for potential leakages of security-related and/or sensitive data in the text generated by LLM/RAG models. It assesses whether the generated answer contains security-related information such as activation keys, passwords, API keys, tokens, or certificates. The evaluation utilizes a regex suite that can quickly and reliably detect formatted sensitive data, including certificates in SSL/TLS PEM format, API keys for H2O.ai and OpenAI, and activation keys for Windows.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Toxicity Evaluator

Toxicity evaluator is a tool that is used to assess the level of toxicity in the text generated by large language models (LLMs). LLMs can generate human-quality text, but they can also be prone to generating toxic content, such as hate speech, offensive language, and discriminatory language. The value of LLM toxicity evaluator is twofold - it can help to ensure that LLMs are not used to generate toxic content that could harm individuals or groups. It can can als help to improve the accuracy and reliability of LLMs by identifying and mitigating the generation of toxic content.

Toxicity evaluator calculates the following metrics:

toxicity - The toxicity score of the generated text. The toxicity score is a value between 0 and 1, where 0 indicates that the text is not toxic and 1 indicates that the text is highly toxic.

severe_toxicity - The severe toxicity score of the generated text. The severe toxicity score is a value between 0 and 1, where 0 indicates that the text is not severely toxic and 1 indicates that the text is highly severely toxic.

obscene - The obscenity score of the generated text. The obscenity score is a value between 0 and 1, where 0 indicates that the text is not obscene and 1 indicates that the text is highly obscene.

threat - The threat score of the generated text. The threat score is a value between 0 and 1, where 0 indicates that the text is not threatening and 1 indicates that the text is highly threatening.

insult - The insult score of the generated text. The insult score is a value between 0 and 1, where 0 indicates that the text is not insulting and 1 indicates that the text is highly insulting.

identity_attack - The identity hate score of the generated text. The identity hate score is a value between 0 and 1, where 0 indicates that the text is not hateful and 1 indicates that the text is highly hateful.

See also: https://github.com/unitaryai/detoxify

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Fairness Bias Evaluator

Fairness bias evaluator assesses whether the LLM/RAG output contains gender, racial, or political bias. This information can then be used to improve the development and deployment of LLMs/RAGs by identifying and mitigating potential biases. High score indicates high fairness bias.

See also: https://huggingface.co/d4data/bias-detection-model

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Contact Information Evaluator

Contact Information Evaluator checks for potential leakages of contact information in the text generated by LLM/RAG models. It assesses whether the generated answer contains contact information such names, addresses, phone numbers, medical information, user names and emails.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Language Mismatch Evaluator

Language mismatch evaluator tries to determine whether the language of the user input and the LLM output is the same.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Parameterizable BYOP Evaluator

Bring Your Own Prompt (BYOP) evaluator uses user supplied prompt and a judge to evaluate LLM or RAG output. Currently implemented BYOP supports only binary problems, thus the prompt has to guide the judge to output either "true" or "false".

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Sexism Evaluator

Sexism evaluator evaluates input and LLM output to find possible instances of sexism.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Stereotypes Evaluator

Stereotype evaluator tries guess if the LLM output contains stereotypes - Assess whether the answer contains an added information about gender or race with no reference in the prompt.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Summarization Evaluator

Summarization evaluator uses a judge to assess the quality of the summary made by the evaluated model.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

BLEU Evaluator

BLEU (Bilingual Evaluation Understudy) measures the quality of machine-generated texts by comparing them to reference texts. BLEU calculates a score between 0 and 1, where a higher score indicates a better match with the reference text.

BLEU is based on the concept of n-grams, which are contiguous sequences of words. The different variations of BLEU such as BLEU-1, BLEU-2, BLEU-3, and BLEU-4 differ in the size of the n-grams considered for evaluation.

BLEU-n measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping n-grams and dividing it by the total number of n-grams in the generated text.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.

ROUGE Evaluator

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of evaluation metrics used to assess the quality of generated summaries compared to reference summaries. There are several variations of ROUGE metrics, including ROUGE-1, ROUGE-2, and ROUGE-L.

This evaluator reports F1 score between the generated and reference n-grams.

ROUGE-1 measures the overlap of 1-grams (individual words) between the generated and the reference summaries.

ROUGE-2 extends the evaluation to 2-grams (pairs of consecutive words).

ROUGE-L considers the longest common subsequence (LCS) between the generated and reference summaries.

These ROUGE metrics provide a quantitative evaluation of the similarity between the generated and reference texts to assess the effectiveness of text summarization algorithms.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.

Classification Evaluator

Binomial and multinomial classification evaluator for LLM models and RAG systems which are used to classify data into two or more classes. The evaluator calculates the confusion matrix and metrics such as accuracy, precision, recall, and F1 score for each model.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.

Feedback