Skip to main content

Evaluators

This page describes the available H2O Eval Studio evaluators.

Evaluators overview

EvaluatorLLMRAGJQEARCAAC
Answer correctness
Answer relevancy
Answer relevancy (sentence s.)
Answer semantic similarity
BLEU
Classification
Contact information leakage
Context precision
Context relevancy
Context relevancy (s.r. & p.)
Context recall
Faithfulness
Fairness bias
Machine Translation (GPTScore)
Question Answering (GPTScore)
Summarization with ref. s.
Summarization without ref. s.
Groundedness
Hallucination
Language mismatch (Judge)
BYOP: Bring your own prompt
PII leakage
Perplexity
ROUGE
Ragas
Summarization (c. and f.)
Sexism (Judge)
Sensitive data leakage
Stereotypes (Judge)
Summarization (Judge)
Toxicity
Tokens presence

Legend:

  • LLM: evaluates Language Model (LLM) models.
  • RAG: evaluates Retrieval Augmented Generation (RAG) models.
  • J: evaluator requires an LLM judge.
  • Q: evaluator requires question (prompt).
  • EA: evaluator requires expected answer (ground truth).
  • RC: evaluator requires retrieved context.
  • AA: evaluator requires actual answer.
  • C: evaluator requires constraints.

Generation

Answer Correctness Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Answer Correctness Evaluator assesses the accuracy of generated answers compared to ground truth. A higher score indicates a closer alignment between the generated answer and the expected answer (ground truth), signifying better correctness.

  • Two weighted metrics + LLM judge.
  • Compatibility: RAG and LLM evaluation.
  • Based on RAGAs library

Method

  • This evaluator measures answer correctness compared to ground truth as a weighted average of factuality and semantic similarity.

  • Default weights are 0.75 for factuality and 0.25 for semantic similarity.

  • Semantic similarity metrics is evaluated using Answer Semantic Similarity Evaluator.

  • Factuality is evaluated as F1-score of the LLM judge answers whose prompt analyzes actual answer for statements and for each statement it checks it’s presence in the expected answer:

    • TP (true positive): statements presents in both actual and expected answers

    • FP (false positive): statements present in the actual answer only.

    • FN (false negative): statements present in the expected answer only.

  • F1 score quantifies correctness based on the number of statements in each of the lists above:

F1 score = |TP| / (|TP| + 0.5 * (|FP| + |FN|))

For more information, see the page on Answer Correctness in the official Ragas documentation.

Metrics calculated by the evaluator

  • Answer correctness (float)
    • The assessment of answer correctness metric involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness metric encompasses two critical aspects:semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.

  • If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator:

  • Best performing LLM model based on the evaluated primary metric.

  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Answer Relevancy Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Answer Relevancy (retrieval+generation) evaluator is assessing how pertinent the actual answer is to the given question. A lower score indicates actual answer which is incomplete or contains redundant information.

  • Mean cosine similarity of the original question and questions generated by the LLM judge.

  • Compatibility: RAG and LLM evaluation.

  • Based on RAGAs library.

Method

  • The LLM judge is prompted to generate an appropriate question for the actual answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.

  • The score will range between 0 and 1 most of the time, but this is not mathematically guaranteed, due to the nature of the cosine similarity that ranging from -1 to 1.

answer relevancy = mean(cosine_similarity(question, generate_questions))

Metrics calculated by the evaluator

  • Answer relevancy (float)
    • Answer relevancy metric (retrieval+generation) is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Answer Relevancy (Sentence Similarity)

QuestionExpected answerRetrieved contextActual answerConstraints

The Answer Relevancy (Sentence Similarity) evaluator assesses how relevant the actual answer is by computing the similarity between the question and the actual answer sentences.

  • Compatibility: RAG and LLM evaluation.

Method

  • The metric is calculated as the maximum similarity between the question and the actual answer sentences:
answer relevancy = max( {S(emb(question), emb(a)): for all a in actual answer} )
  • Where:
    • A is the actual answer.
    • a is a sentence in the actual answer.
    • emb(a) is a vector embedding of the actual answer sentence.
    • emb(question) is a vector embedding of the question.
    • S(q, a) is the 1 - cosine distance between the question q and the actual answer sentence a.
  • The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

  • Answer relevancy (float)
    • Answer Relevancy metric determines whether the RAG outputs relevant information by comparing the actual answer sentences to the question.
    • A higher score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Answer Semantic Similarity Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Answer Semantic Similarity Evaluator assesses the semantic resemblance between the generated answer and the expected answer (ground truth).

  • Cross-encoder model or embeddings + cosine similarity.
  • Compatibility: RAG and LLM evaluation.
  • Based on RAGAs library

Method

  • Evaluator utilizes a cross-encoder model to calculate the semantic similarity score between the actual answer and expected answer. A cross-encoder model takes two text inputs and generates a score indicating how similar or relevant they are to each other.
  • Method is configurable, and the evaluator defaults to embeddings BAAI/bge-small-en-v1.5 (where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI)) and cosine similarity as the similarity metric. In this case, evaluator does vectorization of the ground truth and generated answers and calculates the cosine similarity between them.
  • In general, cross-encoder models (like HuggingFace Sentence Transformers) tend to have higher accuracy in complex tasks, but are slower. Embeddings with cosine similarity tend to be faster, more scalable, but less accurate for nuanced similarities.

See also:

Metrics calculated by the evaluator

  • Answer similarity (float)
    • The concept of answer semantic similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth. Semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator:

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Faithfulness Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Faithfulness Evaluator measures the factual consistency of the generated answer with the given context.

  • LLM finds claims in the actual answer and ensures that these claims are present in the retrieved context.
  • Compatibility: RAG only evaluation.
  • Based on RAGAs library

Method

  • Faithfulness is calculated based on the actual answer and retrieved context.
  • The evaluation assesses whether the claims made in the actual answer can be inferred from the retrieved context, avoiding any hallucinations.
  • The score is determined by the ratio of the actual answer's claims present in the context to the total number of claims in the answer.
faithfulness = number of claims inferable from the context / claims in the answer

See also:

Metrics calculated by the evaluator

  • Faithfulness (float)
    • Faithfulness (generation) metric measures the factual consistency of the generated answer against the given context. It is calculated from the answer and retrieved context. Higher is better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context: (number of claims inferable from the context / claims in the answer).
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Groundedness Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Groundedness (Semantic Similarity) Evaluator assesses the groundedness of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the actual answer is factually correct information by comparing the actual answer to the retrieved context - as the actual answer generated by the LLM model must be based on the retrieved context.

Method

  • The groundedness metric is calculated as:
groundedness = min( { max( {S(emb(a), emb(c)): for all c in C} ): for all a in A } )
  • Where:
    • A is the actual answer.
    • emb(a) is a vector embedding of the actual answer sentence.
    • C is the context retrieved by the RAG model.
    • emb(c) is a vector embedding of the context chunk sentence.
    • S(a, c) is the 1 - cosine distance between the actual answer sentence a and the retrieved context sentence c.
  • The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

  • Groundedness (float)
    • Groundedness metric determines whether the RAG outputs factually correct information by comparing the actual answer to the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates facts that are not supported by the context.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
  • If the actual answer is so small that the embedding ends up empty then the evaluator will produce a problem.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, where most of the evaluated LLM models hallucinated.
  • The least grounded actual answer sentence (in case the output metric score is below the threshold).

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Hallucination Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Hallucination Evaluator assesses the hallucination of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the actual output is factually correct information by comparing the actual output to the retrieved context - as the actual output generated by the LLM model must be based on the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates or discards facts that are not supported by the context.

  • Cross-encoder model assessing retrieved context and actual answer similarity.
  • Compatibility: RAG evaluation only.

Method

  • The evaluation uses Vectara hallucination evaluation cross-encoder model to calculate a score that measures the extent of hallucination in the generated answer from the retrieved context.

See also:

Metrics calculated by the evaluator

  • Hallucination (float)
    • Hallucination metric determines whether the RAG outputs factually correct information by comparing the actual output to the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates facts that are not supported by the context.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, where most of the evaluated LLM models hallucinated.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Language Mismatch Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Language mismatch evaluator tries to determine whether the language of the question (prompt/input) input and the actual answer is the same.

  • LLM judge based language detection.
  • Compatibility: RAG and LLM models.

Method

  • The evaluator prompts the LLM judge to compare languages in the question and actual answer.
  • Evaluator checks every test case. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

  • Same language (pass) (float)
    • Percentage of successfully evaluated RAG/LLM outputs for language mismatch metric which detects whether the language of the input and output is the same.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • Language mismatch (fail) (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check for language mismatch metric which detects whether the language of the input and output is the same.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Language mismatch retrieval failures (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check for language mismatch metric which detects whether the language of the input and output is the same.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Language mismatch generation failures (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for language mismatch metric which detects whether the language of the input and output is the same.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Language mismatch parsing failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for language mismatch metric which detects whether the language of the input and output is the same.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Looping Detection Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Looping detection evaluator tries to find out whether the LLM generation went into a loop.

  • Compatibility: RAG and LLM models.

Method

  • This evaluator provides three metrics:
                        number of unique sequences
unique sentences = ----------------------------
number of all sentences

longest repeated substring * frequency of this substring
longest repeated substring = --------------------------------------------------------
length of the text

length in bytes of compressed string
compression ratio = --------------------------------------
length in bytes of original string

Where:

  • unique sentences omits sentences shorter than 10 characters.
  • compression ratio is calculated using Python's zlib and the maximum compression level (9).

Metrics calculated by the evaluator

  • Unique Sentences (float)
    • Unique sentences metric is a ratio number of unique sequences / number of all sentences, where sentences shorter than 10 characters are omitted.
    • Higher score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.
  • Longest Repeated Substring (float)
    • Longest repeated substring metric is a ratio longest repeated substring * frequency of this substring / length of the text.
    • Lower score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Compression Ratio (float)
    • Ratio length in bytes of compressed string / length in bytes of original string. Compression is done using Python's zlib and the maximum compression level (9).
    • Higher score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Machine Translation (GPTScore) Evaluator

InputExpected answerRetrieved contextActual answerConstraints

GPT Score evaluator family is based on a novel evaluation framework specifically designed for RAGs and LLMs. It utilizes the inherent abilities of LLMs, particularly their ability to understand and respond to instructions, to assess the quality of generated text.

  • LLM judge based evaluation.
  • Compatibility: RAG and LLM models.

Method

  • The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow In other words,.
  • Instructions used by the evaluator are:
    • Accuracy:
      Rewrite the following text with its core information and consistent facts: {ref_hypo} In other words, {hypo_ref}
    • Fluency:
      Rewrite the following text to make it more grammatical and well-written: {ref_hypo} In other words, {hypo_ref}
    • Multidimensional quality metrics:
      Rewrite the following text into high-quality text with its core information: {ref_hypo} In other words, {hypo_ref}
  • Each instruction is evaluated twice - first it uses the expected answer for {ref_hypo} and the actual answer for {hypo_ref}, and then it is reversed. The calculated scores are then averaged.
  • The lower the metric value, the better.

See also:

Metrics calculated by the evaluator

  • Accuracy (float)
    • Are there inaccuracies, missing, or unfactual content in the generated text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
    • This is the primary metric.
  • Fluency (float)
    • Is the generated text well-written and grammatical?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Multidimensional Quality Metrics (float)
    • How is the overall quality of the generated text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Parameterizable BYOP Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Bring Your Own Prompt (BYOP) evaluator uses user supplied custom prompt and an LLM judge to evaluate LLMs/RAGs. Currently implemented BYOP supports only binary problems, thus the prompt has to guide the judge to output either "true" or "false".

Method

  • User provides a custom prompt and an LLM judge.
  • Custom prompt may use question, expected answer, retrieved context and/or actual answer.
  • The evaluator prompts the LLM judge using the custom prompt provided by user.
  • Evaluator checks every test case. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

  • Model passes (float)
    • Percentage of successfully evaluated RAG/LLM outputs.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • Model failures (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model retrieval failures (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model generation failures (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model parse failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Perplexity Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Perplexity measures how well a model predicts the next word based on what came before. The lower the perplexity score, the better the model is at predicting the next word.

Lower perplexity indicates that the model is more certain about its predictions. In comparison, higher perplexity suggests the model is more uncertain. Perplexity is a crucial metric for evaluating the performance of language models in tasks like machine translation, speech recognition, and text generation.

  • Evaluator uses distilgpt2 language model to calculate perplexity of the actual answer using lmppl package.
  • Compatibility: RAG and LLM models.

Method

  • Evaluator utilizes distilgpt2 language model to calculate perplexity of the actual answer using lmppl package. The calculation is as follows:
perplexity = exp(mean(cross-entropy loss))
  • Where the cross-entropy loss corresponds to cross-entropy loss of distilgpt2 calculated on the actual answer.

Metrics calculated by the evaluator

  • Perplexity (float)
    • Perplexity measures how well a model predicts the next word based on what came before (sliding window). The lower the perplexity score, the better the model is at predicting the next word. Perplexity is calculated as exp(mean(-log likelihood)), where log-likelihood is computed using the distilgpt2 language model as the probability of predicting the next word.
    • Lower is better.
    • Range: [0, inf]
    • Default threshold: 0.5
    • Primary metric.

Problems reported by the evaluator

  • If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Question Answering (GPTScore) Evaluator

InputExpected answerRetrieved contextActual answerConstraints

GPT Score evaluator family is based on a novel evaluation framework specifically designed for RAGs and LLMs. It utilizes the inherent abilities of LLMs, particularly their ability to understand and respond to instructions, to assess the quality of generated text.

  • LLM judge based evaluation.
  • Compatibility: RAG and LLM models.

Method

  • The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens Answer: Yes.

  • Instructions used by the evaluator are:

    • Interest:
      Answer the question based on the conversation between a human and AI.
      Question: Are the responses of AI interesting? (a) Yes. (b) No.
      Conversation: {history}
      Answer: Yes
    • Engagement:
      Answer the question based on the conversation between a human and AI.
      Question: Are the responses of AI engaging? (a) Yes. (b) No.
      Conversation: {history}
      Answer: Yes
    • Understandability:
      Answer the question based on the conversation between a human and AI.
      Question: Are the responses of AI understandable? (a) Yes. (b) No.
      Conversation: {history}
      Answer: Yes
    • Relevance:
      Answer the question based on the conversation between a human and AI.
      Question: Are the responses of AI relevant to the conversation? (a) Yes. (b) No.
      Conversation: {history}
      Answer: Yes
    • Specific:
      Answer the question based on the conversation between a human and AI.
      Question: Are the responses of AI generic or specific to the conversation? (a) Yes. (b) No.
      Conversation: {history}
      Answer: Yes
    • Correctness:
      Answer the question based on the conversation between a human and AI.
      Question: Are the responses of AI correct to conversations? (a) Yes. (b) No.
      Conversation: {history}
      Answer: Yes
    • Semantically appropriate:
      Answer the question based on the conversation between a human and AI.
      Question: Are the responses of AI semantically appropriate? (a) Yes. (b) No.
      Conversation: {history}
      Answer: Yes
    • Fluency:
      Answer the question based on the conversation between a human and AI.
      Question: Are the responses of AI fluently written? (a) Yes. (b) No.
      Conversation: {history}
      Answer: Yes
  • Where {history} corresponds to the conversation - question and actual answer.

  • The lower the metric value, the better.

See also:

Metrics calculated by the evaluator

  • Interest (float)
  • Is the generated text interesting?
  • Lower score is better.
  • Range: [0, inf]
  • Default threshold: inf
  • This is the primary metric.
  • Engagement (float)
  • Is the generated text engaging?
  • Lower score is better.
  • Range: [0, inf]
  • Default threshold: inf
  • Understandability (float)
  • Is the generated text understandable?
  • Lower score is better.
  • Range: [0, inf]
  • Default threshold: inf
  • Relevance (float)
  • How well is the generated text relevant to its source text?
  • Lower score is better.
  • Range: [0, inf]
  • Default threshold: inf
  • Specific (float)
  • Is the generated text generic or specific to the source text?
  • Lower score is better.
  • Range: [0, inf]
  • Default threshold: inf
  • Correctness (float)
  • Is the generated text correct or was there a misunderstanding of the source text?
  • Lower score is better.
  • Range: [0, inf]
  • Default threshold: inf
  • Semantically Appropriate (float)
  • Is the generated text semantically appropriate?
  • Lower score is better.
  • Range: [0, inf]
  • Default threshold: inf
  • Fluency (float)
  • Is the generated text well-written and grammatical?
  • Lower score is better.
  • Range: [0, inf]
  • Default threshold: inf

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

RAGAS Evaluator

RAGAs (RAG Assessment) is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG refers to LLM applications that use external data to enhance the context. Evaluation and quantifying the performance of your pipeline can be hard. This is where Ragas (RAG Assessment) comes in. RAGAs metrics score includes both performance of the retrieval and generation components of the RAG pipeline. Therefore RAGAs score represents the overall quality of the answer considering both the retrieval and the answer generation itself.

  • Harmonic mean of Faithfulness, Answer Relevancy, Context precision, and Context Recall metrics.
  • Compatibility: RAG evaluation only.
  • Based on RAGAs library

Method

  • RAGAs metric score is calculated as harmonic mean of the four metrics calculated by the following evaluators:
    • Faithfulness Evaluator (generation)
    • Answer Relevancy Evaluator (retrieval+generation)
    • Context Precision Evaluator (retrieval)
    • Context Recall Evaluator (retrieval)
  • Faithfulness covers generation answer quality, Answer Relevancy covers answer generation and retrieval quality. Context Precision and Context Recall evaluate the retrieval quality.

See also:

Metrics calculated by the evaluator

  • RAGAS (float)
    • RAGAs (RAG Assessment) metric is a harmonic mean of the following metrics: faithfulness, answer relevancy, context precision and context recall.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.
  • Faithfulness (float)
    • Faithfulness (generation) metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. Higher the better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context: (number of claims inferable from the context / claims in the answer).
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Answer relevancy (float)
    • Answer relevancy metric (retrieval+generation) is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Context precision (float)
    • Context precision metric (retrieval) evaluator uses a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not - ideally all the relevant chunks must appear at the top of the context - ranked high.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Context recall (float)
    • Context recall metric (retrieval) measures the extent to which the retrieved context aligns with the answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context. Higher the better. Each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not: (answer sentences that can be attributed to context / answer sentences count)
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Tokens Presence Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Tokens Presence Evaluator assesses whether both the retrieved context (in the case of RAG hosted models) and the generated answer contain/match a specified set of required strings. The evaluation is based on the match/no match of the required strings, using substring and/or regular expression-based search in the retrieved context and actual answer.

  • Boolean expression where operands are strings or regular expressions.
  • Compatibility: RAG and LLM evaluation.

Constraints are defined as a list of strings, regular expressions, and lists:

  • In case of the string, context and/or answer must contain the string.
  • In case of the string with REGEXP: prefix, context and/or answer is checked to match the given regular expression. Use Python regular expression notation - for example REGEXP:^[Aa]nswer:? B$.
  • In case of a list, context and/or answer is checked to contain/match at least one of the list items (be it string or regular expression).

Method

  • Evaluator checks every test case - actual answer and retrieved context - for the presence of the required strings and regular expressions. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Examples

Test Suite constraints Example #1:

"output_constraints": [
"15,969",
"REGEXP:[Mm]illion",
"REGEXP:^15,969 [Mm]illion$",
["either", "or"]
]

Test Suite Constraints Example 1:

"output_constraints": [
["either", "or", "REGEXP:[Mm]illion"]
]

The preceding constraints indicate the following:

  1. must contain 15,969 and

  2. must match regular expression [Mm]illion and

  3. must match regular expression ^15,969 [Mm]illion$ and

  4. must contain either either or or

Test Suite Constraints Example 2:

"output_constraints": [
["either", "or", "REGEXP:[Mm]illion"]
]

The preceding constraints indicate the following:

  1. Must contain either either or or or match regular expression [Mm]illion

Metrics calculated by the evaluator

  • Model passes (float)
    • Percentage of successfully evaluated RAG/LLM outputs.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • Model failures (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model retrieval failures (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model generation failures (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model parse failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Retrieval

Context Precision Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Context Precision Evaluator assesses the quality of the retrieved context by evaluating the order and relevance of text chunks on the context stack - precision of the context retrieval. Ideally, all relevant chunks (ranked higher) should be appearing at the top of the context.

Method

  • The evaluator calculates a score based on the presence of the expected answer (ground truth) in the text chunks at the top of the retrieved context chunk stack.
  • Irrelevant chunks and unnecessarily large context decrease the score.
  • Top of the stack is defined as n top-most chunks at the top of the stack.
  • Chunk relevance is determined by the LLM judge as a [0, 1] value. Chunk relevances are multiplied by the chunk position (depth) in the stack, summed, and normalized to calculate the score:
context precision = sum( chunk precision (depth) * relevance (depth)) / number of relevant items at the top of the chunk stack
chunk precision (depth) = true positives (depth) / (true positives (depth) + false positives (depth))

See also:

Metrics calculated by the evaluator

  • Context precision (float)
    • Context precision metric (retrieval) evaluator uses a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not - ideally all the relevant chunks must appear at the top of the context - ranked high.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Context Recall Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Context Recall Evaluator measures the alignment between the retrieved context and the answer (ground truth).

  • LLM judge is checking ground truth sentences' presence in the retrieved context.
  • Compatibility: RAG evaluation only.
  • Based on RAGAs library

Method

  • Metric is computed based on the ground truth and the retrieved context.
  • The LLM judge analyzes each sentence in the expected answer (ground truth) to determine if it can be attributed to the retrieved context.
  • The score is calculated as the ratio of the number of sentences in the expected answer that can be attributed to the context to the total number of sentences in the expected answer (ground truth).

Score formula:

context recall = (expected answer sentences that can be attributed to context) / (expected answer sentences count)

See also:

Metrics calculated by the evaluator

  • Context recall (float)
    • Context recall metric (retrieval) measures the extent to which the retrieved context aligns with the answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context. Higher is better. Each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not: (answer sentences that can be attributed to context / answer sentences count)
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Context Relevancy Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Context Relevancy Evaluator measures the relevancy of the retrieved context based on the question and contexts.

  • Extraction and relevance assessment by an LLM judge.
  • Compatibility: RAG evaluation only.
  • Based on RAGAs library

Method

  • The evaluator uses an LLM judge to identify relevant sentences within the retrieved context to compute the score using the formula:
context relevancy = (number of relevant context sentences) / (total number of context sentences)
  • Total number of sentences is determined by a sentence tokenizer.

See also:

Metrics calculated by the evaluator

  • Context relevancy (float)
    • Context relevancy metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. Ideally, the retrieved context should exclusively contain essential information to address the provided query. To compute this, evaluator initially estimates the value by identifying sentences within the retrieved context that are relevant for answering the given question. The final score is determined by the following formula: context relevancy = (number of relevant sentences / total number of sentences).
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Context Relevancy Evaluator (Soft Recall and Precision)

QuestionExpected answerRetrieved contextActual answerConstraints

Context Relevancy (Soft Recall and Precision) Evaluator measures the relevancy of the retrieved context based on the question and context sentences and produces two metrics - precision and recall relevancy.

  • Compatibility: RAG evaluation only.

Method

  • The evaluator brings two metrics calculated as:
chunk context relevancy(ch) = max( {S(emb(q), emb(s)): for all s in ch} )

recall relevancy = max( {chunk context relevancy(ch): for all ch in rc} )
precision relevancy = avg( {chunk context relevancy(ch): for all ch in rc} )
  • Where:
    • rc is the retrieved context.
    • ch is a chunk of the retrieved context.
    • emb(s) is a vector embedding of the retrieved context chunk sentence.
    • emb(q) is a vector embedding of the query.
    • S(question, s) is the 1 - cosine distance between the question and the retrieved context sentence s.
  • The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

  • Recall Relevancy (float)
    • Maximum retrieved context chunk relevancy.
    • Higher score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Precision Relevancy (float)
    • Average retrieved context chunk relevancy.
    • Higher score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Privacy

Contact Information Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Contact Information Evaluator checks for potential leakages of contact information in the text generated by RAG/LLM models. It assesses whether the generated answer contains contact information such as names, addresses, phone numbers, medical information, user names, and emails.

  • LLM judge based contact information detection.
  • Compatibility: RAG and LLM models.

Method

  • The evaluator prompts the LLM judge to detect contact information in the actual answer.
  • Evaluator checks every test case for the presence of the contact information. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

  • No contact information leakages (pass) (float)
    • Percentage of successfully evaluated RAG/LLM outputs for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • Contact information leakages (fail) (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Contact information retrieval failures (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Contact information generation failures (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Contact information parsing failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

PII Leakage Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

PII leakage evaluator checks for potential personally identifiable information - like credit card numbers, social security numbers, email addresses - leakages in the text generated by the LLM/RAG model.

  • Regular expressions suite to detect PII in the retrieved context and actual answer.
  • Compatibility: RAG and LLM.

Method

  • PII Leakage Evaluator checks for potential personally identifiable information (PII) leakages in the text generated by LLM/RAG models.
  • The evaluation utilizes a regex suite that can quickly and reliably detect formatted PII, including credit card numbers, SSNs, and emails.
  • Evaluator checks every test case - actual answer and retrieved context - for the presence of the PIIs. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

  • No PII leakages (pass) (float)
    • Percentage of successfully evaluated RAG/LLM outputs for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • PII leakages (fail) (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • PII retrieval leakages (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • PII generation leakages (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Sensitive Data Leakage Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Sensitive Data Leakage Evaluator checks for potential leakages of security-related and/or sensitive data in the text generated by LLM/RAG models. It assesses whether the generated answer contains security-related information such as activation keys, passwords, API keys, tokens, or certificates.

  • Regular expressions suite to detect sensitive data in the retrieved context and actual answer.
  • Compatibility: RAG and LLM.

Method

  • The evaluator utilizes a regex suite that can quickly and reliably detect formatted sensitive data, including certificates in SSL/TLS PEM format, API keys for H2O.ai and OpenAI, and activation keys for Windows.
  • Evaluator checks every test case - actual answer and retrieved context - for the presence of the PIIs. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

  • No sensitive data leakages (pass) (float)
    • Percentage of successfully evaluated RAG/LLM outputs for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • Sensitive data leakages (fail) (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Sensitive data retrieval leakages (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Sensitive data generation leakages (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Sensitive data parsing failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Fairness

Fairness Bias Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Fairness bias evaluator assesses whether the LLM/RAG output contains gender, racial, or political bias. This information can then be used to improve the development and deployment of LLMs/RAGs by identifying and mitigating potential biases.

  • Compatibility: RAG and LLM models.

Method

See also:

Metric calculated by the evaluator

  • Fairness bias (float)
    • Fairness bias metric indicates the level of gender, racial, or political bias in the generated text. High score indicates high fairness bias.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-heatmap-leaderboard
    • Leaderboards with models and prompts by metric values.
  • work-dir-archive
    • Zip archive with evaluator artifacts.

Sexism Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Sexism evaluator evaluates input and LLM output to find possible instances of sexism.

  • LLM judge based sexism detection.
  • Compatibility: RAG and LLM models.

Method

  • The evaluator prompts the LLM judge to detect sexism in the actual answer.
  • Evaluator checks every test case for the presence of sexism. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

  • No Sexism (pass) (float)
    • Percentage of successfully evaluated RAG/LLM outputs for sexism metric which detects possible instances of sexism.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • Sexist (fail) (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check for sexism metric which detects possible instances of sexism.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Sexism in retrieval (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check for sexism metric which detects possible instances of sexism.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Sexism in generation (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for sexism metric which detects possible instances of sexism.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Sexism parsing failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for sexism metric which detects possible instances of sexism.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-bool-leaderboard
    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive
    • Zip archive with evaluator artifacts.

Stereotypes Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Stereotype evaluator tries to guess if the LLM output contains stereotypes - assess whether the answer contains added information about gender or race with no reference in the question.

  • LLM judge based gender stereotypes detection.
  • Compatibility: RAG and LLM models.

Method

  • The evaluator prompts the LLM judge to detect gender stereotypes in the actual answer and also to check the question.
  • Evaluator checks every test case for the presence of stereotypes. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

  • Stereotype-free (pass) (float)
    • Percentage of successfully evaluated RAG/LLM outputs for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • Stereotyped (fail) (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Stereotypes in retrieval (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Stereotypes in generation (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Stereotypes parsing failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-bool-leaderboard
    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive
    • Zip archive with evaluator artifacts.

Toxicity Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Toxicity evaluator is a tool that is used to assess the level of toxicity in the actual answers. RAGs/LLMs can generate human-quality text, but they can also be prone to generating toxic content, such as hate speech, offensive language, and discriminatory language.

The value of LLM toxicity evaluator is twofold - it can help to ensure that LLMs are not used to generate toxic content that could harm individuals or groups.

It can also help to improve the accuracy and reliability of RAGs/LLMs by identifying and mitigating the generation of toxic content.

  • Compatibility: RAG and LLM models.

Method

  • Toxicity evaluator uses Detoxify library to calculate toxicity scores.

See also:

Metrics calculated by the evaluator

  • Toxicity (float)
    • Toxicity score indicates the level of toxicity in the generated text, e.g., hate speech, offensive language, threatening, insulting, and identity attacking language. High score indicates high toxicity.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.
  • Severe Toxicity (float)
    • Toxicity score indicates the level of severe toxicity in the generated text. High score indicates high toxicity.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Obscene (float)
    • Toxicity score indicates the level of obscenity in the generated text. High score indicates high toxicity.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Threat (float)
    • Toxicity score indicates the level of threat in the generated text. High score indicates high toxicity.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Insult (float)
    • Toxicity score indicates the level of insults in the generated text. High score indicates high toxicity.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Identity Attack (float)
    • Toxicity score indicates the level of identity attacks in the generated text. High score indicates high toxicity.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-heatmap-leaderboard
    • Leaderboards with models and prompts by metric values.
  • work-dir-archive
    • Zip archive with evaluator artifacts.

Summarization

BLEU Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

BLEU (Bilingual Evaluation Understudy) measures the quality of machine-generated texts by comparing them to reference texts. BLEU calculates a score between 0.0 and 1.0, where a higher score indicates a better match with the reference text.

  • Compatibility: RAG and LLM models.

Method

  • BLEU is based on the concept of n-grams, which are contiguous sequences of words. The different variations of BLEU, such as BLEU-1, BLEU-2, BLEU-3, and BLEU-4, differ in the size of the n-grams considered for evaluation.
  • BLEU-n measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping n-grams and dividing it by the total number of n-grams in the generated text.
  • NLTK library is used to tokenize the text using punkt tokenizer and then calculate the BLEU score.

See also:

Metrics calculated by the evaluator

  • BLEU-1 (float)
    • BLEU-1 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping unigrams and dividing it by the total number of unigrams in the generated text.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.
  • BLEU-2 (float)
    • BLEU-2 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping bigrams and dividing it by the total number of bigrams in the generated text.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • BLEU-3 (float)
    • BLEU-3 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping trigrams and dividing it by the total number of trigrams in the generated text.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • BLEU-4 (float)
    • BLEU-4 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping 4-grams and dividing it by the total number of 4-grams in the generated text.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-heatmap-leaderboard
    • Leaderboards with models and prompts by metric values.
  • work-dir-archive
    • Zip archive with evaluator artifacts.

ROUGE Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of evaluation metrics used to assess the quality of generated summaries compared to reference summaries. There are several variations of ROUGE metrics, including ROUGE-1, ROUGE-2, and ROUGE-L.

  • Compatibility: RAG and LLM models.

Method

  • The evaluator reports the F1 score between the generated and reference n-grams.
  • ROUGE-1 measures the overlap of 1-grams (individual words) between the generated and the reference summaries.
  • ROUGE-2 extends the evaluation to 2-grams (pairs of consecutive words).
  • ROUGE-L considers the longest common subsequence (LCS) between the generated and reference summaries.
  • These ROUGE metrics provide a quantitative evaluation of the similarity between the generated and reference texts to assess the effectiveness of text summarization algorithms.

See also:

Metrics calculated by the evaluator

  • ROUGE-1 (float)
    • ROUGE-1 metric measures the overlap of 1-grams (individual words) between the generated and the reference summaries.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • ROUGE-2 (float)
    • ROUGE-2 metric measures the overlap of 2-grams (pairs of consecutive words) between the generated and the reference summaries.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • ROUGE-L (float)
    • ROUGE-L metric considers the longest common subsequence (LCS) between the generated and reference summaries.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, the evaluator will report a problem for each perturbed test case and LLM model whose metric flips (moved above or below the threshold) after perturbation.

Insights diagnosed by the evaluator

  • The best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt that most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-heatmap-leaderboard

    • Leaderboards with models and prompts by metric values.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-heatmap-leaderboard
    • Leaderboards with models and prompts by metric values.
  • work-dir-archive
    • Zip archive with evaluator artifacts.

Summarization (Completeness and Faithfulness) Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

This summarization evaluator, which does not require a reference summary, uses two faithfulness metrics based on SummaC (Conv and ZS) and one completeness metric.

  • Compatibility: RAG and LLM models.

Method

  • SummaC Conv is a trained model consisting of a single learned convolution layer compiling the distribution of entailment scores of all document sentences into a single score.
  • SummaC ZS performs zero-shot aggregation by combining sentence-level scores using max and mean operators. This metric is more sensitive to outliers than SummaC Conv.
  • Completeness metric is calculated using the distance of embeddings between the reference and faithful parts of the summary.

See also:

Metrics calculated by the evaluator

  • Completeness (float)
    • Completeness metric is calculated using the distance of embeddings between the reference and faithful parts of the summary.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.
  • Faithfulness (SummaC Conv) (float)
    • The faithfulness metric measures how well the summary preserves the meaning and factual content of the original text. SummaC Conv is a trained model consisting of a single learned convolution layer compiling the distribution of entailment scores of all document sentences into a single score.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Faithfulness (SummaC ZS) (float)
    • The faithfulness metric measures how well the summary preserves the meaning and factual content of the original text. SummaC ZS performs zero-shot aggregation by combining sentence-level scores using max and mean operators. This metric is more sensitive to outliers than SummaC Conv.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-heatmap-leaderboard
    • Leaderboards with models and prompts by metric values.
  • work-dir-archive
    • Zip archive with evaluator artifacts.

Summarization (Judge) Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Summarization evaluator uses an LLM judge to assess the quality of the summary made by the evaluated model using a reference summary.

  • LLM judge based summarization evaluation.
  • Requires a reference summary.
  • Compatibility: RAG and LLM models.

Method

  • The evaluator prompts the LLM judge to compare the actual answer (evaluated RAG/LLM's summary) and the expected answer (reference summary).
  • Evaluator checks every test case for the presence of the contact information. The result of the test case evaluation is a boolean.
  • LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

  • Good summary (pass) (float)
    • Percentage of successfully evaluated RAG/LLM outputs for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • Primary metric.
  • Bad summary (fail) (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Summarization retrieval failures (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Summarization generation failures (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Summarization parsing failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
    • Lower is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator

  • llm-eval-results

    • Frame with the evaluation results.
  • llm-bool-leaderboard

    • LLM failure leaderboard with data and formats for boolean metrics.
  • work-dir-archive

    • Zip archive with evaluator artifacts.

Summarization with reference (GPTScore) Evaluator

InputExpected answerRetrieved contextActual answerConstraints

GPT Score evaluator family is based on a novel evaluation framework specifically designed for RAGs and LLMs. It utilizes the inherent abilities of LLMs, particularly their ability to understand and respond to instructions, to assess the quality of generated text.

  • LLM judge based evaluation.
  • Compatibility: RAG and LLM models.

Method

  • The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow In other words,.
  • Instructions used by the evaluator are:
    • Semantic coverage:
      Rewrite the following text with the same semantics. {ref_hypo} In other words, {hypo_ref}
    • Factuality:
      Rewrite the following text with consistent facts. {ref_hypo} In other words, {hypo_ref}
    • Informativeness:
      Rewrite the following text with its core information. {ref_hypo} In other words, {hypo_ref}
    • Coherence:
      Rewrite the following text into a coherent text. {ref_hypo} In other words, {hypo_ref}
    • Relevance:
      Rewrite the following text with consistent details. {ref_hypo} In other words, {hypo_ref}
    • Fluency:
      Rewrite the following text into a fluent and grammatical text. {ref_hypo} In other words, {hypo_ref}
  • Each instruction is evaluated twice - first it uses the expected answer for {ref_hypo} and the actual answer for {hypo_ref}, and then it is reversed. The calculated scores are then averaged.
  • The lower the metric value, the better.

See also:

Metrics calculated by the evaluator

  • Semantic Coverage (float)
    • How many semantic content units from the reference text are covered by the generated text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
    • This is the primary metric.
  • Factuality (float)
    • Does the generated text preserve the factual statements of the source text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Informativeness (float)
    • How well does the generated text capture the key ideas of its source text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Coherence (float)
    • How much does the generated text make sense?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Relevance (float)
    • How well is the generated text relevant to its source text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Fluency (float)
    • Is the generated text well-written and grammatical?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator

  • llm-eval-results
    • A frame with the evaluation results.
  • llm-heatmap-leaderboard
    • Leaderboards with models and prompts by metric values.
  • work-dir-archive
    • A ZIP archive with evaluator artifacts.

Summarization without reference (GPTScore) Evaluator

InputExpected answerRetrieved contextActual answerConstraints

GPT Score evaluator family is based on a novel evaluation framework specifically designed for RAGs and LLMs. It utilizes the inherent abilities of LLMs, particularly their ability to understand and respond to instructions, to assess the quality of generated text.

  • LLM judge based evaluation.
  • Compatibility: RAG and LLM models.

Method

  • The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow Tl;dr\n.
  • Instructions used by the evaluator are:
    • Semantic coverage:
      Generate a summary with as much semantic coverage as possible for the following text: {src}
      Tl;dr
      {target}
    • Factuality:
      Generate a summary with consistent facts for the following text: {src}
      Tl;dr
      {target}
    • Consistency:
      Generate a factually consistent summary for the following text: {src}
      Tl;dr
      {target}
    • Informativeness:
      Generate an informative summary that captures the key points of the following text: {src}
      Tl;dr
      {target}
    • Coherence:
      Generate a coherent summary for the following text: {src}
      Tl;dr
      {target}
    • Relevance:
      Generate a relevant summary with consistent details for the following text: {src}
      Tl;dr
      {target}
    • Fluency:
      Generate a fluent and grammatical summary for the following text: {src}
      Tl;dr
      {target}
  • Where {src} corresponds to the question and {target} to the actual answer.
  • The lower the metric value, the better.

See also:

Metrics calculated by the evaluator

  • Semantic Coverage (float)
    • How many semantic content units from the reference text are covered by the generated text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
    • This is the primary metric.
  • Factuality (float)
    • Does the generated text preserve the factual statements of the source text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Consistency (float)
    • Is the generated text consistent in the information it provides?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Informativeness (float)
    • How well does the generated text capture the key ideas of its source text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Coherence (float)
    • How much does the generated text make sense?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Relevance (float)
    • How well is the generated text relevant to its source text?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf
  • Fluency (float)
    • Is the generated text well-written and grammatical?
    • Lower score is better.
    • Range: [0, inf]
    • Default threshold: inf

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-heatmap-leaderboard
    • Leaderboards with models and prompts by metric values.
  • work-dir-archive
    • A ZIP archive with evaluator artifacts.

Classification

Classification Evaluator

QuestionExpected answerRetrieved contextActual answerConstraints

Binomial and multinomial classification evaluator for LLM models and RAG systems which are used to classify data into two or more classes.

  • Compatibility: RAG and LLM models.

Method

  • The evaluator matches expected answer (label) and actual answers (prediction) for each test case and calculates the confusion matrix and metrics such as accuracy, precision, recall, and F1 score for each model.

Metrics calculated by the evaluator

  • Accuracy (float)
    • Accuracy metric measures how often the model makes correct predictions using the formula: (True Positives + True Negatives) / Total Predictions.
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
    • Primary metric.
  • Precision (float)
    • Precision metric measures the proportion of the positive predictions that were actually correct using the formula: True Positives / (True Positives + False Positives).
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • Recall (float)
    • Recall metric measures the proportion of the actual positive cases that were correctly predicted using the formula: True Positives / (True Positives + False Negatives).
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75
  • F1 (float)
    • F1 metric measures the balance between precision and recall using the formula: 2 (Precision Recall) / (Precision + Recall).
    • Higher is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.75

Problems reported by the evaluator

  • If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

  • Best performing LLM model based on the evaluated primary metric.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

  • metric_threshold
    • Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator

  • llm-eval-results
    • Frame with the evaluation results.
  • llm-classification-leaderboard
    • Leaderboards with models and prompts by metric values.

Feedback