Problems and insights

Overview

In H2O Eval Studio, evaluations can reveal various problems and insights about the performance of the evaluated models. These problems and insights tabs help users understand the strengths and weaknesses of their RAGs/LLMs, identify areas for improvement, make informed decisions about RAGs/LLMs and act to mitigate risks.

Problems

Problems example

Problems are specific issues or challenges identified during the evaluation of a RAG/LLM. They can include:

Performance issues: Problems related to the RAG/LLM's ability to generate relevant and accurate responses.
Bias and fairness: Issues related to bias in the RAG/LLM's predictions, which can lead to unfair treatment of certain groups or individuals.
Robustness: Problems related to the RAG/LLM's ability to handle variations in input data, such as adversarial attacks or unexpected inputs.

Problems are categorized into different types, such as:

Privacy problems: Issues related to the handling of sensitive data, such as compliance with data protection regulations or potential data leaks.
Security problems: Issues related to the RAG/LLM's vulnerability to attacks, such as adversarial inputs or data poisoning.
Fairness problems: Issues related to bias and discrimination in the RAG/LLM's predictions, which can lead to unfair treatment of certain groups or individuals.
Storage problems: Issues related to the storage requirements of the RAG/LLM, such as disk space.
Accuracy problems: Issues related to the RAG/LLM's accuracy.
Retrieval problems: Issues related to the RAG/LLM's ability to retrieve relevant information from a knowledge base or dataset.
Data quality problems: Issues related to the quality of the input data, such as missing values or incorrect labels.
Stability problems: Issues related to the RAG/LLM's consistency and reliability over time or across different and/or perturbed datasets.
Efficiency problems: Issues related to the RAG/LLM's resource usage, such as execution convergence, space consumption or processing time.
Cost problems: Issues related to the financial cost of using the RAG/LLM, such as exceeded API usage fees limits.
Runtime problems: Issues that occur during the execution of the RAG/LLM, such as crashes or timeouts.

Problems have the following attributes:

Description: A short description of the problem.
Severity: The severity of the problem, such as high, medium, or low.
Problem type: The type of the problem, such as privacy, security, fairness, or storage described above.
Problem code: The AVID taxonomy code of the problem,
Problem attributes: The attributes of the problem, such as model name, evaluator name, and test case.
Actions: The actions that can be taken to mitigate the problem.

Problems detected by the evaluation can be found in the Problems tab of the evaluation where they are grouped by the evaluator and severity.

Problem example

H2O Eval Studio is able to detect whether a metrics score flipped - from pass to fail or vice versa - for an original and perturbed prompt. Test case (question) passes the evaluation if the metrics score calculated by the evaluator is above the user defined threshold (in case that higher metric score is better). This is reported as a stability problem because the RAG/LLM's performance is inconsistent and the perturbation influenced the RAG/LLM's performance:

Description: Model robustness problem detected in case of prompt perturbation: metric 'groundedness' value flipped from pass to fail in case of answers generated by the model 'llama-3.2'. ORIGINAL prompt: 'What is effective challenge of models?', PERTURBED prompt: 'What, effective is challenge, of models?'.
Severity: high
Problem type: robustness
Problem code: AVID / P0200 / "Ability for the AI to perform as intended"
Problem attributes:
- Model name: llama-3.2
- Evaluator name: Groundedness
- Test case: reference to the test case with question, expected answer, actual answer, retrieved context and other metadata.
Actions: Perform sensitivity analysis on various perturbation types and intensities to explore the model's robustness with regard to the specified perturbations. Please refer to the explanation for more details.

Above problem can be used by the user to perform sensitivity analysis - for example using the MRM workflow - on various perturbation types and intensities to explore the model's robustness with regard to the specified perturbations. Based on the results of the sensitivity analysis, the user can make informed decisions and act to mitigate the problem - for example improve the system prompt, choose a different LLM model, or extend RAG corpus with documents that are more diverse and help the model to generate grounded responses regardless quality of the question.

Insights

Insights example

Insights are potentially valuable observations or findings derived from the evaluation of a RAG/LLM. They can include:

Performance comparison: H2O Eval Studio can be used to compare the performance of different LLMs used by the RAG systems or to compare RAG systems performance itself. It reports best performing LLMs and worst performing LLMs.
Accuracy: H2O Eval Studio can identify test cases (questions) where the RAG/LLMs' ability to generate accurate responses is suboptimal - the most difficult questions for the RAG/LLMs to answer.
Performance: H2O Eval Studio can identify the fastest agents, RAGs, LLMs which are able to answer questions in the shortest time and the slowest systems.

Insight attributes: The attributes of the insight, such as model name, evaluator name, and test case.

Cost: H2O Eval Studio can identify the cheapest agents, RAGs, LLMs which are able to answer questions for the lowest cost. The evaluation can also reveal expensive agents, RAGs, LLMs which are able to answer questions for the highest cost and would be better to avoid in production.

Insights are categorized into the same categories as problems.

Insights have the following attributes:

Description: A short description of the insight.
Insight type: The type of the insight, such as performance comparison, accuracy, performance, or cost.
Actions: The actions that can be taken to capitalize on the insight.

Insights identified by the evaluation can be found in the Insights tab of the evaluation where they are grouped by the type.

Insight example

H2O Eval Studio can identify the most difficult questions for RAG/LLMs to answer — the test cases where their ability to generate accurate responses is suboptimal. Such test cases are valuable because they are the root cause of suboptimal scores across all evaluated systems. Therefore, this is where the choice of the best model should be made — choosing a model based on test cases where all models perform well makes no difference.

Description: Question 'What is informed conservatism?' is the most difficult question to be correctly answered by RAG systems according to Groundedness evaluator.
Insight type: weak point
Actions: A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.

Above insight can be used by the user to makes final choice of the LLM used by the evaluated RAG system for the production environment.

Conclusion

By identifying problems and insights during the evaluation process, users can make informed decisions about their RAG/LLMs, improve their performance, and mitigate risks associated with their deployment. H2O Eval Studio provides a comprehensive framework for evaluating RAG/LLMs and uncovering these important aspects.

Feedback

Submit and view feedback for this page
Send feedback about H2O Eval Studio to cloud-feedback@h2o.ai

Overview​

Problems​

Problem example​

Insights​

Insight example​

Conclusion​