Skip to main content

Tutorial 1A: Creating an evaluation using your own model host and test suite

Overview

This tutorial walks you through the process of evaluating large language models (LLMs) using H2O Eval Studio. You will learn how to set up a model host, import a test suite, and create an evaluation to assess an LLM’s performance. By following these steps, you will gain hands-on experience in configuring and running evaluations for LLMs.

Objectives

  • Learn how to add an h2oGPTe RAG type model host by configuring its API key and URL.
  • Understand the process of importing a test suite that contains predefined test cases for LLM evaluation.
  • Learn how to create an evaluation by selecting the model host, test suite, and evaluators to analyze model performance.

Prerequisites

  • Access to H2O Eval Studio
  • Enterprise h2oGPTe API Key
  • Basic understanding of LLMs and evaluation metrics

Step 1: Add a model host

Let's add a model host for the LLM models we want to evaluate.

  1. On the H2O Eval Studio navigation menu, click Model hosts.
  2. In the Model hosts page, click on the New model host button.
  3. In the New model host panel, enter the following name in the Model host name box:
    Tutorial 1A model host
  4. In the Description box, enter the following description:
    Test model host for the tutorial 1A
  5. From the Type drop-down menu, select h2oGPTe RAG as the model host type.
    • Enterprise h2oGPTe is a RAG (Retrieval-Augmented Generation) product that utilizes LLMs to generate responses. You can use H2O Eval Studio to evaluate the performance of LLMs hosted by Enterprise h2oGPTe with the RAG functionality.
  6. In the Host URL box, enter the URL address of Enterprise h2oGPTe:
    https://h2ogpte.genai.h2o.ai
  7. In the API key box, enter the API key for Enterprise h2oGPTe.
  8. Click Create.

Step 2: Import a test suite

Let's import a collection of test cases and documents from the prompt library using the provided URL.

  1. On the H2O Eval Studio navigation menu, click Tests.
  2. Click the Import test suite button.
  3. Enter the following name prefix for the tests in the Test Suite:
    Tutorial 1A test suite
  4. Enter the following description of the tests in the Test Suite.
    Tutorial 1A test suite for the summarization evaluation
  5. In the Import Test Suite field, enter the following URL in the JSON or URL section.
    https://eval-studio-artifacts.s3.eu-central-1.amazonaws.com/h2o-eval-studio-suite-library/summarization_frank_test_suite_7p.json
    For more details on importing a test suite in JSON format, see Import Test Suite.

Step 3: Create an evaluation

Let's evaluate the model using the model host we added in Step 1 and the test suite we imported in Step 2.

  1. On the H2O Eval Studio navigation menu, click Evaluations.
  2. Click New evaluation.
  3. In the Create evaluation panel, enter the following name in the Evaluation name box:
    Tutorial 1A model host evaluation
  4. In the Description box, enter the following description:
    Tutorial 1A model evaluation
  5. From the Model host drop-down menu, select Tutorial 1A model host.
  6. From the Tests drop-down menu, select Tutorial 1A test suite.
  7. From the LLM models drop-down menu, select h2oai/h2o-danube3-4b-chat.
  8. In the Evaluators section, add the Tokens presence evaluator. This evaluator checks whether both the retrieved context (for RAG models) and the generated response contain specific required strings. For more information on different types of evaluators, see Evaluators.
  9. Click Evaluate.
    The new evaluation will be displayed under the Evaluations tab. New evaluation

Clicking the new evaluation opens the interactive evaluation dashboard, which includes a summary, leaderboards, evaluation techniques, and more. In Tutorial 2A, you will learn how to interpret evaluation results using this dashboard.

Summary

You have successfully set up a model host, imported a test suite, and conducted an evaluation using H2O Eval Studio. This process allows you to measure LLM performance based on specific evaluation criteria, such as token presence in responses.

Next

Now that you know how to create an evaluation using your own model host and test suite, the Tutorial 2A will guide you on how to interpret the new evaluation.


Feedback