Skip to main content

Tutorial 1B: Convert documents into Q&A pairs for data preparation

Overview

In this tutorial, you'll learn how to convert documents into question-answer pairs using H2O LLM DataStudio. These pairs can be used for data preparation tasks, such as transforming and cleaning datasets.

Prerequisites

Before you start, ensure you have the following:

Step 1: Create a new project

To begin the process of data curation, let's follow these steps to create a new project:

  1. On the H2O LLM DataStudio left navigation menu, click Curate.
  2. On the All Projects / Curate Data for LLMs page, click New.
    note

    If this is your first time creating a new project, you must integrate H2OGPTe by providing the required credentials. You cannot create a new project without configuring H2OGPTe. For more information, see Settings.

  3. In this tutorial, we will upload a research paper on chronic migraine diagnosis and treatment to generate question-answer pairs. In the Project name text box, let's enter migraine-treatment-curate as the name for the new project.
  4. On the Description text box, let's enter chronic migraine diagnosis and treatment as the description for the project.
  5. On the Document description text box, let's provide a brief description for the file's content. This label helps to quickly categorize and identify the document's purpose. Since we are uploading a research paper, type Research Paper.
  6. From the Task type dropdown menu, select Question-answer task.
  7. Download the research paper on chronic migraine diagnosis and treatment to your computer. Once the file is downloaded, follow these steps:
    1. Click Browse to open the file selection dialog.
    2. Locate and select the downloaded file from your computer.
    3. Alternatively, you can drag and drop the file into the designated area.
  8. Click Upload to upload the document.

Step 2: Configure settings

After uploading your document, configure the following settings. For this tutorial, let's keep the default settings as specified in each step:

  1. In the LLM selection section, select your preferred H2OGPTE LLM from the available options. For this tutorial, we will keep the default LLM selection.
  2. Choose your preferred relevance score from the dropdown menu. For this tutorial, use the default relevance score, which is the Bert approach.
  3. Use the slider labeled Number of tokens per chunk to set the maximum number of tokens per chunk of text processed by the model. The default value is 1,000 tokens. Keep this default value.
  4. Keep the More customization settings at their default values.
  5. Enable the Perform Smart Chunking (Fast) option if you are processing large documents (hundreds or thousands of pages). This feature speeds up chunking but may limit the generation of sufficient records for fine-tuning. The default setting is disabled, so we will keep it as is.
  6. Adjust the Sampling ratio for smart chunking. By default, the sampling ratio is set to 0, meaning LLM DataStudio will automatically select the sampling ratio based on the document length. Keep this default setting.
  7. Toggle the Use h2oGPTe's ingestion pipeline option. The default is enabled, so we will keep it as is. This option allows you to choose between using h2oGPTe's ingestion pipeline or the default LLM DataStudio pipeline.

Step 3: Run pipeline

Now that you’ve configured all the necessary settings, it’s time to execute the pipeline and begin the data processing.

  1. Click Run pipeline to start the process.

Step 4: View the project

To view and interact with your project:

  1. In the H2O LLM DataStudio left navigation menu, click Curate.
  2. Select the newly created project by clicking on its name.

You can view the table of question-answer pairs along with other details, such as the status of each project, project details, the number of pairs, and more. For a complete list of what you can view, see View a specific Curate project.

View curate project

Step 5: Publish the dataset as preparation project

To use the generated question-answer pairs in the data preparation process:

  1. In the Output section, select Publish as Preparation Project from the dropdown menu.
  2. Click Execute.
  3. In the Publish as Prepare Project window, let's keep the Project name and Project description as they are, and select question answering as the task type.
  4. Click Publish to use the generated question-answer pairs in the Data preparation flow.

Publish as prepare project

Summary

In this tutorial, we learned how to convert a document into question-answer pairs using H2O LLM DataStudio. We walked through the process of creating a new project, configuring the necessary settings, and running the pipeline. Finally, we explored how to publish the curated dataset as a preparation project to continue the data preparation process, enabling you to achieve your data transformation goals efficiently.


Feedback