Data curation flow

Overview

The LLM DataStudio's Curate component is a no-code capability to build structured LLM datasets from unstructured data. You can import documents in PDFs, DOCs, Audio and Video file formats and convert those documents to question-answer pairs, summarization pairs and file summaries for downstream tasks.

H2O LLM DataStudio utilizes the Llama2 LLM model family to generate 10 high-quality question-answer pairs from each chunk of text within a document. By default, each chunk is 4000 characters long and has a small overlap with adjacent chunks.

note

The dataset curation component supports documents in multiple languages including English, Spanish, French, German, Italian and Portuguese.

The flow of the data curation process of H2O LLM DataStudio, can be summarized in the following sequential steps:

Step 1: Create a new curate project
Step 2: Upload documents
Step 3: Perform the configuration and run pipeline
Step 4: Use the output for data preparation

Each of the steps given above has been summarised in the sections below.

note

Before starting with Data curation, integrate h2ogpte by providing the required credentials. For more information, see Settings.

Step 1: Create a new curate project

The first step in data curation process is creating a new Curate project. Click New on New Project / Curate Data for LLMs page. Provide the project name and the description. To learn how to create a new curate project, see Create a new project for data curation.

Step 2: Upload documents

As the second step in data curation, select the task type of the experiment (question-answer/summarization/file summary) and upload the document or enter the webpage URL. H2O LLM DataStudio supports PDF, DOCX, TXT, MD file formats, MP3, M4A, WAV audio and video file formats. If you have multiple documents you can upload them in a ZIP file. For more information, see Create a new project for data curation.

Step 3: Configure and run pipeline

After uploading the document, configure Smart chunking (Fast) and Sampling ratio.

If you turn Smart chunking (Fast) on, it will find the unique information within the dataset, filter some of the best chunks and generate question-answer pairs from those chunks. The smart chunking feature is recommended to use when you have large documents and want to generate question-answer pairs quickly.

You can manually specify the sampling ratio, or H2O LLM DataStudio will automatically set the sampling ratio depending on the length of the document. The sampling ratio is used to sample the documents based on the specified percentage. It selects the best chunks out of all available chunks for faster question-answer pair generation. If the sampling ratio is set to 0, LLM Data Studio will automatically choose the best ratio. However, it is recommended to set the sampling ratio to a value greater than or equal to 0.5.

After configuring the uploaded document, click Run pipeline. For more information, see the instructions for creating a new project for data curation.

Step 4: Use the output for data preparation

Once the output is generated from Curate, you can input the dataset to the data preparation flow. Inside the new Curation project, click

(Publish as Preparation Project) to publish the new Curation project as a Preparation project in the data preparation flow. For more information, see View a specific Curation project.

Video guide

Watch this video guide for a walkthrough of the H2O LLM DataStudio interface and to learn more about its data curation process.

Feedback

Submit and view feedback for this page
Send feedback about H2O LLM DataStudio | Docs to cloud-feedback@h2o.ai

Overview​

Step 1: Create a new curate project​

Step 2: Upload documents​

Step 3: Configure and run pipeline​

Step 4: Use the output for data preparation​

Video guide​