Skip to main content

Data curation flow

Overview

The LLM DataStudio's Curate component is a no-code capability to build structured LLM datasets from unstructured data. You can import documents in PDFs, DOCs, Audio and Video file formats and convert those documents to question-answer pairs and summarization pairs for downstream tasks.

note

The dataset curation component supports documents in multiple languages including English, Spanish, French, German, Italian and Portuguese.

The flow of the data curation process of H2O LLM DataStudio, can be summarized in the following sequential steps:

Each of the steps given above has been summarised in the sections below.

note

Before starting with Data curation, integrate h2ogpte or Gradio Client by providing the required credentials. For more information, see Settings.

Step 1: Create a new curate project

The first step in data curation process is creating a new Curate project. Click New on New Project / Curate Data for LLMs page. Provide the project name and the description. To learn how to create a new curate project, see Create a new project for data curation.

Step 2: Upload documents

As the second step in data curation, select the task type of the experiment (question-answer/summarization/file summary) and upload the document or enter the webpage URL. H2O LLM DataStudio supports PDF, DOCX, TXT, MD file formats, MP3, M4A, WAV audio and video file formats. If you have multiple documents you can upload them in a ZIP file. For more information, see Create a new project for data curation.

Step 3: Configure and run pipeline

After uploading the document, configure Smart chunking and Sampling ratio.

If you turn smart chunking on, it will find the unique information within the dataset, filter some of the best chunks and generate question-answer pairs from those chunks. The smart chunking feature is recommended to use when you have large documents and want to generate question-answer pairs quickly.

You can manually specify the sampling ratio, or H2O LLM DataStudio will automatically set the sampling ratio depending on the length of the document. The sampling ratio is used to sample the documents based on the specified percentage. It selects the best chunks out of all available chunks for faster question-answer pair generation. If the sampling ratio is set to 0, LLM Data Studio will automatically choose the best ratio. However, it is recommended to set the sampling ratio to a value greater than or equal to 0.5.

After configuring the uploaded document, click Run pipeline. For more information, see the instructions for creating a new project for data curation.

Step 4: Use the output for data preparation

Once the output is generated from Curate, you can input the dataset to the data preparation flow. Inside the new Curation project, click (Publish as Preparation Project) to publish the new Curation project as a Preparation project in the data preparation flow. For more information, see View a specific Curation project.


Feedback