Skip to main content

Augment

Augmentation allows you to blend your datasets with publicly available datasets preloaded to the H2O LLM DataStudio to obtain variety. In some cases, you can integrate your datasets with RLHF-related datasets to include more domain aspects. The Augment tab displays a catalog of rich datasets that can be immediately used in the Prepare pipeline. Additionally, you can bring your own datasets for the augmentation process. Augmentation settings can be configured within the Configuration step in the Prepare pipeline.

Instructions

  1. On the H2O LLM DataStudio left navigation menu, click Augment. Data catalog
  2. Click on the name of the dataset to get a preview of the dataset. Data catalog

Augment Datasets (RLHF, Improve Content, Extra Rows)

H2O LLM DataStudio provides 18 datasets for different workflow types that you can augment with your input datasets during the Configuration step of data preparation. Also, the Augment datasets contains RLHF related datasets for question answering, text summarization, instruct tuning, and human-bot conversation problem types.

The DataCatalog table contains the following information about augmentation datasets:

  • Name: The name of the dataset.
  • Description: A brief description of the content of the dataset and its purpose.
  • Workflows: The target workflow/task type of the dataset.
  • Rows: The number of rows in the dataset.
  • Cols: The number of columns in the dataset.
  • URL: The source of the dataset.
  • License: The Open Data Commons licenses that issued for the dataset owner.

Preloaded datasets

Standford Q&A Dataset

Stanford Question Answering Dataset consists of questions posed by crowdworkers on a set of Wikipedia articles where the answer to every question is from the corresponding reading passage.

QG-Bench Subset by SQuAD

The SQUAD Dataset for question generation task.

Tweet base Q&A

The Q&A dataset with short tweet, a question and a text phrase as the answer.

RLHF EE QA

The RLHF dataset for Q&A problems.

News Article Summary

The news article summary dataset containes summarized news from news articles from different newspapers.

Costco Article Summary

The Costco article text summarization dataset.

Dialogue Summary

The dialogue summarization dataset.

RLHF OpenAI Summaries

The RLHF OpenAI Summaries dataset contains sample (5000) of the CarperAI RLHF summarise dataset based on reddit thread summaries.

Code QA

The Code QA dataset contains prompt-reply pairs where the prompt is to create a Python function which satisfies the functionality described in a specified docstring. The responses are the generated functions.

Python QA

The Python QA dataset contains prompt-reply pairs where the prompt is to create a Python unit test which tests for the functionality described in a specific docstring. The responses are the generated unit tests.

Self Instruct

The Self Instruct dataset contains prompt-reply pairs.

  • Workflow type: Instruct Tuning
  • Number of rows: 448
  • Number of columns: 2

RLHF Instruct Tuning

The RLHF Instruct Tuning dataset contains a technical Q&A set based on RLHF dataset.

Human Assistance Dataset

The Human Assistance Dataset contains Human-Assistance style conversations, sampled to 5000 rows.

Biomedical Human Assistance

The Biomedical Human Assistance dataset contains the User-Assistant style conversations on biomedical.

User Assistant Conversations

The User Assistant Conversations dataset contains the User-Assistant style conversations, sampled to 5000 rows.

Anthropic RLHF Dataset sample

The Anthropic RLHF Dataset sample dataset contains human preference data about helpfulness and harmlessness meant to train preference (or reward) models for subsequent RLHF training. This datasets takes a sample of 1000 entries.

BERT Pre-training

The BERT Dataset for pretraining.

TWT Eval

The TWET Eval Pretraining Dataset.


Feedback