Augment

Augmentation allows you to blend your datasets with publicly available datasets preloaded to the H2O LLM DataStudio to obtain variety. In some cases, you can integrate your datasets with RLHF-related datasets to include more domain aspects. The Augment tab displays a catalog of rich datasets that can be immediately used in the Prepare pipeline. Additionally, you can bring your own datasets for the augmentation process. Augmentation settings can be configured within the Configuration step in the Prepare pipeline.

Instructions

On the H2O LLM DataStudio left navigation menu, click Augment.
Click on the name of the dataset to get a preview of the dataset.

Augment Datasets (RLHF, Improve Content, Extra Rows)

H2O LLM DataStudio provides 18 datasets for different workflow types that you can augment with your input datasets during the Configuration step of data preparation. Also, the Augment datasets contains RLHF related datasets for question answering, text summarization, instruct tuning, and human-bot conversation problem types.

The DataCatalog table contains the following information about augmentation datasets:

Name: The name of the dataset.
Description: A brief description of the content of the dataset and its purpose.
Workflows: The target workflow/task type of the dataset.
Rows: The number of rows in the dataset.
Cols: The number of columns in the dataset.
URL: The source of the dataset.
License: The Open Data Commons licenses that issued for the dataset owner.

Preloaded datasets

Standford Q&A Dataset

Stanford Question Answering Dataset consists of questions posed by crowdworkers on a set of Wikipedia articles where the answer to every question is from the corresponding reading passage.

Workflow type: Question Answering
Number of rows: 5000
Number of columns: 3
URL: https://huggingface.co/datasets/squad
License: cc-by-4.0

QG-Bench Subset by SQuAD

The SQUAD Dataset for question generation task.

Workflow type: Question Answering
Number of rows: 6283
Number of columns: 3
URL: https://huggingface.co/datasets/lmqg/qg_squad
License: cc-by-4.0

Tweet base Q&A

The Q&A dataset with short tweet, a question and a text phrase as the answer.

Workflow type: Question Answering
Number of rows: 10692
Number of columns: 3
URL: https://huggingface.co/datasets/tweet_qa
License: cc-by-sa-4.0

RLHF EE QA

The RLHF dataset for Q&A problems.

Workflow type: Question Answering
Number of rows: 180
Number of columns: 3
URL: https://huggingface.co/datasets/kastan/EE_QA_for_RLHF
License: mit

News Article Summary

The news article summary dataset containes summarized news from news articles from different newspapers.

Workflow type: Text Summarization
Number of rows: 4515
Number of columns: 2
URL: https://www.kaggle.com/datasets/sunnysai12345/news-summary
License: gpl-2.0

Costco Article Summary

The Costco article text summarization dataset.

Workflow type: Text Summarization
Number of rows: 86
Number of columns: 2
URL: https://huggingface.co/datasets/awinml/costco_long_practice
License: mit

Dialogue Summary

The dialogue summarization dataset.

Workflow type: Text Summarization
Number of rows: 12460
Number of columns: 2
URL: https://huggingface.co/datasets/knkarthick/dialogsum
License: mit

RLHF OpenAI Summaries

The RLHF OpenAI Summaries dataset contains sample (5000) of the CarperAI RLHF summarise dataset based on reddit thread summaries.

Workflow type: Text Summarization
Number of rows: 5000
Number of columns: 2
URL: https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons

Code QA

The Code QA dataset contains prompt-reply pairs where the prompt is to create a Python function which satisfies the functionality described in a specified docstring. The responses are the generated functions.

Workflow type: Instruct Tuning
Number of rows: 591
Number of columns: 2
URL: https://huggingface.co/datasets/OllieStanley/humaneval-mbpp-codegen-qa

Python QA

The Python QA dataset contains prompt-reply pairs where the prompt is to create a Python unit test which tests for the functionality described in a specific docstring. The responses are the generated unit tests.

Workflow type: Instruct Tuning
Number of rows: 591
Number of columns: 2
URL: https://huggingface.co/datasets/OllieStanley/humaneval-mbpp-testgen-qa

Self Instruct

The Self Instruct dataset contains prompt-reply pairs.

Workflow type: Instruct Tuning
Number of rows: 448
Number of columns: 2

RLHF Instruct Tuning

The RLHF Instruct Tuning dataset contains a technical Q&A set based on RLHF dataset.

Workflow type: Instruct Tuning
Number of rows: 337
Number of columns: 2
URL: https://huggingface.co/datasets/kastan/rlhf-qa-comparisons

Human Assistance Dataset

The Human Assistance Dataset contains Human-Assistance style conversations, sampled to 5000 rows.

Workflow type: Human Bot Conversations
Number of rows: 33143
Number of columns: 1
URL: https://huggingface.co/datasets/Dahoas/first-instruct-human-assistant-prompt

Biomedical Human Assistance

The Biomedical Human Assistance dataset contains the User-Assistant style conversations on biomedical.

Workflow type: Human Bot Conversations
Number of rows: 10000
Number of columns: 1
URL: https://huggingface.co/datasets/ericyu3/openassistant_inpainted_dialogs_5k_biomedical
License: apache-2.0

User Assistant Conversations

The User Assistant Conversations dataset contains the User-Assistant style conversations, sampled to 5000 rows.

Workflow type: Human Bot Conversations
Number of rows: 126287
Number of columns: 2
URL: https://huggingface.co/datasets/birgermoell/open_assistant_dataset

Anthropic RLHF Dataset sample

The Anthropic RLHF Dataset sample dataset contains human preference data about helpfulness and harmlessness meant to train preference (or reward) models for subsequent RLHF training. This datasets takes a sample of 1000 entries.

Workflow type: Human Bot Conversations
Number of rows: 2332
Number of columns: 1
URL: https://huggingface.co/datasets/Anthropic/hh-rlhf
License: mit

BERT Pre-training

The BERT Dataset for pretraining.

Workflow type: Continued PreTraining
Number of rows: 20000
Number of columns: 1
URL: https://huggingface.co/datasets/nthngdy/bert_dataset_202203/viewer/nthngdy--bert_dataset_202203/train
License: apache-2.0

TWT Eval

The TWET Eval Pretraining Dataset.

Workflow type: Continued PreTraining
Number of rows: 20000
Number of columns: 1
URL: https://huggingface.co/datasets/ArnavL/TWTEval-Pretraining-Processed

Feedback

Submit and view feedback for this page
Send feedback about H2O LLM DataStudio | Docs to cloud-feedback@h2o.ai

Instructions​

Augment Datasets (RLHF, Improve Content, Extra Rows)​

Preloaded datasets​

Standford Q&A Dataset​

QG-Bench Subset by SQuAD​

Tweet base Q&A​

RLHF EE QA​

News Article Summary​

Costco Article Summary​

Dialogue Summary​

RLHF OpenAI Summaries​

Code QA​

Python QA​

Self Instruct​

RLHF Instruct Tuning​

Human Assistance Dataset​

Biomedical Human Assistance​

User Assistant Conversations​

Anthropic RLHF Dataset sample​

BERT Pre-training​

TWT Eval​