Augment
Augmentation allows you to blend your datasets with publicly available datasets preloaded to the H2O LLM DataStudio to obtain variety. In some cases, you can integrate your datasets with RLHF-related datasets to include more domain aspects. The Augment tab displays a catalog of rich datasets that can be immediately used in the Prepare pipeline. Additionally, you can bring your own datasets for the augmentation process. Augmentation settings can be configured within the Configuration step in the Prepare pipeline.
Instructions
- On the H2O LLM DataStudio left navigation menu, click Augment.
- Click on the name of the dataset to get a preview of the dataset.
Augment Datasets (RLHF, Improve Content, Extra Rows)
H2O LLM DataStudio provides 18 datasets for different workflow types that you can augment with your input datasets during the Configuration step of data preparation. Also, the Augment datasets contains RLHF related datasets for question answering, text summarization, instruct tuning, and human-bot conversation problem types.
The DataCatalog table contains the following information about augmentation datasets:
- Name: The name of the dataset.
- Description: A brief description of the content of the dataset and its purpose.
- Workflows: The target workflow/task type of the dataset.
- Rows: The number of rows in the dataset.
- Cols: The number of columns in the dataset.
- URL: The source of the dataset.
- License: The Open Data Commons licenses that issued for the dataset owner.
Preloaded datasets
Standford Q&A Dataset
Stanford Question Answering Dataset consists of questions posed by crowdworkers on a set of Wikipedia articles where the answer to every question is from the corresponding reading passage.
- Workflow type: Question Answering
- Number of rows: 5000
- Number of columns: 3
- URL: https://huggingface.co/datasets/squad
- License: cc-by-4.0
QG-Bench Subset by SQuAD
The SQUAD Dataset for question generation task.
- Workflow type: Question Answering
- Number of rows: 6283
- Number of columns: 3
- URL: https://huggingface.co/datasets/lmqg/qg_squad
- License: cc-by-4.0
Tweet base Q&A
The Q&A dataset with short tweet, a question and a text phrase as the answer.
- Workflow type: Question Answering
- Number of rows: 10692
- Number of columns: 3
- URL: https://huggingface.co/datasets/tweet_qa
- License: cc-by-sa-4.0
RLHF EE QA
The RLHF dataset for Q&A problems.
- Workflow type: Question Answering
- Number of rows: 180
- Number of columns: 3
- URL: https://huggingface.co/datasets/kastan/EE_QA_for_RLHF
- License: mit
News Article Summary
The news article summary dataset containes summarized news from news articles from different newspapers.
- Workflow type: Text Summarization
- Number of rows: 4515
- Number of columns: 2
- URL: https://www.kaggle.com/datasets/sunnysai12345/news-summary
- License: gpl-2.0
Costco Article Summary
The Costco article text summarization dataset.
- Workflow type: Text Summarization
- Number of rows: 86
- Number of columns: 2
- URL: https://huggingface.co/datasets/awinml/costco_long_practice
- License: mit
Dialogue Summary
The dialogue summarization dataset.
- Workflow type: Text Summarization
- Number of rows: 12460
- Number of columns: 2
- URL: https://huggingface.co/datasets/knkarthick/dialogsum
- License: mit
RLHF OpenAI Summaries
The RLHF OpenAI Summaries dataset contains sample (5000) of the CarperAI RLHF summarise dataset based on reddit thread summaries.
- Workflow type: Text Summarization
- Number of rows: 5000
- Number of columns: 2
- URL: https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons
Code QA
The Code QA dataset contains prompt-reply pairs where the prompt is to create a Python function which satisfies the functionality described in a specified docstring. The responses are the generated functions.
- Workflow type: Instruct Tuning
- Number of rows: 591
- Number of columns: 2
- URL: https://huggingface.co/datasets/OllieStanley/humaneval-mbpp-codegen-qa
Python QA
The Python QA dataset contains prompt-reply pairs where the prompt is to create a Python unit test which tests for the functionality described in a specific docstring. The responses are the generated unit tests.
- Workflow type: Instruct Tuning
- Number of rows: 591
- Number of columns: 2
- URL: https://huggingface.co/datasets/OllieStanley/humaneval-mbpp-testgen-qa
Self Instruct
The Self Instruct dataset contains prompt-reply pairs.
- Workflow type: Instruct Tuning
- Number of rows: 448
- Number of columns: 2
RLHF Instruct Tuning
The RLHF Instruct Tuning dataset contains a technical Q&A set based on RLHF dataset.
- Workflow type: Instruct Tuning
- Number of rows: 337
- Number of columns: 2
- URL: https://huggingface.co/datasets/kastan/rlhf-qa-comparisons
Human Assistance Dataset
The Human Assistance Dataset contains Human-Assistance style conversations, sampled to 5000 rows.
- Workflow type: Human Bot Conversations
- Number of rows: 33143
- Number of columns: 1
- URL: https://huggingface.co/datasets/Dahoas/first-instruct-human-assistant-prompt
Biomedical Human Assistance
The Biomedical Human Assistance dataset contains the User-Assistant style conversations on biomedical.
- Workflow type: Human Bot Conversations
- Number of rows: 10000
- Number of columns: 1
- URL: https://huggingface.co/datasets/ericyu3/openassistant_inpainted_dialogs_5k_biomedical
- License: apache-2.0
User Assistant Conversations
The User Assistant Conversations dataset contains the User-Assistant style conversations, sampled to 5000 rows.
- Workflow type: Human Bot Conversations
- Number of rows: 126287
- Number of columns: 2
- URL: https://huggingface.co/datasets/birgermoell/open_assistant_dataset
Anthropic RLHF Dataset sample
The Anthropic RLHF Dataset sample dataset contains human preference data about helpfulness and harmlessness meant to train preference (or reward) models for subsequent RLHF training. This datasets takes a sample of 1000 entries.
- Workflow type: Human Bot Conversations
- Number of rows: 2332
- Number of columns: 1
- URL: https://huggingface.co/datasets/Anthropic/hh-rlhf
- License: mit
BERT Pre-training
The BERT Dataset for pretraining.
- Workflow type: Continued PreTraining
- Number of rows: 20000
- Number of columns: 1
- URL: https://huggingface.co/datasets/nthngdy/bert_dataset_202203/viewer/nthngdy--bert_dataset_202203/train
- License: apache-2.0
TWT Eval
The TWET Eval Pretraining Dataset.
- Workflow type: Continued PreTraining
- Number of rows: 20000
- Number of columns: 1
- URL: https://huggingface.co/datasets/ArnavL/TWTEval-Pretraining-Processed
- Submit and view feedback for this page
- Send feedback about H2O LLM DataStudio | Docs to cloud-feedback@h2o.ai