Skip to main content

Key functionalities

Converting documents to question-answer pairs and summarization pairs for fine tuning of LLMs using H2O LLM Data Studio supports the following key functionalities:

Variety of data types

With H2O LLM Data Studio, you can convert a variety of data types into question-answer pairs or summarization pairs. Supported data types include:

  • Documents (pdf, docx, md, txt)
  • Audio (.wav, m4a, mp3)
  • Markdown and HTML
  • Collections of the above in .zip format
  • Web URLs and PDFs

LLM-Based question-answer pair generation

H2O LLM DataStudio utilizes the H2OGPT large open source LLM to use the documents as a reference to formulate and format question-answer pair generation. This capability handles the complete end to end pipeline from breaking down documents into chunks, using intelligent prompting techniques and ensuring consistent output formats.

LLM-Based context summarization pair generation

H2O LLM DataStudio's dataset curation capability can be used to generate context summarization pairs. It allows you to curate a dataset for another LLM fine-tuning workflow. This workflow uses the same smart chunking and prompting techniques to generate article-summary pairs. These article-summary pairs can be propagated to Prepare pipelines and LLM Studio for fine-tuning.

Fast QA Mode

The fast QA mode allows you to configure what proportion of input documents to use for question-answer generation. It identifies sections (chunks) of input documents that are diverse and content-rich, ensuring refined and diverse question-answer generation.

View and customize the output

  • Reference Check: Explore the original document's text chunks to see where the question-answer pair was generated.
  • Flag: Mark whether a row is relevant or irrelevant. These rows can be filtered out during data preparation.
  • Edit: Customize and update any question-answer pair.
  • Download Dataset: Download the datasets in either JSON or CSV format.
  • Send to Prepare: Easily send the curated dataset to a Prepare project. This integration allows users to configure a data preparation workflow by augmenting the dataset with other curated collections or selecting public datasets, while also removing sensitive or toxic content and filtering out irrelevant rows.

Use the new structured dataset to finetune LLM in H2O LLM Studio.

You can get your dataset as a CSV for easy import into LLM Studio for fine-tuning. For more information, see Effortless Fine-Tuning of Large Language Models with Open-Source H2O LLM Studio.


Feedback