Skip to main content

Supported problem types


H2O LLM DataStudio offers support for various problem types and workflows, providing users with the necessary tools to prepare datasets and train models for specific tasks. This page serves as a comprehensive guide to the supported problem types, highlights their importance, and explains how the application can assist in dataset preparation and model training.

Question and Answer

  • Description: H2O LLM DataStudio simplifies dataset preparation for question answering models. The datasets consist of contextual information, questions, and their respective answers. Its features facilitate the creation of well-structured datasets essential for training models to accurately respond to user queries based on the provided context.

  • Expected Columns: 'Question', 'Answer', and 'Context'.

  • Example:

    What are the cookies used for?Cookies: In order to offer and provide a customized, personal service, uses cookies to store and help track your information as you travel throughout the site. For example, we may use cookies to help remind us who you are and to deliver content and services based upon your account information. In addition, third party advertising networks may issue cookies when serving advertisements.All Enthusiast, Inc.'s Privacy Policy All Enthusiast, Inc.'s respects the privacy and security of its users. Our goal is to provide you with a personalized Internet experience that delivers the information, resources, and services that are most relevant and helpful to you. In order to achieve this goal, we sometimes collect information during your visits to understand what differentiates you from each of our millions of other users....We welcome any questions or comments you have about please direct them to our contact form.

Text Summarization

  • Description: The Text Summarization workflow is designed for datasets consisting of articles and their corresponding summaries. Using H2O LLM DataStudio tools, this workflow simplifies the process of extracting vital information from articles, allowing you to create concise summaries that capture the main points. The resulting datasets are valuable for training text summarization models that can produce concise and informative summaries from lengthy text.

  • Expected Columns: 'Article' and 'Summary'.

  • Example:

    Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films such as the 1956 noir While the City Sleeps died on March 15 at her home in Beverly Hills, California. Forrest, whose birth name was Katherine Feeney, was 86 and had long battled cancer. Her publicist, Judith Goffin, announced the news Thursday....Forrest married writer-producer Milo Frank in 1951. He died in 2004. She is survived by her niece, Sharon Durham, and nephews, Michael and Mark Feeney. Career: A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films"Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films died on March 15 . Forrest, whose birth name was Katherine Feeney, had long battled cancer . A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films ."

Instruct Tuning

  • Description: H2O LLM DataStudio assists in preparing datasets that include prompts or instructions along with their corresponding responses. These datasets are essential for training models to understand and follow provided instructions, enabling accurate responses to user prompts.

  • Expected Columns: 'Prompt' and 'Response'.

  • Example:

    Translate the phrase "Good Morning" to FrenchBonjour

Human - Bot Conversations

  • Description: This workflow deals with datasets containing dialogues between human users and chatbots. These datasets are crucial for training models to comprehend user intents and deliver appropriate responses, thereby improving conversational experiences. H2O LLM DataStudio aids in efficiently structuring and organizing the conversational data, including user queries, and bot responses.

  • Expected Columns: 'Message_id', 'Parent_id', 'Text', and 'Role'.

  • Example:

    384ad8e0-8fc2-4dfd-bf48-0c417f6c5f0f7d05acb7-9360-458c-8a1d-c0b6492b8f8a"What are your thoughts on the censorship of ChatGPT's output and its liberal biases?"prompter

Continued PreTraining

  • Description: In this workflow, H2O LLM DataStudio helps prepare datasets containing extensive texts for further pretraining of language models. The dataset preparation process focuses on organizing long text data, allowing language models to learn from a diverse range of linguistic patterns. This enhances their language understanding and generation capabilities.

  • Expected Column: 'Text'.

  • Example:

    Chrysaethe amoena Chrysaethe amoena is a species of beetle in the family Cerambycidae. It was described by Gounelle in 1911.