Skip to main content

Supported functionalities

H2O LLM DataStudio supports a multitude of functions to facilitate the preparation of datasets for various task types. The primary goal is to structure data optimally for maximal model performance. Following is an overview of the key functions available:

  • Data Object: Allows input of datasets for all task types.
  • Data Augmentation: Enables the mixing or augmentation of multiple datasets for all task types.
  • Text Cleaning: Offers a range of cleaning methods to clean text data for all task types.
  • Profanity Check: Identifies and removes texts containing profanity, applicable for question and answer, instruct tuning, human-bot conversations, and continued pretraining tasks.
  • Text Quality Check: Checks and filters out low-quality texts for question and answer, instruct tuning, human-bot conversations, and continued pretraining tasks. The app uses text grade technique to include texts within the desired grade range (school age) to ensure the quality of the text. The lower means too simple text, and the higher means too complex text.
  • Length Checker: Filters the dataset based on user-defined minimum and maximum length parameters for all task types.
  • Valid Question: Uses a range of techniques to determine whether the question for question-answer pairs is actually a question. If it finds that there is no question in the pair, the app filters out that particular question-answer pair from further processing.
  • Pad Sequence: Enables the padding of sequences based on a maximum length parameter for all task types.
  • Truncate Sequence by Score: Allows truncation of the sequence based on a score and max length parameter required for all task types.
  • Compression Ratio Filter: Filters text summarization data by comparing the compression ratio of the summaries.
  • Boundary Marking: Adds start and end tokens at the boundaries of the summary text, specifically for text summarization tasks.
  • Sensitive Info Checker: Identifies and removes any texts containing sensitive information, critical for instruct tuning tasks.
  • RLHF Protection: Appends datasets to facilitate RLHF for all task types.
  • Language Understanding: Checks the language of text, allows filtering based on user inputs or threshold, beneficial for all task types.
  • Data Deduplication: Calculates text similarity within the dataset and removes text based on a duplicate score threshold for all task types.
  • Toxicity Detection: Calculates toxicity scores for text objects and filters according to a threshold beneficial for all task types. Output: Converts the transformed dataset to an output object, such as JSON, applicable for all task types.

Feedback