Supported functionalities

H2O LLM DataStudio supports a multitude of functions to facilitate the preparation of datasets for various task types. The primary goal is to structure data optimally for maximal model performance. Following is an overview of the key functions available:

Text Cleaning: Offers a range of cleaning methods to clean text data for all task types.
- Removes unwanted characters (e.g., emojis)
- Removes whitespaces
- Converts text to lowercase
- Standardize handling of URLs and emails.
Profanity Check: Uses a profanity model to identify and remove texts containing profanity. It is applicable for question and answer, instruct tuning, human-bot conversations, and continued pretraining tasks.
Text Quality Check: Checks and filters out low-quality texts for question and answer, instruct tuning, human-bot conversations, and continued pretraining tasks. The app uses text grade technique to include texts within the desired grade range (school age) to ensure the quality of the text. The lower means too simple text, and the higher means too complex text.
Length Checker: Filters the dataset based on user-defined minimum and maximum length parameters for all task types. By default, the context and answers should be within a range of 10-5000 characters while the questions should be within a range of 10-3000 characters.
Valid Question: Uses a range of techniques to determine whether the question for question-answer pairs is actually a question. If it finds that there is no question in the pair, the app filters out that particular question-answer pair from further processing.
Pad Sequence: Adds padding to the auto-generated question-and-answer pairs so that each text is of the same length.
Truncate Sequence by Score: Allows truncation of the sequence based on a score and max length parameter required for all task types. By default, it truncates auto-generated text if the text is greater than 10,000 characters. It applies summarization to the text to reduce its length.
Compression Ratio Filter: Filters text summarization data by comparing the compression ratio of the summaries. It removes rows for summarization tasks if the summarization ratio is greater than 35%. This is only relevant for summarization tasks.
Boundary Marking: Adds _START_ and _END_ tokens at the boundaries of the summary text. This is only relevant for summarization tasks.
Sensitive Info Checker: Identifies and removes any texts containing sensitive information (e.g., emails, phone numbers), critical for instruct tuning tasks.
RLHF Protection: Appends datasets to facilitate RLHF for all task types.
Language Understanding: Checks the language of text, allows filtering based on user inputs or threshold, beneficial for all task types.
Data Deduplication: Calculates text similarity within the dataset and removes text based on a duplicate score threshold for all task types.
Toxicity Detection: Calculates toxicity scores for text objects and filters according to a threshold beneficial for all task types.

Video guide

Watch this video guide to learn more about the key functions in data preparation for LLMs.

Feedback

Submit and view feedback for this page
Send feedback about H2O LLM DataStudio | Docs to cloud-feedback@h2o.ai

Video guide​

Video guide