Skip to main content

Configuration

Overview

For each data preparation step, you can customize the behavior of the function by configuring parameters. You can use default parameters or configure them based on your specific requirements and the characteristics of your dataset.

The configuration settings can be different from one problem type to another according to the required data preparation steps.

Instructions

  1. On the H2O LLM DataStudio left navigation menu, click Projects.
  2. Click the name of the project that you created before.
  3. On the H2O LLM DataStudio left navigation menu inside the project, Click Configuration.
  4. Once all the parameters are configured, click Review to move to the next step.

General configurations

The following configurations can be used for all the workflow/task types.

Source datasets / Augmentation

Select the dataset(s) you need to use. If you select multiple datasets, the datasets will undergo data augmentation. Data augmentation involves mixing the current training dataset with additional datasets. It helps increase the diversity and size of the training dataset and enhances the model’s performance. If none of the datasets are chosen, the first dataset will be used by default.

  1. Click Select to select an augmentation dataset from the Augment to augment with your input dataset.
  2. Click Add to add the dataset.

Filter by column

You can use this configuration to filter rows from the dataset based on another categorical column. It enables you to filter out curation pairs marked as irrelevant.

Text cleaning

This configuration is responsible for text cleaning and preprocessing tasks. Configure the data cleaning steps for the text by selecting or deselecting the tags. The H2O LLM DataStudio provides options to,

  • remove newline characters,
  • remove whitespaces,
  • lowercase capital letters,
  • remove URLs,
  • remove HTML characters, and
  • remove ASCII characters.

Profanity check

Adjust the slider to control the level of sensitivity in detecting offensive language within the text. This configuration helps filter out content that may be offensive or inappropriate for certain applications. For example, if the threshold is set to 0.9, any text in which profanity detection exceeds 0.9 will be filtered out.

Detoxify

Detoxify parameter checks for toxicity in the texts and filters based on the threshold. It includes four sub-configurations.

  • Acceptable toxicity threshold: Adjust the slider to control the level of toxicity within the text. For example, if the value is set to 0.9, any text having value above 0.9 will be dropped.
  • Acceptable identity attack threshold: Adjust the slider to control the level of ‘identity attack’ within the text. Example, if the value is set to 0.9, any text having value above 0.9 will be dropped.
  • Acceptable insult threshold: Adjust the slider to control the level of ‘insult’ within the text. Example, if the value is set to 0.9, any text having value above 0.9 will be dropped.
  • Acceptable threat threshold: Adjust the slider to control the level of ‘threat’ within the text. Example, if the value is set to 0.9, any text having value above 0.9 will be dropped.
note

If there is no GPU, the detoxify function will take a long time to run.

Length check

Adjust the sliders to set the minimum and maximum text lengths for each column of the dataset to ensure the text falls within the desired length criteria. This configuration helps to ensure that the input data meets specific length requirements to truncate or pad the text to a desired length for model compatibility.

Text quality check

Adjust the slider to set the minimum and maximum text grade to include texts within the desired grade range to ensure the quality of the texts. This configuration assesses the quality or appropriateness of the data. It evaluates various criteria, such as grammar, relevance, or coherence, to identify potential issues or areas for improvement in the dataset.

Sensitive info check

Add the sensitive information you wish to drop from the text. The selected sensitive or confidential information will be removed from the text to ensure privacy and data protection.

The available options for the sensitive info check are,

  • Email address
  • Phone number
  • Crypto wallet number
  • The International Bank Account Number (IBAN)
  • IP address
  • Named entity removal

Data Anonymization

Turn the toggle On to anonymize sensitive information. When enabled, all the sensitive data will be transformed into a format that cannot be traced back to the original data.

If the toggle is turned Off, the system will completely remove the sensitive information instead of anonymizing it.

Bias check

Use the slider to set the desired bias threshold. The threshold determines the level of bias that is acceptable within the text.

Example: If you set the threshold to 0.9, any text with a bias score detected to be above 0.9 will be automatically dropped from the dataset.

Add your own code

Upload your own python cleaning function inside a .py file. The code needs to be wrapped inside the following function to work:

def custom_function(df, text_columns):

You can refer the sample Python files in the H2O LLM DataStudio GitHub repository and create your own Python code. Additionally, you can download those examples and upload them to the application according to your problem type.

Pad sequence

Define the maximum padding length for sequences. This configuration is used for sequence padding. It adds padding tokens to sequences to make them equal in length. It is often necessary for efficient batch processing in neural networks.

Truncate sequence

This configuration is responsible for truncating or cutting the input text to a specific length. It removes excess text beyond the desired length, ensuring consistency and compatibility with model requirements. It includes three sub-configuration.

  • Truncate max length: Define the maximum truncating length so that the sequences longer than the specified length will be truncated.
  • Truncate ratio: Set the truncation ratio to summarize the sequence and extract the most informative parts using TextRank so that the most informative parts will not be truncated from the sequence.
  • Model based: Toggle the button to enable or disable the model-based truncation.

Configurations for question-answering

The following configuration can only be used for the Question and answer task type.

Question relevance check

Toggle the button to determine whether the question for question-answer pairs is actually a question. If it finds that there is no question in the pair, the app filters out that particular question-answer pair from further processing.

Configurations for text summarization

The following configurations can only be used for the Text summarization task type.

Filter compression

Adjust the slider to set the threshold for compression ratio between the article and its summary. This configuration calculates the compression ratio between the article and its summary and removes the article-summary pairs with a compression ratio above a certain threshold. It helps in creating high-quality article-summary pairs for text summarization models.

Add special tokens

Toggle the button to determine whether or not to add the start and end token to the texts to indicate the beginning and end of each text sequence. It is commonly used in sequence-to-sequence models or language generation tasks.

Configurations for human-bot conversation

The following configuration can only be used for the Human-Bot conversation task type.

Flatten conversation

Toggle the button to determine whether to flatten the human-bot conversation dataset. When enabled, the dataset will be flattened into a single sequence, disregarding the individual turns.


Feedback