Skip to main content

FAQs

The sections below provide answers to frequently asked questions. If you have additional questions, please send them to cloud-feedback@h2o.ai.


General

What are the main workflows supported by H2O LLM DataStudio?

H2O LLM DataStudio supports several workflows and task types, including Question and Answer models, Text Summarization, Instruct Tuning, Human-Bot Conversations, and Continued PreTraining of language models. It provides tailored functionalities to optimize data preparation for each task type. For more information, see Supported problem types.

What are some of the main features of H2O LLM DataStudio?

H2O LLM DataStudio offers a wide range of data preprocessing and preparation functions, such as text cleaning, text quality detection, tokenization, truncation, and data augmentation with external and RLHF datasets. It also has tools for length checking, relevance checking, profanity checking, and more. For more information, see Supported functionalities.

What is the importance of good data for training Large Language Models?

Good data is crucial for training LLMs because it influences the accuracy, reliability, and effectiveness of the trained models. High-quality, well-prepared data lead to models that can understand and generate language more effectively. They provide more accurate responses in real-world applications. For more information, see Impact of good data vs bad data in Downstream NLP tasks.

Is the LLM DataStudio multi-user or single user?

The LLM DataStudio was created for a single user at a time but it is possible for multiple users to use the same instance. However there is no support for concurrent labeling as you may run the risk of having your label overridden if multiple people label at the same time.

Is there an API for LLM DataStudio?

No, there is no API for LLM DataStudio. It is a GUI based application.

If GPT is used, is LLM DataStudio still necessary, or is filtering handled automatically?

H2O DataStudio manages tasks beyond question-answer pair generation. This includes QA based on RAG, QA based on pure LLM, QA Diversity, as well as the management of the Prep Component and Augment Component. DataStudio facilitates various data-related functions, including filtering, available in the Prepare section of the workflow.

What does the relevance mean for the question-answer pairs?

The relevance in question-answer pairs pertains to how relevant the question is within the given contexts (chunks) of the document.

What are the models used in LLM DataStudio?

LLM DataStudio uses the whisper-tiny model to transcribe audio files as well as BERT models for relevance scores and performing clustering on text chunks to identify a diverse sample of text chunks if Smart Chunking turned on (we use BERT to create the embeddings from the text).

Prepare

How is detoxifying performed in LLM DataStudio?

Detoxifying uses a BERT model that was trained on the public Jigsaw Unintended Bias in Toxicity dataset. This model returns predictions for general toxicity and toxicity subtypes; identity attack, insult, and threat. The thresholds that are used to flag a question-answer pair are configurable in the application during the Preparation phase. By default, any record with a prediction for any toxic features greater than 0.9 is dropped.

What are the file formats supported for exporting the processed data?

After the data preparation process is completed, the resulting dataset can be exported in various formats such as JSON, CSV, Parquet. The choice of the export file format may depend on your specific requirements and the nature of your downstream tasks.

How does H2O LLM DataStudio handle data quality?

H2O LLM DataStudio offers several tools to ensure data quality. It provides functionalities for text cleaning, text quality checking, profanity checking, and sensitive information detection. These tools help refine the raw datasets and ensure the quality and suitability for the training of LLMs.

Can I manage multiple data tasks in H2O LLM DataStudio?

Yes, H2O LLM DataStudio allows users to manage their data tasks effectively. It provides a user-friendly Projects tab where users can create, organize, and track their data preparation projects. For more information, see View projects.

How does H2O LLM DataStudio ensure data privacy and safety?

H2O LLM DataStudio has several features to ensure data privacy and safety. It has tools to identify and remove texts containing sensitive information or profanity. By using these features, users can ensure that the data used for training LLMs are clean, safe, and suitable for the task at hand.

Are there redundancy checks in LLM DataStudio? (i.e. two question-answer pairs are very similar)

The LLM is prompted to create high quality questions and answers and the Smart Chunking (for sampling) ensures that diverse chunks are sampled (not a random sample). In the Smart Chunking mode, clusters are identified from the vector embeddings using k-Means and a sample of chunks from each cluster are utilized to create Question and Answer pairs. We also have a Deduplication check in the Prepare section (the data cleaning pipeline) which will remove question-answer pairs based on a duplicate score threshold. It uses MinHashing to convert the text to a hashing sequence. Jaccard similarity is then used to calculate the similarity on the hashing sequences and every question and answer pair that is above the similarity threshold (defaults to 0.9) is considered a duplicate and dropped.

Curate and Custom Eval

When converting documents to question-answer pairs, is there a way of seeing the progress of the data curation process in percentage?

Yes. On the Logs: Doc2QA Project panel, click Refresh to review the progress and view percentage of completion for each file of your document. For more information, see Create a new project for data curation.

Can we stop the process midway at some point and download the questions-answer pairs generated so far before it's fully complete?

Yes, you can. To stop the data curation process midway, click Terminate on the Logs: Doc2QA Project panel. It terminates the running process, and the question-answer pairs generated so far will be available to download.

Can you create question-answer pairs on multiple files in one go?

Yes, you can create question-answer pairs on multiple files in one go by compressing the files into a zip file and uploading the zip file in to the LLM DataStudio.

How does LLM DataStudio work with audio files?

LLM DataStudio supports audio files. The uploaded audio files are converted to a transcript using whisper. LLM DataStudio then continues with the curation process as it would for a PDF/text document. LLM DataStudio uses the whisper-tiny model to do the transcription.

For question and answer generation in LLM DataStudio, can you select the LLM to use to create question-answer pairs?

When creating a new project for data curation and creating your own evaluation dataset, LLM DataStudio lets you to select Mixtral or Llama2 LLM.

Can I merge datasets in LLM DataStudio?

You can upload a ZIP file with multiple files in it and the question-answer pairs will be created for each file. You can determine the file it came from by the File name column. You can also use the Augment option when you Prepare a dataset. This will allow you to combine your question-answer dataset with an Augment dataset(s). The Augment datasets are preloaded fine-tuning datasets that are publicly available and are known to be helpful for fine-tuning.

How does chunking work in LLM DataStudio?

Each chunk is 4000 characters long and each chunk has a small overlap with each other. To speed up the creation of question-answer pairs, you can turn on Smart Chunking Mode. In Smart Chunking Mode, clusters are identified from the vector embeddings using k-Means and a sample of chunks from each cluster are utilized to create question-answer pairs.

Augment

Does H2O LLM DataStudio support data augmentation?

Yes, H2O LLM DataStudio supports data augmentation. It enables the mixing or augmentation of multiple datasets together for all task types. This can be important for improving the robustness and performance of the trained models. In some cases you can integrate your datasets with RLHF related datasets to provide more domain aspects. The Augment tab shows a catalog of rich datasets which can be used immediately in the Prepare pipeline. You can also bring your own datasets for the augmentation process. For more information, see Augment.


Feedback