Skip to main content

FAQs

The sections below provide answers to frequently asked questions. If you have additional questions, please send them to cloud-feedback@h2o.ai.


What are the main workflows supported by H2O LLM DataStudio?

H2O LLM DataStudio supports several workflows and task types, including Question and Answer models, Text Summarization, Instruct Tuning, Human-Bot Conversations, and Continued PreTraining of language models. It provides tailored functionalities to optimize data preparation for each task type. For more information, see Supported problem types.

What are some of the main features of H2O LLM DataStudio?

H2O LLM DataStudio offers a wide range of data preprocessing and preparation functions, such as text cleaning, text quality detection, tokenization, truncation, and data augmentation with external and RLHF datasets. It also has tools for length checking, relevance checking, profanity checking, and more. For more information, see Supported functionalities.

What are the file formats supported for exporting the processed data?

After the data preparation process is completed, the resulting dataset can be exported in various formats such as JSON, CSV, Parquet. The choice of the export file format may depend on your specific requirements and the nature of your downstream tasks.

What is the importance of good data for training Large Language Models?

Good data is crucial for training LLMs because it influences the accuracy, reliability, and effectiveness of the trained models. High-quality, well-prepared data lead to models that can understand and generate language more effectively. They provide more accurate responses in real-world applications.

What is the importance of good data for training Large Language Models?

Good data is crucial for training LLMs because it influences the accuracy, reliability, and effectiveness of the trained models. High-quality, well-prepared data lead to models that can understand and generate language more effectively. They provide more accurate responses in real-world applications. For more information, see Impact of good data vs bad data in Downstream NLP tasks.

How does H2O LLM DataStudio handle data quality?

H2O LLM DataStudio offers several tools to ensure data quality. It provides functionalities for text cleaning, text quality checking, profanity checking, and sensitive information detection. These tools help refine the raw datasets and ensure the quality and suitability for the training of LLMs.

Can I manage multiple data tasks in H2O LLM DataStudio?

Yes, H2O LLM DataStudio allows users to manage their data tasks effectively. It provides a user-friendly Projects tab where users can create, organize, and track their data preparation projects. For more information, see View projects.

Does H2O LLM DataStudio support data augmentation?

Yes, H2O LLM DataStudio supports data augmentation. It enables the mixing or augmentation of multiple datasets together for all task types. This can be important for improving the robustness and performance of the trained models.

How does H2O LLM DataStudio ensure data privacy and safety?

H2O LLM DataStudio has several features to ensure data privacy and safety. It has tools to identify and remove texts containing sensitive information or profanity. By using these features, users can ensure that the data used for training LLMs are clean, safe, and suitable for the task at hand.

What does augmentation mean?

Augmentation allows you to blend your datasets with other publicly available datasets to obtain variety. In some cases you can integrate your datasets with RLHF related datasets to provide more domain aspects. The Augment tab shows a catalog of rich datasets which can be used immediately in the Prepare pipeline. You can also bring your own datasets for the augmentation process. For more information, see Augment.

When converting documents to question-answer pairs, is there a way of seeing the progress of the data curation process in percentage?

Yes. On the Logs: Doc2QA Project panel, click Refresh to review the progress and view percentage of completion for each file of your document. For more information, see Create a new project for data curation.

Can we stop the process midway at some point and download the questions-answer pairs generated so far before it's fully complete?

Yes, you can. To stop the data curation process midway, click Terminate on the Logs: Doc2QA Project panel. It terminates the running process, and the question-answer pairs generated so far will be available to download.

If GPT is used, is DataStudio still necessary, or is filtering handled automatically?

H2O DataStudio manages tasks beyond QA pair generation. This includes QA based on RAG, QA based on pure LLM, QA Diversity, as well as the management of the Prep Component and Augment Component. DataStudio facilitates various data-related functions, including filtering, available in the Prepare section of the workflow.

What does the relevance mean for the Q&A pairs? Is it about the question's relevance to the document or how relevant the answer is to the question?

The relevance in Q&A pairs pertains to how relevant the question is within the given contexts (chunks) of the document.


Feedback