Skip to main content

Create a new project for data curation

Overview

H2O LLM DataStudio supports the conversion of documents to question-answer pairs, article summarization, and file summarization. To do this, start by creating a new project.

note

Before starting a new project, you must integrate h2ogpte or Gradio Client by providing the required credentials. You cannot create a new project without configuring h2ogpte or Gradio Client. For more information, see Settings.

Instructions

To create a new project for data curation, consider the following instructions:

  1. On the H2O LLM DataStudio left navigation menu, click Curate.

  2. On the All Projects / Curate Data for LLMs page, click New. create a new project

  3. On the Project name text box, enter a name for the project (for example, My new curate project).

  4. On the Description text box, enter a description for the project.

  5. On the Document type text box, provide a brief description for your file's content. This label helps to quickly categorize and identify the document's purpose. Some examples of document types are, Quarterly Financial Report, Purchase Order, User Guide, Product Brief, Research Paper, and Meeting Recording.

  6. From the task type drop-down menu, select the appropriate task type for the experiment.

    • Question-answer task: Used to generate question-answer pairs from documents for LLM fine-tuning
    • Summarization task: Used to generate chunk summaries for LLM fine-tuning
    • File Summary task: Used to generate complete file summary (Note: This task type cannot be used for fine-tuning due to context limitations)
  7. Click Browse and choose the file you want to upload or add the webpage URL and click Run.

    note
    • H2O LLM DataStudio supports PDF, DOCX, TXT, MD, RST and audio files (MP3, M4A, WAV).
    • If you have multiple documents you can upload them in a ZIP file.
  8. Click Upload to upload the document.

  9. Once you upload the documents, you can choose different ways of triggering the pipeline for question-answer pair generation.

    1. Directly click Run pipeline to trigger the pipeline or,
    2. You can choose the Smart chunking option to filter the best chunks of data and generate question-answer pairs only from those chunks.
    note
    • The smart chunking feature is recommended to use when you have a document with hundreds/thousands of pages and want to generate question-answer pairs quickly.
    • With the Smart chunking option, you can specify a Sampling ratio for smart chunking.
    • By default, the sampling ratio is set to 0, and when the ratio is 0, LLM DataStudio will automatically select the sampling ratio based on the length of the document.
    • It is recommended to set the sampling ratio to a value greater than or equal to 0.5.
  10. You can check the logs of the curation process from Logs: Doc2QA Project. logs Click Refresh to review the progress and the percentage of completion. click reload Click Terminate to stop the data curation process midway. It terminates the running process, and the question-answer pairs generated so far will be available to view and download.


Feedback