Skip to main content

Dataset preparation for Text Summarization

Overview

This tutorial describes the process of preparing a dataset that consists of articles and their associated summaries. This task is essential for training text summarization models that can generate succinct and informative summaries from lengthy text. The dataset preparation process focuses on building a well-structured dataset for training text summarization systems.

Prerequisites

Step 1: Explore the project

For this tutorial, we are going to use the prebuilt CNN-DailyMail project, which consists of a CNN/DailyMail news dataset. Let's explore the project:

  1. On the H2O LLM DataStudio navigation menu, click Prepare.
  2. Explore the table with detailed information about both prebuilt projects and projects that you created. For more information on viewing the projects, see View projects.
  3. Click on the CNN-DailyMail project name and navigate to the corresponding data preparation steps.

Step 2: Ingest data

The first step of data preparation is ingesting data by uploading your datasets. In this tutorial, we are proceeding with the preloaded CNN/DailyMail news dataset. Let's configure the preloaded dataset.

  1. To preview the dataset, click Preview under Configure datasets section. It will show you the top 100 rows of the dataset.
  2. Under the Configure columns section, select the relevant columns for article and summary from the given options.
  3. Click Save.

Step 3: Build the workflow

Using the workflow builder tool, let's configure the order of data preparation steps.

  1. Inside the CNN-DailyMail project, click Workflow On the left navigation menu.
  2. Drag and drop the required steps in the following sequence. To learn more about the workflow builder tool, see Workflow builder.
    Workflow

    Augmentation > Text cleaning > Profanity Check > Detoxify > Length Check > Text Quality Check > Sensitive Info Check > Filter Compression Ratio > Language Understanding > Deduplication > Boundary marking > Padding sequence > Truncate sequence > Output

    note

    If there is no GPU, detoxify function will take a long time to run.

  3. After configuring the order of data preparation steps, click Configure to run the workflow. Workflow builder

Step 4: Configure the parameters

H2O LLM DataStudio allows you to customize the behavior of the function for each data preparation step by setting parameters. Let's use the default parameter configurations for this tutorial.

  1. Inside the CNN-DailyMail project, click Configuration On the left navigation menu.
  2. Go through each parameter configuration. To learn more about the available configurations, see Configuration.
  3. Once all the parameters are configured, click Review to move to the next step.

Step 5: Review the configured parameters and execute the workflow

Let's review the configured parameters before executing the data preparation workflow to ensure accuracy.

  1. Inside the CNN-DailyMail project, click Review On the left navigation menu.
  2. Once you are satisfied with configured parameters, click Run pipeline to initiate the execution of the workflow.
    Run pipeline

Step 6: Review and analyze the output

After the data preparation process is completed, a resulting dataset is generated. Let's take time to review and analyze the output.

  1. Inside the CNN-DailyMail project, click Output On the left navigation menu.
    • The Output page graphically represents the number of rows and the percentage of rows against input rows in each data preparation step.
    • The Output page consists a preview of the final dataset with the top 100 rows. output
  2. Click Download CSV to download the output dataset in the CSV file format.
  3. Click Export to H2O Drive to export the output dataset to H2O Drive. Output page

Step 7: Compare input and output datasets

As the final step, we can compare and see the differences between the input and output datasets. Let's take a look at text cleaning differences, selected columns and dataset row differences.

  1. Inside the CNN-DailyMail project, click Insights On the left navigation menu. Insights tab You can compare the input dataset and output dataset. Click View dropped rows to view the rows dropped from the input dataset.

Summary

In this tutorial, we learned how to prepare a dataset with article and their associated summary columns for the problem type of Text Summarization. Also, we discovered that H2O LLM DataStudio lets you upload, prepare, and analyze your datasets, ultimately achieving your desired data transformation goals.


Feedback