Skip to main content
Version: v0.3.0

Tutorial 4A: Annotation task: Text summarization

Overview

This tutorial underlines the steps (process) of annotating and specifying an annotation task rubric for a text summarization annotation task. To highlight the process, we are going to annotate a dataset that contains human-generated abstract summaries from news stories published on the Cable News Network (CNN) and Daily Mail websites. This tutorial also quickly explores how you can download the fully annotated dataset supported in H2O Hydrogen Torch.

Step 1: Explore dataset

We are going to use the preloaded cnn-dailymail-sample demo dataset for this tutorial. The dataset contains 100 samples (text), each containing a summary of a CNN or Daily Mail article. Let's quickly explore the dataset.

  1. On the H2O Label Genie navigation menu, click Datasets.
  2. In the datasets table, click cnn-dailymail-sample.

Step 2: Create an annotation task

Now that we have seen the dataset let's create an annotation task that enables you to annotate the dataset. An annotation task refers to the process of labeling data. For this tutorial, the text summarization annotation task refers to writing a summary for each text input. Let's create an annotation task.

  1. Click New annotation task.
  2. In the Task name box, enter tutorial-4a.
  3. In the Task description box, enter Annotate a dataset containing summaries from news stories from CNN and the Daily Mail websites.
  4. In the Select dataset list, select cnn-dailymail-sample
  5. In the Select task list, select Summarization.
  6. In the Select text column box, select text.
  7. Click Create task.

Step 3: Specify annotation task rubric

Before we can start annotating our dataset, we need to specify an annotation task rubric. An annotation task rubric refers to the labels (for example, object classes) you want to use when annotating your dataset.

  1. In the Select model box, select sshleifer/distilbart-cnn-12-6.
    • The Select model value refers to the zero-shot learning model type to utilize in your annotation task. To learn more, see Text summarization
  2. In the Min target length box, enter 32.
    • The Min target length value refers to the minimum character length of your summaries
  3. In the Max target length box, enter 128.
    • The Max target length value refers to the minimum character length of your summaries
  4. Click Continue to annotate.

Annotation task rubric options

Step 4: Annotate dataset

Now that we have specified the annotation task rubric, let's annotate the dataset. In the Annotate tab, you can individually annotate each summary in the dataset. Let's annotate the first summaries.

  1. A zero-shot learning model is on by default when you annotate a text summarization annotation task. The model accelerates the annotation process by summarizing a given original text (sample).

    You can immediately start annotating in the Annotate tab or wait until the zero-shot model is ready to provide annotation suggestions. H2O Label Genie notifies you to Refresh the instance when zero-shot predictions (suggestions) are available.

    Refresh

    For example, in this tutorial, after refreshing the instance, the model generated the following summary for the first original text (sample).

    U.N. inspectors leave Syria, carrying evidence that will determine whether chemical weapons were used in an attack early last week in a Damascus suburb. "The issues are too big for business as usual," President Obama says in a televised address. U.S. officials have said there's no doubt that the Syrian government was behind the attack, while Syrian officials have denied responsibility. The inspectors will share their findings with Ban Ki-moon Ban, who has said he wants to wait until their final report is completed.

    Annotation task progress with summary and text

    note
    • To learn about the utilized model for a text summarization annotation task, see Zero-shot learning models: Text summarization.
    • During the annotation process of a text summarization dataset, you can download generated zero-shot predictions in the Export tab. To download all generated zero-shot predictions, consider the following instructions:
      caution
      • If the Enable zero-shot predictions setting is turned Off, the zero-shot learning model utilized for a text summarization annotation task is not available during the annotation process while preventing the generation of zero-shot predictions. To turn On the Enable zero-shot predictions setting, see Enable zero-shot predictions.
      • The time it takes H2O Label Genie to generate zero-shot predictions depends on the computational resources of the instance.
      1. Click the Export tab.
      2. In the Export zero-shot predictions list, select Download ZIP.
  2. Click Save and next.

    Note
    • Save and next saves the annotated text
    • To skip a text to annotate later: Click Skip.
      • Skipped summaries reappear after all non-skipped summaries are annotated
    • To download all annotated samples so far, consider the following instructions:
      1. Click the Export tab.
      2. In the Export approved samples list, select Download ZIP.
        Note

        H2O Label Genie downloads a zip file containing the annotated dataset in a format that is supported in H2O Hydrogen Torch. To learn more, see Downloaded dataset formats: Text summarization.

Download annotated dataset

After annotating all the summaries, you can download the dataset in a format that H2O Hydrogen Torch supports. Let's download the annotated dataset.

  1. In the Annotate tab, click Export approved samples. Completed annotation task notification to export
  2. In the Export approved samples list, select Download ZIP.

Summary

In this tutorial, we learned the process of annotating and specifying an annotation task rubric for a text summarization task. We also learned how to download a fully annotated dataset supported in H2O Hydrogen Torch.

Next

To learn the process of annotating and specifying an annotation task rubric for other various annotation tasks in computer vision (CV), natural language processing (NLP), and audio, see Tutorials.


Feedback