Skip to main content
Version: v0.4.0

Tutorial 4A: Text summarization annotation task

Overview

This tutorial describes the process of creating a text summarization annotation task, including specifying an annotation task rubric for it. To highlight the process, we will annotate a dataset that contains human-generated abstract summaries from news stories published on the Cable News Network (CNN) and Daily Mail websites.

Step 1: Explore dataset

We are going to use the preloaded CNN Daily Mail sample demo dataset for this tutorial. The dataset contains 100 samples (text), each containing a summary of a CNN or Daily Mail article. Let's quickly explore the dataset.

note

The dataset already contains a summary column. For purposes of this tutorial, we will ignore that column and create our own column to see how one can create a summarization annotation task.

  1. On the H2O Label Genie navigation menu, click Datasets.
  2. In the Datasets table, click cnn-dailymail-sample.

Step 2: Create an annotation task

Now that we have seen the dataset let's create an annotation task that enables you to annotate the dataset. For this tutorial, the text summarization annotation task refers to writing a summary for each text input.

  1. Click New annotation task.
  2. In the Task name box, enter tutorial-4a.
  3. In the Task description box, enter Annotate a dataset containing summaries from news stories from CNN and the Daily Mail websites.
  4. In the Select task list, select Summarization.
  5. In the Select text column box, select text.
  6. Click Create task.

Step 3: Specify an annotation task rubric

Before we can start annotating our dataset, we need to specify an annotation task rubric. An annotation task rubric refers to the labels (for example, object classes) you want to use when annotating your dataset. Generally, an annotation task rubric refers to the labels (for example, object classes) you want to use when annotating your dataset.

In the case of a summarization annotation task rubric, you need to specify the following two settings in the Rubric tab:

  1. For purposes of this tutorial, let's utilize the default model.
  2. For purposes of this tutorial, let's utilize the default maximum target length.
  3. Click Continue to annotate.

annotation-rubric.png

Step 4: Annotate dataset

In the Annotate tab, you can individually annotate each summary in the dataset. Let's annotate (summarize) the first text.

You can immediately start annotating in the Annotate tab or wait until the zero-shot model is ready to provide annotation suggestions. H2O Label Genie notifies you to Refresh the instance when zero-shot predictions (suggestions) are available. A zero-shot learning model is utilized by default when you annotate a text summarization annotation task. The model accelerates the annotation process by summarizing a given original text. Refresh

note

To learn about the utilized model for a text summarization annotation task, see Zero-shot learning models: Text summarization.

  1. Click Refresh. Annotation task progress with summary and text
  2. Click Save and next.
    Note
    • Save and next saves the annotated text
    • To skip a text to annotate later: Click Skip.
      • Skipped text samples reappear after all non-skipped summaries are annotated
  3. Annotatate all dataset samples.
    note

    At any point in an annotation task, you can download the already annotated (approved) samples. You do not need to fully annotate an imported dataset to download already annotated samples. To learn more, see Download an annotated dataset

Summary

In this tutorial, we learned how to annotate and specify an annotation task rubric for a text summarization task.

Next

To learn the process of annotating and specifying an annotation task rubric for other various annotation tasks in computer vision (CV), natural language processing (NLP), and audio, see Tutorials.


Feedback