Skip to main content
Version: v0.2.0

Tutorial 1A: Annotation task: Text classification

This tutorial will underline the steps (process) of annotating and specifying an annotation task rubric for a text classification annotation task. This tutorial will also quickly explore how you can download the fully annotated dataset supported in H2O Hydrogen Torch.

Step 1: Explore dataset

We will use the preloaded amazon-reviews-demo demo dataset for this tutorial. The dataset contains 180 samples (text), each containing a review of an Amazon product. Let's quickly explore the dataset.

  1. On the H2O Label Genie navigation menu, click Datasets.
  2. In the datasets table, click amazon-reviews-demo.

The first two reviews are as follows:

Stars
Comment
3.0 out of 5 starsI've used other devices to track my fitness/health and I really thought that the Charge 5 was going to be the answer. I did research on various brand devices through various sources. The Fitbit gets great ratings.I'm disappointed. I've been using it faithfully for about 3 weeks now. I'm of the opinion that it just may be buggy, and a future firmware update can improve it in many ways, hence the overall three-star rating. The Premium service is hype and doesn't seem that it is worth the money, so I'm debating if I will subscribe once the free period expires.When I'm walking the actual steps are fewer than what the Fitbit reports, When I'm doing my cardio, while my heart rate is steady, the Fitbit continually reports inaccurate reading, usually much lower than the actual heartbeat and at times showing peak heart rates. It does this inconsistently like it operates on a sliding scale. Different location placement of the Fitbit on either arm makes no difference. Also, I have programed the high/low settings for notification, which I have exceeded, the Fitbit for some reason has not warned/notified/signaled or whatever it is supposed to do.It does an excellent job accurately displaying my heart rhythm for the 30 second period when I do an ECG. It does report when I'm experiencing Atrial fabulation. However, unlike many of the other features, it doesn't record for record the readings. It does a good job in tracking my resting heart rate.I'm not impressed with the Sp02 percentage process. It only does an average during sleep. It doesn't provide data as to levels during the period that it monitors. The ECG, the Sp02% should have a feature like the ECG so I can manually check my oxygen level when desired.
1.0 out of 5 starsBought this yesterday because it looked like a good fitness tracker; it's shipped and sold by Amazon, so I thought it would be a simple process. However, the battery drains shortly after charging; how can you use this? GPS took several minutes to work, then stopped recording. Missing ECG -- which was the main reason for choosing this model. (Fitbit promises all these apps and options for Charge 5, but they aren't available.)Tried to return it -- it says right on the page that you can return this item until January 31, 2022 -- but Amazon would not let me. Sent me to chat, that chat person said I had to contact the manufacturer myself to get a case number, and then go back to Amazon to maybe return it if they approved it. What?? I don't have time for all this! I simply want to return a defective Fitness Charge 5 tracker sold by Amazon, and Amazon won't let me. What a waste of time; so much better to buy at Costco, where at least they take back defective items without hassling you and wasting your time.

Step 2: Create annotation task

Now that we have seen the dataset let's create an annotation task that will enable you to annotate the dataset. An annotation task refers to the process of labeling data. For this tutorial, the annotation task refers to a text classification annotation task assigning one or more categorical target labels to an input text. Let's create an annotation task.

  1. Click New annotation task.
  2. In the Task name box, enter Tutorial 1A.
  3. In the Task description box, enter Annotate dataset containing reviews from Amazon products.
  4. In the Select task list, select Classification.
  5. In the Select text column box, select comment.
  6. Click Create task.

Step 3: Specify annotation task rubric

For our dataset and purposes of this tutorial, we will use the comment of a review to determine if a client was happy or unhappy with the product (purchase).

  1. In the Class name box, enter Happy.
  2. Click Add class.
  3. In the Class name box, enter Unhappy.

Step 4: Annotate dataset

Now that we have specified the annotation task rubric, let's annotate the dataset.

  1. Click Continue to annotate.

In the Annotate tab, you can individually annotate each review (text) in the dataset. Let's annotate the first review.

Based on the first review's comment, the product did not meet the customer's expectations. Observe:

I’ve used other devices to track my fitness/health and I really thought that the Charge 5 was going to be the answer. I did research on various brand devices through various sources. The Fitbit gets great ratings.==I’m disappointed==. I’ve been using it faithfully for about 3 weeks now. I’m of the opinion that it just may be ==buggy==, and a future firmware update ==can improve it in many ways==, hence the overall three-star rating. ==The Premium service is hype and doesn’t seem that it is worth the money==, so I’m debating if I will subscribe once the free period expires.When I’m walking the actual steps are fewer than what the Fitbit reports, When I’m doing my cardio, while my heart rate is steady, the Fitbit continually reports ==inaccurate reading==, usually much lower than the actual heartbeat and at times showing peak heart rates. It does this ==inconsistently== like it operates on a sliding scale. Different location placement of the Fitbit on either arm makes no difference. Also, I have programed the high/low settings for notification, which I have exceeded, the Fitbit for some reason ==has not warned/notified/signaled or whatever it is supposed to do==.It does an excellent job accurately displaying my heart rhythm for the 30 second period when I do an ECG. It does report when I’m experiencing Atrial fabulation. However, unlike many of the other features, it doesn’t record for record the readings. It does a good job in tracking my resting heart rate.==I’m not impressed with the Sp02 percentage proces==. It only does an average during sleep. ==It doesn’t provide data as to levels during the period that it monitors.== The ECG, the Sp02% should have a feature like the ECG so I can manually check my oxygen level when desired.

With the above in mind, let's label the first review as Unhappy.

  1. Click Unhappy.

    Annotated review

    Note
    • As you annotate the dataset, note that you can select which label to use as you annotate a review. To select the Happy or Unhappy label:

      • Click Happy or Unhappy.

    • A zero-shot learning model is on by default when you annotate a text classification annotation task. The model accelerates the annotation (labeling) process by providing the percentage probability of a text (in this case, a review) belonging to a certain label (one of the labels created in the Rubric tab).

  2. Click Save and next.

Note
  • Save and next saves the annotated review
  • To skip a review to annotate later: Click Skip.
    • Skipped reviews (samples) will reappear after all non-skipped reviews are annotated

Export annotated dataset

After annotating all the reviews, you can download the dataset in a format that H2O Hydrogen Torch supports. Let's download the annotated dataset.

  1. In the Annotate tab, click Export annotated samples.

    Export annotated samples

Note
  • In the Dashboard tab, you can download all annotated reviews at any point in the annotation process of a dataset. To download all annotated reviews so far, consider the following instructions:
    1. Click the Dashboard tab.
    2. Click Export approved samples.
  • H2O Label Genie will download a .zip file containing the annotated dataset in a format aligning with the dataset's problem type (annotation task type). To learn more, see Downloaded dataset formats: Text classification.

Summary

In this tutorial, we learned the process of annotating and specifying an annotation task rubric for a text classification task. We also learned how to download a fully annotated dataset supported in H2O Hydrogen Torch.

Next

To learn the process of annotating and specifying an annotation task rubric for other various annotation tasks in computer vision (CV), natural language processing (NLP), and audio, see Tutorials.


Feedback