Skip to main content
Version: v0.4.0

Labeling flow

Overview

The flow of labeling a dataset in H2O Label Genie can be summarized in the following sequential steps:

In the below sections, each step above, in turn, is summarized.

Step 1: (Optional) Explore dataset

As the first step in the labeling flow (an optional step), before you start labeling your dataset, you can explore patterns and groups in your dataset using unsupervised methods, resulting in improved quality of your data. In particular, to provide in-depth knowledge of image and text datasets, H2O Label Genie supports the analysis of such datasets through clustering tasks that, for example, generate embeddings in 2D and 3D that can help you understand the data structure of the dataset.

Clustering task: A clustering task refers to finding and exploring groups in a dataset.

Clusters Graphs

Step 2: Create an annotation task

As the second step in the labeling flow (or the first one if you skipped the optional step 1), you need to create an annotation task.

Annotation task: An annotation task refers to labeling data in a manner that makes them suitable to support an array of deep learning problem types. This process can, for example, involve adding bounding boxes to images where you can attribute labels to each box.

H2O Label Genie supports various annotation tasks in computer vision (CV) (image), natural language processing (NLP) (text), and audio.

Annotation task

Step 3: Specify an annotation task rubric

As the third step in the labeling flow, you need to specify an annotation task rubric.

Annotation task rubric: An annotation task rubric refers to the labels (for example, object classes) you want to use when annotating your dataset.

Annotation task rubric

Step 4: Annotate dataset (with AI assistance)

As the fourth step in the labeling flow, annotate your dataset. Each annotation task for a dataset requires a different process for annotating the dataset. To learn more, see Tutorials.

note

An array of datasets labeled in H2O Label Genie are supported in H2O Hydrogen Torch and H2O LLM Studio. To learn more, see Download an annotated dataset.

H2O Label Genie offers the following major features to speed up the labeling process:

  • Zero-shot learning models: By default, H2O Label Genie utilizes certain zero-shot learning models to accelerate the labeling process. In particular, H2O Label Genie lets you use a zero-shot learning model for several supported annotation tasks
  • Hotkeys: H2O Label Genie supports several hotkeys (keyboard shortcuts) designed to speed up a dataset's annotation (labeling) process
  • Real-time multi-user support and collaboration: By default, while annotating your dataset, H2O Label Genie enables you to work with others to complete an annotation task, resulting in real-time multi-user support and collaboration

Zero-shot learning models: Labeled data is crucial for supervised learning problem types in computer vision (CV), natural language processing (NLP), and audio. High-quality labeled data usually requires a lot of manual labeling that can lead to high costs and delay production or execution.

One way to accelerate the labeling process is to utilize zero-shot learning models. These models let data scientists label unlabeled data with high accuracy and speed. Zero-shot learning models are pre-trained models that have been trained on vast and distinct classes. As a result, zero-shot learning models with prior knowledge can label unlabeled data.

Classification annotate-4-b.png


Feedback