Skip to main content
Version: v1.4.0

Dataset format: Text classification

The data for a text classification experiment can be formatted following format 1 or 2.

A CSV file.

csv_name.csv (1)(2)
  1. The available dataset connectors require the data for a text classification experiment to be in a zip or CSV file.
    Note

    To learn how to upload your zip or CSV file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A CSV file containing the following columns:
    • A text column containing the texts for the experiment
    • One or more label columns containing either either one-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note
      • H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
      • For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.


Feedback