Skip to main content
Version: v1.4.0

Dataset format: Text token classification

The data for a text token classification experiment can be formatted following format 1 or 2.

A Parquet file.

parquet_name.pq (1)(2)
  1. The available dataset connectors require the data for a text token classification to be in a zip or Parquet file.
    Note

    To learn how to upload your zip or Parquet file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A Parquet file containing the following columns:
    • A text column containing tokenized text: each sample should have a list of string tokens
    • A label column containing token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.


Feedback