Skip to main content
Version: v1.2.0

Import dataset settings: Text classification

Dataset name

Name of the dataset.

Problem type

Defines the problem type of the experiment, which also defines the settings H2O Hydrogen Torch displays for the experiment.

Note
  • The selected problem type and experience level determine the settings H2O Hydrogen Torch displays for the experiment
  • The From experiment option allows you to use the settings from a previously run experiment

Train dataframe

Defines a .csv or .pq file containing a dataframe with training records that H2O Hydrogen Torch will use to train the model.

note
  • The records will be combined into mini-batches when training the model.
  • If a validation dataframe is provided, a fold column is not needed in the train dataframe.

Validation dataframe

Defines a .csv or .pq file containing a dataframe with validation records that H2O Hydrogen Torch will use to evaluate the model during training.

Note
  • To set a Validation dataframe requires the Validation strategy to be set to Custom holdout validation. In this case, H2O Hydrogen Torch will fully respect the choice of a separate validation dataframe and will not perform any internal cross-validation. In other words, the model is trained on the full provided train dataframe, and model performance is evaluated on the provided validation dataframe.
  • The validation dataframe should have the same format as the train dataframe but does not require a fold column.

Test dataframe

Defines a .csv or .pq file containing a dataframe with test records that H2O Hydrogen Torch will use to test the model.

note

The test dataframe should have the same format as the train dataframe but does not require a label column.

Unlabeled dataframe

Defines a separate .csv or .pq file containing a dataframe with unlabeled records that H2O Hydrogen Torch uses to generate pseudo labels. H2O Hydrogen Torch first trains the model with the provided labeled data (Train dataframe). Right after, the model predicts pseudo labels for the data in the provided unlabeled dataframe before doing another training run that combines the original labels and pseudo labels.

note
  • Image regression | Image classification | Image object detection
    • The unlabeled dataframe just needs to contain a single image column
  • Text regression | Text classification
    • The unlabeled dataframe just needs to contain a single text column
  • Audio regression | Audio classification | Speech recognition
    • The unlabeled dataframe just needs to contain a single audio column
  • Image regression | Image classification | Image object detection | Audio regression | Audio classification | Speech recognition
    • Assets (e.g., images or audio) need to be located in the Data folder (setting)
  • The training time can significantly increase depending on the size of the unlabeled data
tip

As labeling can be expensive, having additional unlabeled data is quite common. You providing this unlabeled data in H2O Hydrogen Torch trains the model in a semi-supervised manner, potentially improving the model quality in contrast to only training on labeled data.

Label columns

Defines the name(s) of the dataframe column(s) that refer to the target value(s) H2O Hydrogen Torch will aim to predict.

note
  • It can be more than one label column, and therefore, the target value to predict can be single or multi-column.
  • Image classification supports multi-class and multilabel classification.

Text column

Defines the dataset column(s) containing the input text H2O Hydrogen Torch will use during model training. H2O Hydrogen Torch will concatenate multiple text columns with a specific separator token.


Feedback