Version: v1.4.0

Dataset format: Text metric learning

Formats
Example

The data for a text metric learning experiment can be formatted following format 1 or 2.

Format 1
Format 2

A CSV file.

csv_name.csv (1)(2)

A zip file containing a CSV file.

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple CSV files in the zip file that you can use as train, validation, and test dataframes:

A train CSV file needs to follow the format described above
A validation CSV file needs to follow the same format as a train CSV file
A test CSV file needs to follow the same format as a train CSV file, but does not require a label column

The available dataset connectors require the data for a text metric learning experiment to be in a zip or CSV file.
Note
To learn how to upload your zip or CSV file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A CSV file containing the following columns:
- A text column containing the input texts
- A label column containing the class names
  Note
  Texts that are similar should have the same class name.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

The ubuntu_text_metric_learning.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text metric learning problem. The structure of the zip file is as follows:

ubuntu_text_metric_learning.zip
│   └───train.csv
│   └───test.csv

As follows, a random row from the train.csv file:

text	label	fold
what is the easiest way to strip a desktop edition to a server edition ?	16	1

Note

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Demo (preprocessed) datasets.

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai