Skip to main content
Version: v1.4.0

Dataset format: Image and text classification

Dataset format

The data for an image and text classification experiment needs to be in a ZIP file (1) containing a csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple csv files in the ZIP file that you can use as train, validation, and test dataframes:

  • A train csv file needs to follow the format described above
  • A validation csv file needs to follow the same format as a train csv file
  • A test csv file needs to follow the same format as a train csv file, but does not require a label column(s)
  1. The available dataset connectors require the data for an image and text classification experiment to be in a ZIP file.
    Note

    To learn how to upload your ZIP file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • The names of the image files do not specify the data directory (location of the images in the ZIP file). You can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A text column containing the texts for the experiment
    • One or more label columns containing either one-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note
      • H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
      • For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image classification experiment.
    Note

    All image file names need to specify an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

The food_101_imageandtext_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multi-class image + text classification problem. The structure of the ZIP file is as follows:

food_101_imageandtext_classification.zip
│ └───train.csv
│ │
│ └───images
│ └───caesar_salad_371.jpg
│ └───seaweed_salad_539.jpg
│ └───caesar_salad_822.jpg
│ ...

The first three rows of the csv file are as follows:

imagetextlabel
caesar_salad_371.jpgSalmon Caesar Salad Recipe - Kraft Recipescaesar_salad
seaweed_salad_539.jpgNutrition advice - Prevention Magazine - Yahoo!7 Lifestyleseaweed_salad
caesar_salad_822.jpgHail Caesar Salad Recipes Food Network Canadacaesar_salad
Note
  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specified as the value for the Data folder setting. For more information, see Import dataset settings.
  • To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Demo (preprocessed) datasets.

Feedback