Skip to main content
Version: v1.4.0

Import dataset settings: Image classification

Before importing a dataset to H2O Hydrogen Torch, you need to define a set of settings based on the problem type of the dataset. These settings are referred to as import dataset settings.

Dataset name

This setting defines the name of the dataset.

Problem category

This setting defines a particular general problem type category, for example, image.

note
  • The selected problem category (for example, image) determines the options in the Problem type setting.
  • The From experiment option enables you to utilize the settings of an experiment (another experiment).

Problem type

This setting defines the problem type of the experiment, which also defines the settings H2O Hydrogen Torch displays for the experiment.

Note
  • The selected problem category (in the Problem category setting) determines the available problem types.
  • The selected problem type and experience level determine the settings H2O Hydrogen Torch displays for the experiment.

Train dataframe

This setting specifies the path to a file that contains a dataframe comprising training records utilized by H2O Hydrogen Torch for model training within the experiment. Here, the term 'file' denotes a specific file adhering to a dataset format tailored for the problem type addressed in the experiment. To learn more, see Dataset formats.

note
  • The records are combined into mini-batches when training the model.
  • If a validation dataframe is provided, a fold column is not needed in the train dataframe.
  • To import datasets for inference only, when defining the settings for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).

Data folder

Defines the location of the folder containing assets (for example, images or audio clips) the model utilizes for training. H2O Hydrogen Torch loads assets from this folder during training.

Validation dataframe

This setting defines a file containing a dataframe with validation records that H2O Hydrogen Torch uses to evaluate the model during training.

Note
  • To set a Validation dataframe requires the Validation strategy to be set to Custom holdout validation. In the case of providing a validation dataframe, H2O Hydrogen Torch fully respects the choice of a separate validation dataframe and does not perform any internal cross-validation. In other words, the model is trained on the full provided train dataframe, and model performance is evaluated on the provided validation dataframe.
  • The validation dataframe should have the same format as the train dataframe but does not require a fold column.
Setting dependency

The Validation dataframe settings is only available when you select Validation strategy in the Custom holdout validation setting.

Test dataframe

This setting defines a file containing a dataframe with test records that H2O Hydrogen Torch uses to test the model.

note
  • The test dataframe should have the same format as the train dataframe but does not require a label column.
  • To import datasets for inference only, when defining the setting for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).

Data folder test

Defines the location of the folder containing assets (for example, images, texts, or audio clips) H2O Hydrogen Torch utilizes to test the model. H2O Hydrogen Torch loads the assets from this folder when testing the model. This setting is only available if a test dataframe is selected.

Setting dependency
  • This setting is only available if a test dataframe is selected.
  • The Data folder test setting appears when you specify a test dataframe in the Test dataframe setting.

Unlabeled dataframe

Defines a separate CSV or Parquet file (depending on the problem type) containing a dataframe with unlabeled records that H2O Hydrogen Torch utilizes to generate pseudo labels. H2O Hydrogen Torch first trains the model with the provided labeled data (Train dataframe). Right after, the model predicts pseudo labels for the provided unlabeled dataframe before doing another training run that combines the original labels and pseudo labels.

note
  • Image regression | Image classification | Image object detection
    • The unlabeled dataframe just needs to contain a single image column
  • Text regression | Text classification
    • The unlabeled dataframe just needs to contain a single text column
  • Audio regression | Audio classification | Speech recognition
    • The unlabeled dataframe just needs to contain a single audio column
  • Image regression | Image classification | Image object detection | Audio regression | Audio classification | Speech recognition
    • Assets (images or audios) need to be located in the Data folder (setting)
  • All supported problem types
    • The training time can significantly increase depending on the size of the unlabeled data
tip

As labeling can be expensive, having additional unlabeled data is quite common. Providing this unlabeled data in H2O Hydrogen Torch trains the model semi-supervised, potentially improving the model quality in contrast to only training on labeled data.

Label columns

This setting defines the name(s) of the dataframe column(s) that refer to the target value(s) an H2O Hydrogen Torch experiment can aim to predict.

Image column

Defines the dataframe column storing the names of images that H2O Hydrogen Torch loads from the Data folder and Data folder test when training and testing the model.

Validate sample files

This setting determines whether H2O Hydrogen Torch performs validation on sample files during the dataset import process. When enabled, H2O Hydrogen Torch checks the sample files for formatting errors and inconsistencies before importing them. This validation step is crucial for ensuring the accuracy and consistency of the data, which ultimately improves the quality of the machine learning models trained on the dataset.

When this setting is disabled, H2O Hydrogen Torch skips the validation step on the sample files. While this may speed up the import process, it also increases the risk of encountering errors or inconsistencies in the dataset. Therefore, it is recommended to enable this setting to ensure the integrity and reliability of the imported data.

note
  • Audio and speech problem types
    • For audio and speech problem types, the librosa.load function from the librosa library is called on each sample file. This function reads and preprocesses the audio data, returning a tuple containing the audio data as a NumPy array and additional information about the audio, such as the sample rate and duration. By using librosa.load, H2O Hydrogen Torch can ensure that the audio data is in the correct format and can be properly loaded and processed for a model (experiment).
  • Image problem types
    • For image problem types, the np.load function from the NumPy library is called on sample files with the .npy extension. This function loads previously saved NumPy arrays from disk, allowing H2O Hydrogen Torch to validate the sample files for formatting errors and inconsistencies. Additionally, the cv2.imread and cv2.cvtColor functions from the OpenCV library are used to read and convert other image file formats, respectively. By using these functions, H2O Hydrogen Torch can ensure that the image data is in the correct format and can be properly loaded and processed for a model (experiment).

Feedback