Skip to main content
Version: v1.4.0

Import dataset settings: Text span prediction

Before importing a dataset to H2O Hydrogen Torch, you need to define a set of settings based on the problem type of the dataset. These settings are referred to as import dataset settings.

Dataset name

This setting defines the name of the dataset.

Problem category

This setting defines a particular general problem type category, for example, image.

note
  • The selected problem category (for example, image) determines the options in the Problem type setting.
  • The From experiment option enables you to utilize the settings of an experiment (another experiment).

Problem type

This setting defines the problem type of the experiment, which also defines the settings H2O Hydrogen Torch displays for the experiment.

Note
  • The selected problem category (in the Problem category setting) determines the available problem types.
  • The selected problem type and experience level determine the settings H2O Hydrogen Torch displays for the experiment.

Train dataframe

This setting specifies the path to a file that contains a dataframe comprising training records utilized by H2O Hydrogen Torch for model training within the experiment. Here, the term 'file' denotes a specific file adhering to a dataset format tailored for the problem type addressed in the experiment. To learn more, see Dataset formats.

note
  • The records are combined into mini-batches when training the model.
  • If a validation dataframe is provided, a fold column is not needed in the train dataframe.
  • To import datasets for inference only, when defining the settings for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).

Validation dataframe

This setting defines a file containing a dataframe with validation records that H2O Hydrogen Torch uses to evaluate the model during training.

Note
  • To set a Validation dataframe requires the Validation strategy to be set to Custom holdout validation. In the case of providing a validation dataframe, H2O Hydrogen Torch fully respects the choice of a separate validation dataframe and does not perform any internal cross-validation. In other words, the model is trained on the full provided train dataframe, and model performance is evaluated on the provided validation dataframe.
  • The validation dataframe should have the same format as the train dataframe but does not require a fold column.
Setting dependency

The Validation dataframe settings is only available when you select Validation strategy in the Custom holdout validation setting.

Test dataframe

This setting defines a file containing a dataframe with test records that H2O Hydrogen Torch uses to test the model.

note
  • The test dataframe should have the same format as the train dataframe but does not require a label column.
  • To import datasets for inference only, when defining the setting for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).

Question column

Defines the dataset column containing the question text H2O Hydrogen Torch uses during model training.

Context column

Defines the dataset column containing text that answers the question in the question column; H2O Hydrogen Torch uses the context column during model training.

Answer column

Defines the dataset column containing the answer text that H2O Hydrogen Torch uses during model training.

Answer start column

Defines the dataset column, which describes the start of the answer text in the context column. If not set, H2O Hydrogen Torch chooses the first occurrence of the answer text found in the context text as the start of the answer text in the context column.


Feedback