Dataset Details

To view a summary of a dataset or to preview the dataset, click on the dataset or select the [Click for Actions] button beside the dataset that you want to view, and then click Details from the submenu that appears. This opens the Dataset Details page.

Datasets page

Dataset Details Page

The Dataset Details page provides a summary of the dataset. This summary lists each of the dataset’s columns and displays accompanying rows for logical type, format, storage type (see note below), count, number of missing values, mean, minimum, maximum, standard deviation, frequency, and number of unique values.

Note: Driverless AI recognizes the following storage types: integer, string, real, boolean, and time.

Hover over the top of a column to view a summary of the first 20 rows of that column.

Hover text for Dataset Details

To view information for a specific column, type the column name in the field above the graph.

Filter column

Changing a Column Type

Driverless AI also allows you to change a column type. If a column’s data type or distribution does not match the manner in which you want the column to be handled during an experiment, changing the Logical Type can help to make the column fit better. For example, an integer zip code can be changed into a categorical so that it is only used with categorical-related feature engineering. For Date and Datetime columns, use the Format option. To change the Logical Type or Format of a column, click on the group of square icons located to the right of the words Auto-detect. (The squares light up when you hover over them with your cursor.) Then select the new column type for that column.

Change a column's type or format

Dataset Rows Page

To switch the view and preview the dataset, click the Dataset Rows button in the top right portion of the UI. Then click the Dataset Overview button to return to the original view.

Dataset Rows

Modify By Recipe

The option to create a new dataset by modifying an existing dataset with custom recipes is also available from this page. Scoring pipelines can be created on the new dataset by building an experiment. This feature is useful when you want to make changes to the training data that you would not need to make on the new data you are predicting on. For example, you can change the target column from regression to classification, add a weight column to mark specific training rows as being more important, or remove outliers that you do not want to model on. Refer to the Adding a Data Recipe section for more information.

Click the Modify by Recipe button in the top right portion of the UI and select from the following options:

  • Data Recipe URL: Load a custom recipe from a URL to use to modify the dataset. The URL must point to either a raw file, a GitHub repository or tree, or a local file. Sample custom data recipes are available in the https://github.com/h2oai/driverlessai-recipes/tree/rel-1.8.10/data repository.

  • Upload Data Recipe: If you have a custom recipe available on your local system, click this button to upload that recipe.

  • Live Code: Manually enter custom recipe code to use to modify the dataset. Click the Get Preview button to preview the code’s effect on the dataset, then click Save to create a new dataset.

Notes:

  • These options are enabled by default. You can disable them by removing recipe_file and recipe_url from the enabled_file_systems configuration option.

  • Modifying a dataset with a recipe will not overwrite the original dataset. The dataset that is selected for modification will remain in the list of available datasets in its original form, and the modified dataset will appear in this list as a new dataset.

  • Changes made to the original dataset through this feature will not be applied to new data that is scored.

Modify By Recipe menu