The Datasets Page¶
The Datasets Overview page is the Driverless AI Home page. This shows all datasets that have been imported. Note that the first time you log in, this list will be empty.
Supported File Types¶
Driverless AI supports the following dataset file formats:
- arff
- bin
- bz2
- csv
- dat
- feather
- gz
- jay
- nff
- parquet (See notes below)
- tgz
- tsv
- txt
- xls
- xlsx
- xz
- zip
Parquet Notes:
- For Parquet file formats, if you select to import multiple Parquet files, those files will be imported as multiple datasets. If you select a folder of Parquet files, the folder will be imported as a single dataset. Tools like Spark/Hive export data as multiple Parquet files that are stored in a directory with a user-defined name. For example, if you export with
Spark dataFrame.write.parquet("/data/big_parquet_dataset")
, Spark creates a folder /data/big_parquet_dataset, which will contain multiple Parquet files (depending on the number of partitions in the input dataset) + metadata. - You may receive a “Failed to ingest binary file with Parquet: lists with structs are not supported” error when ingesting a Parquet file that has a struct as an element of an array. This is because PyArrow cannot handle a struct that’s an element of an array. In Sparkling Water, we provide a workaround to flatten the Parquet file. Refer to our Sparkling Water solution for more information.
Adding Datasets¶
You can add datasets using one of the following methods:
Drag and drop files from your local machine directly onto this page. Note that this method currently works for files that are less than 10 GB.
or
Click the Add Dataset (or Drag and Drop) button to upload or add a dataset.
Notes:
- Upload File, File System, HDFS, and S3 are enabled by default. These can be disabled by removing them from the
enabled_file_systems
setting in the config.toml file. (Refer to Using the config.toml File section for more information.) - If File System is disabled, the Driverless AI will open local filebrowser by default.
- If Driverless AI was started with data connectors enabled for HDFS, BlueData Datatap, S3, Google Cloud Storage, Google Big Query, Minio, Snowflake, KDB+, and/or Azure Blob Store, then a dropdown will appear allowing you to specify where to begin browsing for the dataset. Refer to Enabling Data Connectors for more information.
Notes:
- Datasets must be in delimited text format.
- Driverless AI can detect the following separators: ,|;t
- When importing a folder, the entire folder and all of its contents are read into Driverless AI as a single file.
- When importing a folder, all of the files in the folder must have the same columns.
Upon completion, the datasets will appear in the Datasets Overview page. Click on a dataset to open a submenu. From this menu, you can specify to view Details, Split, Visualize, Predict, or Delete a dataset. You can also delete an unused dataset by hovering over it, clicking the X button or Delete option, and then confirming the delete. Note: You cannot delete a dataset that was used in an active experiment. You have to delete the experiment first.
Dataset Details¶
To view a summary of a dataset or to preview the dataset, click on the dataset or select the [Click for Actions] button beside the dataset that you want to view, and then click Details from the submenu that appears. This opens the Dataset Details page.
Dataset Details Page¶
The Dataset Details page provides a summary of the dataset. This summary lists each column that is included in the dataset along with the type (see note below), the count, the mean, minimum, maximum, standard deviation, frequency, and the number of unique values.
Note: Driverless AI recognizes the following types: integer, string, real, boolean, and time.
Hover over the top of a column to view a summary of the first 20 rows of that column.
To view information for a specific column, type the column name in the field above the graph.
Dataset Rows Page¶
To switch the view and preview the dataset, click the Dataset Rows button in the top right portion of the UI. Then click the Dataset Overview button to return to the original view.
Splitting Datasets¶
In Driverless AI, you can split a training dataset into test and validation datasets.
Perform the following steps to split a dataset.
- On the Datasets page, select the [Click for Actions] button beside the dataset that you want to split, and then select Split from the submenu that appears.
- The Dataset Splitter form displays. Specify an Output Name 1 and an Output Name 2 for the first and second part of the split. (For example, you can name one test and one valid.)
- Optionally specify a Target column (for stratified sampling), a Fold column (to keep rows belonging to the same group together), and/or a Time column.
- Use the slider to select a split ratio, or enter a value in the Train/Valid Split Ratio field.
- Click Save when you are done.
Upon completion, the split datasets will be available on the Datasets page.
Visualizing Datasets¶
Perform one of the following steps to visualize a dataset:
On the Datasets page, select the [Click for Actions] button beside the dataset that you want to view, and then click Visualize from the submenu that appears.
Click the Autoviz top menu link to go to the Visualizations list page, click the New Visualization button, then select or import the dataset that you want to visualize.
The Visualization Page¶
The Visualization page shows all available graphs for the selected dataset. Note that the graphs on the Visualization page can vary based on the information in your dataset. You can also view and download logs that were generated during the visualization.
The following is a complete list of available graphs.
- Correlated Scatterplots: Correlated scatterplots are 2D plots with large values of the squared Pearson correlation coefficient. In most cases, all possible scatterplots based on pairs of features (variables) are examined for correlations. However, if there are more than 50 numerical columns inside the dataset, Driverless AI randomly selects 50 of them and only examines all pairs of these 50. The displayed plots are ranked according to the correlation. Some of these plots may not look like textbook examples of correlation. The only criterion is that they have a large value of Pearson’s r. (Only variables with Pearson R^2 > 0.95^2 are displayed.) When modeling with these variables, you may want to leave out variables that are perfectly correlated with others.
Note that points in the scatterplot can have different sizes. Because Driverless AI aggregates the data and does not display all points, the bigger the point is, the bigger number of exemplars (aggregated points) the plot covers.
- Spikey Histograms: Spikey histograms are histograms with huge spikes. This often indicates an inordinate number of single values (usually zeros) or highly similar values. The measure of “spikeyness” is a bin frequency that is ten times the average frequency of all the bins. You should be careful when modeling (particularly regression models) with spikey variables.
- Skewed Histograms: Skewed histograms are ones with especially large skewness (asymmetry). The robust measure of skewness is derived from Groeneveld, R.A. and Meeden, G. (1984), “Measuring Skewness and Kurtosis.” The Statistician, 33, 391-399. Highly skewed variables are often candidates for a transformation (e.g., logging) before use in modeling. The histograms in the output are sorted in descending order of skewness.
- Varying Boxplots: Varying boxplots reveal unusual variability in a feature across the categories of a categorical variable. The measure of variability is computed from a robust one-way analysis of variance (ANOVA). Sufficiently diverse variables are flagged in the ANOVA. A boxplot is a graphical display of the fractiles of a distribution. The center of the box denotes the median, the edges of a box denote the lower and upper quartiles, and the ends of the “whiskers” denote that range of values. Sometimes outliers occur, in which case the adjacent whisker is shortened to the next lower or upper value. For variables (features) having only a few values, the boxes can be compressed, sometimes into a single horizontal line at the median.
- Heteroscedastic Boxplots: Heteroscedastic boxplots reveal unusual variability in a feature across the categories of a categorical variable. Heteroscedasticity is calculated with a Brown-Forsythe test: Brown, M. B. and Forsythe, A. B. (1974), “Robust tests for equality of variances. Journal of the American Statistical Association, 69, 364-367. Plots are ranked according to their heteroscedasticity values. A boxplot is a graphical display of the fractiles of a distribution. The center of the box denotes the median, the edges of a box denote the lower and upper quartiles, and the ends of the “whiskers” denote that range of values. Sometimes outliers occur, in which case the adjacent whisker is shortened to the next lower or upper value. For variables (features) having only a few values, the boxes can be compressed, sometimes into a single horizontal line at the median.
- Biplots: A Biplot is an enhanced scatterplot that uses both points and vectors to represent structure simultaneously for rows and columns of a data matrix. Rows are represented as points (scores), and columns are represented as vectors (loadings). The plot is computed from the first two principal components of the correlation matrix of the variables (features). You should look for unusual (non-elliptical) shapes in the points that might reveal outliers or non-normal distributions. And you should look for purple vectors that are well-separated. Overlapping vectors can indicate a high degree of correlation between variables.
- Outliers: Variables with anomalous or outlying values are displayed as red points in a dot plot. Dot plots are constructed using an algorithm in Wilkinson, L. (1999). “Dot plots.” The American Statistician, 53, 276–281. Not all anomalous points are outliers. Sometimes the algorithm will flag points that lie in an empty region (i.e., they are not near any other points). You should inspect outliers to see if they are miscodings or if they are due to some other mistake. Outliers should ordinarily be eliminated from models only when there is a reasonable explanation for their occurrence.
- Correlation Graph: The correlation network graph is constructed from all pairwise squared correlations between variables (features). For continuous-continuous variable pairs, the statistic used is the squared Pearson correlation. For continuous-categorical variable pairs, the statistic is based on the squared intraclass correlation (ICC). This statistic is computed from the mean squares from a one-way analysis of variance (ANOVA). The formula is (MSbetween - MSwithin)/(MSbetween + (k - 1)MSwithin), where k is the number of categories in the categorical variable. For categorical-categorical pairs, the statistic is computed from Cramer’s V squared. If the first variable has k1 categories and the second variable has k2 categories, then a k1 x k2 table is created from the joint frequencies of values. From this table, we compute a chi-square statistic. Cramer’s V squared statistic is then (chi-square / n) / min(k1,k2), where n is the total of the joint frequencies in the table. Variables with large values of these respective statistics appear near each other in the network diagram. The color scale used for the connecting edges runs from low (blue) to high (red). Variables connected by short red edges tend to be highly correlated.
- Parallel Coordinates Plot: A Parallel Coordinates Plot is a graph used for comparing multiple variables. Each variable has its own vertical axis in the plot. Each profile connects the values on the axes for a single observation. If the data contain clusters, these profiles will be colored by their cluster number.
- Radar Plot: A Radar Plot is a two-dimensional graph that is used for comparing multiple variables. Each variable has its own axis that starts from the center of the graph. The data are standardized on each variable between 0 and 1 so that values can be compared across variables. Each profile, which usually appears in the form of a star, connects the values on the axes for a single observation. Multivariate outliers are represented by red profiles. The Radar Plot is the polar version of the popular Parallel Coordinates plot. The polar layout enables us to represent more variables in a single plot.
- Data Heatmap: The heatmap graphic is constructed from the transposed data matrix. Rows of the heatmap represent variables, and columns represent cases (instances). The data are standardized before display so that small values are yellow and large values are red. The rows and columns are permuted via a singular value decomposition (SVD) of the data matrix so that similar rows and similar columns are near each other.
- Missing Values Heatmap: The missing values heatmap graphic is constructed from the transposed data matrix. Rows of the heatmap represent variables and columns represent cases (instances). The data are coded into the values 0 (missing) and 1 (nonmissing). Missing values are colored red and nonmissing values are left blank (white). The rows and columns are permuted via a singular value decomposition (SVD) of the data matrix so that similar rows and similar columns are near each other.
- Gaps Histogram: The gaps index is computed using an algorithm of Wainer and Schacht based on work by John Tukey. (Wainer, H. and Schacht, Psychometrika, 43, 2, 203-12.) Histograms with gaps can indicate a mixture of two or more distributions based on possible subgroups not necessarily characterized in the dataset.
The images on this page are thumbnails. You can click on any of the graphs to view and download a full-scale image. You can also view an explanation for each graph by clicking the Help button in the lower-left corner of each expanded graph.