Supported File Types

Driverless AI supports the following dataset file formats:

  • arff

  • avro

  • bin

  • bz2

  • csv (See note below)

  • dat

  • feather

  • gz

  • jay (See note below)

  • orc (See notes below)

  • parquet (See notes below)

  • pickle / pkl (See note below)

  • tgz

  • tsv

  • txt

  • xls

  • xlsx

  • xz

  • zip

Notes:

  • Compressed Parquet files are typically the most efficient file type to use with Driverless AI.

  • By default, Driverless AI uses the file extension of a file to decide the file type of a file before importing it. If no file extension is provided when adding data, Driverless AI attempts to import that data according to the list of file types defined by the files_without_extensions_expected_types configuration setting. For example, if the list is specified as ["parquet", "orc"] (the default value), Driverless AI first attempts to import the data as a Parquet file. If this is unsuccessful, it then attempts to import the data as an ORC file. Driverless AI continues down the list until the data is successfully imported. This setting can be configured in the config.toml file. (See Using the config.toml File for more info.)

  • CSV in UTF-16 encoding is only supported when implemented with a byte order mark (BOM). If a BOM is not present, the dataset is read as UTF-8.

  • For ORC and Parquet file formats, if you select to import multiple files, those files will be imported as multiple datasets. If you select a folder of ORC or Parquet files, the folder will be imported as a single dataset. Tools like Spark/Hive export data as multiple ORC or Parquet files that are stored in a directory with a user-defined name. For example, if you export with Spark dataFrame.write.parquet("/data/big_parquet_dataset"), Spark creates a folder /data/big_parquet_dataset, which will contain multiple Parquet files (depending on the number of partitions in the input dataset) and metadata. Exporting ORC files produces a similar result.

  • For ORC and Parquet file formats, you may receive a “Failed to ingest binary file with ORC / Parquet: lists with structs are not supported” error when ingesting an ORC or Parquet file that has a struct as an element of an array. This is because PyArrow cannot handle a struct that’s an element of an array.

  • A workaround to flatten Parquet files is provided in Sparkling Water. Refer to our Sparkling Water solution for more information.

  • You can create new datasets from Python script files (custom recipes) by selecting Data Recipe URL or Upload Data Recipe from the Add Dataset (or Drag & Drop) dropdown menu. If you select the Data Recipe URL option, the URL must point to either a raw file, a GitHub repository or tree, or a local file. In addition, you can create a new dataset by modifying an existing dataset with a custom recipe. Refer to Modify By Recipe for more information. Datasets created or added from recipes will be saved as .jay files.

  • To avoid potential errors, converting pickle files to CSV or .jay files is recommended. The following is an example of how to convert a pickle file to a CSV file using Datatable:

import datatable as dt
import pandas as pd

df = pd.read_pickle("test.pkl")
dt = dt.Frame(df)
dt.to_csv("test.csv")