Skip to content

Datasets

Datasets

Interact with datasets in the Driverless AI server.

create

create(
    data: Union[str, DataFrame],
    data_source: str = "upload",
    data_source_config: Dict[str, str] = None,
    force: bool = False,
    name: str = None,
    description: Optional[str] = None,
) -> Dataset

Creates a dataset in the Driverless AI server.

Parameters:

  • data (Union[str, DataFrame]) –

    Path to the data file(s), or path to a directory, or a SQL query, or a pandas Dataframe.

  • data_source (str, default: 'upload' ) –

    Name of the connector to import data from. Use driverlessai.connectors.list() to get enabled connectors in the server.

  • data_source_config (Dict[str, str], default: None ) –

    A dictionary of configuration options for advanced connectors.

  • force (bool, default: False ) –

    Whether to create the dataset even when a dataset already exists with the same name.

  • name (str, default: None ) –

    Name for the created dataset.

  • description (Optional[str], default: None ) –

    Description for the created dataset. (only available from Driverless AI version 1.10.7 onwards)

Returns:

Example
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
    data_source="s3",
    name="iris-data",
    description="My Iris dataset",
)

create_async

create_async(
    data: Union[str, DataFrame],
    data_source: str = "upload",
    data_source_config: Dict[str, str] = None,
    force: bool = False,
    name: str = None,
    description: Optional[str] = None,
) -> DatasetJob

Launches the creation of a dataset in the Driverless AI server.

Parameters:

  • data (Union[str, DataFrame]) –

    Path to the data file(s), or path to a directory, or a SQL query, or a pandas Dataframe.

  • data_source (str, default: 'upload' ) –

    Name of the connector to import data from. Use driverlessai.connectors.list() to get enabled connectors in the server.

  • data_source_config (Dict[str, str], default: None ) –

    A dictionary of configuration options for advanced connectors.

  • force (bool, default: False ) –

    Whether to create the dataset even when a dataset already exists with the same name.

  • name (str, default: None ) –

    Name for the created dataset.

  • description (Optional[str], default: None ) –

    Description for the created dataset. (only available from Driverless AI version 1.10.7 onwards)

Returns:

Example
dataset = client.datasets.create(
    data="SELECT * FROM creditcard",
    data_source="jdbc",
    data_source_config=dict(
        jdbc_jar="/opt/jdbc-drivers/mysql/mysql-connector-java-8.0.23.jar",
        jdbc_driver="com.mysql.cj.jdbc.Driver",
        jdbc_url="jdbc:mysql://localhost:3306/datasets",
        jdbc_username="root",
        jdbc_password="root"
    ),
    name="creditcard",
    description="Sample creditcard data",
    force=True,
)

get

get(key: str) -> Dataset

Retrieves a dataset in the Driverless AI server. If the dataset only exists in H2O Storage then it will be imported into the server first.

Parameters:

  • key (str) –

    The unique ID of the dataset.

Returns:

  • Dataset

    The dataset corresponding to the key.

Example
key = "e7de8630-dbfb-11ea-9f69-0242ac110002"
dataset = client.datasets.get(key=key)

get_by_name

get_by_name(name: str) -> Optional[Dataset]

Retrieves a dataset by its display name from the Driverless AI server.

Parameters:

  • name (str) –

    Name of the dataset.

Returns:

  • Optional[Dataset]

    The dataset with the specified name if it exists, otherwise None.

Beta API

A beta API that is subject to future changes.

gui

gui() -> Hyperlink

Returns the full URL to the Datasets page in the Driverless AI server.

Returns:

  • Hyperlink

    The full URL to the Datasets page.

list

list(start_index: int = 0, count: int = None) -> Sequence[Dataset]

Retrieves datasets in the Driverless AI server.

Parameters:

  • start_index (int, default: 0 ) –

    The index of the first dataset to retrieve.

  • count (int, default: None ) –

    The maximum number of datasets to retrieve. If None, retrieves all available datasets.

Returns:

DatasetJob

Monitor the creation of a dataset in the Driverless AI server.

key property

key: str

Universally unique key of the entity.

Returns:

name property

name: str

Name of the entity.

Returns:

is_complete

is_complete() -> bool

Whether the job has been completed successfully.

Returns:

  • bool

    True if the job has been completed successfully, otherwise False.

is_running

is_running() -> bool

Whether the job has been scheduled or is running, finishing, or syncing.

Returns:

  • bool

    True if the job has not completed yet, otherwise False.

result

result(silent: bool = False) -> Dataset

Awaits the job's completion before returning the created dataset.

Parameters:

  • silent (bool, default: False ) –

    Whether to display status updates or not.

Returns:

  • Dataset

    Created dataset by the job.

status

status(verbose: int = 0) -> str

Returns the status of the job.

Parameters:

  • verbose (int, default: 0 ) –
    • 0: A short description.
    • 1: A short description with a progress percentage.
    • 2: A detailed description with a progress percentage.

Returns:

  • str

    Current status of the job.

Dataset

A dataset in the Driverless AI server.

columns property

columns: List[str]

Column names of the dataset.

Returns:

creation_timestamp property

creation_timestamp: float

Creation timestamp of the dataset in seconds since the epoch (POSIX timestamp).

Returns:

data_source property

data_source: str

Original data source of the dataset.

Returns:

description property

description: Optional[str]

Description of the dataset.

Returns:

file_path property

file_path: str

Path to the dataset bin file in the Driverless AI server.

Returns:

file_size property

file_size: int

Size in bytes of the dataset bin file in the Driverless AI server.

Returns:

key property

key: str

Universally unique key of the entity.

Returns:

log property

Log of the dataset.

Returns:

name property

name: str

Name of the entity.

Returns:

shape property

shape: Tuple[int, int]

Dimensions of the dataset in (rows, cols) format.

Returns:

column_summaries

column_summaries(columns: List[str] = None) -> DatasetColumnSummaryCollection

Returns a collection of column summaries.

Parameters:

  • columns (List[str], default: None ) –

    Names of columns to include.

Returns:

delete

delete() -> None

Permanently deletes the dataset from the Driverless AI server.

download

download(
    dst_dir: str = ".",
    dst_file: Optional[str] = None,
    file_system: Optional[AbstractFileSystem] = None,
    overwrite: bool = False,
) -> str

Downloads the dataset as a CSV file.

Parameters:

  • dst_dir (str, default: '.' ) –

    The path where the CSV file will be saved.

  • dst_file (Optional[str], default: None ) –

    The name of the CSV file (overrides the default file name).

  • file_system (Optional[AbstractFileSystem], default: None ) –

    FSSPEC-based file system to download to instead of the local file system.

  • overwrite (bool, default: False ) –

    Whether to overwrite or not if a file already exists.

Returns:

  • str

    Path to the downloaded CSV file.

Example
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/parser/avro/weather_snappy-compression.avro",
    data_source="s3",
)
dataset.download()

export

export(**kwargs: Any) -> str

Exports the dataset as a CSV file from the Driverless AI server. Note that the export location is configured in the server. Refer to the Driverless AI docs for more information.

Other Parameters:

  • storage_destination (str) –

    Exporting destination. Possible values are file_system, s3, bitbucket, or azure.

  • username (str) –

    BitBucket username.

  • password (str) –

    BitBucket password.

  • branch (str) –

    BitBucket branch.

  • user_note (str) –

    BitBucket commit message.

Returns:

  • str

    Relative path to the exported CSV in the export location.

get_used_in_experiments

get_used_in_experiments() -> Dict[str, List[Experiment]]

Retrieves the completed experiments where the dataset has been used as the training, testing, or validation dataset.

Returns:

  • Dict[str, List[Experiment]]

    A dictionary with three keys, train, test, and validation, each containing a list of experiments.

Driverless AI version requirement

Requires Driverless AI server 1.10.6 or higher.

gui

gui() -> Hyperlink

Returns the full URL to the dataset details page in the Driverless AI server.

Returns:

  • Hyperlink

    URL to the dataset details page.

head

head(num_rows: int = 5) -> Table

Returns the column headers and the first n number of rows of the dataset.

Parameters:

  • num_rows (int, default: 5 ) –

    Number of rows to retrieve.

Returns:

  • Table

    A Table containing the retrieved rows.

Example
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
    data_source="s3",
)
# Print the headers and first 10 rows
print(dataset.head(num_rows=10))

merge_by_rows

merge_by_rows(other_dataset: Dataset, new_dataset_name: str) -> Dataset

Merges the specified dataset into this dataset. Note that the other dataset must have the same columns.

Parameters:

  • other_dataset (Dataset) –

    The dataset that will be merged into this.

  • new_dataset_name (str) –

    Name of the resulting dataset.

Returns:

Driverless AI version requirement

Requires Driverless AI server 1.10.6 or higher.

modify_by_code

modify_by_code(code: str, names: List[str] = None) -> Dict[str, Dataset]

Creates new dataset(s) by modifying the dataset using a Python script.

In the Python script

Parameters:

  • code (str) –

    Python script that modifies X.

  • names (List[str], default: None ) –

    Names for the new dataset(s).

Returns:

  • Dict[str, Dataset]

    A dictionary of newly created datasets with names as keys.

Example
# Import the iris dataset.
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
    data_source="s3",
)

# Create a new dataset only with the first 4 columns.
new_dataset = dataset.modify_by_code(
    code="return X[:, :4]",
    names=["new_dataset"],
)

# Split on 5th column to create 2 datasets.
new_datasets = dataset.modify_by_code(
    code="return [X[:, :5], X[:, 5:]]",
    names=["new_dataset_1", "new_dataset_2"],
)

modify_by_code_preview

modify_by_code_preview(code: str) -> Table

Returns a preview of new dataset(s) created by modifying the dataset using a Python script.

In the Python script

  • The original dataset is available as variable X with type datatable.Frame.
  • Newly created dataset(s) should be returned as a datatable.Frame, or a pandas.DataFrame, or a numpy.ndarray, or a list of those (only the first dataset in the list is shown in the preview).

Parameters:

  • code (str) –

    Python script that modifies X.

Returns:

  • Table

    A table containing the first 10 rows of the new dataset.

Example
# Import the iris dataset.
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
    data_source="s3",
)

# A new dataset only with the first 4 columns.
table = dataset.modify_by_code_preview("return X[:, :4]")
print(table)

modify_by_recipe

modify_by_recipe(
    recipe: Union[str, DataRecipe] = None, names: List[str] = None
) -> Dict[str, Dataset]

Creates new dataset(s) by modifying the dataset using a data recipe.

Parameters:

  • recipe (Union[str, DataRecipe], default: None ) –

    The path to the recipe file, or the url for the recipe, or the data recipe.

  • names (List[str], default: None ) –

    Names for the new dataset(s).

Returns:

  • Dict[str, Dataset]

    A dictionary of newly created datasets with names as keys.

Example
# Import the airlines dataset.
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip",
    data_source="s3",
)

# Modify the original dataset with a recipe.
new_datasets = dataset.modify_by_recipe(
    recipe="https://github.com/h2oai/driverlessai-recipes/blob/master/data/airlines_multiple.py",
    names=["new_airlines1", "new_airlines2"],
)

redescribe

redescribe(description: str) -> Dataset

Changes the description of the dataset.

Parameters:

  • description (str) –

    New description.

Returns:

Driverless AI version requirement

Requires Driverless AI server 1.10.7 or higher.

rename

rename(name: str) -> Dataset

Changes the display name of the dataset.

Parameters:

  • name (str) –

    New display name.

Returns:

set_datetime_format

set_datetime_format(columns: Dict[str, str]) -> None

Sets/updates the datetime format for columns of the dataset.

Parameters:

  • columns (Dict[str, str]) –

    The dictionary where the key is the column name and the value is a valid datetime format.

Example
# Import the Eurodate dataset.
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/jira/v-11-eurodate.csv",
    data_source="s3",
)

# Set the date time format for column ‘ds5'
dataset.set_datetime_format({"ds5": "%d-%m-%y %H:%M"})

set_logical_types

set_logical_types(columns: Dict[str, Union[str, List[str]]]) -> None

Sets/updates the logical data types of the columns of the dataset. The logical type of columns is primarily used to determine which transformers to try on the column's data.

Possible logical types:

  • categorical
  • date
  • datetime
  • id
  • numerical
  • text

Parameters:

  • columns (Dict[str, Union[str, List[str]]]) –

    A dictionary where the key is the column name and the value is the logical type or a list of logical types for the column Use None to unset all logical types.

Example
# Import the prostate dataset.
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/prostate/prostate.csv",
    data_source="s3",
)

# Set the logical types
prostate.set_logical_types(
    {"ID": "id", "AGE": ["categorical", "numerical"], "RACE": None}
)

split_to_train_test

split_to_train_test(
    train_size: float = 0.5,
    train_name: str = None,
    test_name: str = None,
    target_column: str = None,
    fold_column: str = None,
    time_column: str = None,
    seed: int = 1234,
) -> Dict[str, Dataset]

Splits the dataset into train and test datasets in the Driverless AI server.

Parameters:

  • train_size (float, default: 0.5 ) –

    Proportion of the rows to put to the train dataset. Rest will be in the test dataset.

  • train_name (str, default: None ) –

    Name for the train dataset.

  • test_name (str, default: None ) –

    Name for the test dataset.

  • target_column (str, default: None ) –

    Column to use for stratified sampling.

  • fold_column (str, default: None ) –

    Column to ensure grouped splitting.

  • time_column (str, default: None ) –

    Column for time-based splitting.

  • seed (int, default: 1234 ) –

    A random seed for reproducibility.

Note

Only one of target_column, fold_column, or time_column can be passed at a time.

Returns:

  • Dict[str, Dataset]

    A dictionary with keys train_dataset and test_dataset, containing the respective dataset.

Example
# Import the iris dataset.
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
    data_source="s3",
)

# Split the iris dataset into train/test sets.
splits = dataset.split_to_train_test(train_size=0.7)
train_dataset = splits["train_dataset"]
test_dataset = splits["test_dataset"]

split_to_train_test_async

split_to_train_test_async(
    train_size: float = 0.5,
    train_name: str = None,
    test_name: str = None,
    target_column: str = None,
    fold_column: str = None,
    time_column: str = None,
    seed: int = 1234,
) -> DatasetSplitJob

Launches the splitting of the dataset into train and test datasets in the Driverless AI server.

Parameters:

  • train_size (float, default: 0.5 ) –

    Proportion of the rows to put to the train dataset. Rest will be in the test dataset.

  • train_name (str, default: None ) –

    Name for the train dataset.

  • test_name (str, default: None ) –

    Name for the test dataset.

  • target_column (str, default: None ) –

    Column to use for stratified sampling.

  • fold_column (str, default: None ) –

    Column to ensure grouped splitting.

  • time_column (str, default: None ) –

    Column for time-based splitting.

  • seed (int, default: 1234 ) –

    A random seed for reproducibility.

Note

Only one of target_column, fold_column, or time_column can be passed at a time.

Returns:

summarize

summarize() -> DatasetSummary

Summarizes the dataset using a GPT configured in the Driverless AI server.

Returns:

Driverless AI version requirement

Requires Driverless AI server 1.10.6 or higher.

Beta API

A beta API that is subject to future changes.

summarize_async

summarize_async() -> DatasetSummarizeJob

Launches the summarization of the dataset using a GPT configured in the Driverless AI server.

Returns:

Driverless AI version requirement

Requires Driverless AI server 1.10.6 or higher.

Beta API

A beta API that is subject to future changes.

tail

tail(num_rows: int = 5) -> Table

Returns the column headers and the last n number of rows of the dataset.

Parameters:

  • num_rows (int, default: 5 ) –

    Number of rows to retrieve.

Returns:

  • Table

    A Table containing the retrieved rows.

Example
dataset = client.datasets.create(
    data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
    data_source="s3",
)
# Print the headers and last 10 rows
print(dataset.tail(num_rows=10))

DatasetLog

A dataset log file in the Driverless AI server.

file_name property

file_name: str

Filename of the log file.

Returns:

download

download(
    dst_dir: str = ".",
    dst_file: Optional[str] = None,
    file_system: Optional[AbstractFileSystem] = None,
    overwrite: bool = False,
) -> str

Downloads the log file.

Parameters:

  • dst_dir (str, default: '.' ) –

    The path where the log file will be saved.

  • dst_file (Optional[str], default: None ) –

    The name of the log file (overrides the default file name).

  • file_system (Optional[AbstractFileSystem], default: None ) –

    FSSPEC-based file system to download to instead of the local file system.

  • overwrite (bool, default: False ) –

    Whether to overwrite or not if a file already exists.

Returns:

  • str

    Path to the downloaded log file.

head

head(num_lines: int = 50) -> str

Returns the first n lines of the log file.

Parameters:

  • num_lines (int, default: 50 ) –

    Number of lines to retrieve.

Returns:

tail

tail(num_lines: int = 50) -> str

Returns the last n lines of the log file.

Parameters:

  • num_lines (int, default: 50 ) –

    Number of lines to retrieve.

Returns:

DatasetColumnSummaryCollection

A collection of column summaries of a dataset.

A column summary can be retrieved,

  • by the column index dataset.column_summaries()[0]
  • by the column name dataset.column_summaries()["C1"]
  • Or slice it to get multiple summaries dataset.column_summaries()[0:3]

DatasetColumnSummary

Summary of a column in a dataset.

Example

c1_summary = dataset.column_summaries()["C1"]
# Print the summary for a histogram along with column statistics.
print(c1_summary)
Sample output:
--- C1 ---

 4.3|███████
    |█████████████████
    |██████████
    |████████████████████
    |████████████
    |███████████████████
    |█████████████
    |████
    |████
 7.9|████

Data Type: real
Logical Types: ['categorical', 'numerical']
Datetime Format:
Count: 150
Missing: 0
Mean: 5.84
SD: 0.828
Min: 4.3
Max: 7.9
Unique: 35
Freq: 10

count property

count: int

Non-missing values count in the column.

Returns:

data_type property

data_type: str

Raw data type of the column as detected by the Driverless AI server when the dataset was imported.

Returns:

datetime_format property

datetime_format: str

Datetime format of the column. See also Dataset.set_datetime_format.

Returns:

freq property

freq: int

Count of most frequent value in the column.

Returns:

logical_types property

logical_types: List[str]

User defined data types for the column to be used by Driverless AI server. This precedes the data_type. See also Dataset.set_logical_types.

Returns:

max property

Maximum value in the column if it contains binary/numeric data.

Returns:

mean property

mean: Optional[float]

Mean value of the column if it contains binary/numeric data.

Returns:

min property

Minimum value in the column if it contains binary/numeric data.

Returns:

missing property

missing: int

Missing values count in the column.

Returns:

name property

name: str

Column name.

Returns:

sd property

Standard deviation of the column if it contains binary/numeric data.

Returns:

unique property

unique: int

Unique values count of the column.

Returns:

DatasetSplitJob

Monitor splitting of a dataset in the Driverless AI server.

key property

key: str

Universally unique key of the entity.

Returns:

name property

name: str

Name of the entity.

Returns:

is_complete

is_complete() -> bool

Whether the job has been completed successfully.

Returns:

  • bool

    True if the job has been completed successfully, otherwise False.

is_running

is_running() -> bool

Whether the job has been scheduled or is running, finishing, or syncing.

Returns:

  • bool

    True if the job has not completed yet, otherwise False.

result

result(silent: bool = False) -> Dict[str, Dataset]

Awaits the job's completion before returning the split datasets.

Parameters:

  • silent (bool, default: False ) –

    Whether to display status updates or not.

Returns:

  • Dict[str, Dataset]

    A dictionary with keys train_dataset and test_dataset, containing the respective dataset created by the job.

status

status(verbose: int = None) -> str

Returns the status of the job.

Parameters:

  • verbose (int, default: None ) –

    Ignored.

Returns:

  • str

    Current status of the job.

DatasetSummarizeJob

Monitor the creation of a dataset summary in the Driverless AI server.

key property

key: str

Universally unique key of the entity.

Returns:

name property

name: str

Name of the entity.

Returns:

is_complete

is_complete() -> bool

Whether the job has been completed successfully.

Returns:

  • bool

    True if the job has been completed successfully, otherwise False.

is_running

is_running() -> bool

Whether the job has been scheduled or is running, finishing, or syncing.

Returns:

  • bool

    True if the job has not completed yet, otherwise False.

result

result(silent: bool = False) -> DatasetSummary

Awaits the job's completion before returning the created dataset summary.

Parameters:

  • silent (bool, default: False ) –

    Whether to display status updates or not.

Returns:

status

status(verbose: int = 0) -> str

Returns the status of the job.

Parameters:

  • verbose (int, default: 0 ) –
    • 0: A short description.
    • 1: A short description with a progress percentage.
    • 2: A detailed description with a progress percentage.

Returns:

  • str

    Current status of the job.

DatasetSummary dataclass

A summary of a dataset.

provider instance-attribute

provider: str

GPT provider that generated the dataset summary.

Returns:

summary instance-attribute

summary: str

Dataset summary.

Returns: