Datasets¶
Datasets ¶
Interact with datasets in the Driverless AI server.
create ¶
create(
data: Union[str, DataFrame],
data_source: str = "upload",
data_source_config: Dict[str, str] = None,
force: bool = False,
name: str = None,
description: Optional[str] = None,
) -> Dataset
Creates a dataset in the Driverless AI server.
Parameters:
-
data
(Union[str, DataFrame]
) –Path to the data file(s), or path to a directory, or a SQL query, or a pandas Dataframe.
-
data_source
(str
, default:'upload'
) –Name of the connector to import data from. Use
driverlessai.connectors.list()
to get enabled connectors in the server. -
data_source_config
(Dict[str, str]
, default:None
) –A dictionary of configuration options for advanced connectors.
-
force
(bool
, default:False
) –Whether to create the dataset even when a dataset already exists with the same name.
-
name
(str
, default:None
) –Name for the created dataset.
-
description
(Optional[str]
, default:None
) –Description for the created dataset. (only available from Driverless AI version 1.10.7 onwards)
Returns:
-
Dataset
–Created dataset.
Example
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
data_source="s3",
name="iris-data",
description="My Iris dataset",
)
create_async ¶
create_async(
data: Union[str, DataFrame],
data_source: str = "upload",
data_source_config: Dict[str, str] = None,
force: bool = False,
name: str = None,
description: Optional[str] = None,
) -> DatasetJob
Launches the creation of a dataset in the Driverless AI server.
Parameters:
-
data
(Union[str, DataFrame]
) –Path to the data file(s), or path to a directory, or a SQL query, or a pandas Dataframe.
-
data_source
(str
, default:'upload'
) –Name of the connector to import data from. Use
driverlessai.connectors.list()
to get enabled connectors in the server. -
data_source_config
(Dict[str, str]
, default:None
) –A dictionary of configuration options for advanced connectors.
-
force
(bool
, default:False
) –Whether to create the dataset even when a dataset already exists with the same name.
-
name
(str
, default:None
) –Name for the created dataset.
-
description
(Optional[str]
, default:None
) –Description for the created dataset. (only available from Driverless AI version 1.10.7 onwards)
Returns:
-
DatasetJob
–Started the dataset job.
Example
dataset = client.datasets.create(
data="SELECT * FROM creditcard",
data_source="jdbc",
data_source_config=dict(
jdbc_jar="/opt/jdbc-drivers/mysql/mysql-connector-java-8.0.23.jar",
jdbc_driver="com.mysql.cj.jdbc.Driver",
jdbc_url="jdbc:mysql://localhost:3306/datasets",
jdbc_username="root",
jdbc_password="root"
),
name="creditcard",
description="Sample creditcard data",
force=True,
)
get ¶
Retrieves a dataset in the Driverless AI server. If the dataset only exists in H2O Storage then it will be imported into the server first.
Parameters:
-
key
(str
) –The unique ID of the dataset.
Returns:
-
Dataset
–The dataset corresponding to the key.
Example
key = "e7de8630-dbfb-11ea-9f69-0242ac110002"
dataset = client.datasets.get(key=key)
get_by_name ¶
gui ¶
gui() -> Hyperlink
Returns the full URL to the Datasets page in the Driverless AI server.
Returns:
-
Hyperlink
–The full URL to the Datasets page.
list ¶
DatasetJob ¶
Monitor the creation of a dataset in the Driverless AI server.
is_complete ¶
is_complete() -> bool
Whether the job has been completed successfully.
Returns:
-
bool
–True
if the job has been completed successfully, otherwiseFalse
.
is_running ¶
is_running() -> bool
Whether the job has been scheduled or is running, finishing, or syncing.
Returns:
-
bool
–True
if the job has not completed yet, otherwiseFalse
.
result ¶
Dataset ¶
A dataset in the Driverless AI server.
creation_timestamp
property
¶
creation_timestamp: float
Creation timestamp of the dataset in seconds since the epoch (POSIX timestamp).
Returns:
-
float
–
description
property
¶
file_path
property
¶
file_path: str
Path to the dataset bin file in the Driverless AI server.
Returns:
-
str
–
file_size
property
¶
file_size: int
Size in bytes of the dataset bin file in the Driverless AI server.
Returns:
-
int
–
shape
property
¶
column_summaries ¶
column_summaries(columns: List[str] = None) -> DatasetColumnSummaryCollection
Returns a collection of column summaries.
Parameters:
Returns:
-
DatasetColumnSummaryCollection
–Summaries of the columns of the dataset.
download ¶
download(
dst_dir: str = ".",
dst_file: Optional[str] = None,
file_system: Optional[AbstractFileSystem] = None,
overwrite: bool = False,
timeout: float = 30,
) -> str
Downloads the dataset as a CSV file.
Parameters:
-
dst_dir
(str
, default:'.'
) –The path where the CSV file will be saved.
-
dst_file
(Optional[str]
, default:None
) –The name of the CSV file (overrides the default file name).
-
file_system
(Optional[AbstractFileSystem]
, default:None
) –FSSPEC-based file system to download to instead of the local file system.
-
overwrite
(bool
, default:False
) –Whether to overwrite or not if a file already exists.
-
timeout
(float
, default:30
) –Connection timeout in seconds.
Returns:
-
str
–Path to the downloaded CSV file.
Example
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/parser/avro/weather_snappy-compression.avro",
data_source="s3",
)
dataset.download()
export ¶
Exports the dataset as a CSV file from the Driverless AI server. Note that the export location is configured in the server. Refer to the Driverless AI docs for more information.
Other Parameters:
-
storage_destination
(str
) –Exporting destination. Possible values are
file_system
,s3
,bitbucket
, orazure
. -
username
(str
) –BitBucket username.
-
password
(str
) –BitBucket password.
-
branch
(str
) –BitBucket branch.
-
user_note
(str
) –BitBucket commit message.
Returns:
-
str
–Relative path to the exported CSV in the export location.
get_used_in_experiments ¶
get_used_in_experiments() -> Dict[str, List[Experiment]]
Retrieves the completed experiments where the dataset has been used as the training, testing, or validation dataset.
Returns:
-
Dict[str, List[Experiment]]
–A dictionary with three keys,
train
,test
, andvalidation
, each containing a list of experiments.
Driverless AI version requirement
Requires Driverless AI server 1.10.6 or higher.
gui ¶
gui() -> Hyperlink
Returns the full URL to the dataset details page in the Driverless AI server.
Returns:
-
Hyperlink
–URL to the dataset details page.
head ¶
Returns the column headers and the first n
number of rows
of the dataset.
Parameters:
-
num_rows
(int
, default:5
) –Number of rows to retrieve.
Returns:
-
Table
–A Table containing the retrieved rows.
Example
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
data_source="s3",
)
# Print the headers and first 10 rows
print(dataset.head(num_rows=10))
merge_by_rows ¶
Merges the specified dataset into this dataset. Note that the other dataset must have the same columns.
Parameters:
Returns:
-
Dataset
–Merged dataset.
Driverless AI version requirement
Requires Driverless AI server 1.10.6 or higher.
modify_by_code ¶
Creates new dataset(s) by modifying the dataset using a Python script.
In the Python script
- The original dataset is available as variable
X
with typedatatable.Frame
. - Newly created dataset(s) should be returned as a
datatable.Frame
, or a pandas.DataFrame, or a numpy.ndarray, or a list of those.
Parameters:
Returns:
Example
# Import the iris dataset.
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
data_source="s3",
)
# Create a new dataset only with the first 4 columns.
new_dataset = dataset.modify_by_code(
code="return X[:, :4]",
names=["new_dataset"],
)
# Split on 5th column to create 2 datasets.
new_datasets = dataset.modify_by_code(
code="return [X[:, :5], X[:, 5:]]",
names=["new_dataset_1", "new_dataset_2"],
)
modify_by_code_preview ¶
Returns a preview of new dataset(s) created by modifying the dataset using a Python script.
In the Python script
- The original dataset is available as variable
X
with typedatatable.Frame
. - Newly created dataset(s) should be returned as a
datatable.Frame
, or a pandas.DataFrame, or a numpy.ndarray, or a list of those (only the first dataset in the list is shown in the preview).
Parameters:
-
code
(str
) –Python script that modifies
X
.
Returns:
-
Table
–A table containing the first 10 rows of the new dataset.
Example
# Import the iris dataset.
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
data_source="s3",
)
# A new dataset only with the first 4 columns.
table = dataset.modify_by_code_preview("return X[:, :4]")
print(table)
modify_by_recipe ¶
modify_by_recipe(
recipe: Union[str, DataRecipe] = None, names: List[str] = None
) -> Dict[str, Dataset]
Creates new dataset(s) by modifying the dataset using a data recipe.
Parameters:
-
recipe
(Union[str, DataRecipe]
, default:None
) –The path to the recipe file, or the url for the recipe, or the data recipe.
-
names
(List[str]
, default:None
) –Names for the new dataset(s).
Returns:
Example
# Import the airlines dataset.
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip",
data_source="s3",
)
# Modify the original dataset with a recipe.
new_datasets = dataset.modify_by_recipe(
recipe="https://github.com/h2oai/driverlessai-recipes/blob/master/data/airlines_multiple.py",
names=["new_airlines1", "new_airlines2"],
)
redescribe ¶
rename ¶
set_datetime_format ¶
Sets/updates the datetime format for columns of the dataset.
Parameters:
Example
# Import the Eurodate dataset.
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/jira/v-11-eurodate.csv",
data_source="s3",
)
# Set the date time format for column ‘ds5'
dataset.set_datetime_format({"ds5": "%d-%m-%y %H:%M"})
set_logical_types ¶
Sets/updates the logical data types of the columns of the dataset. The logical type of columns is primarily used to determine which transformers to try on the column's data.
Possible logical types:
categorical
date
datetime
id
numerical
text
Parameters:
Example
# Import the prostate dataset.
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/prostate/prostate.csv",
data_source="s3",
)
# Set the logical types
prostate.set_logical_types(
{"ID": "id", "AGE": ["categorical", "numerical"], "RACE": None}
)
split_to_train_test ¶
split_to_train_test(
train_size: float = 0.5,
train_name: str = None,
test_name: str = None,
target_column: str = None,
fold_column: str = None,
time_column: str = None,
seed: int = 1234,
) -> Dict[str, Dataset]
Splits the dataset into train and test datasets in the Driverless AI server.
Parameters:
-
train_size
(float
, default:0.5
) –Proportion of the rows to put to the train dataset. Rest will be in the test dataset.
-
train_name
(str
, default:None
) –Name for the train dataset.
-
test_name
(str
, default:None
) –Name for the test dataset.
-
target_column
(str
, default:None
) –Column to use for stratified sampling.
-
fold_column
(str
, default:None
) –Column to ensure grouped splitting.
-
time_column
(str
, default:None
) –Column for time-based splitting.
-
seed
(int
, default:1234
) –A random seed for reproducibility.
Note
Only one of target_column
, fold_column
, or time_column
can be passed at a time.
Returns:
Example
# Import the iris dataset.
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
data_source="s3",
)
# Split the iris dataset into train/test sets.
splits = dataset.split_to_train_test(train_size=0.7)
train_dataset = splits["train_dataset"]
test_dataset = splits["test_dataset"]
split_to_train_test_async ¶
split_to_train_test_async(
train_size: float = 0.5,
train_name: str = None,
test_name: str = None,
target_column: str = None,
fold_column: str = None,
time_column: str = None,
seed: int = 1234,
) -> DatasetSplitJob
Launches the splitting of the dataset into train and test datasets in the Driverless AI server.
Parameters:
-
train_size
(float
, default:0.5
) –Proportion of the rows to put to the train dataset. Rest will be in the test dataset.
-
train_name
(str
, default:None
) –Name for the train dataset.
-
test_name
(str
, default:None
) –Name for the test dataset.
-
target_column
(str
, default:None
) –Column to use for stratified sampling.
-
fold_column
(str
, default:None
) –Column to ensure grouped splitting.
-
time_column
(str
, default:None
) –Column for time-based splitting.
-
seed
(int
, default:1234
) –A random seed for reproducibility.
Note
Only one of target_column
, fold_column
, or time_column
can be passed at a time.
Returns:
-
DatasetSplitJob
–Started dataset split job.
summarize ¶
summarize() -> DatasetSummary
Summarizes the dataset using a GPT configured in the Driverless AI server.
Returns:
-
DatasetSummary
–Dataset summary.
Driverless AI version requirement
Requires Driverless AI server 1.10.6 or higher.
Beta API
A beta API that is subject to future changes.
summarize_async ¶
summarize_async() -> DatasetSummarizeJob
Launches the summarization of the dataset using a GPT configured in the Driverless AI server.
Returns:
-
DatasetSummarizeJob
–Started summarization job.
Driverless AI version requirement
Requires Driverless AI server 1.10.6 or higher.
Beta API
A beta API that is subject to future changes.
tail ¶
Returns the column headers and the last n
number of rows
of the dataset.
Parameters:
-
num_rows
(int
, default:5
) –Number of rows to retrieve.
Returns:
-
Table
–A Table containing the retrieved rows.
Example
dataset = client.datasets.create(
data="s3://h2o-public-test-data/smalldata/iris/iris.csv",
data_source="s3",
)
# Print the headers and last 10 rows
print(dataset.tail(num_rows=10))
DatasetLog ¶
A dataset log file in the Driverless AI server.
download ¶
download(
dst_dir: str = ".",
dst_file: Optional[str] = None,
file_system: Optional[AbstractFileSystem] = None,
overwrite: bool = False,
timeout: float = 30,
) -> str
Downloads the log file.
Parameters:
-
dst_dir
(str
, default:'.'
) –The path where the log file will be saved.
-
dst_file
(Optional[str]
, default:None
) –The name of the log file (overrides the default file name).
-
file_system
(Optional[AbstractFileSystem]
, default:None
) –FSSPEC-based file system to download to instead of the local file system.
-
overwrite
(bool
, default:False
) –Whether to overwrite or not if a file already exists.
-
timeout
(float
, default:30
) –Connection timeout in seconds.
Returns:
-
str
–Path to the downloaded log file.
head ¶
DatasetColumnSummaryCollection ¶
A collection of column summaries of a dataset.
A column summary can be retrieved,
- by the column index
dataset.column_summaries()[0]
- by the column name
dataset.column_summaries()["C1"]
- Or slice it to get multiple summaries
dataset.column_summaries()[0:3]
DatasetColumnSummary ¶
Summary of a column in a dataset.
Example
c1_summary = dataset.column_summaries()["C1"]
# Print the summary for a histogram along with column statistics.
print(c1_summary)
--- C1 ---
4.3|███████
|█████████████████
|██████████
|████████████████████
|████████████
|███████████████████
|█████████████
|████
|████
7.9|████
Data Type: real
Logical Types: ['categorical', 'numerical']
Datetime Format:
Count: 150
Missing: 0
Mean: 5.84
SD: 0.828
Min: 4.3
Max: 7.9
Unique: 35
Freq: 10
data_type
property
¶
data_type: str
Raw data type of the column as detected by the Driverless AI server when the dataset was imported.
Returns:
-
str
–
datetime_format
property
¶
datetime_format: str
Datetime format of the column. See also Dataset.set_datetime_format.
Returns:
-
str
–
logical_types
property
¶
User defined data types for the column to be used by Driverless AI server. This precedes the data_type. See also Dataset.set_logical_types.
Returns:
max
property
¶
mean
property
¶
min
property
¶
sd
property
¶
DatasetSplitJob ¶
Monitor splitting of a dataset in the Driverless AI server.
is_complete ¶
is_complete() -> bool
Whether the job has been completed successfully.
Returns:
-
bool
–True
if the job has been completed successfully, otherwiseFalse
.
is_running ¶
is_running() -> bool
Whether the job has been scheduled or is running, finishing, or syncing.
Returns:
-
bool
–True
if the job has not completed yet, otherwiseFalse
.
result ¶
DatasetSummarizeJob ¶
Monitor the creation of a dataset summary in the Driverless AI server.
is_complete ¶
is_complete() -> bool
Whether the job has been completed successfully.
Returns:
-
bool
–True
if the job has been completed successfully, otherwiseFalse
.
is_running ¶
is_running() -> bool
Whether the job has been scheduled or is running, finishing, or syncing.
Returns:
-
bool
–True
if the job has not completed yet, otherwiseFalse
.
result ¶
result(silent: bool = False) -> DatasetSummary
Awaits the job's completion before returning the created dataset summary.
Parameters:
-
silent
(bool
, default:False
) –Whether to display status updates or not.
Returns:
-
DatasetSummary
–Created dataset summary by the job.