H2O Module¶

h2o – module for using H2O services.

h2o.api(endpoint, data=None, json=None, filename=None, save_to=None)[source]¶

Perform a REST API request to a previously connected server.

This function is mostly for internal purposes, but may occasionally be useful for direct access to the backend H2O server. It has same parameters as H2OConnection.request.

The list of available endpoints can be obtained using:

endpoints = [' '.join([r.http_method, r.url_pattern]) for r in h2o.api("GET /3/Metadata/endpoints").routes]

For each route, the available parameters (passed as data or json) can be obtained using:

parameters = {f.name: f.help for f in h2o.api("GET /3/Metadata/schemas/{route.input_schema}").fields}

Examples

>>> res = h2o.api("GET /3/NetworkTest")
>>> res["table"].show()

h2o.as_list(data, use_pandas=True, header=True)[source]¶

Convert an H2O data object into a python-specific object.

WARNING! This will pull all data local!

If Pandas is available (and use_pandas is True), then pandas will be used to parse the data frame. Otherwise, a list-of-lists populated by character data will be returned (so the types of data will all be str).

Parameters

data – an H2O data object.
use_pandas – If True, try to use pandas for reading in the data.
header – If True, return column names as first element in list

Returns

List of lists (Rows x Columns).

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> from h2o.utils.typechecks import assert_is_type
>>> res1 = h2o.as_list(iris, use_pandas=False)
>>> assert_is_type(res1, list)
>>> res1 = list(zip(*res1))
>>> assert abs(float(res1[0][9]) - 4.4) < 1e-10 and abs(float(res1[1][9]) - 2.9) < 1e-10 and     ...     abs(float(res1[2][9]) - 1.4) < 1e-10, "incorrect values"
>>> res1

h2o.assign(data, xid)[source]¶

(internal) Assign new id to the frame.

Parameters

data – an H2OFrame whose id should be changed
xid – new id for the frame.

Returns

the passed frame.

Examples

>>> old_name = "prostate.csv"
>>> new_name = "newProstate.csv"
>>> training_data = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"),
...                                  destination_frame=old_name)
>>> temp=h2o.assign(training_data, new_name)

h2o.cluster()[source]¶

Return H2OCluster object describing the backend H2O cluster.

Examples

>>> import h2o
>>> h2o.init()
>>> h2o.cluster()

h2o.cluster_info()[source]¶

Deprecated, use h2o.cluster().show_status().

Deprecated.

h2o.cluster_status()[source]¶

Deprecated, use h2o.cluster().show_status(True).

Deprecated.

h2o.connect(server=None, url=None, ip=None, port=None, https=None, verify_ssl_certificates=None, cacert=None, auth=None, proxy=None, cookies=None, verbose=True, config=None, strict_version_check=False)[source]¶

Connect to an existing H2O server, remote or local.

There are two ways to connect to a server: either pass a server parameter containing an instance of an H2OLocalServer, or specify ip and port of the server that you want to connect to.

Parameters

server – An H2OLocalServer instance to connect to (optional).
url – Full URL of the server to connect to (can be used instead of ip + port + https).
ip – The ip address (or host name) of the server where H2O is running.
port – Port number that H2O service is listening to.
https – Set to True to connect via https:// instead of http://.
verify_ssl_certificates – When using https, setting this to False will disable SSL certificates verification.
cacert – Path to a CA bundle file or a directory with certificates of trusted CAs (optional).
auth – Either a (username, password) pair for basic authentication, an instance of h2o.auth.SpnegoAuth or one of the requests.auth authenticator objects.
proxy – Proxy server address.
cookies – Cookie (or list of) to add to request
verbose – Set to False to disable printing connection status messages.
config – Connection configuration object encapsulating connection parameters.
strict_version_check – If True, an error will be raised if the client and server versions don’t match.

Returns

the new H2OConnection object.

Examples

>>> import h2o
>>> ipA = "127.0.0.1"
>>> portN = "54321"
>>> urlS = "http://127.0.0.1:54321"
>>> connect_type=h2o.connect(ip=ipA, port=portN, verbose=True)
# or
>>> connect_type2 = h2o.connect(url=urlS, https=True, verbose=True)

h2o.connection()[source]¶

Return the current H2OConnection handler.

Examples

>>> temp = h2o.connection()
>>> temp

h2o.create_frame(frame_id=None, rows=10000, cols=10, randomize=True, real_fraction=None, categorical_fraction=None, integer_fraction=None, binary_fraction=None, time_fraction=None, string_fraction=None, value=0, real_range=100, factors=100, integer_range=100, binary_ones_fraction=0.02, missing_fraction=0.01, has_response=False, response_factors=2, positive_response=False, seed=None, seed_for_column_types=None)[source]¶

Create a new frame with random data.

Creates a data frame in H2O with real-valued, categorical, integer, and binary columns specified by the user.

Parameters

frame_id – the destination key. If empty, this will be auto-generated.
rows – the number of rows of data to generate.
cols – the number of columns of data to generate. Excludes the response column if has_response is True.
randomize – If True, data values will be randomly generated. This must be True if either categorical_fraction or integer_fraction is non-zero.
value – if randomize is False, then all real-valued entries will be set to this value.
real_range – the range of randomly generated real values.
real_fraction – the fraction of columns that are real-valued.
categorical_fraction – the fraction of total columns that are categorical.
factors – the number of (unique) factor levels in each categorical column.
integer_fraction – the fraction of total columns that are integer-valued.
integer_range – the range of randomly generated integer values.
binary_fraction – the fraction of total columns that are binary-valued.
binary_ones_fraction – the fraction of values in a binary column that are set to 1.
time_fraction – the fraction of randomly created date/time columns.
string_fraction – the fraction of randomly created string columns.
missing_fraction – the fraction of total entries in the data frame that are set to NA.
has_response – A logical value indicating whether an additional response column should be prepended to the final H2O data frame. If set to True, the total number of columns will be cols + 1.
response_factors – if has_response is True, then this variable controls the type of the “response” column: setting response_factors to 1 will generate real-valued response, any value greater or equal than 2 will create categorical response with that many categories.
positive_reponse – when response variable is present and of real type, this will control whether it contains positive values only, or both positive and negative.
seed – a seed used to generate random values when randomize is True.
seed_for_column_types – a seed used to generate random column types when randomize is True.

Returns

an H2OFrame object

Examples

>>> dataset_params = {}
>>> dataset_params['rows'] = random.sample(list(range(50,150)),1)[0]
>>> dataset_params['cols'] = random.sample(list(range(3,6)),1)[0]
>>> dataset_params['categorical_fraction'] = round(random.random(),1)
>>> left_over = (1 - dataset_params['categorical_fraction'])
>>> dataset_params['integer_fraction'] =
... round(left_over - round(random.uniform(0,left_over),1),1)
>>> if dataset_params['integer_fraction'] + dataset_params['categorical_fraction'] == 1:
...     if dataset_params['integer_fraction'] >
...     dataset_params['categorical_fraction']:
...             dataset_params['integer_fraction'] =
...             dataset_params['integer_fraction'] - 0.1
...     else:   
...             dataset_params['categorical_fraction'] =
...             dataset_params['categorical_fraction'] - 0.1
>>> dataset_params['missing_fraction'] = random.uniform(0,0.5)
>>> dataset_params['has_response'] = False
>>> dataset_params['randomize'] = True
>>> dataset_params['factors'] = random.randint(2,5)
>>> print("Dataset parameters: {0}".format(dataset_params))
>>> distribution = random.sample(['bernoulli','multinomial',
...                               'gaussian','poisson','gamma'], 1)[0]
>>> if   distribution == 'bernoulli': dataset_params['response_factors'] = 2
... elif distribution == 'gaussian':  dataset_params['response_factors'] = 1
... elif distribution == 'multinomial': dataset_params['response_factors'] = random.randint(3,5)
... else:
...     dataset_params['has_response'] = False
>>> print("Distribution: {0}".format(distribution))
>>> train = h2o.create_frame(**dataset_params)

h2o.deep_copy(data, xid)[source]¶

Create a deep clone of the frame data.

Parameters

data – an H2OFrame to be cloned
xid – (internal) id to be assigned to the new frame.

Returns

new H2OFrame which is the clone of the passed frame.

Examples

>>> training_data = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> new_name = "new_frame"
>>> training_copy = h2o.deep_copy(training_data, new_name)
>>> training_copy

h2o.demo(funcname, interactive=True, echo=True, test=False)[source]¶

H2O built-in demo facility.

Parameters

funcname – A string that identifies the h2o python function to demonstrate.
interactive – If True, the user will be prompted to continue the demonstration after every segment.
echo – If True, the python commands that are executed will be displayed.
test – If True, h2o.init() will not be called (used for pyunit testing).

Example

>>> import h2o
>>> h2o.demo("gbm")

h2o.download_all_logs(dirname='.', filename=None, container=None)[source]¶

Download H2O log files to disk.

Parameters

dirname – a character string indicating the directory that the log file should be saved in.
filename – a string indicating the name that the CSV file should be. Note that the default container format is .zip, so the file name must include the .zip extension.
container –
a string indicating how to archive the logs, choice of “ZIP” (default) and “LOG”:
- ZIP: individual log files archived in a ZIP package
- LOG: all log files will be concatenated together in one text file

Returns

path of logs written in a zip file.

Examples

The following code will save the zip file ‘h2o_log.zip’ in a directory that is one down from where you are currently working into a directory called your_directory_name. (Please note that your_directory_name should be replaced with the name of the directory that you’ve created and that already exists.)

>>> h2o.download_all_logs(dirname='./your_directory_name/', filename = 'h2o_log.zip')

h2o.download_csv(data, filename)[source]¶

Download an H2O data set to a CSV file on the local disk.

Warning: Files located on the H2O server may be very large! Make sure you have enough hard drive space to accommodate the entire file.

Parameters

data – an H2OFrame object to be downloaded.
filename – name for the CSV file where the data should be saved to.

Examples

>>> iris = h2o.load_dataset("iris")
>>> h2o.download_csv(iris, "iris_delete.csv")
>>> iris2 = h2o.import_file("iris_delete.csv")
>>> iris2 = h2o.import_file("iris_delete.csv")

h2o.download_model(model, path='', export_cross_validation_predictions=False, filename=None)[source]¶

Download an H2O Model object to the machine this python session is currently connected to. The owner of the file saved is the user by which python session was executed.

Parameters

model – The model object to download.
path – a path to the directory where the model should be saved.
export_cross_validation_predictions – logical, indicates whether the exported model artifact should also include CV Holdout Frame predictions. Default is not to include the predictions.
filename – a filename for the saved model

Returns

the path of the downloaded model

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> my_model = H2OGeneralizedLinearEstimator(family = "binomial")
>>> my_model.train(y = "CAPSULE",
...                x = ["AGE", "RACE", "PSA", "GLEASON"],
...                training_frame = h2o_df)
>>> h2o.download_model(my_model, path='')

h2o.download_pojo(model, path='', get_jar=True, jar_name='')[source]¶

Download the POJO for this model to the directory specified by path; if path is “”, then dump to screen.

Parameters

model – the model whose scoring POJO should be retrieved.
path – an absolute path to the directory where POJO should be saved.
get_jar – retrieve the h2o-genmodel.jar also (will be saved to the same folder path).
jar_name – Custom name of genmodel jar.

Returns

location of the downloaded POJO file.

Examples

>>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor()
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> binomial_fit = H2OGeneralizedLinearEstimator(family = "binomial")
>>> binomial_fit.train(y = "CAPSULE",
...                    x = ["AGE", "RACE", "PSA", "GLEASON"],
...                    training_frame = h2o_df)
>>> h2o.download_pojo(binomial_fit, path='', get_jar=False)

h2o.enable_expr_optimizations(flag)[source]¶

Enable expression tree local optimizations.

Examples

>>> h2o.enable_expr_optimizations(True)

h2o.estimate_cluster_mem(ncols, nrows, num_cols=0, string_cols=0, cat_cols=0, time_cols=0, uuid_cols=0)[source]¶

Computes an estimate for cluster memory usage in GB.

Number of columns and number of rows are required. For a better estimate you can provide a counts of different types of columns in the dataset.

Parameters

ncols – (Required) total number of columns in a dataset. Integer, can’t be negative
nrows – (Required) total number of rows in a dataset. Integer, can’t be negative
num_cols – number of numeric columns in a dataset. Integer, can’t be negative.
string_cols – number of string columns in a dataset. Integer, can’t be negative.
cat_cols – number of categorical columns in a dataset. Integer, can’t be negative.
time_cols – number of time columns in a dataset. Integer, can’t be negative.
uuid_cols – number of uuid columns in a dataset. Integer, can’t be negative.

Returns

An memory estimate in GB.

Example

>>> from h2o import estimate_cluster_mem
>>> ### I will load a parquet file with 18 columns and 2 million lines
>>> estimate_cluster_mem(18, 2000000)
>>> ### I will load another parquet file with 16 columns and 2 million lines; I ask for a more precise estimate 
>>> ### because I know 12 of 16 columns are categorical and 1 of 16 columns consist of uuids.
>>> estimate_cluster_mem(18, 2000000, cat_cols=12, uuid_cols=1)
>>> ### I will load a parquet file with 8 columns and 31 million lines; I ask for a more precise estimate 
>>> ### because I know 4 of 8 columns are categorical and 4 of 8 columns consist of numbers.
>>> estimate_cluster_mem(ncols=8, nrows=31000000, cat_cols=4, num_cols=4)

h2o.explain(models, frame, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, figsize=(16, 9), render=True, qualitative_colormap='Dark2', sequential_colormap='RdYlBu_r', background_frame=None)[source]¶

Generate model explanations on frame data set.

The H2O Explainability Interface is a convenient wrapper to a number of explainabilty methods and visualizations in H2O. The function can be applied to a single model or group of models and returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.

Parameters

models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard).
frame – H2OFrame.
columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.
top_n_features – a number of columns to pick using variable importance (where applicable).
include_explanations – if specified, return only the specified model explanations (mutually exclusive with exclude_explanations).
exclude_explanations – exclude specified model explanations.
plot_overrides – overrides for individual model explanations.
figsize – figure size; passed directly to matplotlib.
render – if True, render the model explanations; otherwise model explanations are just returned.
qualitative_colormap – used for setting qualitative colormap, that is passed to individual plots.
sequential_colormap – used for setting sequential colormap, that is passed to individual plots.
background_frame – optional frame, that is used as the source of baselines for the marginal SHAP. Setting it enables calculating SHAP in more models but it can be more time and memory consuming.

Returns

H2OExplanation containing the model explanations including headers and descriptions.

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the H2OAutoML explanation
>>> aml.explain(test)
>>>
>>> # Create the leader model explanation
>>> aml.leader.explain(test)

h2o.explain_row(models, frame, row_index, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, qualitative_colormap='Dark2', figsize=(16, 9), render=True, background_frame=None)[source]¶

Generate model explanations on frame data set for a given instance.

Explain the behavior of a model or group of models with respect to a single row of data. The function returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.

Parameters

models – H2OAutoML object, supervised H2O model, or list of supervised H2O models.
frame – H2OFrame.
row_index – row index of the instance to inspect.
columns – either a list of columns or column indices to show. If specified, parameter top_n_features will be ignored.
top_n_features – a number of columns to pick using variable importance (where applicable).
include_explanations – if specified, return only the specified model explanations (mutually exclusive with exclude_explanations).
exclude_explanations – exclude specified model explanations.
plot_overrides – overrides for individual model explanations.
qualitative_colormap – a colormap name.
figsize – figure size; passed directly to matplotlib.
render – if True, render the model explanations; otherwise model explanations are just returned.
background_frame – optional frame, that is used as the source of baselines for the marginal SHAP. Setting it enables calculating SHAP in more models but it can be more time and memory consuming.

Returns

H2OExplanation containing the model explanations including headers and descriptions.

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the H2OAutoML explanation
>>> aml.explain_row(test, row_index=0)
>>>
>>> # Create the leader model explanation
>>> aml.leader.explain_row(test, row_index=0)

h2o.export_file(frame, path, force=False, sep=', ', compression=None, parts=1, header=True, quote_header=True, parallel=False, format='csv', write_checksum=True, tz_adjust_from_local=False)[source]¶

Export a given H2OFrame to a path on the machine this python session is currently connected to.

Parameters

frame – the Frame to save to disk.
path – the path to the save point on disk.
force – if True, overwrite any preexisting file with the same path.
sep – field delimiter for the output file.
compression – how to compress the exported dataset (default none; gzip, bzip2 and snappy available)
parts – enables export to multiple ‘part’ files instead of just a single file. Convenient for large datasets that take too long to store in a single file. Use parts = -1 to instruct H2O to determine the optimal number of part files or specify your desired maximum number of part files. Path needs to be a directory when exporting to multiple files, also that directory must be empty. Default is parts = 1, which is to export to a single file.
header – if True, write out column names in the header line.
quote_header – if True, quote column names in the header.
parallel – use a parallel export to a single file (doesn’t apply when num_parts != 1, might create temporary files in the destination directory).
format – one of ‘csv’ or ‘parquet’. Defaults to ‘csv’. Export to parquet is multipart and H2O itself determines the optimal number of files (1 file per chunk).
write_checksum – if supported by the format (e.g. ‘parquet’), export will include a checksum file for each exported data file.

Examples

>>> h2o_df = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
>>> h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor()
>>> rand_vec = h2o_df.runif(1234)
>>> train = h2o_df[rand_vec <= 0.8]
>>> valid = h2o_df[(rand_vec > 0.8) & (rand_vec <= 0.9)]
>>> test = h2o_df[rand_vec > 0.9]
>>> binomial_fit = H2OGeneralizedLinearEstimator(family = "binomial")
>>> binomial_fit.train(y = "CAPSULE",
...                    x = ["AGE", "RACE", "PSA", "GLEASON"],
...                    training_frame = train, validation_frame = valid)
>>> pred = binomial_fit.predict(test)
>>> h2o.export_file(pred, "/tmp/pred.csv", force = True)

h2o.flow()[source]¶

Open H2O Flow in your browser.

Examples

>>> python
>>> import h2o
>>> h2o.init()
>>> h2o.flow()

h2o.frames()[source]¶

Retrieve all the Frames.

Returns: Meta information on the frames
Examples

>>> arrestsH2O = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> h2o.frames()

h2o.get_frame(frame_id, **kwargs)[source]¶

Obtain a handle to the frame in H2O with the frame_id key.

Parameters: frame_id (str) – id of the frame to retrieve.
Returns: an H2OFrame object
Examples

>>> from h2o.frame import H2OFrame
>>> frame1 = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> frame2 = h2o.get_frame(frame1.frame_id)

h2o.get_grid(grid_id)[source]¶

Return the specified grid.

Parameters: grid_id – The grid identification in h2o
Returns: an H2OGridSearch instance.
Examples

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> x = ["DayofMonth", "Month"]
>>> hyper_parameters = {'learn_rate':[0.1,0.2],
...                     'max_depth':[2,3],
...                     'ntrees':[5,10]}
>>> search_crit = {'strategy': "RandomDiscrete",
...                'max_models': 5,
...                'seed' : 1234,
...                'stopping_metric' : "AUTO",
...                'stopping_tolerance': 1e-2}
>>> air_grid = H2OGridSearch(H2OGradientBoostingEstimator,
...                          hyper_params=hyper_parameters,
...                          search_criteria=search_crit)
>>> air_grid.train(x=x,
...                y="IsDepDelayed",
...                training_frame=airlines,
...                distribution="bernoulli")
>>> fetched_grid = h2o.get_grid(str(air_grid.grid_id))
>>> fetched_grid

h2o.get_model(model_id)[source]¶

Load a model from the server.

Parameters: model_id – The model identification in H2O
Returns: Model object, a subclass of H2OEstimator
Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> model = H2OGeneralizedLinearEstimator(family="binomial",
...                                       alpha=0,
...                                       Lambda=1e-5)
>>> model.train(x=predictors,
...             y=response,
...             training_frame=airlines)
>>> model2 = h2o.get_model(model.model_id)

h2o.get_timezone()[source]¶

Deprecated, use h2o.cluster().timezone.

Deprecated.

h2o.import_file(path=None, destination_frame=None, parse=True, header=0, sep=None, col_names=None, col_types=None, na_strings=None, pattern=None, skipped_columns=None, force_col_types=False, custom_non_data_line_markers=None, partition_by=None, quotechar=None, escapechar=None, tz_adjust_to_local=False)[source]¶

Import files into an H2O cluster. The default behavior is to pass-through to the parse phase automatically.

The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed multi-threaded pull of the data. The main difference between this method and upload_file() is that the latter works with local files, whereas this method imports remote files (i.e. files local to the server). If you running H2O server on your own machine, then both methods behave the same.

Parameters

path – path(s) specifying the location of the data to import or a path to a directory of files to import
destination_frame – The unique hex key assigned to the imported file. If none is given, a key will be automatically generated.
parse – If True, the file should be parsed after import. If False, then a list is returned containing the file path.
header – -1 means the first line is data, 0 means guess, 1 means first line is header.
sep – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
col_names – A list of column names for the file.
col_types – A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. The possible types a column may have are:
partition_by –
Names of the column the persisted dataset has been partitioned by.
- ”unknown” - this will force the column to be parsed as all NA
- ”uuid” - the values in the column must be true UUID or will be parsed as NA
- ”string” - force the column to be parsed as a string
- ”numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
- ”enum” - force the column to be parsed as a categorical column.
- ”time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats:
  - ”yyyy-MM-dd” (date),
  - ”yyyy MM dd” (date),
  - ”dd-MMM-yy” (date),
  - ”dd MMM yy” (date),
  - ”HH:mm:ss” (time),
  - ”HH:mm:ss:SSS” (time),
  - ”HH:mm:ss:SSSnnnnnn” (time),
  - ”HH.mm.ss” (time),
  - ”HH.mm.ss.SSS” (time),
  - ”HH.mm.ss.SSSnnnnnn”(time).
  Times can also contain “AM” or “PM”.
na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
pattern – Character string containing a regular expression to match file(s) in the folder if path is a directory.
skipped_columns – an integer list of column indices to skip and not parsed into the final frame from the import file.
force_col_types – If true, will force the column types to be either the ones in Parquet schema for Parquet files or the ones specified in column_types. This parameter is used for numerical columns only. Other column settings will happen without setting this parameter. Defaults to false.”
custom_non_data_line_markers – If a line in imported file starts with any character in given string it will NOT be imported. Empty string means all lines are imported, None means that default behaviour for given format will be used
quotechar – A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
escapechar – (Optional) One ASCII character used to escape other characters.

Returns

a new H2OFrame instance.

Examples

>>> birds = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/birds.csv")

h2o.import_hive_table(database=None, table=None, partitions=None, allow_multi_format=False)[source]¶

Import Hive table to H2OFrame in memory.

Make sure to start H2O with Hive on classpath. Uses hive-site.xml on classpath to connect to Hive. When database is specified as jdbc URL uses Hive JDBC driver to obtain table metadata. then uses direct HDFS access to import data.

Parameters

database – Name of Hive database (default database will be used by default), can be also a JDBC URL.
table – name of Hive table to import
partitions – a list of lists of strings - partition key column values of partitions you want to import.
allow_multi_format – enable import of partitioned tables with different storage formats used. WARNING: this may fail on out-of-memory for tables with a large number of small partitions.

Returns

an H2OFrame containing data of the specified Hive table.

Examples

>>> basic_import = h2o.import_hive_table("default",
...                                      "table_name")
>>> jdbc_import = h2o.import_hive_table("jdbc:hive2://hive-server:10000/default",
...                                      "table_name")
>>> multi_format_enabled = h2o.import_hive_table("default",
...                                              "table_name",
...                                              allow_multi_format=True)
>>> with_partition_filter = h2o.import_hive_table("jdbc:hive2://hive-server:10000/default",
...                                               "table_name",
...                                               [["2017", "02"]])

h2o.import_mojo(mojo_path, model_id=None)[source]¶

Imports an existing MOJO model as an H2O model.

Parameters

mojo_path – Path to the MOJO archive on the H2O’s filesystem
model_id – Model ID, default is None

Returns

An H2OGenericEstimator instance embedding given MOJO

Examples

>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> model = H2OGradientBoostingEstimator(ntrees = 1)
>>> model.train(x = ["Origin", "Dest"],
...             y = "IsDepDelayed",
...             training_frame=airlines)
>>> original_model_filename = tempfile.mkdtemp()
>>> original_model_filename = model.download_mojo(original_model_filename)
>>> mojo_model = h2o.import_mojo(original_model_filename)

h2o.import_sql_select(connection_url, select_query, username, password, optimize=True, use_temp_table=None, temp_table_name=None, fetch_mode=None, num_chunks_hint=None)[source]¶

Import the SQL table that is the result of the specified SQL query to H2OFrame in memory.

Creates a temporary SQL table from the specified sql_query. Runs multiple SELECT SQL queries on the temporary table concurrently for parallel ingestion, then drops the table. Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath:

java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp

Also see h2o.import_sql_table. Currently supported SQL databases are MySQL, PostgreSQL, MariaDB, Hive, Oracle and Microsoft SQL Server.

Parameters

connection_url – URL of the SQL database connection as specified by the Java Database Connectivity (JDBC) Driver. For example, “jdbc:mysql://localhost:3306/menagerie?&useSSL=false”
select_query – SQL query starting with SELECT that returns rows from one or more database tables.
username – username for SQL server
password – password for SQL server
optimize – DEPRECATED. Ignored - use fetch_mode instead. Optimize import of SQL table for faster imports.
use_temp_table – whether a temporary table should be created from select_query
temp_table_name – name of temporary table to be created from select_query
fetch_mode – Set to DISTRIBUTED to enable distributed import. Set to SINGLE to force a sequential read by a single node from the database.
num_chunks_hint – Desired number of chunks for the target Frame.

Returns

an H2OFrame containing data of the specified SQL query.

Examples

>>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false"
>>> select_query = "SELECT bikeid from citibike20k"
>>> username = "root"
>>> password = "abc123"
>>> my_citibike_data = h2o.import_sql_select(conn_url, select_query,
    ...                                          username, password)

h2o.import_sql_table(connection_url, table, username, password, columns=None, optimize=True, fetch_mode=None, num_chunks_hint=None)[source]¶

Import SQL table to H2OFrame in memory.

Assumes that the SQL table is not being updated and is stable. Runs multiple SELECT SQL queries concurrently for parallel ingestion. Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath:

java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp

Also see import_sql_select(). Currently supported SQL databases are MySQL, PostgreSQL, MariaDB, Hive, Oracle and Microsoft SQL.

Parameters

connection_url –
URL of the SQL database connection as specified by the Java Database Connectivity (JDBC) Driver. For example:
```
"jdbc:mysql://localhost:3306/menagerie?&useSSL=false"
```
table – name of SQL table
columns – a list of column names to import from SQL table. Default is to import all columns.
username – username for SQL server
password – password for SQL server
optimize – DEPRECATED. Ignored - use fetch_mode instead. Optimize import of SQL table for faster imports.
fetch_mode – Set to DISTRIBUTED to enable distributed import. Set to SINGLE to force a sequential read by a single node from the database.
num_chunks_hint – Desired number of chunks for the target Frame.

Returns

an H2OFrame containing data of the specified SQL table.

Examples

>>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false"
>>> table = "citibike20k"
>>> username = "root"
>>> password = "abc123"
>>> my_citibike_data = h2o.import_sql_table(conn_url, table, username, password)

h2o.init(url=None, ip=None, port=None, name=None, https=None, cacert=None, insecure=None, username=None, password=None, cookies=None, proxy=None, start_h2o=True, nthreads=-1, ice_root=None, log_dir=None, log_level=None, max_log_file_size=None, enable_assertions=True, max_mem_size=None, min_mem_size=None, strict_version_check=None, ignore_config=False, extra_classpath=None, jvm_custom_args=None, bind_to_localhost=True, verbose=True, **kwargs)[source]¶

Attempt to connect to a local server, or if not successful start a new server and connect to it.

Parameters

url – Full URL of the server to connect to (can be used instead of ip + port + https).
ip – The ip address (or host name) of the server where H2O is running.
port – Port number that H2O service is listening to.
name – Cluster name. If None while connecting to an existing cluster it will not check the cluster name. If set then will connect only if the target cluster name matches. If no instance is found and decides to start a local one then this will be used as the cluster name or a random one will be generated if set to None.
https – Set to True to connect via https:// instead of http://.
cacert – Path to a CA bundle file or a directory with certificates of trusted CAs (optional).
insecure – When using https, setting this to True will disable SSL certificates verification.
username – Username and
password – Password for basic authentication.
cookies – Cookie (or list of) to add to each request.
proxy – Proxy server address.
start_h2o – If False, do not attempt to start an h2o server when connection to an existing one failed.
nthreads – “Number of threads” option when launching a new h2o server.
ice_root – Directory for temporary files for the new h2o server.
log_dir – Directory for H2O logs to be stored if a new instance is started. Ignored if connecting to an existing node.
log_level –
The logger level for H2O if a new instance is started. One of:
- TRACE
- DEBUG
- INFO
- WARN
- ERRR
- FATA
Default is INFO. Ignored if connecting to an existing node.
max_log_file_size – Maximum size of INFO and DEBUG log files. The file is rolled over after a specified size has been reached. (The default is 3MB. Minimum is 1MB and maximum is 99999MB)
enable_assertions – Enable assertions in Java for the new h2o server.
max_mem_size –
Maximum memory to use for the new h2o server. Integer input will be evaluated as gigabytes. Other units can be specified by passing in a string (e.g. “160M” for 160 megabytes).
- Note: If max_mem_size is not defined, then the amount of memory that H2O allocates will be determined by the default memory of the Java Virtual Machine (JVM). This amount depends on the Java version, but it will generally be 25% of the machine’s physical memory.
min_mem_size – Minimum memory to use for the new h2o server. Integer input will be evaluated as gigabytes. Other units can be specified by passing in a string (e.g. “160M” for 160 megabytes).
strict_version_check – If True, an error will be raised if the client and server versions don’t match.
ignore_config – Indicates whether a processing of a .h2oconfig file should be conducted or not. Default value is False.
extra_classpath – List of paths to libraries that should be included on the Java classpath when starting H2O from Python.
kwargs – (all other deprecated attributes)
jvm_custom_args – Customer, user-defined argument’s for the JVM H2O is instantiated in. Ignored if there is an instance of H2O already running and the client connects to it.
bind_to_localhost – A flag indicating whether access to the H2O instance should be restricted to the local machine (default) or if it can be reached from other computers on the network.
verbose – Set to False to disable printing connection status and info messages.

Examples

>>> import h2o
>>> h2o.init(ip="localhost", port=54323)

h2o.interaction(data, factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]¶

Categorical Interaction Feature Creation in H2O.

Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by the user.

Parameters

data – the H2OFrame that holds the target categorical columns.
factors – factor columns (either indices or column names).
pairwise – If True, create pairwise interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.
max_factors – Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made).
min_occurrence – Min. occurrence threshold for factor levels in pair-wise interaction terms
destination_frame – a string indicating the destination key. If empty, this will be auto-generated by H2O.

Returns

H2OFrame

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris = iris.cbind(iris[4] == "Iris-setosa")
>>> iris[5] = iris[5].asfactor()
>>> iris.set_name(5,"C6")
>>> iris = iris.cbind(iris[4] == "Iris-virginica")
>>> iris[6] = iris[6].asfactor()
>>> iris.set_name(6, name="C7")
>>> two_way_interactions = h2o.interaction(iris,
...                                        factors=[4,5,6],
...                                        pairwise=True,
...                                        max_factors=10000,
...                                        min_occurrence=1)
>>> from h2o.utils.typechecks import assert_is_type
>>> assert_is_type(two_way_interactions, H2OFrame)
>>> levels1 = two_way_interactions.levels()[0]
>>> levels2 = two_way_interactions.levels()[1]
>>> levels3 = two_way_interactions.levels()[2]
>>> two_way_interactions

h2o.is_expr_optimizations_enabled()[source]¶

Examples

>>> h2o.enable_expr_optimizations(True)
>>> h2o.is_expr_optimizations_enabled()
>>> h2o.enable_expr_optimizations(False)
>>> h2o.is_expr_optimizations_enabled()

h2o.lazy_import(path, pattern=None)[source]¶

Import a single file or collection of files.

Parameters

path – A path to a data file (remote or local).
pattern – Character string containing a regular expression to match file(s) in the folder.

Returns

either a H2OFrame with the content of the provided file, or a list of such frames if importing multiple files.

Examples

>>> iris = h2o.lazy_import("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")

h2o.list_timezones()[source]¶

Deprecated, use h2o.cluster().list_timezones().

Deprecated.

h2o.load_dataset(relative_path)[source]¶

Imports a data file within the ‘h2o_data’ folder.

Examples

>>> fr = h2o.load_dataset("iris")

h2o.load_frame(frame_id, path, force=True)[source]¶

Load frame previously stored in H2O’s native format.

This will load a data frame from file-system location. Stored data can be loaded only with a cluster of the same size and same version the the one which wrote the data. The provided directory must be accessible from all nodes (HDFS, NFS). Provided frame_id must be the same as the one used when writing the data.

Parameters

frame_id – the frame ID of the original frame
path – a filesystem location where to look for frame data
force – overwrite an already existing frame (defaults to true)

Returns

A Frame object.

Examples

>>> iris = h2o.load_frame("iris_weather.hex", "hdfs://namenode/h2o_data")

h2o.load_grid(grid_file_path, load_params_references=False)[source]¶

Loads previously saved grid with all its models from the same folder

Parameters

grid_file_path – A string containing the path to the file with grid saved. Grid models are expected to be in the same folder.
load_params_references – If true will attemt to reload saved objects referenced by grid parameters (e.g. training frame, calibration frame), will fail if grid was saved without referenced objects.

Returns

An instance of H2OGridSearch

Examples

>>> from collections import OrderedDict
>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# Run GBM Grid Search
>>> ntrees_opts = [1, 3]
>>> learn_rate_opts = [0.1, 0.01, .05]
>>> hyper_parameters = OrderedDict()
>>> hyper_parameters["learn_rate"] = learn_rate_opts
>>> hyper_parameters["ntrees"] = ntrees_opts
>>> export_dir = pyunit_utils.locate("results")
>>> gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params=hyper_parameters)
>>> gs.train(x=list(range(4)), y=4, training_frame=train)
>>> grid_id = gs.grid_id
>>> old_grid_model_count = len(gs.model_ids)
# Save the grid search to the export directory
>>> saved_path = h2o.save_grid(export_dir, grid_id)
>>> h2o.remove_all();
>>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# Load the grid searcht-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> grid = h2o.load_grid(saved_path)
>>> grid.train(x=list(range(4)), y=4, training_frame=train)

h2o.load_model(path)[source]¶

Load a saved H2O model from disk. (Note that ensemble binary models can now be loaded using this method.)

Parameters: path – the full path of the H2O Model to be imported.
Returns: an H2OEstimator object
Examples

>>> training_data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> model = H2OGeneralizedLinearEstimator(family="binomial",
...                                       alpha=0,
...                                       Lambda=1e-5)
>>> model.train(x=predictors,
...             y=response,
...             training_frame=training_data)
>>> h2o.save_model(model, path='', force=True)
>>> h2o.load_model(model)

h2o.log_and_echo(message='')[source]¶

Log a message on the server-side logs.

This is helpful when running several pieces of work one after the other on a single H2O cluster and you want to make a notation in the H2O server side log where one piece of work ends and the next piece of work begins.

Sends a message to H2O for logging. Generally used for debugging purposes.

Parameters: message – message to write to the log.
Examples

>>> ret = h2o.log_and_echo("Testing h2o.log_and_echo")

h2o.ls()[source]¶

List keys on an H2O Cluster.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> h2o.ls()

h2o.make_leaderboard(object, leaderboard_frame=None, sort_metric='AUTO', extra_columns=[], scoring_data='AUTO')[source]¶

Create a leaderboard from a list of models, grids and/or automls.

Parameters

object – List of models, automls, or grids; or just single automl/grid object.
leaderboard_frame – Frame used for generating the metrics (optional).
sort_metric – Metric used for sorting the leaderboard.
extra_columns – What extra columns should be calculated (might require leaderboard_frame). Use “ALL” for all available or list of extra columns.
scoring_data – Metrics to be reported in the leaderboard (“xval”, “train”, or “valid”). Used if no leaderboard_frame is provided.

Returns

H2OFrame

Examples

>>> import h2o
>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> h2o.init()
>>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv")
>>> hyper_parameters = {'alpha': [0.01,0.5],
...                     'lambda': [1e-5,1e-6]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_parameters)
>>> gs.train(y=3, training_frame=training_data)
>>> h2o.make_leaderboard(gs, training_data)

h2o.make_metrics(predicted, actual, domain=None, distribution=None, weights=None, treatment=None, auc_type='NONE', auuc_type='AUTO', auuc_nbins=-1, custom_auuc_thresholds=None)[source]¶

Create Model Metrics from predicted and actual values in H2O.

Parameters

predicted (H2OFrame) – an H2OFrame containing predictions.
actuals (H2OFrame) – an H2OFrame containing actual values.
domain – list of response factors for classification.
distribution – distribution for regression.
weights (H2OFrame) – an H2OFrame containing observation weights (optional).
treatment (H2OFrame) – an H2OFrame containing treatment information for uplift binomial classification only.
auc_type –
For multinomial classification you have to specify which type of agregated AUC/AUCPR will be used to calculate this metric. Possibilities are:
- MACRO_OVO
- MACRO_OVR
- WEIGHTED_OVO
- WEIGHTED_OVR
- NONE
- AUTO
(OVO = One vs. One, OVR = One vs. Rest). Default is “NONE” (AUC and AUCPR are not calculated).
auuc_type –
For uplift binomial classification you have to specify which type of AUUC will be used to calculate this metric. Choose from:
- gini
- lift
- gain
- AUTO (default, uses qini)
auuc_nbins – For uplift binomial classification you have to specify number of bins to be used for calculation the AUUC. Default is -1, which means 1000.

:param custom_auuc_thresholds For uplift binomial classification you can specify exact thresholds to: calculate AUUC. Default is NONE. If the thresholds are not defined, auuc_nbins will be used to calculate, the new thresholds from the predicted data.

Examples

>>> fr = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> fr["CAPSULE"] = fr["CAPSULE"].asfactor()
>>> fr["RACE"] = fr["RACE"].asfactor()
>>> response = "AGE"
>>> predictors = list(set(fr.names) - {"ID", response})
>>> for distr in ["gaussian", "poisson", "laplace", "gamma"]:
...     print("distribution: %s" % distr)
...     model = H2OGradientBoostingEstimator(distribution=distr,
...                                          ntrees=2,
...                                          max_depth=3,
...                                          min_rows=1,
...                                          learn_rate=0.1,
...                                          nbins=20)
...     model.train(x=predictors,
...                 y=response,
...                 training_frame=fr)
...     predicted = h2o.assign(model.predict(fr), "pred")
...     actual = fr[response]
...     m0 = model.model_performance(train=True)
...     m1 = h2o.make_metrics(predicted, actual, distribution=distr)
...     m2 = h2o.make_metrics(predicted, actual)
>>> print(m0)
>>> print(m1)
>>> print(m2)

h2o.model_correlation_heatmap(models, frame, top_n=None, cluster_models=True, triangular=True, figsize=(13, 13), colormap='RdYlBu_r', save_plot_path=None)[source]¶

Model Prediction Correlation Heatmap

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering).

Parameters

models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard)
frame – H2OFrame
top_n – DEPRECATED. show just top n models (applies only when used with H2OAutoML).
cluster_models – if True, cluster the models
triangular – make the heatmap triangular
figsize – figsize: figure size; passed directly to matplotlib
colormap – colormap to use
save_plot_path – a path to save the plot via using matplotlib function savefig

Returns

object that contains the resulting figure (can be accessed using result.figure())

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the model correlation heatmap
>>> aml.model_correlation_heatmap(test)

h2o.models()[source]¶

Retrieve the IDs all the Models.

Returns: Handles of all the models present in the cluster
Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> model1 = H2OGeneralizedLinearEstimator(family="binomial")
>>> model1.train(y=response, training_frame=airlines)
>>> model2 = H2OXGBoostEstimator(family="binomial")
>>> model2.train(y=response, training_frame=airlines)
>>> model_list = h2o.get_models()

h2o.mojo_predict_csv(input_csv_path, mojo_zip_path, output_csv_path=None, genmodel_jar_path=None, classpath=None, java_options=None, verbose=False, setInvNumNA=False, predict_contributions=False, predict_calibrated=False, extra_cmd_args=None)[source]¶

MOJO scoring function to take a CSV file and use MOJO model as zip file to score.

Parameters

input_csv_path – Path to input CSV file.
mojo_zip_path – Path to MOJO zip downloaded from H2O.
output_csv_path – Optional, name of the output CSV file with computed predictions. If None (default), then predictions will be saved as prediction.csv in the same folder as the MOJO zip.
genmodel_jar_path – Optional, path to genmodel jar file. If None (default) then the h2o-genmodel.jar in the same folder as the MOJO zip will be used.
classpath – Optional, specifies custom user defined classpath which will be used when scoring. If None (default) then the default classpath for this MOJO model will be used.
java_options – Optional, custom user defined options for Java. By default -Xmx4g -XX:ReservedCodeCacheSize=256m is used.
verbose – Optional, if True, then additional debug information will be printed. False by default.
predict_contributions – if True, then return prediction contributions instead of regular predictions (only for tree-based models).
predict_calibrated – if true, then return calibrated probabilities in addition to the predicted probabilities.
extra_cmd_args – Optional, a list of additional arguments to append to genmodel.jar’s command line.

Returns

List of computed predictions

h2o.mojo_predict_pandas(dataframe, mojo_zip_path, genmodel_jar_path=None, classpath=None, java_options=None, verbose=False, setInvNumNA=False, predict_contributions=False, predict_calibrated=False)[source]¶

MOJO scoring function to take a Pandas frame and use MOJO model as zip file to score.

Parameters

dataframe – Pandas frame to score.
mojo_zip_path – Path to MOJO zip downloaded from H2O.
genmodel_jar_path – Optional, path to genmodel jar file. If None (default) then the h2o-genmodel.jar in the same folder as the MOJO zip will be used.
classpath – Optional, specifies custom user defined classpath which will be used when scoring. If None (default) then the default classpath for this MOJO model will be used.
java_options – Optional, custom user defined options for Java. By default -Xmx4g is used.
verbose – Optional, if True, then additional debug information will be printed. False by default.
predict_contributions – if True, then return prediction contributions instead of regular predictions (only for tree-based models).
predict_calibrated – if true, then return calibrated probabilities in addition to the predicted probabilities.

Returns

Pandas frame with predictions

h2o.network_test()[source]¶

Deprecated, use h2o.cluster().network_test().

Deprecated.

h2o.no_progress()[source]¶

Disable the progress bar from flushing to stdout.

The completed progress bar is printed when a job is complete so as to demarcate a log file.

Examples

>>> h2o.no_progress()
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> x = ["DayofMonth", "Month"]
>>> model = H2OGeneralizedLinearEstimator(family="binomial",
...                                       alpha=0,
...                                       Lambda=1e-5)
>>> model.train(x=x, y="IsDepDelayed", training_frame=airlines)  

h2o.parse_raw(setup, id=None, first_line_is_header=0)[source]¶

Parse dataset using the parse setup structure.

Parameters

setup – Result of h2o.parse_setup()
id – an id for the frame.
first_line_is_header – -1, 0, 1 if the first line is to be used as the header

Returns

an H2OFrame object.

Examples

>>> fraw = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"),
...                         parse=False)
>>> fhex = h2o.parse_raw(h2o.parse_setup(fraw),
...                      id='prostate.csv',
...                      first_line_is_header=0)
>>> fhex.summary()

h2o.parse_setup(raw_frames, destination_frame=None, header=0, separator=None, column_names=None, column_types=None, na_strings=None, skipped_columns=None, force_col_types=False, custom_non_data_line_markers=None, partition_by=None, quotechar=None, escapechar=None, tz_adjust_to_local=False)[source]¶

Retrieve H2O’s best guess as to what the structure of the data file is.

During parse setup, the H2O cluster will make several guesses about the attributes of the data. This method allows a user to perform corrective measures by updating the returning dictionary from this method. This dictionary is then fed into parse_raw to produce the H2OFrame instance.

Parameters

raw_frames – a collection of imported file frames
destination_frame – The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.
header – -1 means the first line is data, 0 means guess, 1 means first line is header.
separator – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
column_names – A list of column names for the file. If skipped_columns are specified, only list column names of columns that are not skipped.
column_types –
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. If skipped_columns are specified, only list column types of columns that are not skipped. The possible types a column may have are:
- ”unknown” - this will force the column to be parsed as all NA
- ”uuid” - the values in the column must be true UUID or will be parsed as NA
- ”string” - force the column to be parsed as a string
- ”numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
- ”enum” - force the column to be parsed as a categorical column.
- ”time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats:
  - ”yyyy-MM-dd” (date),
  - ”yyyy MM dd” (date),
  - ”dd-MMM-yy” (date),
  - ”dd MMM yy” (date),
  - ”HH:mm:ss” (time),
  - ”HH:mm:ss:SSS” (time),
  - ”HH:mm:ss:SSSnnnnnn” (time),
  - ”HH.mm.ss” (time),
  - ”HH.mm.ss.SSS” (time),
  - ”HH.mm.ss.SSSnnnnnn” (time).
  Times can also contain “AM” or “PM”.
na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
skipped_columns – an integer lists of column indices to skip and not parsed into the final frame from the import file.
force_col_types – If True, will force the column types to be either the ones in Parquet schema for Parquet files or the ones specified in column_types. This parameter is used for numerical columns only. Other column settings will happen without setting this parameter. Defaults to False.
custom_non_data_line_markers – If a line in imported file starts with any character in given string it will NOT be imported. Empty string means all lines are imported, None means that default behaviour for given format will be used
partition_by – A list of columns the dataset has been partitioned by. None by default.
quotechar – A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
escapechar – (Optional) One ASCII character used to escape other characters.
tz_adjust_to_local – (Optional) Adjust the imported data timezone from GMT to cluster timezone.

Returns

a dictionary containing parse parameters guessed by the H2O backend.

Examples

>>> col_headers = ["ID","CAPSULE","AGE","RACE",
...                "DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> col_types=['enum','enum','numeric','enum',
...            'enum','enum','numeric','numeric','numeric']
>>> hex_key = "training_data.hex"
>>> fraw = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"),
...                         parse=False)
>>> setup = h2o.parse_setup(fraw,
...                         destination_frame=hex_key,
...                         header=1,
...                         separator=',',
...                         column_names=col_headers,
...                         column_types=col_types,
...                         na_strings=["NA"])
>>> setup

h2o.pd_multi_plot(models, frame, column, best_of_family=True, row_index=None, target=None, max_levels=30, figsize=(16, 9), colormap='Dark2', markers=['o', 'v', 's', 'P', '*', 'D', 'X', '^', '<', '>', '.'], save_plot_path=None, show_rug=True)[source]¶

Plot partial dependencies of a variable across multiple models.

The partial dependence plot (PDP) provides a graph of the marginal effect of a variable on the response. The effect of a variable is measured by the change in the mean response. The PDP assumes independence between the feature for which is the PDP computed and the rest.

Parameters

models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard)
frame – H2OFrame
column – string containing column name
best_of_family – if True, show only the best models per family
row_index – if None, do partial dependence, if integer, do individual conditional expectation for the row specified by this integer
target – (only for multinomial classification) for what target should the plot be done
max_levels – maximum number of factor levels to show
figsize – figure size; passed directly to matplotlib
colormap – colormap name
markers – List of markers to use for factors, when it runs out of possible markers the last in this list will get reused
save_plot_path – a path to save the plot via using matplotlib function savefig
show_rug – Show rug to visualize the density of the column

Returns

object that contains the resulting matplotlib figure (can be accessed using result.figure()).

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create a partial dependence plot
>>> aml.pd_multi_plot(test, column="alcohol")

h2o.print_mojo(mojo_path, format='json', tree_index=None)[source]¶

Generates string representation of an existing MOJO model.

Parameters

mojo_path – Path to the MOJO archive on the user’s local filesystem
format – Output format. Possible values: json (default), dot, png
tree_index – Index of tree to print

Returns

An string representation of the MOJO for text output formats, a path to a directory with the rendered images for image output formats (or a path to a file if only a single tree is outputted)

Example

>>> import json
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
>>> gbm_h2o = H2OGradientBoostingEstimator(ntrees = 5,
...                                        learn_rate = 0.1,
...                                        max_depth = 4,
...                                        min_rows = 10)
>>> gbm_h2o.train(x = list(range(1,prostate.ncol)),
...               y = "CAPSULE",
...               training_frame = prostate)
>>> mojo_path = gbm_h2o.download_mojo()
>>> mojo_str = h2o.print_mojo(mojo_path)
>>> mojo_dict = json.loads(mojo_str)

h2o.rapids(expr)[source]¶

Execute a Rapids expression.

Parameters: expr – The rapids expression (ascii string).
Returns: The JSON response (as a python dictionary) of the Rapids execution.
Examples

>>> rapidTime = h2o.rapids("(getTimeZone)")["string"]
>>> print(str(rapidTime))

h2o.remove(x, cascade=True)[source]¶

Remove object(s) from H2O.

Parameters

x – H2OFrame, H2OEstimator, or string, or a list of those things: the object(s) or unique id(s) pointing to the object(s) to be removed.
cascade – boolean, if set to TRUE (default), the object dependencies (e.g. submodels) are also removed.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> h2o.remove(airlines)
>>> airlines
# Should receive error: "This H2OFrame has been removed."

h2o.remove_all(retained=None)[source]¶

Removes all objects from H2O with possibility to specify models and frames to retain. Retained keys must be keys of models and frames only. For models retained, training and validation frames are retained as well. Cross validation models of a retained model are NOT retained automatically, those must be specified explicitely.

Parameters: retained – Keys of models and frames to retain
Examples

>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> gbm = H2OGradientBoostingEstimator(ntrees = 1)
>>> gbm.train(x = ["Origin", "Dest"],
...           y = "IsDepDelayed",
...           training_frame=airlines)
>>> h2o.remove_all([airlines.frame_id,
...                 gbm.model_id])

h2o.resume(recovery_dir=None)[source]¶

Triggers auto-recovery resume - this will look into configured recovery dir and resume and tasks that were interrupted by unexpected cluster stopping.

Parameters: recovery_dir – A path to where cluster recovery data is stored, if blank, will use cluster’s configuration.

h2o.save_grid(grid_directory, grid_id, save_params_references=False, export_cross_validation_predictions=False)[source]¶

Export a Grid and it’s all its models into the given folder

Parameters

grid_directory – A string containing the path to the folder for the grid to be saved to.
grid_id – A character string with identification of the Grid in H2O.
save_params_references – True if objects referenced by grid parameters (e.g. training frame, calibration frame) should also be saved.
export_cross_validation_predictions – A boolean flag indicating whether the models exported from the grid should be saved with CV Holdout Frame predictions. Default is not to export the predictions.

Examples

>>> from collections import OrderedDict
>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# Run GBM Grid Search
>>> ntrees_opts = [1, 3]
>>> learn_rate_opts = [0.1, 0.01, .05]
>>> hyper_parameters = OrderedDict()
>>> hyper_parameters["learn_rate"] = learn_rate_opts
>>> hyper_parameters["ntrees"] = ntrees_opts
>>> export_dir = pyunit_utils.locate("results")
>>> gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params=hyper_parameters)
>>> gs.train(x=list(range(4)), y=4, training_frame=train)
>>> grid_id = gs.grid_id
>>> old_grid_model_count = len(gs.model_ids)
# Save the grid search to the export directory
>>> saved_path = h2o.save_grid(export_dir, grid_id)
>>> h2o.remove_all();
>>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# Load the grid search
>>> grid = h2o.load_grid(saved_path)
>>> grid.train(x=list(range(4)), y=4, training_frame=train)

h2o.save_model(model, path='', force=False, export_cross_validation_predictions=False, filename=None)[source]¶

Save an H2O Model object to disk. (Note that ensemble binary models can now be saved using this method.) The owner of the file saved is the user by which H2O cluster was executed.

Parameters

model – The model object to save.
path – a path to save the model at (hdfs, s3, local)
force – if True overwrite destination directory in case it exists, or throw exception if set to False.
export_cross_validation_predictions – logical, indicates whether the exported model artifact should also include CV Holdout Frame predictions. Default is not to export the predictions.
filename – a filename for the saved model

Returns

the path of the saved model

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> my_model = H2OGeneralizedLinearEstimator(family = "binomial")
>>> my_model.train(y = "CAPSULE",
...                x = ["AGE", "RACE", "PSA", "GLEASON"],
...                training_frame = h2o_df)
>>> h2o.save_model(my_model, path='', force=True)

h2o.set_timezone(value)[source]¶

Deprecated, set h2o.cluster().timezone instead.

Deprecated.

h2o.show_progress()[source]¶

Enable the progress bar (it is enabled by default).

Examples

>>> h2o.no_progress()
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> x = ["DayofMonth", "Month"]
>>> model = H2OGeneralizedLinearEstimator(family="binomial",
...                                       alpha=0,
...                                       Lambda=1e-5)
>>> model.train(x=x, y="IsDepDelayed", training_frame=airlines)
>>> h2o.show_progress()
>>> model.train(x=x, y="IsDepDelayed", training_frame=airlines)

h2o.shutdown(prompt=False)[source]¶

Deprecated, use h2o.cluster().shutdown().

Deprecated.

h2o.upload_custom_metric(func, func_file='metrics.py', func_name=None, class_name=None, source_provider=None)[source]¶

Upload given metrics function into H2O cluster.

The metrics can have different representation:

class: needs to implement map(pred, act, weight, offset, model), reduce(l, r) and metric(l) methods
string: the same as in class case, but the class is given as a string

Parameters

func – metric representation: string, class
func_file – internal name of file to save given metrics representation
func_name – name for h2o key under which the given metric is saved
class_name – name of class wrapping the metrics function (when supplied as string)
source_provider – a function which provides a source code for given function

Returns

reference to uploaded metrics function

Examples

>>> class CustomMaeFunc:
>>>     def map(self, pred, act, w, o, model):
>>>         return [abs(act[0] - pred[0]), 1]
>>>
>>>     def reduce(self, l, r):
>>>         return [l[0] + r[0], l[1] + r[1]]
>>>
>>>     def metric(self, l):
>>>         return l[0] / l[1]
>>>
>>> custom_func_str = '''class CustomMaeFunc:
>>>     def map(self, pred, act, w, o, model):
>>>         return [abs(act[0] - pred[0]), 1]
>>>
>>>     def reduce(self, l, r):
>>>         return [l[0] + r[0], l[1] + r[1]]
>>>
>>>     def metric(self, l):
>>>         return l[0] / l[1]'''
>>>
>>>
>>> h2o.upload_custom_metric(custom_func_str, class_name="CustomMaeFunc", func_name="mae")

h2o.upload_file(path, destination_frame=None, header=0, sep=None, col_names=None, col_types=None, na_strings=None, skipped_columns=None, force_col_types=False, quotechar=None, escapechar=None)[source]¶

Upload a dataset from the provided local path to the H2O cluster.

Does a single-threaded push to H2O. Also see import_file().

Parameters

path – A path specifying the location of the data to upload.
destination_frame – The unique hex key assigned to the imported file. If none is given, a key will be automatically generated.
header – -1 means the first line is data, 0 means guess, 1 means first line is header.
sep – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
col_names – A list of column names for the file.
col_types –
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. The possible types a column may have are:
- ”unknown” - this will force the column to be parsed as all NA
- ”uuid” - the values in the column must be true UUID or will be parsed as NA
- ”string” - force the column to be parsed as a string
- ”numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
- ”enum” - force the column to be parsed as a categorical column.
- ”time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats:
  - ”yyyy-MM-dd” (date),
  - ”yyyy MM dd” (date),
  - ”dd-MMM-yy” (date),
  - ”dd MMM yy” (date).
  - ”HH:mm:ss” (time),
  - ”HH:mm:ss:SSS” (time),
  - ”HH:mm:ss:SSSnnnnnn” (time),
  - ”HH.mm.ss” (time),
  - ”HH.mm.ss.SSS” (time),
  - ”HH.mm.ss.SSSnnnnnn” (time).
  Times can also contain “AM” or “PM”.
na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
skipped_columns – an integer lists of column indices to skip and not parsed into the final frame from the import file.
force_col_types – If True, will force the column types to be either the ones in Parquet schema for Parquet files or the ones specified in column_types. This parameter is used for numerical columns only. Other column settings will happen without setting this parameter. Defaults to False.
quotechar – A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
escapechar – (Optional) One ASCII character used to escape other characters.

Returns

a new H2OFrame instance.

Examples

>>> iris_df = h2o.upload_file("~/Desktop/repos/h2o-3/smalldata/iris/iris.csv")

h2o.upload_model(path)[source]¶

Upload a binary model from the provided local path to the H2O cluster. (H2O model can be saved in a binary form either by save_model() or by download_model() function.)

Parameters: path – A path on the machine this python session is currently connected to, specifying the location of the model to upload.
Returns: a new H2OEstimator object.

h2o.upload_mojo(mojo_path, model_id=None)[source]¶

Uploads an existing MOJO model from local filesystem into H2O and imports it as an H2O Generic Model.

Parameters

mojo_path – Path to the MOJO archive on the user’s local filesystem
model_id – Model ID, default None

Returns

An H2OGenericEstimator instance embedding given MOJO

Examples

>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> model = H2OGradientBoostingEstimator(ntrees = 1)
>>> model.train(x = ["Origin", "Dest"],
...             y = "IsDepDelayed",
...             training_frame=airlines)
>>> original_model_filename = tempfile.mkdtemp()
>>> original_model_filename = model.download_mojo(original_model_filename)
>>> mojo_model = h2o.upload_mojo(original_model_filename)

h2o.varimp_heatmap(models, top_n=None, num_of_features=20, figsize=(16, 9), cluster=True, colormap='RdYlBu_r', save_plot_path=None)[source]¶

Variable Importance Heatmap across a group of models

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.

Parameters

models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard)
top_n – DEPRECATED. use just top n models (applies only when used with H2OAutoML)
num_of_features – limit the number of features to plot based on the maximum variable importance across the models. Use None for unlimited.
figsize – figsize: figure size; passed directly to matplotlib
cluster – if True, cluster the models and variables
colormap – colormap to use
save_plot_path – a path to save the plot via using matplotlib function savefig

Returns

object that contains the resulting figure (can be accessed using result.figure())

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the variable importance heatmap
>>> aml.varimp_heatmap()