I tried to install H2O in Python but ``pip install scikit-learn`` failed - what should I do?

Use the following commands (prepending with sudo if necessary):

easy_install pip
pip install numpy
brew install gcc
pip install scipy
pip install scikit-learn

If you are still encountering errors and you are using OSX, the default version of Python may be installed. We recommend installing the Homebrew version of Python instead:

brew install python

If you are encountering errors related to missing Python packages when using H2O, refer to the following list for a complete list of all Python packages, including dependencies:

  • grip
  • tabulate
  • wheele
  • jsonlite
  • ipython
  • numpy
  • scipy
  • pandas
  • -U gensim
  • jupyter
  • -U PIL
  • nltk
  • beautifulsoup4

How do I specify a value as an enum in Python? Is there a Python equivalent of ``as.factor()`` in R?

Use .asfactor() to specify a value as an enum.

I received the following error when I tried to install H2O using the Python instructions on the downloads page - what should I do to resolve it?

  Downloading h2o- (43.1Mb): 43.1Mb downloaded
  Running egg_info for package from
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
    IOError: [Errno 2] No such file or directory: '/tmp/pip-nTu3HK-build/'
    Complete output from command python egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

IOError: [Errno 2] No such file or directory: '/tmp/pip-nTu3HK-build/'

Command python egg_info failed with error code 1 in /tmp/pip-nTu3HK-build

With Python, there is no automatic update of installed packages, so you must upgrade manually. Additionally, the package distribution method recently changed from distutils to wheel. The following procedure should be tried first if you are having trouble installing the H2O package, particularly if error messages related to bdist_wheel or eggs display.

# this gets the latest setuptools
# see
wget -O - | sudo python

# platform dependent ways of installing pip are at
# but the above should work on most linux platforms?

# on ubuntu
# if you already have some version of pip, you can skip this.
sudo apt-get install python-pip

# the package manager doesn't install the latest. upgrade to latest
# we're not using easy_install any more, so don't care about checking that
pip install pip --upgrade

# I've seen pip not install to the final version ..i.e. it goes to an almost
# final version first, then another upgrade gets it to the final version.
# We'll cover that, and also double check the install.

# after upgrading pip, the path name may change from /usr/bin to /usr/local/bin
# start a new shell, just to make sure you see any path changes


# Also: I like double checking that the install is bulletproof by reinstalling.
# Sometimes it seems like things say they are installed, but have errors during the install. Check for no errors or stack traces.

pip install pip --upgrade --force-reinstall

# distribute should be at the most recent now. Just in case
# don't do --force-reinstall here, it causes an issue.

pip install distribute --upgrade

# Now check the versions
pip list | egrep '(distribute|pip|setuptools)'
distribute (0.7.3)
pip (7.0.3)
setuptools (17.0)

# Re-install wheel
pip install wheel --upgrade --force-reinstall

After completing this procedure, go to Python and use h2o.init() to start H2O in Python.


  • If you use gradlew to build the jar yourself, you have to start the jar >yourself before you do h2o.init().
  • If you download the jar and the H2O package, h2o.init() will work like R >and you don’t have to start the jar yourself.

How should I specify the datatype during import in Python?

Refer to the following example:

#Let's say you want to change the second column "CAPSULE" of prostate.csv
#to categorical. You have 3 options.

#Option 1. Use a dictionary of column names to types.
fr = h2o.import_file("smalldata/logreg/prostate.csv", col_types = {"CAPSULE":"Enum"})

#Option 2. Use a list of column types.
c_types = [None]*9
c_types[1] = "Enum"
fr = h2o.import_file("smalldata/logreg/prostate.csv", col_types = c_types)

#Option 3. Use parse_setup().
fraw = h2o.import_file("smalldata/logreg/prostate.csv", parse = False)
fsetup = h2o.parse_setup(fraw)
fsetup["column_types"][1] = '"Enum"'
fr = h2o.parse_raw(fsetup)

How do I view a list of variable importances in Python?

Use model.varimp(return_list=True) as shown in the following example:

model = h2o.gbm(y = "IsDepDelayed", x = ["Month"], training_frame = df)
vi = model.varimp(return_list=True)
[(u'Month', 69.27436828613281, 1.0, 1.0)]

How can I get the H2O Python Client to work with third-party plotting libraries for plotting metrics outside of Flow?

In Flow, plots are created using the H2O UI and using specific RESTful commands that are issued from the UI. You can obtain similar plotting specific data in Python using a third-party plotting library such as Pandas or Matplotlib. In addition, every metric that H2O displays in the Flow is calculated on the backend and stored for each model. So you can inspect any metric after getting the data from H2O and then using a plotting library in Python to create the graphs.

The example below shows how to plot the logloss for training and validation using Pandas to store the data and also generate the plot. Pandas has a simplified but limited plotting API, and it is also based on Matplotlib.

# import pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# get the scoring history for the model
scoring_history = pd.DataFrame(model.score_history())

# plot the validation and training logloss
scoring_history.plot(x='number_of_trees', y = ['validation_logloss', 'training_logloss'])

What is PySparkling? How can I use it for grid search or early stopping?

PySparkling basically calls H2O Python functions for all operations on H2O data frames. You can perform all H2O Python operations available in H2O Python version or later from PySparkling.

For help on a function within IPython Notebook, run H2OGridSearch?

Here is an example of grid search in PySparkling:

from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.gbm import H2OGradientBoostingEstimator

iris = h2o.import_file("/Users/nidhimehta/h2o-dev/smalldata/iris/iris.csv")

ntrees_opt = [5, 10, 15]
max_depth_opt = [2, 3, 4]
learn_rate_opt = [0.1, 0.2]
hyper_parameters = {"ntrees": ntrees_opt, "max_depth":max_depth_opt,

gs = H2OGridSearch(H2OGradientBoostingEstimator(distribution='multinomial'), hyper_parameters)
gs.train(x=range(0,iris.ncol-1), y=iris.ncol-1, training_frame=iris, nfold=10)
print gs.sort_by('logloss', increasing=True)

Here is an example of early stopping in PySparkling:

from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

hidden_opt = [[32,32],[32,16,8],[100]]
l1_opt = [1e-4,1e-3]
hyper_parameters = {"hidden":hidden_opt, "l1":l1_opt}

model_grid = H2OGridSearch(H2ODeepLearningEstimator, hyper_params=hyper_parameters)
model_grid.train(x=x, y=y, distribution="multinomial", epochs=1000, training_frame=train,
   validation_frame=test, score_interval=2, stopping_rounds=3, stopping_tolerance=0.05, stopping_metric="misclassification")

Do you have a tutorial for grid search in Python?

Yes, a notebook is available here that demonstrates the use of grid search in Python.