Driverless AI Standalone Python Scoring Pipeline

A standalone Python scoring pipeline is available after successfully completing an experiment. This package contains an exported model and Python 3.8 source code examples for productionizing models built using H2O Driverless AI.

The files in this package let you transform and score on new data in several different ways:

  • From Python 3.8, you can import a scoring module and use it to transform and score on new data.

  • From other languages and platforms, you can use the TCP/HTTP scoring service bundled with this package to call into the scoring pipeline module through remote procedure calls (RPC).

For more information on the Python Scoring Pipeline, refer to the following sections:

Before You Begin

Refer to the following notes for important information regarding the Python Scoring Pipeline.

Note

If you use the virtualenv or pip run methods of Python scoring, CUDA, OpenCL, and cuDNN must be manually installed. For more information, see CUDA, OpenCL, and cuDNN Install Instructions.

Note

The downloaded scorer zip file contains a shell script called run_example.sh, which is used to set up a virtual environment and run an example Python script. If you use the pip-virtualenv mode for the run_example.sh shell script, refer to the following examples to install prerequisites for Python scoring:

To install the necessary prerequisites and activate a virtual environment using the run_example.sh shell script with Docker, refer to the following examples:

Ubuntu 18.04 or later

# replace <KEY> with your license key
docker run -ti --entrypoint=bash --runtime nvidia -e DRIVERLESS_AI_LICENSE_KEY=<KEY> -v /home/$USER/scorers:/scorers docker.io/nvidia/cuda:11.2.2-base-ubuntu18.04
apt-get update
apt-get install python3.8 virtualenv unzip git -y
apt-get install libgomp1 libopenblas-base ocl-icd-libopencl1 -y  # required at runtime
apt install build-essential libssl-dev libffi-dev python3-dev python3.8-dev -y  # to compile some packages
apt install language-pack-en -y  # for proper encoding support
apt-get install libopenblas-dev -y  # for runtime
mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
export LANG="en_US.UTF-8"
export LC_ALL="en_US.UTF-8"
unzip /scorers/scorer.zip
cd scoring-pipeline
# if don't need h2o-3 recipe server, then add dai_enable_h2o_recipes=0 before bash below
bash run_example.sh

Red Hat Enterprise Linux (Red Hat Universal Base Image 8 without GPUs)

docker run -ti --entrypoint=bash -v /home/$USER/scorers:/scorers registry.access.redhat.com/ubi8/ubi:8.4
dnf -y install python38 unzip virtualenv openblas libgomp
unzip /scorers/scorer.zip
cd scoring-pipeline
bash run_example.sh

CentOS 8

docker run -ti --entrypoint=bash -v /home/$USER/Downloads/scorers:/scorers centos:8
dnf -y install python38 unzip virtualenv openblas libgomp procps
unzip /scorers/scorer.zip
cd scoring-pipeline
bash run_example.sh

To install the necessary prerequisites and activate a virtual environment using the run_example.sh shell script on Ubuntu 16.04, run the following commands:

sudo apt-get update
sudo apt-get install software-properties-common # Ubuntu 16.04 only
sudo add-apt-repository ppa:deadsnakes/ppa # Ubuntu 16.04 only
sudo apt-get update
sudo apt-get install python3.8 virtualenv unzip -y
sudo apt-get install libgomp1 libopenblas-base ocl-icd-libopencl1 -y  # required at runtime
unzip scorer.zip
cd scoring-pipeline
bash run_example.sh

If you need to be able to compile, also run the following command:

sudo apt install build-essential libssl-dev libffi-dev python3-dev -y

To run a scoring job using the example.py file after the virtual environment has been activated, run the following command:

export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
python example.py

To install the necessary prerequisites and activate a virtual environment using the run_example.sh shell script on Ubuntu 18.04 or later, run the following commands:

sudo apt-get update
sudo apt-get install python3.8 virtualenv unzip -y
sudo apt-get install libgomp1 libopenblas-base ocl-icd-libopencl1 -y  # required at runtime
unzip scorer.zip
cd scoring-pipeline
bash run_example.sh

If you need to be able to compile, also run the following command:

sudo apt install build-essential libssl-dev libffi-dev python3-dev -y

To run a scoring job using the example.py file after the virtual environment has been activated, run the following command:

export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
python example.py

To install the necessary prerequisites and activate a virtual environment using the run_example.sh shell script on Red Hat Enterprise Linux 8, run the following command:

dnf -y install python38 unzip virtualenv openblas libgomp
unzip /rpms/scorer.zip
cd scoring-pipeline
bash run_example.sh

To install the necessary prerequisites and activate a virtual environment using the run_example.sh shell script on CentOS 8, run the following command:

dnf -y install python38 unzip virtualenv openblas libgomp procps
unzip /rpms/scorer.zip
cd scoring-pipeline
bash run_example.sh

Note

Custom Recipes and the Python Scoring Pipeline

By default, if a custom recipe has been uploaded into Driverless AI and is subsequently not used in the experiment, the Python Scoring Pipeline still contains the H2O recipe server. If this pipeline is then deployed in a container, the H2O recipe server causes the size of the pipeline to be much larger. In addition, Java has to be installed in the container, which further increases the runtime storage and memory requirements. A workaround is to set the following environment variable before running the Python Scoring Pipeline:

export dai_enable_custom_recipes=0

CUDA, OpenCL, and cuDNN Install Instructions

Refer to the following sections for instructions on installing CUDA, OpenCL, and cuDNN when using the virtualenv or pip run methods of Python scoring.

Installing CUDA with NVIDIA Drivers

Before installing CUDA, make sure you have already installed wget, gcc, make, and elfutils-libelf-devel:

sudo yum -y install wget
sudo yum -y install gcc
sudo yum -y install make
sudo yum -y install elfutils-libelf-devel

Next, visit https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html for instructions on installing CUDA. It is recommended that you use the runfile method of installation.

If prompted to select what tools you would like to install, select Drivers only.

Installing OpenCL

Run the following to install OpenCL for Centos7 or RH7 based systems using yum and x86.

sudo yum -y clean all
sudo yum -y makecache
sudo yum -y update
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.el7.x86_64.rpm
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.x86_64.rpm
sudo rpm -if ocl-icd-2.2.12-1.el7.x86_64.rpm
sudo rpm -if clinfo-2.1.17.02.09-1.el7.x86_64.rpm
clinfo

mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

Installing cuDNN

For information on installing cuDNN on Linux, refer to https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html.

Note

cuDNN 8 or later is required.

Python Scoring Pipeline Files

The scoring-pipeline folder includes the following notable files:

  • example.py: An example Python script demonstrating how to import and score new records.

  • run_example.sh: Runs example.py (also sets up a virtualenv with prerequisite libraries). For more information, refer to the second note in the Before You Begin section.

  • tcp_server.py: A standalone TCP server for hosting scoring services.

  • http_server.py: A standalone HTTP server for hosting scoring services.

  • run_tcp_server.sh: Runs TCP scoring service (runs tcp_server.py).

  • run_http_server.sh: Runs HTTP scoring service (runs http_server.py).

  • example_client.py: An example Python script demonstrating how to communicate with the scoring server.

  • run_tcp_client.sh: Demonstrates how to communicate with the scoring service via TCP (runs example_client.py).

  • run_http_client.sh: Demonstrates how to communicate with the scoring service via HTTP (using curl).

Quick Start

There are two methods for starting the Python Scoring Pipeline.

Quick Start - Alternative Method

This section describes an alternative method for running the Python Scoring Pipeline. This version requires Internet access.

Note

If you use a scorer from a version prior to 1.10.4.1, you need to add export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True prior to creating the new scorer python environment, either in run_example.sh or in the same terminal where the shell scripts are executed. Note that this change will not be necessary when installing the scorer on top of an already built tar.sh, which is the recommended way.

Prerequisites

  • The scoring module and scoring service are supported only on Linux with Python 3.8 and OpenBLAS.

  • The scoring module and scoring service download additional packages at install time and require Internet access. Depending on your network environment, you might need to set up internet access via a proxy.

  • Valid Driverless AI license. Driverless AI requires a license to be specified in order to run the Python Scoring Pipeline.

  • Apache Thrift (to run the scoring service in TCP mode)

  • Linux environment

  • Python 3.8

  • libopenblas-dev (required for H2O4GPU)

  • OpenCL

For info on how to install these prerequisites, refer to the following examples.

Installing Python 3.8 and OpenBLAS on Ubuntu 16.10 or Later:

sudo apt install python3.8 python3.8-dev python3-pip python3-dev \
  python-virtualenv python3-virtualenv libopenblas-dev

Installing Python 3.8 and OpenBLAS on Ubuntu 16.04:

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get install python3.8 python3.8-dev python3-pip python3-dev \
  python-virtualenv python3-virtualenv libopenblas-dev

Installing Conda 3.6:

You can install Conda using either Anaconda or Miniconda. Refer to the links below for more information:

Installing OpenCL:

Install OpenCL on RHEL:

yum -y clean all
yum -y makecache
yum -y update
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.el7.x86_64.rpm
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.x86_64.rpm
rpm -if clinfo-2.1.17.02.09-1.el7.x86_64.rpm
rpm -if ocl-icd-2.2.12-1.el7.x86_64.rpm
clinfo

mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

Install OpenCL on Ubuntu:

sudo apt install ocl-icd-libopencl1

mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

License Specification

Driverless AI requires a license to be specified in order to run the Python Scoring Pipeline. The license can be specified via an environment variable in Python:

# Set DRIVERLESS_AI_LICENSE_FILE, the path to the Driverless AI license file
%env DRIVERLESS_AI_LICENSE_FILE="/home/ubuntu/license/license.sig"


# Set DRIVERLESS_AI_LICENSE_KEY, the Driverless AI license key (Base64 encoded string)
%env DRIVERLESS_AI_LICENSE_KEY="oLqLZXMI0y..."

The examples that follow use DRIVERLESS_AI_LICENSE_FILE. Using DRIVERLESS_AI_LICENSE_KEY would be similar.

Installing the Thrift Compiler

Thrift is required to run the scoring service in TCP mode, but it is not required to run the scoring module. The following steps are available on the Thrift documentation site at: https://thrift.apache.org/docs/BuildingFromSource.

sudo apt-get install automake bison flex g++ git libevent-dev \
  libssl-dev libtool make pkg-config libboost-all-dev ant
wget https://github.com/apache/thrift/archive/0.10.0.tar.gz
tar -xvf 0.10.0.tar.gz
cd thrift-0.10.0
./bootstrap.sh
./configure
make
sudo make install

Run the following to refresh the runtime shared after installing Thrift:

sudo ldconfig /usr/local/lib

Running the Python Scoring Pipeline - Alternative Method

  1. On the completed Experiment page, click on the Download Python Scoring Pipeline button to download the scorer.zip file for this experiment onto your local machine.

Download Python Scoring Pipeline button
  1. Extract the scoring pipeline.

You can run the scoring module and the scoring service after downloading and extracting the pipeline.

Score from a Python Program

If you intend to score from a Python program, run the scoring module example. (Requires Linux and Python 3.8.)

export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh

Score Using a Web Service

If you intend to score using a web service, run the HTTP scoring server example. (Requires Linux x86_64 and Python 3.8.)

export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_http_server.sh
bash run_http_client.sh

Score Using a Thrift Service

If you intend to score using a Thrift service, run the TCP scoring server example. (Requires Linux x86_64, Python 3.8 and Thrift.)

export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_tcp_server.sh
bash run_tcp_client.sh

Note: By default, the run_*.sh scripts mentioned above create a virtual environment using virtualenv and pip, within which the Python code is executed. The scripts can also leverage Conda (Anaconda/Mininconda) to create Conda virtual environment and install required package dependencies. The package manager to use is provided as an argument to the script.

# to use conda package manager
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh --pm conda

# to use pip package manager
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh --pm pip

If you experience errors while running any of the above scripts, check to make sure your system has a properly installed and configured Python 3.8 installation. Refer to the Troubleshooting Python Environment Issues section that follows to see how to set up and test the scoring module using a cleanroom Ubuntu 16.04 virtual machine.

The Python Scoring Module

The scoring module is a Python module bundled into a standalone wheel file (name scoring_*.whl). All the prerequisites for the scoring module to work correctly are listed in the requirements.txt file. To use the scoring module, all you have to do is create a Python virtualenv, install the prerequisites, and then import and use the scoring module as follows:

# See 'example.py' for complete example.
from scoring_487931_20170921174120_b4066 import Scorer
scorer = Scorer()       # Create instance.
score = scorer.score([  # Call score()
    7.416,              # sepal_len
    3.562,              # sepal_wid
    1.049,              # petal_len
    2.388,              # petal_wid
])

The scorer instance provides the following methods (and more):

  • score(list): Score one row (list of values).

  • score_batch(df): Score a Pandas dataframe.

  • fit_transform_batch(df): Transform a Pandas dataframe.

  • get_target_labels(): Get target column labels (for classification problems).

The process of importing and using the scoring module is demonstrated by the bash script run_example.sh, which effectively performs the following steps:

# See 'run_example.sh' for complete example.
virtualenv -p python3.8 env
source env/bin/activate
pip install --use-deprecated=legacy-resolver -r requirements.txt
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
python example.py

The Scoring Service

The scoring service hosts the scoring module as an HTTP or TCP service. Doing this exposes all the functions of the scoring module through remote procedure calls (RPC). In effect, this mechanism lets you invoke scoring functions from languages other than Python on the same computer or from another computer on a shared network or on the Internet.

The scoring service can be started in two ways:

  • In TCP mode, the scoring service provides high-performance RPC calls via Apache Thrift (https://thrift.apache.org/) using a binary wire protocol.

  • In HTTP mode, the scoring service provides JSON-RPC 2.0 calls served by Tornado (http://www.tornadoweb.org).

Scoring operations can be performed on individual rows (row-by-row) or in batch mode (multiple rows at a time).

Scoring Service - TCP Mode (Thrift)

The TCP mode lets you use the scoring service from any language supported by Thrift, including C, C++, C#, Cocoa, D, Dart, Delphi, Go, Haxe, Java, Node.js, Lua, perl, PHP, Python, Ruby and Smalltalk.

To start the scoring service in TCP mode, you will need to generate the Thrift bindings once, then run the server:

# See 'run_tcp_server.sh' for complete example.
thrift --gen py scoring.thrift
python tcp_server.py --port=9090

Note that the Thrift compiler is only required at build-time. It is not a run time dependency, i.e. once the scoring services are built and tested, you do not need to repeat this installation process on the machines where the scoring services are intended to be deployed.

To call the scoring service, generate the Thrift bindings for your language of choice, then make RPC calls via TCP sockets using Thrift’s buffered transport in conjunction with its binary protocol.

# See 'run_tcp_client.sh' for complete example.
thrift --gen py scoring.thrift

# See 'example_client.py' for complete example.
socket = TSocket.TSocket('localhost', 9090)
transport = TTransport.TBufferedTransport(socket)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ScoringService.Client(protocol)
transport.open()
row = Row()
row.sepalLen = 7.416  # sepal_len
row.sepalWid = 3.562  # sepal_wid
row.petalLen = 1.049  # petal_len
row.petalWid = 2.388  # petal_wid
scores = client.score(row)
transport.close()

You can reproduce the exact same result from other languages, e.g. Java:

thrift --gen java scoring.thrift

// Dependencies:
// commons-codec-1.9.jar
// commons-logging-1.2.jar
// httpclient-4.4.1.jar
// httpcore-4.4.1.jar
// libthrift-0.10.0.jar
// slf4j-api-1.7.12.jar

import ai.h2o.scoring.Row;
import ai.h2o.scoring.ScoringService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import java.util.List;

public class Main {
  public static void main(String[] args) {
    try {
      TTransport transport = new TSocket("localhost", 9090);
      transport.open();

      ScoringService.Client client = new ScoringService.Client(
        new TBinaryProtocol(transport));

      Row row = new Row(7.642, 3.436, 6.721, 1.020);
      List<Double> scores = client.score(row);
      System.out.println(scores);

      transport.close();
    } catch (TException ex) {
      ex.printStackTrace();
    }
  }
}

Scoring Service - HTTP Mode (JSON-RPC 2.0)

The HTTP mode lets you use the scoring service using plaintext JSON-RPC calls. This is usually less performant compared to Thrift, but has the advantage of being usable from any HTTP client library in your language of choice, without any dependency on Thrift.

For JSON-RPC documentation, see http://www.jsonrpc.org/specification.

To start the scoring service in HTTP mode:

# See 'run_http_server.sh' for complete example.
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
python http_server.py --port=9090

To invoke scoring methods, compose a JSON-RPC message and make a HTTP POST request to http://host:port/rpc as follows:

# See 'run_http_client.sh' for complete example.
curl http://localhost:9090/rpc \
  --header "Content-Type: application/json" \
  --data @- <<EOF
 {
  "id": 1,
  "method": "score",
  "params": {
    "row": [ 7.486, 3.277, 4.755, 2.354 ]
  }
 }
EOF

Similarly, you can use any HTTP client library to reproduce the above result. For example, from Python, you can use the requests module as follows:

import requests
row = [7.486, 3.277, 4.755, 2.354]
req = dict(id=1, method='score', params=dict(row=row))
res = requests.post('http://localhost:9090/rpc', data=req)
print(res.json()['result'])

Python Scoring Pipeline Shapley values support

The Python Scoring Pipeline supports Shapley contributions for transformed features and original features. The following example demonstrates how to retrieve Shapley contributions for transformed and original features when making predictions:

from scoring_487931_20170921174120_b4066 import Scorer

# Transformed Features Shapley Values
scorer = Scorer()       # Create instance.
score = scorer.score([  # Call score()
    7.416,              # sepal_len
    3.562,              # sepal_wid
    1.049,              # petal_len
    2.388,              # petal_wid
], pred_contribs=True, pred_contribs_original=False)

# Original Features Shapley Values
scorer = Scorer()       # Create instance.
score = scorer.score([  # Call score()
    7.416,              # sepal_len
    3.562,              # sepal_wid
    1.049,              # petal_len
    2.388,              # petal_wid
], pred_contribs=True, pred_contribs_original=True)

Note

  • Setting pred_contribs_original=True requires that pred_contribs is also set to True.

  • Presently, Shapley contributions for transformed features and original features are available for XGBoost (GBM, GLM, RF, DART), LightGBM, Zero-Inflated, Imbalanced and DecisionTree models (and their ensemble). For ensemble with ExtraTrees meta learner (ensemble_meta_learner=’extra_trees’) models we suggest to use the Python scoring packages.

  • Shapley values for original features are approximated from the accompanying Shapley values for transformed features with the Naive Shapley (even split) method.

  • The Shapley fast approximation uses only one model (from the first fold) with no more than the first 50 trees. For details see fast_approx_num_trees and fast_approx_do_one_fold_one_model config.toml settings.

Frequently asked questions

I’m getting GCC compile errors on Red Hat / CentOS when not using tar and SCORING_PIPELINE_INSTALL_DEPENDENCIES = 0. How do I fix this?

To fix this issue, run the following command:

sudo yum -y install gcc

Why am I getting a “TensorFlow is disabled” message when I run the Python Scoring Pipeline?

If you ran an experiment when TensorFlow was enabled and then attempt to run the Python Scoring Pipeline, you may receive a message similar to the following:

TensorFlow is disabled. To enable, export DRIVERLESS_AI_ENABLE_TENSORFLOW=1 or set enable_tensorflow=true in config.toml.

To successfully run the Python Scoring Pipeline, you must enable the DRIVERLESS_AI_ENABLE_TENSORFLOW flag. For example:

export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
DRIVERLESS_AI_ENABLE_TENSORFLOW=1 bash run_example.sh

Troubleshooting Python Environment Issues

The following instructions describe how to set up a cleanroom Ubuntu 16.04 virtual machine to test that this scoring pipeline works correctly.

Prerequisites:

  1. Create configuration files for Vagrant.

    • bootstrap.sh: contains commands to set up Python 3.8 and OpenBLAS.

    • Vagrantfile: contains virtual machine configuration instructions for Vagrant and VirtualBox.

----- bootstrap.sh -----

#!/usr/bin/env bash

sudo apt-get -y update
sudo apt-get -y install apt-utils build-essential python-software-properties software-properties-common zip libopenblas-dev
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt-get update -yqq
sudo apt-get install -y python3.8 python3.8-dev python3-pip python3-dev python-virtualenv python3-virtualenv

# end of bootstrap.sh

----- Vagrantfile -----

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure(2) do |config|
  config.vm.box = "ubuntu/xenial64"
  config.vm.provision :shell, path: "bootstrap.sh", privileged: false
  config.vm.hostname = "h2o"
  config.vm.provider "virtualbox" do |vb|
    vb.memory = "4096"
  end
end

# end of Vagrantfile
  1. Launch the VM and SSH into it. Note that we’re also placing the scoring pipeline in the same directory so that we can access it later inside the VM.

cp /path/to/scorer.zip .
vagrant up
vagrant ssh
  1. Test the scoring pipeline inside the virtual machine.

cp /vagrant/scorer.zip .
unzip scorer.zip
cd scoring-pipeline/
export DRIVERLESS_AI_LICENSE_FILE="/path/to/license.sig"
bash run_example.sh

At this point, you should see scores printed out on the terminal. If not, contact us at support@h2o.ai.