Scoring Spark Data Frames
Do parallelized scoring of a data frame in mini-batches against a MLOps deployment.
Prerequisites
h2o_mlops_scoring_client-*-py3-none-any.whl
file or access to the Python Package Index (PyPI)- Java
Setup
Install the h2o_mlops_scoring_client
with pip
.
Notes
- If running locally, the number of cores used (and thus parallel processes) can be overridden with:
num_cores = 10
h2o_mlops_scoring_client.spark_master = f"local[{num_cores}]"
Example Usage
import h2o_mlops_scoring_client
Choose the MLOps scoring endpoint.
MLOPS_ENDPOINT_URL = "https://model.internal.dedicated.h2o.ai/d4d36117-c94a-4182-8b75-5f5abbd1c28b/model/score"
Get a data frame to use along with a unique ID column used to identify each score.
spark = h2o_mlops_scoring_client.get_spark_session()
DATA_FRAME = spark.read.csv("/Users/jgranados/datasets/BNPParibas.csv", header=True, inferSchema=True)
ID_COLUMN = "ID"
And now we score.
spark_df = h2o_mlops_scoring_client.score_data_frame(
mlops_endpoint_url=MLOPS_ENDPOINT_URL,
id_column=ID_COLUMN,
data_frame=DATA_FRAME,
)
23/08/21 14:22:19 INFO h2o_mlops_scoring_client: Connecting to H2O.ai MLOps scorer at 'https://model.internal.dedicated.h2o.ai/d4d36117-c94a-4182-8b75-5f5abbd1c28b/model/score'
Optionally merge the scores into the original data frame.
DATA_FRAME.join(spark_df, on=ID_COLUMN).toPandas()
ID | target | v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | ... | v124 | v125 | v126 | v127 | v128 | v129 | v130 | v131 | target.0 | target.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 471 | 1 | NaN | NaN | C | NaN | NaN | NaN | NaN | NaN | ... | NaN | BJ | NaN | NaN | NaN | 0 | NaN | NaN | 0.146024 | 0.853976 |
1 | 1238 | 0 | 0.762258 | 2.672957 | C | 2.900483 | 8.021613 | 2.323856 | 1.672848 | 2.385984 | ... | 0.798897 | AK | 2.309768 | 2.657601 | 1.414868 | 0 | 3.408867 | 0.578036 | 0.618661 | 0.381339 |
2 | 1591 | 0 | NaN | NaN | C | NaN | NaN | NaN | NaN | NaN | ... | NaN | BM | NaN | NaN | NaN | 0 | NaN | NaN | 0.463295 | 0.536705 |
3 | 2866 | 0 | 2.009504 | 6.397093 | C | 4.778735 | 11.223940 | 3.014256 | 3.414799 | 0.132427 | ... | 0.206987 | X | 1.725845 | 2.800408 | 2.287892 | 0 | 1.145128 | 2.222223 | 0.214818 | 0.785182 |
4 | 3918 | 1 | 1.200612 | 10.720488 | C | 6.038627 | 12.636180 | 2.527201 | 2.740981 | 0.203916 | ... | 0.224938 | G | 1.592580 | 3.292614 | 4.548198 | 0 | 0.679665 | 2.786885 | 0.131239 | 0.868761 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
114316 | 225722 | 0 | NaN | NaN | C | NaN | NaN | NaN | NaN | NaN | ... | NaN | E | NaN | NaN | NaN | 0 | NaN | NaN | 0.380844 | 0.619156 |
114317 | 225842 | 0 | 0.628183 | 7.858950 | C | 3.737692 | 7.437793 | 2.105262 | 4.244482 | 0.023728 | ... | 0.001574 | AV | 2.324041 | 4.393040 | 1.806239 | 0 | 0.640000 | 1.000001 | 0.261797 | 0.738203 |
114318 | 226147 | 1 | 2.890626 | 9.125501 | C | 3.921754 | 9.334230 | 3.359375 | 3.554688 | 0.020339 | ... | 0.070830 | BK | 1.527745 | 2.197267 | 3.129845 | 0 | 0.747252 | 4.705882 | 0.406256 | 0.593744 |
114319 | 227244 | 0 | NaN | NaN | C | NaN | NaN | NaN | NaN | NaN | ... | NaN | BN | NaN | NaN | NaN | 0 | NaN | NaN | 0.628416 | 0.371584 |
114320 | 227894 | 1 | 10.857788 | 3.696549 | C | 4.161768 | 7.925563 | 2.148985 | 3.006773 | 2.458015 | ... | 0.458877 | AK | 1.647989 | 2.302482 | 1.380174 | 0 | 3.171172 | 4.924242 | 0.348519 | 0.651481 |
114321 rows × 135 columns
Feedback
- Submit and view feedback for this page
- Send feedback about H2O MLOps to cloud-feedback@h2o.ai