Skip to main content
Version: v0.68.0

Scoring Spark Data Frames

Do parallelized scoring of a data frame in mini-batches against a MLOps deployment.

Prerequisites

  • h2o_mlops_scoring_client-*-py3-none-any.whl file or access to the Python Package Index (PyPI)
  • Java

Setup

Install the h2o_mlops_scoring_client with pip.

Notes

  • If running locally, the number of cores used (and thus parallel processes) can be overridden with:
num_cores = 10
h2o_mlops_scoring_client.spark_master = f"local[{num_cores}]"

Example Usage

import h2o_mlops_scoring_client

Choose the MLOps scoring endpoint.

MLOPS_ENDPOINT_URL = "https://model.internal.dedicated.h2o.ai/d4d36117-c94a-4182-8b75-5f5abbd1c28b/model/score"

Get a data frame to use along with a unique ID column used to identify each score.

spark = h2o_mlops_scoring_client.get_spark_session()

DATA_FRAME = spark.read.csv("/Users/jgranados/datasets/BNPParibas.csv", header=True, inferSchema=True)
ID_COLUMN = "ID"

And now we score.

spark_df = h2o_mlops_scoring_client.score_data_frame(
mlops_endpoint_url=MLOPS_ENDPOINT_URL,
id_column=ID_COLUMN,
data_frame=DATA_FRAME,
)

23/08/21 14:22:19 INFO h2o_mlops_scoring_client: Connecting to H2O.ai MLOps scorer at 'https://model.internal.dedicated.h2o.ai/d4d36117-c94a-4182-8b75-5f5abbd1c28b/model/score'

Optionally merge the scores into the original data frame.

DATA_FRAME.join(spark_df, on=ID_COLUMN).toPandas()
IDtargetv1v2v3v4v5v6v7v8...v124v125v126v127v128v129v130v131target.0target.1
04711NaNNaNCNaNNaNNaNNaNNaN...NaNBJNaNNaNNaN0NaNNaN0.1460240.853976
1123800.7622582.672957C2.9004838.0216132.3238561.6728482.385984...0.798897AK2.3097682.6576011.41486803.4088670.5780360.6186610.381339
215910NaNNaNCNaNNaNNaNNaNNaN...NaNBMNaNNaNNaN0NaNNaN0.4632950.536705
3286602.0095046.397093C4.77873511.2239403.0142563.4147990.132427...0.206987X1.7258452.8004082.28789201.1451282.2222230.2148180.785182
4391811.20061210.720488C6.03862712.6361802.5272012.7409810.203916...0.224938G1.5925803.2926144.54819800.6796652.7868850.1312390.868761
..................................................................
1143162257220NaNNaNCNaNNaNNaNNaNNaN...NaNENaNNaNNaN0NaNNaN0.3808440.619156
11431722584200.6281837.858950C3.7376927.4377932.1052624.2444820.023728...0.001574AV2.3240414.3930401.80623900.6400001.0000010.2617970.738203
11431822614712.8906269.125501C3.9217549.3342303.3593753.5546880.020339...0.070830BK1.5277452.1972673.12984500.7472524.7058820.4062560.593744
1143192272440NaNNaNCNaNNaNNaNNaNNaN...NaNBNNaNNaNNaN0NaNNaN0.6284160.371584
114320227894110.8577883.696549C4.1617687.9255632.1489853.0067732.458015...0.458877AK1.6479892.3024821.38017403.1711724.9242420.3485190.651481

114321 rows × 135 columns


Feedback