Skip to main content
Version: v0.64.0

Scoring with Feature Store

Do parallelized scoring of a Spark data frame from H2O.ai Feature Store in mini-batches against a MLOps deployment.

Prerequisites

  • h2o_mlops_scoring_client-*-py3-none-any.whl file or access to the Python Package Index (PyPI)
  • Java

Setup

Install h2o_mlops_scoring_client and featurestore with pip.

Notes

  • If running locally, the number of cores used (and thus parallel processes) can be overridden with:
num_cores = 10
h2o_mlops_scoring_client.spark_master = f"local[{num_cores}]"
  • featurestore will have its own Spark configuration and dependency requirements, depending on how it was installed.

Example Usage

import featurestore
import h2o_mlops_scoring_client
import os

Point the scorer to the conf directory you want to use.

h2o_mlops_scoring_client.spark_conf_dir = os.path.expanduser("~/.h2o_mlops_scoring_client/fs-conf")

Get the Spark session for the scoring client to pass to Feature Store.

spark = h2o_mlops_scoring_client.get_spark_session()

Default Feature Store Spark configuration applied. :: loading settings :: url = jar:file:/Users/jgranados/miniconda3/envs/h2o/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /Users/jgranados/.ivy2/cache The jars for the packages stored in: /Users/jgranados/.ivy2/jars org.apache.hadoop#hadoop-aws added as a dependency io.delta#delta-core_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-cc7b6061-aa08-4e2e-ad63-1e457ee799b4;1.0 confs: [default] found org.apache.hadoop#hadoop-aws;3.3.4 in central found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central found io.delta#delta-core_2.12;2.2.0 in central found io.delta#delta-storage;2.2.0 in central found org.antlr#antlr4-runtime;4.8 in central :: resolution report :: resolve 127ms :: artifacts dl 6ms :: modules in use: com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default] io.delta#delta-core_2.12;2.2.0 from central in [default] io.delta#delta-storage;2.2.0 from central in [default] org.antlr#antlr4-runtime;4.8 from central in [default] org.apache.hadoop#hadoop-aws;3.3.4 from central in [default] org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]

| | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded|

| default | 6 | 0 | 0 | 0 || 6 | 0 |

:: retrieving :: org.apache.spark#spark-submit-parent-cc7b6061-aa08-4e2e-ad63-1e457ee799b4 confs: [default] 0 artifacts copied, 6 already retrieved (0kB/8ms)

Choose the MLOps scoring endpoint.

MLOPS_ENDPOINT_URL = "https://model.internal.dedicated.h2o.ai/981f5909-a1b2-4d1a-831b-e65675036934/model/score"

Get a Spark data frame from Feature Store along with a unique ID column used to identify each score.

fs = featurestore.Client("feature-store-api.internal.dedicated.h2o.ai", secure=True)
fs.auth.login()
project = fs.projects.get("bnpparibas")

DATA_FRAME = project.feature_sets.get("BNPParibas.csv").retrieve().as_spark_frame(spark)
ID_COLUMN = "ID"

17-07-2023 08:35:38 : INFO : Connecting to the server feature-store-api.internal.dedicated.h2o.ai ... 17-07-2023 08:35:40 : INFO : Server version: 0.18.1 17-07-2023 08:35:40 : INFO : Client version: 0.18.1 17-07-2023 08:35:40 : INFO : Opening browser to visit: https://auth.internal.dedicated.h2o.ai/auth/realms/hac/protocol/openid-connect/auth?client_id=hac-feature-store&code_challenge=RqAp-ffJCZAEARbrwjsHrRRHJ_ljTtu66siPzFRm0L0&code_challenge_method=S256&redirect_uri=https://feature-store.internal.dedicated.h2o.ai/Callback&response_type=code&scope=openid%20offline_access&state=pLykY74sux

And now we score.

spark_df = h2o_mlops_scoring_client.score_data_frame(
mlops_endpoint_url=MLOPS_ENDPOINT_URL,
id_column=ID_COLUMN,
data_frame=DATA_FRAME,
)

spark_df.show()

[Stage 10:> (0 + 1) / 1]

+------+-----------+----------+ | ID| target.0| target.1| +------+-----------+----------+ |218057| 0.5960199|0.40398008| |204563| 0.0965969| 0.9034031| |126602| 0.18404573|0.81595427| | 31077| 0.14552522| 0.8544748| |184547|0.050189316| 0.9498107| |215697| 0.1665029| 0.8334971| |134328| 0.2938105| 0.7061895| |174066| 0.72826254|0.27173743| |100775| 0.20617479| 0.7938252| |195239| 0.15678006|0.84321994| |203262| 0.2985583| 0.7014417| |109920| 0.25398308| 0.7460169| | 76561| 0.12035632| 0.8796437| |110243| 0.12437302| 0.875627| |188865| 0.15772098| 0.842279| |125723| 0.08976406|0.91023594| | 25447| 0.5709034|0.42909658| |175111| 0.09511238| 0.9048876| |172872| 0.21316737|0.78683263| | 2957| 0.38719475|0.61280525| +------+-----------+----------+ only showing top 20 rows

Optionally merge the scores into the original data frame.

DATA_FRAME.join(spark_df, on=ID_COLUMN).toPandas()
IDtargetv1v2v3v4v5v6v7v8...v125v126v127v128v129v130v131time_travel_column_auto_generatedtarget.0target.1
01482670NaNNaNCNaNNaNNaNNaNNaN...ACNaNNaNNaN0NaNNaN2023-07-12 13:51:020.4136670.586333
114520301.18245111.789324C4.32327514.2501682.5798953.1260890.145522...Q1.8306522.4949162.30781500.9070641.8032782023-07-12 13:51:020.3903970.609603
21567491NaNNaNCNaNNaNNaNNaNNaN...BMNaNNaNNaN1NaNNaN2023-07-12 13:51:020.0279440.972056
313902410.71895510.119158C3.6879119.7915272.2469522.0173120.243591...BH1.3011432.8815581.94208212.0455350.7534252023-07-12 13:51:020.1006110.899389
414447511.2653899.579234C4.4912598.3994622.6812582.3255800.057701...AA1.7091072.2828322.29569501.1058832.1276602023-07-12 13:51:020.3475540.652446
..................................................................
1143163576811.6599376.722593C6.85602810.3107232.7635713.6339160.579856...AK1.9495682.7927323.53905000.8296292.3809522023-07-12 13:51:020.1782490.821751
11431713242811.0891976.767900C3.708279NaN1.8369152.437444NaN...P1.6438284.128643NaN01.7198081.1235962023-07-12 13:51:020.1277640.872236
1143183881610.2890637.806124C4.9813988.1746132.9843753.1640621.953577...BM1.9351444.2041011.50542910.8493830.4651152023-07-12 13:51:020.0281650.971835
11431914773210.4048149.926242C4.6542589.1324812.4507652.8446390.033742...P1.6165211.3539392.88922700.4615381.3333332023-07-12 13:51:020.3123230.687677
114320865720NaNNaNCNaNNaNNaNNaNNaN...AINaNNaNNaN0NaNNaN2023-07-12 13:51:020.3526570.647343

114321 rows × 136 columns


Feedback