Scoring with Feature Store
Do parallelized scoring of a Spark data frame from H2O.ai Feature Store in mini-batches against a MLOps deployment.
Prerequisites
h2o_mlops_scoring_client-*-py3-none-any.whl
file or access to the Python Package Index (PyPI)- Java
Setup
Install h2o_mlops_scoring_client
and featurestore
with pip
.
Notes
- If running locally, the number of cores used (and thus parallel processes) can be overridden with:
num_cores = 10
h2o_mlops_scoring_client.spark_master = f"local[{num_cores}]"
featurestore
will have its own Spark configuration and dependency requirements, depending on how it was installed.
Example Usage
import featurestore
import h2o_mlops_scoring_client
import os
Point the scorer to the conf
directory you want to use.
h2o_mlops_scoring_client.spark_conf_dir = os.path.expanduser("~/.h2o_mlops_scoring_client/fs-conf")
Get the Spark session for the scoring client to pass to Feature Store.
spark = h2o_mlops_scoring_client.get_spark_session()
Default Feature Store Spark configuration applied.
:: loading settings :: url = jar:file:/Users/jgranados/miniconda3/envs/h2o/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/jgranados/.ivy2/cache
The jars for the packages stored in: /Users/jgranados/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-cc7b6061-aa08-4e2e-ad63-1e457ee799b4;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;3.3.4 in central
found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
found io.delta#delta-core_2.12;2.2.0 in central
found io.delta#delta-storage;2.2.0 in central
found org.antlr#antlr4-runtime;4.8 in central
:: resolution report :: resolve 127ms :: artifacts dl 6ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default]
io.delta#delta-core_2.12;2.2.0 from central in [default]
io.delta#delta-storage;2.2.0 from central in [default]
org.antlr#antlr4-runtime;4.8 from central in [default]
org.apache.hadoop#hadoop-aws;3.3.4 from central in [default]
org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 6 | 0 | 0 | 0 || 6 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-cc7b6061-aa08-4e2e-ad63-1e457ee799b4
confs: [default]
0 artifacts copied, 6 already retrieved (0kB/8ms)
Choose the MLOps scoring endpoint.
MLOPS_ENDPOINT_URL = "https://model.internal.dedicated.h2o.ai/981f5909-a1b2-4d1a-831b-e65675036934/model/score"
Get a Spark data frame from Feature Store along with a unique ID column used to identify each score.
fs = featurestore.Client("feature-store-api.internal.dedicated.h2o.ai", secure=True)
fs.auth.login()
project = fs.projects.get("bnpparibas")
DATA_FRAME = project.feature_sets.get("BNPParibas.csv").retrieve().as_spark_frame(spark)
ID_COLUMN = "ID"
17-07-2023 08:35:38 : INFO : Connecting to the server feature-store-api.internal.dedicated.h2o.ai ...
17-07-2023 08:35:40 : INFO : Server version: 0.18.1
17-07-2023 08:35:40 : INFO : Client version: 0.18.1
17-07-2023 08:35:40 : INFO : Opening browser to visit: https://auth.internal.dedicated.h2o.ai/auth/realms/hac/protocol/openid-connect/auth?client_id=hac-feature-store&code_challenge=RqAp-ffJCZAEARbrwjsHrRRHJ_ljTtu66siPzFRm0L0&code_challenge_method=S256&redirect_uri=https://feature-store.internal.dedicated.h2o.ai/Callback&response_type=code&scope=openid%20offline_access&state=pLykY74sux
And now we score.
spark_df = h2o_mlops_scoring_client.score_data_frame(
mlops_endpoint_url=MLOPS_ENDPOINT_URL,
id_column=ID_COLUMN,
data_frame=DATA_FRAME,
)
spark_df.show()
[Stage 10:> (0 + 1) / 1]
+------+-----------+----------+
| ID| target.0| target.1|
+------+-----------+----------+
|218057| 0.5960199|0.40398008|
|204563| 0.0965969| 0.9034031|
|126602| 0.18404573|0.81595427|
| 31077| 0.14552522| 0.8544748|
|184547|0.050189316| 0.9498107|
|215697| 0.1665029| 0.8334971|
|134328| 0.2938105| 0.7061895|
|174066| 0.72826254|0.27173743|
|100775| 0.20617479| 0.7938252|
|195239| 0.15678006|0.84321994|
|203262| 0.2985583| 0.7014417|
|109920| 0.25398308| 0.7460169|
| 76561| 0.12035632| 0.8796437|
|110243| 0.12437302| 0.875627|
|188865| 0.15772098| 0.842279|
|125723| 0.08976406|0.91023594|
| 25447| 0.5709034|0.42909658|
|175111| 0.09511238| 0.9048876|
|172872| 0.21316737|0.78683263|
| 2957| 0.38719475|0.61280525|
+------+-----------+----------+
only showing top 20 rows
Optionally merge the scores into the original data frame.
DATA_FRAME.join(spark_df, on=ID_COLUMN).toPandas()
ID | target | v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | ... | v125 | v126 | v127 | v128 | v129 | v130 | v131 | time_travel_column_auto_generated | target.0 | target.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 148267 | 0 | NaN | NaN | C | NaN | NaN | NaN | NaN | NaN | ... | AC | NaN | NaN | NaN | 0 | NaN | NaN | 2023-07-12 13:51:02 | 0.413667 | 0.586333 |
1 | 145203 | 0 | 1.182451 | 11.789324 | C | 4.323275 | 14.250168 | 2.579895 | 3.126089 | 0.145522 | ... | Q | 1.830652 | 2.494916 | 2.307815 | 0 | 0.907064 | 1.803278 | 2023-07-12 13:51:02 | 0.390397 | 0.609603 |
2 | 156749 | 1 | NaN | NaN | C | NaN | NaN | NaN | NaN | NaN | ... | BM | NaN | NaN | NaN | 1 | NaN | NaN | 2023-07-12 13:51:02 | 0.027944 | 0.972056 |
3 | 139024 | 1 | 0.718955 | 10.119158 | C | 3.687911 | 9.791527 | 2.246952 | 2.017312 | 0.243591 | ... | BH | 1.301143 | 2.881558 | 1.942082 | 1 | 2.045535 | 0.753425 | 2023-07-12 13:51:02 | 0.100611 | 0.899389 |
4 | 144475 | 1 | 1.265389 | 9.579234 | C | 4.491259 | 8.399462 | 2.681258 | 2.325580 | 0.057701 | ... | AA | 1.709107 | 2.282832 | 2.295695 | 0 | 1.105883 | 2.127660 | 2023-07-12 13:51:02 | 0.347554 | 0.652446 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
114316 | 35768 | 1 | 1.659937 | 6.722593 | C | 6.856028 | 10.310723 | 2.763571 | 3.633916 | 0.579856 | ... | AK | 1.949568 | 2.792732 | 3.539050 | 0 | 0.829629 | 2.380952 | 2023-07-12 13:51:02 | 0.178249 | 0.821751 |
114317 | 132428 | 1 | 1.089197 | 6.767900 | C | 3.708279 | NaN | 1.836915 | 2.437444 | NaN | ... | P | 1.643828 | 4.128643 | NaN | 0 | 1.719808 | 1.123596 | 2023-07-12 13:51:02 | 0.127764 | 0.872236 |
114318 | 38816 | 1 | 0.289063 | 7.806124 | C | 4.981398 | 8.174613 | 2.984375 | 3.164062 | 1.953577 | ... | BM | 1.935144 | 4.204101 | 1.505429 | 1 | 0.849383 | 0.465115 | 2023-07-12 13:51:02 | 0.028165 | 0.971835 |
114319 | 147732 | 1 | 0.404814 | 9.926242 | C | 4.654258 | 9.132481 | 2.450765 | 2.844639 | 0.033742 | ... | P | 1.616521 | 1.353939 | 2.889227 | 0 | 0.461538 | 1.333333 | 2023-07-12 13:51:02 | 0.312323 | 0.687677 |
114320 | 86572 | 0 | NaN | NaN | C | NaN | NaN | NaN | NaN | NaN | ... | AI | NaN | NaN | NaN | 0 | NaN | NaN | 2023-07-12 13:51:02 | 0.352657 | 0.647343 |
114321 rows × 136 columns
Feedback
- Submit and view feedback for this page
- Send feedback about H2O MLOps to cloud-feedback@h2o.ai