Skip to main content
Version: v0.67.0

Scoring with Snowflake

Use Snowflake as source and sink for MLOps batch scoring.

Prerequisites

  • h2o_mlops_scoring_client-*-py3-none-any.whl file or access to the Python Package Index (PyPI)
  • Java
  • Snowflake Credentials

Setup

  • Install the h2o_mlops_scoring_client with pip.
  • Create a conf folder for default configuration files. In the conf folder, create a spark-defaults.conf file with the following contents:
spark.jars.packages net.snowflake:snowflake-jdbc:3.13.29,net.snowflake:spark-snowflake_2.12:2.11.3-spark_3.3

Notes

  • If running locally, the number of cores used (and thus parallel processes) can be overridden with:
num_cores = 10
h2o_mlops_scoring_client.spark_master = f"local[{num_cores}]"

Example Usage

import h2o_mlops_scoring_client
import json
import os

Point the scorer to the conf directory you want to use.

h2o_mlops_scoring_client.spark_conf_dir = os.path.expanduser("~/.h2o_mlops_scoring_client/snowflake-conf")

Create a dictionary of Snowflake options. Required options are: 'sfURL', 'sfUser', 'sfPassword', 'sfDatabase', 'sfWarehouse'

sf_options_path = os.path.expanduser("~/.h2o_mlops_scoring_client/snowflake-conf/sfOptions.json")
with open(sf_options_path) as sf_options_file:
sf_options = json.loads(sf_options_file.read())

Choose the MLOps scoring endpoint.

MLOPS_ENDPOINT_URL = "https://model.internal.dedicated.h2o.ai/f325d002-3c3f-4283-9585-1569afc5dd12/model/score"

Set the Snowflake query or table to use along with a unique ID column used to identify each score.

ID_COLUMN = "ID"
SOURCE_DATA = "select * from BNPPARIBAS.PUBLIC.CSV"
SINK_LOCATION = "BNPPARIBAS.PUBLIC.SCORES"

Set the source and sink file type (here we demonstrate a query and a table).

SOURCE_FORMAT = h2o_mlops_scoring_client.Format.SNOWFLAKE_QUERY
SINK_FORMAT = h2o_mlops_scoring_client.Format.SNOWFLAKE_TABLE

Set the sink write mode. Look at the WriteMode value to see its behavior.

h2o_mlops_scoring_client.WriteMode.OVERWRITE.value

'Overwrite existing files'

SINK_WRITE_MODE = h2o_mlops_scoring_client.WriteMode.OVERWRITE

If the file count of the source is small enough, you will want to preprocess the table into partitions to take advantage of parallel scoring. The number of partitions should equal the number of cores times 3. If the file count is larger than the number of cores, repartitioning may slow down scoring, as each individual file will already count as a partition.

def preprocess(spark_df):
return spark_df.repartition(30)

And now we score.

h2o_mlops_scoring_client.score_source_sink(
mlops_endpoint_url=MLOPS_ENDPOINT_URL,
id_column=ID_COLUMN,
source_data=SOURCE_DATA,
source_format=SOURCE_FORMAT,
source_config=sf_options,
sink_location=SINK_LOCATION,
sink_format=SINK_FORMAT,
sink_config=sf_options,
sink_write_mode=SINK_WRITE_MODE,
preprocess_method=preprocess,
)

23/05/09 14:51:27 INFO h2o_mlops_scoring_client: Starting Spark context

Default Snowflake Spark configuration applied.

23/05/09 14:51:28 WARN Utils: Your hostname, M16Max-100638.local resolves to a loopback address: 127.0.0.1; using 192.168.1.8 instead (on interface en0) 23/05/09 14:51:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Ivy Default Cache set to: /Users/jgranados/.ivy2/cache The jars for the packages stored in: /Users/jgranados/.ivy2/jars net.snowflake#snowflake-jdbc added as a dependency net.snowflake#spark-snowflake_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-ddb817c2-6af8-415e-93f1-2d11bbaae1c5;1.0 confs: [default]

:: loading settings :: url = jar:file:/Users/jgranados/miniconda3/envs/h2o/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

found net.snowflake#snowflake-jdbc;3.13.29 in central found net.snowflake#spark-snowflake_2.12;2.11.3-spark_3.3 in central found net.snowflake#snowflake-ingest-sdk;0.10.8 in central found net.snowflake#snowflake-jdbc;3.13.30 in central :: resolution report :: resolve 90ms :: artifacts dl 3ms :: modules in use: net.snowflake#snowflake-ingest-sdk;0.10.8 from central in [default] net.snowflake#snowflake-jdbc;3.13.30 from central in [default] net.snowflake#spark-snowflake_2.12;2.11.3-spark_3.3 from central in [default] :: evicted modules: net.snowflake#snowflake-jdbc;3.13.29 by [net.snowflake#snowflake-jdbc;3.13.30] in [default]

| | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded|

| default | 4 | 0 | 0 | 1 || 3 | 0 |

:: retrieving :: org.apache.spark#spark-submit-parent-ddb817c2-6af8-415e-93f1-2d11bbaae1c5 confs: [default] 0 artifacts copied, 3 already retrieved (0kB/6ms) 23/05/09 14:51:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/05/09 14:51:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 23/05/09 14:51:30 INFO h2o_mlops_scoring_client: Connecting to H2O.ai MLOps scorer at 'https://model.internal.dedicated.h2o.ai/f325d002-3c3f-4283-9585-1569afc5dd12/model/score' 23/05/09 14:51:33 INFO h2o_mlops_scoring_client: Applying preprocess method 23/05/09 14:51:33 INFO h2o_mlops_scoring_client: Starting scoring from 'select * from BNPPARIBAS.PUBLIC.CSV' to 'BNPPARIBAS.PUBLIC.SCORES' 23/05/09 14:51:34 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. 23/05/09 14:52:53 INFO h2o_mlops_scoring_client: Scoring complete
23/05/09 14:52:53 INFO h2o_mlops_scoring_client: Total run time: 0:01:26 23/05/09 14:52:53 INFO h2o_mlops_scoring_client: Scoring run time: 0:01:20 23/05/09 14:52:53 INFO h2o_mlops_scoring_client: Stopping Spark context 23/05/09 14:52:53 WARN SparkConnectorContext$: Finish cancelling all queries for local-1683669089802


Feedback