Version: v1.0.11

Batch scoring

This page guides you on how to use the H2O MLOps Python client for batch scoring.

For more information about batch scoring and the supported source and sink types, see Batch scoring.

Configure the input source

To list available source connectors, run:

mlops.batch_connectors.source_specs.list()

Use the following code to configure the input source:

Amazon S3
GCP
Azure
MinIO
JDBC

source = h2o_mlops.options.BatchSourceOptions(
   spec_uid="s3",
   config={
       "region": "us-west-2",
       "accessKeyID": credentials['AccessKeyId'],
       "secretAccessKey": credentials['SecretAccessKey'],
       "sessionToken": credentials['SessionToken'],
   },
   mime_type=h2o_mlops.types.MimeType.CSV,
   location="s3://<bucket-name>/<path-to-input-file>.csv",
)

note

Public S3 buckets are also supported as an input sink. To read from the public S3 bucket, leave the access key and secret key fields empty. Only the input sink allows public S3 buckets.

source = h2o_mlops.options.BatchSourceOptions(
   spec_uid="gcp",
   config={
       "projectID": credentials['projectID'],
       "credentials": credentials['credentials'],
   },
   mime_type=h2o_mlops.types.MimeType.CSV,
   location="<location>",
)

source = h2o_mlops.options.BatchSourceOptions(
   spec_uid="azure",
   config={
       "accountKey": credentials['accountKey'],
       "sasToken": credentials['sasToken'],
       "containerName": credentials['containerName']
   },
   mime_type=h2o_mlops.types.MimeType.CSV,
   location="https://<container-name>.blob.core.windows.net/<path-to-file>.csv",
)

source = h2o_mlops.options.BatchSourceOptions(
   spec_uid="s3",
   config={
       "region": "us-west-2",
       "accessKeyID": credentials['AccessKeyId'],
       "secretAccessKey": credentials['SecretAccessKey'],
       "sessionToken": credentials['SessionToken'],
       "pathStyle": True,
       "endpoint": "https://s3.minio.location"
   },
   mime_type=h2o_mlops.types.MimeType.CSV,
   location="s3://<bucket-name>/<path-to-input-file>.csv",
)

source = h2o_mlops.options.BatchSourceOptions(
   spec_uid="jdbc",
   config={
     "table": "table_with_data", 
     "driver": "postgres", 
     "numPartitions": 8, 
     "lowerBound": "2023-01-01 00:00:00",
     "upperBound": "2024-01-01 00:00:00",
     "partitionColumn": "created_at",
     "secretParams": {
       "username": credentials["username"],
       "password": credentials["password"],
     }
   },
   mime_type=h2o_mlops.types.MimeType.JDBC,
   location="postgres://h2oai-postgresql.default:5432/db_name?user={{username}}&password={{password}}&sslmode=disable",
)

Configure the output location

To list available sink connectors, run:

mlops.batch_connectors.sink_specs.list()

This command returns schema details, supported paths, and MIME types.

Set up the output location where the batch scoring results will be stored:

Amazon S3
GCP
Azure
MinIO
JDBC

output_location = location="s3://<bucket-name>/<path-to-output-directory>/" + datetime.now().strftime("%Y%m%d-%H%M%S")
sink = h2o_mlops.options.BatchSinkOptions(
   spec_uid="s3",
   config={
       "region": "us-west-2",
       "accessKeyID": credentials['AccessKeyId'],
       "secretAccessKey": credentials['SecretAccessKey'],
       "sessionToken": credentials['SessionToken'],
   },
   mime_type=h2o_mlops.types.MimeType.JSONL,
   location=output_location,
)

output_location = location="<location>" + datetime.now().strftime("%Y%m%d-%H%M%S")
sink = h2o_mlops.options.BatchSinkOptions(
   spec_uid="gcp",
   config={
       "projectID": credentials['projectID'],
       "credentials": credentials['credentials'],
   },
   mime_type=h2o_mlops.types.MimeType.JSONL,
   location=output_location,
)

output_location = location="https://<container-name>.blob.core.windows.net/<path-to-output-directory>/" + datetime.now().strftime("%Y%m%d-%H%M%S")
sink = h2o_mlops.options.BatchSinkOptions(
   spec_uid="azure",
   config={
       "accountKey": credentials['accountKey'],
       "sasToken": credentials['sasToken'],
       "containerName": credentials['containerName']
   },
   mime_type=h2o_mlops.types.MimeType.JSONL,
   location=output_location,
)

output_location = location="s3://<bucket-name>/" + datetime.now().strftime("%Y%m%d-%H%M%S")
sink = h2o_mlops.options.BatchSinkOptions(
   spec_uid="s3",
   config={
       "region": "us-west-2",
       "accessKeyID": credentials['AccessKeyId'],
       "secretAccessKey": credentials['SecretAccessKey'],
       "sessionToken": credentials['SessionToken'],
       "pathStyle": True,
       "endpoint": "https://s3.minio.location"
   },
   mime_type=h2o_mlops.types.MimeType.JSONL,
   location=output_location,
)

sink = h2o_mlops.options.BatchSinkOptions(
   spec_uid="jdbc",
   config={
       "driver": "postgres",
       "table": "new_table",
       "secretParams": {
         "username": credentials["username"],
         "password": credentials["password"],
     }
   },
   mime_type=h2o_mlops.types.MimeType.JDBC,
   location="postgres://h2oai-postgresql.default:5432/db_name?user={{username}}&password={{password}}&sslmode=disable",
)

Create batch scoring job

First, retrieve the scoring runtime for the model:

scoring_runtime = model.experiment().scoring_runtimes[0]

To retrieve a list of available resource specifications for job creation, use:

mlops.batch_connectors.source_specs.list()

and

mlops.batch_connectors.sink_specs.list()

Create the batch scoring job:

job = workspace.batch_scoring_jobs.create(
   source=source,
   sink=sink,
   model=model,
   scoring_runtime=scoring_runtime,
   kubernetes_options=h2o_mlops.options.BatchKubernetesOptions(
       replicas=2,
       min_replicas=1,
   ),
   mini_batch_size=100, #number of rows sent per request during batch processing
   name="DEMO JOB",
)

Retrieve the job ID:

job.uid

Wait for job completion

During the execution of the following code, you can view the log output from both the scorer and the batch scoring job.

job.wait()

By default, this command will print logs while waiting. If you want to wait for job completion without printing any logs, use:

job.wait(logs=False)

List all jobs

workspace.batch_scoring_jobs.list()

Retrieve a job by ID

workspace.batch_scoring_jobs.get(uid=...)

Cancel a job

job.cancel()

By default, this command blocks until the job is fully canceled. If you want to cancel without waiting for completion, use:

job.cancel(wait=False)

Delete a job

job.delete()

Feedback

Submit and view feedback for this page
Send feedback about H2O MLOps to cloud-feedback@h2o.ai

Configure the input source​

Configure the output location​

Create batch scoring job​

Wait for job completion​

List all jobs​

Retrieve a job by ID​

Cancel a job​

Delete a job​