Version: v1.0.0

Batch scoring

Batch scoring is the process of making predictions on a large set of data all at once, instead of one-by-one in real time. This feature supports usage through both the UI and H2O MLOps Python client.

Batch scoring jobs in H2O MLOps create a dedicated Kubernetes runtime that reads data from an input source and stores the predicted results in an output location.

To run a batch scoring job, you must define the source of the input data and the location (sink) for the scored output.

H2O MLOps supports the following source and sink types:

Azure Blob Storage
Amazon S3
Google Cloud Storage (GCS)
MinIO
JDBC

note

Supported input formats: JDBC tables, CSV files with and without headers, and JSON files.
Supported output formats: JDBC tables, CSV files without headers, and JSON files.

Batch scoring with the UI

This section describes how to start a batch scoring job using the H2O MLOps UI.

To batch score a model using the UI, follow these steps:

On the left navigation bar, click Batch scoring jobs.
Click Start new job.
On the Start new job page, enter a name for the batch scoring job in the Job name field.
Select the model from the Model drop-down menu.
Choose the artifact type and runtime from the Artifact type and runtime drop-down menu.
Under Kubernetes options, configure Kubernetes options, such as the number of replicas and resource requests and limits.
Under Advanced settings, configure the batch size.
Specify the source and sink configuration.

Select the appropriate spec type (for example, S3 Spec) from the Source spec drop-down menu and fill out the configuration fields.

Source spec

note
The MinIO specification uses the same configuration fields as the S3 specification. To select MinIO as the source spec type, choose S3 spec from the source spec drop-down menu.
- S3 spec
- Azure Blob Storage spec
- GCS spec
- JDBC spec
For S3 Spec, provide the following details:
- accessKeyID (required): The unique identifier for AWS authentication. Not required for public S3 buckets.
- secretAccessKey (required): The private password for AWS authentication. Not required for public S3 buckets.
- sessionToken: The temporary security token for time-limited access to AWS resources.
- pathStyle: Select this option to enable path-style URL construction for the S3 bucket.
- region (required): The AWS geographical region where resources or services will be accessed.
- endpoint: The custom URL to override default AWS service endpoint for specialized configurations.
- partSize: The size of each partition in bytes for reading data.
For Azure Blob Storage spec, provide the following details:
- accountKey: The Azure storage account key.
  or
- sasToken: The shared access signature (SAS) token for accessing the storage account.
  note
  
  Either accountKey or sasToken is required to authenticate the source. You don’t need to provide both.
  
  If you use a sasToken, make sure it includes read, write, and list permissions.
- containerName (required): The name of the blob storage container.
- partitionSize: The size of each partition in bytes for reading data.
For GCS spec, provide the following details:
- credentials (required): The service account JSON credentials.
- projectID (required): The Google Cloud Project ID.
- endpoint: The custom endpoint URL.
- partSize: The size of each partition in bytes for reading data.
For JDBC spec, provide the following details:
- secretParams: The set of key-value pairs that contain sensitive parameters (e.g., passwords) used to dynamically construct the JDBC connection string. Each key and value must be a string. For example, you can use postgres://user:{{pass}}, where pass is defined in secretParams.
- driver (required): The JDBC driver to use. Supported values include mysql, postgres, mssql, and oracle.
- table (required): The table to read from. You can also use any valid SQL expression for a FROM clause, such as a subquery enclosed in parentheses.
- numPartitions: The number of partitions to divide the table into for parallel reads. Required if partitioning is enabled. This setting determines the level of read parallelism.
- lowerBound: The lower boundary value used to compute partition strides. It is not used to filter rows and must match the data type of the partitionColumn.
- upperBound: The upper boundary value used to compute partition strides. Like lowerBound, this is only used for partitioning and must match the data type of the partitionColumn.
- partitionColumn: The column used to determine how the data is partitioned. This must be a numeric, date, or timestamp column.
- Source MIME type (required): The MIME type (media type) of the input data. Select an appropriate option from the drop-down menu.
- Source location (required): The path to the input data source.
Now, select the appropriate spec type (for example, S3 Spec) from the Sink spec drop-down menu and fill out the configuration fields.

Sink spec

note
The MinIO specification uses the same configuration fields as the S3 specification. To select MinIO as the sink spec type, choose S3 spec from the sink spec drop-down menu.
- S3 spec
- Azure Blob Storage spec
- GCS spec
- JDBC spec
For S3 Spec, provide the following details:
- accessKeyID (required): The unique identifier for AWS authentication.
- secretAccessKey (required): The private password for AWS authentication.
- sessionToken: The temporary security token for time-limited access to AWS resources.
- pathStyle: Select this option to enable path-style URL construction for the S3 bucket.
- region (required): The AWS geographical region where resources or services will be accessed.
- endpoint: The custom URL to override default AWS service endpoint for specialized configurations.
- writeConcurrency: The number of concurrent write operations.
For Azure Blob Storage spec, provide the following details:
- accountKey: The Azure storage account key.
  or
- sasToken: The shared access signature (SAS) token for accessing the storage account.
  note
  
  Either accountKey or sasToken is required to authenticate the sink. You don’t need to provide both.
  
  If you use a sasToken, make sure it includes read, write, and list permissions.
- containerName (required): The name of the blob storage container.
- writeConcurrency: The number of concurrent write operations allowed.
For GCS spec, provide the following details:
- credentials (required): The service account JSON credentials.
- projectID (required): The Google Cloud Project ID.
- endpoint: The custom endpoint URL.
- writeConcurrency: The number of concurrent write operations.
For JDBC spec, provide the following details:
- secretParams: The set of key-value pairs that contain sensitive parameters (e.g., passwords) used to dynamically construct the JDBC connection string. Each key and value must be a string. For example, you can use postgres://user:{{pass}}, where pass is defined in secretParams.
- driver (required): The JDBC driver to use. Supported values include mysql, postgres, mssql, and oracle.
- table (required): The JDBC table that should be write into.
- Sink MIME type (required): The MIME type (media type) of the output data. Select an appropriate option from the drop-down menu.
- Sink location (required): The destination path where the output data will be written.
After filling out the configuration fields, click Start job to initiate the batch scoring job.

Batch scoring with Python client

To learn how to perform batch scoring using the H2O MLOps Python client, see the Batch scoring example in the Python client examples section.

Feedback

Submit and view feedback for this page
Send feedback about H2O MLOps to cloud-feedback@h2o.ai

Batch scoring with the UI​

Source spec​

Sink spec​

Batch scoring with Python client​

Batch scoring with the UI

Source spec

Sink spec

Batch scoring with Python client