Batch scoring
Batch scoring is the process of making predictions on a large set of data all at once, instead of one-by-one in real time. This feature supports usage through both the UI and H2O MLOps Python client.
Batch scoring jobs in H2O MLOps create a dedicated Kubernetes runtime that reads data from an input source and stores the predicted results in an output location.
To run a batch scoring job, you must define the source of the input data and the location (sink) for the scored output.
H2O MLOps supports the following source and sink types:
- Azure Blob Storage
- Amazon S3
- Google Cloud Storage (GCS)
- MinIO
- JDBC
- JDBC tables, CSV files (without header), and JSON lines are supported as input.
- Output can be stored in CSV format (without header), JSON lines format, or written directly to a JDBC table.
Batch scoring with the UI​
This section describes how to start a batch scoring job using the H2O MLOps UI.
To batch score a model using the UI, follow these steps:
-
From Manage projects, select the project that contains the model you want to batch score.
-
In the left navigation bar, click Batch scoring jobs.
-
Click Start new job.
-
On the Start new job page, enter a name for the batch scoring job in the Job name field.
-
Select the model from the Model drop-down menu.
-
Choose the artifact type and runtime from the Artifact type and runtime drop-down menu.
-
Under Advanced settings, configure the batch size and Kubernetes options, such as the number of replicas and resource requests and limits.
-
Specify the source and sink configuration.
Select the appropriate spec type (for example, S3 Spec) from the Source spec drop-down menu and fill out the configuration fields.
Source spec​
noteThe MinIO specification uses the same configuration fields as the S3 specification. To select MinIO as the source spec type, choose S3 spec from the source spec drop-down menu.
- S3 spec
- Azure Blob Storage spec
- GCS spec
- JDBC spec
For S3 Spec, provide the following details:
- accessKeyID (required): The unique identifier for AWS authentication. Not required for public S3 buckets.
- secretAccessKey (required): The private password for AWS authentication. Not required for public S3 buckets.
- sessionToken: The temporary security token for time-limited access to AWS resources.
- pathStyle: Select this option to enable path-style URL construction for the S3 bucket.
- region (required): The AWS geographical region where resources or services will be accessed.
- endpoint: The custom URL to override default AWS service endpoint for specialized configurations.
- partSize: The size of each partition in bytes for reading data.
For Azure Blob Storage spec, provide the following details:
- accountKey (required): The Azure storage account key.
- sasToken (required): The shared access signature (SAS) token for accessing the storage account.
- containerName (required): The name of the blob storage container.
- partitionSize: The size of each partition in bytes for reading data.
For GCS spec, provide the following details:
- credentials (required): The service account JSON credentials.
- projectID (required): The Google Cloud Project ID.
- endpoint: The custom endpoint URL.
- partSize: The size of each partition in bytes for reading data.
For JDBC spec, provide the following details:
- secretParams: The set of key-value pairs that contain sensitive parameters (e.g., passwords) used to dynamically construct the JDBC connection string. Each key and value must be a string. For example, you can use
postgres://user:{{pass}}
, wherepass
is defined insecretParams
. - driver (required): The JDBC driver to use. Supported values include
mysql
,postgres
,mssql
, andoracle
. - table (required): The table to read from. You can also use any valid SQL expression for a
FROM
clause, such as a subquery enclosed in parentheses. - numPartitions: The number of partitions to divide the table into for parallel reads. Required if partitioning is enabled. This setting determines the level of read parallelism.
- lowerBound: The lower boundary value used to compute partition strides. It is not used to filter rows and must match the data type of the
partitionColumn
. - upperBound: The upper boundary value used to compute partition strides. Like
lowerBound
, this is only used for partitioning and must match the data type of thepartitionColumn
. - partitionColumn: The column used to determine how the data is partitioned. This must be a numeric, date, or timestamp column.
- Source MIME type (required): The MIME type (media type) of the input data. Select an appropriate option from the drop-down menu.
- Source location (required): The path to the input data source.
Now, select the appropriate spec type (for example, S3 Spec) from the Sink spec drop-down menu and fill out the configuration fields.
Sink spec​
noteThe MinIO specification uses the same configuration fields as the S3 specification. To select MinIO as the sink spec type, choose S3 spec from the sink spec drop-down menu.
- S3 spec
- Azure Blob Storage spec
- GCS spec
- JDBC spec
For S3 Spec, provide the following details:
- accessKeyID (required): The unique identifier for AWS authentication.
- secretAccessKey (required): The private password for AWS authentication.
- sessionToken: The temporary security token for time-limited access to AWS resources.
- pathStyle: Select this option to enable path-style URL construction for the S3 bucket.
- region (required): The AWS geographical region where resources or services will be accessed.
- endpoint: The custom URL to override default AWS service endpoint for specialized configurations.
- writeConcurrency: The number of concurrent write operations.
For Azure Blob Storage spec, provide the following details:
- accountKey (required): The Azure storage account key.
- sasToken (required): The shared access signature (SAS) token for accessing the storage account.
- containerName (required): The name of the blob storage container.
- writeConcurrency: The number of concurrent write operations allowed.
For GCS spec, provide the following details:
- credentials (required): The service account JSON credentials.
- projectID (required): The Google Cloud Project ID.
- endpoint: The custom endpoint URL.
- writeConcurrency: The number of concurrent write operations.
For JDBC spec, provide the following details:
- secretParams: The set of key-value pairs that contain sensitive parameters (e.g., passwords) used to dynamically construct the JDBC connection string. Each key and value must be a string. For example, you can use
postgres://user:{{pass}}
, wherepass
is defined insecretParams
. - driver (required): The JDBC driver to use. Supported values include
mysql
,postgres
,mssql
, andoracle
. - table (required): The JDBC table that should be write into.
- Sink MIME type (required): The MIME type (media type) of the output data. Select an appropriate option from the drop-down menu.
- Sink location (required): The destination path where the output data will be written.
-
After filling out the configuration fields, click Start job to initiate the batch scoring job.
Batch scoring with Python client​
To learn how to perform batch scoring using the H2O MLOps Python client, see the Batch scoring tutorial in the Python client tutorials section.
- Submit and view feedback for this page
- Send feedback about H2O MLOps to cloud-feedback@h2o.ai