Batch scoring
Batch scoring is the process of making predictions on a large set of data all at once, instead of one-by-one in real time. This feature supports usage through both the UI and H2O MLOps Python client.
Batch scoring jobs in H2O MLOps create a dedicated Kubernetes runtime that reads data from an input source and stores the predicted results in an output location.
To run a batch scoring job, you must define the source of the input data and the location (sink) for the scored output.
H2O MLOps supports the following source and sink types:
- Azure Blob Storage
- Amazon S3
- Google Cloud Storage (GCS)
- MinIO
- JDBC
- JDBC tables, CSV files (without header), and JSON lines are supported as input.
- Output can be stored in CSV format (without header), JSON lines format, or written directly to a JDBC table.
Batch scoring with the UI
This section describes how to start a batch scoring job using the H2O MLOps UI.
To batch score a model using the UI, follow these steps:
-
From Manage projects, select the project that contains the model you want to batch score.
-
In the left navigation bar, click Batch scoring jobs.
-
Click Start new job.
-
On the Start new job page, enter a name for the batch scoring job in the Job name field.
-
Select the model from the Model drop-down menu.
-
Choose the artifact type and runtime from the Artifact type and runtime drop-down menu.
-
Under Advanced settings, configure the batch size and Kubernetes options, such as the number of replicas and resource requests and limits.
-
Specify the source and sink configuration.
Select the appropriate spec type (for example, S3 Spec) from the Source spec drop-down menu and fill out the configuration fields.
Source spec
noteThe MinIO specification uses the same configuration fields as the S3 specification. To select MinIO as the source spec type, choose S3 spec from the source spec drop-down menu.
- S3 spec
- Azure Blob Storage spec
- GCS spec
- JDBC spec
For S3 Spec, provide the following details:
- accessKeyID (required): The unique identifier for AWS authentication. Not required for public S3 buckets.
- secretAccessKey (required): The private password for AWS authentication. Not required for public S3 buckets.
- sessionToken: The temporary security token for time-limited access to AWS resources.
- pathStyle: Select this option to enable path-style URL construction for the S3 bucket.
- region (required): The AWS geographical region where resources or services will be accessed.
- endpoint: The custom URL to override default AWS service endpoint for specialized configurations.
- partSize: The size of each partition in bytes for reading data.
For Azure Blob Storage spec, provide the following details:
- accountKey (required): The Azure storage account key.
- sasToken (required): The shared access signature (SAS) token for accessing the storage account.
- containerName (required): The name of the blob storage container.
- partitionSize: The size of each partition in bytes for reading data.
For GCS spec, provide the following details:
- credentials (required): The service account JSON credentials.
- projectID (required): The Google Cloud Project ID.
- endpoint: The custom endpoint URL.
- partSize: The size of each partition in bytes for reading data.
For JDBC spec, provide the following details:
- secretParams: The set of key-value pairs that contain sensitive parameters (e.g., passwords) used to dynamically construct the JDBC connection string. Each key and value must be a string. For example, you can use
postgres://user:{{pass}}
, wherepass
is defined insecretParams
. - driver (required): The JDBC driver to use. Supported values include
mysql
,postgres
,mssql
, andoracle
. - table (required): The table to read from. You can also use any valid SQL expression for a
FROM
clause, such as a subquery enclosed in parentheses. - numPartitions: The number of partitions to divide the table into for parallel reads. Required if partitioning is enabled. This setting determines the level of read parallelism.
- lowerBound: The lower boundary value used to compute partition strides. It is not used to filter rows and must match the data type of the
partitionColumn
. - upperBound: The upper boundary value used to compute partition strides. Like
lowerBound
, this is only used for partitioning and must match the data type of thepartitionColumn
. - partitionColumn: The column used to determine how the data is partitioned. This must be a numeric, date, or timestamp column.
- Source MIME type (required): The MIME type (media type) of the input data. Select an appropriate option from the drop-down menu.
- Source location (required): The path to the input data source.
Now, select the appropriate spec type (for example, S3 Spec) from the Sink spec drop-down menu and fill out the configuration fields.
Sink spec
noteThe MinIO specification uses the same configuration fields as the S3 specification. To select MinIO as the sink spec type, choose S3 spec from the sink spec drop-down menu.
- S3 spec
- Azure Blob Storage spec
- GCS spec
- JDBC spec
For S3 Spec, provide the following details:
- accessKeyID (required): The unique identifier for AWS authentication.
- secretAccessKey (required): The private password for AWS authentication.
- sessionToken: The temporary security token for time-limited access to AWS resources.
- pathStyle: Select this option to enable path-style URL construction for the S3 bucket.
- region (required): The AWS geographical region where resources or services will be accessed.
- endpoint: The custom URL to override default AWS service endpoint for specialized configurations.
- writeConcurrency: The number of concurrent write operations.
For Azure Blob Storage spec, provide the following details:
- accountKey (required): The Azure storage account key.
- sasToken (required): The shared access signature (SAS) token for accessing the storage account.
- containerName (required): The name of the blob storage container.
- writeConcurrency: The number of concurrent write operations allowed.
For GCS spec, provide the following details:
- credentials (required): The service account JSON credentials.
- projectID (required): The Google Cloud Project ID.
- endpoint: The custom endpoint URL.
- writeConcurrency: The number of concurrent write operations.
For JDBC spec, provide the following details:
- secretParams: The set of key-value pairs that contain sensitive parameters (e.g., passwords) used to dynamically construct the JDBC connection string. Each key and value must be a string. For example, you can use
postgres://user:{{pass}}
, wherepass
is defined insecretParams
. - driver (required): The JDBC driver to use. Supported values include
mysql
,postgres
,mssql
, andoracle
. - table (required): The JDBC table that should be write into.
- Sink MIME type (required): The MIME type (media type) of the output data. Select an appropriate option from the drop-down menu.
- Sink location (required): The destination path where the output data will be written.
-
After filling out the configuration fields, click Start job to initiate the batch scoring job.
Batch scoring with Python client
To learn how to perform batch scoring using the H2O MLOps Python client, see the Batch scoring tutorial in the Python client tutorials section.
- Submit and view feedback for this page
- Send feedback about H2O MLOps to cloud-feedback@h2o.ai