Skip to main content

Using the H2O Document AI - Bulk Scorer

Scoring documents in the UI can be time consuming. To score large amounts of documents, use the H2O Document AI - Bulk Scorer, which is run from the command line using either a Docker or Python environment. This is because the H2O Document AI - Bulk Scorer is currently separate from the main H2O Document AI UI.

info

When using the H2O Document AI - Bulk Scorer, you can run the same job multiple times. Running the same job creates a new output file (that is, no files are overwritten).

Download the H2O Document AI - Bulk Scorer

Use either of the following download links to gain access to the H2O Document AI - Bulk Scorer:

You can also contact the H2O Document AI team.

Retrieve your scoring URL

Access your scoring URL from the H2O Document AI - Publisher Published Pipelines panel.

Scoring URL copy button located on each published pipeline to use for the H2O Document AI - Bulk Scorer.

Start the H2O Document AI - Bulk Scorer

# Using H2O Document AI - Bulk Scorer with Docker
# Step 1: Download the Docker image archive
wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/document-ai-bulk-scorer/rel-v0.2.5/docker/docai-scorer_0.2.5_docker.tar.gz

# Step 2: Load the Docker image from the archive
docker load < docai-scorer_0.2.5_docker.tar.gz

# Step 3: Test your installation
docker run \
-it \
--rm \
-u "$(id -u)":"$(id -g)" \
-v "$(pwd)"/local/test-nem-5p:/home/appuser/app/test-nem-5p \
-v "$(pwd)"/local/results:/home/appuser/app/results \
-v "$(pwd)"/.env:/home/appuser/app/.env \
docai-scorer:0.2.5 --help

Retrieve a list of all available commands

If you don't know where to begin after starting the H2O Document AI - Bulk Scorer, the help command retrieves a list of all the available commands.

--help

Configuration

The H2O Document AI - Bulk Scorer can be configured using a YAML file. The file location defaults to ./config.yaml. Any YAML file can be provided as the config using the --config command.

Configuration options provided in the config.yaml can be overridden using environmental variables or command line options. For example, if the location of the files option is provided in multiple places, the option set in the command line takes the highest priority. Environmental variables have second highest priority. Options set in the config.yaml file have the lowest priority.

The H2O Document AI - Bulk Scorer example section contains an example config.yaml file.

Authentication

Authentication can be done in two ways:

  1. Through username/password authentication using docai_user and docai_password + Keycloak.
  2. Through Managed Cloud authentication for SSO support. These values can be obtained from the "Accessing H2O AI Cloud APIs" section of the Managed Cloud.

If both options are provided, then the second option will be used.

Authentication with username/password

You can specify your authentication credentials in the config.yaml file, in the environmental variables, or from the command line.

DocAI user

# setting from the config.yaml file:
docai_user: "<>"

# setting from environmental variables:
-e DOCAI_USER=<>

# setting from the command line:
--docai_user <>

DocAI password

# setting from the config.yaml file:
docai_password: "<>"

# setting from environmental variables:
-e DOCAI_PASSWORD=<>

# setting from the command line:
--docai-password <>

Auth base URL

# setting from the config.yaml file:
auth_base_url: "<>"

# setting from environmental variables:
-e DOCAI_AUTH_BASE_URL=<>

# setting from the command line:
--auth-base-url <>

Keycloak client ID

# setting from the config.yaml file:
keycloak_client_id: "<>"

# setting from environmental variables:
-e DOCAI_KEYCLOAK_CLIENT_ID=<>

# setting from the command line:
--keycloak-client-id <>

Keycloak realm

# setting from the config.yaml file:
keycloak_realm: "<>"

# setting from environmental variables:
-e DOCAI_KEYCLOAK_REALM=<>

# setting from the command line:
--keycloak-realm <>

Authentication for Managed Cloud

The following authentication commands provide SSO support for the Managed Cloud. These commands can be specified in the config.yaml file, in the environmental variables, or from the command line.

info

If you provide username/password authentication as well as authentication for the Managed Cloud, then the authentication for the Managed Cloud will be used.

Platform token

# setting from the config.yaml file:
platform_token: "<>"

# setting from environmental variables:
-e DOCAI_PLATFORM_TOKEN=<>

# setting from the command line:
--platform_token <>

Token endpoint URL

# setting from the config.yaml file:
token_endpoint_url: "<>"

# setting from environmental variables:
-e DOCAI_TOKEN_ENDPOINT_URL=<>

# setting from the command line:
--token_endpoint_url <>

Platform client ID

# setting from the config.yaml file:
platform_client_id: "<>"

# setting from environmental variables:
-e DOCAI_PLATFORM_CLIENT_ID=<>

# setting from the command line:
--platform_client_id <>

Input and output

The following input/output commands can be specified in the config.yaml file, in the environmental variables, or from the command line.

Inputting the location of the file(s)

This command queues a path to an image file, a directory with images, or a zip file with images for the H2O Document AI - Bulk Scorer. This command can be given multiple times, and you can mix the input types (for example, -i <image_file> -i <image_directory> -i <image_file>).

# setting in the config.yaml file:
images: <>

# setting from environmental variables:
-e DOCAI_IMAGES=<>

# setting from the command line:
--images <>
-i <>

Set the allowed file types

By default, the H2O Document AI - Bulk Scorer can read the following file types:

  • PDF
  • PNG
  • JPEG
  • JPG
  • BMP
  • TIFF
  • GIF

However, if you want the H2O Document AI - Bulk Scorer to only score certain file types (for example, JPEG and PNG only) you can specify that.

# setting from the config.yaml file:
valid_image_file_extensions:
- "<>"

# setting from environmental variables:
-e DOCAI_VALID_IMAGE_FILE_EXTENSIONS=<>

# setting from the command line:
--valid-image-file-extensions <>

Input an encrypted zip file

This command lets you input an encrypted zip file.

# setting from the config.yaml file:
zip_password: "<>"

# setting from environmental variables:
-e DOCAI_ZIP_PASSWORD=<>

# setting from the command line:
--zip-password <>

Input multiple encrypted zip files

This command lets you input multiple encrypted zip files.

# setting from the config.yaml file:
zip_password_file: "<>"

# setting from environmental variables:
-e DOCAI_ZIP_PASSWORD_FILE=<>

# setting from the command line:
--zip-password-file <>

Filter input image list

This command can filter the inputted image list. For example, if you set --regex ".*1[.](jpeg|pdf|png)$" in the command line, then the H2O Document AI - Bulk Scorer only selects images that end in *1.pdf, *1.png, or *1.jpeg.

# setting from the config.yaml file:
regex: "<>"

# setting from environmental variables:
-e DOCAI_REGEX=<>

# setting from the command line:
--regex "<>"

Provide an output directory

This command provides a directory to save the results of the H2O Document AI - Bulk Scorer job.

# setting from the config.yaml file:
out_dir: "<>"

# setting from environmental variables:
-e DOCAI_OUTPUT_DIR=<>

# setting from the command line:
--output-dir <>
-o <>

Provide a temporary image directory

This command provides a temporary image directory for unzipped files.

# setting from the config.yaml file:
temp_image_dir: "<>"

# setting from environmental variables:
-e DOCAI_TEMP_IMAGE_DIR=<>

# setting from the command line:
--temp-image-dir <>

Pipeline

The following pipeline commands can be specified in the config.yaml file, in the environmental variables, or from the command line.

Scorer base URL

In order to access a pipeline, you need to provide the base URL for the scorer. You can access the base scorer URL from the H2O Document AI - Publisher UI on the Published Pipelines page.

# setting from the config.yaml file:
scorer_base_url: "<>"

# setting from environmental variables:
-e DOCAI_SCORER_BASE_URL=<>

# setting from the command line:
--scorer-base-url <>

Provide a pipeline to use for scoring

This command provides the pipeline to use for scoring the new documents.

# setting from the config.yaml file:
pipeline="<>"

# setting from environmental variables:
-e DOCAI_PIPELINE=<>

# setting from the command line:
--pipeline <>
-p <>

Number of replicas

This command provides the number of replicas. This value should match what you set when you published your pipeline.

# setting from the config.yaml file:
num_replicas: <>

# setting from environmental variables:
-e DOCAI_NUM_REPLICAS=<>

# setting from the command line:
--num-replicas <>

Options for the run

The following options can be specified in the config.yaml file, in the environmental variables, or from the command line.

Providing a name for the run

This command provides a name for the scoring job. You can use the name to outline what is run in the job. For example, if you run a job that scores 5 images, is a benchmark with 10 total requests, and uses 1 pod, you could name that job "<ImageFileName>-5page-10rps-1pod" to reflect the settings you established.

# setting from the config.yaml file:
name: <>

# setting from environmental variables:
-e DOCAI_NAME=<>

# setting from the command line:
--name <>
-n <>

Verbosity

This command prints the entire configuration (from the config.yaml file, the environmental variables, and the command line) that will run.

# setting from the config.yaml file:
verbose: true/false

# setting from environmental variables:
-e DOCAI_VERBOSE=TRUE/FALSE

# setting from the command line:
--verbose / --no-verbose

Request a dry run of the information

This command performs a dry run of the provided commands. No requests will be made to the H2O Document AI - Bulk Scorer, but you will instead be told what will happen when you actually run the task.

# setting from the config.yaml file:
dry_run: true/false

# setting from environmental variables:
-e DOCAI_DRY_RUN=TRUE/FALSE

# setting from the command line:
--dry-run // --no-dry-run

Requesting a list of the given images

This command returns a list of all the inputted images without running the scorer. This command is useful when you pass multiple images into one job.

# setting from the config.yaml file:
list_images: true/false

# setting from environmental variables:
-e DOCAI_LIST_IMAGES=TRUE/FALSE

# setting from the command line:
--list-images // --no-list-images

Request logs for the scorer

This command requests logs for the run.

# setting from the config.yaml file:
scorer_logs: true/false

# setting from environmental variables:
-e DOCAI_SCORER_LOGS=TRUE/FALSE

# setting from the command line:
--scorer-logs // --no-scorer-logs

Log level

This command sets the log type for the run (one of: "INFO" or "DEBUG").

# setting from the config.yaml file:
log_level: "<>"

# setting from environmental variables:
-e DOCAI_LOG_LEVEL=<>

# setting from the command line:
--log-level <>

Dynamic page subsetting

This command is a corresponding HTTP server handler that passes an extra field to the underlying scorer. It lets you dynamically state a page range for any given API call.

# setting from the config.yaml file:
extra: "<>"

# setting from environmental variables:
-e DOCAI_EXTRA=<>

# setting from the command line:
--extra <>

Benchmark a run

The following benchmark commands can be specified in the config.yaml file, in the environmental variables, or from the command line.

Benchmark a job

This command will specify whether to set run this job as a benchmark.

# setting from the config.yaml file:
benchmark: true/false

# setting from environmental variables:
-e DOCAI_BENCHMARK=TRUE/FALSE

# setting from the command line:
--benchmark \\ --no-benchmark

Provide a path for the benchmark results

This command provides a path for the benchmark results.

# setting from the config.yaml file:
benchmark_results: "<>"

# setting from environmental variables:
-e DOCAI_BENCHMARK_RESULTS=<>

# setting from the command line:
--benchmark-results <>

H2O Document AI - Bulk Scorer example

The following is an example of how to use the H2O Document AI - Bulk Scorer.

First, create a config.yaml file with the following information:

# config.yaml
# Configuration example file for H2O Document AI - Bulk Scorer

version: 0.2.5

# Authentication can be done in two ways. If both are specified,
# then the SSO support authentication will be used.

# 1. Authentication using docai_user, docai_password + Keycloak
docai_user: "your_username"
docai_password: "your_password"
auth_base_url: "https://auth.1234567.h2o.ai/auth"
keycloak_client_id: "kc-id"
keycloak_realm: "kc-realm"

# Or

# 2. Authentication for SSO support using patform_token on Managed Cloud.
# The following values can be obtained through the "Accessing H2O AI Cloud APIs" section
platform_token: "your1platform2token3here4..."
token_endpoint_url: "https://auth.1234567.h2o.ai/token/endpoint/url"
platform_client_id: "your-client-id"

# Input & Output
images: null
out_dir: "./results"

# Pipeline
scorer_base_url: "https://document-ai-scorer.tester.h2o.ai"
pipeline: "tester-pipeline"
num_replicas: 4

# Options for this run
name: nem-2x-5page-1rps-1pod
verbose: true
dry_run: false
list_images: false
log_level: "INFO"

# Benchmark
benchmark: false
num_requests: 1
benchmark_results: null

valid_image_file_extensions:
- ".pdf"
- ".jpeg"

temp_image_dir: null

Next, you create a new docker image and load the H2O Document AI - Bulk Scorer. Then, you provide your user information (-u). The -v lines set up the location of the:

  • files to be ingested into the H2O Document AI - Bulk Scorer (local/test-nem-5p points to /home/appuser/app/test-nem-5p),
  • the location of the output directory (local/results points to home/appuser/app/results), and
  • the location of the configuration file (config.yaml points to /home/appuser/app/.env) respectively.

The last line is the final part of the configuration that is passed to the H2O Document AI - Bulk Scorer and uses the location of the files established in the previous lines and in the config.yaml file.

The following information is added to the command line that is not provided in the config.yaml file:

  • The location of the input files (-i) is "test-nem-5p"; this file location is set in the first -v command
  • This example is set as a benchmark test for this run only (--benchmark)
  • The benchmark results (--benchmark-results) print to "results/nem-2x-5page-1rps-1pod.csv"
$ docker load < docai-scorer_0.2.5_docker.tar.gz
$ docker run \
-it \
--rm \
-u "$(id -u)":"$(id -g)" \
-v "$(pwd)"/local/test-nem-5p:/home/appuser/app/test-nem-5p \
-v "$(pwd)"/local/results:/home/appuser/app/results \
-v "$(pwd)"config.yaml:/home/appuser/app/config.yaml \
docai-scorer:0.2.5 -i test-nem-5p --benchmark --benchmark-results results/nem-2x-5page-1rps-1pod.csv

Feedback