Skip to main content

Using published pipelines

This page describes how to publish a scoring pipeline, how to use the pipeline to score new documents, and how to make curl requests to the published pipeline.

Publishing a scoring pipeline

The following steps describe how to publish a scoring pipeline.

  1. After you have successfully trained a model, navigate to the Published Pipelines page through the left navigation bar. There is no Published Pipelines panel on the main project page, so you have to access it via the left navigation bar.
  2. Click Publish Pipeline on the upper navigation bar. GIF of how to launch the publish pipeline panel by clicking the yellow button in the top right of the screen
  3. Select which optimal character recognition (OCR) method you want your Pipeline to use for scoring. You will want to use a different method for scoring than you used for training.
  4. Select the model(s) to be used for the Pipeline. You can publish a pipeline using just a Token Labeling model, just a Page Classification model, or using both.
  5. Provide a name for your pipeline. The name has to be unique and only supports lowercase letters, digits, and the dash ("-") symbol.
  6. (Optional) Toggle to mute your logs if you need to protect sensitive information.
  7. Select a post-processor. One of:
  8. Configure how your scoring pipeline is scheduled:
    • Replicas:
      • Min: minimum amount required
      • Max: set higher to allow for fluctuations of higher traffic
    • CPU:
      • Request: allows pipeline to be provisioned so long as the requested amount is available
      • Limit: if the limit is reached, the pipeline is throttled down to the limit
    • Memory:
      • Request: amount of memory required to schedule the pipeline
      • Limit: if the limit is exceeded, the pod will be killed and the pipeline will be restarted
    • Tolerations: (Optional) which nodes you want to schedule your scoring pipeline to
    • Node selector: (Optional) select nodes that have specific labels on them
  9. Click Publish
    Top half of Publish Pipeline panel
    Bottom half of Publish Pipeline panel

Post-processors

When publishing a pipeline, you need to add a post-processor. You can either use one of the built-in post-processors (i.e. Generic or Supply-chain) or you can write your own custom post-processor. Both built-in post-processors merge individual tokens that were both predicted as the same label. For example, if token John and token Smith are both predicted as label customer_name they will be merged to the single prediction John Smith.

To access additional post-processor recipes, see the H2O Document AI Recipe Repository.

Generic

The Generic post-processor includes Top N and the ability to produce an image snippet. Top N delivers a second view of the data that shows what the top predicted class will be as well as the second most likely and third most likely. Image snippet returns a cropped image of each prediction rectangle in a byte string. The Generic post-processor produces huge byte strings which can make it difficult to find the real predictions.

Supply-chain

The Supply-chain post-processor includes line ID groupings. Line ID groupings are used to group predictions together (e.g. item ID, price, and quantity).

Current workaround
Workaround

Avoid publishing pipelines with long names. Try to create pipelines with names around 3-4 characters in length otherwise the pipeline might fail to deploy.

Submitting new documents to the pipeline

After your pipeline has been published, you can use it to score new documents. Track the progress of your submitted file under background processes. You access background processes by clicking the bell icon next to your account name.

When the document you submitted to the pipeline has finished processing, you can access the result (a JSON response) from background processes by clicking the "View" button. This will bring you to the Scoring Results page. You can access the logs from the scoring results page, too. Gif showing how to submit a new file to be scored to a published pipeline. Also shows how to access the scored document from the background processes.

Universal scoring pipeline

You can construct your own custom pipeline in the UI. This is known as the universal scoring pipeline (USP). The USP grants you more flexibility than is available in the UI alone. All available functionalities exist as processors, and the pipeline module orchestrates them by following the configurations you set via YAML. This lets you chain together many processors to perfrom a single scoring task, thus creating a single pipeline.

The USP affords several interesting functionalities that traditional pipelines do not, such as:

  • conditional processing: run a page classifier first and send documents in different classes to separate token classifiers,
  • non-text object detection with regular text OCR methods,
  • OCR-only pipelines,
  • and many more possibilities!

Pipeline flow

The pipeline is broken down into individual tasks with a single processor per task. The tasks are carried out sequentially. The Intake processor ingests the input documents and returns an annotation set wth the documents to be processed further downstream. Then, each subsequent processor ingests a list of annotation sets, performs its process, then outputs a list of annotation sets for the next processor to process. At the end of the pipeline, a post processor is needed to translate the annotation sets into a JSON file for consumption.

There are processors that are highly common amongst most pipelines such as Intake and PdfExtract, but the beauty of the USP is its highly customizable nature.

Accessing the universal scoring pipeline

To use the USP:

  1. Navigate to the Publish Pipelines panel.
  2. Click Publish Pipeline.
  3. Toggle on Use custom pipeline.

This gives you access to the config YAML file that allows you to construct your custom pipeline. When you publish your pipeline, it will read the YAML file when it runs.

If you set options in the UI panel, the YAML file will read them. For example, if you select the Tesseract option for OCR, it will have that preset as the OCR option for your Pipeline in the custom pipeline. To update the USP with the changes you select in the Publish Pipelines panel, toggle the Use custom pipeline off and back on. The changes you select will now be reflected in the YAML file.

Of course, you do not have to use any of the options in the Publish Pipelines panel. You can set all of your processors directly in the YAML file.

USP and custom post processors

The USP can utilize your custom post processors as well as the ones that come pre-baked into the UI. To use a custom post processor, toggle on Use custom post-processor. Any custom post processor you write or upload will be directly linked to the YAML file as soon as you toggle on Use custom pipeline. Your post processor will be the last task executed.

H2O provided pipelines and processors

H2O provides several premade pipelines and post processors for you to use.

Manipulating artifacts

When you open a new Publish Pipelines panel and toggle on Use custom pipeline without selecting a model option, the artifacts will be empty in the pipeline. Selecting a model to use for the pipeline will fill the artifacts with the source URL of the model and the unique name of that model. This artifact information will also populate the the Predict task which is the processor task that has your model run predictions. You can fill in this information yourself if you have a model you want to use instead of having the UI prefill this information for you.

Universal scoring pipeline example

Let's look at an example of a custom pipeline and walk through how it works. The following pipeline is USP-exclusive feature: an OCR-only pipeline.

spec: # A general pipeline example that only does OCR on a set of data

pipeline:
steps:
- tasks:
- name: "Intake" # name of task
type: PipelineTask # if absent, defaults to PipelineTask. Can also be PipelineReorderInputs and InputCommand. InputCommand cannot be used in the scorer.
class: argus.processors.ocr_processors.Intake # fqn_of_Processor class
parameters:
root_docs_path: /input/path
follow_symlinks: true
- tasks:
- name: "PdfExtract"
type: PipelineTask
class: argus.processors.ocr_processors.PdfTextExtract
- tasks:
- name: "ImageNormalize"
class: argus.processors.ocr_processors.NormalizeImages
parameters:
resample_to_dpi: 300
normalize_image_format: .jpg
- tasks:
- name: "OCR"
class: argus.processors.ocr_processors.GenericOcr
parameters:
ocr_style: DocTROcr
- tasks:
- name: "PostProcess"
class: argus.processors.post_processors.ocr_only_post_processor.PostProcessor
parameters:
output_format: 'json'

The first task is Intake which lists all the documents that need to be processed by processors further downstream. The second task is PdfExtract which attempts to extact the text on a page, and if it can't, leaves the page untouched. The third task is ImageNormalize which normalizes the format and DPI and validates the image. The fourth task is OCR which will, in this case, run the docTR OCR method. The final task is the post processor: you need a post processor to translate the data from an annotation set to a JSON file for use.

The MiniProgram processor

The MiniProgram processor is a special type of processor that lets you write small "in-between" steps without implementing a full-fledged processor. It runs once for every page and modifies all aspects of the annotation set (both the current document and the current page) by:

  1. Defining a local variable for each page which allows it access to the current page/document/annotation set data structure;
  2. Running the MiniProgram processor (which possibly modifies the local variables);
  3. Updating the output annotation set from the local variables after MiniProgram processor finishes.

The when parameter

All processors take a when parameter (in addition to their processor-specific parameters). The when parameter's value is a slightly extended mini program that allows you to:

  • run processors conditionally on some pages or documents depending on the information contained in the input, or to drop them from the output annotation sets.
  • manipulate an annotation set in the same way as a regular mini program.

Accessing a scoring pipeline via curl

The following sections describe how to access published pipelines by sending API requests using curl.

Authentication

The following is a sample curl command to retrieve an access token from an identity provider.

info
  • The access token has a short lifetime of approximately five minutes and may expire while a document is being processed. If this occurs, rerun the curl command to retrieve a new access token. The processing of the document is not affected by the token expiry.
  • The access token that is returned by this command must be included in all requests to the proxy.
ACCESS_TOKEN=$(curl -X POST 'http://keycloak.34.211.115.161.nip.io/auth/realms/wave/protocol/openid-connect/token' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'password=REPLACE_ME' \
-d 'username=REPLACE_ME' \
-d 'grant_type=password' \
-d 'response_type=token' \
-d 'client_id=admin-cli' | jq -r .access_token)

Listing published pipelines

To retrieve a list of published pipelines, run the following curl command:

curl https://document-ai-scorer.cloud-qa.h2o.ai/pipeline -H "Authorization: Bearer ${ACCESS_TOKEN}"

The following is the expected response to the preceding command:

[{"name":"ak-pipeline-test","scoringUrl":"https://document-ai-scorer.cloud-qa.h2o.ai/async/model/ak-pipeline-test/score"}]

Submitting documents to a scoring pipeline

To retrieve a list of published pipelines, run the following curl command:

curl -v https://document-ai-scorer.cloud-qa.h2o.ai/async/model/bd-trial/score \
-F documentGuid=cdbd0e44-2672-4d63-94dd-9afb110547ec \
-F document=@./test_api_svcs.pdf \
-H "Authorization: Bearer ${ACCESS_TOKEN}"

The following is the expected response to the preceding command:

{"jobUri":"https://document-ai-scorer.cloud-qa.h2o.ai/job/adde851e-b9b4-11ec-9ceb-fe652bf3bfb4","jobId":"adde851e-b9b4-11ec-9ceb-fe652bf3bfb4"}

Checking pipeline status

curl https://document-ai-scorer.cloud-qa.h2o.ai/job/adde851e-b9b4-11ec-9ceb-fe652bf3bfb4 -H "Authorization: Bearer ${ACCESS_TOKEN}"

The following is the expected response to the preceding command:

{"status":"succeeded","resultUri":"https://document-ai-scorer.cloud-qa.h2o.ai/job/adde851e-b9b4-11ec-9ceb-fe652bf3bfb4/result"}

Retrieving a prediction response

To retrieve a prediction response from a published scoring pipeline, run the following curl command after the status of a pipeline has changed to succeeded:

curl https://document-ai-scorer.cloud-qa.h2o.ai/job/adde851e-b9b4-11ec-9ceb-fe652bf3bfb4/result -H "Authorization: Bearer ${ACCESS_TOKEN}"

The following is the expected response to the preceding command:

{"documentGuid":"some-id","entityConfidences":[],"modelMetadata":{"version":"370f1c"},"pageFailures":[],"pages":{"0":{"metadata":{"dpi":200,"size":[10334,14617]}}}}

Template method

The template method lets you easily access the target text in a document by knowing its coordinates ahead of time. The template method is best used for cases where you have many documents from the same vendor (or of the same format) and expect many more in the future. Due to the repetitive nature of these documents, the location of the target text is predictable, so you don’t need a model to predict its location. Instead, you can have the template look at the text in the assigned coordinates. This allows the template to be more accurate than a model.

To use the template method, contact the H2O Document AI team to help you set up your template.

Using the template method

After your template has been created, you need to navigate to the Published Pipelines page to utilize it.

  1. Click Publish Pipeline.
  2. Select an OCR method that uses PDF text extraction (for example, E3 Best).
  3. Add a model (either a finetuned or a dummy model).
    tip

    If you encounter a document that was not templated, a finetuned model will provide you with reasonable output. A dummy model will provide you with junk output.

  4. Provide a pipeline name.
  5. Toggle on Use custom post-processor. Paste your template code into the exposed post-processor.
  6. Click Publish.

This will create a pipeline that utilizes your template. You can submit documents to your published pipeline by clicking the Submit Document button at the end of the row, or you can use the bulk scorer to submit many documents at once.


Feedback