Skip to main content

Tutorial 1A: Introduction to H2O Document AI - Publisher

This tutorial will walk you through the basics of H2O Document AI - Publisher and will follow the recommended beginner workflow. For this tutorial, we will be building a Token Labeling model (which focuses on labeling your files' tokens).

Prerequisites

Step 1: Creating the project

Start by creating a new project. Click the “Create new Project” button on H2O Document AI - Publisher’s homepage.

Create a new Project button.

This takes you to the Create a Project panel.

  1. Provide Docs_Tutorial as your project name
  2. (Optional) Provide This Project will be for the tutorial. as the description
  3. Add the zip file 5pdfs-clean-documents.zip which contains the set of documents, PDFs, and images you will work with
  4. Click Create project.
Create Project launch panel with fields to provide a name, description, and a zip file with your images. Gold create project button at the bottom right.

Project main page

After creating your new project, you will be taken to your project's main page. Your project will now also be accessible from the left navigation bar. From your project’s main page, you can see each available page: Document sets, Annotation sets, Models, and Jobs. Published Pipelines is only accessible from the left navigation bar.

Project Main Page. Shows the following Project panels: Document sets, Annotation sets, Models, and Jobs. The top navigation bar is populated with options for your project.

You can access these pages in two ways:

  • clicking the page name on the left navigation bar
  • clicking “See All” on the desired panel from the project’s main page

Your uploaded files will appear on the Document sets page and the Annotation sets page. You can see these files from the project’s main page or by going directly into the Document sets or Annotation sets page.

Step 2: Running OCR

Navigate to the Document sets page. You will now run optimal character recognition (OCR) on your files. This lets you extract the tokens from your files.

Click on the checkbox next to the files you uploaded. Selecting this checkbox will make the OCR button available in the upper navigation bar.

Click OCR.

Shows how to access OCR panel by selecting your Document set to make the OCR button available.

On the OCR launch page:

  1. Select best for the OCR method
  2. Provide Tutorial_OCR as the Result name
  3. (Optional) Provide Running OCR on the 5 imported PDFs. as the description
  4. Click OCR
OCR launch page where you select your OCR method and provide a resulting name. Gold OCR button in the bottom right.

When your OCR job is finished running, you can access your tokens from the Annotation sets folder. It will have the attributes text and confidence attached to its file.

Step 3: Annotating your files

Navigate to the Annotation sets page. Your two annotation sets should be the OCR tokens set you just created and the blank labeling set that got uploaded to this page when you created your project.

Find your blank annotation set. Go to the end of its row and click the drop-down arrow next to the Info button. Then, click Edit in Page View from the drop-down options.

Shows how to access Edit in Page View from the Annotation sets page by selecting your annotation set, clicking the drop-down arrow at the end of the annotation set row, and clicking Edit in Page View.

This will take you to the annotations page. Here, you can provide labeled regions to your files and classify what type of page each file in your set is.

Shows the Page View screen. You will start on the first image of the zip file.

info

Creating labels for your files will be time consuming.

Creating region attributes

The following steps describe how to add a new attribute.

  1. In the Attributes box, ensure that Region attributes is selected.
  2. In the Attribute name field, enter "label" and click the + button to add the new attribute. (Note: For this step, the name of attribute must be set to "label" in order for H2O Document AI to work correctly.)
Region attributes on the attributes panel. Has a field to provide a name for the attribute type, a name id, a description, and lets you choose the type of attribute list (option shown here is drop-down).

Create option ids

You can now create the option ids (that is, the labels that can be assigned to regions when annotating documents) you want to use to label your files. For this tutorial, select dropdown as the Type, and then enter the option ids you want to use. In this tutorial, the following labels are used:

List of option IDs to use in this tutorial.

You can set one of these ids as your default value ("def."). This means that when you start assigning these ids to regions, that id will be the first choice provided.

Applying bounding boxes

To create a labeling region on your image, left-click and hold from one corner of the area you want bounded and drag to the opposite corner. Release the left-click to create the bounding box.

While the box is still highlighted, select the region id from the drop down menu on the annotation editor.

GIF of how to use a bounding box in Page View. Start by drawing the box, then provide the label ID.

Create regions on all five documents using the option ids as guidance for what needs labeled. After creating each bounding box, select the option id that fits that region. Each page should have a handful of labeled regions.

After you have created all of your regions and labeled them, save your work by clicking the save button on the top tool bar.

Step 4: Applying labels

Navigate back to the Annotation sets page. You can see your labeled set has the label attribute now. Combining that with your token set (which has the text attribute) will give you a file that has both the label and text attributes. You need both of these attributes to create a TokenLabeling model.

Click Apply Labels in the upper navigation bar. We will now combine the information we extracted from OCR and created in Edit in Page View by applying labels to our tokens.

Shows how to access the Apply labels panel by clicking "Apply Labels" in the top navigation bar.

On the Apply Labels launch page:

  1. Select Tutorial_OCR for your text annotation set
  2. Select 5pdfs-clean-documents Labels for your labeled annotation set
  3. Provide TutorialLabeling as the name
  4. (Optional) Provide Running Apply Labels using the tokens in Tutorial_OCR and the labels created on 5pdfs_clean_documents as the description
  5. Select Assume there are no labels in the labels page for what to do when the labels page is missing
Apply Labels launch panel. Has fields for selecting OCR set and labels set, providing the resulting name, adding a description, and what should be done when a labels page is missing. Gold Apply Labels button in the bottom right corner.

The labeled token set will appear on your Annotation sets page when the job is finished running.

Step 5: Training your Token Labeling model

Now that you have a set with both the text and label attributes, you can build a Token Labeling model!

On the Annotation sets page, select the check box next to your Labeled Tokens set. Click Train Model.

Shows how to access the Train Model panel by selecting your labeled tokens set to make the Train Model button in the top navigation bar available.

On the Train Model launch page:

  1. Select Token Labeling as your model type
  2. Ensure that TutorialLabeling is set as your training annotation set
  3. Provide Model4Tutorial as the name
  4. (Optional) Provide Using the TutorialLabeling set to create a Token Labeling model as te description
  5. Toggle Evaluate off since we have no validation set
  6. Click Train
Train Model launch panel. Has fields for selecting your model type, choosing your training annotation set, toggling whether to specify batch size and epochs, providing the name of the resulting model, providing a description, and toggling whether to provide a validation annotation set to evaluate the model. Gold Train button in the bottom right corner.

Once your model training job has finished running, your model will be available on the Models page.

Step 6: Publishing a pipeline with your token labelling model

Navigate to the Published Pipelines page. You can now publish a scoring pipeline using the token labeling model you built.

Click Publish Pipeline in the upper navigation bar.

This is the Published Pipelines page. Here you can publish and use pipelines for scoring new documents.

On the Publish Pipeline launch page:

  1. Select DocTR EfficientNet B3 as the OCR method you want your pipeline to use
  2. Select your token labeling model
  3. Provide pipeline4tutorial as the name for your Pipeline
  4. Select Supply-chain as the built-in post-processor
  5. Keep the rest of the default values
  6. Click Publish

The top part of the Publish Pipeline panel.

The bottom part of the Publish Pipeline panel.

When your pipeline has finished publishing you will be able to access the scoring URL if you want to use the bulk scorer. Otherwise, you can submit new, uniquely named documents directly to the pipeline in the UI by clicking on the Submit Document button at the end of the pipeline row.

Submitting a new file to be scored

Try submitting your extra medical referral document to your published pipeline. Click Submit document to submit a new document to be scored.

The button at the end of a published pipeline that allows you to submit new, uniquely named documents for scoring.

On the submit document panel:

  1. Add the aurora-new.pdf
  2. Click Submit file to pipeline

The panel to submit new documents to be scored.

Your file will be scored by your pipeline. To access your results:

  1. Click the background processes bell icon Bell icon next to your account name on the top banner next to your account name
  2. Click View next to the Scoring aurora-new.pdf finished process

How to access your scored document JSON information.

A JSON of your scoring results will be available.

The JSON information for your scored document.

Summary

In this tutorial, you learned how to utilize the basic workflow of H2O Document AI - Publisher. You worked with creating a project, running OCR, annotating your images, applying labels to your tokens, building a token labeling model, and publishing a pipeline.

Next

If you would like to try building another model with validation using the information you have learned from this tutorial, test out Tutorial 1B: Creating an evaluation model in H2O Document AI - Publisher.


Feedback