Skip to main content

Recommended workflow

The following is the recommended workflow for beginner H2O Document AI users.

Step 1: Ingest

When you begin H2O Document AI, you start with a blank Project repository. Begin by creating a project and importing the files you want to work with. Once the files have been imported, they are accessible from the Document sets page, and a blank annotation set appears on the Annotation sets page.

Step 2: Label

From the Document sets page, run OCR on the document set to retrieve the tokens for the files. The completed OCR file appears on the Annotation sets page.

On the Annotation sets page, begin annotating your blank document set by using edit in page view. You can apply regional attributes (labels) to your created bounding boxes and/or file attributes (classes) to each individual file within your set of documents.

Once you've finished annotating your files, combine the information you derived from OCR (that is, the tokens) and the annotations you’ve just created by running the apply labels job.

Step 3: Train models

After you've retrieved your fully labeled tokens, you can build your model. Select your labeled annotation set and run the Train Model job. If you want to classify your document pages, run a Page Classification model (requires class and text attributes). If you want the tokens of your images labeled, run a Token Labeling model (requires label and text attributes). Note that this can take some time to build. The finished model appears on the Models page.

info

You can build a model with or without using a validation set.

Step 4: Deployment and post-processing

You can now use this model to predict on other data or publish a scoring pipeline to feed new documents into.


Feedback