Skip to main content


H2O Document AI is an H2O AI Cloud (HAIC) engine that lets you build accurate AI models that:

  • Classify documents
  • Extract text, tables, and images from documents
  • Group, label, and refine extracted information from documents

H2O Document AI supports various documents and use cases to help organizations understand, process, and manage large amounts of unstructured data. Upload your documents to H2O Document AI using the H2O Document AI web interface (in HAIC) or API. H2O Document AI lets you handle a wide variety of documents, including:

  • Image scans (faxes in PDF or other formats, pictures with text, and non-editable forms)
  • Documents with embedded text that have text and layout metadata (PDF docs, Word docs, HTML pages)
  • Documents with regular text “left to right/top to bottom” (CSVs, emails, editable forms)

H2O Document AI uses a combination of:

  • Intelligent Character Recognition (ICR), which leverages learning algorithms for generalizable character and word recognition,
  • Document layout understanding, and
  • Natural Language Processing (NLP) to develop highly accurate models rapidly.

The following sections provide answers to frequently asked questions. If you have additional questions, please send them to


These questions involve model training and model functions.

What is the format of an exported model?

Models are exported as zip files with the artifacts necessary to execute the LayoutLM model only.

What is the requirement to run an exported model?

You need H2O Document AI's specific pipeline to run the exported model. It will not run in H2O MLOps or in any other customer environment (unless they handle all the parts in the same way).

You could execute the model and use Microsoft's LayoutLM code, however, this is reasonably complex.

Is exporting a model similar to creating a scoring pipeline?

Model export is not the same thing as publishing a scoring pipeline.

If you export a model, that can only be done in H2O Document AI’s UI for the LayoutLM models. You can run that model open source, but you still need to know how with tokenization, location embeddings, and other elements in place. It is essentially a transformer’s architecture.

The pipeline that is deployed when you publish a pipeline contains:

  • the way you ingest documents
  • your chosen method of OCR (which can include checking embedding quality, using embedded text, rotation, detecting and recognizing )
  • the ability to execute a page classification, token labeling model, or both
  • the ability to execute post-processing against the raw predictions of the above models


The questions involve pipeline publishing and scoring documents.

How do the replica values work when running the bulk scorer?

The number of replicas should not exceed the maximum number of replicas you set when you published the pipeline. If the number of replicas you are using exceeds the maximum number of replicas available, it will take time to allocate more replicas because they will need to be freed up first.