Key terms
This documentation has several unique terms used for H2O Document AI - Publisher and H2O Document AI - Viewer. The explanations for all such terms can be found here.
Annotation
Before data can be extracted from a document, it must first be annotated. Annotation refers to the process of labeling and organizing documents in a manner that makes them suitable for further analysis. This process can, for example, involve marking images or texts with bounding boxes that have labels attributed to each box.
Annotation set
An annotation set refers to the collection of different types of annotations. For example:
- Text annotations (usually from the OCR process)
- Token entity annotations (labels)
- Page annotations (classes)
Attribute
An attribute is a type of annotation. There are two types of attributes: region (which classify certain regions on the document/file) and file (which classify a whole document/file). Within an annotation set, multiple attributes can be created, with each storing different types of attributes about each document. For example, you may want to create an attribute for the main entity recognition annotations, and another for grouping line items together.
In H2O Document AI - Publisher, the attributes can be created in Page View below the file list. After attributes are created and the annotation set is saved, the set of attributes is shown in the annotation set list. This is a good way of quickly distinguishing between different annotation sets. In addition, the Apply Labels and Train Model actions require specific attributes, and the choice lists are filtered by the corresponding attribute types required.
AutoML
AutoML or Automated Machine Learning is the process of automating algorithm selection, feature generation, hyperparameter tuning, iterative modeling, and model assessment. AutoML tools such as H2O Driverless AI makes it easy to train and evaluate machine learning models. Automating the repetitive tasks around Machine Learning development allows individuals to focus on the data and the business problems they solve.
Batch size
The batch size defines the number of documents (with their bounding box coordinates) that are passed over to the processing device (with CPUs or GPUs) at a time. It is a hyperparameter that defines the number of samples to work through before updating the LayoutLM model parameters. If more than one GPU is available, then per_device_batch_size
is determined by dividing the batch size by the number of GPUs.
Bounding box
In H2O Document AI - Publisher, a bounding box highlights and takes a spatial location in an image or document while a label is attributed to the bounding box.
Class
A type of information (such as "customer address" or "customer phone number"). In reference to an entire document (i.e. assigning a document a class), this is the type of document it is (such as "medical card" or "driver's license").
Concatenated annotation sets
The process of combining two or more annotation sets into a single larger set that includes all documents. Before annotation sets can be combined, they must have the same attributes.
As new documents become available, they can be annotated individually and combined with existing annotation sets to make a larger training set using the concatenate function.
Document sets
A set of documents for H2O Document AI - Publisher that can include PDFs, images (PNG, JPG, GIF), or a zip file containing a collection of the preceding filetypes, including a mix of multiple different types of documents.
Embedded text
Embedded text refers to metadata stored inside a PDF that conveys a precise definition of the text in the page, including the location of the text. When available, this can be used directly in order to more efficiently and accurately obtain the data needed for Document AI models.
Embedded text is usually available in documents created by software systems such as Microsoft Word, an order processing system, or a web browser. Embedded text is often unavailable in images from scanners, phones, or faxes, even if those images are stored as PDFs. When embedded text is not available, OCR using computer vision is used to obtain the text and location data needed for Document AI models.
Embedded text can be added to a PDF by any OCR process, so when Document AI encounters embedded text, it uses an algorithm to detect whether the embedded text is authentic and uses only authentic embedded text.
Entity
Entity refers to a set of related bounding boxes or is the instance of a class (“54th Avenue NYC, NY” for a customer address or “608-806-1234” for a customer phone number). If a model scores each of three contiguous tokens as address, it is common to group these together as a single multi-token entity. This step occurs in the post-processing stage.
Epochs
An epoch is defined as one pass over all the training documents. The number of epochs is a hyperparameter that defines the number of times that the learning algorithm works through the entire training dataset.
Ingest
The process of uploading documents to H2O Document AI - Publisher using the web interface or API.
Intelligent Character Recognition
Intelligent Character Recognition (ICR) is an advanced Optical Character Recognition (OCR) that recognizes characters beyond font libraries in a generalized manner.
Jobs
An action taken by the H2O Document AI - Publisher system. Some examples include importing documents, annotation set operations such as saving, executing models such as OCR or token classification.
Label
In H2O Document AI, the term label is used specifically for annotating token entities. When a document set is initially uploaded to H2O Document AI - Publisher, an annotation set is created with the region attribute “label”. When a document is added to and processed by H2O Document AI - Viewer, the labels are available on the document results page.
Labeling
In H2O Document AI - Publisher, this is the task of detecting and tagging data with labels in images, videos, audio, and text. Labeling data is an important step in data preparation and preprocessing for building AI.
LayoutLM
LayoutLM is a multi-modal AI modeling architecture that is designed specifically for document understanding tasks, incorporating features of the text and also the locations of the text.
Models
An artifact that has been trained to perform H2O Document AI - Publisher tasks.
Natural language processing (NLP)
NLP is a subfield of linguistics, computer science, and artificial intelligence that is concerned with the interactions between computers and human language. In particular, NLP knows how to program computers to process and analyze large amounts of natural language data.
Optical character recognition (OCR)
Optical character recognition (OCR) recognizes characters in documents or images and provides the text and text location.
Page classification
An H2O Document AI - Publisher model type that learns what type of document a page is by using the text within the page. For example, you can train a page classification model to differentiate between invoices, receipts, and pay stubs.
Post-processing
Modeling stages that occur after the primary AI model(s). In H2O Document AI - Publisher, a common post-processing step is to aggregate contiguous tokens together to create a single entity. Another example is to standardize date text into a standard date format.
Pre-processing
Modeling stages that occur before the primary AI model(s). In H2O Document AI - Publisher, image processing tasks are handled as pre-processing.
Predict
Predict refers to the process of using a model to create annotations against an annotation set. This is typically done using the Predict Using Model option from the annotation sets page. However, this can also be done while training the model by using the evaluation section of the train interface. Each creates a new annotation set with the attributes being predicted from the model. Predicting is often referred to as scoring and running inference.
Project
A set of data and models related to a particular data type. You must create a project before you can upload any data. Projects store all document sets, annotation set, models, and published pipelines.
Publish
The term Publish refers to the process of creating a pipeline of multiple actions that collectively process a document into a result set. In most cases, this describes the end goal of H2O Document AI - Publisher, where the OCR stage, one or more trained models, and post processors are combined into a single process that is optimized for Document AI MLOps. Whereas use of the H2O Document AI - Publisher user interface works in single batch jobs to create elements of a pipeline, processing documents from end-to-end with a Rest API occurs in H2O Document AI MLOps, and Publish refers to the action of creating the pipeline.
Quality
You are given a quality score on a prediction annotation set when you train a model using evaluation. The quality is the f1-score of the model that was applied to the dataset.
Result sets
Result sets show the final stage of the data after applying one or more post-processing actions to an annotation set. Converting individual token predictions into multi-token entities is an example that would transform an annotation set into a result set.
Split annotation set (SAS)
The process of dividing a single annotation set into smaller pieces. This is commonly done to set up an AI task into training and validation sets. When training AI models, it is common to use a portion of the labeled data to train the model where the model sees the document and answers. The other portion of the labeled data is then used to judge the accuracy where the model predicts answers it was not shown. The errors are calculated and analyzed. This helps to ensure that the model works against documents it has never seen.
Tagging
The process of making unstructured data more structured by manually or automatically adding tags or annotations to various components of the unstructured data.
Token labeling
The process of adding annotations to tokens or sets of tokens. In H2O Document AI - Publisher, this usually refers to adding entity annotations, or region attributes, of the class “label”.
Train models
The process of training an H2O Document AI - Publisher model with an annotation set. This involves using an annotation set with “text” and “label” to train a token labeling model, or an annotation set with “text” and “class” to train a page classification model. Training models is the H2O Document AI - Publisher task that requires the most time.
Value
In H2O Document AI - Viewer, a value is the predicted token within a labeled region. Values are located on the document results page.
- Submit and view feedback for this page
- Send feedback about H2O Document AI to cloud-feedback@h2o.ai