Skip to main content
Version: v1.6.27 🚧

Extractors

Overview​

Extractors, defined by JSON schemas, play an important role in document AI by converting unstructured document content into structured, actionable data. They allow users to retrieve information from various document types, such as CVs, invoices, Form 10-Ks, or scanned images, without requiring complex setups or extensive annotations.

To use an Extractor, first identify the specific information you want to extract from a document. This information is specified in a JSON schema, which is part of an Extractor and acts as a blueprint for the data, detailing the fields and data types you wish to capture. Once you define this schema, you can apply the Extractor to the document, retrieving the desired information in a structured JSON format. For example, this structured data can be useful for individuals and applications that require organized information.

Create an Extractor for a document​

To create an Extractor for a document, consider the following steps:

  1. In the Enterprise h2oGPTe navigation menu, click Documents.

  2. In the Documents grid/list, click the document from which you want to extract information.

  3. Click Document AI.

  4. Click Summarize, Extract, Process.

  5. In the LLM list, select a large language model (LLM) to process the extraction process.

  6. Click Output mode and select JSON schema.

  7. Define the labels for the JSON schema with one of the following two options:

    note

    The JSON schema does not require exact label names to align perfectly with document fields, as the collection's large language model (LLM) can interpret and infer label purposes based on context. This allows the model to understand and map various label names, even if there are minor differences in terminology, to their intended data points. Just as a human might deduce what a field intends to capture, the LLM uses its interpretive capability to accurately match schema labels with relevant content, even when exact terms differ.

    You can build the JSON schema for the Extractor using the JSON Schema builder; all you have to do is define each field, its type, and whether it is required. For example:

    JSON schema builder

  8. Click Summarize, Extract, Process.

    note

    Enterprise h2oGPTe creates a Job to extract the appropriate information.

View a completed Extractor for a document​

To view a completed Extractor for a document, consider the following steps:

  1. In the Enterprise h2oGPTe navigation menu, click Documents
  2. In the Documents grid/list, click the document you want to view it's completed Extractor.
  3. Click Document AI.
  4. The completed Extractor(s) are located in the Recent results section. For example: Extractor card Extractor result

Feedback