Skip to main content
Version: v1.6.1 🚧

Extractors

Overview

Extractors, defined by JSON schemas, play an important role in document AI by converting unstructured document content into structured, actionable data. They allow users to retrieve information from various document types—such as CVs, invoices, Form 10-Ks, or scanned images—without requiring complex setups or extensive annotations.

Extractor flow

To use an Extractor, first identify the specific information you want to extract from a document. This information is specified in a JSON schema, which is part of an Extractor and acts as a blueprint for the data, detailing the fields and data types you wish to capture. Once you define this schema, you can apply the Extractor to the document, retrieving the desired information in a structured JSON format. This structured data is useful for individuals and applications that require organized information.

Create an Extractor

To create an Extractor, consider the following steps:

  1. In the Enterprise h2oGPTe navigation menu, click Extractors.
  2. Click + New extractor.
  3. In the Extractor name box, enter a name for the Extractor.
  4. In the LLM list, select an LLM.
  5. Define the labels for the JSON schema with one of the following two options:

To build a JSON schema for the Extractor using the JSON schema builder, define each field, its type, and whether it is required. For example:

Visual builder

note

The JSON schema does not require exact label names to align perfectly with document fields, as the collection's large language model (LLM) can interpret and infer label purposes based on context. This allows the model to understand and map various label names, even if there are minor differences in terminology, to their intended data points. Just as a human might deduce what a field intends to capture, the LLM uses its interpretive capability to accurately match schema labels with relevant content, even when exact terms differ.

  1. Click Save.

Run an Extractor

To run an Extractor on a document, consider the following steps:

  1. In the Enterprise h2oGPTe navigation menu, click Extractors.
  2. In the Extractors table, locate the row of the Extractor you want to run and click Run in that row.
  3. In the Select a collection list, select a Collection.
    note

    The selected Collection must include the document intended for use with the Extractor. The Extractor will retrieve all requested information from the document according to its JSON schema.

  4. Click Run.
    note

    Enterprise h2oGPTe creates a Job to process the Extractor. The Extractor is completed when its Job is completed.

View a completed Extractor

Once the Extractor has finished processing, you can access the extracted information of a document by following these steps:

  1. In the Enterprise h2oGPTe navigation menu, click Collections
  2. Click the My Collections tab.
  3. In the My Collections table, click the Collection name containing the document used for the Extractor.
  4. In the Documents table, click the document used for the Extractor.
  5. The most recent Extractor is located in the Recent results section. Extractor cardExtractor result

Feedback