Version: v1.6.37-dev1 🚧

Extractors

Overview

Extractors, defined by JSON schemas, play an important role in document AI by converting unstructured document content into structured, actionable data. They allow users to retrieve information from various document types, such as CVs, invoices, Form 10-Ks, or scanned images, without requiring complex setups or extensive annotations.

To use an Extractor, first identify the specific information you want to extract from a document. This information is specified in a JSON schema, which is part of an Extractor and acts as a blueprint for the data, detailing the fields and data types you wish to capture. Once you define this schema, you can apply the Extractor to the document, retrieving the desired information in a structured JSON format. For example, this structured data can be useful for individuals and applications that require organized information.

Create an Extractor for a document

To create an Extractor for a document, consider the following steps:

In the Enterprise h2oGPTe navigation menu, click Documents.
In the Documents grid/list, click the document from which you want to extract information.
Click Document AI.
Click Summarize, Extract, Process.
In the LLM list, select a large language model (LLM) to process the extraction process.
Click Output mode and select JSON schema.
Define the labels for the JSON schema with one of the following two options:

note
The JSON schema does not require exact label names to align perfectly with document fields, as the collection's large language model (LLM) can interpret and infer label purposes based on context. This allows the model to understand and map various label names, even if there are minor differences in terminology, to their intended data points. Just as a human might deduce what a field intends to capture, the LLM uses its interpretive capability to accurately match schema labels with relevant content, even when exact terms differ.
- Option 1 (JSON schema builder)
- Option 2 (JSON schema code)
You can build the JSON schema for the Extractor using the JSON Schema builder; all you have to do is define each field, its type, and whether it is required. For example:
You can define the actual JSON schema code for the Extractor. Consider the following steps to do so:
1. Click the Input JSON Schema toggle.
2. In the JSON schema box, enter a valid JSON schema. For example:
  
  { "type":"object", "properties":{ "revenueGrowthRate":{ "type":"number", "description":"The growth rate of revenue." }, "netProfitMargin":{ "type":"number", "description":"The company's profit margin." }, "currentRatio":{ "type":"number", "description":"The company's liquidity position." }, "returnOnEquity":{ "type":"number", "description":"The efficiency in generating profit from equity." }, "debtToEquityRatio":{ "type":"number", "description":"The proportion of debt to shareholders' equity." } }, "required":[ "revenueGrowthRate", "netProfitMargin", "currentRatio", "returnOnEquity", "debtToEquityRatio" ] }
Click Summarize, Extract, Process.

note
Enterprise h2oGPTe creates a Job to extract the appropriate information.

View a completed Extractor for a document

To view a completed Extractor for a document, consider the following steps:

In the Enterprise h2oGPTe navigation menu, click Documents
In the Documents grid/list, click the document you want to view it's completed Extractor.
Click Document AI.
The completed Extractor(s) are located in the Recent results section. For example:

Feedback

Submit and view feedback for this page
Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai

Overview​

Create an Extractor for a document​

View a completed Extractor for a document​

Overview

Create an Extractor for a document

View a completed Extractor for a document