Extractors
Overview
Extractors, defined by JSON schemas, play an important role in document AI by converting unstructured document content into structured, actionable data. They allow users to retrieve information from various document types—such as CVs, invoices, Form 10-Ks, or scanned images—without requiring complex setups or extensive annotations.
Extractor flow
To use an Extractor, first identify the specific information you want to extract from a document. This information is specified in a JSON schema, which is part of an Extractor and acts as a blueprint for the data, detailing the fields and data types you wish to capture. Once you define this schema, you can apply the Extractor to the document, retrieving the desired information in a structured JSON format. This structured data is useful for individuals and applications that require organized information.
Create an Extractor
To create an Extractor, consider the following steps:
- In the Enterprise h2oGPTe navigation menu, click Extractors.
- Click + New extractor.
- In the Extractor name box, enter a name for the Extractor.
- In the LLM list, select an LLM.
- Define the labels for the JSON schema with one of the following two options:
- Option 1 (UI: JSON schema builder)
- Option 2 (JSON schema code)
To build a JSON schema for the Extractor using the JSON schema builder, define each field, its type, and whether it is required. For example:
To define the actual JSON schema code for the Extractor, consider the following steps:
- Click the Input JSON Schema toggle.
- In the JSON schema box, enter the a valid JSON schema. For example:
{
"$schema":"http://json-schema.org/draft-07/schema#",
"type":"object",
"properties":{
"revenueGrowthRate":{
"type":"number",
"description":"The growth rate of revenue."
},
"netProfitMargin":{
"type":"number",
"description":"The company's profit margin."
},
"currentRatio":{
"type":"number",
"description":"The company's liquidity position."
},
"returnOnEquity":{
"type":"number",
"description":"The efficiency in generating profit from equity."
},
"debtToEquityRatio":{
"type":"number",
"description":"The proportion of debt to shareholders' equity."
}
},
"required":[
"revenueGrowthRate",
"netProfitMargin",
"currentRatio",
"returnOnEquity",
"debtToEquityRatio"
]
}
The JSON schema does not require exact label names to align perfectly with document fields, as the collection's large language model (LLM) can interpret and infer label purposes based on context. This allows the model to understand and map various label names, even if there are minor differences in terminology, to their intended data points. Just as a human might deduce what a field intends to capture, the LLM uses its interpretive capability to accurately match schema labels with relevant content, even when exact terms differ.
- Click Save.
Run an Extractor
To run an Extractor on a document, consider the following steps:
- In the Enterprise h2oGPTe navigation menu, click Extractors.
- In the Extractors table, locate the row of the Extractor you want to run and click Run in that row.
- In the Select a collection list, select a Collection. note
The selected Collection must include the document intended for use with the Extractor. The Extractor will retrieve all requested information from the document according to its JSON schema.
- Click Run. note
Enterprise h2oGPTe creates a Job to process the Extractor. The Extractor is completed when its Job is completed.
View a completed Extractor
Once the Extractor has finished processing, you can access the extracted information of a document by following these steps:
- In the Enterprise h2oGPTe navigation menu, click Collections
- Click the My Collections tab.
- In the My Collections table, click the Collection name containing the document used for the Extractor.
- In the Documents table, click the document used for the Extractor.
- The most recent Extractor is located in the Recent results section.
- Submit and view feedback for this page
- Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai