Skip to main content
Version: v1.6.14-dev3 🚧

Tutorial 8: With an Extractor, transform unstructured document content into structured JSON data

Overview​

This tutorial demonstrates how to utilize Extractors in Enterprise h2oGPTe to convert unstructured document content into structured JSON data. While documents can contain valuable information, their unstructured nature often makes it challenging to analyze efficiently. Extractors address this challenge by transforming the content of these documents into structured formats that can be readily utilized by individuals and applications requiring organized data.

In this tutorial, we will illustrate how Extractors function by extracting specific pieces of information from Alphabet's Form 10-K.

Objective​

  • Learn how to create and configure an Extractor in Enterprise h2oGPTe to transform unstructured document content into a structured JSON format.

Prerequisites​

Step 1: Discuss Extractors​

Extractors, defined by JSON schemas, play an important role in document AI by converting unstructured document content into structured, actionable JSON data. They allow users to retrieve information from various document types—such as CVs, invoices, Form 10-Ks, or scanned images—without requiring complex setups or extensive annotations.

To use an Extractor, first identify the specific information you want to extract from a document. This information is specified in a JSON schema, which is part of an Extractor and acts as a blueprint for the data, detailing the fields and data types you wish to capture. Once you define this schema, you can apply the Extractor to the document, retrieving the desired information in a structured JSON format. This structured data is useful for individuals and applications that require organized information.

Step 2: Create a Collection​

To apply an Extractor to a document, you must first add the document to a Collection. The document we will use for this tutorial will be Alphabet's Form 10-K. A Form 10-K is a comprehensive annual report that publicly traded companies in the United States must file with the Securities and Exchange Commission (SEC). It provides a detailed overview of their financial performance, operations, and risks. This document includes audited financial statements, management's discussion and analysis, and information about the company's business activities, helping investors make informed decisions.

Let's create a Collection with Alphabet's Form 10-K.

  1. In the Enterprise h2oGPTe navigation menu, click Collections.
  2. Click + New collection.
  3. In the Collection name box, enter the following:
    Tutorial 8
  4. Click Browse Files.
  5. Upload Alphabet's Form 10-K (downloaded as goog-10-k-2023.pdf).
  6. Click + Create.

Step 3: Create an Extractor​

Let's create an Extractor that will be in charge of extracting the following metrics from Alphabet's Form 10-K:

  • Revenue growth rate
  • Net profit margin
  • Current ratio
  • Return on equity
  • Debt-to-equity ratio
  1. In the Enterprise h2oGPTe navigation menu, click Extractors.
  2. Click + New extractor.
  3. In the Extractor name box, enter:
    Form 10-k
  4. Click the Input JSON Schema toggle.
  5. In the JSON Schema box, enter the following valid JSON schema:
{
"$schema":"http://json-schema.org/draft-07/schema#",
"type":"object",
"properties":{
"revenueGrowthRate":{
"type":"number",
"description":"The growth rate of revenue."
},
"netProfitMargin":{
"type":"number",
"description":"The company's profit margin."
},
"currentRatio":{
"type":"number",
"description":"The company's liquidity position."
},
"returnOnEquity":{
"type":"number",
"description":"The efficiency in generating profit from equity."
},
"debtToEquityRatio":{
"type":"number",
"description":"The proportion of debt to shareholders' equity."
}
},
"required":[
"revenueGrowthRate",
"netProfitMargin",
"currentRatio",
"returnOnEquity",
"debtToEquityRatio"
]
}
note

The JSON schema does not require exact label names to align perfectly with document fields, as the collection's large language model (LLM) can interpret and infer label purposes based on context. This allows the model to understand and map various label names, even if there are minor differences in terminology, to their intended data points. Just as a human might deduce what a field intends to capture, the LLM uses its interpretive capability to accurately match schema labels with relevant content, even when exact terms differ.

  1. Click Save.

Step 4: Generate a structured response​

Now, let's run the created Extractor to generate a structured response.

  1. In the Extractors table, locate the row of the newly created Extractor and click Run in that row.
  2. In the Select a collection list, select Tutorial 8.
  3. Click Run.
    note

    Enterprise h2oGPTe creates a Job to process the Extractor.

Step 5: View extracted information​

Once the Extractor has finished, you can access the extracted information by following these steps (the Extractor is considered completed when its Job is finished):

  1. In the Enterprise h2oGPTe navigation menu, click Collections.
  2. Click the My collections tab.
  3. In the My collections table, click Tutorial 8.
  4. In the Documents table, click goog-10-k-20.
  5. In the Recent results section, click the card.

Extractor card Extractor result

Summary​

This tutorial illustrated how Extractors can efficiently organize and extract information from a document, simplifying the process for individuals and applications to interact with a document's data in a structured manner.


Feedback