Using document sets

Work with available document sets or upload a new set.

OCR

The optical character recognition (OCR) method figures out what text is on the page. Start by selecting which document set(s) you want to work with. The OCR button is now available. Click OCR in the upper navigation bar.

Ensure you selected the correct document set(s)
Select the OCR Method. These are the engines for running OCR. This could be one of:
- Tesseract: works on pdfs and images, less accurate than PDF text extraction
- PDF text extraction: works on some PDFs where text can be safely extracted and will perfectly extract the text
- docTR: works on pdfs and images, less accurate than PDF text extraction
- Paddle OCR Latin: uses PPOcr library to do OCR on Latin (e.g. Portuguese and Spanish)
- Paddle OCR Arabic: uses PPOcr library to do OCR on Arabic
- DocTR EfficientNet B3: uses the DocTR detection model and the EfficientNet B3 architecture recognition model trained on in-house generated data including handwritten datasets
- DocTR EfficientNet B0: slightly lighter and faster, but less accurate than EfficientNet B3
- DocTR EfficientNet V2M: larger and slower, but more accurate than EfficientNet B3
- E3 Best: first runs PDF test extraction, and if it cannot extract the text, runs DocTR EfficientNet B3 OCR
- best: first runs PDF text extraction, and if it cannot extract the text, runs docTR OCR
Provide a Result name
(Optional) Add a description
Click OCR to run the action

After this action finishes running, your OCR object will appear on the Annotation sets page.

Supported languages

Here is a list of the languages supported by H2O Document AI - Publisher:

Latin
Arabic

Afrikaans
Azerbaijani
Bosnian
Czech
Welsh
Danish
Spanish
Estonian
French
Irish
Croatian
Hungarian
Indonesian
Icelandic
Italian
Kurdish
Lithuanian
Latvian
Maori
Malay
Maltese
Dutch
Norwegian
Occitan
Polish
Portuguese
Romanian
Serbian (Latin)
Slovak
Slovenian
Albanian
Swedish
Swahili
Tagalog
Turkish
Uzbek
Vietnamese
German

Import document set

This will import a set of documents. The names of these files must be unique for the import to succeed.

Provide a name for the document set
(Optional) Add a description for the document set
Select if you want to copy the attributes from an available attribute set
Upload the desired documents, images, or compressed files. You can either drag and drop the zip file or browse for it

Once the document is imported, it will appear on the Document sets page and an entry will appear on the Annotation sets page.

Supported file types

Images
PDFs
ZIPs (and nested ZIP files)

You can upload multiple files with the same name (they can have the same file sets and everything). You'll just be prompted to select which file should be used. You will only be prompted when importing an annotation set which references an ambiguous file.

Interacting with a document set

Each document set has an Info button at the end of the row. Clicking Info will give you the details of your document set (e.g. the description or number of pages). You can also find the logs for your document set here. To see the full log, click Expand. You can also download the log by clicking Download.

The drop-down arrow next to the Info button gives you the option to either rename, export, or delete your document set.

Rename

Rename the document set and provide a new description.

Export

Export a zip file of the document set to your local computer.

Delete

Delete the document set. You will be prompted to acknowledge that the act of deletion is irreversible before you can delete your model

Feedback

Submit and view feedback for this page
Send feedback about H2O Document AI to cloud-feedback@h2o.ai

OCR​

Supported languages​

Import document set​

Supported file types​

Interacting with a document set​

Rename​

Export​

Delete​

OCR