Using document sets
Work with available document sets or upload a new set.
OCR
The optical character recognition (OCR) method figures out what text is on the page. Start by selecting which document set(s) you want to work with. The OCR button is now available. Click OCR in the upper navigation bar.
- Ensure you selected the correct document set(s)
- Select the OCR Method. These are the engines for running OCR. This could be one of:
- Tesseract: works on pdfs and images, less accurate than PDF text extraction
- PDF text extraction: works on some PDFs where text can be safely extracted and will perfectly extract the text
- docTR: works on pdfs and images, less accurate than PDF text extraction
- Paddle OCR Latin: uses PPOcr library to do OCR on Latin (e.g. Portuguese and Spanish)
- Paddle OCR Arabic: uses PPOcr library to do OCR on Arabic
- DocTR EfficientNet B3: uses the DocTR detection model and the EfficientNet B3 architecture recognition model trained on in-house generated data including handwritten datasets
- DocTR EfficientNet B0: slightly lighter and faster, but less accurate than EfficientNet B3
- DocTR EfficientNet V2M: larger and slower, but more accurate than EfficientNet B3
- E3 Best: first runs PDF test extraction, and if it cannot extract the text, runs DocTR EfficientNet B3 OCR
- best: first runs PDF text extraction, and if it cannot extract the text, runs docTR OCR
- Provide a Result name
- (Optional) Add a description
- Click OCR to run the action
After this action finishes running, your OCR object will appear on the Annotation sets page.
Supported languages
Here is a list of the languages supported by H2O Document AI - Publisher:
- Latin
- Arabic
- Afrikaans
- Azerbaijani
- Bosnian
- Czech
- Welsh
- Danish
- Spanish
- Estonian
- French
- Irish
- Croatian
- Hungarian
- Indonesian
- Icelandic
- Italian
- Kurdish
- Lithuanian
- Latvian
- Maori
- Malay
- Maltese
- Dutch
- Norwegian
- Occitan
- Polish
- Portuguese
- Romanian
- Serbian (Latin)
- Slovak
- Slovenian
- Albanian
- Swedish
- Swahili
- Tagalog
- Turkish
- Uzbek
- Vietnamese
- German
- Arabic
- Persian
- Uyghur
- Urdu
Import document set
This will import a set of documents. The names of these files must be unique for the import to succeed.
- Provide a name for the document set
- (Optional) Add a description for the document set
- Select if you want to copy the attributes from an available attribute set
- Upload the desired documents, images, or compressed files. You can either drag and drop the zip file or browse for it
Once the document is imported, it will appear on the Document sets page and an entry will appear on the Annotation sets page.
Interacting with a document set
Each document set has an Info button at the end of the row. Clicking Info will give you the details of your document set (e.g. the description or number of pages). You can also find the logs for your document set here. To see the full log, click Expand. You can also download the log by clicking Download.
The drop-down arrow next to the Info button gives you the option to either rename, export, or delete your document set.
Rename
Rename the document set and provide a new description.
Export
Export a zip file of the document set to your local computer.
Delete
Delete the document set. You will be prompted to acknowledge that the act of deletion is irreversible before you can delete your model
- Submit and view feedback for this page
- Send feedback about H2O Document AI to cloud-feedback@h2o.ai