Skip to main content

Using document sets

Work with available document sets or upload a new set.

OCR

The optical character recognition (OCR) method figures out what text is on the page. Start by selecting which document set(s) you want to work with. The OCR button is now available. Click OCR in the upper navigation bar.

  1. Ensure you selected the correct document set(s)
  2. Select the OCR Method. These are the engines for running OCR. This could be one of:
    • Tesseract: works on pdfs and images, less accurate than PDF text extraction
    • PDF text extraction: works on some PDFs where text can be safely extracted and will perfectly extract the text
    • docTR: works on pdfs and images, less accurate than PDF text extraction
    • Paddle OCR Latin: uses PPOcr library to do OCR on Latin (e.g. Portuguese and Spanish)
    • Paddle OCR Arabic: uses PPOcr library to do OCR on Arabic
    • DocTR EfficientNet B3: uses the DocTR detection model and the EfficientNet B3 architecture recognition model trained on in-house generated data including handwritten datasets
    • DocTR EfficientNet B0: slightly lighter and faster, but less accurate than EfficientNet B3
    • DocTR EfficientNet V2M: larger and slower, but more accurate than EfficientNet B3
    • E3 Best: first runs PDF test extraction, and if it cannot extract the text, runs DocTR EfficientNet B3 OCR
    • best: first runs PDF text extraction, and if it cannot extract the text, runs docTR OCR
  3. Provide a Result name
  4. (Optional) Add a description
  5. Click OCR to run the action

After this action finishes running, your OCR object will appear on the Annotation sets page.

Supported languages

Here is a list of the languages supported by H2O Document AI - Publisher:

  • Afrikaans
  • Azerbaijani
  • Bosnian
  • Czech
  • Welsh
  • Danish
  • Spanish
  • Estonian
  • French
  • Irish
  • Croatian
  • Hungarian
  • Indonesian
  • Icelandic
  • Italian
  • Kurdish
  • Lithuanian
  • Latvian
  • Maori
  • Malay
  • Maltese
  • Dutch
  • Norwegian
  • Occitan
  • Polish
  • Portuguese
  • Romanian
  • Serbian (Latin)
  • Slovak
  • Slovenian
  • Albanian
  • Swedish
  • Swahili
  • Tagalog
  • Turkish
  • Uzbek
  • Vietnam
  • French
  • German

Import document set

This will import a set of documents. The names of these files must be unique for the import to succeed.

  1. Provide a name for the document set
  2. (Optional) Add a description for the document set
  3. Select if you want to copy the attributes from an available attribute set
  4. Upload the desired documents, images, or compressed files. You can either drag and drop the zip file or browse for it

Once the document is imported, it will appear on the Document sets page and an entry will appear on the Annotation sets page.

Interacting with a document set

Each document set has an Info button at the end of the row. Clicking Info will give you the details of your document set (e.g. the description or number of pages). You can also find the logs for your document set here. To see the full log, click Expand. You can also download the log by clicking Download.

The drop-down arrow next to the Info button gives you the option to either rename, export, or delete your document set.

Rename

Rename the document set and provide a new description.

Export

Export a zip file of the document set to your local computer.

Delete

Delete the document set. You will be prompted to acknowledge that the act of deletion is irreversible before you can delete your model


Feedback