Skip to main content
Version: v1.6.40-dev2 🚧

Add a Document(s) to a Collection

Overview​

A Collection can contain multiple Documents. Added documents are indexed and stored in a database. When you ask a question about the Document(s), h2oGPTe crawls through the indexed Document(s) in the Collection to find relevant content to answer the question while utilizing the H2O LLM to summarize a concise question response. You can add documents while creating a Collection or after creating a Collection.

note

To learn how to create a Collection, see Create a Collection.

Available ingestion methods​

h2oGPTe supports multiple methods for adding documents to Collections. Each method has specific configuration options and applications:

Local and direct upload methods​

MethodDescriptionBest forAdditional resource
Upload DocumentsUpload files from your local computer to a collection.Local files, quick uploads, direct controlLearn more
Upload Plain TextCreate documents by entering or pasting text content.Text content, notes, code snippetsLearn more
Import from File SystemImport documents from local directories using glob patterns.Bulk workflows, automated processingLearn more

Web and cloud storage methods​

MethodDescriptionBest forAdditional resource
Import from URLCrawl and import content from web pages or entire sites.Web content, online resourcesLearn more
Import from S3Import documents from Amazon S3 buckets.AWS ecosystem, cloud workflowsLearn more
Import from Azure Blob StorageImport documents from Azure Blob Storage containers.Azure ecosystem, enterprise storageLearn more
Import from Google Cloud StorageImport documents from Google Cloud Storage buckets.GCP ecosystem, cloud workflowsLearn more

Enterprise and collaboration methods​

MethodDescriptionBest forAdditional resource
Import from SharePoint OnlineImport documents from Microsoft SharePoint Online sites.Microsoft 365 ecosystem, enterprise collaborationLearn more
Import from SharePoint On-PremiseImport documents from on-premise SharePoint Server sites.On-premise SharePoint, enterprise environmentsLearn more

Reuse and organization methods​

MethodDescriptionBest forAdditional resource
Select Existing DocumentImport documents that already exist in other collections.Reusing documents across collectionsLearn more
Select Existing CollectionImport all documents from an existing collection.Bulk collection copying, organizationLearn more

Shared document processing options​

All ingestion methods support the following document processing options that control how your documents are analyzed and indexed:

Document processing options​

OptionDefaultDescriptionUse Case
Create short document summariesDisabled (can be toggled on or off)Automatically generates a concise summary of each documentEnable when you want quick document overviews for better searchability
Create sample questions for documentsDisabled (can be toggled on or off)Generates suggested questions based on the document contentHelpful for discovering what types of questions the document can answer
Spoken language in audio filesAuto (dropdown)Specifies the language for audio file transcriptionRequired for accurate transcription of audio files
OCR modelAutomatic (dropdown)Selects the OCR engine for extracting text from images and readable PDFsOptions: Automatic, Disable, docTR, Tesseract, Mississippi-800M, PaddleOCR-Latin, PaddleOCR-English, PaddleOCR-Chinese, PaddleOCR-Arabic, PaddleOCR-Japanese, PaddleOCR-Korean, PaddleOCR-Cyrillic, PaddleOCR-Devanagari, PaddleOCR-Telugu, PaddleOCR-Kannada, PaddleOCR-Tamil
Tesseract languageEnglish (eng) (dropdown)Language setting for Tesseract OCR processingRequired when using Tesseract OCR model for non-English documents
Ingest ModeStandard (dropdown)Document processing mode that affects analysis depth and speed
  • Standard: Smart processing with advanced layout analysis and selective OCR:
    • If extractable text is present on a page, OCR is skipped.
    • If a page has no extractable text, OCR is applied to the whole page (unless it's a solid color).
    • For pages with both text and images, OCR is only applied to images that are likely to contain information.
  • Agents Only: No processing – ingests files "as-is" for use by agents.
  • Lite: Processes data up to 2× faster by skipping OCR and advanced layout analysis.
    • For non-image files, both OCR and advanced layout analysis are skipped.
    • For image files, OCR is performed, but advanced layout analysis is still skipped.
TIP

OCR model options and recommendations:

  • Automatic: Automatically selects the best OCR method
  • Disabled: Disables OCR processing
  • docTR: Deep learning-based OCR, generally best for English
  • Tesseract: Traditional OCR engine
  • Mississippi-800M: Best for handwriting
  • PaddleOCR: OCR for multiple languages/scripts (including Arabic, Chinese, Cyrillic, Devanagari, Japanese, Kannada, Korean, Tamil, Telugu, Latin)

Recommendations:

  • For English: docTR > PaddleOCR-English > Tesseract
  • For Arabic, Chinese, Cyrillic, Devanagari, Japanese, Kannada, Korean, Tamil, Telugu, Latin: PaddleOCR > Tesseract
  • For handwriting: Mississippi-800M is best

Note: OCR is only needed for images and non-readable PDFs. Readable PDFs (documents with extractable text) are automatically processed without needing OCR settings.

When to use Lite mode: Choose Lite Ingest Mode when your PDFs contain images without useful text, or when you want to avoid the extra time spent on layout analysis. Lite mode is not recommended for scanned documents or PDFs with images containing important text that needs to be extracted for RAG queries.

When to use Standard mode: The Standard ingest mode includes a smart OCR check. For each page, h2oGPTe determines whether text can be extracted. If so, OCR is skipped for that page. If not, OCR is applied. For pages with both text and images, only images that are likely to contain information are OCR'd. This means you don't need to manually check if your document is readable.

Advanced processing options​

OptionDefaultDescriptionUse Case
Keep tables as one chunkDisabled (can be toggled on or off)Preserves table structure by keeping entire tables as single chunksEnable when working with tabular data that needs to maintain relationships
Chunk by pageDisabled (can be toggled on or off)Splits documents into chunks based on page boundariesUseful for maintaining page-level context in multi-page documents
Handwriting checkDisabled (can be toggled on or off)Enables handwriting detection and processingEnable when documents contain handwritten text that needs to be transcribed

Steps to add a document​

To add a Document(s) to a Collection, follow these steps:

  1. In the Enterprise h2oGPTe navigation menu, click Collections.
  2. In the Collections table, select the name of the collection you want to add a Document(s) to.
  3. Click + Add documents. Add documents with collection
    info

    You can upload certain text, image, and audio file types to a Collection. To learn more, see Supported file types for a Collection.

  4. Select the Upload document option from the list.
  5. For detailed instructions and configuration options for each ingestion method, see the available ingestion methods table above.
  6. Click Add.

Next steps​

After adding documents to your collection:

  • Read more about ingestions methods: Try each ingestion method and choose the best option for adding a document that meets your needs.
  • Monitor and test: Check job status, verify documents are searchable, and test content retrieval
  • Optimize and scale: Adjust settings as needed and add more documents or collections. To learn how to chat with a Collection, see Chat with a Collection.

Feedback