Add a Document(s) to a Collection
Overview​
A Collection can contain multiple Documents. Added documents are indexed and stored in a database. When you ask a question about the Document(s), h2oGPTe crawls through the indexed Document(s) in the Collection to find relevant content to answer the question while utilizing the H2O LLM to summarize a concise question response. You can add documents while creating a Collection or after creating a Collection.
To learn how to create a Collection, see Create a Collection.
Available ingestion methods​
h2oGPTe supports multiple methods for adding documents to Collections. Each method has specific configuration options and applications:
Local and direct upload methods​
Method | Description | Best for | Additional resource |
---|---|---|---|
Upload Documents | Upload files from your local computer to a collection. | Local files, quick uploads, direct control | Learn more |
Upload Plain Text | Create documents by entering or pasting text content. | Text content, notes, code snippets | Learn more |
Import from File System | Import documents from local directories using glob patterns. | Bulk workflows, automated processing | Learn more |
Web and cloud storage methods​
Method | Description | Best for | Additional resource |
---|---|---|---|
Import from URL | Crawl and import content from web pages or entire sites. | Web content, online resources | Learn more |
Import from S3 | Import documents from Amazon S3 buckets. | AWS ecosystem, cloud workflows | Learn more |
Import from Azure Blob Storage | Import documents from Azure Blob Storage containers. | Azure ecosystem, enterprise storage | Learn more |
Import from Google Cloud Storage | Import documents from Google Cloud Storage buckets. | GCP ecosystem, cloud workflows | Learn more |
Enterprise and collaboration methods​
Method | Description | Best for | Additional resource |
---|---|---|---|
Import from SharePoint Online | Import documents from Microsoft SharePoint Online sites. | Microsoft 365 ecosystem, enterprise collaboration | Learn more |
Import from SharePoint On-Premise | Import documents from on-premise SharePoint Server sites. | On-premise SharePoint, enterprise environments | Learn more |
Reuse and organization methods​
Method | Description | Best for | Additional resource |
---|---|---|---|
Select Existing Document | Import documents that already exist in other collections. | Reusing documents across collections | Learn more |
Select Existing Collection | Import all documents from an existing collection. | Bulk collection copying, organization | Learn more |
Shared document processing options​
All ingestion methods support the following document processing options that control how your documents are analyzed and indexed:
Document processing options​
Option | Default | Description | Use Case |
---|---|---|---|
Create short document summaries | Disabled (can be toggled on or off) | Automatically generates a concise summary of each document | Enable when you want quick document overviews for better searchability |
Create sample questions for documents | Disabled (can be toggled on or off) | Generates suggested questions based on the document content | Helpful for discovering what types of questions the document can answer |
Spoken language in audio files | Auto (dropdown) | Specifies the language for audio file transcription | Required for accurate transcription of audio files |
OCR model | Automatic (dropdown) | Selects the OCR engine for extracting text from images and readable PDFs | Options: Automatic, Disable, docTR, Tesseract, Mississippi-800M, PaddleOCR-Latin, PaddleOCR-English, PaddleOCR-Chinese, PaddleOCR-Arabic, PaddleOCR-Japanese, PaddleOCR-Korean, PaddleOCR-Cyrillic, PaddleOCR-Devanagari, PaddleOCR-Telugu, PaddleOCR-Kannada, PaddleOCR-Tamil |
Tesseract language | English (eng) (dropdown) | Language setting for Tesseract OCR processing | Required when using Tesseract OCR model for non-English documents |
Ingest Mode | Standard (dropdown) | Document processing mode that affects analysis depth and speed |
|
OCR model options and recommendations:
- Automatic: Automatically selects the best OCR method
- Disabled: Disables OCR processing
- docTR: Deep learning-based OCR, generally best for English
- Tesseract: Traditional OCR engine
- Mississippi-800M: Best for handwriting
- PaddleOCR: OCR for multiple languages/scripts (including Arabic, Chinese, Cyrillic, Devanagari, Japanese, Kannada, Korean, Tamil, Telugu, Latin)
Recommendations:
- For English: docTR > PaddleOCR-English > Tesseract
- For Arabic, Chinese, Cyrillic, Devanagari, Japanese, Kannada, Korean, Tamil, Telugu, Latin: PaddleOCR > Tesseract
- For handwriting: Mississippi-800M is best
Note: OCR is only needed for images and non-readable PDFs. Readable PDFs (documents with extractable text) are automatically processed without needing OCR settings.
When to use Lite mode: Choose Lite Ingest Mode when your PDFs contain images without useful text, or when you want to avoid the extra time spent on layout analysis. Lite mode is not recommended for scanned documents or PDFs with images containing important text that needs to be extracted for RAG queries.
When to use Standard mode: The Standard ingest mode includes a smart OCR check. For each page, h2oGPTe determines whether text can be extracted. If so, OCR is skipped for that page. If not, OCR is applied. For pages with both text and images, only images that are likely to contain information are OCR'd. This means you don't need to manually check if your document is readable.
Advanced processing options​
Option | Default | Description | Use Case |
---|---|---|---|
Keep tables as one chunk | Disabled (can be toggled on or off) | Preserves table structure by keeping entire tables as single chunks | Enable when working with tabular data that needs to maintain relationships |
Chunk by page | Disabled (can be toggled on or off) | Splits documents into chunks based on page boundaries | Useful for maintaining page-level context in multi-page documents |
Handwriting check | Disabled (can be toggled on or off) | Enables handwriting detection and processing | Enable when documents contain handwritten text that needs to be transcribed |
Steps to add a document​
To add a Document(s) to a Collection, follow these steps:
- In the Enterprise h2oGPTe navigation menu, click Collections.
- In the Collections table, select the name of the collection you want to add a Document(s) to.
- Click + Add documents.
info
You can upload certain text, image, and audio file types to a Collection. To learn more, see Supported file types for a Collection.
- Select the Upload document option from the list.
- For detailed instructions and configuration options for each ingestion method, see the available ingestion methods table above.
- Click Add.
Next steps​
After adding documents to your collection:
- Read more about ingestions methods: Try each ingestion method and choose the best option for adding a document that meets your needs.
- Monitor and test: Check job status, verify documents are searchable, and test content retrieval
- Optimize and scale: Adjust settings as needed and add more documents or collections. To learn how to chat with a Collection, see Chat with a Collection.
- Submit and view feedback for this page
- Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai