Version: v1.6.44-dev7 🚧

Architecture

h2oGPTe Architecture Diagram

Overview

h2oGPTe is an enterprise RAG (Retrieval-Augmented Generation) platform that provides secure document processing, conversational AI, agentic workflows, and deep research capabilities. The platform enables autonomous AI agents to perform complex multi-step tasks, conduct thorough research across large document collections, and execute sophisticated reasoning chains. The architecture follows a microservices pattern with PostgreSQL for data storage, Redis for caching, MinIO for object storage, and Keycloak for authentication. In the architecture diagram below, green boxes indicate GPU-accelerated services.

Data Flow Patterns

The h2oGPTe platform implements several key data flow patterns that enable efficient document processing, intelligent retrieval, and real-time AI interactions. These patterns are designed to optimize performance, ensure data consistency, and provide seamless user experiences across different use cases.

1. Document Ingestion Flow

The document ingestion pipeline handles the complete lifecycle of document processing, from initial upload to final indexing. This flow supports multiple document formats (PDF, Word, Excel, images, etc.) and automatically extracts structured and unstructured content while preserving document layout and formatting. The pipeline includes intelligent chunking strategies, metadata extraction, and multi-modal content understanding.

2. RAG Query Flow

The Retrieval-Augmented Generation (RAG) query flow combines semantic search with generative AI to provide accurate, contextual responses. This pattern leverages both vector similarity and lexical search to find relevant content, then uses advanced prompt engineering to generate responses that are grounded in your organization's data. The flow includes caching mechanisms for performance optimization and streaming for real-time user interaction.

3. Agent Execution Flow

The agent execution flow enables autonomous AI agents to perform complex, multi-step tasks using a variety of tools and reasoning capabilities. Agents can plan their approach, execute tools iteratively, and adapt their strategy based on intermediate results. This pattern is essential for workflows that require decision-making, data analysis, or integration with external systems.

Component Responsibilities

Each component in the h2oGPTe architecture has specific responsibilities and capabilities designed to work together as a cohesive system. The platform's modular design allows for independent scaling, maintenance, and enhancement of individual services while maintaining overall system integrity.

Note: Green boxes indicate GPU-accelerated services that leverage specialized hardware for compute-intensive AI operations.

h2oGPT Service

The h2oGPT service acts as the LLM abstraction layer for all AI text generation operations in h2oGPTe.

LLM Routing: All LLM requests from h2oGPTe go to h2oGPT at configurable endpoints:
- H2OGPTE_CORE_LLM_ADDRESS: Primary LLM endpoint
- H2OGPTE_CORE_OPENAI_ADDRESS: OpenAI-compatible API endpoint
- H2OGPTE_CORE_AGENT_SHARED_ADDRESS: Shared agent service endpoint
- H2OGPTE_CORE_AGENT_ISOLATED_ADDRESS: Isolated agent service endpoint
- H2OGPTE_CORE_LITE_LLM_ADDRESS: LiteLLM proxy endpoint
Deployment Flexibility:
- Internal deployment: When h2oGPT runs within the cluster/network, provides access to multiple configured LLMs
- External deployment: When pointing to external h2oGPT instances with pre-configured LLM choices
LLM Abstraction: Abstracts different LLM providers (vLLM, text-generation-inference, Replicate, Azure, OpenAI, AWS Bedrock, H2O MLOps)
Prompt Engineering: Handles prompt optimization and template management for different LLMs
API Support: Provides both text completion and chat APIs with custom context and prompts
Document Processing: Map/reduce API built on langchain for document processing workflows

Frontend (UI)

The React-based frontend provides a modern, responsive user interface built with TypeScript and Tailwind CSS. It serves as the primary interaction point for users, offering an intuitive experience for both technical and non-technical users.

User Interface: Interactive chat interface with markdown rendering, document viewer with highlighting, collection management dashboard, and admin panels
Real-time Communication: WebSocket-based streaming for live AI responses, progress indicators for long-running operations, and collaborative features
File Management: Drag-and-drop file upload with progress tracking, batch document processing, preview and download capabilities
Configuration: User preferences and settings management, theme customization, API key management, and workspace configuration

Mux (API Gateway)

The Mux gateway, written in Go, serves as the API gateway and authentication layer for all client requests.

Authentication: OIDC integration with Keycloak, JWT token validation, session management, guest user support with device fingerprinting
Authorization: Role-based access control (RBAC), API key management, license validation
Database Integration: PostgreSQL connection pooling with trusted/untrusted user separation, row-level security enforcement
Request Routing: HTTP/WebSocket routing to backend services, Redis pub/sub for multi-instance synchronization
File Operations: File upload/download handling, streaming support for large documents

Core Service

The Core service, implemented in Python, acts as the central orchestrator for document processing and LLM interactions.

Orchestration: Coordinates document processing workflows and service interactions
LLM Management: Handles requests to the h2oGPT service for LLM interactions
Configuration: Manages system-wide settings and environment configurations
Encryption: Provides encryption/decryption services for sensitive data
File Server: Built-in file serving capabilities for document access

VEX Service

The VEX service provides vector search and indexing capabilities with support for multiple backends.

Vector Search: Similarity search using embeddings, HNSW algorithm for internal backend
Full-text Search: Text search capabilities alongside vector search
Backend Support: Internal backends (HNSW, SQLite), External backends (Elasticsearch, Milvus, Qdrant, Redis)
Document Processing: Chunking strategies, embedding generation
FastAPI Interface: REST API built with FastAPI and uvicorn

Crawl Service

The Crawl service handles document ingestion from various enterprise sources.

Connectors: SharePoint (On-premise and Online), Azure Blob Storage, Google Cloud Storage, AWS S3, Local file system
Document Processing: Document parsing and metadata extraction, Integration with parse capabilities for text extraction
Chunking: Document chunking for vector indexing
Ingestion Pipeline: Batch and incremental ingestion support

Chat Service

The Chat service manages conversation sessions and RAG pipeline execution.

Session Management: Chat session creation and tracking, conversation history storage in PostgreSQL
RAG Pipeline: Document retrieval from VEX service, context injection for LLM prompts
Real-time Communication: WebSocket support for streaming responses
Integration: Coordinates with Core service for LLM interactions

Parse Service

The Parse service (integrated within Crawl) provides document parsing and text extraction.

Format Support: PDF, Word, Excel, PowerPoint, HTML, images, and various text formats
OCR Capabilities: Optical character recognition for scanned documents, multi-language support
Text Extraction: Structured text extraction from documents, metadata extraction
Integration: Embedded within the crawl service pipeline

Models Service

The Models service provides GPU-accelerated model serving capabilities.

GPU Support: CUDA support for GPU acceleration, configurable GPU device allocation
Resource Management: Memory limits and resource constraints, container-based deployment
Integration: Works with external h2oGPT service for LLM capabilities
PII Detection: Built-in PII detection and redaction capabilities

Deployment Architecture

The h2oGPTe platform is designed for flexible deployment across various environments, from single-node development setups to large-scale production clusters. The architecture supports containerized deployment using Docker and Kubernetes, with comprehensive configuration management and monitoring capabilities.

Container Architecture

The platform utilizes a containerized microservices architecture that ensures consistency across environments and simplifies deployment and scaling. Each service runs in its own container with clearly defined interfaces and dependencies.

Deployment Options

Docker Compose Deployment

Services: h2ogpte-app (Python services), h2ogpte-mux (Go gateway), h2ogpte-ui (React frontend)
Infrastructure: PostgreSQL, Redis, MinIO, Keycloak
Optional Services: vLLM server for local LLM hosting, Milvus/Elasticsearch for vector search

Kubernetes Deployment

Helm Charts: Available for production deployments
Multi-instance Support: Redis pub/sub for service synchronization

Configuration Management

Environment Variables: Extensive configuration via environment variables
Settings Service: Centralized configuration management
Feature Flags: Runtime feature toggling support

Security Architecture

Security is built into every layer of the h2oGPTe platform, implementing defense-in-depth strategies to protect sensitive data and ensure compliance with enterprise security requirements.

Authentication and Identity Management

OIDC/OAuth2: Full OpenID Connect and OAuth 2.0 implementation for enterprise SSO
JWT Tokens: Token-based authentication with JWKS validation and key rotation
OAuth2 Flows: Authorization code flow with PKCE, token refresh, and token exchange
Session Management: Redis-backed session storage with automatic token refresh
API Keys: Database-stored API keys for programmatic access
Guest Users: Device fingerprinting for anonymous access
Enterprise IdP Support: Integration with Okta, Azure AD, Keycloak, and other OIDC providers

Authorization and Access Control

PostgreSQL RLS: Row-level security policies for data isolation
RBAC System: Database-backed roles and permissions
User Types: Regular users, admin users, guest users
Trusted/Untrusted Connections: Separate database connection pools based on trust level
Audit Logging: Database-level audit trails

Data Security

Encryption: Configurable encryption for sensitive data
PII Detection: Built-in PII detection and redaction
Secure Cookies: SameSite policies and secure flag
Object Storage: MinIO with bucket-level access controls

Infrastructure Security

Container Security: Docker-based isolation
Service Communication: Internal service authentication
Rate Limiting: Built-in rate limiting and cost controls
License Validation: Enterprise license enforcement

Storage Architecture

PostgreSQL: Primary database with 150+ migrations, extensive stored procedures
Redis: Caching layer and pub/sub messaging
MinIO Buckets: Documents, collections, user data, shared data, agent tools
Cloud Storage Support: S3, Azure Blob, Google Cloud Storage integration

h2oGPT Services Architecture

When h2oGPT is deployed as part of the h2oGPTe stack, it runs as a distributed set of services that handle different aspects of LLM operations. This microservice architecture allows for better resource allocation, scalability, and fault tolerance.

Docker Container Service Breakdown

h2ogpt-openai Container

Container Role: Primary LLM interface and request routing hub

Internal Services:

OpenAI API Server (:5000): Main LLM endpoint, OpenAI-compatible API
LiteLLM Proxy (:5020): Multi-provider LLM routing (OpenAI, Azure, Bedrock, etc.)
Gradio UI (:7860): Web interface for direct LLM interaction

Resource Profile: CPU-optimized, no GPU requirements Memory Limit: 32GB

h2ogpt-function Container

Container Role: GPU-accelerated compute and specialized AI functions

Internal Services:

Function Server (:5002): Tool execution, function calls, GPU-intensive tasks
Function OpenAI API (:5005): Specialized API for STT and image generation
Function Gradio UI (:7860→7861): Web interface for function-specific operations

Resource Profile: GPU-accelerated with NVIDIA runtime support Memory Limit: 32GB GPU Configuration: Configurable via CUDA_VISIBLE_DEVICES

Deployment Requirements:

When Required: Function server is only needed when agent pods are enabled
CPU Mode: Can run on CPU but not recommended - significantly slower
GPU Mode Minimum: 8GB GPU memory (baseline for standard operations)
Typical GPU Usage: ~5GB GPU memory for base image processing operations (H2OImageCaptionLoader, H2OOCRLoader)

GPU Memory Scaling:

Base image processing (OCR, captions): 5GB
With image generation: Varies by model (can range up to e.g. 48GB for image generation model)
Multi-GPU support: Can distribute models across GPUs (e.g., image generation on GPU 1, other operations on GPU 0)

h2ogpt-agent-shared Container

Container Role: Multi-worker agent execution environment

Internal Services:

Agent Server (:5004): Multi-worker agent operations with configurable worker count

Resource Configuration:

Memory Limit: 64GB
Worker Count: Configurable via ${H2OGPT_WORKERS}
Docker-out-of-Docker support for AutoGen via /var/run/docker.sock mount

Communication Pattern:

Connects to h2ogpt-openai:5000 for LLM requests
Connects to h2ogpt-openai:5020 for multi-provider routing
Connects to h2ogpt-function:5002 for tool execution
Connects to h2ogpt-function:5005 for STT/ImageGen operations
Uses IMAGEGEN_OPENAI_BASE_URL and STT_OPENAI_BASE_URL for specialized AI functions

h2ogpt-agent-isolated Container

Container Role: Single-tenant isolated agent execution

Internal Services:

Agent Server (:5006): Single-request agent with strict isolation

Resource Configuration:

Memory Limit: 64GB
Worker Count: Fixed at 1 (H2OGPT_AGENT_WORKERS: "1")
Docker-out-of-Docker support for AutoGen via /var/run/docker.sock mount

Isolation Configuration:

NUM_CONCURRENT_REQUESTS: "1" enforces single request processing
CONCURRENT_REQUESTS_BEHAVIOR: "reject" prevents request queuing
Dedicated resource allocation ensures no interference between requests

Communication Pattern: Same as agent-shared but with guaranteed isolation

Network Architecture

All h2oGPT services communicate within the h2ogpt-network bridge network, enabling:

Service Discovery: Docker DNS resolution between containers
Internal APIs: Direct service-to-service communication on internal ports
Load Balancing: Automatic failover and request distribution
Security: Isolated network segment with controlled external access

Resource Allocation and Deployment Configuration

Memory Limits

h2ogpt-openai: 32GB memory limit (CPU-optimized for API routing)
h2ogpt-function: 32GB memory limit (GPU-accelerated for compute tasks)
h2ogpt-agent-shared: 64GB memory limit (multi-worker agent processing)
h2ogpt-agent-isolated: 64GB memory limit (single-tenant isolation)

Worker Configuration

OpenAI Service: Configurable workers via ${H2OGPT_WORKERS}
Function Service: Fixed single worker (H2OGPT_FUNCTION_SERVER_WORKERS: "1")
Agent Shared: Configurable workers via ${H2OGPT_WORKERS}
Agent Isolated: Fixed single worker (H2OGPT_AGENT_WORKERS: "1")

Special Capabilities

Browser Integration: Agent containers support browser cookies (H2OGPT_BROWSER_COOKIES: "1")
GPU Access: Function container configurable via CUDA_VISIBLE_DEVICES environment variable

Feedback

Submit and view feedback for this page
Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai

Overview​

Data Flow Patterns​

1. Document Ingestion Flow​

2. RAG Query Flow​

3. Agent Execution Flow​

Component Responsibilities​

h2oGPT Service​

Frontend (UI)​

Mux (API Gateway)​

Core Service​

VEX Service​

Crawl Service​

Chat Service​

Parse Service​

Models Service​

Deployment Architecture​

Container Architecture​

Deployment Options​

Docker Compose Deployment​

Kubernetes Deployment​

Configuration Management​

Security Architecture​

Authentication and Identity Management​

Authorization and Access Control​

Data Security​

Infrastructure Security​

Storage Architecture​

h2oGPT Services Architecture​

Docker Container Service Breakdown​

h2ogpt-openai Container​

h2ogpt-function Container​

h2ogpt-agent-shared Container​

h2ogpt-agent-isolated Container​

Network Architecture​

Resource Allocation and Deployment Configuration​

Memory Limits​

Worker Configuration​

Special Capabilities​

Overview

Data Flow Patterns

1. Document Ingestion Flow

2. RAG Query Flow

3. Agent Execution Flow

Component Responsibilities

h2oGPT Service

Frontend (UI)

Mux (API Gateway)

Core Service

VEX Service

Crawl Service

Chat Service

Parse Service

Models Service

Deployment Architecture

Container Architecture

Deployment Options

Docker Compose Deployment

Kubernetes Deployment

Configuration Management

Security Architecture

Authentication and Identity Management

Authorization and Access Control

Data Security

Infrastructure Security

Storage Architecture

h2oGPT Services Architecture

Docker Container Service Breakdown

h2ogpt-openai Container

h2ogpt-function Container

h2ogpt-agent-shared Container

h2ogpt-agent-isolated Container

Network Architecture

Resource Allocation and Deployment Configuration

Memory Limits

Worker Configuration

Special Capabilities