Skip to main content
Version: v1.6.23 🚧

Architecture

Overview​

The system mostly has a Firebase-like "serverless" architecture, in the sense that there is no API server to speak of, and the UI fetches data from Postgres by invoking Postgres routines via a bridge server. Postgres has row level security (RLS) enabled, and uses the user's OAuth2 credentials to read/write data. Document ingest, indexing, and question answering are all implemented as background workers written in Python. Workers pull jobs from Redis queues. Semantic/lexical search runs as a separate service. Raw content is stored in ObjectStore. All subsystems except databases are stateless, and can be scaled independently of the other subsystems.

Architecture​

Green nodes are GPU optimized services.

Subsystems​

  • GUI: Browser-based user interface. Written in Typescript using React and Tailwind.
  • MacOS: Native App for MacOS.
  • Swagger: Auto-generated REST API.
  • Python: Asynchronous and Synchronous Python RPC client using web-sockets.
  • Mux: Pass-through HTTP server for the UI to communicate with all backend services, and send/receive chat messages. Written in Go.
  • VectorDB:
    • hnswlib+mysql (default) or Milvus/Elastic/Qdrant/Redis for vector and text index.
    • Vector indexes for similarity search.
    • Full-text indexes for lexical search.
  • Workers: Scalable, distributed. Written in Python.
    • Vex: Server written in Python, with HTTP interface, facilitates connection to internal or external VectorDB
    • Core: Core HTTP server written in Python. Serves low-latency methods used directly by the UI and clients.
    • Crawl: Coordinates document ingestion and indexing, and Document.AI processing.
    • Chat: Coordinates user chat sessions, RAG and Agentic AI.
    • Models: Scalable service for AI helpers for embeddings, OCR, layout, captions, guardrails, audio/image processing, etc. Highly parallelized and GPU optimized.
    • h2oGPT:
      • All LLM requests from h2oGPTe go to a h2oGPT instance at H2OGPTE_CORE_LLM_ADDRESS or H2OGPTE_CORE_OPENAI_ADDRESS or H2OGPTE_CORE_AGENT_ADDRESS.
        • (Default for testing, ~10 different LLMs) If H2OGPTE_CORE_LLM_ADDRESS is internal to the k8s cluster (or the Docker compose network), then h2oGPTe will use the built-in h2oGPT to direct LLM requests to. This gives us the flexibility to configure a multiple LLMs at various other endpoints like h2oGPT running elsewhere, or H2O MLops, Azure, OpenAI, AWS Bedrock, Replicate.com, etc.
        • (Default for installs and for HAIC) If H2OGPTE_CORE_LLM_ADDRESS is external to the h2oGPTe-local network, then we can't control it, and the LLM-choices are hardcoded by the remote h2oGPT instance. This is fine if the installer/user/customer wants to use a limited set of h2oGPT/TGI/vLLM/OpenAI/Azure/AWSBedRock endpoints, like for Replicated with custom vLLM or Managed Cloud with just a handful of LLMs in H2O MLOps.
      • Does prompt engineering
      • Abstracts away different LLMs (locally or remote, e.g., vLLM, text-generation-inference, Replicate, Azure, etc.)
      • Provides text completion API and chat API for talking to LLMs, with (or without) custom context and prompts.
      • Provides map/reduce API built on top of langchain for processing of documents.
    • h2oGPT_Agent:
      • Multimodal Agentic AI
        • Planning
        • Code execution
        • Review and Iteration
        • Tool usage
      • GPU optimized helpers for additional capabilities:
        • Image generation
        • Speech to Text
        • Text to Speech
  • Redis: Contains:
    • User sessions.
    • Job queue and job scheduler data/stats.
    • Pub/sub for brokering chat messages.
  • ObjectStore: Object storage for raw content and documents.
    • Minio for on-prem, S3 for Managed Cloud.
    • Documents
    • PDF version of documents
    • per-page previews
  • Database:
    • Postgres for on-prem, RDS for Managed Cloud.
    • Meta data about users, collections, and documents.
    • Chat sessions and message history.
  • LLMs:
    • Private LLMs (air-gapped, on-premises)
      • HuggingFace Models + vLLM, GPUs required
    • Private LLMs (hosted in cloud by customer)
      • HuggingFace Models + vLLM, GPUs required
    • External LLMs (third-party)
      • Azure/OpenAI
      • Anthropic
      • Amazon/Bedrock/Sagemaker
      • Google
      • Mistral
      • Grok/Together.ai/Replicate
  • Tools: Access to code interpreter, shell, Python and various internal and external APIs for third-party services like Internet Search, Math APIs, Data connectors, etc.

Scaling​

  • Mux is stateless. Load-balance additional instances to improve concurrency.
  • Background Workers are also stateless. Spin up more instances to improve throughput.

Feedback