Skip to main content
Version: v1.6.14-dev3 🚧

Architecture

Overview​

The system mostly has a Firebase-like "serverless" architecture, in the sense that there is no API server to speak of, and the UI fetches data from Postgres by invoking Postgres routines via a bridge server. Postgres has row level security (RLS) enabled, and uses the user's OAuth2 credentials to read/write data. Document ingest, indexing, and question answering are all implemented as background workers written in Python. Workers pull jobs from Redis queues. Semantic/lexical search runs as a separate service. Raw content is stored in Minio. All subsystems except databases are stateless, and can be scaled independently of the other subsystems.

Diagram​

Subsystems​

  • GUI: Browser-based user interface. Written in Typescript using React and Tailwind.
  • MacOS: Native App for MacOS.
  • Swagger: Auto-generated REST API.
  • Python: Asynchronous and Synchronous Python client using web-sockets.
  • Mux: Pass-through HTTP server for the UI to communicate with all backend services, and send/receive chat messages. Written in Go.
  • Vex: Multi-modal database written in Python, with HTTP interface. Contains:
    • Content chunks with provenance.
    • Vector indexes for similarity search.
    • Full-text indexes for lexical search.
  • Workers: Scalable, distributed. Written in Python.
    • Core: Core HTTP server written in Python. Serves low-latency methods used directly by the UI and clients.
    • Crawler: Coordinates document ingestion and indexing.
    • Chat: Coordinates user chat sessions, RAG and Agentic AI.
    • Models: Scalable service for AI helpers for embeddings, OCR, layout, captions, guardrails, audio/image processing, etc. Highly parallelized and GPU optimized.
    • h2oGPT:
      • All LLM requests from h2oGPTe go to a h2oGPT instance at H2OGPTE_CORE_LLM_ADDRESS or H2OGPTE_CORE_OPENAI_ADDRESS or H2OGPTE_CORE_AGENT_ADDRESS.
        • (Default for testing, ~10 different LLMs) If H2OGPTE_CORE_LLM_ADDRESS is internal to the k8s cluster (or the Docker compose network), then h2oGPTe will use the built-in h2oGPT to direct LLM requests to. This gives us the flexibility to configure a multiple LLMs at various other endpoints like h2oGPT running elsewhere, or H2O MLops, Azure, OpenAI, AWS Bedrock, Replicate.com, etc.
        • (Default for installs and for HAIC) If H2OGPTE_CORE_LLM_ADDRESS is external to the h2oGPTe-local network, then we can't control it, and the LLM-choices are hardcoded by the remote h2oGPT instance. This is fine if the installer/user/customer wants to use a limited set of h2oGPT/TGI/vLLM/OpenAI/Azure/AWSBedRock endpoints, like for Replicated with custom vLLM or Managed Cloud with just a handful of LLMs in H2O MLOps.
      • Does prompt engineering
      • Abstracts away different LLMs (locally or remote, e.g., vLLM, text-generation-inference, Replicate, Azure, etc.)
      • Provides text completion API and chat API for talking to LLMs, with (or without) custom context and prompts.
      • Provides map/reduce API built on top of langchain for processing of documents.
    • h2oGPT_Agent:
      • Multimodal Agentic AI
        • Planning
        • Code execution
        • Review and Iteration
        • Tool usage
      • GPU optimized helpers for additional capabilities:
        • Image generation
        • Speech to Text
        • Text to Speech
  • Redis: Contains:
    • User sessions.
    • Job queue and job scheduler data/stats.
    • Pub/sub for brokering chat messages.
  • Minio: Object storage for raw content and documents.
  • Postgres:
    • Meta data about users, collections, and documents.
    • Chat sessions and message history.
  • LLMs:
    • Private LLMs (air-gapped, on-premises)
      • HuggingFace Models + vLLM
    • Private LLMs (hosted in cloud by customer)
      • HuggingFace Models + vLLM
    • External LLMs (third-party)
      • Azure/OpenAI
      • Anthropic
      • Amazon
      • Google
      • Mistral
      • Grok/Together.ai/Replicate

Scaling​

  • Mux is stateless. Load-balance additional instances to improve concurrency.
  • Background Workers are also stateless. Spin up more instances to improve throughput.

Feedback