Skip to main content
Version: v1.6.40-dev2 🚧

Troubleshooting

Overview​

This troubleshooting hub provides comprehensive guidance for resolving issues across the h2oGPTe platform. You can find quick diagnostic commands for health checks and log analysis, detailed troubleshooting guides for each component (clients, gateway, workers, AI and LLM, and storage), recovery procedures for service restarts and data recovery, monitoring and alerting best practices, and common deployment issues with their solutions.

Quick diagnostic commands​

note

Some example commands assume that curl or nslookup are available in the pod. If your pod uses a distroless image, use a debug pod with these tools installed.

Health checks​

# Mux health check (via port-forward, from local machine or external to cluster)
kubectl port-forward <mux-pod> 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
curl -f http://localhost:8888/rpc/health/liveness
kill %1

# Option 2: Use a debug pod with curl installed (from within the cluster)
kubectl run -it --rm debug --image=alpine --restart=Never -- sh
# Inside the debug pod, install curl and run the health checks:
apk add --no-cache curl
curl -f http://<mux-pod>:8888/rpc/health/readiness
curl -f http://<mux-pod>:8888/rpc/health/liveness

# External health checks (via service endpoints)
kubectl port-forward svc/h2ogpte-mux 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
kill %1

kubectl port-forward svc/h2ogpte-core 8890:8890 &
curl -f http://localhost:8890/ping
kill %1

kubectl port-forward svc/h2ogpte-vex 8889:8889 &
curl -f http://localhost:8889/ping
kill %1

# Database health checks (external dependencies)
# Check connection strings in secrets:
kubectl get secret h2ogpte-redis -o jsonpath='{.data.host}' | base64 -d
kubectl get secret h2ogpte-postgres-trusted -o jsonpath='{.data.dsn}' | base64 -d
note

Redis and PostgreSQL are external dependencies configured via secrets.

Log analysis​

# View component logs
kubectl logs -f deployment/mux
kubectl logs -f deployment/core
kubectl logs -f deployment/crawl
kubectl logs -f deployment/chat

# Search for errors
kubectl logs deployment/mux | grep -i error
kubectl logs deployment/core | grep -i error

# Monitor real-time logs
kubectl logs -f deployment/mux --tail=100
note

By default, h2oGPTe components are less verbose. For detailed debugging output, set the log level to debug using environment variables (for example, H2OGPTE_CHAT_LOG_LEVEL=DEBUG).

Resource monitoring​

# Check pod status
kubectl get pods -o wide

# Monitor resource usage
kubectl top pods
kubectl top nodes

# Check GPU usage
nvidia-smi
kubectl exec -it <gpu-pod> -- nvidia-smi
note

The kubectl top command requires the Kubernetes Metrics Server to be enabled in your cluster.

Network debugging​

# Test service connectivity (from external)
kubectl port-forward svc/h2ogpte-mux 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
kill %1

# Check their connection details via secrets:
kubectl get secret h2ogpte-redis -o yaml
kubectl get secret h2ogpte-postgres-trusted -o yaml
note

Redis and PostgreSQL are external dependencies.

Component-specific troubleshooting​

ComponentSymptomsCausesSolutions
Client - ConnectionUnable to connect to Mux gateway, WebSocket connection failuresNetwork connectivity problems, authentication failures, incorrect endpoint configurationVerify network connectivity to Mux gateway (port 8888), check authentication credentials and session tokens, validate WebSocket endpoint configuration, review firewall and proxy settings
Client - PerformanceSlow document uploads, timeout errors, unresponsive UILarge document handling, insufficient timeout configurations, network latencyIncrease timeout values for large document processing, optimize document size before upload, check network latency and bandwidth, review client-side caching configuration
Client - WebSocketConnection drops, reconnection failures, message delivery problemsNetwork instability, proxy interference, WebSocket configuration issuesImplement exponential backoff for reconnection, configure WebSocket keep-alive settings, review proxy and load balancer WebSocket support, check Mux WebSocket configuration
Mux GatewayGateway timeouts, authentication failures, WebSocket connection problemsHigh load, authentication configuration errors, upstream service failuresCheck Mux logs for authentication and routing errors, verify database and Redis connectivity, review authentication provider configuration, scale Mux replicas for high load
RedisConnection pool exhaustion, memory pressure, queue overflowHigh message volume, insufficient memory, connection leaksMonitor Redis memory usage and connection count, implement connection pooling and retry logic, scale Redis resources or add Redis cluster, review queue processing and cleanup procedures
CoreEmbedding generation failures, processing timeouts, memory issuesGPU memory exhaustion, model loading failures, insufficient resourcesMonitor GPU memory usage and utilization, check model loading and configuration, scale Core replicas for high load, review embedding model configuration
CrawlDocument parsing errors, storage access issues, batch processing failuresUnsupported file formats, storage permission problems, resource constraintsVerify file format support and parsing configuration, check S3-compatible object storage access permissions and bucket policies (for example, MinIO, AWS S3, or other), monitor batch processing queue and worker status, review document size limits and processing timeouts
ChatSession management problems, RAG retrieval failures, agent coordination issuesDatabase connectivity problems, vector search failures, agent configuration errorsCheck Postgres connectivity and session storage, verify Vex vector search functionality, review agent configuration and tool availability, monitor chat session state and cleanup procedures
ModelsGPU memory issues, model loading failures, OCR processing errorsInsufficient GPU memory, model compatibility problems, OCR configuration errorsMonitor GPU memory usage and model loading, verify model compatibility and configuration, check OCR model availability and configuration, review GPU allocation and scheduling
h2oGPT_AgentTool execution failures, planning errors, memory leaksTool configuration problems, agent planning failures, resource exhaustionVerify tool configuration and availability, check agent planning and reasoning capabilities, monitor agent memory usage and cleanup, review agent workspace and file management
h2oGPTModel inference errors, API rate limiting, prompt engineering issuesExternal API failures, rate limit exceeded, prompt configuration errorsCheck external LLM endpoint availability, implement rate limiting and fallback mechanisms, review prompt templates and engineering, monitor cost tracking and usage analytics
LLMsGPU out-of-memory, model compatibility issues, vLLM configuration problemsInsufficient GPU memory, model version incompatibility, vLLM configuration errorsMonitor GPU memory usage and allocation, verify model compatibility and version requirements, review vLLM configuration and optimization, check model serving and inference performance
S3-compatible StorageStorage quota exceeded, access permission errors, bucket policy issuesInsufficient storage space, incorrect access keys, bucket policy misconfigurationMonitor storage usage and implement lifecycle policies, verify access keys and bucket permissions, review bucket policies and access control, check S3 API compatibility and configuration
PostgresConnection limits, query performance issuesHigh connection count, slow queriesMonitor connection pool usage and implement connection limits, optimize slow queries and add database indexes, review user permissions, check database backup and recovery procedures
VexIndex corruption, similarity search failures, bulk operation timeoutsVector index corruption, search configuration errors, insufficient resourcesMonitor vector index health and rebuild if necessary, verify similarity search configuration and thresholds, check bulk operation performance and resource allocation, review vector database backup and recovery

Recovery procedures​

Service restart​

# Graceful restart of components
kubectl rollout restart deployment/mux
kubectl rollout restart deployment/core
kubectl rollout restart deployment/crawl
kubectl rollout restart deployment/chat

# Check rollout status
kubectl rollout status deployment/mux
kubectl rollout status deployment/core

Data recovery​

# Database backup verification
kubectl exec -it postgres-pod -- pg_dump -h localhost -U postgres h2ogpte > backup.sql

# Restore from backup
kubectl exec -it postgres-pod -- psql -h localhost -U postgres h2ogpte < backup.sql

# S3-compatible object storage data verification (MinIO, AWS S3, or other)
mc ls <storage-alias>/documents
mc ls <storage-alias>/models

Scaling actions​

# Scale components horizontally
kubectl scale deployment/mux --replicas=3
kubectl scale deployment/core --replicas=2
kubectl scale deployment/crawl --replicas=2

# Check scaling status
kubectl get deployment
kubectl describe deployment/mux

Monitoring and alerting​

Key metrics to monitor​

  • Component health: Pod status, readiness, liveness
  • Resource usage: CPU, memory, GPU utilization
  • Performance: Response times, throughput, error rates
  • Storage: Disk usage, I/O performance, backup status
  • Network: Connectivity, latency, bandwidth usage
  • Pod failures: Alert on pod crashes or restarts
  • Resource exhaustion: Alert on high CPU/memory usage
  • Service unavailability: Alert on health check failures
  • Storage issues: Alert on disk space or I/O problems
  • Authentication failures: Alert on auth errors or rate limiting

Log aggregation​

note
  • Some h2oGPTe and h2oGPTe Agent logs are written to local file paths (such as /workspace/save), controlled by the H2OGPT_OPENAI_LOG_PATH environment variable. These logs are not collected by standard Kubernetes log scrapers, which typically aggregate logs from container stdout/stderr.

  • For production environments with compliance requirements, ensure that logs in these local paths are made available for aggregation (for example, by symlinking to stdout/stderr, using a sidecar container, or configuring a log shipper).

  • Centralized logging: Use ELK stack or similar for log aggregation
  • Structured logging: Ensure components use structured logging
  • Log retention: Configure appropriate log retention policies
  • Log analysis: Set up automated log analysis and alerting

Best practices​

Preventive maintenance​

  • Regular health checks: Implement automated health check monitoring
  • Resource monitoring: Monitor resource usage and plan for scaling
  • Backup verification: Regularly test backup and recovery procedures
  • Security updates: Keep components updated with security patches

Performance optimization​

  • Resource allocation: Right-size resource requests and limits
  • Caching: Implement appropriate caching strategies
  • Connection pooling: Use connection pooling for database connections
  • Load balancing: Implement proper load balancing for high availability

Security hardening​

  • Network policies: Implement Kubernetes network policies
  • RBAC: Use Role-Based Access Control for all components
  • Secrets management: Use Kubernetes secrets for sensitive data
  • Audit logging: Enable comprehensive audit logging

Common deployment issues​

Kubernetes configuration​

  • Resource limits: Ensure proper CPU and memory limits
  • Storage classes: Verify persistent volume storage classes
  • Network policies: Check network policy configurations
  • Service accounts: Verify service account permissions

GPU configuration​

  • Driver compatibility: Ensure NVIDIA drivers are compatible
  • GPU scheduling: Verify GPU scheduling is enabled
  • Memory allocation: Check GPU memory allocation settings
  • Multi-GPU setup: Verify multi-GPU configuration

Network configuration​

  • DNS resolution: Check internal and external DNS
  • Load balancer: Verify ingress controller configuration
  • Firewall rules: Review firewall and security group settings
  • SSL/TLS: Check certificate configuration and validity

Security troubleshooting​

Authentication issues​

  • OIDC configuration: Verify OIDC provider settings
  • Token validation: Check token validation and expiration
  • User permissions: Review user role assignments
  • Session management: Check session timeout and cleanup

Authorization problems​

  • RBAC policies: Verify role-based access control policies
  • Resource permissions: Check resource-level permissions
  • Cross-namespace access: Review namespace isolation
  • API access: Verify API endpoint access controls

Data protection​

  • Encryption: Check encryption at rest and in transit
  • Backup security: Verify backup encryption and access
  • Audit logs: Review audit log configuration and retention
  • Compliance: Check compliance with data protection regulations

Feedback