Troubleshooting
Overview​
This troubleshooting hub provides comprehensive guidance for resolving issues across the h2oGPTe platform. You can find quick diagnostic commands for health checks and log analysis, detailed troubleshooting guides for each component (clients, gateway, workers, AI and LLM, and storage), recovery procedures for service restarts and data recovery, monitoring and alerting best practices, and common deployment issues with their solutions.
Quick diagnostic commands​
Some example commands assume that curl
or nslookup
are available in the pod. If your pod uses a distroless image, use a debug pod with these tools installed.
Health checks​
# Mux health check (via port-forward, from local machine or external to cluster)
kubectl port-forward <mux-pod> 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
curl -f http://localhost:8888/rpc/health/liveness
kill %1
# Option 2: Use a debug pod with curl installed (from within the cluster)
kubectl run -it --rm debug --image=alpine --restart=Never -- sh
# Inside the debug pod, install curl and run the health checks:
apk add --no-cache curl
curl -f http://<mux-pod>:8888/rpc/health/readiness
curl -f http://<mux-pod>:8888/rpc/health/liveness
# External health checks (via service endpoints)
kubectl port-forward svc/h2ogpte-mux 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
kill %1
kubectl port-forward svc/h2ogpte-core 8890:8890 &
curl -f http://localhost:8890/ping
kill %1
kubectl port-forward svc/h2ogpte-vex 8889:8889 &
curl -f http://localhost:8889/ping
kill %1
# Database health checks (external dependencies)
# Check connection strings in secrets:
kubectl get secret h2ogpte-redis -o jsonpath='{.data.host}' | base64 -d
kubectl get secret h2ogpte-postgres-trusted -o jsonpath='{.data.dsn}' | base64 -d
Redis and PostgreSQL are external dependencies configured via secrets.
Log analysis​
# View component logs
kubectl logs -f deployment/mux
kubectl logs -f deployment/core
kubectl logs -f deployment/crawl
kubectl logs -f deployment/chat
# Search for errors
kubectl logs deployment/mux | grep -i error
kubectl logs deployment/core | grep -i error
# Monitor real-time logs
kubectl logs -f deployment/mux --tail=100
By default, h2oGPTe components are less verbose. For detailed debugging output, set the log level to debug using environment variables (for example, H2OGPTE_CHAT_LOG_LEVEL=DEBUG
).
Resource monitoring​
# Check pod status
kubectl get pods -o wide
# Monitor resource usage
kubectl top pods
kubectl top nodes
# Check GPU usage
nvidia-smi
kubectl exec -it <gpu-pod> -- nvidia-smi
The kubectl top
command requires the Kubernetes Metrics Server to be enabled in your cluster.
Network debugging​
# Test service connectivity (from external)
kubectl port-forward svc/h2ogpte-mux 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
kill %1
# Check their connection details via secrets:
kubectl get secret h2ogpte-redis -o yaml
kubectl get secret h2ogpte-postgres-trusted -o yaml
Redis and PostgreSQL are external dependencies.
Component-specific troubleshooting​
Component | Symptoms | Causes | Solutions |
---|---|---|---|
Client - Connection | Unable to connect to Mux gateway, WebSocket connection failures | Network connectivity problems, authentication failures, incorrect endpoint configuration | Verify network connectivity to Mux gateway (port 8888), check authentication credentials and session tokens, validate WebSocket endpoint configuration, review firewall and proxy settings |
Client - Performance | Slow document uploads, timeout errors, unresponsive UI | Large document handling, insufficient timeout configurations, network latency | Increase timeout values for large document processing, optimize document size before upload, check network latency and bandwidth, review client-side caching configuration |
Client - WebSocket | Connection drops, reconnection failures, message delivery problems | Network instability, proxy interference, WebSocket configuration issues | Implement exponential backoff for reconnection, configure WebSocket keep-alive settings, review proxy and load balancer WebSocket support, check Mux WebSocket configuration |
Mux Gateway | Gateway timeouts, authentication failures, WebSocket connection problems | High load, authentication configuration errors, upstream service failures | Check Mux logs for authentication and routing errors, verify database and Redis connectivity, review authentication provider configuration, scale Mux replicas for high load |
Redis | Connection pool exhaustion, memory pressure, queue overflow | High message volume, insufficient memory, connection leaks | Monitor Redis memory usage and connection count, implement connection pooling and retry logic, scale Redis resources or add Redis cluster, review queue processing and cleanup procedures |
Core | Embedding generation failures, processing timeouts, memory issues | GPU memory exhaustion, model loading failures, insufficient resources | Monitor GPU memory usage and utilization, check model loading and configuration, scale Core replicas for high load, review embedding model configuration |
Crawl | Document parsing errors, storage access issues, batch processing failures | Unsupported file formats, storage permission problems, resource constraints | Verify file format support and parsing configuration, check S3-compatible object storage access permissions and bucket policies (for example, MinIO, AWS S3, or other), monitor batch processing queue and worker status, review document size limits and processing timeouts |
Chat | Session management problems, RAG retrieval failures, agent coordination issues | Database connectivity problems, vector search failures, agent configuration errors | Check Postgres connectivity and session storage, verify Vex vector search functionality, review agent configuration and tool availability, monitor chat session state and cleanup procedures |
Models | GPU memory issues, model loading failures, OCR processing errors | Insufficient GPU memory, model compatibility problems, OCR configuration errors | Monitor GPU memory usage and model loading, verify model compatibility and configuration, check OCR model availability and configuration, review GPU allocation and scheduling |
h2oGPT_Agent | Tool execution failures, planning errors, memory leaks | Tool configuration problems, agent planning failures, resource exhaustion | Verify tool configuration and availability, check agent planning and reasoning capabilities, monitor agent memory usage and cleanup, review agent workspace and file management |
h2oGPT | Model inference errors, API rate limiting, prompt engineering issues | External API failures, rate limit exceeded, prompt configuration errors | Check external LLM endpoint availability, implement rate limiting and fallback mechanisms, review prompt templates and engineering, monitor cost tracking and usage analytics |
LLMs | GPU out-of-memory, model compatibility issues, vLLM configuration problems | Insufficient GPU memory, model version incompatibility, vLLM configuration errors | Monitor GPU memory usage and allocation, verify model compatibility and version requirements, review vLLM configuration and optimization, check model serving and inference performance |
S3-compatible Storage | Storage quota exceeded, access permission errors, bucket policy issues | Insufficient storage space, incorrect access keys, bucket policy misconfiguration | Monitor storage usage and implement lifecycle policies, verify access keys and bucket permissions, review bucket policies and access control, check S3 API compatibility and configuration |
Postgres | Connection limits, query performance issues | High connection count, slow queries | Monitor connection pool usage and implement connection limits, optimize slow queries and add database indexes, review user permissions, check database backup and recovery procedures |
Vex | Index corruption, similarity search failures, bulk operation timeouts | Vector index corruption, search configuration errors, insufficient resources | Monitor vector index health and rebuild if necessary, verify similarity search configuration and thresholds, check bulk operation performance and resource allocation, review vector database backup and recovery |
Recovery procedures​
Service restart​
# Graceful restart of components
kubectl rollout restart deployment/mux
kubectl rollout restart deployment/core
kubectl rollout restart deployment/crawl
kubectl rollout restart deployment/chat
# Check rollout status
kubectl rollout status deployment/mux
kubectl rollout status deployment/core
Data recovery​
# Database backup verification
kubectl exec -it postgres-pod -- pg_dump -h localhost -U postgres h2ogpte > backup.sql
# Restore from backup
kubectl exec -it postgres-pod -- psql -h localhost -U postgres h2ogpte < backup.sql
# S3-compatible object storage data verification (MinIO, AWS S3, or other)
mc ls <storage-alias>/documents
mc ls <storage-alias>/models
Scaling actions​
# Scale components horizontally
kubectl scale deployment/mux --replicas=3
kubectl scale deployment/core --replicas=2
kubectl scale deployment/crawl --replicas=2
# Check scaling status
kubectl get deployment
kubectl describe deployment/mux
Monitoring and alerting​
Key metrics to monitor​
- Component health: Pod status, readiness, liveness
- Resource usage: CPU, memory, GPU utilization
- Performance: Response times, throughput, error rates
- Storage: Disk usage, I/O performance, backup status
- Network: Connectivity, latency, bandwidth usage
Recommended alerts​
- Pod failures: Alert on pod crashes or restarts
- Resource exhaustion: Alert on high CPU/memory usage
- Service unavailability: Alert on health check failures
- Storage issues: Alert on disk space or I/O problems
- Authentication failures: Alert on auth errors or rate limiting
Log aggregation​
-
Some h2oGPTe and h2oGPTe Agent logs are written to local file paths (such as
/workspace/save
), controlled by theH2OGPT_OPENAI_LOG_PATH
environment variable. These logs are not collected by standard Kubernetes log scrapers, which typically aggregate logs from container stdout/stderr. -
For production environments with compliance requirements, ensure that logs in these local paths are made available for aggregation (for example, by symlinking to stdout/stderr, using a sidecar container, or configuring a log shipper).
- Centralized logging: Use ELK stack or similar for log aggregation
- Structured logging: Ensure components use structured logging
- Log retention: Configure appropriate log retention policies
- Log analysis: Set up automated log analysis and alerting
Best practices​
Preventive maintenance​
- Regular health checks: Implement automated health check monitoring
- Resource monitoring: Monitor resource usage and plan for scaling
- Backup verification: Regularly test backup and recovery procedures
- Security updates: Keep components updated with security patches
Performance optimization​
- Resource allocation: Right-size resource requests and limits
- Caching: Implement appropriate caching strategies
- Connection pooling: Use connection pooling for database connections
- Load balancing: Implement proper load balancing for high availability
Security hardening​
- Network policies: Implement Kubernetes network policies
- RBAC: Use Role-Based Access Control for all components
- Secrets management: Use Kubernetes secrets for sensitive data
- Audit logging: Enable comprehensive audit logging
Common deployment issues​
Kubernetes configuration​
- Resource limits: Ensure proper CPU and memory limits
- Storage classes: Verify persistent volume storage classes
- Network policies: Check network policy configurations
- Service accounts: Verify service account permissions
GPU configuration​
- Driver compatibility: Ensure NVIDIA drivers are compatible
- GPU scheduling: Verify GPU scheduling is enabled
- Memory allocation: Check GPU memory allocation settings
- Multi-GPU setup: Verify multi-GPU configuration
Network configuration​
- DNS resolution: Check internal and external DNS
- Load balancer: Verify ingress controller configuration
- Firewall rules: Review firewall and security group settings
- SSL/TLS: Check certificate configuration and validity
Security troubleshooting​
Authentication issues​
- OIDC configuration: Verify OIDC provider settings
- Token validation: Check token validation and expiration
- User permissions: Review user role assignments
- Session management: Check session timeout and cleanup
Authorization problems​
- RBAC policies: Verify role-based access control policies
- Resource permissions: Check resource-level permissions
- Cross-namespace access: Review namespace isolation
- API access: Verify API endpoint access controls
Data protection​
- Encryption: Check encryption at rest and in transit
- Backup security: Verify backup encryption and access
- Audit logs: Review audit log configuration and retention
- Compliance: Check compliance with data protection regulations
- Submit and view feedback for this page
- Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai