Version: v1.6.47-dev1 🚧

Troubleshooting

Overview

This troubleshooting hub provides comprehensive guidance for resolving issues across the h2oGPTe platform. You can find quick diagnostic commands for health checks and log analysis, detailed troubleshooting guides for each component (clients, gateway, workers, AI and LLM, and storage), recovery procedures for service restarts and data recovery, monitoring and alerting best practices, and common deployment issues with their solutions.

Quick diagnostic commands

note

Some example commands assume that curl or nslookup are available in the pod. If your pod uses a distroless image, use a debug pod with these tools installed.

Health checks

# Mux health check (via port-forward, from local machine or external to cluster)
kubectl port-forward <mux-pod> 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
curl -f http://localhost:8888/rpc/health/liveness
kill %1

# Option 2: Use a debug pod with curl installed (from within the cluster)
kubectl run -it --rm debug --image=alpine --restart=Never -- sh
# Inside the debug pod, install curl and run the health checks:
apk add --no-cache curl
curl -f http://<mux-pod>:8888/rpc/health/readiness
curl -f http://<mux-pod>:8888/rpc/health/liveness

# External health checks (via service endpoints)
kubectl port-forward svc/h2ogpte-mux 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
kill %1

kubectl port-forward svc/h2ogpte-core 8890:8890 &
curl -f http://localhost:8890/ping
kill %1

kubectl port-forward svc/h2ogpte-vex 8889:8889 &
curl -f http://localhost:8889/ping
kill %1

# Database health checks (external dependencies)
# Check connection strings in secrets:
kubectl get secret h2ogpte-redis -o jsonpath='{.data.host}' | base64 -d
kubectl get secret h2ogpte-postgres-trusted -o jsonpath='{.data.dsn}' | base64 -d

note

Redis and PostgreSQL are external dependencies configured via secrets.

Log analysis

# View component logs
kubectl logs -f deployment/mux
kubectl logs -f deployment/core
kubectl logs -f deployment/crawl
kubectl logs -f deployment/chat

# Search for errors
kubectl logs deployment/mux | grep -i error
kubectl logs deployment/core | grep -i error

# Monitor real-time logs
kubectl logs -f deployment/mux --tail=100

note

By default, h2oGPTe components are less verbose. For detailed debugging output, set the log level to debug using environment variables (for example, H2OGPTE_CHAT_LOG_LEVEL=DEBUG).

Resource monitoring

# Check pod status
kubectl get pods -o wide

# Monitor resource usage
kubectl top pods
kubectl top nodes

# Check GPU usage
nvidia-smi
kubectl exec -it <gpu-pod> -- nvidia-smi

note

The kubectl top command requires the Kubernetes Metrics Server to be enabled in your cluster.

Network debugging

# Test service connectivity (from external)
kubectl port-forward svc/h2ogpte-mux 8888:8888 &
curl -f http://localhost:8888/rpc/health/readiness
kill %1

# Check their connection details via secrets:
kubectl get secret h2ogpte-redis -o yaml
kubectl get secret h2ogpte-postgres-trusted -o yaml

note

Redis and PostgreSQL are external dependencies.

Component-specific troubleshooting

Component	Symptoms	Causes	Solutions
Client - Connection	Unable to connect to Mux gateway, WebSocket connection failures	Network connectivity problems, authentication failures, incorrect endpoint configuration	Verify network connectivity to Mux gateway (port 8888), check authentication credentials and session tokens, validate WebSocket endpoint configuration, review firewall and proxy settings
Client - Performance	Slow document uploads, timeout errors, unresponsive UI	Large document handling, insufficient timeout configurations, network latency	Increase timeout values for large document processing, optimize document size before upload, check network latency and bandwidth, review client-side caching configuration
Client - WebSocket	Connection drops, reconnection failures, message delivery problems	Network instability, proxy interference, WebSocket configuration issues	Implement exponential backoff for reconnection, configure WebSocket keep-alive settings, review proxy and load balancer WebSocket support, check Mux WebSocket configuration
Mux Gateway	Gateway timeouts, authentication failures, WebSocket connection problems	High load, authentication configuration errors, upstream service failures	Check Mux logs for authentication and routing errors, verify database and Redis connectivity, review authentication provider configuration, scale Mux replicas for high load
Redis	Connection pool exhaustion, memory pressure, queue overflow	High message volume, insufficient memory, connection leaks	Monitor Redis memory usage and connection count, implement connection pooling and retry logic, scale Redis resources or add Redis cluster, review queue processing and cleanup procedures
Core	Embedding generation failures, processing timeouts, memory issues	GPU memory exhaustion, model loading failures, insufficient resources	Monitor GPU memory usage and utilization, check model loading and configuration, scale Core replicas for high load, review embedding model configuration
Crawl	Document parsing errors, storage access issues, batch processing failures	Unsupported file formats, storage permission problems, resource constraints	Verify file format support and parsing configuration, check S3-compatible object storage access permissions and bucket policies (for example, MinIO, AWS S3, or other), monitor batch processing queue and worker status, review document size limits and processing timeouts
Chat	Session management problems, RAG retrieval failures, agent coordination issues	Database connectivity problems, vector search failures, agent configuration errors	Check Postgres connectivity and session storage, verify Vex vector search functionality, review agent configuration and tool availability, monitor chat session state and cleanup procedures
Models	GPU memory issues, model loading failures, OCR processing errors	Insufficient GPU memory, model compatibility problems, OCR configuration errors	Monitor GPU memory usage and model loading, verify model compatibility and configuration, check OCR model availability and configuration, review GPU allocation and scheduling
h2oGPT_Agent	Tool execution failures, planning errors, memory leaks	Tool configuration problems, agent planning failures, resource exhaustion	Verify tool configuration and availability, check agent planning and reasoning capabilities, monitor agent memory usage and cleanup, review agent workspace and file management
h2oGPT	Model inference errors, API rate limiting, prompt engineering issues	External API failures, rate limit exceeded, prompt configuration errors	Check external LLM endpoint availability, implement rate limiting and fallback mechanisms, review prompt templates and engineering, monitor cost tracking and usage analytics
LLMs	GPU out-of-memory, model compatibility issues, vLLM configuration problems	Insufficient GPU memory, model version incompatibility, vLLM configuration errors	Monitor GPU memory usage and allocation, verify model compatibility and version requirements, review vLLM configuration and optimization, check model serving and inference performance
S3-compatible Storage	Storage quota exceeded, access permission errors, bucket policy issues	Insufficient storage space, incorrect access keys, bucket policy misconfiguration	Monitor storage usage and implement lifecycle policies, verify access keys and bucket permissions, review bucket policies and access control, check S3 API compatibility and configuration
Postgres	Connection limits, query performance issues	High connection count, slow queries	Monitor connection pool usage and implement connection limits, optimize slow queries and add database indexes, review user permissions, check database backup and recovery procedures
Vex	Index corruption, similarity search failures, bulk operation timeouts	Vector index corruption, search configuration errors, insufficient resources	Monitor vector index health and rebuild if necessary, verify similarity search configuration and thresholds, check bulk operation performance and resource allocation, review vector database backup and recovery

Recovery procedures

Service restart

# Graceful restart of components
kubectl rollout restart deployment/mux
kubectl rollout restart deployment/core
kubectl rollout restart deployment/crawl
kubectl rollout restart deployment/chat

# Check rollout status
kubectl rollout status deployment/mux
kubectl rollout status deployment/core

Data recovery

# Database backup verification
kubectl exec -it postgres-pod -- pg_dump -h localhost -U postgres h2ogpte > backup.sql

# Restore from backup
kubectl exec -it postgres-pod -- psql -h localhost -U postgres h2ogpte < backup.sql

# S3-compatible object storage data verification (MinIO, AWS S3, or other)
mc ls <storage-alias>/documents
mc ls <storage-alias>/models

Scaling actions

# Scale components horizontally
kubectl scale deployment/mux --replicas=3
kubectl scale deployment/core --replicas=2
kubectl scale deployment/crawl --replicas=2

# Check scaling status
kubectl get deployment
kubectl describe deployment/mux

Monitoring and alerting

Key metrics to monitor

Component health: Pod status, readiness, liveness
Resource usage: CPU, memory, GPU utilization
Performance: Response times, throughput, error rates
Storage: Disk usage, I/O performance, backup status
Network: Connectivity, latency, bandwidth usage

Recommended alerts

Pod failures: Alert on pod crashes or restarts
Resource exhaustion: Alert on high CPU/memory usage
Service unavailability: Alert on health check failures
Storage issues: Alert on disk space or I/O problems
Authentication failures: Alert on auth errors or rate limiting

Log aggregation

note

Some h2oGPTe and h2oGPTe Agent logs are written to local file paths (such as /workspace/save), controlled by the H2OGPT_OPENAI_LOG_PATH environment variable. These logs are not collected by standard Kubernetes log scrapers, which typically aggregate logs from container stdout/stderr.
For production environments with compliance requirements, ensure that logs in these local paths are made available for aggregation (for example, by symlinking to stdout/stderr, using a sidecar container, or configuring a log shipper).

Centralized logging: Use ELK stack or similar for log aggregation
Structured logging: Ensure components use structured logging
Log retention: Configure appropriate log retention policies
Log analysis: Set up automated log analysis and alerting

Best practices

Preventive maintenance

Regular health checks: Implement automated health check monitoring
Resource monitoring: Monitor resource usage and plan for scaling
Backup verification: Regularly test backup and recovery procedures
Security updates: Keep components updated with security patches

Performance optimization

Resource allocation: Right-size resource requests and limits
Caching: Implement appropriate caching strategies
Connection pooling: Use connection pooling for database connections
Load balancing: Implement proper load balancing for high availability

Security hardening

Network policies: Implement Kubernetes network policies
RBAC: Use Role-Based Access Control for all components
Secrets management: Use Kubernetes secrets for sensitive data
Audit logging: Enable comprehensive audit logging

Common deployment issues

Kubernetes configuration

Resource limits: Ensure proper CPU and memory limits
Storage classes: Verify persistent volume storage classes
Network policies: Check network policy configurations
Service accounts: Verify service account permissions

GPU configuration

Driver compatibility: Ensure NVIDIA drivers are compatible
GPU scheduling: Verify GPU scheduling is enabled
Memory allocation: Check GPU memory allocation settings
Multi-GPU setup: Verify multi-GPU configuration

Network configuration

DNS resolution: Check internal and external DNS
Load balancer: Verify ingress controller configuration
Firewall rules: Review firewall and security group settings
SSL/TLS: Check certificate configuration and validity

Security troubleshooting

Authentication issues

OIDC configuration: Verify OIDC provider settings
Token validation: Check token validation and expiration
User permissions: Review user role assignments
Session management: Check session timeout and cleanup

Authorization problems

RBAC policies: Verify role-based access control policies
Resource permissions: Check resource-level permissions
Cross-namespace access: Review namespace isolation
API access: Verify API endpoint access controls

Data protection

Encryption: Check encryption at rest and in transit
Backup security: Verify backup encryption and access
Audit logs: Review audit log configuration and retention
Compliance: Check compliance with data protection regulations
User data deletion: Verify users can successfully delete their data

User data deletion troubleshooting

Getting started with data deletion

If you're new to data deletion, see the comprehensive User Data Deletion guide for step-by-step instructions on using both the Python client and REST API.

Common user issues

"Delete request expired" (HTTP 400)

What you see: Error message when trying to confirm deletion
Why it happens: You waited longer than 5 minutes to confirm
How to fix: Start the process over and request a new deletion and confirm within 5 minutes

"Unauthorized" (HTTP 401)

What you see: Authentication error during deletion
Why it happens: Invalid API key or session token
How to fix: Check your authentication credentials and ensure you're logged in

"Deletion taking too long"

What you see: Confirmation request seems to hang or timeout
Why it happens: Large amounts of data are being processed in batches
How to fix: Wait for completion (can take several minutes) or increase the timeout parameter

Feedback

Submit and view feedback for this page
Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai

Overview​

Quick diagnostic commands​

Health checks​

Log analysis​

Resource monitoring​

Network debugging​

Component-specific troubleshooting​

Recovery procedures​

Service restart​

Data recovery​

Scaling actions​

Monitoring and alerting​

Key metrics to monitor​

Recommended alerts​

Log aggregation​

Best practices​

Preventive maintenance​

Performance optimization​

Security hardening​

Common deployment issues​

Kubernetes configuration​

GPU configuration​

Network configuration​

Security troubleshooting​

Authentication issues​

Authorization problems​

Data protection​

User data deletion troubleshooting​

Getting started with data deletion​

Common user issues​

Overview

Quick diagnostic commands

Health checks

Log analysis

Resource monitoring

Network debugging

Component-specific troubleshooting

Recovery procedures

Service restart

Data recovery

Scaling actions

Monitoring and alerting

Key metrics to monitor

Recommended alerts

Log aggregation

Best practices

Preventive maintenance

Performance optimization

Security hardening

Common deployment issues

Kubernetes configuration

GPU configuration

Network configuration

Security troubleshooting

Authentication issues

Authorization problems

Data protection

User data deletion troubleshooting

Getting started with data deletion

Common user issues