Skip to main content

H2O AI Cloud Observability

Observability in H2O AI Hybrid Cloud provides visibility into the performance and health of workloads running on the platform. This page covers how to access and interpret key metrics for monitoring cloud infrastructure and behavior.

H2O.ai workloads (i.e.,tasks or processes running on the platform) export performance metrics on the HTTP endpoint using the Prometheus text format.

Each pod that exports any metrics listens on a dedicated port (port 9089). It doesn't require any additional authentication. To access the metrics, use a curl command from within the cluster.

Instrumentation

H2O.ai workloads are instrumented with [OpenTelemetry]https://opentelemetry.io/) SDKs. This means that the workloads are set up to automatically collect and send performance data about their activities. In general, workloads export metrics for standard HTTP requests and RPC (Remote Procedure Calls).

For more details regarding the particular metrics, see the following topics in the OpenTelemetry documentation:

Example: H2O Discovery Service

For instance, let's take a look at some metrics exposed by the H2O Discovery Service. H2O Discovery Service is used by most client applications so it is an essential part of the H2O platform. It is exposed via a public HTTP Endpoint that exposes an RPC service; therefore, both HTTP and RPC metrics mentioned above are being exported.

PromQL

The examples below use PromQL expressions.

Example: HTTP latency

You can use the following metrics to check if the latency is within an acceptable limit:

  • http_server_duration_milliseconds_bucket metric exposes the histogram with the server's HTTP request-handling duration.
  • http_server_duration_milliseconds_count metric is the number of all of the requests included in the histogram.

The Prometheus query shown below calculates the ratio of the requests that were served in under 500ms. You can then compare it with the desired ratio to create a signal for the alert. < 0.95 will trigger when there is more than 5% of the requests that took more then 500ms to handle in the last 5 minutes.

sum(rate(http_server_duration_milliseconds_bucket{le="500"}[5m])) by (net_host_name) / sum(rate(http_server_duration_milliseconds_count[5m])) by (net_host_name)

Example: HTTP Errors

http_server_duration_milliseconds_count has a http_status_code label that can be used to determine how many requests errored.

The following query calculates how many 500 Internal Server Error responses were recieved over the last 5 minutes. The absolute number of errors is not very useful, therefore, similar to the previous example given above, you can check the ratio instead.

sum(increase(http_server_duration_milliseconds_count{code="500"}[5m]))

This expression calculates what portion of requests returned 5XX status in the last 5 minutes, which can subsequently then be used as a signal for alert.

sum(increase(http_server_duration_milliseconds_count{code=~"5.."}[5m])) / sum(increase(http_server_duration_milliseconds_count{}[5m]))

Example: RPC

Both examples given above (HTTP errors and HTTP latency) can also be adapted for RPC metrics.

  • rpc_server_duration_milliseconds_bucket metric exposes the histogram with server RPC calls handling duration.
  • rpc_server_duration_milliseconds_count is the number of all RPC requests.
  • rpc_server_duration_milliseconds_sum is the sum of all of the observed durations.

Although the metrics track similar things for both HTTP requests and RPC requests, RPC metrics can provide different insights to the service. The H2O Discovery Service is a gRPC service (not HTTP), so instead of tracking HTTP status codes (like 404 or 500 for errors), you can calculate the error rate using the rpc_grpc_status_code label. This calculates how many UNKNOWN or INTERNAL errors were observed in the last 5 minutes.

sum(increase(rpc_server_duration_milliseconds_count{rpc_grpc_status_code=~"2|13"}[5m]))
gRPC Status Codes

For more gRPC status code labels, see the gRPC Status Codes documentation.

RPC metrics have labels with RPC methods. This allows us, for example, to calculate latency only for one functionality. The expression below calculates the average of the latency in the last 5 minutes for the ai.h2o.cloud.discovery.v1.ServiceService/GetService method requests.

rate(rpc_server_duration_milliseconds_sum{rpc_method="GetService"}[5m]) / rate(rpc_server_duration_milliseconds_count{rpc_method="GetService"}[5m])
Note

The service label has been omitted for brevity. It's not needed in this example as there's no method name clash in the H2O Discovery service.


Feedback