Driverless AI Health API

The following sections describe the Driverless AI Health API.

Overview
Using the DAI Health API
Attribute Definitions

Overview

The Driverless AI Health API is a publicly available API that exposes basic system metrics and statistics. Its primary purpose is to provide information for resource monitoring and auto-scaling of Driverless AI multinode clusters. The API outputs a set of metrics in a JSON format so that they can be used by tools like KEDA or K8S Autoscaler.

Notes:

The Health API is only available in multinode or singlenode mode. For more information, refer to the worker_mode config.toml option.
For security purposes, the Health API endpoint can be disabled by setting the enable_health_api config.toml option to false. This setting is enabled by default.
The Health API is designed with the intention to provide information that is needed by users to write their own autoscaling logic for Multinode Driverless AI. It can also be used in tandem with services like Enterprise Puddle to skip the authentication step and instead retrieve the needed information directly.

Using the DAI Health API

To retrieve Driverless AI’s health status, create a GET request:

GET http://{driverless-ai-instance-address}/apis/health/v1

This returns the following JSON response:

{

“api_version”: “1.1”, “server_version”: “2.2.0”, “application_id”: “dai_1342577”, “timestamp”: “2023-05-25T19:46:19.593386+00:00”, “last_system_interaction”: “2023-05-25T19:46:17.978125+00:00”, “active_users”: 1, “is_idle”: false, “resources”: {

“cpu_cores”: 12, “gpus”: 0, “nodes”: 2

}, “tasks”: {

“running”: 0, “running_gpu”: 0, “running_cpu”: 0, “running_non_experiment”: 0, “scheduled”: 0, “scheduled_on_gpu”: 0, “scheduled_on_cpu”: 0

}, “utilization”: {

“cpu”: 0.12416666666666666, “gpu”: 0.0, “memory”: 0.888

}, “workers”: [

{
“name”: “NODE:LOCAL2”, “running_tasks”: 0, “running_tasks_gpu”: 0, “running_tasks_cpu”: 0, “running_tasks_local”: 0, “scheduled_tasks”: 0, “is_local”: true, “have_gpus”: false, “processors_count”: 0, “local_processors_count”: 0, “startup_id”: “not_set”, “cpu”: 0.24833333333333332, “memory”: 0.888, “total_memory”: 33396789248, “available_memory”: 3735101440, “total_gpus”: 0, “total_disk_size”: 401643327488, “available_disk_size”: 18929004544, “disk_limit_gb”: 5368709120

}, {

“name”: “NODE:REMOTE1”, “running_tasks”: 0, “running_tasks_gpu”: 0, “running_tasks_cpu”: 0, “running_tasks_local”: 0, “scheduled_tasks”: 0, “is_local”: false, “have_gpus”: false, “processors_count”: 2, “local_processors_count”: 3, “startup_id”: “not_set”, “cpu”: 0.25183333333333335, “memory”: 0.888, “total_memory”: 33396789248, “available_memory”: 3735818240, “total_gpus”: 0, “total_disk_size”: 401643327488, “available_disk_size”: 18929004544, “disk_limit_gb”: 5368709120

}

]

}

Attribute Definitions

The following is a list of relevant JSON attribute definitions.

api_version (string): API version
server_version (string): Driverless AI server version
application_id (string): Driverless AI instance ID that is randomly created at startup or overridden with config.application_id
timestamp (string): Current server time in ISO8601 format
last_system_interaction (string): ISO8601 format timestamp of last interaction with the Driverless AI server. The following are considered as system interactions:

Incoming RPC request from client

Login/Logout of user

A system event like _sync_ message from a running or finished experiment

Initialization of dataset upload

Custom recipe upload

is_idle (boolean): System is considered idle when there is no task running or scheduled, and no upload or download going on from the user session
active_users (int): Number of active users in the system. A user is considered active if they have interacted with the system within the config.user_activity_timeout period, which by default is 60 minutes
resources.nodes (int): Number of nodes in Driverless AI cluster
resources.gpus (int): Total number of GPUs in Driverless AI cluster
resources.cpu_cores (int): Total number of CPU cores in Driverless AI cluster
tasks.running (int): Total number of jobs running in the system
tasks.scheduled (int): Total number of jobs waiting for execution in scheduling queue
tasks.scheduled_on_gpu (int): The total number of jobs that require a GPU and are awaiting execution in the GPU scheduling queue
tasks.scheduled_on_cpu (int): The total number of jobs that require only CPU and are awaiting execution in the CPU scheduling queue
utilization.cpu (float [0, 1]): CPU utilization percentage aggregated across all nodes
utilization.gpu (float [0, 1]): GPU utilization percentage aggregated across all nodes
utilization.memory (float [0, 1]): Memory utilization percentage aggregated across all nodes
workers (list of objects): Contains a list of active workers in the multinode cluster
workers[].name (string): Name of the worker
workers[].running_tasks (int): The number of tasks that are currently running on the worker node
workers[].scheduled_tasks (int): The number of tasks scheduled specifically (with affinity) for the worker
workers[].cpu (float): A snapshot of current usage CPU percentage
workers[].memory (float): A snapshot of current memory usage percentage
workers[].total_memory (int): Total consumed memory
workers[].available_memory (int): Available memory
workers[].total_gpus (int): Total number of GPUs
workers[].total_disk_size (int): Total disk size in bytes
workers[].available_disk_size (int): Available disk size in bytes
workers[].disk_limit_gb (float): Disk limit defined by the config.toml value for disk_limit_gb