Driverless AI Health API¶
The following sections describe the Driverless AI Health API.
Overview¶
The Driverless AI Health API is a publicly available API that exposes basic system metrics and statistics. Its primary purpose is to provide information for resource monitoring and auto-scaling of Driverless AI multinode clusters. The API outputs a set of metrics in a JSON format so that they can be used by tools like KEDA or K8S Autoscaler.
Notes:
The Health API is only available in
multinode
orsinglenode
mode. For more information, refer to theworker_mode
config.toml option.For security purposes, the Health API endpoint can be disabled by setting the
enable_health_api
config.toml option tofalse
. This setting is enabled by default.The Health API is designed with the intention to provide information that is needed by users to write their own autoscaling logic for Multinode Driverless AI. It can also be used in tandem with services like Enterprise Puddle to skip the authentication step and instead retrieve the needed information directly.
Using the DAI Health API¶
To retrieve Driverless AI’s health status, create a GET
request:
GET http://{driverless-ai-instance-address}/apis/health/v1
This returns the following JSON response:
- {
“api_version”: “1.1”, “server_version”: “1.10.6”, “application_id”: “dai_1342577”, “timestamp”: “2023-05-25T19:46:19.593386+00:00”, “last_system_interaction”: “2023-05-25T19:46:17.978125+00:00”, “active_users”: 1, “is_idle”: false, “resources”: {
“cpu_cores”: 12, “gpus”: 0, “nodes”: 2
}, “tasks”: {
“running”: 0, “running_gpu”: 0, “running_cpu”: 0, “running_non_experiment”: 0, “scheduled”: 0, “scheduled_on_gpu”: 0, “scheduled_on_cpu”: 0
}, “utilization”: {
“cpu”: 0.12416666666666666, “gpu”: 0.0, “memory”: 0.888
}, “workers”: [
- {
“name”: “NODE:LOCAL2”, “running_tasks”: 0, “running_tasks_gpu”: 0, “running_tasks_cpu”: 0, “running_tasks_local”: 0, “scheduled_tasks”: 0, “is_local”: true, “have_gpus”: false, “processors_count”: 0, “local_processors_count”: 0, “startup_id”: “not_set”, “cpu”: 0.24833333333333332, “memory”: 0.888, “total_memory”: 33396789248, “available_memory”: 3735101440, “total_gpus”: 0, “total_disk_size”: 401643327488, “available_disk_size”: 18929004544, “disk_limit_gb”: 5368709120
}, {
“name”: “NODE:REMOTE1”, “running_tasks”: 0, “running_tasks_gpu”: 0, “running_tasks_cpu”: 0, “running_tasks_local”: 0, “scheduled_tasks”: 0, “is_local”: false, “have_gpus”: false, “processors_count”: 2, “local_processors_count”: 3, “startup_id”: “not_set”, “cpu”: 0.25183333333333335, “memory”: 0.888, “total_memory”: 33396789248, “available_memory”: 3735818240, “total_gpus”: 0, “total_disk_size”: 401643327488, “available_disk_size”: 18929004544, “disk_limit_gb”: 5368709120
}
]
}
Attribute Definitions¶
The following is a list of relevant JSON attribute definitions.
api_version (string): API version
server_version (string): Driverless AI server version
application_id (string): Driverless AI instance ID that is randomly created at startup or overridden with
config.application_id
timestamp (string): Current server time in ISO8601 format
last_system_interaction (string): ISO8601 format timestamp of last interaction with the Driverless AI server. The following are considered as system interactions:
Incoming RPC request from client
Login/Logout of user
A system event like _sync_ message from a running or finished experiment
Initialization of dataset upload
Custom recipe upload
is_idle (boolean): System is considered idle when there is no task running or scheduled, and no upload or download going on from the user session
active_users (int): Number of active users in the system. A user is considered active if they have interacted with the system within the
config.user_activity_timeout
period, which by default is 60 minutesresources.nodes (int): Number of nodes in Driverless AI cluster
resources.gpus (int): Total number of GPUs in Driverless AI cluster
resources.cpu_cores (int): Total number of CPU cores in Driverless AI cluster
tasks.running (int): Total number of jobs running in the system
tasks.scheduled (int): Total number of jobs waiting for execution in scheduling queue
tasks.scheduled_on_gpu (int): The total number of jobs that require a GPU and are awaiting execution in the GPU scheduling queue
tasks.scheduled_on_cpu (int): The total number of jobs that require only CPU and are awaiting execution in the CPU scheduling queue
utilization.cpu (float [0, 1]): CPU utilization percentage aggregated across all nodes
utilization.gpu (float [0, 1]): GPU utilization percentage aggregated across all nodes
utilization.memory (float [0, 1]): Memory utilization percentage aggregated across all nodes
workers (list of objects): Contains a list of active workers in the multinode cluster
workers[].name (string): Name of the worker
workers[].running_tasks (int): The number of tasks that are currently running on the worker node
workers[].scheduled_tasks (int): The number of tasks scheduled specifically (with affinity) for the worker
workers[].cpu (float): A snapshot of current usage CPU percentage
workers[].memory (float): A snapshot of current memory usage percentage
workers[].total_memory (int): Total consumed memory
workers[].available_memory (int): Available memory
workers[].total_gpus (int): Total number of GPUs
workers[].total_disk_size (int): Total disk size in bytes
workers[].available_disk_size (int): Available disk size in bytes
workers[].disk_limit_gb (float): Disk limit defined by the config.toml value for
disk_limit_gb