Driverless AI Health API

The following sections describe the Driverless AI Health API.

Overview

The Driverless AI Health API is a publicly available API that exposes basic system metrics and statistics. Its primary purpose is to provide information for resource monitoring and auto-scaling of Driverless AI multinode clusters. The API outputs a set of metrics in a JSON format so that they can be used by tools like KEDA or K8S Autoscaler.

Notes:

  • The Health API is only available in multinode or singlenode mode. For more information, refer to the worker_mode config.toml option.

  • For security purposes, the Health API endpoint can be disabled by setting the enable_health_api config.toml option to false. This setting is enabled by default.

  • The Health API is designed with the intention to provide information that is needed by users to write their own autoscaling logic for Multinode Driverless AI. It can also be used in tandem with services like Enterprise Puddle to skip the authentication step and instead retrieve the needed information directly.

Using the DAI Health API

To retrieve Driverless AI’s health status, create a GET request:

GET http://{driverless-ai-instance-address}/apis/health/v1

This returns the following JSON response:


{

“api_version”: “1.1”, “server_version”: “1.11.1”, “application_id”: “dai_1342577”, “timestamp”: “2023-05-25T19:46:19.593386+00:00”, “last_system_interaction”: “2023-05-25T19:46:17.978125+00:00”, “active_users”: 1, “is_idle”: false, “resources”: {

“cpu_cores”: 12, “gpus”: 0, “nodes”: 2

}, “tasks”: {

“running”: 0, “running_gpu”: 0, “running_cpu”: 0, “running_non_experiment”: 0, “scheduled”: 0, “scheduled_on_gpu”: 0, “scheduled_on_cpu”: 0

}, “utilization”: {

“cpu”: 0.12416666666666666, “gpu”: 0.0, “memory”: 0.888

}, “workers”: [

{

“name”: “NODE:LOCAL2”, “running_tasks”: 0, “running_tasks_gpu”: 0, “running_tasks_cpu”: 0, “running_tasks_local”: 0, “scheduled_tasks”: 0, “is_local”: true, “have_gpus”: false, “processors_count”: 0, “local_processors_count”: 0, “startup_id”: “not_set”, “cpu”: 0.24833333333333332, “memory”: 0.888, “total_memory”: 33396789248, “available_memory”: 3735101440, “total_gpus”: 0, “total_disk_size”: 401643327488, “available_disk_size”: 18929004544, “disk_limit_gb”: 5368709120

}, {

“name”: “NODE:REMOTE1”, “running_tasks”: 0, “running_tasks_gpu”: 0, “running_tasks_cpu”: 0, “running_tasks_local”: 0, “scheduled_tasks”: 0, “is_local”: false, “have_gpus”: false, “processors_count”: 2, “local_processors_count”: 3, “startup_id”: “not_set”, “cpu”: 0.25183333333333335, “memory”: 0.888, “total_memory”: 33396789248, “available_memory”: 3735818240, “total_gpus”: 0, “total_disk_size”: 401643327488, “available_disk_size”: 18929004544, “disk_limit_gb”: 5368709120

}

]

}

Attribute Definitions

The following is a list of relevant JSON attribute definitions.

  • api_version (string): API version

  • server_version (string): Driverless AI server version

  • application_id (string): Driverless AI instance ID that is randomly created at startup or overridden with config.application_id

  • timestamp (string): Current server time in ISO8601 format

  • last_system_interaction (string): ISO8601 format timestamp of last interaction with the Driverless AI server. The following are considered as system interactions:

  1. Incoming RPC request from client

  2. Login/Logout of user

  3. A system event like _sync_ message from a running or finished experiment

  4. Initialization of dataset upload

  5. Custom recipe upload

  • is_idle (boolean): System is considered idle when there is no task running or scheduled, and no upload or download going on from the user session

  • active_users (int): Number of active users in the system. A user is considered active if they have interacted with the system within the config.user_activity_timeout period, which by default is 60 minutes

  • resources.nodes (int): Number of nodes in Driverless AI cluster

  • resources.gpus (int): Total number of GPUs in Driverless AI cluster

  • resources.cpu_cores (int): Total number of CPU cores in Driverless AI cluster

  • tasks.running (int): Total number of jobs running in the system

  • tasks.scheduled (int): Total number of jobs waiting for execution in scheduling queue

  • tasks.scheduled_on_gpu (int): The total number of jobs that require a GPU and are awaiting execution in the GPU scheduling queue

  • tasks.scheduled_on_cpu (int): The total number of jobs that require only CPU and are awaiting execution in the CPU scheduling queue

  • utilization.cpu (float [0, 1]): CPU utilization percentage aggregated across all nodes

  • utilization.gpu (float [0, 1]): GPU utilization percentage aggregated across all nodes

  • utilization.memory (float [0, 1]): Memory utilization percentage aggregated across all nodes

  • workers (list of objects): Contains a list of active workers in the multinode cluster

  • workers[].name (string): Name of the worker

  • workers[].running_tasks (int): The number of tasks that are currently running on the worker node

  • workers[].scheduled_tasks (int): The number of tasks scheduled specifically (with affinity) for the worker

  • workers[].cpu (float): A snapshot of current usage CPU percentage

  • workers[].memory (float): A snapshot of current memory usage percentage

  • workers[].total_memory (int): Total consumed memory

  • workers[].available_memory (int): Available memory

  • workers[].total_gpus (int): Total number of GPUs

  • workers[].total_disk_size (int): Total disk size in bytes

  • workers[].available_disk_size (int): Available disk size in bytes

  • workers[].disk_limit_gb (float): Disk limit defined by the config.toml value for disk_limit_gb