Experiment Queuing In Driverless AI¶
Driverless AI supports automatic queuing of experiments to avoid system overload. You can launch multiple experiments simultaneously that are automatically queued and run when the necessary resources become available.
The worker queue indicates the number of experiments that are waiting for their turn on a CPU or GPU + CPU system. Significant jobs like running experiments and making predictions are distinguished from minor tasks. In the following image, ‘GPU queue’ indicates that there are two experiments waiting in the worker queue on a GPU-enabled system, and not that two workers are waiting for a GPU:
By default, each node runs two experiments at a time. This is controlled by the
worker_remote_processorsoption in the config.toml file. Starting with version 1.10.4, Driverless AI automatically sets the maximum number of CPU cores to use per experiment and the maximum number of remote tasks to be processed at one time based on the number of CPU cores your system has. For example, if your system has 128 CPU cores, DAI automatically sets
worker_remote_processorsto 6 and reduces
max_coresby about 4. Additional options that control resource allocation can also be configured in the config.toml file.
By default, Driverless AI is configured so that each experiment attempts to use every GPU on the system as needed. To limit the number of GPUs per experiment, configure the
For an optimal setup, take into account how config.toml options prefixed with
num_gpus_perhave been configured, as well as how many concurrent tasks have been set for a worker that has GPUs.
If GPUs are disabled, ‘#1 in CPU queue’ appears instead of ‘#1 in GPU queue’.
For more information on multinode training in Driverless AI, see Multinode Training (Alpha).