Experiment Queuing In Driverless AI¶
Driverless AI supports automatic queuing of experiments to avoid system overload. You can launch multiple experiments simultaneously that are automatically queued and run when the necessary resources become available.
The worker queue indicates the number of experiments that are waiting for their turn on a CPU or GPU + CPU system. Significant jobs like running experiments and making predictions are distinguished from minor tasks. In the following image, 〈GPU queue〉 indicates that there are two experiments waiting in the worker queue on a GPU-enabled system, and not that two workers are waiting for a GPU:
Notes:
By default, each node runs two experiments at a time. This is controlled by the
worker_remote_processors
option in the config.toml file. Starting with version 1.10.4, Driverless AI automatically sets the maximum number of CPU cores to use per experiment and the maximum number of remote tasks to be processed at one time based on the number of CPU cores your system has. For example, if your system has 128 CPU cores, DAI automatically setsworker_remote_processors
to 6 and reducesmax_cores
by about 4. Additional options that control resource allocation can also be configured in the config.toml file.By default, Driverless AI is configured so that each experiment attempts to use every GPU on the system as needed. To limit the number of GPUs per experiment, configure the
num_gpus_per_experiment
config.toml setting.For an optimal setup, take into account how config.toml options prefixed with
num_gpus_per
have been configured, as well as how many concurrent tasks have been set for a worker that has GPUs.If GPUs are disabled, 〈#1 in CPU queue〉 appears instead of 〈#1 in GPU queue〉.
For more information on multinode training in Driverless AI, see Multinode Training (Alpha).