Experiment Queuing In Driverless AI

Driverless AI supports automatic queuing of experiments to avoid system overload. You can launch multiple experiments simultaneously that are automatically queued and run when the necessary resources become available.

The worker queue indicates the number of experiments that are waiting for their turn on a CPU or GPU + CPU system. Significant jobs like running experiments and making predictions are distinguished from minor tasks. In the following image, ‘GPU queue’ indicates that there are two experiments waiting in the worker queue on a GPU-enabled system, and not that two workers are waiting for a GPU:

Worker Queue

Notes:

  • By default, each node runs two experiments at a time. This is controlled by the worker_remote_processors option in the config.toml file. Starting with version 1.10.4, Driverless AI automatically sets the maximum number of CPU cores to use per experiment and the maximum number of remote tasks to be processed at one time based on the number of CPU cores your system has. For example, if your system has 128 CPU cores, DAI automatically sets worker_remote_processors to 6 and reduces max_cores by about 4. Additional options that control resource allocation can also be configured in the config.toml file.

  • By default, Driverless AI is configured so that each experiment attempts to use every GPU on the system as needed. To limit the number of GPUs per experiment, configure the num_gpus_per_experiment config.toml setting.

  • For an optimal setup, take into account how config.toml options prefixed with num_gpus_per have been configured, as well as how many concurrent tasks have been set for a worker that has GPUs.

  • If GPUs are disabled, ‘#1 in CPU queue’ appears instead of ‘#1 in GPU queue’.

  • For more information on multinode training in Driverless AI, see Multinode Training (Alpha).