GPUs in Driverless AI

Driverless AI can run on machines with only CPUs or machines with CPUs and GPUs. For the best (and intended-as-designed) experience, install Driverless AI on modern data center hardware with GPUs and CUDA support. Feature engineering and model building are primarily performed on CPU and GPU respectively. For this reason, Driverless AI benefits from multi-core CPUs with sufficient system memory and GPUs with sufficient RAM. For best results, we recommend GPUs that use the Pascal or Volta architectures. Ampere-based NVIDIA GPUs are also supported on x86 machines (requires NVIDIA CUDA Driver 11.8 or later).

Driverless AI ships with NVIDIA CUDA 11.8.0 and cuDNN.

Image and natural language processing (NLP) use cases in H2O Driverless AI benefit significantly from GPU usage.

Model building algorithms, namely, XGBoost (GBM/DART/RF/GLM), LightGBM (GBM/DART/RF), PyTorch (BERT models) and TensorFlow (CNN/BiGRU/ImageNet) models utilize GPU. Model scoring on GPUs can be enabled by selecting non-zero number of GPUs for prediction/scoring via num_gpus_for_prediction system expert setting of the experiment. Shapley calculation on GPUs is coming soon. MOJO scoring for productionizing models on GPUs can be enabled for some uses cases. See tensorflow_nlp_have_gpus_in_production in config.toml. Driverless AI Tensorflow, BERT and Image models support C++ MOJO scoring for production.

Feature engineering transformers such as ClusterDist cuML Transformer, TruncSVDNum cuML Transformer, DBSCAN cuML Transformer run on GPUs.

With Driverless AI Dask multinode setup, GPUs can be used for extensive model hyperparamenter search.

For details see -

GPUs can be enabled/disabled per Experiment. System expert settings of an experiment exposes some fine grained control of GPUs. For all other GPU related config settings see config.toml.

Nvidia MIG support

Driverless AI 2.0 can run GPU training on machines with Nvidia® Multi-instance GPU (MIGs). For real-time GPU monitoring, install the NVIDIA® Data Center GPU Manager (DCGM) so that Driverless AI can monitor real time gpu metrics in the same way as running on GPUs. Driverless AI can still run on MIGs without DCGM, but real-time GPU metrics, such as memory usage and utilization, will not be available.

The following steps describe how to set up DCGM in Driverless AI. Note that Driverless AI only supports DCGM 3.3.8.

  1. Enable MIG on all GPUs: Ensure that all visible GPUs are either enabled or disabled for MIG. Driverless AI does not support mixed configurations of GPUs and MIGs. For more information, see the NVIDIA MIG user guide.

  2. Install DCGM: Install DCGM on the same machine as Driverless AI or on a separate machine. For installation instructions, see the DCGM user guide. If running on Kubernetes, use the NVIDIA GPU operator to install DCGM.

  3. Configure Driverless AI to use DCGM: Set DAI_DCGM_DAEMON_ADDRESS as an environment variable or specify dcgm_daemon_address in config.toml to allow Driverless AI to access DCGM. If running on Kubernetes, use the fully qualified domain name (FQDN) of the DCGM service.