GPUs in Driverless AI
Driverless AI can run on machines with only CPUs or machines with CPUs and GPUs. For the best (and intended-as-designed) experience, install Driverless AI on modern data center hardware with GPUs and CUDA support. Feature engineering and model building are primarily performed on CPU and GPU respectively. For this reason, Driverless AI benefits from multi-core CPUs with sufficient system memory and GPUs with sufficient RAM. For best results, we recommend GPUs that use the Pascal or Volta architectures. Ampere-based NVIDIA GPUs are also supported on x86 machines (requires NVIDIA CUDA Driver 11.8 or later).
Driverless AI ships with NVIDIA CUDA 11.8.0 and cuDNN.
Image and natural language processing (NLP) use cases in H2O Driverless AI benefit significantly from GPU usage.
Model building algorithms, namely, XGBoost (GBM/DART/RF/GLM), LightGBM (GBM/DART/RF), PyTorch (BERT models) and TensorFlow (CNN/BiGRU/ImageNet) models utilize GPU. Model scoring on GPUs can be enabled by selecting non-zero number of GPUs for prediction/scoring via num_gpus_for_prediction system expert setting of the experiment. Shapley calculation on GPUs is coming soon. MOJO scoring for productionizing models on GPUs can be enabled for some uses cases. See tensorflow_nlp_have_gpus_in_production
in config.toml. Driverless AI Tensorflow, BERT and Image models support C++ MOJO scoring for production.
Feature engineering transformers such as ClusterDist cuML Transformer, TruncSVDNum cuML Transformer, DBSCAN cuML Transformer run on GPUs.
With Driverless AI Dask multinode setup, GPUs can be used for extensive model hyperparamenter search.
For details see -
Driverless AI & NVIDIA cuDNN
NVIDIA cuDNN is a library for deep neural nets built using CUDA and optimized for GPUs. For NLP data modeling and feature engineering , Driverless AI uses cuDNN PyTorch (BERT models) and TensorFlow NLP recipe based on CNN and BiGRU (RNN) deep learning models. For modeling Image data, TensorFlow (ImageNet models) are used.
Driverless AI & NVIDIA RAPIDS
NVIDIA RAPIDS provides PyData APIs that are GPU-accelerated.Driverless AI integrates RAPIDS cuML (scikit-learn) transformers namely ClusterDist cuML Transformer, TruncSVDNum cuML Transformer, DBSCAN cuML Transformer for feature engineering and RAPIDS cuDF extension to XGBoost GBM / DART for building machine learning models on GPUs.
Driverless AI & Dask + RAPIDS
DASK is a Python library that enables out of core parallelism and distribution of some popular Python libraries. Combining DASK with RAPIDS supports a multinode multigpu framework. Driverless supports DASK multinode cluster and offers Dask cuDF multiGPU XGBoost GBM / DART for modeling and Dask multinode / multiGPU hyperparameter search for model parameter selection.
Driverless AI & OpenCL and CUDA LightGBM
On X86 architecture, OpenCL algorithm implementation of LightGBM on GPU is used in Driverless AI for building models.
GPUs can be enabled/disabled per Experiment. System expert settings of an experiment exposes some fine grained control of GPUs. For all other GPU related config settings see config.toml.
Nvidia MIG support
Driverless AI 2.0 can run GPU training on machines with Nvidia® Multi-instance GPU (MIGs). For real-time GPU monitoring, install the NVIDIA® Data Center GPU Manager (DCGM) so that Driverless AI can monitor real time gpu metrics in the same way as running on GPUs. Driverless AI can still run on MIGs without DCGM, but real-time GPU metrics, such as memory usage and utilization, will not be available.
The following steps describe how to set up DCGM in Driverless AI. Note that Driverless AI only supports DCGM 3.3.8
.
Enable MIG on all GPUs: Ensure that all visible GPUs are either enabled or disabled for MIG. Driverless AI does not support mixed configurations of GPUs and MIGs. For more information, see the NVIDIA MIG user guide.
Install DCGM: Install DCGM on the same machine as Driverless AI or on a separate machine. For installation instructions, see the DCGM user guide. If running on Kubernetes, use the NVIDIA GPU operator to install DCGM.
Configure Driverless AI to use DCGM: Set
DAI_DCGM_DAEMON_ADDRESS
as an environment variable or specifydcgm_daemon_address
inconfig.toml
to allow Driverless AI to access DCGM. If running on Kubernetes, use the fully qualified domain name (FQDN) of the DCGM service.