Before You Begin¶

Driverless AI can run on machines with only CPUs or machines with CPUs and GPUs. For the best (and intended-as-designed) experience, install Driverless AI on modern data center hardware with GPUs and CUDA support. Feature engineering and model building are primarily performed on CPU and GPU respectively. For this reason, Driverless AI benefits from multi-core CPUs with sufficient system memory and GPUs with sufficient RAM. For best results, we recommend GPUs that use the Pascal or Volta architectures. The older K80 and M60 GPUs available in EC2 are supported and very convenient, but not as fast. Ampere-based NVIDIA GPUs are also supported on x86, as Driverless AI ships with NVIDIA CUDA 11.2.2 toolkit. Image processing and NLP use cases in particular, benefit significantly from GPU usage. For details, see GPUs in Driverless AI.

Driverless AI supports local, LDAP, and PAM authentication. Authentication can be configured by setting environment variables or via a config.toml file. Refer to the Authentication Methods section for more information. Note that the default authentication method is “unvalidated.”

Driverless AI also supports HDFS, S3, Google Cloud Storage, Google Big Query, KDB, MinIO, and Snowflake access. Support for these data sources can be configured by setting environment variables for the data connectors or via a config.toml file. Refer to the Data Connectors section for more information.

Sizing Requirements¶

Sizing Requirements for Native Installs¶

Driverless AI requires a minimum of 5 GB of system memory in order to start experiments and a minimum of 5 GB of disk space in order to run a small experiment. Note that these limits can changed in the config.toml file. We recommend that you have sufficient system CPU memory (64 GB or more) and 1 TB of free disk space available.

Sizing Requirements for Docker Installs¶

For Docker installs, we recommend 1 TB of free disk space. Driverless AI uses approximately 38 GB. In addition, the unpacking/temp files require space on the same Linux mount /var during installation. Once Driverless AI runs, the mounts from the Docker container can point to other file system mount points.

GPU Sizing Requirements¶

If you are running Driverless AI with GPUs, ensure that your GPU has compute capability >=3.5 and at least 4GB of RAM. If these requirements are not met, then Driverless AI switches to CPU-only mode.

Sizing Requirements for Storing Experiments¶

We recommend that your Driverless tmp directory has at least 500 GB to 1 TB of space. The (Driverless) tmp directory holds all experiments and all datasets. We also recommend that you use SSDs (preferably NVMe).

Virtual Memory Settings in Linux¶

If you are running Driverless AI on a Linux machine, we recommend setting the overcommit memory to 0. The setting can be changed with the following command:

sudo sh -c "/bin/echo 0 > /proc/sys/vm/overcommit_memory"

This is the default value that indicates that the Linux kernel is free to overcommit memory. If this value is set to 2, then the Linux kernel does not overcommit memory. In the latter case, the memory requirements of Driverless AI may surpass the memory allocation limit and prevent the experiment from completing.

If Redis runs out of memory, then background save may fail under low memory condition with “0” set.

The value “0” is not required to run DAI, but DAI might erroneously seem to run out of memory even if plenty of memory is available.

Large Pages¶

If Transparent Huge Pages (THP) support is enabled in your kernel, avoiding huge page support is highly recommended.

sudo sh -c "/bin/echo never > /sys/kernel/mm/transparent_hugepage/enabled"

Note that this isn’t required to run DAI. However, if the preceding command isn’t used, older systems may run much slower when allocating memory, and issues relating to latency and memory usage may arise with Redis.

Per-Experiment GPU memory and usage¶

If you’re using Docker, you need to use --pid=host for NVIDIA to be able to access per-process GPU usage for logging. This improves logging, but isn’t required to run DAI.

Docker Shared Memory¶

XGBoost, Torch, and TensorFlow require shared memory allocated larger than the small default. The docker command should contain the following:

--shm-size=2g

Without this option, those packages will fail. Triton inference server also requires this option be set, and if under heavy load, may require even larger values than 2g.

Docker resource limits¶

DAI controls various resources and needs more resources than what systems typically set by default. You can use the following option to ensure that DAI is given enough resources:

--ulimit nofile=131071:131071 --ulimit nproc=16384:16384

Without this option, DAI crashes under load.

Docker NICE¶

As stated in the official Docker documentation, the --cap-add=SYS_NICE option grants the container the CAP_SYS_NICE capability, which lets the container raise process nice values, set real-time scheduling policies, set CPU affinity, and other operations. If this flag isn’t passed when starting the container, DAI isn’t able to control resources and can end up with all processes only using a single core. This is also required to use the built-in NVIDIA Triton Inference Server and its use of non-uniform memory access (NUMA) control.

Memory Requirements per Experiment¶

As a rule of thumb, the memory requirement per experiment is approximately 5 to 10 times the size of the dataset. Dataset size can be estimated as the number of rows x columns x 4 bytes; if text is present in the data, then more bytes per element are needed.

Backup Strategy¶

The Driverless AI tmp directory is used to store all experiment artifacts such as deployment artifacts and MLIs. It also stores the master.db database that tracks users to Driverless artifacts. Note that no files should be added or deleted in the tmp folder outside of what Driverless AI adds automatically.

We recommend periodically stopping Driverless AI and backing up the Driverless AI tmp directory to ensure that a copy of the Driverless AI state is available for instances where you may need to revert to a prior state.

Upgrade Strategy¶

When upgrading Driverless AI, note that:

Image models from version 1.9.x aren’t supported in 1.10.x. All other models from 1.9.x are supported in 1.10.x.
(MLI) Interpretations made in version 1.9.0 are supported in 1.9.x and later.
(MLI) Interpretations made in version 1.8.x aren’t supported in 1.9.x and later. However, interpretations made in 1.8.x can still be viewed and rerun.
We recommend following these steps before upgrading:
- Build MLI models: Before upgrading, run MLI jobs on models that you want to continue to interpret in future Driverless AI releases. If an MLI job appears in the list of Interpreted Models in your current version, then it is retained after upgrading.
- Build MOJO pipelines: Before upgrading, build MOJO pipelines on all desired models.
- Stop Driverless AI and make a backup (copy) of the Driverless AI tmp directory.

The upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade.

Note: Driverless AI does not support data migration from a newer version to an older version. If you rollback to an older version of Driverless AI after upgrading, newer versions of the master.db file will not work with the older Driverless AI version. For this reason, we recommend saving a copy of the older ‘tmp’ directory to fully restore the older Driverless AI version’s state.

Other Notes¶

Supported Browsers¶

Driverless AI is tested most extensively on Chrome and Firefox. For the best user experience, we recommend using the latest version of Chrome. You may encounter issues if you use other browsers or earlier versions of Chrome and/or Firefox.

To `sudo` or Not to `sudo`¶

Driverless RPM and DEB installs require sudo access. The TARSH install can be done without sudo access.

Some of the installation steps in the document may show sudo prepending different commands. Note that using sudo may not always be required.

Note about Docker Configuration (`ulimit`)¶

When running Driverless AI with Docker, it is recommended to configure ulimit options by using the --ulimit argument to docker run. The following is an example of how to configure these options:

--ulimit nproc=65535:65535 \
--ulimit nofile=4096:8192 \

Refer to https://docs.docker.com/engine/reference/commandline/run/#set-ulimits-in-container—ulimit for more information on these options.

Note about nvidia-docker 1.0¶

If you have nvidia-docker 1.0 installed, you need to remove it and all existing GPU containers. Refer to https://github.com/NVIDIA/nvidia-docker/blob/master/README.md for more information.

Deprecation of `nvidia-smi`¶

The nvidia-smi command has been deprecated by NVIDIA. Refer to https://github.com/nvidia/nvidia-docker#upgrading-with-nvidia-docker2-deprecated for more information. The installation steps have been updated for enabling persistence mode for GPUs.

Note About CUDA Versions¶

Driverless AI ships with CUDA 11.2.2 for GPUs, but the driver must exist in the host environment. We recommend to have NVIDIA driver >= 471.68 installed in your environment, for a seamless experience on all NVIDIA architectures, including Ampere.

Go to NVIDIA download driver to get the latest NVIDIA Tesla A/T/V/P/K series driver. For reference on CUDA Toolkit and Minimum Required Driver Versions and CUDA Toolkit and Corresponding Driver Versions, see here .

Note

If you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02.

Note About Authentication¶

The default authentication setting in Driverless AI is “unvalidated.” In this case, Driverless AI will accept any login and password combination, it will not validate whether the password is correct for the specified login ID, and it will connect to the system as the user specified in the login ID. This is true for all instances, including Cloud, Docker, and native instances.

We recommend that you configure authentication. Driverless AI provides a number of authentication options, including LDAP, PAM, Local, and None. Refer to Authentication Methods for information on how to enable a different authentication method.

Note: Driverless AI is also integrated with IBM Spectrum Conductor and supports authentication from Conductor. Contact sales@h2o.ai for more information about using IBM Spectrum Conductor authentication.

Note About Shared File Systems¶

If your environment uses a shared file system, then you must set the following configuration option:

datatable_strategy='write'

The above can be specified in the config.toml file (for native installs) or specified as an environment variable (Docker image installs).

This configuration is required because, in some cases, Driverless AI can fail to read files during an experiment. The write option lets Driverless AI properly read and write data from shared file systems to disk.

Note About the Master Database File¶

The master.db file keeps track of users to Driverless artifacts in the DAI tmp directory. If you are running two versions of Driverless AI, keep in mind that newer versions of the master.db file will not work with older versions of Driverless AI.