Multinode Training (Alpha)¶
Driverless AI can be configured to run in a multinode worker mode. This section describes the multinode training process and how to configure it.
Understanding Multinode Training¶
Multinode training in Driverless AI can be used to run multiple experiments at the same time. It is effective in situations where you need to run and complete many experiments simultaneously in a short amount of time without having to wait for each individual experiment to finish.
Multinode training uses a load distribution technique in which a set of machines (worker nodes) are used to help a main server node process experiments. These machines can be CPU only or CPU + GPU, with experiments being distributed accordingly.
Jobs (experiments) within the multinode setup are organized into a queue. Jobs remain in this queue when no processor is available. When a worker’s processor becomes available, it asks the job queue service to assign it a new job. By default, each worker node processes two jobs at a time (configured with the
worker_remote_processors option in the config.toml file). Each worker can process multiple jobs at the same time, but two workers cannot process the same experiment at the same time. Messaging and data exchange services are also implemented to allow the workers to effectively communicate with the main server node.
Multinode training in Driverless AI is currently in a preview stage. If you are interested in using multinode configurations, contact firstname.lastname@example.org.
Multinode training requires the transfer of data to several different workers. For example, if an experiment is scheduled to be on a remote worker node, the datasets it is using need to be copied to the worker machine by using the Minio service. The experiment can take longer to initialize depending on the size of the transferred objects.
The number of jobs that each worker node processes is controlled by the
worker_remote_processorsoption in the config.toml file.
Tasks are not distributed to best fit workers. Workers consume tasks from the queue on a first in, first out (FIFO) basis.
A single experiment runs entirely on one machine. For this reason, using a large number of commodity-grade hardware is not useful in the context of multinode.
Description of Configuration Attributes¶
worker_mode: Specifies how the long-running tasks are scheduled. Available options include:
multiprocessing: Forks the current process immediately.
singlenode: Shares the task through Redis and needs a worker running.
multinode: Same as
singlenode. Also shares the data through MinIO and allows the worker to run on the different machine.
redis_ip: Redis IP address. Defaults to 127.0.0.1
redis_port: Redis port. Defaults to 6379.
redis_db: Redis database. Each DAI instance running on the Redis server should have unique integer. Defaults to 0.
main_server_redis_password: Main Server Redis password. Defaults to empty string.
local_minio_port: The port that MinIO will listen on. This only takes effect if the current system is a multinode main server.
main_server_minio_address: The address of the main server’s MinIO server. Defaults to 127.0.0.1:9000.
main_server_minio_access_key_id: Access key of main server’s MinIO server.
main_server_minio_secret_access_key: The secret access key of main server MinIO server.
main_server_minio_bucket: The name of MinIO bucket used for file synchronization.
worker_local_processors: The maximum number of local tasks processed at once. Defaults to 32.
worker_remote_processors: The maximum number of remote tasks processed at once. Defaults to 2.
redis_result_queue_polling_interval: The frequency in milliseconds that the server should extract results from the Redis queue. Defaults to 100.
main_server_minio_bucket_ping_timeout: The number of seconds the worker should wait for the main server MinIO bucket before it fails. Defaults to 30.
worker_healthy_response_period: The number of seconds to wait for the worker to respond before being marked as unhealthy. Defaults to 300.
Multinode Setup Example¶
The following example configures a two-node Multinode Driverless AI cluster on AWS EC2 instances using bashtar distribution. This example can be expanded to multiple worker nodes. This example assumes that you have spun up two EC2 instances (Ubuntu 16.04) within the same VPC on AWS.
In the VPC settings, enable inbound rules to listen to TCP connections on port 6379 for Redis and 9000 for MinIO.
Install the Driverless AI Natively¶
Install Driverless AI on the server node. Refer to one of the following topics for information on how to perform a native install on Linux systems.
Edit the Driverless AI config.toml¶
After Driverless AI is installed, edit the following config options in the config.toml file.
#set worker mode to multinode worker_mode = "multinode" # Redis settings -- set the ip address of redis server to aws instance ip redis_ip = "<host_ip>" # Redis settings redis_port = 6379 # Redis settings main_server_redis_password = "<main_server_redis_pwd>" # Location of main server's minio server. # Note that you can also use `local_minio_port` to specify a different port. main_server_minio_address = "<host_ip>:9000"
Start the Driverless AI Server Node¶
cd dai-1.9.0-linux-x86_64 ./run-dai.sh
Install the Linux deb/rpm/tar package on the EC2 instance to create a Driverless AI worker node. After the installation is complete, edit the following in the config.toml.
# Redis settings, point to the dai main server's redis server ip address redis_ip = "<dai_main_server_host_ip>" # Redis settings redis_port = 6379 # Redis settings, point to the dai main server's redis server password main_server_redis_password = "<dai_main_server_host_redis_pwd>" # Location of the dai main server's minio server. main_server_minio_address = "<dai_main_server_host>:9000"
Start the Driverless AI Worker Node¶
cd dai-1.9.0-linux-x86_64 ./run-dai.sh --worker # Note that when using rpm/deb you can run the following: sudo systemctl start dai-worker
Once the worker node starts, use the Driverless AI server IP to log into Driverless AI. Click on Resources > System Info to confirm that the number of workers is “2” if only one worker is used. (By default, each worker node processes two jobs at a time. This is configured with the
worker_remote_processors option in the config.toml file.)