Dask Redis Multinode Example¶
Dask Multinode Example running docker¶
On main server with public IP address 172.16.2.210:
mkdir -p /home/$USER/docker/data ; chmod u+rwx /home/$USER/docker/data
mkdir -p /home/$USER/docker/log ; chmod u+rwx /home/$USER/docker/log
mkdir -p /home/$USER/docker/tmp ; chmod u+rwx /home/$USER/docker/tmp
mkdir -p /home/$USER/docker/license ; chmod u+rwx /home/$USER/docker/license
mkdir -p /home/$USER/docker/jupyter/notebooks
cp /home/$USER/.driverlessai/license.sig /home/$USER/docker/license/
export server=172.16.2.210
docker run \
--net host \
--runtime nvidia \
--rm \
--init \
--pid=host \
--gpus all \
--ulimit core=-1 \
--shm-size=2g \
-u `id -u`:`id -g` \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /home/$USER/docker/license:/license \
-v /home/$USER/docker/data:/data \
-v /home/$USER/docker/log:/log \
-v /home/$USER/docker/tmp:/tmp \
-v /home/$USER/docker/jupyter:/jupyter \
-e dai_dask_server_ip=$server \
-e dai_redis_ip=$server \
-e dai_redis_port=6379 \
-e dai_main_server_minio_address=$server:9001 \
-e dai_local_minio_port=9001 \
-e dai_ip=$server \
-e dai_main_server_redis_password="<REDIS_PASSWORD>" \
-e dai_worker_mode='multinode' \
-e dai_enable_dask_cluster=1 \
-e dai_enable_jupyter_server=1 \
-e dai_enable_jupyter_server_browser=1 \
-e NCCL_SOCKET_IFNAME="enp5s0" \
-e NCCL_DEBUG=WARN \
-e NCCL_P2P_DISABLE=1 \
docker_image
The preceding example launches the following:
DAI main server on 12345
MinIO data server on 9001
Redis server on 6379
H2O-3 MLI server on 12348
H2O-3 recipe server on 50361
Juypter on 8889
Dask CPU scheduler on 8786
Dask CPU scheduler’s dashboard on 8787
Dask GPU scheduler on 8790
Dask GPU scheduler’s dashboard on 8791
LightGBM Dask listening port on 12400
Notes:
$USER
in bash gives the username.
Replace
<REDIS_PASSWORD>
with default Redis password or new one.
Replace various ports with alternative values if required.
Replace
docker_image
with the image (include repository if remote image).
For GPU usage,
--runtime nvidia
is required. Systems without GPUs should remove this line.
Dask on cluster can be disabled by passing
dai_enable_dask_cluster=0
. If Dask on cluster is disabled, thendai_dask_server_ip
does not need to be set.
Dask dashboard ports (for example, 8787 and 8791) and H2O-3 ports 12348, 50361, and 50362 are not required to be exposed. These are for user-level access to H2O-3 or Dask behavior.
Jupyter can be disabled by passing
dai_enable_jupyter_server=0
anddai_enable_jupyter_server_browser=0
.
Dask requires the host network be used so scheduler can tell workers where to find other workers, so a subnet on new IP cannot be used, e.g. with
docker network create --subnet=192.169.0.0/16 dainet
.
To isolate user access to single user, instead of doing
-v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro
one can map to user files with the same required information. These options ensure container knows who user is.
Directories created should have not existed or should be from a prior run by same user. Pre-existing directories should be moved or names changed to avoid conflicts.
Services like the Procsy server, H2O-3 MLI and Recipe servers, and Vis-data server are only used internally for each node.
The options
-p 12400:12400
is only required to LightGBM Dask.
NCCL_SOCKET_IFNAME
should specify the actual hardware device to use, as required due to issues with NCCL obtaining the correct device automatically from IP.
On any number of workers for server with public IP address 172.16.2.210:
mkdir -p /home/$USER/docker/log ; chmod u+rwx /home/$USER/docker/log
mkdir -p /home/$USER/docker/tmp ; chmod u+rwx /home/$USER/docker/tmp
export server=172.16.2.210
docker run \
--runtime nvidia \
--gpus all \
--rm \
--init \
--pid=host \
--net host \
--ulimit core=-1 \
--shm-size=2g \
-u `id -u`:`id -g` \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /home/$USER/docker/log:/log \
-v /home/$USER/docker/tmp:/tmp \
-e dai_dask_server_ip=$server \
-e dai_redis_ip=$server \
-e dai_redis_port=6379 \
-e dai_main_server_minio_address=$server:9001 \
-e dai_local_minio_port=9001 \
-e dai_ip=$server \
-e dai_main_server_redis_password="<REDIS_PASSWORD>" \
-e dai_worker_mode='multinode' \
-e dai_enable_dask_cluster=1 \
-e NCCL_SOCKET_IFNAME="enp4s0" \
-e NCCL_DEBUG=WARN \
-e NCCL_P2P_DISABLE=1 \
docker_image --worker
Notes:
If same disk is used for main server and worker, change 《docker》 to 《docker_w1》 for worker 1, etc.
NCCL_SOCKET_IFNAME
should specify actual hardware name, in general different on each node.
Dask Multinode Example running tar¶
On main server with public IP address 172.16.2.210:
export DRIVERLESS_AI_LICENSE_FILE=/home/$$USER/.driverlessai/license.sig
export server=172.16.2.210
NCCL_SOCKET_IFNAME="enp5s0" \
NCCL_DEBUG=WARN \
NCCL_P2P_DISABLE=1 \
dai_dask_server_ip=$server dai_redis_ip=$server dai_redis_port=6379 \
dai_main_server_minio_address=$server:9001 dai_ip=$server dai_main_server_redis_password="<REDIS_PASSWORD>" \
dai_worker_mode='multinode' dai_enable_dask_cluster=1 \
dai_enable_jupyter_server=1 dai_enable_jupyter_server_browser=1 \
/opt/h2oai/dai/dai-env.sh python -m h2oai &> multinode_main.txt
On each worker node, run the exact same command but with --worker
added at the end, i.e.:
export DRIVERLESS_AI_LICENSE_FILE=/home/$$USER/.driverlessai/license.sig
export server=172.16.2.210
NCCL_SOCKET_IFNAME="enp4s0" \
NCCL_DEBUG=WARN \
NCCL_P2P_DISABLE=1 \
dai_dask_server_ip=$server dai_redis_ip=$server dai_redis_port=6379 \
dai_main_server_minio_address=$server:9001 dai_ip=$server dai_main_server_redis_password="<REDIS_PASSWORD>" \
dai_worker_mode='multinode' dai_enable_dask_cluster=1 \
/opt/h2oai/dai/dai-env.sh python -m h2oai --worker &> multinode_worker.txt
Notes:
In this example, address 172.16.2.210 needs to be the public IP associated with the network device to use for communication.
$USER
in bash gives the username.
Replace
<REDIS_PASSWORD>
with default Redis password or new one.
Replace various ports with alternative values if required.
NCCL_SOCKET_IFNAME
should be set to be actual hardware device name to use on each node.