Configuring node affinity and toleration¶
This page describes how to configure node affinity and toleration for MLOps scorers (pods). Test.
Note
For more information on node affinity and toleration, refer to the following pages in the official Kubernetes documentation:
Understanding node affinity and toleration¶
As stated in the official Kubernetes documentation, "node affinity is a property of Pods that attracts them to a set of nodes, either as a preference or a hard requirement. Taints are the opposite—they allow a node to repel a set of pods. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints." In the case of MLOps, these options let you ensure that scorers (pods) are scheduled onto specific machines (nodes) in a cluster that have been set up for machine learning tasks.
Setup¶
In order to provide options for selecting node affinity and toleration when deploying a model, an admin must set up node affinity and toleration when installing MLOps.
Note
MLOps supports all resources of the Kubernetes API. For more information, refer to the official Kubernetes API Reference page.
Node affinity¶
The following is an example of how node affinity can be set up when installing MLOps.
kubernetes_node_affinity_shortcuts = [
{
name = "required-gpu-preferred-v100"
display_name = "GPU (Tesla V100)"
description = "Deploys on GPU-enabled nodes only, preferably one with Tesla V100 GPU."
affinity = {
required_during_scheduling_ignored_during_execution = {
node_selector_terms = [
{
match_expressions = [
{
key = "gpu-type"
operator = "Exists"
}
]
}
]
}
preferred_during_scheduling_ignored_during_execution = [
{
weight = 1
preference = {
match_expressions = [
{
key = "gpu-type"
operator = "In"
values = ["tesla-v100"]
}
]
}
}
]
}
}
]
In the preceding example, the first block contains the standard name
, display_name
, and description
fields required by Kubernetes. The second block (required_during_scheduling...
) specifies the required node affinity matches. In the preceding example, the node is required to have a label named gpu-type
in order for the deployed model to be scheduled on it. The third block (preferred_during_scheduling...
) contains the preferred node affinity matches. In the preceding example, any node with a gpu-type
label set to tesla-v100
is preferred, but not required.
Toleration¶
The following is an example of how toleration can be set up when installing MLOps:
kubernetes_toleration_shortcuts = [
{
name = "gpu-jobs-only"
display_name = "Specialized GPU nodes OK"
description = "Tolerates nodes that are meant only for jobs requiring GPUs."
tolerations = [
{
effect = "NoSchedule"
key = "gpu-jobs-only"
operator = "Exists"
}
]
},
{
name = "disk-pressure-tolerant"
display_name = "Disk-pressure tolerant"
description = "Tolerates nodes under disk pressure. Useful for short term models of negligible size."
tolerations = [
{
effect = "NoSchedule"
key = "node.kubernetes.io/disk-pressure"
operator = "Exists"
}
]
}
]
In the preceding example, the first toleration (gpu-jobs-only
) allows the model to be deployed on any node that has a taint called gpu-jobs-only
. Nodes with this taint typically refuse new pods from being scheduled on them, but applying this toleration allows a model to be scheduled.
The second toleration (disk-pressure-tolerant
) allows the model to be deployed on a node that is under memory pressure. By default, Kubernetes applies the node.kubernetes.io/disk-pressure
taint to any node that is running low on disk space, and therefore refuses any new pods to be scheduled on those nodes. Applying this toleration, however, allows a model to be scheduled on nodes with this taint.