By default, the version of the NVIDIA driver installed in a Container Service for Kubernetes (ACK) cluster varies based on the type and version of the cluster. If your Compute Unified Device Architecture (CUDA) toolkit requires compatibility with a newer version of the NVIDIA driver, you can customize the installation of the NVIDIA driver on GPU nodes. This topic describes how to specify an NVIDIA driver version for GPU-accelerated nodes in a node pool by adding a label.
Precautions
ACK does not guarantee the compatibility between the NVIDIA driver version and the CUDA toolkit version. You need to verify the compatibility between them.
For more information about the NVIDIA driver versions required by different NVIDIA models, see NVIDIA official documentation.
For custom OS images that are installed with the NVIDIA driver and GPU components such as the NVIDIA Container Runtime, ACK does not guarantee the compatibility of the NVIDIA driver with other GPU components, such as the monitoring components.
If you add a label to a node pool to specify an NVIDIA driver version for GPU-accelerated nodes, the driver installation process is triggered when a node is added. Therefore, this applies only to newly scaled or newly added nodes, and existing nodes will not be affected. To apply a new driver to existing nodes, you need to remove the nodes and add the existing nodes again.
The gn7 and ebmgn7 instance types are incompatible with NVIDIA driver versions 510.xxx and 515.xxx. For these instance types, we recommend that you use driver versions that are earlier than 510.xxx and have the GPU System Processor (GSP) disabled, such as 470.xxx.xxxx, or 525.125.06 or later versions.
The Elastic Compute Service (ECS) instance types ebmgn7 or ebmgn7e support only NVIDIA driver versions later than 460.32.03.
When you create a node pool, if the driver version that you specify is not in the NVIDIA driver versions supported by ACK list, ACK automatically installs the default driver version. If you specify a driver version that is incompatible with the latest operating system, the node addition may fail. In this case, you need to select the latest supported driver version.
Step 1: Determine the NVIDIA driver version
Select an NVIDIA driver version that is compatible with your applications from the How to choose an NVIDIA driver version list. In this example, the version of the NVIDIA driver is 550.144.03.
Step 2: Create a node pool and specify a driver version
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose .
Click Create Node Pool in the upper-left corner. For more information about the parameters, see Create and manage a node pool. The following section displays the parameters.
In the Node Label section, click the
icon. Set Key to
ack.aliyun.com/nvidia-driver-version
, and then set Value to550.144.03
.
Step 3: Check whether the specified NVIDIA driver version is installed
Run the following command to query pods that have the
component: nvidia-device-plugin
label:kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ack-nvidia-device-plugin-fnctc 1/1 Running 0 2m33s 10.117.227.43 cn-qingdao.10.117.XXX.XX <none> <none>
The output shows that the name of the pod running on the newly added node is
ack-nvidia-device-plugin-fnctc
.Run the following command to query the NVIDIA driver version of the node:
kubectl exec -ti ack-nvidia-device-plugin-fnctc -n kube-system -- nvidia-smi
Expected output:
Mon Mar 24 08:51:55 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla P4 On | 00000000:00:07.0 Off | 0 | | N/A 33C P8 7W / 75W | 0MiB / 7680MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
The output shows that the NVIDIA driver version is 550.144.03. This indicates that the NVIDIA driver is successfully installed with the specified version.
Other methods
When you use CreateClusterNodePool to create a node pool, you can add a label to the node pool configuration to specify an NVIDIA driver version. The following sample code provides an example:
{
// Other fields are not shown.
......
"tags": [
{
"key": "ack.aliyun.com/nvidia-driver-version",
"value": "550.144.03"
}
],
// Other fields are not shown.
......
}