Specify an NVIDIA driver version for nodes by adding a label - Container Service for Kubernetes

By default, the version of the NVIDIA driver installed in a Container Service for Kubernetes (ACK) cluster varies based on the type and version of the cluster. If your Compute Unified Device Architecture (CUDA) toolkit requires compatibility with a newer version of the NVIDIA driver, you can customize the installation of the NVIDIA driver on GPU nodes. This topic describes how to specify an NVIDIA driver version for GPU-accelerated nodes in a node pool by adding a label.

Precautions

ACK does not guarantee the compatibility between the NVIDIA driver version and the CUDA toolkit version. You need to verify the compatibility between them.
For more information about the NVIDIA driver versions required by different NVIDIA models, see NVIDIA official documentation.
For custom OS images that are installed with the NVIDIA driver and GPU components such as the NVIDIA Container Runtime, ACK does not guarantee the compatibility of the NVIDIA driver with other GPU components, such as the monitoring components.
If you add a label to a node pool to specify an NVIDIA driver version for GPU-accelerated nodes, the driver installation process is triggered when a node is added. Therefore, this applies only to newly scaled or newly added nodes, and existing nodes will not be affected. To apply a new driver to existing nodes, you need to remove the nodes and add the existing nodes again.
The gn7 and ebmgn7 instance types are incompatible with NVIDIA driver versions 510.xxx and 515.xxx. For these instance types, we recommend that you use driver versions that are earlier than 510.xxx and have the GPU System Processor (GSP) disabled, such as 470.xxx.xxxx, or 525.125.06 or later versions.
The Elastic Compute Service (ECS) instance types ebmgn7 or ebmgn7e support only NVIDIA driver versions later than 460.32.03.
When you create a node pool, if the driver version that you specify is not in the NVIDIA driver versions supported by ACK list, ACK automatically installs the default driver version. If you specify a driver version that is incompatible with the latest operating system, the node addition may fail. In this case, you need to select the latest supported driver version.

Step 1: Determine the NVIDIA driver version

Select an NVIDIA driver version that is compatible with your applications from the How to choose an NVIDIA driver version list. In this example, the version of the NVIDIA driver is 550.144.03.

Step 2: Create a node pool and specify a driver version

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
Click Create Node Pool in the upper-left corner. For more information about the parameters, see Create and manage a node pool. The following section displays the parameters.
In the Node Label section, click the icon. Set Key to ack.aliyun.com/nvidia-driver-version, and then set Value to 550.144.03.

Step 3: Check whether the specified NVIDIA driver version is installed

Run the following command to query pods that have the component: nvidia-device-plugin label:

kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide

Expected output:

NAME                             READY   STATUS    RESTARTS   AGE     IP              NODE                       NOMINATED NODE   READINESS GATES
ack-nvidia-device-plugin-fnctc   1/1     Running   0          2m33s   10.117.227.43   cn-qingdao.10.117.XXX.XX   <none>           <none>

The output shows that the name of the pod running on the newly added node is ack-nvidia-device-plugin-fnctc.

Run the following command to query the NVIDIA driver version of the node:

kubectl exec -ti ack-nvidia-device-plugin-fnctc -n kube-system -- nvidia-smi

Expected output:

Mon Mar 24 08:51:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P4                       On  |   00000000:00:07.0 Off |                    0 |
| N/A   33C    P8              7W /   75W |       0MiB /   7680MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The output shows that the NVIDIA driver version is 550.144.03. This indicates that the NVIDIA driver is successfully installed with the specified version.

Other methods

When you use CreateClusterNodePool to create a node pool, you can add a label to the node pool configuration to specify an NVIDIA driver version. The following sample code provides an example:

{
  // Other fields are not shown.
  ......
    "tags": [
        {
            "key": "ack.aliyun.com/nvidia-driver-version",
            "value": "550.144.03"
        }
    ],
  // Other fields are not shown.
  ......
}