Upgrade an OS image version or change a node OS type - Container Service for Kubernetes

Alibaba Cloud Container Service for Kubernetes (ACK) regularly releases new OS image versions that provide new features, performance optimizations, and bug fixes. You should upgrade the OS images of your node pools in a timely manner. You can also switch OS types as needed. For example, you can replace an operating system that has reached its end of life (EOL) with a supported one.

For more information about the OS types and latest OS image versions supported by ACK and the limits on specific operating systems, see OS image release notes.

Notes

This operation replaces the system disks of nodes in batches to update the operating system. Do not store important data on the system disk. If you do, back up the data in advance. Data disks are not affected during the upgrade. We recommend that you perform this operation during off-peak hours.
When ACK upgrades a node by replacing its system disk, ACK drains the node. This process evicts pods from the node to other available nodes while respecting the Pod Disruption Budget (PDB). To ensure high availability, deploy your workloads with multiple replicas across different nodes. Also, configure a PDB for critical services to control the number of pods that can be disrupted at the same time.
The default timeout period for draining a node is 30 minutes. If pod migration is not complete within the timeout period, ACK stops the upgrade to ensure service stability.
When ACK upgrades a node by replacing its system disk, ACK re-initializes the node based on the current node pool configuration. This includes settings such as the logon method, labels, taints, OS image, and runtime version. To update the node pool configuration, you must edit the node pool. If you modify a node using other methods, the changes are overwritten during the upgrade.
If a pod on a node references a HostPath that points to the system disk, the data in the HostPath directory is lost after the system disk is replaced.
If your cluster uses other custom configurations, such as swap partitions, kubelet configurations modified by using the CLI, or runtime configurations, the cluster may fail to be updated or the custom configurations may be overwritten during the update.
By default, some ACK OS images use cgroup v2. For more information about cgroup v2, see cgroup versions.
If you have worker nodes that are not managed by a node pool, also known as standalone nodes, you must first migrate them to a node pool. For more information, see Migrate standalone nodes to a node pool.
Starting with ContainerOS 3.4.0, the system disk is set to read-only mode. To ensure that the system can start, you must attach at least one data disk. Therefore, when you upgrade from ContainerOS 3.3 to 3.4 or later, follow the procedure below. This does not affect other versions.
Click to view the procedure
Select an upgrade solution based on the data disk attachment status of your node pool:
- A single data disk is attached: The system can start normally. You can complete the upgrade by following the steps in the Procedure section below.
- Multiple data disks are attached: You must create a new node pool and migrate the nodes. Create a node pool that uses ContainerOS 3.4 or a later version, attach one data disk, and then scale out the required number of nodes. Gradually migrate applications to the new node pool by setting the old node pool to unschedulable or by updating application workloads to use labels that schedule them to the new node pool. After the migration is complete, unpublish the old node pool.
- No data disk is attached:
  - Keep the current node pool: Update the node pool configuration to attach one data disk and scale out new nodes. After the new nodes are running as expected, drain and remove the old nodes.
  - Create a new node pool and migrate the nodes: Follow the same procedure as when multiple data disks are attached.
For more information about how to create and edit a node pool, see Create and manage node pools. For more information about how to set a node to unschedulable, see Drain a node and manage its scheduling status. For more information about how to remove a node, see Remove a node.
If you customize the GPU driver version for a node pool by specifying a version number or using an OSS URL, the OS image upgrade may cause an incompatibility between the OS and the driver. Select the latest driver from the List of NVIDIA driver versions supported by ACK.

Procedure

Follow these steps to update the OS image to the latest version or change the OS type. To avoid compatibility risks, run a precheck scan first.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose Nodes > Node Pools.
On the Node Pools page, find the target node pool, and click > Change Operating System in the Actions column.
Click Precheck to scan for potential risks and view the results.
- Normal: The precheck is successful. You can proceed to the next step.
- Abnormal: The precheck failed. This does not affect the running state of your cluster. Resolve the issues based on the recommended solutions.

After the precheck is successful, configure the following parameters and click Start Replacement.

Parameter		Description
Target Version		Select the target OS image and version.
Current Version		The current OS version.
Nodes To Update		Specify the nodes whose operating systems you want to replace. You can select all nodes or specific nodes.
Ignore Warnings		Specifies whether to proceed with the upgrade if the precheck reports node pool-level warnings. An example of a warning is that a pod in the node pool uses a HostPath that points to the system disk.
Batch Replacement Policy	Maximum Concurrent Nodes Per Batch	The system updates nodes in batches based on the maximum number of concurrent nodes that you specify.
	Automatic Pause Policy	The pause policy for the OS replacement process.
	Interval Between Batches	If you do not use the automatic pause policy, you can specify an interval between update batches. The interval can be 5 to 120 minutes.
	Automatic Snapshot	This upgrade method replaces the system disk. If the system disk contains important business data, create a snapshot for the node before you update the OS. This lets you back up and restore data. You are charged for using snapshots. If a snapshot is no longer needed after the upgrade, delete the snapshot promptly.

Important

To avoid incompatibility risks when you change the OS, review the OS image release notes.

References

For more information about how to upgrade the kubelet and container runtime versions of a node pool, see Upgrade a node pool.
For more information about the process and logic of upgrades by replacing system disks, see Reference: In-place upgrades and upgrades by replacing system disks.