You can manually scale a node pool by adjusting the expected number of nodes. This helps you maintain the desired number of nodes and improves operational efficiency. Scaling out a node pool ensures that you have sufficient nodes to support your business, while scaling in a node pool reduces resource costs.
ACK also supports automatic scaling. You can choose between two elastic solutions: node autoscaling or instant elasticity. These solutions automatically scale node resources to increase scheduling capacity. For more information, see Node scaling.
Introduction to node pool scaling
The expected number of nodes represents the desired state of a node pool. After you specify this value, the node pool automatically triggers a scale-out or scale-in operation based on the current number of nodes to match the expected number. This process occurs without manual intervention.
Scale out a node pool
When you increase the expected number of nodes to a value greater than the current number, the system triggers a scale-out. If the system fails to create a node, it automatically retries until the number of nodes in the node pool reaches the expected number. The configuration of the new nodes, including the specific instance types and zones, is determined by the node pool's configuration and scaling policy. For more information about scaling policies, see Scaling policies.
During a node pool scale-out, you are billed for the instance types that are created and used. For example, a node pool is configured with two instance types. The Billing Method is set to Pay-As-You-Go, and the Scaling Policy is set to Priority. During a scale-out, two nodes of instance type A are added in the zone of the first-priority vSwitch. If the resources for instance type A are insufficient, three nodes of instance type B are added in the zone of the second-priority vSwitch. The fee for one hour is calculated using the formula: Unit price of instance type × Number of nodes × Billing duration
. The total fee is (Unit price of node A × 2 × 1) + (Unit price of node B × 3 × 1)
.
A node pool scale-out includes two steps.
Create ECS instances: ACK node pools use Auto Scaling (ESS) to create nodes. After you adjust the expected number of nodes, ACK modifies the expected number of instances for the ESS scaling group. This triggers a scale-out based on the node pool configuration. The node pool status changes to Scaling Out. After ESS successfully creates the ECS instances, the node pool status changes to Active. For more information about the expected number of instances, see Expected number of instances.
ImportantGPU-accelerated ECS Bare Metal instances of the ebmgn7 and ebmgn7e families do not support automatic multi-instance GPU (MIG) cleanup. When ACK adds nodes of these types, it resets their existing MIG settings. The time required for the reset varies. If the reset takes too long, the automatic node addition might fail.
To troubleshoot the failure, see What do I do if I fail to add ECS Bare Metal instances?.
For more information about ebmgn7e, see GPU-accelerated compute-optimized instances (gn, ebm, and scc series).
Add the ECS instances to the cluster: After ESS creates the ECS instances, the instances automatically run a
cloud-init
script maintained by ACK. This script initializes the nodes and adds them to the node pool. The execution logs are saved to the /var/log/messages file on the node. You can log on to the node and rungrep cloud-init /var/log/messages
to view the execution logs.NoteIf a node is successfully added to the node pool, the log information in
/var/log/messages
is automatically cleared. The log is available for reference only if a node fails to be added to the cluster.If a node fails to be added to the cluster, key information from the
/var/log/messages
log is captured in the task result. You can view the reason on the Cluster Tasks tab of the cluster.
Scale in a node pool
If you set the expected number of nodes to a value less than the current number, the system triggers a scale-in and removes nodes.
When scaling in nodes, the removal process is as follows:
If the node pool's scaling policy is set to Priority, the system removes the most recently created instances.
If the node pool's scaling policy is set to Balanced Distribution, the system filters the zones of the ECS instances based on the policy. Then, it removes the most recently created instances to keep the number of ECS instances in each zone of the scaling group approximately equal.
If the node pool's scaling policy is set to Cost Optimization, the system attempts to remove ECS instances in descending order of their vCPU unit price.
When you scale in a node pool by adjusting the expected number of nodes, the nodes are removed even if the draining process fails. If you require nodes to be drained before removal, you must remove the nodes by specifying them individually. For more information, see Remove a node.
When you scale in a node pool, subscription ECS instances are not released. To release subscription instances, log on to the ECS console, convert the subscription instances to pay-as-you-go instances, and then release them. To convert a subscription instance to a pay-as-you-go instance, see Change the billing method from subscription to pay-as-you-go.
Procedure
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose .
In the node pool list, find the target node pool and click Scale in the Actions column. Set Scaling Mode to Manual.
(Optional) If you have not granted permissions to Operation Orchestration Service (OOS), create the AliyunOOSLifecycleHook4CSRole role to grant the required permissions to OOS. Click AliyunOOSLifecycleHook4CSRole and follow the on-screen instructions to complete the authorization.
NoteIf you are using an Alibaba Cloud account, click AliyunOOSLifecycleHook4CSRole to grant the permissions.
If you are using a RAM user, ensure that the corresponding Alibaba Cloud account has been granted the AliyunOOSLifecycleHook4CSRole role. Then, grant the AliyunRAMReadOnlyAccess system policy to the RAM user. For more information, see Grant permissions to a RAM user.
Enter a value in the Expected Number Of Nodes field and submit the configuration as prompted.
After you submit the configuration, the node pool status changes to Updating, and then to Scaling Out or Removing Node.
In the node pool list, a status of Scaling Out indicates that the node pool is being scaled out. A status of Active indicates that the scale-out is complete.
ImportantWhen scaling out nodes in a cluster, if the security group denies access to 100.64.0.0/10, the nodes cannot be added to the cluster.
In the node pool list, a status of Removing Node indicates that the node pool is being scaled in. A status of Active indicates that the scale-in is complete.
Unrecommended operations and suggestions
The expected number of nodes is the number of nodes that a node pool should maintain. Some operations may cause the node pool to scale unexpectedly, which can lead to resource loss. The following table describes common unrecommended operations and the corresponding suggestions.
We recommend that you avoid performing the operations described in this section.
Unrecommended operation | Node pool behavior | Suggestion |
Directly remove a node using the | The expected number of nodes monitors the number of ECS instances in the ESS scaling group but does not monitor the number of nodes in the cluster. If you remove a node using the API Server, the ECS instance is not released. Therefore, the number of nodes in the node pool does not change. However, because the node has been removed from the cluster, the node status on the node list of the node pool changes to Unknown. |
|
Release an ECS instance in the ECS console or using an OpenAPI. | The node pool detects the release of the ECS instance and automatically creates a new one to maintain the expected number of nodes. |
|
Use ESS to remove an ECS instance from a scaling group without changing the expected number of instances. | The node pool detects the release of the ECS instance and automatically creates a new one to maintain the expected number of nodes. | Do not directly perform operations on the scaling group that is associated with the node pool. This can cause the node pool to behave abnormally. |
A subscription ECS instance is released upon expiration. | The node pool detects the release of the ECS instance and automatically creates a new one to maintain the expected number of nodes. | ACK detects the release of the node and adds a new instance to maintain the expected number of nodes. This can cause resource loss. Manage expiring ECS instances promptly. You can remove the node or renew the subscription for the ECS instance.
|
Manually enable health checks for instances in an ESS scaling group in the Auto Scaling console or using an OpenAPI. | After health checks are enabled for the scaling group, the scaling group automatically creates a new ECS instance whenever it detects an unhealthy instance, such as a stopped instance. | By default, ACK does not enable ESS health checks. A new ECS instance is added only when a node is released. Do not directly perform operations on the ESS scaling group of the node pool. Otherwise, the node pool may behave abnormally. |
Error codes for scaling failures and solutions
When you scale a node pool, the operation may fail due to reasons such as insufficient inventory. On the Clusters page, click the name of the target cluster. On the Cluster Tasks tab, you can view the list of cluster tasks and click View Reason to identify why the node pool failed to scale.
The following table describes common error codes for scale-out failures.
Error code | Cause | Solution |
RecommendEmpty.InstanceTypeNoStock | The ECS instance inventory is insufficient in the selected zone. | Edit the node pool to add vSwitches in different zones and configure multiple instance types. This increases the success rate of node creation. The node pool list displays pools with low elasticity. This helps you evaluate the availability of your node pool configuration and the health of the instance supply. For more information, see Check the elasticity of a node pool. |
NodepoolScaleFailed.FailedJoinCluster | The node failed to be added to the ACK cluster. | Log on to the node and run |
InvalidAccountStatus.NotEnoughBalance | Your account has an insufficient balance. | Add funds to your account and try again. |
InvalidParameter.NotMatch | The error message | Change the instance type.
|
QuotaExceed.ElasticQuota | The number of ECS instances of the selected instance type in the current region exceeds the quota. | You can perform the following operations.
|
InvalidResourceType.NotSupported | The specified ECS instance type is not supported or is out of stock in the selected zone. | Call the Query instance types operation to check whether the instance type is supported in the zone, and then change the instance type of the node pool. |
InvalidImage.NotSupported | The error message | Change the instance type.
|
InvalidParameter.NotMatch | The error message | Change the instance type.
|
QuotaExceeded.PrivateIpAddress | The number of available private IP addresses in the vSwitch is insufficient. | Configure more available vSwitches for the node pool and retry the operation. |
InvalidParameter.KmsNotEnabled | The specified KMS key is not enabled. | Log on to the Key Management Service (KMS) console to check the key status. |
InvalidInstanceType.NotSupported | The error message | Change the instance type.
|
InsufficientBalance.CreditPay | Your account has an insufficient balance. | Add funds to your account and try again. |
ApiServer.InternalError | The error message | Check whether the API Server of your cluster is available and accessible. For more information, see Troubleshoot issues when you access a cluster from the console. |
RecommendEmpty.InstanceTypeNotAuthorized | The specified instance type requires authorization before it can be used. | Submit a ticket to ECS to request the authorization. |
Account.Arrearage | Your account has an insufficient balance. | Add funds to your account and try again. |
Err.QueryEndpoints | The request to access the API Server of the ACK cluster failed. | Check whether the API Server of your cluster is available and accessible. For more information, see Troubleshoot issues when you access a cluster from the console. |
RecommendEmpty.DiskTypeNoStock | The disk inventory is insufficient in the selected zone. | Add more zones (vSwitches) to the node pool or change the disk type, and then retry the operation. |
InvalidParameter.KMSKeyId.KMSUnauthorized | KMS cannot be accessed due to a lack of authorization. | Log on to the ECS console to grant the AliyunECSDiskEncryptDefaultRole role to ECS to allow access to KMS. For more information, see Encryption-related permissions. |
InvalidParameter.Conflict | The error message | Change the instance type or disk type and retry the operation. |
NotSupportSnapshotEncrypted.DiskCategory | System disk encryption is supported only for enhanced SSDs (ESSDs). | Change the disk type. For more information about disk types and encryption, see Create and manage a node pool. |
ScalingActivityInProgress | The node pool is being scaled. Try the operation again later. | To avoid scaling activity conflicts, do not scale nodes directly using ESS. |
Instance.StartInstanceFailed | The ECS instance failed to start. | Try the operation again later. To troubleshoot the specific cause, submit a ticket to ECS. |
OperationDenied.NoStock | The selected ECS instance type is out of stock in the specified zone. | Select another instance type and retry. Elasticity quantifies the probability of a successful scale-out for your node pool based on real-time inventory. For more information, see Check the elasticity of a node pool. |
NodepoolScaleFailed.WaitForDesiredSizeTimeout | The scale-out task timed out. | Perform the following steps to view the scale-out details.
|
ApiServer.TooManyRequests | The scale-out task was throttled by the API Server. | The scale-out task was throttled by the API Server. Reduce the number of requests that you send to the API Server or try the operation again later. |
NodepoolScaleFailed.PartialSuccess | Some nodes were successfully scaled out, but other nodes failed to be scaled out due to inventory issues. | Select other instance types and retry. Elasticity quantifies the probability of a successful scale-out for your node pool based on real-time inventory. For more information, see Check the elasticity of a node pool. |
References
For information about how to remove a node from a cluster, including specific operations and notes, see Remove a node.
For information about node pool O&M operations, such as upgrading a node pool, automatic node recovery, and fixing OS CVE vulnerabilities in a node pool, see Node pool O&M.
For information about best practices for node pools, such as using deployment sets to distribute nodes across different physical servers for high availability and creating node pools from preemptible instances, see Best practices for nodes and node pools.