FAQ and solutions for node instant scaling - Container Service for Kubernetes

This topic describes common issues and solutions when you use the node instant scaling feature.

Index

Category	Subcategory	Jump link
Scaling behavior of node instant scaling	Known limitations
	Scale-out behavior	What resource types can node instant scaling simulate? Does node instant scaling support scaling out nodes of a suitable instance type from a node pool based on pod resource requests? If a node pool has multiple instance types, how does node instant scaling select one by default? When using node instant scaling, how can I monitor real-time changes in the instance type inventory of a node pool? How can I optimize the node pool configuration to prevent scale-out failures due to insufficient inventory? Why does node instant scaling fail to add nodes? How do I configure custom resources for a node pool that has node instant scaling enabled?
	Scale-in behavior	Why does node instant scaling fail to remove nodes? What types of pods can prevent node instant scaling from removing nodes?
Custom scaling behavior	Control scaling behavior using pods	How do I control node scale-in using pods?
Custom scaling behavior	Control scaling behavior using nodes	How do I specify which nodes to delete during a scale-in? How do I prevent node instant scaling from removing specific nodes? Can node instant scaling scale in only empty nodes?
About the node instant scaling component		Are there any operations that trigger the automatic update of the node instant scaling component? Why do node scaling activities fail even after role authorization is granted for an ACK managed cluster?

Known limitations

Feature limitations

Node instant scaling does not support the swift mode.
A node pool can contain up to 180 nodes per scale-out batch.
Scale-in cannot be disabled for a specific cluster.
Note
To disable scale-in for a specific node, see How do I prevent node instant scaling from removing specific nodes?
The node instant scaling solution does not support checking the inventory of preemptible instances. If the Billing Method of the node pool is set to preemptible instances and the option to Use Pay-as-you-go Instances to Supplement Preemptible Capacity is enabled for the node pool, the pay-as-you-go instance is scaled out even if there is sufficient inventory of preemptible instances.

Inaccurate node resource estimation

The underlying system of an ECS instance consumes some resources. This means the available memory of an instance is less than the amount defined in its instance type. For more information, see Why is the memory size of a purchased instance different from the memory size defined in its instance type?. As a result, the schedulable resources of a node estimated by the node instant scaling component may be greater than the actual schedulable resources. The estimation is not 100% accurate. Note the following points when you configure pod requests.

When you configure pod requests, the total requested resources, including CPU, memory, and disk, must be less than the instance type specifications. The total requested resources should not exceed 70% of the node's resources.
When the node instant scaling component checks whether a node has sufficient resources, it considers only Kubernetes pod resources, such as pending pods and DaemonSet pods. If static pods that are not managed by a DaemonSet exist on the node, you must reserve resources for these pods in advance.
If a pod requests a large amount of resources, for example, more than 70% of a node's resources, you must test and confirm in advance that the pod can be scheduled to a node of the same instance type.

Limited simulatable resource types

The node instant scaling component supports only a limited number of resource types for simulating and determining whether to perform scaling operations. For more information, see What resource types can node instant scaling simulate?.

Scale-out behavior

What resource types can node instant scaling simulate?

The following resource types are supported for simulating and determining scaling behavior.

cpu
memory
ephemeral-storage 
aliyun.com/gpu-mem # Only shared GPUs are supported.
nvidia.com/gpu

Does node instant scaling support scaling out nodes of a suitable instance type from a node pool based on pod resource requests?

Yes, it does. For example, you configure two instance types, 4-core 8 GB and 12-core 48 GB, for a node pool with Auto Scaling enabled. A pod requests 2 CPU cores. When node instant scaling performs a scale-out, it prioritizes scheduling the pod to a 4-core 8 GB node. If the 4-core 8 GB instance type is later upgraded to 8-core 16 GB, node instant scaling automatically runs the pod on an 8-core 16 GB node.

If a node pool has multiple instance types, how does node instant scaling select one by default?

Based on the instance types configured in the node pool, node instant scaling periodically excludes instance types with insufficient inventory. It then sorts the remaining types by the number of CPU cores and checks each one to see if it meets the resource requests of unschedulable pods. Once an instance type meets the requirements, node instant scaling selects that instance type and does not check the remaining types.

When using node instant scaling, how can I monitor real-time changes in the instance type inventory of a node pool?

Node instant scaling provides health metrics that periodically update the inventory of instance types in a node pool with Auto Scaling enabled. When the inventory status of an instance type changes, node instant scaling sends a Kubernetes event named InstanceInventoryStatusChanged. You can subscribe to this event notification to monitor the inventory health of the node pool, assess its current status, and adjust the instance type configuration in advance. For more information, see View the health status of node instant scaling.

How can I optimize the node pool configuration to prevent scale-out failures due to insufficient inventory?

Consider the following configuration suggestions to expand the range of available instance types:

Configure multiple optional instance types for the node pool, or use a generalized configuration.
Configure multiple zones for the node pool.

Why does node instant scaling fail to add nodes?

Check for the following scenarios.

The instance types configured for the node pool have insufficient inventory.
The instance types configured for the node pool cannot meet the pod's resource requests. The resource size of an ECS instance type is its listed specification. Consider the following resource reservations during runtime.
- During instance creation, some resources are consumed by virtualization and the operating system. For more information, see Why is the memory size of a purchased instance different from the memory size defined in its instance type?.
- ACK requires a certain amount of node resources to run Kubernetes components and system processes, such as kubelet, kube-proxy, Terway, and the container runtime. For a detailed description of the reservation policy, see Node resource reservation policy.
- By default, system components are installed on nodes. The resources requested by a pod must be less than the instance specifications.
You have completed the authorization as described in Enable instant elasticity for nodes.
The node pool with Auto Scaling enabled fails to scale out instances.

To ensure the accuracy of subsequent scaling and system stability, the node instant scaling component does not perform scaling operations until issues with abnormal nodes are resolved.

How do I configure custom resources for a node pool that has node instant scaling enabled?

You can configure ECS tags with the following fixed prefix for a node pool that has node instant scaling enabled. This allows the scaling component to identify the available custom resources in the node pool or the exact values of specified resources.

Note

The version of the node instant scaling component ACK GOATScaler must be v0.2.18 or later. To upgrade the component, see Manage add-ons.

goatscaler.io/node-template/resource/{resource-name}:{resource-size}

Example:

goatscaler.io/node-template/resource/hugepages-1Gi:2Gi

Scale-in behavior

Why does node instant scaling fail to remove nodes?

Consider the following scenarios.

The option to scale in only empty nodes is enabled, but the node being checked is not empty.
The resource request threshold of the pods on the node is higher than the configured scale-in threshold.
Pods from the kube-system namespace are running on the node.
The pods on the node have a mandatory scheduling policy that prevents other nodes from running them.
The pods on the node have a PodDisruptionBudget, and the minimum number of available pods has been reached.
If a new node is added, node instant scaling will not perform a scale-in operation on that node within 10 minutes.
Offline nodes exist. An offline node is a running instance that does not have a corresponding node object. The node instant scaling component supports an automatic cleanup feature in v0.5.3 and later. For earlier versions, you must manually delete these residual instances.
Version v0.5.3 is in phased release. Please submit a ticket to request access. For information about how to upgrade the component, see Components.
On the Node Pools page, click Sync Node Pool, and then click Details. On the Node Management tab, check whether any nodes are in the offline state.

What types of pods can prevent node instant scaling from removing nodes?

If a pod is not created by a native Kubernetes controller, such as a deployment, ReplicaSet, Job, or StatefulSet, or if pods on a node cannot be safely terminated or migrated, the node instant scaling component may be prevented from removing the node.

Control scaling behavior using pods

How do I control node instant scaling node scale-in using pods?

You can use the pod annotation goatscaler.io/safe-to-evict to specify whether a pod prevents a node from being scaled in during a node instant scaling scale-in.

To prevent the node from being scaled in, add the annotation "goatscaler.io/safe-to-evict": "false" to the pod.
To allow the node to be scaled in, add the annotation "goatscaler.io/safe-to-evict": "true" to the pod.

Control scaling behavior using nodes

How do I specify which nodes to delete during a node instant scaling scale-in?

You can add the goatscaler.io/force-to-delete:true:NoSchedule taint to the nodes that you want to remove. After you add this taint, node instant scaling directly deletes the nodes without checking the pod status or draining the pods. Use this feature with caution because it may cause service interruptions or data loss.

How do I prevent node instant scaling from removing specific nodes?

You can configure the node annotation "goatscaler.io/scale-down-disabled": "true" for the target node to prevent it from being scaled in by the node instant scaling component. The following is a sample command to add the annotation.

kubectl annotate node <nodename> goatscaler.io/scale-down-disabled=true

Can node instant scaling scale in only empty nodes?

You can configure whether to scale in only empty nodes at the node level or cluster level. If you configure this feature at both levels, the node-level configuration takes precedence.

Node level: Add the label goatscaler.io/scale-down-only-empty:true or goatscaler.io/scale-down-only-empty:false to a node to enable or disable scaling in only empty nodes.
Cluster level: In the Container Service for Kubernetes console, go to the Add-ons page. Find the node instant scaling component ACK GOATScaler and follow the on-screen instructions to set ScaleDownOnlyEmptyNodes to true or false. This enables or disables scaling in only empty nodes.

About the node instant scaling component

Are there any operations that trigger the automatic update of the node instant scaling component?

No. ACK does not automatically update the node instant scaling component, ACK GOATScaler, except during system maintenance and upgrades. You need to manually upgrade the component on the Container Service Management Console Component Management page.

Why does node scaling still fail after I complete role authorization in the ACK managed cluster?

This may be caused by the absence of addon.aliyuncsmanagedautoscalerrole.token in the secret under the cluster kube-system namespace. If this token is missing, use one of the following methods to add the token:

Submit a ticket for technical support.
Manually add the AliyunCSManagedAutoScalerRolePolicy permission: By default, ACK assumes the worker RAM role to use the relevant capabilities. Use the following steps to manually assign the AliyunCSManagedAutoScalerRolePolicy permission to the worker role:
1. On the Clusters page, find the cluster that you want to manage and click the name of the cluster. In the left-side pane, click Cluster Information.
2. On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose Nodes > Node Pools.
3. On the Node Pools page, click Enable next to Node Scaling.
4. Authorize the KubernetesWorkerRole role and the AliyunCSManagedAutoScalerRolePolicy system policy as prompted. The following figure shows the console page on which you can complete the authorization:
5. To apply the new RAM policy, manually restart cluster-autoscaler or ack-goatscaler Deployment in the kube-system namespace. The cluster-autoscaler manages node auto scaling, while ack-goatscaler handles node instant scaling.