Custom elastic resource priority scheduling is an elastic scheduling policy from Alibaba Cloud. It lets you define a ResourcePolicy to control the order in which application instance pods are scheduled to different types of node resources during deployment or scale-out. During a scale-in, pods are terminated in the reverse order.
Do not use system-reserved labels, such as alibabacloud.com/compute-class
or alibabacloud.com/compute-qos
, in the label selector of a workload, such as the spec.selector.matchLabels
of a Deployment. These labels may be modified by the system during custom priority scheduling, which can cause the controller to frequently rebuild Pods and affect application stability.
Prerequisites
An ACK managed cluster Pro edition of version 1.20.11 or later is required. To upgrade your cluster, see Manually upgrade an ACK cluster.
The scheduler version must meet the following requirements based on the ACK cluster version. For more information about the features supported by different scheduler versions, see kube-scheduler.
ACK version
Scheduler version
1.20
v1.20.4-ack-7.0 or later
1.22
v1.22.15-ack-2.0 or later
1.24 or later
All versions are supported
To use Elastic Container Instance (ECI) resources, you must deploy ack-virtual-node. For more information, see Use ECI in an ACK cluster.
Precautions
Starting with scheduler version v1.x.x-aliyun-6.4, the default value of the
ignorePreviousPod
field for custom elastic resource priority isFalse
. The default value ofignoreTerminatingPod
isTrue
. Existing ResourcePolicies that use these fields are not affected, and neither are future updates.This feature conflicts with pod-deletion-cost and the two cannot be used at the same time.
This feature cannot be used with ECI-based elastic scheduling with ElasticResource.
This feature uses a best-effort policy. It does not guarantee that pods are scaled in in the reverse order.
The `max` field is available only in cluster versions 1.22 and later and scheduler versions 5.0 and later.
When used with an elastic node pool, this feature may cause the node pool to scale out unnecessarily. To use this feature, include the elastic node pool in a unit and do not set the `max` field for that unit.
If your scheduler version is earlier than 5.0 or your cluster version is 1.20 or earlier, pods that exist before the ResourcePolicy is created are the first to be terminated during a scale-in.
If your scheduler version is earlier than 6.1 or your cluster version is 1.20 or earlier, do not modify the ResourcePolicy until the pods associated with it are completely deleted.
Usage
Create a ResourcePolicy to define the elastic resource priority:
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: test
namespace: default
spec:
selector:
key1: value1
strategy: prefer
units:
- nodeSelector:
unit: first
podLabels:
key1: value1
podAnnotations:
key1: value1
resource: ecs
- nodeSelector:
unit: second
max: 10
resource: ecs
- resource: eci
# Optional, Advanced Configurations
preemptPolicy: AfterAllUnits
ignorePreviousPod: false
ignoreTerminatingPod: true
matchLabelKeys:
- pod-template-hash
whenTryNextUnits:
policy: TimeoutOrExceedMax
timeout: 1m
selector
: Specifies that the ResourcePolicy applies to pods in the same namespace that have thelabel
key1=value1
. If theselector
is empty, the policy applies to all pods in the namespace.strategy
: The scheduling policy. Currently, onlyprefer
is supported.units
: The user-defined scheduling units. During a scale-out, resources are provisioned in the order defined inunits
. During a scale-in, resources are released in the reverse order.resource
: The type of elastic resource. The supported types areeci
,ecs
,elastic
, andacs
. Theelastic
type is available in cluster versions 1.24 and later and scheduler versions 6.4.3 and later. Theacs
type is available in cluster versions 1.26 and later and scheduler versions 6.7.1 and later.NoteThe
elastic
type is deprecated. Use auto-scaling node pools by settingk8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
in `podLabels`.NoteThe
acs
type adds thealibabacloud.com/compute-class: default
andalibabacloud.com/compute-class: general-purpose
labels to the pod by default. You can overwrite the default values by declaring different values in `podLabels`. Whenalpha.alibabacloud.com/compute-qos-strategy
is declared in `podAnnotations`, thealibabacloud.com/compute-class: default
label is not added by default.ImportantScheduler versions earlier than 6.8.3 do not support using multiple
acs
units at the same time.nodeSelector
: Useslabels
to identify thenodes
in this scheduling unit. This parameter applies only toecs
resources.max
(available in scheduler version 5.0 and later): The maximum number of pod replicas that can be scheduled in this scheduling unit.maxResources
(available in scheduler version 6.9.5 and later): The maximum amount of pod resources that can be scheduled in this scheduling unit.podAnnotations
: A map of typemap[string]string{}
. The key-value pairs configured inpodAnnotations
are added to the pod by the scheduler. When counting the number of pods in this unit, only pods with these key-value pairs are counted.podLabels
: A map of typemap[string]string{}
. The key-value pairs configured inpodLabels
are added to the pod by the scheduler. When counting the number of pods in this unit, only pods with these key-value pairs are counted.NoteIf a unit's `podLabels` contains
k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
, or if the number of pods in the current unit is less than the specified `max` value, the scheduler makes the pod wait in the current unit. You can set the waiting time inwhenTryNextUnits
. Thek8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
label is not added to the pod and is not required on the pod for the pod count.
preemptPolicy
(available in scheduler version 6.1 and later. This parameter does not apply to ACS units): Specifies whether to allow preemption when scheduling fails for aunit
in a ResourcePolicy with multiple units. `BeforeNextUnit` means the scheduler attempts preemption each time scheduling fails for a unit. `AfterAllUnits` means the scheduler attempts preemption only after scheduling fails for the last unit. The default value is `AfterAllUnits`.You can enable preemption by configuring ACK Scheduler parameters. For more information, see Enable preemption.
ignorePreviousPod
(available in scheduler version 6.1 and later): Must be used with themax
field inunits
. If this field is set totrue
, pods that were scheduled before the ResourcePolicy was created are ignored when counting pods.ignoreTerminatingPod
(available in scheduler version 6.1 and later): Must be used with themax
field inunits
. If this field is set totrue
, pods in the `Terminating` state are ignored when counting pods.matchLabelKeys
(available in scheduler version 6.2 and later): Must be used with themax
field inunits
. Pods are grouped based on the values of their labels. Themax
count is applied to each group of pods separately. When you use this feature, if a pod is missing a label declared inmatchLabelKeys
, the pod is rejected by the scheduler.whenTryNextUnits
(available in cluster version 1.24 and later and scheduler version 6.4 and later): Describes the conditions under which a pod is allowed to use resources from subsequent units.policy
: The policy used by the pod. Valid values includeExceedMax
,LackResourceAndNoTerminating
,TimeoutOrExceedMax
, andLackResourceOrExceedMax
(default).ExceedMax
: Allows the pod to use resources from the next unit if the `max` and `maxResources` fields of the current unit are not set, or if the number of pods in the current unit is greater than or equal to the `max` value (or if the used resources in the current unit plus the current pod's resources exceed `maxResources`). This policy can be used with auto scaling and ECI to prioritize auto scaling for node pools.ImportantIf the auto-scaling node pool cannot scale out nodes for a long time, this policy may cause pods to remain in the `Pending` state.
Currently, Cluster Autoscaler is not aware of the `max` limit in the ResourcePolicy. The actual number of scaled-out instances may exceed the `max` value. This issue will be fixed in a future release.
TimeoutOrExceedMax
: This error occurs under the following conditions:The `max` field of the current unit is set and the number of pods in the unit is less than the `max` value, or `maxResources` is set and the amount of scheduled resources plus the resources of the current pod is less than `maxResources`.
The `max` field of the current unit is not set, and the `podLabels` of the current unit contains
k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
.
If the current unit has insufficient resources to schedule the pod, the pod waits in the current unit for a maximum duration specified by
timeout
. This policy can be used with auto scaling and ECI to prioritize auto scaling for node pools and automatically use ECI resources after a timeout.ImportantIf a node is scaled out during the timeout period but is not in the `Ready` state, and the pod does not have a toleration for the `NotReady` taint, the pod is still scheduled to an ECI instance.
LackResourceOrExceedMax
: Allows the pod to use resources from the next unit if the number of pods in the current unit is greater than or equal to the `max` value, or if there are no more available resources in the current unit. This is the default policy and is suitable for most basic scenarios.LackResourceAndNoTerminating
: Allows the pod to use resources from the next unit if the number of pods in the current unit is greater than or equal to the `max` value, or if there are no more available resources in the current unit, and there are no pods in the `Terminating` state in the current unit. This policy is suitable for rolling updates. It prevents new pods from being scheduled to subsequent units because of existing terminating pods.
timeout
(The `timeout` parameter is not supported for ACS units, which are limited only by `max`): When the policy isTimeoutOrExceedMax
, this field specifies the timeout duration. If this field is empty, the default timeout period is 15 minutes.
Example scenarios
Scenario 1: Prioritize scheduling based on node pools
Assume that you want to deploy an application. The cluster has two node pools: Node Pool A and Node Pool B. You want to schedule pods to Node Pool A first. If resources are insufficient, schedule them to Node Pool B. When scaling in, you want to terminate pods in Node Pool B first, and then terminate pods in Node Pool A. In this example, cn-beijing.10.0.3.137
and cn-beijing.10.0.3.138
belong to Node Pool A. cn-beijing.10.0.6.47
and cn-beijing.10.0.6.46
belong to Node Pool B. The node specifications are 2 vCores and 4 GB of memory. Follow these steps to prioritize scheduling based on node pools:
Use the following YAML content to create a ResourcePolicy and customize the node pool scheduling order.
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # This must be associated with the label of the pod you create later. strategy: prefer units: - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058**** - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****
NoteYou can obtain the node pool ID from the Nodes > Node Pools page of your cluster. For more information, see Create and manage a node pool.
Use the following YAML content to create a deployment with two pods.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # This must be associated with the selector of the ResourcePolicy you created in the previous step. spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2
Create the Nginx application and view the deployment result.
Run the following command to create the Nginx application.
kubectl apply -f nginx.yaml
Expected output:
deployment.apps/nginx created
Run the following command to view the deployment result.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 17s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 17s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none>
The output shows that the first two pods are scheduled to the nodes in Node Pool A.
Scale out the pods.
Run the following command to scale out the number of pods to four.
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
Run the following command to check the pod status.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 101s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 101s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 1/1 Running 0 18s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 1/1 Running 0 18s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>
The output shows that when the nodes in Node Pool A have insufficient resources, pods are scheduled to the nodes in Node Pool B.
Scale in the pods.
Run the following command to scale in the number of pods from four to two.
kubectl scale deployment nginx --replicas 2
Expected output:
deployment.apps/nginx scaled
Run the following command to check the pod status.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 2m41s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 2m41s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 0/1 Terminating 0 78s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 0/1 Terminating 0 78s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>
The output shows that pods in Node Pool B are terminated first, following the reverse order of scheduling.
Scenario 2: Hybrid scheduling with ECS and ECI
Assume that you want to deploy an application. The cluster has three types of resources: subscription Elastic Compute Service (ECS) instances, pay-as-you-go ECS instances, and ECI instances. To reduce resource costs, you want to schedule pods in the following order of priority: subscription ECS instances, pay-as-you-go ECS instances, and ECI instances. When scaling in, you want to terminate pods on ECI instances first, then pods on pay-as-you-go ECS instances, and finally pods on subscription ECS instances. The example nodes have 2 vCores and 4 GB of memory. Follow these steps for hybrid scheduling with ECS and ECI:
Run the following commands to add different
labels
to nodes based on their billing method. You can also use the node pool feature to automatically add thelabels
.kubectl label node cn-beijing.10.0.3.137 paidtype=subscription kubectl label node cn-beijing.10.0.3.138 paidtype=subscription kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-go
Use the following YAML content to create a ResourcePolicy and customize the node pool scheduling order.
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # This must be associated with the label of the pod you create later. strategy: prefer units: - resource: ecs nodeSelector: paidtype: subscription - resource: ecs nodeSelector: paidtype: pay-as-you-go - resource: eci
Use the following YAML content to create a deployment with two pods.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # This must be associated with the selector of the ResourcePolicy you created in the previous step. spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2
Create the Nginx application and view the deployment result.
Run the following command to create the Nginx application.
kubectl apply -f nginx.yaml
Expected output:
deployment.apps/nginx created
Run the following command to view the deployment result.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 66s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 66s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that the first two pods are scheduled to nodes with the
label
paidtype=subscription
.
Scale out the pods.
Run the following command to scale out the number of pods to four.
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
Run the following command to check the pod status.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 16s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 3m48s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 16s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 3m48s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that when nodes with the
label
paidtype=subscription
have insufficient resources, pods are scheduled to nodes with thelabel
paidtype=pay-as-you-go
.Run the following command to scale out the number of pods to six.
kubectl scale deployment nginx --replicas 6
Expected output:
deployment.apps/nginx scaled
Run the following command to check the pod status.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 3m10s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 6m42s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 3m10s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 6m42s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Running 0 36s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Running 0 36s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>
The output shows that pods are scheduled to ECI resources when the ECS instances have insufficient resources.
Scale in the pods.
Run the following command to scale in the number of pods from six to four.
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
Run the following command to check the pod status.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 4m59s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 8m31s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 4m59s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 8m31s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Terminating 0 2m25s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Terminating 0 2m25s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>
The output shows that pods on ECI instances are terminated first, in reverse order of scheduling.
Run the following command to scale in the number of pods from four to two.
kubectl scale deployment nginx --replicas 2
Expected output:
deployment.apps/nginx scaled
Run the following command to check the pod status.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 0/1 Terminating 0 6m43s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 10m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 0/1 Terminating 0 6m43s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 10m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that pods on nodes with the
label
paidtype=pay-as-you-go
are terminated first, in reverse scheduling order.Run the following command to check the pod status.
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 11m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 11m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that only pods remain on nodes with the
paidtype=subscription
label
.
References
When you deploy services in an ACK cluster, you can use tolerations and node affinity to use only ECS or ECI resources, or to automatically request ECI resources when ECS resources are insufficient. By configuring scheduling policies, you can meet different requirements for elastic resources in various workload scenarios. For more information, see Specify resource allocation between ECS and ECI.
High availability and high performance are important requirements for distributed tasks. In an ACK managed cluster Pro edition, you can use native Kubernetes scheduling semantics to spread distributed tasks across zones for high availability. You can also use these semantics to implement affinity-based deployment of distributed tasks in a specific zone for high performance. For more information, see Spread ECI pods across zones and schedule ECI pods with affinity.