How to customize pod scheduling order - Container Service for Kubernetes

Custom elastic resource priority scheduling is an elastic scheduling policy from Alibaba Cloud. It lets you define a ResourcePolicy to control the order in which application instance pods are scheduled to different types of node resources during deployment or scale-out. During a scale-in, pods are terminated in the reverse order.

Warning

Do not use system-reserved labels, such as alibabacloud.com/compute-class or alibabacloud.com/compute-qos, in the label selector of a workload, such as the spec.selector.matchLabels of a Deployment. These labels may be modified by the system during custom priority scheduling, which can cause the controller to frequently rebuild Pods and affect application stability.

Prerequisites

An ACK managed cluster Pro edition of version 1.20.11 or later is required. To upgrade your cluster, see Manually upgrade an ACK cluster.
The scheduler version must meet the following requirements based on the ACK cluster version. For more information about the features supported by different scheduler versions, see kube-scheduler.
ACK version
Scheduler version
1.20
v1.20.4-ack-7.0 or later
1.22
v1.22.15-ack-2.0 or later
1.24 or later
All versions are supported
To use Elastic Container Instance (ECI) resources, you must deploy ack-virtual-node. For more information, see Use ECI in an ACK cluster.

Precautions

Starting with scheduler version v1.x.x-aliyun-6.4, the default value of the ignorePreviousPod field for custom elastic resource priority is False. The default value of ignoreTerminatingPod is True. Existing ResourcePolicies that use these fields are not affected, and neither are future updates.
This feature conflicts with pod-deletion-cost and the two cannot be used at the same time.
This feature cannot be used with ECI-based elastic scheduling with ElasticResource.
This feature uses a best-effort policy. It does not guarantee that pods are scaled in in the reverse order.
The `max` field is available only in cluster versions 1.22 and later and scheduler versions 5.0 and later.
When used with an elastic node pool, this feature may cause the node pool to scale out unnecessarily. To use this feature, include the elastic node pool in a unit and do not set the `max` field for that unit.
If your scheduler version is earlier than 5.0 or your cluster version is 1.20 or earlier, pods that exist before the ResourcePolicy is created are the first to be terminated during a scale-in.
If your scheduler version is earlier than 6.1 or your cluster version is 1.20 or earlier, do not modify the ResourcePolicy until the pods associated with it are completely deleted.

Usage

Create a ResourcePolicy to define the elastic resource priority:

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: test
  namespace: default
spec:
  selector:
    key1: value1
  strategy: prefer
  units:
  - nodeSelector:
      unit: first
    podLabels:
      key1: value1
    podAnnotations:
      key1: value1
    resource: ecs
  - nodeSelector:
      unit: second
    max: 10
    resource: ecs
  - resource: eci
  # Optional, Advanced Configurations
  preemptPolicy: AfterAllUnits
  ignorePreviousPod: false
  ignoreTerminatingPod: true
  matchLabelKeys:
  - pod-template-hash
  whenTryNextUnits:
    policy: TimeoutOrExceedMax
    timeout: 1m

selector: Specifies that the ResourcePolicy applies to pods in the same namespace that have the label key1=value1. If the selector is empty, the policy applies to all pods in the namespace.
strategy: The scheduling policy. Currently, only prefer is supported.
units: The user-defined scheduling units. During a scale-out, resources are provisioned in the order defined in units. During a scale-in, resources are released in the reverse order.
- resource: The type of elastic resource. The supported types are eci, ecs, elastic, and acs. The elastic type is available in cluster versions 1.24 and later and scheduler versions 6.4.3 and later. The acs type is available in cluster versions 1.26 and later and scheduler versions 6.7.1 and later.
  Note
  The elastic type is deprecated. Use auto-scaling node pools by setting k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" in `podLabels`.
  Note
  The acs type adds the alibabacloud.com/compute-class: default and alibabacloud.com/compute-class: general-purpose labels to the pod by default. You can overwrite the default values by declaring different values in `podLabels`. When alpha.alibabacloud.com/compute-qos-strategy is declared in `podAnnotations`, the alibabacloud.com/compute-class: default label is not added by default.
  Important
  Scheduler versions earlier than 6.8.3 do not support using multiple acs units at the same time.
- nodeSelector: Uses labels to identify the nodes in this scheduling unit. This parameter applies only to ecs resources.
- max (available in scheduler version 5.0 and later): The maximum number of pod replicas that can be scheduled in this scheduling unit.
- maxResources (available in scheduler version 6.9.5 and later): The maximum amount of pod resources that can be scheduled in this scheduling unit.
- podAnnotations: A map of type map[string]string{}. The key-value pairs configured in podAnnotations are added to the pod by the scheduler. When counting the number of pods in this unit, only pods with these key-value pairs are counted.
- podLabels: A map of type map[string]string{}. The key-value pairs configured in podLabels are added to the pod by the scheduler. When counting the number of pods in this unit, only pods with these key-value pairs are counted.
  Note
  If a unit's `podLabels` contains k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true", or if the number of pods in the current unit is less than the specified `max` value, the scheduler makes the pod wait in the current unit. You can set the waiting time in whenTryNextUnits. The k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" label is not added to the pod and is not required on the pod for the pod count.
preemptPolicy (available in scheduler version 6.1 and later. This parameter does not apply to ACS units): Specifies whether to allow preemption when scheduling fails for a unit in a ResourcePolicy with multiple units. `BeforeNextUnit` means the scheduler attempts preemption each time scheduling fails for a unit. `AfterAllUnits` means the scheduler attempts preemption only after scheduling fails for the last unit. The default value is `AfterAllUnits`.
You can enable preemption by configuring ACK Scheduler parameters. For more information, see Enable preemption.
ignorePreviousPod (available in scheduler version 6.1 and later): Must be used with the max field in units. If this field is set to true, pods that were scheduled before the ResourcePolicy was created are ignored when counting pods.
ignoreTerminatingPod (available in scheduler version 6.1 and later): Must be used with the max field in units. If this field is set to true, pods in the `Terminating` state are ignored when counting pods.
matchLabelKeys (available in scheduler version 6.2 and later): Must be used with the max field in units. Pods are grouped based on the values of their labels. The max count is applied to each group of pods separately. When you use this feature, if a pod is missing a label declared in matchLabelKeys, the pod is rejected by the scheduler.
whenTryNextUnits (available in cluster version 1.24 and later and scheduler version 6.4 and later): Describes the conditions under which a pod is allowed to use resources from subsequent units.
- policy: The policy used by the pod. Valid values include ExceedMax, LackResourceAndNoTerminating, TimeoutOrExceedMax, and LackResourceOrExceedMax (default).
  - ExceedMax: Allows the pod to use resources from the next unit if the `max` and `maxResources` fields of the current unit are not set, or if the number of pods in the current unit is greater than or equal to the `max` value (or if the used resources in the current unit plus the current pod's resources exceed `maxResources`). This policy can be used with auto scaling and ECI to prioritize auto scaling for node pools.
    Important
    If the auto-scaling node pool cannot scale out nodes for a long time, this policy may cause pods to remain in the `Pending` state.
    Currently, Cluster Autoscaler is not aware of the `max` limit in the ResourcePolicy. The actual number of scaled-out instances may exceed the `max` value. This issue will be fixed in a future release.
  - TimeoutOrExceedMax: This error occurs under the following conditions:
    - The `max` field of the current unit is set and the number of pods in the unit is less than the `max` value, or `maxResources` is set and the amount of scheduled resources plus the resources of the current pod is less than `maxResources`.
    - The `max` field of the current unit is not set, and the `podLabels` of the current unit contains k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true".
    If the current unit has insufficient resources to schedule the pod, the pod waits in the current unit for a maximum duration specified by timeout. This policy can be used with auto scaling and ECI to prioritize auto scaling for node pools and automatically use ECI resources after a timeout.
    Important
    If a node is scaled out during the timeout period but is not in the `Ready` state, and the pod does not have a toleration for the `NotReady` taint, the pod is still scheduled to an ECI instance.
  - LackResourceOrExceedMax: Allows the pod to use resources from the next unit if the number of pods in the current unit is greater than or equal to the `max` value, or if there are no more available resources in the current unit. This is the default policy and is suitable for most basic scenarios.
  - LackResourceAndNoTerminating: Allows the pod to use resources from the next unit if the number of pods in the current unit is greater than or equal to the `max` value, or if there are no more available resources in the current unit, and there are no pods in the `Terminating` state in the current unit. This policy is suitable for rolling updates. It prevents new pods from being scheduled to subsequent units because of existing terminating pods.
- timeout (The `timeout` parameter is not supported for ACS units, which are limited only by `max`): When the policy is TimeoutOrExceedMax, this field specifies the timeout duration. If this field is empty, the default timeout period is 15 minutes.

Example scenarios

Scenario 1: Prioritize scheduling based on node pools

Assume that you want to deploy an application. The cluster has two node pools: Node Pool A and Node Pool B. You want to schedule pods to Node Pool A first. If resources are insufficient, schedule them to Node Pool B. When scaling in, you want to terminate pods in Node Pool B first, and then terminate pods in Node Pool A. In this example, cn-beijing.10.0.3.137 and cn-beijing.10.0.3.138 belong to Node Pool A. cn-beijing.10.0.6.47 and cn-beijing.10.0.6.46 belong to Node Pool B. The node specifications are 2 vCores and 4 GB of memory. Follow these steps to prioritize scheduling based on node pools:

Use the following YAML content to create a ResourcePolicy and customize the node pool scheduling order.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    app: nginx # This must be associated with the label of the pod you create later.
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058****
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****

Note

You can obtain the node pool ID from the Nodes > Node Pools page of your cluster. For more information, see Create and manage a node pool.

Use the following YAML content to create a deployment with two pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx # This must be associated with the selector of the ResourcePolicy you created in the previous step.
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Create the Nginx application and view the deployment result.

Run the following command to create the Nginx application.
```
kubectl apply -f nginx.yaml
```
Expected output:
```
deployment.apps/nginx created
```

Run the following command to view the deployment result.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          17s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running   0          17s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>

The output shows that the first two pods are scheduled to the nodes in Node Pool A.

Scale out the pods.

Run the following command to scale out the number of pods to four.

kubectl scale deployment nginx --replicas 4

Expected output:

deployment.apps/nginx scaled

Run the following command to check the pod status.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE    IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          101s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running   0          101s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>
nginx-9cdf7bbf9-m****   1/1     Running   0          18s    172.29.113.156   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-x****   1/1     Running   0          18s    172.29.113.89    cn-beijing.10.0.6.46    <none>           <none>

The output shows that when the nodes in Node Pool A have insufficient resources, pods are scheduled to the nodes in Node Pool B.

Scale in the pods.

Run the following command to scale in the number of pods from four to two.
```
kubectl scale deployment nginx --replicas 2
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to check the pod status.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running       0          2m41s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running       0          2m41s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>
nginx-9cdf7bbf9-m****   0/1     Terminating   0          78s     172.29.113.156   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-x****   0/1     Terminating   0          78s     172.29.113.89    cn-beijing.10.0.6.46    <none>           <none>

The output shows that pods in Node Pool B are terminated first, following the reverse order of scheduling.

Scenario 2: Hybrid scheduling with ECS and ECI

Assume that you want to deploy an application. The cluster has three types of resources: subscription Elastic Compute Service (ECS) instances, pay-as-you-go ECS instances, and ECI instances. To reduce resource costs, you want to schedule pods in the following order of priority: subscription ECS instances, pay-as-you-go ECS instances, and ECI instances. When scaling in, you want to terminate pods on ECI instances first, then pods on pay-as-you-go ECS instances, and finally pods on subscription ECS instances. The example nodes have 2 vCores and 4 GB of memory. Follow these steps for hybrid scheduling with ECS and ECI:

Run the following commands to add different labels to nodes based on their billing method. You can also use the node pool feature to automatically add the labels.

kubectl label node cn-beijing.10.0.3.137 paidtype=subscription
kubectl label node cn-beijing.10.0.3.138 paidtype=subscription
kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go
kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-go

Use the following YAML content to create a ResourcePolicy and customize the node pool scheduling order.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    app: nginx # This must be associated with the label of the pod you create later.
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      paidtype: subscription
  - resource: ecs
    nodeSelector:
      paidtype: pay-as-you-go
  - resource: eci

Use the following YAML content to create a deployment with two pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx # This must be associated with the selector of the ResourcePolicy you created in the previous step.
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Create the Nginx application and view the deployment result.

Run the following command to create the Nginx application.
```
kubectl apply -f nginx.yaml
```
Expected output:
```
deployment.apps/nginx created
```

Run the following command to view the deployment result.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          66s   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          66s   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

The output shows that the first two pods are scheduled to nodes with the label paidtype=subscription.

Scale out the pods.

Run the following command to scale out the number of pods to four.
```
kubectl scale deployment nginx --replicas 4
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to check the pod status.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running   0          16s     172.29.113.155   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running   0          3m48s   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running   0          16s     172.29.113.88    cn-beijing.10.0.6.46    <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          3m48s   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

The output shows that when nodes with the label paidtype=subscription have insufficient resources, pods are scheduled to nodes with the label paidtype=pay-as-you-go.

Run the following command to scale out the number of pods to six.
```
kubectl scale deployment nginx --replicas 6
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to check the pod status.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running   0          3m10s   172.29.113.155   cn-beijing.10.0.6.47           <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running   0          6m42s   172.29.112.215   cn-beijing.10.0.3.137          <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running   0          3m10s   172.29.113.88    cn-beijing.10.0.6.46           <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          6m42s   172.29.113.23    cn-beijing.10.0.3.138          <none>           <none>
nginx-9cdf7bbf9-s****   1/1     Running   0          36s     10.0.6.68        virtual-kubelet-cn-beijing-j   <none>           <none>
nginx-9cdf7bbf9-v****   1/1     Running   0          36s     10.0.6.67        virtual-kubelet-cn-beijing-j   <none>           <none>

The output shows that pods are scheduled to ECI resources when the ECS instances have insufficient resources.

Scale in the pods.

Run the following command to scale in the number of pods from six to four.
```
kubectl scale deployment nginx --replicas 4
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to check the pod status.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running       0          4m59s   172.29.113.155   cn-beijing.10.0.6.47           <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running       0          8m31s   172.29.112.215   cn-beijing.10.0.3.137          <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running       0          4m59s   172.29.113.88    cn-beijing.10.0.6.46           <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running       0          8m31s   172.29.113.23    cn-beijing.10.0.3.138          <none>           <none>
nginx-9cdf7bbf9-s****   1/1     Terminating   0          2m25s   10.0.6.68        virtual-kubelet-cn-beijing-j   <none>           <none>
nginx-9cdf7bbf9-v****   1/1     Terminating   0          2m25s   10.0.6.67        virtual-kubelet-cn-beijing-j   <none>           <none>

The output shows that pods on ECI instances are terminated first, in reverse order of scheduling.

Run the following command to scale in the number of pods from four to two.
```
kubectl scale deployment nginx --replicas 2
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to check the pod status.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   0/1     Terminating   0          6m43s   172.29.113.155   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running       0          10m     172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-f****   0/1     Terminating   0          6m43s   172.29.113.88    cn-beijing.10.0.6.46    <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running       0          10m     172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

The output shows that pods on nodes with the label paidtype=pay-as-you-go are terminated first, in reverse scheduling order.

Run the following command to check the pod status.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          11m   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          11m   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

The output shows that only pods remain on nodes with the paidtype=subscription label.

References

When you deploy services in an ACK cluster, you can use tolerations and node affinity to use only ECS or ECI resources, or to automatically request ECI resources when ECS resources are insufficient. By configuring scheduling policies, you can meet different requirements for elastic resources in various workload scenarios. For more information, see Specify resource allocation between ECS and ECI.
High availability and high performance are important requirements for distributed tasks. In an ACK managed cluster Pro edition, you can use native Kubernetes scheduling semantics to spread distributed tasks across zones for high availability. You can also use these semantics to implement affinity-based deployment of distributed tasks in a specific zone for high performance. For more information, see Spread ECI pods across zones and schedule ECI pods with affinity.

ACK version	Scheduler version
1.20	v1.20.4-ack-7.0 or later
1.22	v1.22.15-ack-2.0 or later
1.24 or later	All versions are supported