Enable scheduling features

When you deploy GPU computing jobs in an ACK managed cluster Pro, you can assign scheduling property labels to GPU nodes. These labels, such as exclusive, shared, and topology-aware scheduling, and card model labels, help optimize resource utilization and enable precise application scheduling.

Scheduling labels

GPU scheduling labels identify GPU models and resource allocation policies. This enables fine-grained resource management and efficient scheduling.

Scheduling feature	Label value	Scenarios
Exclusive scheduling (Default)	`ack.node.gpu.schedule: default`	High-performance jobs that require exclusive use of an entire GPU card, such as model training and HPC.
Shared scheduling	`ack.node.gpu.schedule: cgpu` `ack.node.gpu.schedule: core_mem` `ack.node.gpu.schedule: share` `ack.node.gpu.schedule: mps`	Improves GPU utilization. Suitable for scenarios where multiple lightweight jobs run concurrently, such as in multitenancy or inference workloads. `cgpu`: Shares computing power and isolates video memory. Based on Alibaba Cloud cGPU technology. `core_mem`: Isolates both computing power and video memory. `share`: Shares both computing power and video memory with no isolation. `mps`: Shares computing power and isolates video memory. Based on NVIDIA Multi-Process Service (MPS) isolation combined with Alibaba Cloud cGPU technology.
Shared scheduling	`ack.node.gpu.placement: binpack` `ack.node.gpu.placement: spread`	This applies to optimizing the resource allocation policy for multiple GPU cards on a single node after the `cgpu`, `core_mem`, `share`, and `mps` shared scheduling features are enabled. `binpack`: (Default) Compactly schedules pods on multiple cards. Fills one GPU with pods before assigning to the next. This reduces resource fragmentation and is ideal for scenarios that prioritize resource utilization or energy savings. `spread`: Distributes pods across different GPUs. This reduces the impact of a single card failure and is suitable for high availability (HA) jobs.
Topology-aware scheduling	`ack.node.gpu.schedule: topology`	Automatically assigns pods to the GPU combination with the optimal communication bandwidth based on the physical GPU topology within a single node. Suitable for jobs that are sensitive to inter-GPU communication latency.
Card model scheduling	`aliyun.accelerator/nvidia_name: <GPU_card_name>` Use with card model scheduling to set the video memory capacity and total number of GPU cards for a GPU job. `aliyun.accelerator/nvidia_mem: <video_memory_per_card>` `aliyun.accelerator/nvidia_count: <total_number_of_GPU_cards>`	Schedules jobs to nodes with specified GPU models or avoids nodes with specified models.

Exclusive scheduling

If a node has no GPU scheduling labels, exclusive scheduling is enabled by default. In this mode, the node allocates GPU resources to pods in units of a single GPU.

If other GPU scheduling features are enabled, removing the labels does not restore exclusive scheduling. You must manually change the label value to ack.node.gpu.schedule: default to restore the exclusive scheduling feature.

Shared scheduling

Shared scheduling is supported only in ACK managed cluster Pro. For more information, see Limits.

Install the ack-ai-installer shared scheduling component.
1. Log on to the ACK console. In the navigation pane on the left, click Clusters.
2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.
3. On the Cloud-native AI Suite page, click Deploy. On the Deploy Cloud-native AI Suite page, select Scheduling Policy Extension (Batch Scheduling, GPU Sharing, GPU Topology Awareness).
  For more information about how to set the computing power scheduling policy for cGPU, see Install and use the cGPU component.
4. On the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.
  On the Cloud-native AI Suite page, find the installed shared GPU component ack-ai-installer in the component list.
Enable the shared scheduling feature.
1. On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose Node Management > Node Pools.
2. On the Node Pools page, click Create Node Pool, configure the node labels, and then click Confirm.
  You can keep the default settings for other configuration items. For more information about the scenarios for node labels, see Scheduling labels.
  - Configure basic shared scheduling.
    Click the icon for Node Labels, set Key to ack.node.gpu.schedule, and select one of the following tag values: cgpu, core_mem, share, or mps (requires you to install the MPS Control Daemon component).
  - Configure multi-card shared scheduling.
    If a node has multiple GPUs, you can configure multi-card shared scheduling to optimize resource allocation.
    Click the icon for Node Labels, set the Key to ack.node.gpu.placement, and set the tag value to binpack or spread.
Verify that shared scheduling is enabled.
cgpu/share/mps
Replace <NODE_NAME> with the name of your target node and run the following command to verify that cgpu, share, or mps shared scheduling is enabled for the node pool.
kubectl get nodes <NODE_NAME> -o yaml | grep -q "aliyun.com/gpu-mem"
Expected output:
aliyun.com/gpu-mem: "60"
If the value of the aliyun.com/gpu-mem field is not 0, cgpu, share, or mps shared scheduling is enabled.
core_mem
Replace <NODE_NAME> with the name of your target node and run the following command to verify that core_mem shared scheduling is enabled for the node pool.
kubectl get nodes <NODE_NAME> -o yaml | grep -E 'aliyun\.com/gpu-core\.percentage|aliyun\.com/gpu-mem'
Expected output:
aliyun.com/gpu-core.percentage:"80" aliyun.com/gpu-mem:"6"
If the values of the aliyun.com/gpu-core.percentage and aliyun.com/gpu-mem fields are not 0, core_mem shared scheduling is enabled.
binpack
Use the shared GPU scheduling GPU resource query tool and run the following command to query the GPU resource allocation of the node:
kubectl inspect cgpu
Expected output:
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) GPU Memory(GiB) cn-shanghai.192.0.2.109 192.0.2.109 15/15 9/15 0/15 0/15 24/60 -------------------------------------------------------------------------------------- Allocated/Total GPU Memory In Cluster: 24/60 (40%)
The output shows that GPU0 is fully allocated (15/15) and GPU1 is partially allocated (9/15). This matches the strategy of filling one GPU before allocating resources to the next, which confirms that the binpack policy is in effect.
spread
Use the shared scheduling GPU resource query tool and run the following command to query the GPU resource allocation of the node:
```
kubectl inspect cgpu
```
Expected output:
```
NAME                   IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU Memory(GiB)
cn-shanghai.192.0.2.109  192.0.2.109  4/15                   4/15                   0/15                   4/15                   12/60
--------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
12/60 (20%)
```
The output shows that 4/15 of the resources are allocated to GPU0, 4/15 to GPU1, and 4/15 to GPU3. This confirms that the spread policy is in effect because the pods are distributed across different GPUs.

Topology-aware scheduling

Topology-aware scheduling is supported only in ACK managed cluster Pro. For more information, see System component version requirements.

Install the ack-ai-installer shared scheduling component.
Enable topology-aware scheduling.
Replace <NODE_NAME> with the name of your target node and run the following command to add a label to the node. This activates the topology-aware scheduling feature for the node.
```
kubectl label node <NODE_NAME> ack.node.gpu.schedule=topology
```
After you activate topology-aware scheduling for a node, it no longer supports scheduling for non-topology-aware GPU resources. You can run the kubectl label node <NODE_NAME> ack.node.gpu.schedule=default --overwrite command to change the label and restore exclusive scheduling.
Verify that topology-aware scheduling is enabled.
Replace <NODE_NAME> with the name of your target node and run the following command to verify that topology-aware scheduling is enabled for the node pool.
```
kubectl get nodes <NODE_NAME> -o yaml | grep aliyun.com/gpu
```
Expected output:
```
aliyun.com/gpu: "2"
```
If the value of the aliyun.com/gpu field is not 0, topology-aware scheduling is enabled.

Card model scheduling

You can schedule a Job to a node with a specified GPU model or avoid a specific model.

Check the GPU model of the node.
Run the following command to query the GPU models of the nodes in the cluster.
The GPU model name is in the NVIDIA_NAME field.
```
kubectl get nodes -L aliyun.accelerator/nvidia_name
```
The expected output is similar to the following:
```
NAME                        STATUS   ROLES    AGE   VERSION            NVIDIA_NAME
cn-shanghai.192.XX.XX.176   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.177   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
```
Expand to view more ways to check the GPU model.
On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose Workloads > Pods. In the row of the container that you created (for example, tensorflow-mnist-multigpu-***), click Terminal in the Actions column. Then, from the drop-down list, select the container to log on to and run the following commands.
- Query the GPU model: nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
- Query the video memory capacity of each GPU: nvidia-smi --id=0 --query-gpu=memory.total --format=csv,noheader | sed -e 's/ //g'
- Query the total number of GPUs on the node: nvidia-smi -L | wc -l

Enable card model scheduling.

On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Workloads > Jobs.

On the Jobs page, click Create From YAML. Use the following examples to create an application and enable the card model scheduling feature.

Specify a particular card model

Use the GPU model label to run your application on nodes with a specific GPU model.

Replace Tesla-V100-SXM2-32GB in the aliyun.accelerator/nvidia_name: "Tesla-V100-SXM2-32GB" code with the actual GPU model of your node.

Expand to view the YAML file details

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-mnist
    spec:
      nodeSelector:
        aliyun.accelerator/nvidia_name: "Tesla-V100-SXM2-32GB" # Runs the application on a Tesla V100-SXM2-32GB GPU.
      containers:
      - name: tensorflow-mnist
        image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=1000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            nvidia.com/gpu: 1
        workingDir: /root
      restartPolicy: Never

After the Job is created, you can choose Workloads > Pods in the navigation pane on the left. In the pod list, you can see that an example pod is successfully scheduled to a matching node, which demonstrates flexible scheduling based on the GPU model label.

Exclude a particular card model

Use the GPU model label with node affinity and anti-affinity to prevent your application from running on certain GPU models.

Replace Tesla-V100-SXM2-32GB in values: - "Tesla-V100-SXM2-32GB" with the actual GPU model of your node.

Expand to view the YAML file details

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-mnist
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: aliyun.accelerator/nvidia_name  # Card model scheduling label
                operator: NotIn
                values:
                - "Tesla-V100-SXM2-32GB"            # Prevents the pod from being scheduled to a node with a Tesla-V100-SXM2-32GB card.
      containers:
      - name: tensorflow-mnist
        image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=1000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            nvidia.com/gpu: 1
        workingDir: /root
      restartPolicy: Never

After the Job is created, the application is not scheduled to nodes that have the label key aliyun.accelerator/nvidia_name and the value Tesla-V100-SXM2-32GB. However, it can be scheduled to GPU nodes with other GPU models.

Container Service for Kubernetes:Enable scheduling features

Scheduling labels