Customize operating system (OS) parameters for a node pool - Container Service for Kubernetes

If the default operating system (OS) parameters for a Linux system do not meet your needs, you can customize the OS parameters for nodes in a node pool to tune system performance. After you customize the parameters, the system applies the changes to nodes in batches. These changes take effect immediately on existing nodes in the node pool, and new nodes will also use the new configurations.

Usage notes

This feature is supported only for ACK managed clusters, ACK dedicated clusters (new clusters can no longer be created), and ACK Edge clusters that are version 1.28 or later. To upgrade a cluster, see Manually upgrade a cluster.

Precautions

Modifying OS node configurations might change the configurations of existing pods on the node and trigger pods to be recreated. Before you proceed, ensure that your application is highly available.
Improper OS parameter adjustments can alter the behavior of the Linux kernel. This might degrade node performance or make the node unavailable, which can affect your services. Before you modify parameters, fully understand their functions and perform thorough tests before you apply them in a production environment.

Configure OS parameters for a node pool

You can configure sysctl parameters and Transparent HugePage (THP) parameters for a node pool. While all these parameters can be configured by modifying configuration files, THP parameters and some sysctl parameters can also be configured in the console or using the OpenAPI.

Configure in the console or using the OpenAPI

Console

After you customize OS parameters, the system applies the changes to nodes in batches. The new configurations take effect immediately on existing nodes and are automatically applied to new nodes. Because applying custom OS parameters changes the OS configurations of existing nodes, this operation might affect your services. We recommend that you perform this operation during off-peak hours.

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
In the Actions column for the target node pool, click > OS Configuration.
Read the precautions on the page. Click + Custom Parameters, select the parameters to configure, and specify the nodes to upgrade. Set Maximum Concurrent Nodes Per Batch to a value up to 10, and then click Submit. Follow the on-screen instructions to complete the operation.
After you set Maximum Concurrent Nodes Per Batch, the OS configurations are applied to nodes in batches. During this process, you can view the progress and control the operation (such as pause, resume, and cancel) in the Event History area. You can use the pause feature to verify the upgraded nodes. When the task is paused, nodes that are being configured continue until the configuration is complete. The custom configuration is not applied to nodes that have not been started until you resume the task.
Important
Complete the custom configuration task as soon as possible. A paused task is automatically canceled after 7 days, and the related events and logs are then cleared.

OpenAPI

In addition to using the console, you can customize OS parameters by calling the ModifyNodePoolNodeConfig API operation.

Configure using a configuration file

ACK lets you write custom parameters to the /etc/sysctl.d/99-user-customized.conf file. This file is reserved for custom configurations during node initialization and restart. The sysctl parameters in this file take precedence when the node restarts, overwriting the default OS values and the values set using the custom sysctl configuration feature for the node pool.

Important

Adjusting sysctl parameters alters the behavior of the Linux kernel. This can degrade node performance or make the node unavailable, which can affect your services. Fully assess the risks before you make changes.

For existing nodes in the node pool, you can log on to a node to modify this custom parameter file. Then, you must manually run the sysctl -p /etc/sysctl.d/99-user-customized.conf command to apply the configuration.
For future nodes that are added to the node pool, you can add a script that writes to the custom parameter file to the user data for the node pool instances. This ensures that new nodes use these custom parameter values by default. The procedure is as follows.
In the User Data field of the node pool configuration, add the command echo '${sysctl_key}=${sysctl_value}' > /etc/sysctl.d/99-user-customized.conf. Replace ${sysctl_key} and ${sysctl_value} with the actual values. This command writes the custom configuration to the specified configuration file in the /etc/sysctl.d/ directory.
For more information about how to access the configuration page, see Create and manage a node pool.

List of sysctl parameters

Note

In the following table, Default Value refers to the value that ACK sets by default when a node pool is initialized.
For the value ranges of the following parameters, see the Linux Kernel sysctl parameters documentation.
For parameters that are not yet supported in the console or using the OpenAPI, and for parameters not listed in the table, see Configure using a configuration file for instructions on how to modify them.

Field name	Description	Default value	Supported in the console or using the OpenAPI
fs.aio-max-nr	The maximum number of asynchronous I/O operations.	65536
fs.file-max	The maximum number of file handles that the system can open.	2097152
fs.inotify.max_user_watches	The maximum number of inotify watches that a single user can create.	524288
fs.nr_open	The maximum number of file descriptors that a single process can open. This value must be less than the value of fs.file-max.	1048576
kernel.pid_max	The maximum number of PIDs that the system can assign.	4194303
kernel.threads-max	The maximum number of threads that the system can create.	504581
net.core.netdev_max_backlog	The maximum number of packets that can be queued on the INPUT side when the interface receives packets faster than the kernel can process them.	16384
net.core.optmem_max	The maximum size of the ancillary buffer for each network socket, in bytes.	20480
net.core.rmem_max	The maximum size of the receive buffer for each network socket, in bytes.	16777216
net.core.wmem_max	The maximum size of the send buffer for each network socket, in bytes.	16777216
net.core.wmem_default	The default size of the send buffer for each network socket, in bytes.	212992
net.ipv4.tcp_mem	The amount of memory that the TCP stack can use, in memory pages (usually 4 KB in size). This parameter consists of three integer values that represent the low threshold, pressure threshold, and high threshold. You must set them in strict order.	Dynamically calculated based on the total system memory
net.ipv4.neigh.default.gc_thresh1	The minimum number of entries to keep in the ARP cache. The system does not perform garbage collection if the number of entries in the cache is below this value.	System preset
net.ipv4.neigh.default.gc_thresh2	The maximum number of entries in the ARP cache. This is a soft limit. When the number of entries in the cache reaches this value, the system considers performing garbage collection but does not immediately enforce it. Instead, it waits for a 5-second delay.	1024
net.ipv4.neigh.default.gc_thresh3	The maximum number of entries to keep in the ARP cache. This is a hard limit. When the number of entries in the cache reaches this value, the system immediately performs garbage collection. If the number of entries in the cache consistently exceeds this value, the system continuously performs cleanup.	8192
user.max_user_namespaces	The maximum number of user namespaces that a single user can create.	0
kernel.softlockup_panic	When a soft lockup occurs, the kernel triggers a panic and restarts the system to quickly recover the system state.	1
kernel.softlockup_all_cpu_backtrace	When a soft lockup is detected, captures debug information for all CPUs to facilitate problem diagnosis.	1
vm.max_map_count	Limits the maximum number of memory mapping areas a single process can have to prevent excessive memory usage.	262144
net.core.somaxconn	Sets the maximum number of connections for the socket listener queue to control concurrent connection processing capacity.	32768
net.ipv4.tcp_wmem	Configures the minimum, default, and maximum values for the TCP connection send buffer. Unit: bytes. This setting directly affects the network throughput and memory usage of TCP connections.	4096 12582912 16777216
net.ipv4.tcp_rmem	Configures the minimum, default, and maximum values for the TCP receive buffer. Unit: bytes. This setting directly affects the network throughput and memory usage of TCP connections.	4096 12582912 16777216
net.ipv4.tcp_max_syn_backlog	Limits the number of connection requests that have not completed the three-way handshake in the SYN queue.	8096
net.ipv4.tcp_slow_start_after_idle	Controls whether a TCP connection reuses the slow start algorithm after a long idle period.	0
net.ipv4.ip_forward	Enables IPv4 packet forwarding, allowing the system to act as a router to forward packets.	1
net.bridge.bridge-nf-call-iptables	Makes the bridge device apply Layer 3 iptables rules during Layer 2 forwarding to ensure network security policies take effect.	1
fs.inotify.max_user_instances	Limits the number of inotify monitors a single user can create to prevent resource exhaustion.	16384
fs.inotify.max_queued_events	Sets the number of file system events that can be cached in the kernel queue.	16384
fs.may_detach_mounts	Allows the kernel to safely detach a mount target from a namespace while it is still being accessed by a process, preventing the entire namespace from being locked.	1

List of THP parameters

Transparent Enormous Pages (THP) is a feature in the Linux kernel that automatically merges small pages (usually 4 KB) into large pages (usually 2 MB or larger). This reduces the size and number of accesses to Page Table Entries (PTEs), eases the pressure on the Translation Lookaside Buffer (TLB) cache, and improves memory access efficiency.

Note

All the following parameters can be configured in the console or using the OpenAPI.
The default values of the following parameters vary based on the operating system and its kernel version. For more information, see the Linux Kernel THP parameters documentation.

Field name	Description	Possible values
transparent_enabled	Specifies whether to enable the THP feature globally in the system.	always: Enables the THP feature globally in the system. never: Disables the THP feature globally in the system. madvise: Enables the THP feature only in memory regions that are marked with `MADV_HUGEPAGE` using the `madvise()` system call.
transparent_defrag	Specifies whether to enable defragmentation related to THP. When enabled, small pages in memory can be merged into a large page. This can reduce the page table size and improve system performance.	always: When the system fails to allocate a transparent enormous page, it pauses the memory allocation and waits for the system to perform direct memory reclaim and direct memory defragmentation. After memory reclaim and defragmentation are complete, if there is enough contiguous free memory, the system allocates a transparent enormous page. defer: When the system fails to allocate a transparent enormous page, it allocates a normal 4 KB page instead. At the same time, the system wakes up the `kswapd` daemon process for background memory reclaim and the `kcompactd` daemon process for background memory defragmentation. After a period of time, if there is enough contiguous free memory, the `khugepaged` daemon process can merge the previously allocated 4 KB pages into a 2 MB transparent enormous page. madvise: For memory regions marked with `MADV_HUGEPAGE` using the `madvise()` system call, the memory allocation behavior is the same as `always`. For other memory regions, when a page fault occurs, a normal 4 KB page is allocated instead. In other memory areas, when a page fault occurs, the system allocates regular pages (4 KB in size) instead. defer+madvise: For memory regions marked with `MADV_HUGEPAGE` using the `madvise()` system call, the memory allocation behavior is the same as `always`. For other memory regions, the memory allocation behavior is the same as `defer`. never: Disables defragmentation.
khugepaged_defrag	`khugepaged` is a kernel thread that is primarily responsible for managing and defragmenting large pages to reduce memory fragmentation and improve performance. It monitors large pages in the system and, when it finds scattered large pages, attempts to merge them into larger contiguous pages to improve memory utilization and performance. Because this operation performs locking in the memory path, the `khugepaged` daemon process might start scanning and converting large pages at inopportune times, which can potentially affect application performance.	0: Disables the khugepaged defragmentation feature. 1: The `khugepaged` daemon process periodically wakes up when the system is idle and attempts to merge contiguous 4 KB pages into 2 MB transparent enormous pages.
khugepaged_alloc_sleep_millisecs	Specifies the time in milliseconds that the `khugepaged` daemon process waits before the next large page allocation attempt after a THP allocation fails. This is to avoid continuous large page allocation failures in a short period.	For more information, see khugepaged defragmentation.
khugepaged_scan_sleep_millisecs	The interval in milliseconds at which the `khugepaged` daemon process wakes up.
khugepaged_pages_to_scan	The number of memory pages that the `khugepaged` daemon process scans each time it wakes up.