This topic describes how to properly install, configure, and use the ACK GPU fault detection component to better manage and maintain GPU resources in your ACK clusters. You can read this topic to gain information about how to monitor GPU resources in your ACK clusters. This helps you improve cluster reliability and efficiency.
Prerequisites
You have installed ack-node-problem-detector (NPD), and the component version is 1.2.26 or later.
When ack-nvidia-device-plugin of version 0.17.0 or later is used with NPD of version 1.2.26 or later, NPD automatically isolates a GPU card when a GPU fault is detected. When NPD detects that the GPU returns to a normal state, the isolation of the GPU card is automatically deactivated.
For more information about how to view the version of the ack-nvidia-device-plugin component and upgrade the component, see View the version of Nvidia Device Plugin.
ack-node-problem-detector (NPD) is a cluster node anomaly monitoring component that ACK developed based on the open source node-problem-detector project. NPD provides a variety of GPU fault detection items to enhance fault detection capabilities in GPU scenarios. When a fault is detected, the component generates a Kubernetes event or Kubernetes node condition based on the fault type.
Notes
NVIDIA Xid and SXid errors are written to
/var/log/messages
or/var/log/syslog
by the GPU driver through the NVRM event mechanism. NPD records whether each Xid and SXid error has been processed. If you restart the node after an Xid or SXid error is detected, NPD does not generate an event or node condition for this Xid or SXid error, regardless of whether the problem corresponding to this Xid or SXid error has been resolved (for example, Xid 79 indicates that the GPU device must be replaced to resolve the problem). This means that NPD considers this Xid error resolved.NPD detects NVIDIA Xid or NVIDIA SXid errors by checking the
/var/log/messages
file or/var/log/syslog
file on the node. If dmesg logs are redirected to other files, ack-node-problem-detector cannot detect Xid or SXid errors.
Detection items and recovery suggestions
Recovery suggestion of None indicates that no action is required for the hardware. We recommend that you check whether the application configuration is normal.
Detection item name | Whether a node condition is generated | Whether an event is generated | Description | Whether the GPU card is isolated by default | Recovery suggestion |
NvidiaXID13Error | No | Yes
|
| No | None |
NvidiaXID31Error | No | Yes
|
| No | None |
NvidiaXID43Error | No | Yes
|
| No | None |
NvidiaXID45Error | No | Yes
|
| No | None |
NvidiaXID48Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID63Error | No | Yes
|
| No | None |
NvidiaXID64Error | No | Yes
|
| No | None |
NvidiaXID74Error | Yes
| Yes
|
| Yes | Hardware maintenance. |
NvidiaXID79Error | Yes
| Yes
|
| Yes | Hardware maintenance. |
NvidiaXID94Error | No | Yes
|
| No | None |
NvidiaXID95Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID119Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID120Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID140Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaEccModeNotEnabled | Yes
| Yes (events are continuously generated until the issue is fixed)
| The ECC mode is disabled for the node. | No | Enable the ECC mode for the node. |
NvidiaPendingRetiredPages | Yes
| Yes (events are continuously generated until the issue is fixed)
|
| Yes | Restart the node. |
NvidiaRemappingRowsFailed | Yes
| Yes (events are continuously generated until the issue is fixed)
| Row remapping fails on the GPU. | Yes | Hardware maintenance. |
NvidiaRemappingRowsRequireReset | Yes
| Yes (events are continuously generated until the issue is resolved)
| Uncorrectable and uncontained errors have occurred on the GPU. To fix these errors, you must reset the GPU. We recommend that you reset the GPU at your earliest opportunity to restore normal operations. | Yes | Restart the node. |
NvidiaDeviceLost | Yes
| Yes (events are continuously generated until the issue is resolved)
|
| Yes | Hardware maintenance. |
NvidiaInfoRomCorrupted | Yes
| Yes (events are continuously generated until the issue is resolved)
|
| Yes | Hardware maintenance. |
NvidiaPowerCableErr | Yes
| Yes (events are continuously generated until the issue is resolved)
|
| Yes | Hardware maintenance. |
NvidiaXID44Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID61Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID62Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID69Error | Yes
| Yes
|
| Yes | Restart the node. |
NvidiaXID[code]Error | No | Yes (up to three node events can be generated)
| Other Xid errors. | No | to Alibaba Cloud technical support. |
NvidiaSXID[code]Error | No | Yes (up to three events can be generated)
|
| No | None |
Other related events
In the exclusive GPU scenario, NPD automatically isolates GPU cards by default based on outlier detection items. After isolation, new GPU application pods will not be allocated to that GPU card. You can view the isolation effect by checking the nvidia.com/gpu
quantity reported in the Resource of the Kubernetes Node. After the GPU card recovers, ACK automatically deactivates the isolation.
Trigger cause | Event content | Description |
GPU Isolation | Yes
| The GPU card is isolated due to detected issues. |
GPU Isolation Removal | Yes
| The GPU card fault recovery, isolation is removed. |