GPU fault detection and automatic isolation - Container Service for Kubernetes

This topic describes how to properly install, configure, and use the ACK GPU fault detection component to better manage and maintain GPU resources in your ACK clusters. You can read this topic to gain information about how to monitor GPU resources in your ACK clusters. This helps you improve cluster reliability and efficiency.

Prerequisites

You have installed ack-node-problem-detector (NPD), and the component version is 1.2.26 or later.
When ack-nvidia-device-plugin of version 0.17.0 or later is used with NPD of version 1.2.26 or later, NPD automatically isolates a GPU card when a GPU fault is detected. When NPD detects that the GPU returns to a normal state, the isolation of the GPU card is automatically deactivated.
For more information about how to view the version of the ack-nvidia-device-plugin component and upgrade the component, see View the version of Nvidia Device Plugin.

ack-node-problem-detector (NPD) is a cluster node anomaly monitoring component that ACK developed based on the open source node-problem-detector project. NPD provides a variety of GPU fault detection items to enhance fault detection capabilities in GPU scenarios. When a fault is detected, the component generates a Kubernetes event or Kubernetes node condition based on the fault type.

Notes

NVIDIA Xid and SXid errors are written to /var/log/messages or /var/log/syslog by the GPU driver through the NVRM event mechanism. NPD records whether each Xid and SXid error has been processed. If you restart the node after an Xid or SXid error is detected, NPD does not generate an event or node condition for this Xid or SXid error, regardless of whether the problem corresponding to this Xid or SXid error has been resolved (for example, Xid 79 indicates that the GPU device must be replaced to resolve the problem). This means that NPD considers this Xid error resolved.
NPD detects NVIDIA Xid or NVIDIA SXid errors by checking the /var/log/messages file or /var/log/syslog file on the node. If dmesg logs are redirected to other files, ack-node-problem-detector cannot detect Xid or SXid errors.

Detection items and recovery suggestions

Recovery suggestion of None indicates that no action is required for the hardware. We recommend that you check whether the application configuration is normal.

Detection item name	Whether a node condition is generated	Whether an event is generated	Description	Whether the GPU card is isolated by default	Recovery suggestion
NvidiaXID13Error	No	Yes `Type: Warning` `Reason: NvidiaXID13Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 13 error has occurred.`	`Graphics Engine Exception.` In most cases, arrays are out of their declared ranges or an instruction error occurs. In rare cases, a hardware error occurs.	No	None
NvidiaXID31Error	No	Yes `Type: Warning` `Reason: NvidiaXID31Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 31 error has occurred.`	`GPU memory page fault.` In most cases, this issue occurs because the application accesses an illegal address. In rare cases, this issue indicates a driver or hardware error.	No	None
NvidiaXID43Error	No	Yes `Type: Warning` `Reason: NvidiaXID43Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 43 error has occurred.`	`GPU stopped processing.` This event is generated when your application encounters a software issue and must be terminated. The GPU remains in a healthy state. In most cases, this event indicates errors in your application, not errors in the driver.	No	None
NvidiaXID45Error	No	Yes `Type: Warning` `Reason: NvidiaXID45Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 45 error has occurred.`	`Preemptive cleanup, due to previous errors – Most likely to see when running multiple cuda applications and hitting a DBE.` This event is generated when your application is interrupted and the kernel driver terminates the GPU applications that run on the GPU. If GPU resets, Control-C signals, or SIGKILL signals occur, it indicates that your application is interrupted and this event is generated. In many cases, this event does not indicate errors and is caused by user or system operations.	No	None
NvidiaXID48Error	Yes `Type: NvidiaXID48Error` `Reason: NodeHasNvidiaXID48Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 48 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID48Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 48 error has occurred.`	`Double Bit ECC Error(DBE).` This event is generated when the GPU encounters an uncorrectable error. The error is also reported to your application. In most cases, you need to reset the GPU or node to fix this error.	Yes	Restart the node.
NvidiaXID63Error	No	Yes `Type: Warning` `Reason: NvidiaXID63Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 63 error has occurred.`	`ECC page retirement or row remapping recording event.` When the application encounters a GPU memory hardware error, the Error Correction Code (ECC) mechanism of NVIDIA retires or remaps the faulty memory region. The retirement or remapping information must be recorded in infoROM to ensure that the retirement or remapping is permanently effective. Volt architecture: successfully records the ECC page retirement event in infoROM. Ampere architecture: successfully records the row remapping event in infoROM.	No	None
NvidiaXID64Error	No	Yes `Type: Warning` `Reason: NvidiaXID64Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 64 error has occurred.`	`ECC page retirement or row remapper recording failure.` This event is similar to Xid 63. However, Xid 63 indicates that the retirement or remapping information is successfully recorded in infoROM. Xid 64 indicates that the retirement or remapping information fails to be recorded.	No	None
NvidiaXID74Error	Yes `Type: NvidiaXID74Error` `Reason: NodeHasNvidiaXID74Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 74 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID74Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 74 error has occurred.`	`Fatal NVLINK Error.` The Xid message indicates an NVLink hardware error. The GPU encounters a critical hardware error and must be repaired.	Yes	Hardware maintenance.
NvidiaXID79Error	Yes `Type: NvidiaXID79Error` `Reason: NodeHasNvidiaXID79Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 79 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID79Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 79 error has occurred.`	`GPU has fallen off the bus.` The GPU has fallen off the bus and the bus cannot find the GPU. This means that the GPU encounters a critical hardware error and must be disconnected and repaired.	Yes	Hardware maintenance.
NvidiaXID94Error	No	Yes `Type: Warning` `Reason: NvidiaXID94Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 94 error has occurred.`	`Contained ECC error.` When the application encounters an uncorrectable GPU memory Error Correcting Code (ECC) error, the error containment mechanism of NVIDIA (contained) attempts to contain the error to the faulty application in case the error affects other applications on the GPU-accelerated node. An Xid 94 event is generated if the error containment mechanism successfully contains the error. In this case, only the faulty application is affected by the uncorrectable ECC error.	No	None
NvidiaXID95Error	Yes `Type: NvidiaXID95Error` `Reason: NodeHasNvidiaXID95Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 95 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID95Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 95 error has occurred.`	`Uncontained ECC error.` An Xid 95 event indicates that the ECC error fails to be contained. Other applications on the faulty GPU are also affected. To restart the applications, you must first restart the faulty GPU.	Yes	Restart the node.
NvidiaXID119Error	Yes `Type: NvidiaXID119Error` `Reason: NodeHasNvidiaXID119Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 119 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID119Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 119 error has occurred.`	`GSP RPC Timeout.` The response of GPU System Processor (GSP) to a Remote Procedure Call (RPC) message times out.	Yes	Restart the node.
NvidiaXID120Error	Yes `Type: NvidiaXID120Error` `Reason: NodeHasNvidiaXID120Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 120 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID120Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 120 error has occurred.`	`GSP Error.` Errors occur in the code running on the GSP core of the GPU.	Yes	Restart the node.
NvidiaXID140Error	Yes `Type: NvidiaXID140Error` `Reason: NodeHasNvidiaXID140Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 140 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID140Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 140 error has occurred.`	`Unrecovered ECC Error.` This event may be generated when the GPU driver detects uncorrectable errors in GPU memory. The errors affect the capability of the driver to mark the pages for dynamic page offlining or row remapping. Reset the GPU.	Yes	Restart the node.
NvidiaEccModeNotEnabled	Yes `Type: NvidiaEccModeNotEnabled` `Reason: EccModeNotEnabled` `Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.`	Yes (events are continuously generated until the issue is fixed) `Type: Warning` `Reason: NvidiaEccModeNotEnabled` `Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.`	The ECC mode is disabled for the node.	No	Enable the ECC mode for the node.
NvidiaPendingRetiredPages	Yes `Type: NvidiaPendingRetiredPages` `Reason: NodeHasNvidiaPendingRetriedPages` `Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.`	Yes (events are continuously generated until the issue is fixed) `Type: Warning` `Reason: NvidiaPendingRetiredPages` `Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.`	Pending retired pages exist on the GPU. Restart the GPU to make the retired pages effective.	Yes	Restart the node.
NvidiaRemappingRowsFailed	Yes `Type: NvidiaRemappedRowsFailed` `Reason: GPUMemoryRemappingRowsFailed` `Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row mapping.`	Yes (events are continuously generated until the issue is fixed) `Type: Warning` `Reason: NvidiaRemappedRowsFailed` `Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row mapping.`	Row remapping fails on the GPU.	Yes	Hardware maintenance.
NvidiaRemappingRowsRequireReset	Yes `Type: NvidiaRemappingRowsRequireReset` `Reason: UncontainedEccError` `Message: GpuIds=xxx;MSG=Remapping rows requires GPU reset.`	Yes (events are continuously generated until the issue is resolved) `Type: Warning` `Reason: NvidiaRemappingRowsRequireReset` `Message: GpuIds=xxx;MSG=Remapping rows requires GPU reset.`	Uncorrectable and uncontained errors have occurred on the GPU. To fix these errors, you must reset the GPU. We recommend that you reset the GPU at your earliest opportunity to restore normal operations.	Yes	Restart the node.
NvidiaDeviceLost	Yes `Type: NvidiaDeviceLost` `Reason: NodeHasNvidiaDeviceLost` `Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible`	Yes (events are continuously generated until the issue is resolved) `Type: Warning` `Reason: NvidiaDeviceLost` `Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible.`	`The GPU has fallen off the bus or has otherwise become inaccessible.` The GPU has fallen off the bus and cannot be accessed.	Yes	Hardware maintenance.
NvidiaInfoRomCorrupted	Yes `Type: NvidiaInfoRomCorrupted` `Reason: NodeHasNvidiaInfoRomCorrupted` `Message: GpuIds=xxx;MSG=GPU infoROM is corrupted`	Yes (events are continuously generated until the issue is resolved) `Type: Warning` `Reason: NvidiaInfoRomCorrupted` `Message: GpuIds=xxx;MSG=GPU infoROM is corrupted.`	`infoROM is corrupted.` infoROM is corrupted.	Yes	Hardware maintenance.
NvidiaPowerCableErr	Yes `Type: NvidiaPowerCableErr` `Reason: NodeHasNvidiaPowerCableErr` `Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached`	Yes (events are continuously generated until the issue is resolved) `Type: Warning` `Reason: NvidiaPowerCableErr` `Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached.`	`A device's external power cables are not properly attached.` The external power cable is not securely connected.	Yes	Hardware maintenance.
NvidiaXID44Error	Yes `Type: NvidiaXID44Error` `Reason: NodeHasNvidiaXID44Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 44 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID44Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 44 error has occurred.`	`Graphics Engine fault during context switch` This typically indicates that an uncorrectable error has occurred on the GPU. The error is also reported to the user application. You need to reset the GPU or restart the node to clear this error.	Yes	Restart the node.
NvidiaXID61Error	Yes `Type: NvidiaXID61Error` `Reason: NodeHasNvidiaXID61Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 61 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID61Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 61 error has occurred.`	`Internal micro-controller breakpoint/warning (newer drivers)` This typically indicates that an uncorrectable error has occurred on the GPU. The error is also reported to the user application. You must reset the GPU or restart the node to clear this error.	Yes	Restart the node.
NvidiaXID62Error	Yes `Type: NvidiaXID62Error` `Reason: NodeHasNvidiaXID62Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 62 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID62Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 62 error has occurred.`	`Internal micro-controller halt (newer drivers)` These faults indicate that an uncorrectable error has occurred on the GPU. The error is also reported to the user application. A GPU reset or node restart is required to clear this error.	Yes	Restart the node.
NvidiaXID69Error	Yes `Type: NvidiaXID69Error` `Reason: NodeHasNvidiaXID69Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 69 error has occurred.`	Yes `Type: Warning` `Reason: NvidiaXID69Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An NVIDIA XID 69 error has occurred.`	`Graphics Engine class error` This fault indicates an uncorrectable error has occurred on the GPU and is reported to the user application. A GPU reset or node restart is required to clear this error.	Yes	Restart the node.
NvidiaXID[code]Error	No	Yes (up to three node events can be generated) `Type: Warning` `Reason: NvidiaXID[code]Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid [code] error has occurred.`	Other Xid errors.	No	to Alibaba Cloud technical support.
NvidiaSXID[code]Error	No	Yes (up to three events can be generated) `Type: Warning` `Reason: NvidiaSXID[code]Error` `Message: TS=xxx;NVSwitchIds=xxx;MSG=An nvidia sxid [code] error has occurred.`	SXid errors are classified into three types: Correctable: This type of error is corrected. System behavior is not affected. No additional recovery is required. Fatal: This type of error is fatal to the device. System behavior is affected. The only way to recover from this error is to reset the device or restart the system. Non-fatal: This type of error is not fatal to the device. System behavior is affected. Device resets or system restarts may not be required.	No	None

Other related events

In the exclusive GPU scenario, NPD automatically isolates GPU cards by default based on outlier detection items. After isolation, new GPU application pods will not be allocated to that GPU card. You can view the isolation effect by checking the nvidia.com/gpu quantity reported in the Resource of the Kubernetes Node. After the GPU card recovers, ACK automatically deactivates the isolation.

Trigger cause

Event content

Description

GPU Isolation

Yes

Type: Warning
Reason: NvidiaDeviceIsolated
Message: GpuIds=xxx;MSG=Nvidia device has been isolated due to detected issues.

The GPU card is isolated due to detected issues.

GPU Isolation Removal

Yes

Type: Normal
Reason: NvidiaDeviceRecovered
Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 31 error has occurred.

The GPU card fault recovery, isolation is removed.