GPU Health

Modal constantly monitors host GPU health, draining Workers with critical issues and surfacing warnings for customer triage.

Application level observability of GPU health is facilitated by metrics and event logging to container log streams.

[gpu-health] logging 

Containers with attached NVIDIA GPUs are connected to our gpu-health monitoring system and receive event logs which originate from either application software behavior, system software behavior, or hardware failure.

These logs are in the following format: [gpu-health] [LEVEL] GPU-[UUID]: EVENT_TYPE: MSG

  • gpu-health: Name indicating the source is Modal’s observability system.
  • LEVEL: Represents the severity level of the log message.
  • GPU_UUID: A unique identifier for the GPU device associated with the event, if any.
  • EVENT_TYPE: The type of event source. Modal monitors for multiple types of errors, including Xid, SXid, and uncorrectable ECC. See below for more details.
  • MSG: The message component is either the original message taken from the event source, or a description provided by Modal of the problem.

Level 

The severity level may be CRITICAL or WARN. Modal automatically responds to CRITICAL level events by draining the underlying Worker and migrating customer containers. WARN level logs may be benign or indication of an application or library bug. No automatic action is taken by our system for warnings.

Xid & SXid 

The Xid message is an error report from the NVIDIA driver. The SXid, or “Switch Xid” is a report for the NVSwitch component used in GPU-to-GPU communication, and is thus only relevant in multi-GPU containers.

A classic critical Xid error is the ‘fell of the bus’ report, code 79. The gpu-health event log looks like this:

[gpu-health] [CRITICAL] GPU-1234: XID: NVRM: Xid (PCI:0000:c6:00): 79, pid=1101234, name=nvc:[driver], GPU has fallen off the bus.

There are over 100 Xid codes and they are of highly varying frequency, severity, and specificity. NVIDIA’s official documentation provides limited information, so we maintain our own tabular information below.

Xid Details

XIDNameCriticalCauses
1Invalid or corrupted push buffer streamNodriver error, system memory corruption, bus error, framebuffer corruption

Unused ID. If you see an error emitted with this ID contact support.

2Invalid or corrupted push buffer streamNodriver error, system memory corruption, bus error, framebuffer corruption
3Invalid or corrupted push buffer streamNodriver error, system memory corruption, bus error, framebuffer corruption
4Invalid or corrupted push buffer stream OR GPU semaphore timeoutNodriver error, user app error, system memory corruption, bus error, framebuffer corruption

This ID (4) is overloaded. It can be either invalid or corrupted push buffer stream or GPU semaphore timeout. In the latter case user app error is a potential cause.

6Invalid or corrupted push buffer streamNodriver error, system memory corruption, bus error, framebuffer corruption
7Invalid or corrupted push buffer addressNodriver error, system memory corruption, bus error, framebuffer corruption
8GPU stopped processingNodriver error, user app error, bus error, thermal issue
9Driver error programming GPUYesdriver error

Official NVIDIA documentation implicates only the driver as cause of error, thus recommended action is rebooting the system.

11Invalid or corrupted push buffer streamNodriver error, system memory corruption, bus error, framebuffer corruption
12Driver error handling GPU exceptionYesdriver error

Official NVIDIA documentation implicates only the driver as cause of error, thus recommended action is rebooting the system.

13Graphics Engine ExceptionNohardware error, driver error, user app error, system memory corruption, bus error, thermal issue, framebuffer corruption

Marked as non-critical, this error indicates GPU memory anomalies affecting code and data segments, arrays being out of their declared ranges, applications having illegal memory access issues, or instruction errors.

Restart applications and check whether the same Xid is returned. To debug, refer to cuda-memcheck or CUDA-GDB. In rare cases it can be caused by the hardware degradation. Please contact Modal support if the issue persists.

16Display engine hungYesdriver error

Because this error is attributed only to the driver, the recommended operator action is rebooting the system.

18Bus mastering disabled in PCI Config SpaceYesdriver error

Because this error is attributed only to the driver, the recommended operator action is rebooting the system.

19Display Engine errorYesdriver error

Because this error is attributed only to the driver, the recommended operator action is rebooting the system.

20Invalid or corrupted Mpeg push bufferNo
21Invalid or corrupted Motion Estimation push bufferNo
22Invalid or corrupted Video Processor push bufferNo
24GPU semaphore timeoutNo

It is expected that the timeout is recoverable.

25Invalid or illegal push buffer streamNo
26Framebuffer timeoutYesdriver error

Indicates framebuffer timeout, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

27Video processor exceptionYesdriver error

Indicates a video processor exception, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

28Video processor exceptionYesdriver error

Indicates video processor exception, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

29Video processor exceptionYesdriver error

Indicates video processor exception, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

30GPU semaphore access errorYesdriver error

Indicates GPU semaphore access error, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

31GPU memory page faultNohardware error, driver error, user app error

Debug the user application unless the issue is new and there have been no changes to the application but there has been changes to GPU driver or other GPU system software. Restart applications and check whether the same Xid is returned. To debug, refer to cuda-memcheck or CUDA-GDB. In rare cases it can be caused by the hardware degradation. If the issue persists, please contact support for hardware inspection and repair.

The official NVIDIA docs explains Xid 31 as a user application issue, but can also be driver bugs or hardware issues. This event is logged when MMU reports a fault when an illegal address access is made by an application unit on the chip.

32Invalid or corrupted push buffer streamYesdriver error, system memory corruption, bus error, thermal issue, framebuffer corruption

The event is reported by the DMA controller of the PCIE bus that manages communication between the NVIDIA driver and GPU. In most cases, a PCI quality issue occurs.

33Internal micro-controller errorYesdriver error

Indicates internal micro-controller error, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

34Video processor exceptionYesdriver error

Indicates video processor exception, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

35Video processor exceptionYesdriver error

Indicates video processor exception, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

36Video processor exceptionYesdriver error

Indicates video processor exception, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

37Driver firmware errorNo
38Driver firmware errorYesdriver error

Marked as critical, indicates NVIDIA driver firmware issues. Reboot the system to check whether the firmware issue persists.

42Video processor exceptionYesdriver error

Indicates a video processor exception, and because a driver error is the only possible reason, the recommended operator action is rebooting the system.

43GPU stopped processingNodriver error, user app error

Marked as non-critical, indicates GPU stopped processing, due to a user application encountering a software induced fault. Restart applications and check whether the same Xid is returned. And report if the issue persists.

44Graphics Engine fault during context switchYesdriver error

Marked as critical, indicates uncorrectable GPU errors. Stop existing workloads and reboot the system (or reset GPUs) to clear this error. Marked as critical, indicates uncorrectable GPU errors. If the uncorrectable GPU error persists after rebooting the system, contact provider for inspection and repair of the hardware.

The official NVIDIA docs explains Xid 44 as a potential driver issue. DeepSeek's Fire-Flyer AI paper explains Xid 44 indicates uncorrectable GPU errors, recommends GPU reset or node reboot.

45Preemptive cleanup, due to previous errors — Most likely to see when running multiple CUD applications and hitting a Double Bit ECC Error (DBE).Nodriver error, user app error

Robust Channel Preemptive Removal. No action, informative only. Indicates channels affected by another failure. On A100, this error could be seen by itself due to unexpected Fabric Manager shutdown when FM is running in the same OS environment as the GPU. Otherwise, this error is safe to ignore as an informational message.

The official NVIDIA docs explains Xid 45 is returned when the kernel driver terminates a GPU application, as a result of a user of system action. NVIDIA Xid 45: OS: Preemptive Channel Removal

46GPU stopped processingYesdriver error

Indicates GPU stopped processing, labeling a driver error as an only possible reason, thus we recommend rebooting the system.

47Video processor exceptionYesdriver error

Indicates a video processor exception, labeling a driver error as an only possible reason, thus we recommend rebooting the system.

48Double Bit ECC ErrorYesdriver error

This event is logged when the GPU detects that an uncorrectable error occurs on the GPU. A GPU reset or node reboot is needed to clear this error. Marked as critical, indicates uncorrectable double bit ECC errors (DBE), which also reports back to the user application. Stop existing workloads and reboot the system (or reset GPUs) to clear this error.

  • If Xid 48 is followed by Xid 63 or 64: Drain/cordon the node, wait for all work to complete, and reset GPU(s) reporting the XID (refer to GPU reset capabilities/limitations section below).
  • If Xid 48 is not followed by Xid 63 or 64: see Running Field Diagnostics to collect additional debug information. The error is also reported to your application. In most cases, you need to reset the GPU or node to fix this error.
54Auxiliary power is not connected to the GPU boardNo
56Display Engine errorNodriver error, hardware error
57Error programming video memory interfaceNodriver error, hardware error, framebuffer corruption
58Unstable video memory interface detectedNodriver error, hardware error

Overloaded ID, could be unstable video memory interface detected or EDC error, clarified in printout (driver error=false)

59Internal micro-controller error (older drivers)Yesdriver error

As driver error is only potential cause, recommended operator action is rebooting the system.

60Video processor exceptionYesdriver error

As driver error is only potential cause, recommended operator action is rebooting the system.

61Internal micro-controller breakpoint/warning (newer drivers)Yes

PMU Breakpoint. Report a GPU Issue and Reset GPU(s) reporting the XID (refer GPU reset capabilities/limitations section below). Marked as critical, indicates internal micro-controller breakpoint/warning and GPU internal engine stops working. Stop existing workloads and reboot the system (or reset GPUs) to clear this error.

Internal micro-controller breakpoint/warning. The GPU internal engine stops working. Consequently, your businesses are affected.

  • The official NVIDIA docs explains Xid 61 indicates internal micro-controller warning.
  • The official NVIDIA debugging guidelines recommend resetting the GPU that reports the Xid 61.
  • The Alibaba Cloud "Diagnose GPU-accelerated nodes" guide explains Xid 61 indicates GPU internal engine stops working, thus affecting the business.
  • DeepSeek Fire-Flyer AI-HPC paper explains Xid 61 indicates uncorrectable GPU errors, which reports back to the user application, recommending GPU reset or node reboot to clear this error.
62Internal micro-controller halt (newer drivers)Yeshardware error, driver error, thermal issue

This event is similar to Xid 61. PMU Halt Error. Report a GPU Issue to support.

63ECC page retirement or row remapping recording eventNohardware error, driver error, framebuffer corruption

These events are logged when the GPU handles ECC memory errors on the GPU.

A100: Row-remapping recording event. This XID indicates successful recording of a row-remapping entry to the InfoROM. If associated with XID 94, the application that encountered the error needs to be restarted. All other applications on the system can keep running as is until there is a convenient time to reset the GPU (refer GPU reset capabilities/limitations section below) or reboot for row remapping to activate.

Legacy GPU: ECC page retirement recording event. If associated with XID 48, drain/cordon the node, wait for all work to complete, and reset GPU(s) reporting the XID (refer GPU reset capabilities/limitations section below). If not, it is from a single bit error and the system can keep running as is until there is a convenient time to reboot it. Xid 63 indicates that the retirement or remapping information is successfully recorded in infoROM.

64ECC page retirement or row remapper recording failureYeshardware error, driver error

These events are logged when the GPU handles ECC memory errors on the GPU.

A100 and later: Row-remapping recording failure. This XID indicates a failure in recording a row-remapping entry to the InfoROM. The node should be rebooted immediately since there is a recording failure. If the errors continue, drain, triage, and see Report a GPU Issue for further operator instructions.

Legacy GPU: ECC page retirement recording failure.

See above, however the node should be monitored closely. If there is no associated XID 48 error, then these are related to single bit-errors. The GPU(s) reporting the error must be reset (refer to GPU reset capabilities/limitations section below) immediately since there is a recording failure. If the errors continue, drain, triage, and see Report a GPU Issue. See guidelines on when to RMA GPUs based on excessive errors. ECC page retirement or row remapper recording failure. This event is similar to XID 63. However, Xid 63 indicates that the retirement or remapping information is successfully recorded in infoROM. Xid 64 indicates that the retirement or remapping information fails to be recorded.

65Video processor exceptionYeshardware error, driver error

Triggered when the GPU handles memory ECC errors on the GPU. Most instances can be resolved by simply resetting the GPU to retain optimal performance.

66Illegal access by driverNodriver error, user app error
67Illegal access by driverNodriver error, user app error
68NVDEC0 ExceptionYeshardware error, driver error

Video processor exception.

69Graphics Engine class errorYeshardware error, driver error

Xid 69 indicates uncorrectable GPU errors. Stop the workloads and reboot the system. If the same Xid is reported again after rebooting the system, the GPU hardware should be inspected and repaired. This error has zero observations on Modal GPUs.

70CE3: Unknown ErrorNohardware error, driver error

Indicates an unknown error. It is marked as a warning and does not require immediate action.

71CE4: Unknown ErrorNohardware error, driver error
72CE5: Unknown ErrorNohardware error, driver error
73NVENC2 ErrorNohardware error, driver error
74NVLINK ErrorYeshardware error, driver error, bus error

This event is logged when the GPU detects that a problem with a connection from the GPU to another GPU or NVSwitch over NVLink. A GPU reset or node reboot is needed to clear this error.

This event may indicate a hardware failure with the link itself, or may indicate a problem with the device at the remote end of the link. For example, if a GPU fails, another GPU connected to it over NVLink may report an Xid 74 simply because the link went down as a result. The nvidia-smi nvlink command can provide additional details on NVLink errors, and connection information on the links.

If this error is seen repeatedly and GPU reset or node reboot fails to clear the condition, contact your hardware vendor for support.

Extract the hex strings from the XID error message. eg: (0x12345678, 0x12345678, 0x12345678, 0x12345678, 0x12345678, 0x12345678, 0x12345678) Look at the bolded DWORD (the first) and take the following paths if the particular bits (counting from LSB side) are set.

  • Bits 4 or 5: Likely HW issue with ECC/Parity. If seen more than 2 times on the same link, report a bug.
  • Bits 21 or 22: Marginal channel SI issue. Check link mechanical connections. If other errors accompany, follow the resolution for those.
  • Bits 8, 9, 12, 16, 17, 24, 28: Could possibly be a HW issue: Check link mechanical connections and re-seat if a field resolution is required. Run diags if issue persists.

"Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning" says that Xid 74 indicates errors in NVLink. For PCIe A100, it's mainly occurred on the NVLink Bridge between two GPUs. Its occurrence rate is several orders of magnitude higher than other hardware faults. Apart from stress testing to exclude those that are constantly repeating errors, there isn't a good way to avoid the occurrence of Xid 74 issues.

75CE6: Unknown ErrorNohardware error, driver error
76CE7: Unknown ErrorNohardware error, driver error
77CE8: Unknown ErrorNohardware error, driver error
78vGPU Start ErrorYesdriver error

vGPU start error. Reboot the system.

79GPU has fallen off the busYeshardware error, driver error, system memory corruption, bus error, thermal issue

This event is logged when the driver attempts to access the GPU over PCIe and finds it is not accessible. Often caused by hardware failures on the PCIe link bringing the link down. May also be failing GPU hardware or other driver issues. Review system and kernel PCI event logs for indications of link failures.

Example: "NVRM: Xid (PCI:0000:b1:00): 79, GPU has fallen off the bus.

80Corrupted data sent to GPUNohardware error, driver error, system memory corruption, bus error, framebuffer corruption
81VGA Subsystem ErrorNohardware error

Xid 81 indicates a VGA subsystem error, labeling a hardware failure as the only possible reason. It is recommended to contact support for hardware inspection. This error has zero observations on Modal GPUs.

82NVJPG0 ErrorNohardware error, driver error
83NVDEC1 ErrorNohardware error, driver error
84NVDEC2 ErrorNohardware error, driver error
85CE9: Unknown ErrorNohardware error, driver error
86OFA ExceptionNohardware error, driver error
87ReservedNo
88NVDEC3 ErrorNohardware error, driver error
89NVDEC4 ErrorNohardware error, driver error
90ReservedNo
91ReservedNo
92High single-bit ECC error rateNohardware error, driver error

A hardware or driver error occurs. This error is marked non-critical as single-bit errors can be handled by the GPU. See Running Field Diagnostics to collect additional debug information.

93Non-fatal violation of provisioned InfoROM wear limitNodriver error, user app error
94Contained ECC errorNohardware error, driver error, framebuffer corruption

This XID indicates a contained ECC error has occurred. These events are logged when GPU drivers handle errors in GPUs that support error containment, starting with NVIDIA® A100 GPUs.

For Xid 94, these errors are contained to one application, and the application that encountered this error must be restarted.

All other applications running at the time of the Xid are unaffected. It is recommended to reset the GPU container when convenient. Applications can continue to be run until the reset can be performed.

NOTE: This XID is only expected on the NVIDIA Ampere A100s, Hopper, and later architectures. If observed on earlier architectures (e.g. Turing), contact support for investigation.

95Uncontained ECC errorYeshardware error, driver error, framebuffer corruption

This XID indicates an uncontained ECC error has occurred.

These events are logged when GPU drivers handle errors in GPUs that support error containment, starting with NVIDIA® A100 GPUs.

For Xid 95, these errors affect multiple applications, and the affected GPU must be reset before applications can restart. Refer https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html for GPU Reset capabilities & limitations

A100 only:

  • If MIG is enabled, drain any work on the other GPU instances, wait for all work to complete, and reset GPU(s) reporting the XID (refer to the GPU reset capabilities/limitations section below).
  • If MIG is disabled, the node should be rebooted immediately since there is an uncorrectable uncontained ECC error.

(Modal does not support MIG.)

References:

https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#user-visible-statistics

This event is similar to Xid 94. However, Xid 94 indicates that the error is suppressed. Xid 95 indicates that the error fails to be suppressed. Other applications on the GPU-accelerated node are also affected.

96NVDEC5 ErrorNo
97NVDEC6 ErrorNo
98NVDEC7 ErrorNo
99NVJPG1 ErrorNo
100NVJPG2 ErrorNo
101NVJPG3 ErrorNo
102NVJPG4 ErrorNo
103NVJPG5 ErrorNo
104NVJPG6 ErrorNo
105NVJPG7 ErrorNohardware error, driver error
106SMBPBI Test MessageNo
107SMBPBI Test Message SilentNouser app error
108ReservedNo

This is a reserved Xid and is not expected to be observed. If encountered contact support for investigation.

109Context Switch Timeout ErrorNohardware error, driver error, user app error, system memory corruption, bus error, thermal issue, framebuffer corruption

An error with all possible causes, and usually recoverable.

110Security Fault ErrorYes

This event should be uncommon unless there is a hardware failure. Modal will drain the worker and reset the GPU, and if the problem persists, contact the hardware vendor for support.

111Display Bundle Error EventNohardware error, driver error, bus error
112Display Supervisor ErrorNohardware error, driver error
113DP Link Training ErroNohardware error, driver error
114Display Pipeline Underflow ErrorNohardware error, driver error, framebuffer corruption
115Display Core Channel ErrorNohardware error, driver error
116Display Window Channel ErrorNohardware error, driver error
117Display Cursor Channel ErrorNohardware error, driver error
118Display Pixel Pipeline ErrorNohardware error, driver error
119GSP RPC TimeoutYeshardware error, driver error, system memory corruption, bus error, thermal issue, framebuffer corruption

The official NVIDIA docs explains Xid 119 indicates GSP module failures to respond to RPC messages, recommending GPU reset or node power cycle if the issue persists.

120GSP ErrorYeshardware error, driver error, system memory corruption, bus error, thermal issue, framebuffer corruption

The official NVIDIA docs explains Xid 120 indicates GSP module failures to respond to RPC messages, recommending GPU reset or node power cycle if the issue persists.

121C2C Link ErrorYeshardware error, bus error

The official NVIDIA docs explains Xid 121 indicates corrected errors on the C2C NVLink connection to a Grace CPU, with no operational impact, recommending the GPU reset to retrain the link.

122SPI PMU RPC Read FailureNohardware error, driver error
123SPI PMU RPC Write FailureYeshardware error, driver error

Refer to GPU reset capabilities/limitations section provided in Section D.9 of the Fabric Manager User Guide.

124SPI PMU RPC Erase FailureNohardware error, driver error
125Inforom FS FailureNohardware error, driver error
126ReservedNo
127ReservedNo
128ReservedNo
129ReservedNo
130ReservedNo
131ReservedNo
132ReservedNo
134ReservedNo
135ReservedNo
136ReservedNo
137NVLink FLA privilege errorNouser app error

This event is logged when a fault is reported by the remote memory management unit (MMU), such as when an illegal NVLink peer-to-peer access is made by an applicable unit on the chip. Typically these are application-level bugs, but can also be driver bugs or hardware bugs.

138ReservedNo
139ReservedNo
140Unrecovered ECC ErrorYeshardware error, driver error, framebuffer corruption

This event may occur when the GPU driver has observed uncorrectable errors in GPU memory, in such a way as to interrupt the GPU driver’s ability to mark the pages for dynamic page offlining or row remapping. Modal will drain the worker and reset the GPU, and if the problem persists, contact the hardware vendor for support.

141ROBUST_CHANNEL_FAST_PATH_ERRORNo
142ROBUST_CHANNEL_NVENC3_ERRORNo
143GPU Initialization FailureYeshardware error, driver error, framebuffer corruption

GPU initialization failure. GPU_INIT_ERROR in driver (e.g. error status while polling for FSP boot complete).

144NVLINK_SAW_ERRORNo
145NVLINK_RLW_ERRORNo
146NVLINK_TLW_ERRORNo
147NVLINK_TREX_ERRORNo
148NVLINK_NVLPW_CTRL_ERRORNo
149NVLINK_NETIR_ERRORNo
150NVLINK_MSE_ERRORNo

Example error on NVIDIA B200: 'NVRM: Xid (PCI:0000:5b:00): 150, MSE Degraded Fatal XC0 i0 Link -1 (0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000)'

This error is not expected on non-Hopper GPU architectures.

151ROBUST_CHANNEL_KEY_ROTATION_ERRORNo
152RESERVED7_ERRORNo
153RESERVED8_ERRORNo
154GPU Recovery Action ChangedYes

Recovery action changed for GPU device. The following state transitions are observed:

  • 0x0 (None) to 0x1 (GPU Reset Required)
  • 0x0 (None) to 0x2 (Node Reboot Required)
  • 0x0 (None) to 0x4 (Drain and Reset)

Unobserved transitions:

  • 0x0 (None) to 0x3

Because all observed transitions require a reset or reboot, this XID is critical.

155NVLINK_SW_DEFINED_ERRORNo
156RESOURCE_RETIREMENT_EVENTNo
157RESOURCE_RETIREMENT_FAILURENo
158GPU_FATAL_TIMEOUTNo
159ROBUST_CHANNEL_CHI_NON_DATA_ERRORNo
160CHANNEL_RETIREMENT_EVENTNo
161CHANNEL_RETIREMENT_FAILURENo
162ROBUST_CHANNEL_LAST_ERRORNo