GPU Health
Modal constantly monitors host GPU health, draining Workers with critical issues and surfacing warnings for customer triage.
Application level observability of GPU health is facilitated by metrics and event logging to container log streams.
[gpu-health]
logging
Containers with attached NVIDIA GPUs are connected to our gpu-health
monitoring system
and receive event logs which originate from either application software behavior, system software behavior, or hardware failure.
These logs are in the following format: [gpu-health] [LEVEL] GPU-[UUID]: EVENT_TYPE: MSG
gpu-health
: Name indicating the source is Modal’s observability system.LEVEL
: Represents the severity level of the log message.GPU_UUID
: A unique identifier for the GPU device associated with the event, if any.EVENT_TYPE
: The type of event source. Modal monitors for multiple types of errors, including Xid, SXid, and uncorrectable ECC. See below for more details.MSG
: The message component is either the original message taken from the event source, or a description provided by Modal of the problem.
Level
The severity level may be CRITICAL
or WARN
. Modal automatically responds to CRITICAL
level events by draining the underlying Worker and migrating customer containers. WARN
level logs may be benign or indication of an application or library bug. No automatic action is taken by our system for warnings.
Xid & SXid
The Xid message is an error report from the NVIDIA driver. The SXid, or “Switch Xid” is a report for the NVSwitch component used in GPU-to-GPU communication, and is thus only relevant in multi-GPU containers.
A classic critical Xid error is the ‘fell of the bus’ report, code 79. The gpu-health
event log looks like this:
[gpu-health] [CRITICAL] GPU-1234: XID: NVRM: Xid (PCI:0000:c6:00): 79, pid=1101234, name=nvc:[driver], GPU has fallen off the bus.
There are over 100 Xid codes and they are of highly varying frequency, severity, and specificity. See NVIDIA’s official documentation for more information.