AMD admits that the EPYC Rome processor will hang after running for 1044 days
The AMD EPYC 7002, a server processor launched in 2019 with a Zen 2 architecture and codenamed Rome, recently had its errata sheet released by AMD. The documentation states, “1474 a core may hang after about 1044 days.” To rectify this issue, one must reboot the server. Notably, AMD has indicated it will not address this problem.
Roughly equating to 34 months or slightly under three years, the exact duration is 1042 days and 12 hours. The likelihood of an issue arises from the CPU REFCLK computing 10ns ticks within a 54-bit signed integer. Upon calculating more than nine trillion of these ticks, an overflow occurs at the 1042-day and 12-hour mark. Upon the incidence of an overflow, the kernel becomes stuck, declining any external interrupt requests until the power supply is turned off and restarted, thus resetting the counter.
The discovery of this issue indicates that more than one system has been operating for almost three years without a restart. Uncovering this loophole must have been a time-consuming process. According to AMD’s guidelines, the root cause is that the kernel is unable to exit the CC6 power-saving mode. Upon entering this mode, the CPU voltage and clock frequency decrease. The time discrepancy in various systems encountering bugs may depend on the circumstances of spread spectrum modulation and REFCLK frequency.
AMD does not intend to release any fixes for the CC6 error. Instead, they advise administrators to disable CC6 to prevent kernel freezing, or simply schedule regular system reboots before the time limit arrives.