Bad Memory

From Leo's Notes
Last edited on 30 December 2021, at 01:21.

Memory Error Logs

Memory Scrubbing Error

This machine check is occurring periodically. /var/log/messages show:

Sep 11 12:11:20 galaxy-dev kernel: mce: [Hardware Error]: Machine check events logged
Sep 11 12:11:20 galaxy-dev mcelog: Hardware event. This is not a software error.
Sep 11 12:11:20 galaxy-dev mcelog: MCE 0
Sep 11 12:11:20 galaxy-dev mcelog: CPU 3 BANK 9
Sep 11 12:11:20 galaxy-dev mcelog: MISC 910a00020000e8c ADDR d20dae4000
Sep 11 12:11:20 galaxy-dev mcelog: TIME 1599847880 Fri Sep 11 12:11:20 2020
Sep 11 12:11:20 galaxy-dev mcelog: MCG status:
Sep 11 12:11:20 galaxy-dev mcelog: MCi status:
Sep 11 12:11:20 galaxy-dev mcelog: Corrected error
Sep 11 12:11:20 galaxy-dev mcelog: MCi_MISC register valid
Sep 11 12:11:20 galaxy-dev mcelog: MCi_ADDR register valid
Sep 11 12:11:20 galaxy-dev mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL1_ERR
Sep 11 12:11:20 galaxy-dev mcelog: Transaction: Memory scrubbing error
Sep 11 12:11:20 galaxy-dev mcelog: MemCtrl: Corrected patrol scrub error
Sep 11 12:11:20 galaxy-dev mcelog:
Sep 11 12:11:20 galaxy-dev mcelog: STATUS 8c000047000800c1 MCGSTATUS 0
Sep 11 12:11:20 galaxy-dev mcelog: MCGCAP 1000c14 APICID 60 SOCKETID 3
Sep 11 12:11:20 galaxy-dev mcelog: PPIN 8800004700800091
Sep 11 12:11:20 galaxy-dev mcelog: CPUID Vendor Intel Family 6 Model 45

This is most likely due to failing memory. You can see how frequent these corrections have been made for each memory controller (mc), chip-select row (csrow), and channel by running:

# grep '[0-9]'  mc*/csrow*/ch*_ce_count
mc0/csrow0/ch0_ce_count:0
mc0/csrow0/ch1_ce_count:0
mc0/csrow0/ch2_ce_count:0
mc0/csrow0/ch3_ce_count:0
mc0/csrow1/ch0_ce_count:0
mc0/csrow1/ch1_ce_count:0
mc0/csrow1/ch2_ce_count:0
mc0/csrow1/ch3_ce_count:0
mc1/csrow0/ch0_ce_count:0
mc1/csrow0/ch1_ce_count:0
mc1/csrow0/ch2_ce_count:0
mc1/csrow0/ch3_ce_count:0
mc1/csrow1/ch0_ce_count:0
mc1/csrow1/ch1_ce_count:0
mc1/csrow1/ch2_ce_count:0
mc1/csrow1/ch3_ce_count:0
mc2/csrow0/ch0_ce_count:0
mc2/csrow0/ch1_ce_count:0
mc2/csrow0/ch2_ce_count:0
mc2/csrow0/ch3_ce_count:0
mc2/csrow1/ch0_ce_count:0
mc2/csrow1/ch1_ce_count:0
mc2/csrow1/ch2_ce_count:0
mc2/csrow1/ch3_ce_count:0
mc3/csrow0/ch0_ce_count:96
mc3/csrow0/ch1_ce_count:2
mc3/csrow0/ch2_ce_count:0
mc3/csrow0/ch3_ce_count:0
mc3/csrow1/ch0_ce_count:0
mc3/csrow1/ch1_ce_count:0
mc3/csrow1/ch2_ce_count:0
mc3/csrow1/ch3_ce_count:0

Or use the edac-utils package:

# edac-util
mc3: csrow0: CPU_SrcID#3_Ha#0_Chan#0_DIMM#0: 96 Corrected Errors
mc3: csrow0: CPU_SrcID#3_Ha#0_Chan#1_DIMM#0: 2 Corrected Errors


Generic undefined request

This is on a ProLiant DL580 G7. The system logs show the following machine check exception errors:

Sep 10 00:31:09 galaxy kernel: [31916012.191742] mce: [Hardware Error]: Machine check events logged
Sep 10 00:31:09 galaxy kernel: mce: [Hardware Error]: Machine check events logged
Sep 10 00:31:09 galaxy mcelog: Hardware event. This is not a software error.
Sep 10 00:31:09 galaxy mcelog: MCE 0
Sep 10 00:31:09 galaxy mcelog: CPU 24 BANK 9
Sep 10 00:31:09 galaxy mcelog: TIME 1599719469 Thu Sep 10 00:31:09 2020
Sep 10 00:31:09 galaxy mcelog: MCG status:
Sep 10 00:31:09 galaxy mcelog: MCi status:
Sep 10 00:31:09 galaxy mcelog: Error overflow
Sep 10 00:31:09 galaxy mcelog: Corrected error
Sep 10 00:31:09 galaxy mcelog: Error enabled
Sep 10 00:31:09 galaxy mcelog: MCA: MEMORY CONTROLLER GEN_CHANNEL0_ERR
Sep 10 00:31:09 galaxy mcelog: Transaction: Generic undefined request
Sep 10 00:31:09 galaxy mcelog: STATUS d000014000310080 MCGSTATUS 0
Sep 10 00:31:09 galaxy mcelog: MCGCAP 1000c18 APICID c0 SOCKETID 3
Sep 10 00:31:09 galaxy mcelog: CPUID Vendor Intel Family 6 Model 47
Sep 10 00:31:09 galaxy mcelog: Hardware event. This is not a software error.
Sep 10 00:31:09 galaxy mcelog: MCE 1
Sep 10 00:31:09 galaxy mcelog: CPU 24 BANK 9
Sep 10 00:31:09 galaxy mcelog: TIME 1599719469 Thu Sep 10 00:31:09 2020
Sep 10 00:31:09 galaxy mcelog: MCG status:
Sep 10 00:31:09 galaxy mcelog: MCi status:
Sep 10 00:31:09 galaxy mcelog: Error overflow
Sep 10 00:31:09 galaxy mcelog: Corrected error
Sep 10 00:31:09 galaxy mcelog: Error enabled
Sep 10 00:31:09 galaxy mcelog: MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Sep 10 00:31:09 galaxy mcelog: Transaction: Generic undefined request
Sep 10 00:31:09 galaxy mcelog: STATUS d0000780000a008f MCGSTATUS 0
Sep 10 00:31:09 galaxy mcelog: MCGCAP 1000c18 APICID c0 SOCKETID 3
Sep 10 00:31:09 galaxy mcelog: CPUID Vendor Intel Family 6 Model 47

After some searching, this appears to be a known issue for this model of HPs with this particular CPU. The CPU of the server above is a Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz. The Intel Xeon E7 family is known to have this issue and HP has an advisory about this:

Certain ProLiant servers utilizing Intel Xeon E7 Family processors may experience Correctable Machine Check (CMC) Memory Errors. These CMC errors are not operating system-dependent and may occur with any operating system. With high-speed processor busses, it is normal for a low occurrence of CMC memory error events to occur. However, on rare occasions, a higher than expected number of CMC memory errors may be logged in the IA32_MC8_Status Model-Specific Register (MSR) or IA32_MC9_Status MSR of the processors across a single processor package.

[snip]

This is not an HP product-specific issue but rather a behavior of the Intel Xeon Processor E7 family

One resolution is to go into the BIOS and set the Minimum Processor Idle Power Package State from Package C3 State to No Package State.