Bad Memory
Memory Error Logs
Memory Scrubbing Error
This machine check is occurring periodically. /var/log/messages
show:
Sep 11 12:11:20 galaxy-dev kernel: mce: [Hardware Error]: Machine check events logged
Sep 11 12:11:20 galaxy-dev mcelog: Hardware event. This is not a software error.
Sep 11 12:11:20 galaxy-dev mcelog: MCE 0
Sep 11 12:11:20 galaxy-dev mcelog: CPU 3 BANK 9
Sep 11 12:11:20 galaxy-dev mcelog: MISC 910a00020000e8c ADDR d20dae4000
Sep 11 12:11:20 galaxy-dev mcelog: TIME 1599847880 Fri Sep 11 12:11:20 2020
Sep 11 12:11:20 galaxy-dev mcelog: MCG status:
Sep 11 12:11:20 galaxy-dev mcelog: MCi status:
Sep 11 12:11:20 galaxy-dev mcelog: Corrected error
Sep 11 12:11:20 galaxy-dev mcelog: MCi_MISC register valid
Sep 11 12:11:20 galaxy-dev mcelog: MCi_ADDR register valid
Sep 11 12:11:20 galaxy-dev mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL1_ERR
Sep 11 12:11:20 galaxy-dev mcelog: Transaction: Memory scrubbing error
Sep 11 12:11:20 galaxy-dev mcelog: MemCtrl: Corrected patrol scrub error
Sep 11 12:11:20 galaxy-dev mcelog:
Sep 11 12:11:20 galaxy-dev mcelog: STATUS 8c000047000800c1 MCGSTATUS 0
Sep 11 12:11:20 galaxy-dev mcelog: MCGCAP 1000c14 APICID 60 SOCKETID 3
Sep 11 12:11:20 galaxy-dev mcelog: PPIN 8800004700800091
Sep 11 12:11:20 galaxy-dev mcelog: CPUID Vendor Intel Family 6 Model 45
This is most likely due to failing memory. You can see how frequent these corrections have been made for each memory controller (mc), chip-select row (csrow), and channel by running:
# grep '[0-9]' mc*/csrow*/ch*_ce_count
mc0/csrow0/ch0_ce_count:0
mc0/csrow0/ch1_ce_count:0
mc0/csrow0/ch2_ce_count:0
mc0/csrow0/ch3_ce_count:0
mc0/csrow1/ch0_ce_count:0
mc0/csrow1/ch1_ce_count:0
mc0/csrow1/ch2_ce_count:0
mc0/csrow1/ch3_ce_count:0
mc1/csrow0/ch0_ce_count:0
mc1/csrow0/ch1_ce_count:0
mc1/csrow0/ch2_ce_count:0
mc1/csrow0/ch3_ce_count:0
mc1/csrow1/ch0_ce_count:0
mc1/csrow1/ch1_ce_count:0
mc1/csrow1/ch2_ce_count:0
mc1/csrow1/ch3_ce_count:0
mc2/csrow0/ch0_ce_count:0
mc2/csrow0/ch1_ce_count:0
mc2/csrow0/ch2_ce_count:0
mc2/csrow0/ch3_ce_count:0
mc2/csrow1/ch0_ce_count:0
mc2/csrow1/ch1_ce_count:0
mc2/csrow1/ch2_ce_count:0
mc2/csrow1/ch3_ce_count:0
mc3/csrow0/ch0_ce_count:96
mc3/csrow0/ch1_ce_count:2
mc3/csrow0/ch2_ce_count:0
mc3/csrow0/ch3_ce_count:0
mc3/csrow1/ch0_ce_count:0
mc3/csrow1/ch1_ce_count:0
mc3/csrow1/ch2_ce_count:0
mc3/csrow1/ch3_ce_count:0
Or use the edac-utils
package:
# edac-util
mc3: csrow0: CPU_SrcID#3_Ha#0_Chan#0_DIMM#0: 96 Corrected Errors
mc3: csrow0: CPU_SrcID#3_Ha#0_Chan#1_DIMM#0: 2 Corrected Errors
Generic undefined request
This is on a ProLiant DL580 G7. The system logs show the following machine check exception errors:
Sep 10 00:31:09 galaxy kernel: [31916012.191742] mce: [Hardware Error]: Machine check events logged
Sep 10 00:31:09 galaxy kernel: mce: [Hardware Error]: Machine check events logged
Sep 10 00:31:09 galaxy mcelog: Hardware event. This is not a software error.
Sep 10 00:31:09 galaxy mcelog: MCE 0
Sep 10 00:31:09 galaxy mcelog: CPU 24 BANK 9
Sep 10 00:31:09 galaxy mcelog: TIME 1599719469 Thu Sep 10 00:31:09 2020
Sep 10 00:31:09 galaxy mcelog: MCG status:
Sep 10 00:31:09 galaxy mcelog: MCi status:
Sep 10 00:31:09 galaxy mcelog: Error overflow
Sep 10 00:31:09 galaxy mcelog: Corrected error
Sep 10 00:31:09 galaxy mcelog: Error enabled
Sep 10 00:31:09 galaxy mcelog: MCA: MEMORY CONTROLLER GEN_CHANNEL0_ERR
Sep 10 00:31:09 galaxy mcelog: Transaction: Generic undefined request
Sep 10 00:31:09 galaxy mcelog: STATUS d000014000310080 MCGSTATUS 0
Sep 10 00:31:09 galaxy mcelog: MCGCAP 1000c18 APICID c0 SOCKETID 3
Sep 10 00:31:09 galaxy mcelog: CPUID Vendor Intel Family 6 Model 47
Sep 10 00:31:09 galaxy mcelog: Hardware event. This is not a software error.
Sep 10 00:31:09 galaxy mcelog: MCE 1
Sep 10 00:31:09 galaxy mcelog: CPU 24 BANK 9
Sep 10 00:31:09 galaxy mcelog: TIME 1599719469 Thu Sep 10 00:31:09 2020
Sep 10 00:31:09 galaxy mcelog: MCG status:
Sep 10 00:31:09 galaxy mcelog: MCi status:
Sep 10 00:31:09 galaxy mcelog: Error overflow
Sep 10 00:31:09 galaxy mcelog: Corrected error
Sep 10 00:31:09 galaxy mcelog: Error enabled
Sep 10 00:31:09 galaxy mcelog: MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Sep 10 00:31:09 galaxy mcelog: Transaction: Generic undefined request
Sep 10 00:31:09 galaxy mcelog: STATUS d0000780000a008f MCGSTATUS 0
Sep 10 00:31:09 galaxy mcelog: MCGCAP 1000c18 APICID c0 SOCKETID 3
Sep 10 00:31:09 galaxy mcelog: CPUID Vendor Intel Family 6 Model 47
After some searching, this appears to be a known issue for this model of HPs with this particular CPU. The CPU of the server above is a Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz
. The Intel Xeon E7 family is known to have this issue and HP has an advisory about this:
Certain ProLiant servers utilizing Intel Xeon E7 Family processors may experience Correctable Machine Check (CMC) Memory Errors. These CMC errors are not operating system-dependent and may occur with any operating system. With high-speed processor busses, it is normal for a low occurrence of CMC memory error events to occur. However, on rare occasions, a higher than expected number of CMC memory errors may be logged in the IA32_MC8_Status Model-Specific Register (MSR) or IA32_MC9_Status MSR of the processors across a single processor package.[snip]
This is not an HP product-specific issue but rather a behavior of the Intel Xeon Processor E7 family—HP Support Center, https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03282091
One resolution is to go into the BIOS and set the Minimum Processor Idle Power Package State
from Package C3 State
to No Package State
.