Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcelog didn't catch the mce memory #91

Open
mysnoopy opened this issue May 30, 2021 · 3 comments
Open

mcelog didn't catch the mce memory #91

mysnoopy opened this issue May 30, 2021 · 3 comments

Comments

@mysnoopy
Copy link

Found the mce error on dmesg. But mcelog didn't catch it and /var/log/mcelog is empty,

[root@test ~]#dmesg -T |grep mce
[Tue Apr 21 16:02:26 2020] mce: Using 22 MCE banks
[Sat May 1 08:56:53 2021] mce: [Hardware Error]: Machine check events logged

[root@test ~]# mcelog --client
[root@test ~]# cat /var/log/mcelog
[root@test ~]#

[root@test ~]# cat /etc/mcelog/mcelog.conf

config file for mcelog

For further options, see the mcelog manpage and documentation

by default, disable extended error logging on newer Intel processors

#syslog = yes

logfile = /var/log/mcelog

no-imc-log = yes

Filter out known broken events by default

filter = yes

don't log memory errors individually

#filter-memory-errors = yes

output in undecoded raw format to be easier machine readable

#raw = yes

[server]

An upstream bug prevents this from being disabled

Only allow root to connect by default

client-user = root

Path to socket client uses to connect

socket-path = /var/run/mcelog-client

[dimm]

Enable DIMM-tracking

dimm-tracking-enabled = yes

Disable DIMM DMI pre-population unless supported on your system

dmi-prepopulate = no

execute these triggers when the rate of corrected or uncorrected

errors per DIMM exceeds the threshold

uc-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
ce-error-trigger = dimm-error-trigger
ce-error-threshold = 10 / 24h

[socket]

Memory error accounting per socket

socket-tracing-enabled = yes
mem-uc-error-threshold = 100 / 24h
mem-ce-error-trigger = socket-memory-error-trigger
mem-ce-error-threshold = 100 / 24h
mem-ce-error-log = yes

[cache]

Attempt to off-line CPUs causing cache errors

cache-threshold-trigger = cache-error-trigger
cache-threshold-log = yes

[page]

Try to soft-offline a 4K page if it exceeds the threshold

memory-ce-threshold = 10 / 24h
memory-ce-trigger = page-error-trigger
memory-ce-log = yes
memory-ce-action = soft

[trigger]

Maximum number of running triggers

children-max = 2
directory = /etc/mcelog/triggers
[root@test ~]#

@aegl
Copy link
Collaborator

aegl commented Jun 16, 2021

Was your kernel built with CONFIG_RAS_CEC=y?
If so, you may have some corrected memory errors that were handled by the RAS_CEC code and not passed though to mcelog.

@dimaslv
Copy link

dimaslv commented Dec 8, 2021

Same problem here.

# grep CONFIG_RAS_CEC /boot/config-5.4.84-2.el7.x86_64 
# CONFIG_RAS_CEC is not set
# lsmod|grep -c edac
0
# mcelog --client
# mcelog --version
mcelog mcelog-144-9.94d853b2ea81.el7
# grep -vE "^#|^$" /etc/mcelog/mcelog.conf 
no-imc-log = yes
filter = yes
filter-memory-errors = yes
[server]
client-user = root
[dimm]
dimm-tracking-enabled = yes
dmi-prepopulate = yes
uc-error-trigger = dimm-error-trigger
ce-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
ce-error-threshold = 1000 / 24h
[socket]
socket-tracking-enabled = yes
mem-uc-error-threshold = 1000 / 24h
mem-ce-error-threshold = 1000 / 24h
mem-ce-error-log = yes
[cache]
cache-threshold-log = yes
[page]
memory-ce-threshold = 10 / 24h
memory-ce-log = no
memory-ce-action = off
[trigger]
children-max = 2
directory = /etc/mcelog/triggers

CPU Intel E5-2680 v3
In /var/log/mcelog only "failed to prefill DIMM database from DMI data".
And still a lot of errors in kernel log:

Dec  7 16:01:31 srv kernel: [    6.653750] mce: [Hardware Error]: Machine check events logged
Dec  7 16:01:31 srv kernel: [    6.653820] mce: CMCI storm detected: switching to poll mode
Dec  7 16:01:31 srv kernel: [    6.654715] mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 7: cc163b8000010090
Dec  7 16:01:31 srv kernel: [    6.657713] mce: [Hardware Error]: TSC 0 ADDR 3f7f308140 MISC 42363686 
Dec  7 16:01:31 srv kernel: [    6.658714] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1638882045 SOCKET 1 APIC 20 microcode 36
...
Dec  7 16:01:31 srv kernel: [    6.713718] mce: MCE records pool full!

All those mce errors in the log appeared only on boot (probably before mcelog started).
Could it be, that early "CMCI storm detected: switching to poll mode" during the boot effectiveley turns off passing MCE to mcelog?

@aegl
Copy link
Collaborator

aegl commented Dec 8, 2021

Old kernels didn't pass early errors to mcelog.

There is a fix in v5.15

See this commit: 3bff147b187d ("x86/mce: Defer processing of early errors")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants