Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why THRESHOLD env. variable is increasing with number of errors ? #72

Open
proof88 opened this issue Aug 27, 2018 · 0 comments
Open

Why THRESHOLD env. variable is increasing with number of errors ? #72

proof88 opened this issue Aug 27, 2018 · 0 comments

Comments

@proof88
Copy link

proof88 commented Aug 27, 2018

Hello,

I would like to ask about why the THRESHOLD environment variable is set as it is set by mcelog.

On a system there was a HW-related problem which triggered a lot of correctable errors. The logging was weird because as you see below, the number of correctable errors were logged as same as threshold value, i.e. the threshold value was increasing with the number of correctable errors!
Log snippet (between each line a few minutes elapsed):

MCE-SOCKET-TRIGGER: Socket  0  , Correctable  19899  , Uncorrectable  0  , Treshold  19899 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  12559069  , Uncorrectable  0  , Treshold  12559069 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  432343640  , Uncorrectable  0  , Treshold  432343640 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  874393634  , Uncorrectable  0  , Treshold  874393634 in 24h

By checking source code of mcelog, I understood that this behavior may be by design. Is it really working by design? What is the purpose of such threshold logging which is continuously increasing with the number of errors? Shouldn't be threshold logged as a different value such as bucket size?

I understand the code works as following:
When a predefined per socket threshold is exceeded, mcelog calls "socket-memory-error-trigger.local" script. This script does that printout where we see threshold increasing with error count continuously:

echo "MCE-SOCKET-TRIGGER: Socket " $SOCKETID " , Correctable " $CECOUNT " , Uncorrectable " $UCCOUNT " , Treshold " $THRESHOLD > $MCE_LOG_FILE

This script prints environment variables such as THRESHOLD. Since same number is logged for correctable and threshold, I think in this case CECOUNT equals to THRESHOLD environment variable.

Environment variables are set by memdb.c :: memdb_trigger(). It sets value of THRESHOLD env. variable to value of thresh variable, and that thresh variable is set by leaky-bucket.c :: bucket_output(). It sets value to: b->count + b->excess.
So what we see logged as threshold is: bucket’s count plus bucket’s excess value.

When an error comes, bucket’s leaky-bucket.c :: bucket_account() function is called. It increases bucket’s count value. If count reaches the capacity of the bucket, excess is increased by count, then count becomes zero.
So for example, if bucket capacity is 10: when an error comes in, count is increased. After 10 errors, count reaches 10, so excess is increased by 10 so it becomes 10, count is reset to 0. Again more errors come: after 10 errors, count reaches 10, excess is increased by 10 so it becomes 20, count is reset to 0.
Since we set THRESHOLD env. variable to count+excess, it is always the number of errors registered so far since we initialized our bucket at the very beginning. This way it will continuously increase with the number of detected errors. So currently I don’t see how the printed THRESHOLD environment variable is actually a threshold.
Is this value really calculated and set how it should be calculated and set by mcelog?

Thanks,
Ádám Szabó

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant