Why THRESHOLD env. variable is increasing with number of errors ? #72

proof88 · 2018-08-27T13:17:47Z

Hello,

I would like to ask about why the THRESHOLD environment variable is set as it is set by mcelog.

On a system there was a HW-related problem which triggered a lot of correctable errors. The logging was weird because as you see below, the number of correctable errors were logged as same as threshold value, i.e. the threshold value was increasing with the number of correctable errors!
Log snippet (between each line a few minutes elapsed):

MCE-SOCKET-TRIGGER: Socket  0  , Correctable  19899  , Uncorrectable  0  , Treshold  19899 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  12559069  , Uncorrectable  0  , Treshold  12559069 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  432343640  , Uncorrectable  0  , Treshold  432343640 in 24h
[...]
MCE-SOCKET-TRIGGER: Socket  0  , Correctable  874393634  , Uncorrectable  0  , Treshold  874393634 in 24h

By checking source code of mcelog, I understood that this behavior may be by design. Is it really working by design? What is the purpose of such threshold logging which is continuously increasing with the number of errors? Shouldn't be threshold logged as a different value such as bucket size?

I understand the code works as following:
When a predefined per socket threshold is exceeded, mcelog calls "socket-memory-error-trigger.local" script. This script does that printout where we see threshold increasing with error count continuously:

echo "MCE-SOCKET-TRIGGER: Socket " $SOCKETID " , Correctable " $CECOUNT " , Uncorrectable " $UCCOUNT " , Treshold " $THRESHOLD > $MCE_LOG_FILE

This script prints environment variables such as THRESHOLD. Since same number is logged for correctable and threshold, I think in this case CECOUNT equals to THRESHOLD environment variable.

Environment variables are set by memdb.c :: memdb_trigger(). It sets value of THRESHOLD env. variable to value of thresh variable, and that thresh variable is set by leaky-bucket.c :: bucket_output(). It sets value to: b->count + b->excess.
So what we see logged as threshold is: bucket’s count plus bucket’s excess value.

When an error comes, bucket’s leaky-bucket.c :: bucket_account() function is called. It increases bucket’s count value. If count reaches the capacity of the bucket, excess is increased by count, then count becomes zero.
So for example, if bucket capacity is 10: when an error comes in, count is increased. After 10 errors, count reaches 10, so excess is increased by 10 so it becomes 10, count is reset to 0. Again more errors come: after 10 errors, count reaches 10, excess is increased by 10 so it becomes 20, count is reset to 0.
Since we set THRESHOLD env. variable to count+excess, it is always the number of errors registered so far since we initialized our bucket at the very beginning. This way it will continuously increase with the number of detected errors. So currently I don’t see how the printed THRESHOLD environment variable is actually a threshold.
Is this value really calculated and set how it should be calculated and set by mcelog?

Thanks,
Ádám Szabó

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why THRESHOLD env. variable is increasing with number of errors ? #72

Why THRESHOLD env. variable is increasing with number of errors ? #72

proof88 commented Aug 27, 2018

Why THRESHOLD env. variable is increasing with number of errors ? #72

Why THRESHOLD env. variable is increasing with number of errors ? #72

Comments

proof88 commented Aug 27, 2018