-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add xor tree repair loop #2309
add xor tree repair loop #2309
Conversation
How does this fix the local node if only the local XOR is broken, and all other nodes are in sync? Edit: now I (hopefully) understand; the local node's XOR needs to differ from all other connected nodes (indicating it is in sync with no other node), for the repair to trigger |
Should this be an invokable/triggerable operation as well (e.g. by removing certain files or sending a specific API call), in case it is suspected to be broken but it doesn't trigger for whatever reason (e.g. other node has the same defect. |
ctx context.Context | ||
cancel context.CancelFunc | ||
ticker *time.Ticker | ||
currentPage uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Racy access (by both ticker and stateOK
for instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested with -race and no race conditions have been found. Both checkPage and stateOk use the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I saw that later and forgot to remove the second comment 👍
calculatedXorTree := tree.New(tree.NewXor(), PageSize) | ||
|
||
// acquire global lock | ||
err := f.state.graph.db.Write(context.Background(), func(txn stoabs.WriteTx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how quick/fast is this? The funcs to be called by externals are now blocked by the mutex.(although they are blocked by the write TX anyways..?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
takes around 50-100ms per check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps good to do a performance test with this? 100ms on an interval of a page/10s = ~1% DB downtime. It could have a significant impact during bursts of data (auto publishing OpenID services), but becomes less relevant as more transactions move off the dag.
if err != nil { | ||
return err | ||
} | ||
log.Logger().Warnf("detected XOR tree mismatch for page %d, fixed using recalculated values", f.currentPage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it just be it's temporarily out-of-sync with the other nodes due "normal" out-of-date (missing a few TXs) and being slow for a moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only when you also have experienced down time.
@@ -93,6 +93,11 @@ type State interface { | |||
// - highest lamport clock in the DAG | |||
// A requested clock of math.MaxUint32 will return the iblt of the entire DAG | |||
IBLT(reqClock uint32) (tree.Iblt, uint32) | |||
|
|||
// IncorrectStateDetected is called when the xor and LC value from a gossip message do NOT match the local state. | |||
IncorrectStateDetected() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 node mismatches = repair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, could become interesting if 2 nodes are out of sync.... Could set it to 5 or even 10
func newXorTreeRepair(state *state) *xorTreeRepair { | ||
return &xorTreeRepair{ | ||
state: state, | ||
ticker: time.NewTicker(10 * time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given almost 30000 TXs in stable, it would take 30000/512*10/60 ~= 10 minutes
to fix it, worst case. Is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opposed to the 6 months of additional network messages every 10s.....
You want different gossip messages to arrive within a single interval so they can set the state to green again.
Another node would have to have the exact same defect. Extremely unlikely. Then you can always turn off 1 node or nuke the DAG. |
Band-aid time! |
closes #2295
The dag state now starts a go routine that loops over parts of the dag when triggered. The trigger is a multiple XOR comparison failure detected from gossip messages. All connected nodes will need to have a different XOR before it's triggered. When triggered it handles a page of TX per 10 seconds.
The fix does not run as migration on startup. Although performance is good enough (20k tx/s), it exposes healthy nodes to an alternate code path which was previously not present. With this solution only nodes with a broken xor tree will be exposed to this code, reducing the risk of introducing new bugs for healthy nodes.
iblt trees are not repaired as discussed in #2295