Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add xor tree repair loop #2309

Merged
merged 3 commits into from
Jul 8, 2023
Merged

Conversation

woutslakhorst
Copy link
Member

closes #2295

The dag state now starts a go routine that loops over parts of the dag when triggered. The trigger is a multiple XOR comparison failure detected from gossip messages. All connected nodes will need to have a different XOR before it's triggered. When triggered it handles a page of TX per 10 seconds.

The fix does not run as migration on startup. Although performance is good enough (20k tx/s), it exposes healthy nodes to an alternate code path which was previously not present. With this solution only nodes with a broken xor tree will be exposed to this code, reducing the risk of introducing new bugs for healthy nodes.

iblt trees are not repaired as discussed in #2295

network/dag/tree/tree.go Outdated Show resolved Hide resolved
network/dag/consistency.go Show resolved Hide resolved
network/dag/consistency.go Outdated Show resolved Hide resolved
network/transport/v2/handlers.go Show resolved Hide resolved
@reinkrul
Copy link
Member

reinkrul commented Jul 6, 2023

All connected nodes will need to have a different XOR before it's triggered

How does this fix the local node if only the local XOR is broken, and all other nodes are in sync?

Edit: now I (hopefully) understand; the local node's XOR needs to differ from all other connected nodes (indicating it is in sync with no other node), for the repair to trigger

@reinkrul
Copy link
Member

reinkrul commented Jul 6, 2023

Should this be an invokable/triggerable operation as well (e.g. by removing certain files or sending a specific API call), in case it is suspected to be broken but it doesn't trigger for whatever reason (e.g. other node has the same defect.

network/dag/consistency.go Outdated Show resolved Hide resolved
ctx context.Context
cancel context.CancelFunc
ticker *time.Ticker
currentPage uint32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Racy access (by both ticker and stateOK for instance)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested with -race and no race conditions have been found. Both checkPage and stateOk use the mutex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I saw that later and forgot to remove the second comment 👍

calculatedXorTree := tree.New(tree.NewXor(), PageSize)

// acquire global lock
err := f.state.graph.db.Write(context.Background(), func(txn stoabs.WriteTx) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how quick/fast is this? The funcs to be called by externals are now blocked by the mutex.(although they are blocked by the write TX anyways..?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

takes around 50-100ms per check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps good to do a performance test with this? 100ms on an interval of a page/10s = ~1% DB downtime. It could have a significant impact during bursts of data (auto publishing OpenID services), but becomes less relevant as more transactions move off the dag.

network/dag/consistency.go Show resolved Hide resolved
if err != nil {
return err
}
log.Logger().Warnf("detected XOR tree mismatch for page %d, fixed using recalculated values", f.currentPage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it just be it's temporarily out-of-sync with the other nodes due "normal" out-of-date (missing a few TXs) and being slow for a moment?

Copy link
Member Author

@woutslakhorst woutslakhorst Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only when you also have experienced down time.

network/dag/consistency.go Outdated Show resolved Hide resolved
@@ -93,6 +93,11 @@ type State interface {
// - highest lamport clock in the DAG
// A requested clock of math.MaxUint32 will return the iblt of the entire DAG
IBLT(reqClock uint32) (tree.Iblt, uint32)

// IncorrectStateDetected is called when the xor and LC value from a gossip message do NOT match the local state.
IncorrectStateDetected()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 node mismatches = repair?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, could become interesting if 2 nodes are out of sync.... Could set it to 5 or even 10

func newXorTreeRepair(state *state) *xorTreeRepair {
return &xorTreeRepair{
state: state,
ticker: time.NewTicker(10 * time.Second),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given almost 30000 TXs in stable, it would take 30000/512*10/60 ~= 10 minutes to fix it, worst case. Is that OK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opposed to the 6 months of additional network messages every 10s.....

You want different gossip messages to arrive within a single interval so they can set the state to green again.

@woutslakhorst
Copy link
Member Author

Should this be an invokable/triggerable operation as well (e.g. by removing certain files or sending a specific API call), in case it is suspected to be broken but it doesn't trigger for whatever reason (e.g. other node has the same defect.

Another node would have to have the exact same defect. Extremely unlikely. Then you can always turn off 1 node or nuke the DAG.

@reinkrul
Copy link
Member

reinkrul commented Jul 7, 2023

Band-aid time!

@woutslakhorst woutslakhorst merged commit c215dcd into master Jul 8, 2023
5 checks passed
@woutslakhorst woutslakhorst deleted the feature/2295/xor_tree_repair branch July 8, 2023 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto-fix incorrect XOR/IBLT values
3 participants