Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubevirt: Force neighbors cache update after live migrtation #4416

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

qinqon
Copy link
Contributor

@qinqon qinqon commented Jun 4, 2024

What this PR does and why is it needed

Improve connectivity after kubevirt live migration by forcing updating neighbors cache at VM and pods, it remove flakiness at difference scenarios. The details are at the first commit on the design doc.

Which issue(s) this PR fixes

Fixes #4237
Fixes #3986
Fixes https://issues.redhat.com/browse/SDN-4860

Special notes for reviewers

The main change to review is the first commit that contains the design doc changes, this is where review has to go. The rest of the PR is the PoC's code to illustrate the design doc, after design doc is ok the rest of the PR will be reworked.

PoC TODO:

  • Replace LSP deletion with mac address removal.

Details to documentation updates

live migration design docs updated

Description for the changelog

kubevirt: Force neighbors cache update after live migrtation

Does this PR introduce a user-facing change?

NONE

For more information on release notes see: TBD
-->

kubevirt: Force neighbors cache update after live migrtation

@qinqon qinqon changed the title Kubevirt neighbors after live migration kubevirt: Force neighbors cache update after live migrtation Jun 4, 2024
@qinqon qinqon marked this pull request as ready for review June 4, 2024 11:55
@qinqon qinqon requested a review from a team as a code owner June 4, 2024 11:55
@qinqon qinqon requested a review from ricky-rav June 4, 2024 11:55
@tssurya
Copy link
Member

tssurya commented Jun 4, 2024

docs, docs, please fix the docs issue here as well if possible @qinqon ?

@qinqon
Copy link
Contributor Author

qinqon commented Jun 4, 2024

docs, docs, please fix the docs issue here as well if possible @qinqon ?

You mean https://issues.redhat.com/browse/SDN-4904, right ?

@tssurya
Copy link
Member

tssurya commented Jun 4, 2024

I meant #4398 :D

ah they are the same!

@qinqon
Copy link
Contributor Author

qinqon commented Jun 4, 2024

I meant #4398 :D

ah they are the same!

Addressed at different PR #4418

Sorry for the delay.

@tssurya tssurya added feature/kubevirt-live-migration All issues related to kubevirt live migration area/e2e-testing labels Jun 4, 2024
@qinqon qinqon force-pushed the kubevirt-neighbors-after-live-migration branch from 207ac71 to fa77351 Compare June 11, 2024 10:51
@qinqon
Copy link
Contributor Author

qinqon commented Jun 11, 2024

Rebased

@github-actions github-actions bot added the kind/documentation All issues related to documentation label Jun 11, 2024
@coveralls
Copy link

Coverage Status

coverage: 52.485% (-0.2%) from 52.729%
when pulling fa77351 on qinqon:kubevirt-neighbors-after-live-migration
into 17dce5c on ovn-org:master.

Copy link
Contributor

@ricky-rav ricky-rav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing my review comments for the first commit only. Will move to the other commits right away.

┌──────────────────────┐ │ │ ┌──────────────────────┐
│ lsp ns-virt-launcher1│──┘ └───│ lsp pod1 │
│ address: │ │ address: │
│ 0a:58:0a:f4:00:01 │ │ 0a:58:0a:f1:00:02 │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MAC address for lsp pod1 shouldn't have changed, I suppose? Here I see that the f4 bytes became f1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, will fix it.

```text
┌────────────────────────┐ ┌────────────────────────┐
│logical switch node2 │ │logic switch node1 │
│ (10.244.1.0/24) │ │ (10.244.1.0/24) │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here the subnet for logical switch node1 should be the same as before: 10.244.0.0/24

└─────────────────────────────────┘
```

Now the neighbors tables are incorrect since non of those mac address are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non -> none, address -> addresses

Now the neighbors tables are incorrect since non of those mac address are
are part of the logical switches, after some time or traffic neighbors cache
get invalidated and updated so they will point to the arp_proxy mac `0a:58:0a:f3:00:00`, problem
is that it may take too much times and connections will be broken.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

times -> time

is that it may take too much times and connections will be broken.

To improve the situation ovn-k sends the following GARPs and unsolicited NA taking
into account tha the arp proxy mac is `0a:58:0a:f3:00:00`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tha -> that


At node2 advertise to VM that pod's from the same subnet should go over proxy_arp:
- `GARP(10.244.0.9 -> 0a:58:0a:f3:00:00, vm mac(0a:58:0a:f4:00:01))`
- one per pod at same subnet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity, here I would say something like `one per pod on the same subnet (in this example, only for pod1).

@qinqon qinqon force-pushed the kubevirt-neighbors-after-live-migration branch from fa77351 to 4f58658 Compare June 13, 2024 12:30
@qinqon qinqon requested a review from ricky-rav June 13, 2024 12:30
@coveralls
Copy link

Coverage Status

coverage: 52.472% (-0.3%) from 52.729%
when pulling 4f58658 on qinqon:kubevirt-neighbors-after-live-migration
into 17dce5c on ovn-org:master.

Copy link
Contributor

@ricky-rav ricky-rav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly asking for more comments. I haven't looked at the last two commits yet.

WithPolling(polling).
WithTimeout(timeout).
Should(Succeed(), func() string { return stage + ": " + podName + ": " + output })
output, err := kubevirt.RunCommand(vmi, fmt.Sprintf("curl http://%s", net.JoinHostPort(podIP, "8000")), 5*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you briefly explain why there's no need for the call to Eventually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Eventually was enmascarading the real issue, since that it's the time it takes for arp neighbors cache to be reconciled, removing it showed the issue and force for it to be solved.

resource: virtualMachine,
test: liveMigrate,
topology: "localnet",
eastWestTrafficTimeout: 15 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're increasing the timeout from 5 to 15 seconds because we want to allow enough time for the GARPs to be sent on both nodes (old, new) and for ovn-k to remove the VM mac address from the old node LSP, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue we fix here is for default pod network, this tests are for multi-homing live migration we have lower expectations right now, we have some jira to improve live migration at multi-homing, that's why we put this value here to be able to re-use the east/west traffic function but make it pass at multi-homing live migration.

@@ -146,6 +148,31 @@ func AllVMPodsAreCompleted(client *factory.WatchFactory, pod *corev1.Pod) (bool,
return true, nil
}

// isLiveMigrationInProgress return true if a live migration is still progressing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add to the comment how live migration is detected? From what I understand (to continue your comment line) "that is when there's more than one pod associated to the same VM".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old code.

@@ -199,6 +226,29 @@ func nodeContainsPodSubnet(watchFactory *factory.WatchFactory, nodeName string,
return false, nil
}

// findNodeOwningSubnet will return the node owning the subnet
// contains the subnets from the argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the sentence in the comment is broken :)
How about findNodeOwningSubnet returns the node owning the subnet found in the pod annotation passed as input


// notifyARPProxyMACForIP will send an GARP to force arp caches to clean up
// at pods and reference to the arp proxy gw mac
func notifyARPProxyMACForIPs(ips []*net.IPNet, dstMAC net.HardwareAddr) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is one of the core points from this PR. I know you described the global picture in the markdown document, but could you also describe here what's happening in the relevant functions and in the commit message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, so the commits implementing the design doc are from the PoC, if implementatino is "sound" I will refactor them with proper commit messages and structure.

@qinqon qinqon requested a review from ricky-rav June 17, 2024 05:36
@qinqon qinqon force-pushed the kubevirt-neighbors-after-live-migration branch from 4f58658 to 4413548 Compare June 17, 2024 05:38
@coveralls
Copy link

Coverage Status

coverage: 52.452% (-0.3%) from 52.716%
when pulling 4413548 on qinqon:kubevirt-neighbors-after-live-migration
into 701b8e0 on ovn-org:master.

This change add the design to advertise neighbors after kubevirt's VM
live migration to improve connectivity

Signed-off-by: Enrique Llorente <[email protected]>
@qinqon qinqon force-pushed the kubevirt-neighbors-after-live-migration branch from 4413548 to c0d4d35 Compare June 19, 2024 07:20
@ricky-rav
Copy link
Contributor

The approach looks good to me. The pod informer used in the logic is a shared informer, so it shouldn't add overhead to the existing code.
/lgtm

@coveralls
Copy link

Coverage Status

coverage: 52.492% (-0.3%) from 52.749%
when pulling c0d4d35 on qinqon:kubevirt-neighbors-after-live-migration
into 9f1f3f2 on ovn-org:master.

Copy link

This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or reach out to maintainers for code reviews or consider closing this if you do not plan to work on it.

@github-actions github-actions bot added the lifecycle/stale All issues (> 60 days) and PRs (>90 days) with no activity. label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-testing feature/kubevirt-live-migration All issues related to kubevirt live migration kind/documentation All issues related to documentation lifecycle/stale All issues (> 60 days) and PRs (>90 days) with no activity.
Projects
Status: Todo
4 participants