Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't deleted snapshot data #7826

Open
rtjdamen opened this issue Jul 11, 2024 · 16 comments · Fixed by #7960
Open

Couldn't deleted snapshot data #7826

rtjdamen opened this issue Jul 11, 2024 · 16 comments · Fixed by #7960
Assignees

Comments

@rtjdamen
Copy link

Are you using XOA or XO from the sources?

XOA

Which release channel?

latest

Provide your commit number

No response

Describe the bug

A CBT snapshot backup finishes with warning " Couldn't deleted snapshot data"

Error message

Couldn't deleted snapshot data

error
{"code":"VDI_IN_USE","params":["OpaqueRef:79798f68-e639-494b-8adf-cade435be5fb","data_destroy"],"call":{"method":"VDI.data_destroy","params":["OpaqueRef:79798f68-e639-494b-8adf-cade435be5fb"]}}
vdiRef
"OpaqueRef:79798f68-e639-494b-8adf-cade435be5fb"

To reproduce

Random behavior, seems to be related to specific vms as it reoccurs on the same vms every time

Expected behavior

if the vdi.data-destroy fails i would expect a retry, a manual retry does work. Maybe a timing issue?

Screenshots

No response

Node

18.20.2

Hypervisor

XCP-ng 8.2

Additional context

Does happen on some vms

@rtjdamen
Copy link
Author

Issue still occuring on latest version, seems like an issue with timing, snapshot is still in use. After a few seconds the command can be processed by hand.

@olivierlambert
Copy link
Member

Thanks for your feedback @rtjdamen !

@fbeauchamp
Copy link
Collaborator

Hi @rtjdamen ,

After a careful analysis with the xcp-ng team, we found that this error is raised when the xapi failed to unplug the vdi after a non modifiable delay of 4s.

We patched your installation with a retry on XO side. If it's ok with you we'll monitor this night jobs and see i it's enough to handle this edge case.

Regards

fbeauchamp added a commit that referenced this issue Sep 3, 2024
sometimes the capi take too long to detach the VDI
in this case, the timeout is fixed at 4s, non modifiable
when the timeout is reached the xapi raise a VDI_IN_USE error

this is an internal process of the xapi

This commit add a retry on XO side to give more room for the xapi
to work through this process, as XO already do it one vdi destroying

fix #7826
@rtjdamen
Copy link
Author

rtjdamen commented Sep 3, 2024

Yes, no problem, we will keep an eye on it too!

@sblanchouin
Copy link

Hi,

Can you patch our installation too ASAP ?

Thanks !

@rtjdamen
Copy link
Author

rtjdamen commented Sep 4, 2024

@fbeauchamp unfortunatly it seems like the snapshot data is still not destroyed in every case, i think the 4s is still too short. Maybe we need to increase it to 10s to start with?

@fbeauchamp
Copy link
Collaborator

patch has been redeployed on proxy. Waiting for this night run to be sure

@rtjdamen
Copy link
Author

rtjdamen commented Sep 5, 2024

patch has been redeployed on proxy. Waiting for this night run to be sure

Seems like that did the trick! no more orphan VDI's this morning! Also no VDI_In_Use destroy messages, are these related or not?

@fbeauchamp
Copy link
Collaborator

yes because the vdi were not deleted (VDI_IN_USE ) and stayed as orphans. Now we purge them correctly.

@rtjdamen
Copy link
Author

rtjdamen commented Sep 5, 2024

So 2 issues fixed!

julien-f pushed a commit that referenced this issue Sep 10, 2024
sometimes the capi take too long to detach the VDI
in this case, the timeout is fixed at 4s, non modifiable
when the timeout is reached the xapi raise a VDI_IN_USE error

this is an internal process of the xapi

This commit add a retry on XO side to give more room for the xapi
to work through this process, as XO already do it one vdi destroying

fix #7826
julien-f pushed a commit that referenced this issue Sep 10, 2024
sometimes the capi take too long to detach the VDI
in this case, the timeout is fixed at 4s, non modifiable
when the timeout is reached the xapi raise a VDI_IN_USE error

this is an internal process of the xapi

This commit add a retry on XO side to give more room for the xapi
to work through this process, as XO already do it one vdi destroying

fix #7826
@rtjdamen
Copy link
Author

Issue not resolved completely, original fix was solving the issue but the version now active in XOA is not resolving the issue.

@julien-f
Copy link
Member

@rtjdamen Are you sure both your XOA and your XO Proxies are up-to-date on latest channel?

If they are, we need to take a look at them.

@rtjdamen
Copy link
Author

According to the gui they are.

@julien-f
Copy link
Member

@rtjdamen After looking at your infra, it seems that there is still a major improvement, there are very few VDI_IN_USE errors now 🙂

The only problem we saw comes from the fact that one of your VDIs is still attached to the control domain and our XCP-ng team is still investigating this issue.

We will still continue to monitor this problem.

@rtjdamen
Copy link
Author

rtjdamen commented Sep 24, 2024 via email

@rtjdamen
Copy link
Author

I just checked but the one failed yesterday is not hanging at the control domain. So this is incorrect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants