Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add server-side memory of ComputeServiceRegistrations that consistently fail jobs? #258

Open
dotsdl opened this issue Mar 15, 2024 · 0 comments

Comments

@dotsdl
Copy link
Member

dotsdl commented Mar 15, 2024

When a ComputeService is deployed to a problematic compute node, this can cause random or systematic failures of ProtocolDAGs executed on that node. This can swiftly result in Task exhaustion from the server, as the ComputeService consumes and errors out on Tasks in quick succession, leaving healthy ComputeServices to idle.

One mitigation for this is to implement short-term memory within or associated with ComputeServiceRegistrations server-side. As a ComputeService submits completed or errored ProtocolDAGResults to the server, the completion or error could be indicated with addition of either 1 or -1 to a growing list of values.

This list can then be evaluated server-side when the ComputeService attempts to claim new Tasks, perhaps with a weighted sum of values in the list with higher weights on the most recent values and lower weights on the older ones. If the resulting sum is negative, the ComputeService may be denied new attempts to claim until some time expiry is reached, configurable as part of the AlchemiscaleComputeAPI config, with the datetime set as (datetime_denied_attempt() + expiry_seconds). It would then be allowed to claim Tasks again on its first attempt after expiry to redeem itself.

This should slow down task exhaustion substantially, while also giving ComputeServices a chance to recover from temporary issues, such as transient high load on a shared resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant