Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SiteWhitelist/SiteBlacklist update for active workflows #8323

Open
4 tasks
amaltaro opened this issue Nov 13, 2017 · 6 comments
Open
4 tasks

Support SiteWhitelist/SiteBlacklist update for active workflows #8323

amaltaro opened this issue Nov 13, 2017 · 6 comments

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Nov 13, 2017

Impact of the new feature
ReqMgr2, Global WorkQueue, Local WorkQueue

Is your feature request related to a problem? Please describe.
This is especially important for long living workflows, where sites might come and go and further tweak of the site lists could be important.

Describe the solution you'd like
Support update to the SiteWhitelist and SiteBlacklist for workflows that have already been assigned (being between assigned and running-closed).

There are two steps that can be taken for this:

  1. when a request is updated, only propagate the new site lists to global workqueue elements that are sitting in Available status;
  2. when a request is updated, propagate it to local workqueue and to jobs sitting pending in condor.

The following tickets have been materialized for option 1) above:

Describe alternatives you've considered
In addition to the steps above, I think we will have to update the relational database as well, including jobs already created bu queued in JobSubmitter.

Additional context
None

@amaltaro amaltaro changed the title Support SiteWhitelist/SiteBlacklist update for assigned/acquired workflows Support SiteWhitelist/SiteBlacklist update for active workflows May 12, 2023
@amaltaro
Copy link
Contributor Author

From a feedback from Hasan, this feature will be useful in diverse scenarios, such as:

  1. sites that are planning to destructive (not saving any data in it) remake their storage. They would be able to remove those sites from the site list well in advance, giving it time for those workflows to potentially complete.
  2. Workflows that might be misbehaving and/or causing site issues
  3. Sites going - or coming out of - long downtime period.

This is actually a meta issue that will require, at the very least, the following developments:

  • Update input data placement (MSTransferor task)
  • Update all the global workqueue elements, if any.
  • Update local workqueue elements, potentially through JobUpdater component
  • Update all the WMBS relational data/site mask
  • Update jobs in the agent queue for JobSubmitter
  • Update jobs pending in condor

It is important to mention that there are many places where a data race / race condition can happen, such as:

  • during the global workqueue work creation
  • during the WMBS work creation
  • during job creation
  • perhaps during job submission.
    I honestly do not see a way to resolve all of those, unless we decide to make it not a single event triggered action, but poll through it during a given amount of time.

I will be creating the other issues in the coming days and will set this one as a meta issue.

@amaltaro amaltaro added QPrio: Medium quarter priority and removed QPrio: High quarter priority labels Jul 31, 2023
@amaltaro
Copy link
Contributor Author

From a P&R discussion last week, this got demoted in favor of wmcore_pileup developments. Now medium prio.

@amaltaro
Copy link
Contributor Author

Based on the O&C weekly meeting discussion that took place today, it looks like the P&R team would be happy if we could deliver the initial sub-optimal feature in Q3/2024, mentioned in the issue description as:
"""

  1. when a request is updated, only propagate the new site lists to global workqueue elements that are sitting in Available status.
    """

The list of sub-tickets that we need to consider are:

  1. relocation of the relevant input data. At the moment there is no mechanism in MSTransferor to make it re-evaluate a given workflow/input data placement. So we will have to implement a new mechanism to make this step automated. This feature itself can be implemented with two different levels of quality:
    a) trigger a new data placement with the new site list
    b) in addition to the new data placement, also trigger a data deletion for the site that has been removed from the site whitelist - if any.
    c) similar to a), but instead of making a new rule, we could consume the rule ids already persisted in the database (via MSTransferor/MSMonitor) and update their RSE expression accordingly.
    NOTE: this item itself could easily spawn 2 or 3 tickets, depending on the decisions we make and advice from DM experts...
  2. Change ReqMgr2 behavior such that it allows update of the fields SiteWhitelist and SiteBlacklist for workflows that are in a state between assigned and running-closed. Not accept it though if the state is staged, such that we can avoid a data race condition (global workqueue with an outdated workflow spec). This update needs to reflect both the JSON document as well as the workload spec object.
  3. Coupled to the ReqMgr2 action listed in item 2., we also need to make a call to Global WorkQueue and update every single workqueue element that is in status Available, for that given workflow. We should likely also update the workload spec persisted in the workqueue (_inbox) database.
  4. It is not clear to me whether we would have to update the workload spec that has already been download in a given agent - in case the workflow is already running. This requires further investigation. If needed, then we need to implement it in one of the components. How to detect that the spec file changed???
  5. Optional: do we want to keep a history of such changes? If so, which information needs to be persisted? Probably DN, timestamp, list of sites added, list of sites removed. Anything else?

With potentially the 5 items above, we can deliver a very first version of this feature, which will update site lists in any work that has NOT yet been acquired by any agents. WorkQueue elements and jobs already materialized in the agents would go through the system without considering the site list update.

I appreciate any feedback that people might have, especially for functionality/services that I might be missing here.

@hassan11196
Copy link
Member

Hello @amaltaro,

Thank you for providing the list of sub-tickets.

I just want to confirm my understanding of the description. In point #2, you mentioned that ReqMgr will be modified to allow updates to SiteWhitelist and SiteBlacklist between the 'assigned' and 'running-closed' states but not in the 'staged' state. However, from my understanding, a workflow transitions to 'running-closed' once all its Work Queue Elements (WQEs) are picked up by an agent. This implies that changing the SiteWhitelist when the workflow is in the 'running-closed' state would not actually affect where the jobs run. Is that correct?

I understand that this is something that would be tackled in the second part of the issue description i.e

2. when a request is updated, propagate it to local workqueue and to jobs sitting pending in condor.

@haozturk
Copy link

Thanks a lot @amaltaro and @anpicci! This is much needed. It's reasonable to approach this request in two steps and focus on the first step in Q3. Probably we'll discuss each step in its own issue, but let me make a quick comment for 1.c: You cannot update (update-rule) the rse expression of a rule and keep the same rule id. You can change the rse expression by "moving" (move-rule) a rule which creates a new rule.

@amaltaro
Copy link
Contributor Author

Hi @haozturk @hassan11196 , thank you for your prompt feedback (and Andrea). Your both points are valid and they will be considered when we materialize these 5 points into their own GH tickets. Once that is done, I will also update the initial description of this PR, such that it becomes a meta-issue and we can track all of the sub-items to be developed. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ToDo
Development

No branches or pull requests

3 participants