Update workflow spec files in WMAgents upon site list changes #12039

amaltaro · 2024-07-11T19:00:25Z

Impact of the new feature
WMAgent

Is your feature request related to a problem? Please describe.
This is a sub-task of this meta-issue: #8323
Towards providing a feature that allows workflow site lists to be changed while workflows are active in the system.

Describe the solution you'd like
The expected behavior for this ticket is simple, but how it is supposed to be implemented is still not completely clear.

What is expected from this ticket is: whenever a workflow is active in the system (from assigned to running) and it is updated with new site lists (SiteWhitelist/SiteBlacklist), the agent(s) working on that given workflow need to update the local workflow spec file. A potential component candidate for this would be WorkflowUpdater, if we can make this development simple enough.

Describe alternatives you've considered
How an agent knows when it has to update the workflow spec or not? I would say the following are valid options:
a) it does not have to know, it will simply download/update spec files every x hours (12? 24?)
b) if there is document timestamp, we might be able to use that information.

Otherwise, we would need to keep this record somewhere, and have agents using that information to decide which workflows need to have the spec updated.

Additional context
None

The text was updated successfully, but these errors were encountered:

vkuznet · 2024-09-17T14:05:52Z

Alan, in order to proceed with this issue please clarify the following:

the option (a) implies that we need to pull ALL workflows in the agent and make a comparison of their state with a previous polling cycle, therefore we need to introduce a persistent storage for workflows to make such comparison. The extraction of all workflows comes from
- https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/WorkflowUpdater/WorkflowUpdaterPoller.py#L286
- https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/WorkflowUpdater/WorkflowUpdaterPoller.py#L251
the list of active workflows is fetched from underlying DB, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Workflow/GetUnfinishedWorkflows.py which does not have timestamp in it, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/CreateWMBSBase.py#L168

Therefore, the solution (b) will require modification of WMBS database schema, while solution (a) requires storing previous state somehow.

Please clarify which approach should be implemented as I don't want to waste development time if it will not be required. Since WorkflowUpdater already fetches all workflows in its algorithm it seems logical to proceed with option (a) but as I mentioned we will need to keep around a previous list of workflows somewhere. If this is desired option please clarify where to store this information and how.

vkuznet · 2024-09-18T17:33:45Z

After discussion with Alan we end-up with two possible solution:

Use PubSub model and NATS (or similar) server where new changes to workflow will be published at reqmgr2 site and consumed by WMAgents
- here ReqMgr2 will be publisher, while WMAgents will be subscribers
- we can also extend it to other components
Use polling model and fetching docs from reqmgr2/CouchDB to agents. This workflow should compare specs and act upon any changes
- here we will follow synchronous model, i.e. someone will post update to ReqMgr2, the WMagents will run polling cycle to fetch all docs and compare them with information present in ReqMgr2, then act upon it.

The solution (1) requires setting up NATS (or similar) server (this is referring as infrastructure, see CMS NATS) and developing publisher and subscriber clients. To simplify this I provided very basic example which can be run on any laptop using python. Please find it in this document.

The solution (2) will require careful evaluation of scalability issues like:

fetching O(1000) workflows at each polling cycle (concurrency issue and RAM utilization)
- if we'll send O(1000) requests to ReqMgr2 from each agent we must guarantee that it will sustain such load (requests can be send in chunks of 100)
- if we'll pack all docs from CouchDB into a single payload we'll face RAM spike at WMAgent consuming such document, or we must introduce streaming via NDJSON to avoid such possibility
walking through O(1000) workflows to identify changes and act upon it
- the sequential loop can run a long time to walk through each spec
- the async loop is prone to proper managing errors, etc.

amaltaro added New Feature WMAgent labels Jul 11, 2024

amaltaro changed the title ~~Update workflow spec files in WMAgents~~ Update workflow spec files in WMAgents upon site list changes Jul 11, 2024

amaltaro mentioned this issue Jul 11, 2024

Support SiteWhitelist/SiteBlacklist update for active workflows #8323

Open

4 tasks

vkuznet self-assigned this Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update workflow spec files in WMAgents upon site list changes #12039

Update workflow spec files in WMAgents upon site list changes #12039

amaltaro commented Jul 11, 2024

vkuznet commented Sep 17, 2024

vkuznet commented Sep 18, 2024 •

edited

Loading

Update workflow spec files in WMAgents upon site list changes #12039

Update workflow spec files in WMAgents upon site list changes #12039

Comments

amaltaro commented Jul 11, 2024

vkuznet commented Sep 17, 2024

vkuznet commented Sep 18, 2024 • edited Loading

vkuznet commented Sep 18, 2024 •

edited

Loading