[FEATURE] Add transport request retry capability for async workflow steps #158

joshpalis · 2023-11-10T20:29:02Z

Is your feature request related to a problem?

Coming from PR #155 which adds a GetMLTaskStep. The underlying API is an async operation that is used after registering a local model. This step is used to ascertain the local model registration status and retrieve the model id once the status is COMPLETED.
Local model registration takes upwards of 30 seconds, depending on the size of the model zip that needs to be downloaded into the cluster, so it is necessary for the GetMLTaskStep to repeatedly call this API until the model ID is returned.

This will be a common issue for any WorkflowStep that invokes an async API

What alternatives have you considered?

Brute force method of retrying requests until a COMPLETED status is returned results in scores of requests sent across the transport layer. Perhaps sleeping the thread for some time and then retrying the request may be the right path, but it does run the risk of having the WorkflowStep node_timeout run out.

Neural Plugin has this retry method for executing the predict API:

The text was updated successfully, but these errors were encountered:

dbwiddis · 2023-11-13T17:06:11Z

I see at least 3 (good) ways to do this:

I like the approach the Neural plugin has taken, but I'm not fond of recursion.
Personally I'd probably try to use the Thread Pool capabilities for this. The Reindex API uses the ThreadPool schedule() capability to conduct its retries:
- https://github.com/opensearch-project/OpenSearch/blob/c676479f6d1a0d8da8d3b23d05a31747a5322dac/server/src/main/java/org/opensearch/index/reindex/RetryListener.java#L83-L93
We actually use schedule() here in the ProcessNode code just prior to the workflow step execution in order to implement the timeout feature (think of it as the opposite of a retry!):
https://github.com/opensearch-project/opensearch-ai-flow-framework/blob/1de0ca54ab29c00987f56e982c87855b70e47911/src/main/java/org/opensearch/flowframework/workflow/ProcessNode.java#L163-L167
- The particular schedule() method used is a one-shot method but there is also one that repeats (you provide an initial delay and a time-between). This repeated schedule could be wrapped around the execute call a few lines below the above
  https://github.com/opensearch-project/opensearch-ai-flow-framework/blob/1de0ca54ab29c00987f56e982c87855b70e47911/src/main/java/org/opensearch/flowframework/workflow/ProcessNode.java#L170
- Or possibly those two approaches can be combined to integrate the timeout as part of the retry.

owaiskazi19 · 2023-11-13T19:30:23Z

If we are going with the 1st approach, instead of static RETRY values, we should have a dynamic setting to update the retry values if needed.
The RetryListener is an internal class to OpenSearch, we can try building a similar listener for our plugin. As the retry logic should be generic and will be needed in the future for other Plugins APIs as well. Eventually, it's just scheduling in the threadppol. Similar to the 3rd option.
scheduleWithFixedDelay is a better option here, we just need to decide on the right interval. I see other plugins have used it as well:

https://github.com/opensearch-project/anomaly-detection/blob/4c6ba48bf9fb9234ad8bc0b9193f3c68409acfb9/src/main/java/org/opensearch/ad/cluster/ClusterManagerEventListener.java#L71-L74

https://github.com/opensearch-project/k-NN/blob/87ddcb60094c50f44f0eb5deb7b8ca9a5381410d/src/main/java/org/opensearch/knn/index/KNNCircuitBreaker.java#L108

https://github.com/opensearch-project/alerting/blob/a26f4c087fbeae4fb79fd598b8c5cf46ecf11f55/alerting/src/main/kotlin/org/opensearch/alerting/alerts/AlertIndices.kt#L183

dbwiddis · 2023-11-13T20:36:04Z

scheduleWithFixedDelay is a better option here

I think integrating this into ProcessNode somehow is the best long-term option. I'm ok with Option 1 if it's easily implemented as a short term fix.

joshpalis · 2023-11-13T20:47:52Z

Agreed with option 1 as a short term fix, option 3 for long term

joshpalis added enhancement New feature or request untriaged labels Nov 10, 2023

joshpalis mentioned this issue Nov 10, 2023

[META] End to end template API user experience #88

Closed

22 tasks

minalsha removed the untriaged label Nov 16, 2023

joshpalis self-assigned this Nov 20, 2023

joshpalis mentioned this issue Nov 20, 2023

Adds transport request retry capability for GetMLTaskStep #179

Merged

minalsha closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add transport request retry capability for async workflow steps #158

[FEATURE] Add transport request retry capability for async workflow steps #158

joshpalis commented Nov 10, 2023 •

edited

Loading

dbwiddis commented Nov 13, 2023

owaiskazi19 commented Nov 13, 2023 •

edited

Loading

dbwiddis commented Nov 13, 2023

joshpalis commented Nov 13, 2023

[FEATURE] Add transport request retry capability for async workflow steps #158

[FEATURE] Add transport request retry capability for async workflow steps #158

Comments

joshpalis commented Nov 10, 2023 • edited Loading

Is your feature request related to a problem?

What alternatives have you considered?

dbwiddis commented Nov 13, 2023

owaiskazi19 commented Nov 13, 2023 • edited Loading

dbwiddis commented Nov 13, 2023

joshpalis commented Nov 13, 2023

joshpalis commented Nov 10, 2023 •

edited

Loading

owaiskazi19 commented Nov 13, 2023 •

edited

Loading