Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add transport request retry capability for async workflow steps #158

Closed
joshpalis opened this issue Nov 10, 2023 · 4 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@joshpalis
Copy link
Member

joshpalis commented Nov 10, 2023

Is your feature request related to a problem?

Coming from PR #155 which adds a GetMLTaskStep. The underlying API is an async operation that is used after registering a local model. This step is used to ascertain the local model registration status and retrieve the model id once the status is COMPLETED.
Local model registration takes upwards of 30 seconds, depending on the size of the model zip that needs to be downloaded into the cluster, so it is necessary for the GetMLTaskStep to repeatedly call this API until the model ID is returned.

This will be a common issue for any WorkflowStep that invokes an async API

What alternatives have you considered?

Brute force method of retrying requests until a COMPLETED status is returned results in scores of requests sent across the transport layer. Perhaps sleeping the thread for some time and then retrying the request may be the right path, but it does run the risk of having the WorkflowStep node_timeout run out.

Neural Plugin has this retry method for executing the predict API:

@dbwiddis
Copy link
Member

I see at least 3 (good) ways to do this:

  1. I like the approach the Neural plugin has taken, but I'm not fond of recursion.

  2. Personally I'd probably try to use the Thread Pool capabilities for this. The Reindex API uses the ThreadPool schedule() capability to conduct its retries:

  3. We actually use schedule() here in the ProcessNode code just prior to the workflow step execution in order to implement the timeout feature (think of it as the opposite of a retry!):
    https://github.com/opensearch-project/opensearch-ai-flow-framework/blob/1de0ca54ab29c00987f56e982c87855b70e47911/src/main/java/org/opensearch/flowframework/workflow/ProcessNode.java#L163-L167

@owaiskazi19
Copy link
Member

owaiskazi19 commented Nov 13, 2023

  1. If we are going with the 1st approach, instead of static RETRY values, we should have a dynamic setting to update the retry values if needed.

  2. The RetryListener is an internal class to OpenSearch, we can try building a similar listener for our plugin. As the retry logic should be generic and will be needed in the future for other Plugins APIs as well. Eventually, it's just scheduling in the threadppol. Similar to the 3rd option.

  3. scheduleWithFixedDelay is a better option here, we just need to decide on the right interval. I see other plugins have used it as well:

https://github.com/opensearch-project/anomaly-detection/blob/4c6ba48bf9fb9234ad8bc0b9193f3c68409acfb9/src/main/java/org/opensearch/ad/cluster/ClusterManagerEventListener.java#L71-L74

https://github.com/opensearch-project/k-NN/blob/87ddcb60094c50f44f0eb5deb7b8ca9a5381410d/src/main/java/org/opensearch/knn/index/KNNCircuitBreaker.java#L108

https://github.com/opensearch-project/alerting/blob/a26f4c087fbeae4fb79fd598b8c5cf46ecf11f55/alerting/src/main/kotlin/org/opensearch/alerting/alerts/AlertIndices.kt#L183

@dbwiddis
Copy link
Member

  1. scheduleWithFixedDelay is a better option here

I think integrating this into ProcessNode somehow is the best long-term option. I'm ok with Option 1 if it's easily implemented as a short term fix.

@joshpalis
Copy link
Member Author

Agreed with option 1 as a short term fix, option 3 for long term

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants