Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

efficiency: investigate bottleneck #547

Open
oliver-sanders opened this issue Jan 4, 2024 · 5 comments
Open

efficiency: investigate bottleneck #547

oliver-sanders opened this issue Jan 4, 2024 · 5 comments
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Jan 4, 2024

See also: cylc/cylc-ui#1614

This workflow has proven to be remarkably difficult for the UIS & UI to handle:

#!Jinja2

{% set members = 10 %}
{% set hours = 100 %}

[scheduler]
    allow implicit tasks = True

[task parameters]
    member = 0..{{members}}
    fcsthr = 0..{{hours}}
  [[templates]]
    member = member%(member)03d
    fcsthr = _fcsthr%(fcsthr)03d

[scheduling]
  initial cycle point = 2000
  runahead limit = P3
  [[xtriggers]]
    start = wall_clock(offset=PT7H15M)
  [[graph]]
    T00,T06,T12,T18 = """
        @start & prune[-PT6H]:finish => prune & purge
        @start => sniffer:ready_<member,fcsthr> => <member,fcsthr>_process? => finish
        <member,fcsthr>_process:fail? => fault
      """

[runtime]
    [[sniffer]]
        [[[outputs]]]
{% for member in range(0, members + 1) %}
    {% for hour in range(0, hours + 1) %}
            ready_member{{ member | pad(3, 0) }}_fcsthr{{ hour | pad(3, 0) }} = {{ member }}{{ hour }}
    {% endfor %}
{% endfor %}

For more information see: https://cylc.discourse.group/t/slow-load-of-cylc-workflows-disconnects/823/19

Investigation so far has confirmed:

  • The scheduler is not a source of delay.
  • The UIS chokes on the update for several seconds.
    • During this time, updates to other workflows are suspended
  • The UI chokes on the deltas of several seconds.
  • The browser takes a couple of seconds to update.

This issue focuses on the UIS side of things.

Suggested remediation (Flow/UIS only, please update with new suggestions):

@oliver-sanders oliver-sanders added this to the pending milestone Jan 4, 2024
@oliver-sanders
Copy link
Member Author

Early investigations using #546:

  • As expected, there is no single source of the problem :(
  • Resolvers
    • ~10% of CPU time for my example was used by cylc.flow.network.graphql.IgnoreFieldMiddleware.resolve and it's callchain.
    • This is the cost of actually computing the response, but it's only ~10% of the CPU time?!
    • iscoroutinefunction called here took a reasonable amount of time, it's possible that upgrading our dependencies may remove the requirement to check for synchronous functions?
  • Asyncio
    • Asycio is processing a very large number of tasks and taking quite a while doing it.
    • The ensure_future call (now deprecated), which calls create_task which took a whopping ~15% of the CPU time in my example.
    • I think these tasks are resolvers for individual fields, hence why there are so many of them.
    • Uvloop has much better throughput characteristics so may help here async: switch to uvloop #545.
  • Promise
    • It's hard to tell, but it looks like the overheads of the Promise library may be reasonable.
    • Promise is an async library used by graphql-ws (and other dependencies?), I think as a stopgap solution in a time when Python async support wasn't what it is now.
    • I expect there may be an improvement to cutting this out (especially in combination with uvloop).
  • to_snake_case
    • This simple method gets hammered (10's of thousands of calls).
    • Ideally we would reduce the number of calls, but caching would also help.
    • The implementation can be improved too as it's compiling a regex for every call (upstream change).
  • Tokens

oliver-sanders added a commit to oliver-sanders/cylc-uiserver that referenced this issue Jan 10, 2024
* The resolve routine is implmented recursively in the graphql-ws
  library.
* Because the function is async this results in a large number of async
  tasks being created when the library is used with large, deeply nested
  schema.
* Async tasks have an overhead, above that of regular function calls.
* For the example in cylc#547
  this resulted in over 10 seconds of overheads.
oliver-sanders added a commit to oliver-sanders/cylc-uiserver that referenced this issue Jan 10, 2024
* Partially addresses cylc#547
* The resolve routine is implmented recursively in the graphql-ws
  library.
* Because the function is async this results in a large number of async
  tasks being created when the library is used with large, deeply nested
  schema.
* Async tasks have an overhead, above that of regular function calls.
* For the example in cylc#547
  this resulted in over 10 seconds of overheads.
oliver-sanders added a commit to oliver-sanders/cylc-uiserver that referenced this issue Jan 10, 2024
* Partially addresses cylc#547
* The resolve routine is implmented recursively in the graphql-ws
  library.
* Because the function is async this results in a large number of async
  tasks being created when the library is used with large, deeply nested
  schema.
* Async tasks have an overhead, above that of regular function calls.
* For the example in cylc#547
  this resulted in over 10 seconds of overheads.
oliver-sanders added a commit that referenced this issue Jan 30, 2024
graphql-ws: optimise the "resolve" routine

* Partially addresses #547
* The resolve routine is implmented recursively in the graphql-ws
  library.
* Because the function is async this results in a large number of async
  tasks being created when the library is used with large, deeply nested
  schema.
* Async tasks have an overhead, above that of regular function calls.
* For the example in #547
  this resulted in over 10 seconds of overheads.
* websockets: avoid duplicate resolve call
  Due to inheritance, we were calling
  `resolve(execution_result.data)` twice.

---------

Co-authored-by: Tim Pillinger <[email protected]>
@dwsutherland
Copy link
Member

dwsutherland commented Feb 6, 2024

The promise is gone when I get around to (hopefully soon) upgrading Graphene, and rewriting our own graphql-ws (as this uses promises too) ..
It appears, promise was created by the original developer of Graphene who (I assume) also (in addition to async support) found it a useful way of translating graphql libraries of javascript to python equivalent (could be wrong)..

@dwsutherland
Copy link
Member

dwsutherland commented Feb 7, 2024

The other thing we do is recursively sift through the GraphQL query result (in the middleware) and strip out all the null fields... Not exactly sure how expensive this is.. Think I use the promise library for this.

@hjoliver
Copy link
Member

hjoliver commented Feb 9, 2024

to_snake_case

This simple method gets hammered (10's of thousands of calls).

Sorry if I've missed relevant discussions elsewhere (just back from leave) - but can we just rename all affected variables using JS conventions? It'll look ugly in a Python program, but the payoff sounds big.

@oliver-sanders
Copy link
Member Author

We could, but it would be easier to just cache the output of the method or re-write the code so that method doesn't get hammered, there's a separate issue for this in cylc-flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants