Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudflare R2 Workers #3461

Open
ovflowd opened this issue Aug 18, 2023 · 38 comments
Open

Cloudflare R2 Workers #3461

ovflowd opened this issue Aug 18, 2023 · 38 comments

Comments

@ovflowd
Copy link
Member

ovflowd commented Aug 18, 2023

As part of the continuous conversation we've been having on Slack (and also spread in some of the nodejs/build issues); We have long-term goals for improving the reliability of our Distribution Assets (aka /home/dist) on the DigitalOcean Server.

The main goal is to uphold a reliable way to serve our assets (binaries, docs, metrics, etc) to the public.

What's the issue?

As mentioned in a few issues such as (#3424 and #3410) and on our March 17th incident (on https://nodejs.org/ko/blog/announcements/node-js-march-17-incident) and discussed over many other places; Our DigitalOcean server is unable to serve all the traffic it gets.

Of course, a few things are set in place, such as Cloudflare-caching, so that, in theory, all requests are served by Cloudflare after the initial load.

This has been proven inefficient due to numerous factors, for example, cache purges. (Even if the cache purges were tailored to only the affected paths) It is still a risky approach that creates a gigantic load on our DigitalOcean server. The same server is stored for all Node.js Binaries and numerous other vital assets to date.

Not to mention that even for a short period, the server cannot withhold the immense traffic that goes through it.

Meaning: In the best scenario, this server should never enter a stressful situation.

What's the Plan?

Champions

Proof of Concept Repository

After numerous discussions with the Build Team, the OpenJS Foundation, and Cloudflare, we've concluded: The DigitalOcean server should not serve these files at best.

Enter the solution: Cloudflare R2

The idea is that all requests that do not go to Vercel (aka selected paths such as /download, /dist, /docs, /api) will go through a Cloudflare Worker.

This worker is responsible for:

image

  • Serving content from a R2 Bucket
    • The R2 Bucket has the contents of /home/dist from the DO server synced
      • Meaning that the root (/) of the R2 bucket is the contents of /home/dist
    • The sync of file is done through a script/daemon/or something else sitting inside the DO server.
      • It should be reactive and only do additions/removals/updates based on fs changes (so it is incremental)
  • Mapping Requests
    • Since the contents is the same from the DO server, we should respect the same paths originally created on the NGINX nodejs.org config file.
    • For example, requests to the following places go mapped:
      • /dist goes to /nodejs/release (originally on DO /home/dist/nodejs/release)
      • /download goes to /nodejs (originally on DO /home/dist/nodejs)
      • /docs goes to /nodejs/docs (originally on DO /home/dist/nodejs/docs)
      • /api goes to /nodejs/docs/latest/api (originally on DO /home/dist/nodejs/docs/latest/api)
      • /metrics goes to /metrics (originally on DO /home/dist/metrics)
    • These mappings ensure that requests are respected the original way intended on NGINX
  • Serving Directory Listings
  • Caching Access (setting cache headers)
  • Serving 404's

The Worker will never make a single request to the DigitalOcean origin, because it should always be in sync with whatever is in DigitalOcean.

This also means that:

  • We can disable our Load Balancer configuration on Cloudflare
    • In case of emergencies, we can disable the Worker and re-enable the Load Balancer
  • Cloudflare will never make requests to DigitalOcean itself
  • DigitalOcean Server is still the source of truth
  • Load on the Server will be close to 0

The Long Term

This solution is also long-term proof, But in the future we could create ways that Jenkins uploads directly to R2, for example, or other shenanigans.

@mcollina
Copy link
Member

What is the expected cost of this? The worker would be executed millions of times per day.

@ovflowd
Copy link
Member Author

ovflowd commented Aug 19, 2023

What is the expected cost of this? The worker would be executed millions of times per day.

0

@MoLow
Copy link
Member

MoLow commented Aug 19, 2023

What is the expected cost of this? The worker would be executed millions of times per day.

0

can you describe why? is this donated by cloudflare?

@MoLow
Copy link
Member

MoLow commented Aug 19, 2023

added to the build adgenda for visibilty and tracking

@flakey5
Copy link
Member

flakey5 commented Aug 19, 2023

can you describe why? is this donated by cloudflare?

Yep! @ovflowd would know more of the specifics regarding the quotas they gave

@ovflowd
Copy link
Member Author

ovflowd commented Aug 19, 2023

What is the expected cost of this? The worker would be executed millions of times per day.

0

can you describe why? is this donated by cloudflare?

Yes. To iterate on that Cloudflare gaves us R2 and Worker quotas/credits that are enough for 2x what would be our usage.

@MoLow
Copy link
Member

MoLow commented Sep 13, 2023

@nodejs/build should we procced with this or should we continue waiting for the BuildWG meeting?

@ovflowd
Copy link
Member Author

ovflowd commented Sep 13, 2023

Afaik @flakey5 is almost finishing their implementation.

@MoLow
Copy link
Member

MoLow commented Sep 26, 2023

Discussed on the Build WG - will work with @ovflowd and @flakey5 on getting this deployed

@MoLow
Copy link
Member

MoLow commented Sep 27, 2023

worker is up in running. see:
https://dist-worker-prod.nodejs.workers.dev/dist
https://dist-worker-prod.nodejs.workers.dev/download
https://dist-worker-prod.nodejs.workers.dev/docs
https://dist-worker-prod.nodejs.workers.dev/api/
https://dist-worker-prod.nodejs.workers.dev/metrics

  • note CF cache is disabled when behind the workers.dev domain, so this might be slower then the final result

a manual synchronization of the data has been performed, I am now working on automating the deployment of release artifacts into the r2 bucket

@ovflowd
Copy link
Member Author

ovflowd commented Sep 27, 2023

I wonder if the directory listing the Modified field is important? I also wonder if the fact that the new directory listing renders differently from the usual way NGINX does is somehow problematic.

How does the web use directory listing to index things? Tools such as Artifactory, for example.

@MoLow
Copy link
Member

MoLow commented Sep 27, 2023

I also wonder if the fact that the new directory listing renders differently from the usual way NGINX does is somehow problematic.

I was actually thinking we can make it look nicer in the future

@ovflowd
Copy link
Member Author

ovflowd commented Sep 27, 2023

I also wonder if the fact that the new directory listing renders differently from the usual way NGINX does is somehow problematic.

I was actually thinking we can make it look nicer in the future

Sure, but I don't want to break the web 😅

@ovflowd
Copy link
Member Author

ovflowd commented Sep 27, 2023

Lord knows what sort of tooling out there is depending on our directory listing to look like the way it is (codewise)

@MoLow
Copy link
Member

MoLow commented Sep 27, 2023

I uploaded a cached version to https://r2.nodejs.org/ so we can test cache.
seeing some issues with that: https://r2.nodejs.org/dist/latest-argon/

@shadowspawn
Copy link
Member

shadowspawn commented Oct 21, 2023

I did some light testing with n and it Just Works. e.g.

% export N_NODE_MIRROR=https://dist-worker-prod.nodejs.workers.dev/download/release/
% export N_NODE_DOWNLOAD_MIRROR=https://dist-worker-prod.nodejs.workers.dev/download
% n lts
  installing : node-v18.18.1
       mkdir : /usr/local/n/versions/node/18.18.1
       fetch : https://dist-worker-prod.nodejs.workers.dev/download/release/v18.18.1/node-v18.18.1-darwin-arm64.tar.xz
     copying : node/18.18.1
   installed : v18.18.1 (with npm 9.8.1)
% n nightly
  installing : nightly-v21.0.0-nightly20231009387e2929fe
       mkdir : /usr/local/n/versions/nightly/21.0.0-nightly20231009387e2929fe
       fetch : https://dist-worker-prod.nodejs.workers.dev/download/nightly/v21.0.0-nightly20231009387e2929fe/node-v21.0.0-nightly20231009387e2929fe-darwin-arm64.tar.xz
     copying : nightly/21.0.0-nightly20231009387e2929fe
npm WARN cli npm v10.1.0 does not support Node.js v21.0.0-nightly20231009387e2929fe. This version of npm supports the following node versions: `^18.17.0 || >=20.5.0`. You can find the latest version at https://nodejs.org/.
   installed : v21.0.0-nightly20231009387e2929fe (with npm 10.1.0)

@shadowspawn
Copy link
Member

I ran more tests and narrowed down a failure to using wget --spider to do a preflight for existence of a file before download. It fails the test because wget sees a 304 and concludes the file does not exist. I offer the evidence, but don't know where the fault lies! Downloading the file does succeed.

$ wget --spider https://dist-worker-prod.nodejs.workers.dev/download/release/v4.9.1/node-v4.9.1-linux-arm64.tar.gz
Spider mode enabled. Check if remote file exists.
--2023-10-28 18:54:32--  https://dist-worker-prod.nodejs.workers.dev/download/release/v4.9.1/node-v4.9.1-linux-arm64.tar.gz
Resolving dist-worker-prod.nodejs.workers.dev (dist-worker-prod.nodejs.workers.dev)... 2606:4700:3032::ac43:930c, 2606:4700:3032::6815:1cb1, 172.67.147.12, ...
Connecting to dist-worker-prod.nodejs.workers.dev (dist-worker-prod.nodejs.workers.dev)|2606:4700:3032::ac43:930c|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
Remote file does not exist -- broken link!!!

$ wget --spider https://nodejs.org/download/release//v4.9.1/node-v4.9.1-linux-arm64.tar.gz
Spider mode enabled. Check if remote file exists.
--2023-10-28 19:02:20--  https://nodejs.org/download/release//v4.9.1/node-v4.9.1-linux-arm64.tar.gz
Resolving nodejs.org (nodejs.org)... 2606:4700:10::6814:172e, 2606:4700:10::6814:162e, 104.20.23.46, ...
Connecting to nodejs.org (nodejs.org)|2606:4700:10::6814:172e|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11889847 (11M) [application/gzip]
Remote file exists.

@ovflowd
Copy link
Member Author

ovflowd commented Oct 28, 2023

I ran more tests and narrowed down a failure to using wget --spider to do a preflight for existence of a file before download. It fails the test because wget sees a 304 and concludes the file does not exist. I offer the evidence, but don't know where the fault lies! Downloading the file does succeed.

$ wget --spider https://dist-worker-prod.nodejs.workers.dev/download/release/v4.9.1/node-v4.9.1-linux-arm64.tar.gz
Spider mode enabled. Check if remote file exists.
--2023-10-28 18:54:32--  https://dist-worker-prod.nodejs.workers.dev/download/release/v4.9.1/node-v4.9.1-linux-arm64.tar.gz
Resolving dist-worker-prod.nodejs.workers.dev (dist-worker-prod.nodejs.workers.dev)... 2606:4700:3032::ac43:930c, 2606:4700:3032::6815:1cb1, 172.67.147.12, ...
Connecting to dist-worker-prod.nodejs.workers.dev (dist-worker-prod.nodejs.workers.dev)|2606:4700:3032::ac43:930c|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
Remote file does not exist -- broken link!!!

$ wget --spider https://nodejs.org/download/release//v4.9.1/node-v4.9.1-linux-arm64.tar.gz
Spider mode enabled. Check if remote file exists.
--2023-10-28 19:02:20--  https://nodejs.org/download/release//v4.9.1/node-v4.9.1-linux-arm64.tar.gz
Resolving nodejs.org (nodejs.org)... 2606:4700:10::6814:172e, 2606:4700:10::6814:162e, 104.20.23.46, ...
Connecting to nodejs.org (nodejs.org)|2606:4700:10::6814:172e|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11889847 (11M) [application/gzip]
Remote file exists.

Interesting, @flakey5 we shouldn't answer 304 to preflight requests (HEAD) even if the response is actually a cache hit.

@flakey5
Copy link
Member

flakey5 commented Oct 29, 2023

we shouldn't answer 304 to preflight requests (HEAD) even if the response is actually a cache hit.

Hmmm okay. We should also get some e2e tests in for HEAD requests in general now that I think about it, will work on both of those tmow

@ovflowd
Copy link
Member Author

ovflowd commented Nov 8, 2023

We're upgrading the Ubuntu Server now (DO) and after we verify upgrade success and verify that the sync issue for Cloudflare is gone, we're proceeding with the following changes on Cloudflare for adoption fo R2 and the Workers:

Mandators Steps:

  • Create Worker Routes for:
    • nodejs.org/api*
    • nodejs.org/dist*
    • nodejs.org/docs*
    • nodejs.org/download*
    • nodejs.org/metrics*
    • nodejs.org/release*
  • Disable Vercel Origin Rule
  • Disable Cache-Non Vercel Paths Cache Rule

Post Cleanup:

  • Disable Load Balancer (Origin)
    • Remove the R2 Origin Pool from the Load Balancers
  • Remove the Temporary Worker R2 Rule
  • Remove r2.nodejs.org domain (temporary)
  • Remove vercel.nodejs.org domain (temporary)

@flakey5
Copy link
Member

flakey5 commented Nov 8, 2023

* Create Worker Routes for:
  
  * nodejs.org/api*
  * nodejs.org/dist*
  * nodejs.org/docs*
  * nodejs.org/download*
  * nodejs.org/metrics*

I'd be alright with all of these except for https://nodejs.org/dist/* and https://nodejs.org/download/release/* as of now since I still need to hear back from R2 on whether or not they've fixed the circular dependency in their CI. I definitely think we should first start with https://nodejs.org/metrics* as a test first though before adding the other routes.

@MoLow

This comment was marked as duplicate.

@flakey5
Copy link
Member

flakey5 commented Nov 8, 2023

you mentioned some internal dependency of the Cloudflare team on node.js/dist - is that resolved?

I'm not sure, right now I think it's safe to assume that it's not however

@ovflowd
Copy link
Member Author

ovflowd commented Nov 8, 2023

FYI all paths besides ^/dist are now being served by the Worker. We're waiting for Cloudflare R2 team to give us an ack regarding this matter.

@MoLow
Copy link
Member

MoLow commented Nov 8, 2023

Amazing. nice job y'all!

@trivikr
Copy link
Member

trivikr commented Mar 22, 2024

I'm visiting this issue from an old issue I'd reported on n node version manager tj/n#784

What are the remaining action items on this request? Is the cache purging issue fixed?
If there an availability dashboard which we can refer to verify it's fixed?

@flakey5
Copy link
Member

flakey5 commented Mar 22, 2024

What are the remaining action items on this request?

Right now we're waiting on reviews for a few prs (nodejs/node#51394, #3620, nodejs/release-cloudflare-worker#103). I hope when those are landed we can move forward with this.

Is the cache purging issue fixed?

With the current infra, no. The cache is still being purged completely with each release of Node. With the worker, selective cache purging isn't necessary since Workers and R2 can most definitely handle the traffic that comes from a complete cache purge and more. It still is something to tackle though to ensure we're being smart with our usage, but it can wait (imo at least).

If there an availability dashboard which we can refer to verify it's fixed?

If you're referring to a Node status page, there's one here https://status.nodejs.org/ but from what I can see it's used for announcing major outages?

@flakey5
Copy link
Member

flakey5 commented Jun 26, 2024

As an update: we are waiting on #3620 to be merged and deployed onto the www server. Once it is I'd like for us to do a test of the full pipeline (from release CI to promoting it). Once we see that the full pipeline is working, we can continue with rolling the worker out when y'all are ready. I'm still in favor of incrementally rolling it out like we did with the https://nodejs.org/download/* path for a bit.

The worker is ready to be deployed whenever. We're in the middle of refactoring it but it's more so for code quality than anything functionality wise. It also mimics nginx's directory listing response so we shouldn't see another nodejs/citgm#1028

@trivikr
Copy link
Member

trivikr commented Jul 3, 2024

we are waiting on #3620 to be merged and deployed onto the www server

The PR #3620 is merged.

Who's going to deploy it and test the full pipeline, so that worker can be rolled out?

@targos
Copy link
Member

targos commented Jul 3, 2024

I deployed it right after merge. The only way to test it is waiting for a new release to happen.

@flakey5
Copy link
Member

flakey5 commented Jul 3, 2024

@ovflowd
Copy link
Member Author

ovflowd commented Jul 3, 2024

So should we do a retroactive sync of the buckets?

@flakey5
Copy link
Member

flakey5 commented Jul 3, 2024

Yeah!

@flakey5
Copy link
Member

flakey5 commented Jul 3, 2024

Just to be safe I think we should monitor the worker the next time a regular release is made (v22.5.0 or whatever's next) and make sure everything is updated correctly.

Once we see everything working as intended, I'd be in favor of rolling it out. I'd like it to be a day where I and someone else on @nodejs/web-infra are free though so we can monitor it and make any changes necessary

@trivikr
Copy link
Member

trivikr commented Jul 3, 2024

we should monitor the worker the next time a regular release is made

Can this be verified with Security Releases planned for Thu, Jul 4th?

@flakey5
Copy link
Member

flakey5 commented Jul 3, 2024

Can this be verified with Security Releases planned for Thu, Jul 4th?

Should be able to!

@trivikr
Copy link
Member

trivikr commented Jul 8, 2024

Security releases are out

I don't see directory created on r2 subdomain though https://r2.nodejs.org/download/release/v18.20.4/

@ovflowd
Copy link
Member Author

ovflowd commented Jul 8, 2024

Security releases are out

I don't see directory created on r2 subdomain though r2.nodejs.org/download/release/v18.20.4

#3620 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants