Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: metro-file-map roadmap ideas that would be fantastic for monorepos #1325

Open
stevenpetryk opened this issue Aug 19, 2024 · 2 comments

Comments

@stevenpetryk
Copy link

stevenpetryk commented Aug 19, 2024

TL;DR: thank you for adding symlink support! It works extremely well for us, but there are some directions we think metro-file-map could be taken that would work better for large non-Facebook monorepos (non-Facebook as in, on standard filesystems).

These directions include (more below):

  • Lazily computing content.sha1hex for files in node_modules
  • Lazily loading files in node_modules into TreeFS (rather than eagerly loading all node_modules)
  • Tolerating node_modules not being returned in the Watchman query

Our usecase

At Discord, we have a large monorepo, whose node_modules are managed by pnpm, with the following structure:

/
  node_modules/
    .pnpm/ (full of highly specific directories containing actual third party code)

  discord_app/
    node_modules/ (full of symlinks to ../../node_modules/.pnpm/* and other parts of the repo)

  discord_common/js/packages/... (full of other code used by the app)

Because of this layout (which we can't really change), Metro ends up issuing a Watchman query that can take around 20 seconds on even the fastest machines: it requests the content.sha1hex of every single file in node_modules, which for us, is around 300k.

If we remove node_modules as a watchFolder, then no node_modules end up in the file map, and Metro cannot find them (which is understandable).

Because of this behavior of metro-file-map, we're unable to exclude node_modules/.pnpm from our Watchman config, which means that watchman watch-project . is also incredibly slow—even though Metro is the only tool we use that actually needs to have Watchman read node_modules/.pnpm (the other tools just follow symlinks).

Some directions I've explored

I set out this past week to try patching Metro in various ways and seeing how behaviors improved. I have a couple of approaches that I think could be worth upstreaming:

Idea 1: Lazily computing sha1s

I noticed that the Watchman query issued by Metro was taking around 20 seconds, and realized that it went down to <1s when removing content.sha1hex as a field from the query. So I tried the following:

diff --git a/packages/metro-file-map/src/lib/TreeFS.js b/packages/metro-file-map/src/lib/TreeFS.js
index 28971ff3..1038cabb 100644
--- a/packages/metro-file-map/src/lib/TreeFS.js
+++ b/packages/metro-file-map/src/lib/TreeFS.js
@@ -22,6 +22,8 @@ import H from '../constants';
 import {RootPathUtils} from './RootPathUtils';
 import invariant from 'invariant';
 import path from 'path';
+import crypto from 'crypto';
+import fs from 'fs';
 
 type DirectoryNode = Map<string, MixedNode>;
 type FileNode = FileMetaData;
@@ -185,7 +187,15 @@ export default class TreeFS implements MutableFileSystem {
 
   getSha1(mixedPath: Path): ?string {
     const fileMetadata = this._getFileData(mixedPath);
-    return (fileMetadata && fileMetadata[H.SHA1]) ?? null;
+    if (fileMetadata == null) return null;
+
+    let sha1 = fileMetadata[H.SHA1] ?? null;
+    if (sha1 == null) {
+      const content = fs.readFileSync(mixedPath);
+      sha1 = crypto.createHash('sha1').update(content).digest('hex');
+      fileMetadata[H.SHA1] = sha1;
+    }
+    return sha1;
   }
 
   exists(mixedPath: Path): boolean {
diff --git a/packages/metro/src/node-haste/DependencyGraph/createFileMap.js b/packages/metro/src/node-haste/DependencyGraph/createFileMap.js
index 1533b941..37f193ba 100644
--- a/packages/metro/src/node-haste/DependencyGraph/createFileMap.js
+++ b/packages/metro/src/node-haste/DependencyGraph/createFileMap.js
@@ -80,7 +80,7 @@ function createFileMap(
         })),
     perfLoggerFactory: config.unstable_perfLoggerFactory,
     computeDependencies,
-    computeSha1: true,
+    computeSha1: false,
     dependencyExtractor: config.resolver.dependencyExtractor,
     enableHastePackages: config?.resolver.enableGlobalPackages,
     enableSymlinks: config.resolver.unstable_enableSymlinks,

Disgusting. But it made it so that we never get stuck on a slow query. Now, I think that asking watchman for sha1's is actually really reasonable—for app code. For node_modules, we're asking it to sha1sum hundreds of thousands of files that Metro will likely never actually have to read.

One thing to consider may be a hybrid approach here, then: issue two watchman queries—one for all the app code (with content.sha1hex), and one for all node_modules (without content.sha1hex). Then, around the moment the sha1 is needed by the transformer, we compute it and write it to the filemap (preferably more elegantly than above).

However, this still has a problem: we are still loading hundreds of thousands of files into the filemap, most of which we'll probably never read, which is probably part of why, at Discord, we have to give Metro 16 gigabytes of RAM.

Idea 2: Excluding node_modules from the query entirely

This would, imo, be the most monorepo-friendly approach. Instead of querying watchman for node_modules, Metro could resolve imports as they are encountered (by, yes, following symlinks), and just read from the filesystem (caching as you go). I understand this would probably be a large behavioral change to metro-file-map (much larger than Idea 1).

But the downstream benefits of this would be massive:

  1. Folks would be able to exclude node_modules/.pnpm in .watchmanconfig, which for us at least, takes watch-project down from 500k files to just 200k, speeding it up by the same proportion.
  2. It would prevent Watchman from recrawling whenever node_modules changes (because the most frequently-changing parts would be ignored).
  3. It would drastically reduce metro-file-map's memory footprint for monorepos with many node_modules.
  4. It would reduce the memory footprint of Watchman's cache.

In all this exploration, I've started to feel a little bit comfortable inside of the Metro monorepo, and I'd be interested in contributing some of these ideas—but wanted to talk direction/interest first. I hope this perspective from someone dealing with a hefty repo has been valuable. Thanks :)

@robhogan
Copy link
Contributor

Hi @stevenpetryk - this is an awesome analysis and we'd love to work with you to make improvements here.

It's probably worth a few words of background context to help explain why things are as they are:

  • At Meta, all production dependencies are checked in to the monorepo, and dev/tooling dependencies live elsewhere (using NODE_PATHS), unwatched by Metro, so the file map in memory ends up being less big (we don't hit an 8GB old-space-size limit, even though a single Metro config covers all of Meta's RN apps and surfaces).
  • SHA-1 hashes of all checked-in files are stored in source control by Eden, so we get them "for free" when querying Watchman using an Eden backend - no additional IO is performed.
  • Metro is a persistent background process expected to cope with developers moving between revisions, including when the contents of node_modules change, so we do want Metro to be aware of those changes. We don't differentiate between app code and dependencies in that respect.
  • We use CI-populated, network-hosted caches for both transformer and metro-file-map, so Metro is rarely starting cold, and we bias towards doing as much work at startup as possible when the cache is built (even if that means overdoing it), vs repeating work on dev machines at runtime.
  • Because we work on virtualised, remote Eden repositories, and have remote transform caches, Metro can often perform a build without ever even pulling untransformed source over the network - we get the SHA1 in the metadata, use it to compute the transform cache key, then read the transform cache (local then remote), skipping the transformer completely. And because of those remote filesystems, avoiding pulling a file is a bigger perf win than avoiding a local read would be.

That said, I appreciate the pain this behaviour causes for large repos outside Meta - we've heard similar stories from Microsoft, in particular, and I'm keen to work on a design that works well for our largest external users. So let's dig in...

Metro ends up issuing a Watchman query that can take around 20 seconds on even the fastest machines: it requests the content.sha1hex of every single file in node_modules, which for us, is around 300k.

Interesting - there could be a gap in Metro's logic here, where we assume that requesting content.sha1hex is cheap on Watchman but that may only be true of Eden backends (need to confirm this) - do you have a metro-file-map cache available?

Taking advantage of cached SHA1 on non-Eden Watchman

For non-Watchman, our fs.readdir-based crawler does not recompute SHA1s unless that file has changed (stat.mtime) since the last time the file map cache was written, in which case a Metro worker performs the IO just after the crawl result and before the cache is written:

const computeSha1 =
this._options.computeSha1 && !isSymlink && fileMetadata[H.SHA1] == null;

But for Watchman, Metro effectively always asks Watchman to computeSha1, which doesn't take advantage of the file map cache.

This could be as simple as changing:

const {query, queryGenerator} = planQuery({
since,
extensions,
directoryFilters,
includeSha1: computeSha1,
includeSymlinks,
});

So that we includeSha1 only if the backing watcher is Eden.

Of course for cold builds this makes no difference, but it ought to be a big win if you do have a cache.

However, this still has a problem: we are still loading hundreds of thousands of files into the filemap, most of which we'll probably never read, which is probably part of why, at Discord, we have to give Metro 16 gigabytes of RAM.

This is believable but worth validating, I think - where is that 16GB actually being used? If you use the "minimal" watchFolders configuration under the current architecture, the assumption suggests the TreeFS.#rootNode Map object should be many GBs of file path strings and stats, which would manifest as the metro-file-map cache also being many gigs - is that the case?

Another possibility is that the memory is being used by Graph instances, probably the dependencies Map - there'll typically be one Graph for every entry point and config served since Metro started, and each will already only include the minimal set of node_modules files that actually appear in the dep graph. If that's the case we could potentially do more do dedupe modules appearing in overlapping graphs, and also look at constant-factor storage efficiency, but it's unrelated to metro-file-map.

Lazily hashing

Idea 1: Lazily computing sha1s

IMO, this is very much the right thing to do - there are a few small things to design for:

  • Is SHA1 the right cache key input in all cases, or if say we configure "node_modules cache keys are computed lazily", might it make sense to have a cheaper cache key (eg md5(path + mtime)) that we'd only check the local transform cache for, and what would configuring that look like.
  • Currently, SHA1s are computed on startup and cached, and we never save the file map cache again, so lazily computed keys would have the disadvantage of not making it into the file map cache. We could (periodically / after a build) re-save the file map cache to avoid compromising there.
  • Ideally, avoid reading a file twice (once to compute the cache key lazily, and then immediately again for the transformer if there's no transform cache hit)

Excluding node_modules

Idea 2: Excluding node_modules from the query entirely

I think this is where we need to be a bit careful to distinguish, say, the contents of TreeFS, from the set of files we're monitoring for changes. I'd be reluctant to ignore changes inside node_modules, as that a) breaks one of Metro's core correctness guarantees, b) we wouldn't know when to invalidate cache keys, so we'd need to at least stat every "excluded" file for each traversal.

But, we might not need to keep the whole project in memory - the watching part alone can be cheap, especially if we don't even need to stat files. On macOS, we can subscribe to changes for a whole file tree in constant time and with no application memory cost, though it's more costly and requires a walk on Linux.

A non-trivial complication with some files being in the in-memory map and others not, is how you can quickly tell one case from the other to know whether you need to perform disk IO - especially when symlinks can bridge the two cases. I've got some work in draft somewhere that builds this logic into TreeFS so that it can fall through to file IO. I'll see if I can revive that tomorrow and get back to you here. A second complication is correctly implementing incremental resolution when typical file IO makes symlinks transparent.

But yes, broadly I do think there are good opportunities here and I'd be very grateful for your help!

@stevenpetryk
Copy link
Author

Thanks for the context! I suspected that Eden would provide a lot of this for free. No judgment on my end btw, I think Metro's watchman integration makes a lot of sense as-is for tons of repos.


Responses to specific questions

we assume that requesting content.sha1hex is cheap on Watchman but that may only be true of Eden backends (need to confirm this)

I can definitely confirm this. When I manually run the same query as Metro against Watchman on a fresh watch, asking for content.sha1hex makes the query taking 20 seconds. Without content.sha1hex, it takes 1 second. Of course, once this is cached, subsequent queries are much faster—but the recrawls caused by watching so many files causes us to have to restart Watchman a lot.

do you have a metro-file-map cache available?

I do, but I probably can't share since they contain a list of file paths. I'm seeing this:

❯ du -h /tmp/metro-file-map-*
29M     /tmp/metro-file-map-c3bccb77fbd577a705afcf67c1144e72-78018780974c22cc67ee8635ff6e7df7
63M     /tmp/metro-file-map-c3bccb77fbd577a705afcf67c1144e72-f22c3655af61476b03650e9a2fecbaa0

Responses to solution responses (lol)

Lazily hashing

Is SHA1 the right cache key input in all cases

In my testing it's plenty fast, but for blazing fast™, CRC32 would probably be the most blazin' and sufficient. But it may be easier to use the same algorithm for all (even if one's coming from Watchman and another from metro-file-map).

Re: the other two points, yes and yes. I noticed there was no mechanism to persist the file map cache. I'm glad this is what you proposed, since I felt a little shy suggesting periodic persistence. And the extra read in my patch was just for testing, I'd definitely want to only read once.

Excluding node_modules

Agreed that Metro should/must continue watching node_modules (would be pretty inconvenient otherwise). My latest thinking on this is that Metro should tolerate symlink targets being missing from the Watchman query. The reason being that it would be a huge boon to Watchman performance if we could ignore node_modules/.pnpm, which is where pnpm stores all the 300k non-symlinks in our monorepo.

I think we're aligned but just to clarify, this is the state of the world I'm imagining:

  • We add node_modules/.pnpm to ignore_dirs in .watchmanconfig.
  • Watchman continues to watch everything else in node_modules (which for us, as pnpm users, is entirely symlinks or directories containing symlinks).
  • Something in the app imports react-native-foo/bar, which resolves to:
    • discord_app/node_modules/react-native-foo/bar.js (which metro-file-map is already realpath'ing it seems)
    • -> node_modules/.pnpm/[email protected]/node_modules/react-native-foo/bar.js (Metro realizes this is not in its filemap, but sees it exists, and adds it to the filemap cache with its checksum)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants