Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/tmp filling up on CI hosts #3864

Open
targos opened this issue Aug 14, 2024 · 24 comments
Open

/tmp filling up on CI hosts #3864

targos opened this issue Aug 14, 2024 · 24 comments

Comments

@targos
Copy link
Member

targos commented Aug 14, 2024

https://ci.nodejs.org/job/node-test-commit-smartos/buildTimeTrend

It's slowing down core development.

@richardlau
Copy link
Member

/tmp on test-equinix_mnx-smartos20-x64-3 is full:

$ ansible -m command -a "df -h" test-equinix_mnx-smartos20-x64-?
test-equinix_mnx-smartos20-x64-3 | CHANGED | rc=0 >>
Filesystem                                  Size  Used Avail Use% Mounted on
zones/356655a2-12e6-e1d7-ac7b-b5188ad37cb0  100G   30G   71G  30% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  4.6G   12G  29% /etc/svc/volatile
swap                                        4.0G  4.0G     0 100% /tmp
swap                                         16G  4.6G   12G  29% /var/run
test-equinix_mnx-smartos20-x64-4 | CHANGED | rc=0 >>
Filesystem                                  Size  Used Avail Use% Mounted on
zones/c6e3d47a-1421-ee11-c52d-c3c80c198e95  100G   25G   76G  25% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  2.3G   14G  15% /etc/svc/volatile
swap                                        4.0G  1.8G  2.3G  45% /tmp
swap                                         16G  2.3G   14G  15% /var/run

It's full of node-coverage-* directories. I've removed them:

[root@test-equinix_mnx-smartos20-x64-3 ~]# rm -rf /tmp/node-coverage-*
[root@test-equinix_mnx-smartos20-x64-3 ~]# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
zones/356655a2-12e6-e1d7-ac7b-b5188ad37cb0  100G   30G   71G  30% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  955M   16G   6% /etc/svc/volatile
swap                                        4.0G   44K  4.0G   1% /tmp
swap                                         16G  955M   16G   6% /var/run
[root@test-equinix_mnx-smartos20-x64-3 ~]#

@richardlau
Copy link
Member

I've cleared out /tmp/node-coverage-* on test-equinix_mnx-smartos20-x64-4 as well:

[root@test-equinix_mnx-smartos20-x64-4 ~]# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
zones/c6e3d47a-1421-ee11-c52d-c3c80c198e95  100G   25G   76G  25% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  2.3G   14G  15% /etc/svc/volatile
swap                                        4.0G  1.8G  2.3G  45% /tmp
swap                                         16G  2.3G   14G  15% /var/run
[root@test-equinix_mnx-smartos20-x64-4 ~]# rm -rf /tmp/node-coverage-*
[root@test-equinix_mnx-smartos20-x64-4 ~]# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
zones/c6e3d47a-1421-ee11-c52d-c3c80c198e95  100G   25G   76G  25% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  467M   16G   3% /etc/svc/volatile
swap                                        4.0G   44K  4.0G   1% /tmp
swap                                         16G  467M   16G   3% /var/run
[root@test-equinix_mnx-smartos20-x64-4 ~]#

@targos
Copy link
Member Author

targos commented Aug 14, 2024

Good catch!
So there's a bug in our tests. They shouldn't be using the system's tmpdir.

@richardlau
Copy link
Member

richardlau commented Aug 14, 2024

Good catch! So there's a bug in our tests. They shouldn't be using the system's tmpdir.

FWIW I ran make -j88 test-ci on a dev machine I have access to and it created these directories/files in /tmp:

drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-09AcaL
drwx------.  2 rlau   rlau      138 Aug 14 13:48 node-coverage-6qYENJ
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-afMxbz
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-e0jO5f
drwx------.  2 rlau   rlau      138 Aug 14 13:48 node-coverage-F6ZUIw
drwx------.  2 rlau   rlau       50 Aug 14 13:48 node-coverage-hHzLhL
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-Lc7I1n
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-otzELS
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-p2PeI0
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-q6CVl4
drwx------.  2 rlau   rlau       50 Aug 14 13:48 node-coverage-RdCe2w
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-SnPBDM
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-tw4y03
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-x0GveC
drwx------.  2 rlau   rlau       50 Aug 14 13:48 node-coverage-zBbuSF
-rw-r--r--.  1 rlau   rlau   138954 Aug 14 13:47 perf-426641.map
-rw-r--r--.  1 rlau   rlau   112792 Aug 14 13:47 perf-426677.map

@richardlau
Copy link
Member

richardlau commented Aug 14, 2024

I think some of those are from test/parallel/test-runner-coverage.js which has specific tests checking coverage is produced when the NODE_V8_COVERAGE environment variable is not set.

@richardlau
Copy link
Member

I think some of those are from test/parallel/test-runner-coverage.js which has specific tests checking coverage is produced when the NODE_V8_COVERAGE environment variable is not set.

If I exclude test/parallel/test-runner-coverage.js from parallel there still at least one more test writing out coverage:

$ rm -rf /tmp/node-coverage-* /tmp/perf-*.map
$ tools/test.py -J --skip=parallel/test-runner-coverage parallel
[00:28|% 100|+ 3508|-   0]: Done

All tests passed.
$ ls -al /tmp
...
drwx------.  2 rlau   rlau       51 Aug 14 17:20 node-coverage-8Oj2sa
drwx------.  2 rlau   rlau       96 Aug 14 17:20 node-coverage-Xwd98L
-rw-r--r--.  1 rlau   rlau   138954 Aug 14 17:20 perf-1822721.map
-rw-r--r--.  1 rlau   rlau   112792 Aug 14 17:20 perf-1822738.map
...
$

@targos
Copy link
Member Author

targos commented Aug 14, 2024

Can you try to change the permissions so only root is allowed to write to /tmp. Hopefully we can identify the tests by observing failures.

@richardlau
Copy link
Member

Not on that machine, but should be able to in a container.

@richardlau
Copy link
Member

$ mkdir /tmp/cantwrite
$ chmod u-w /tmp/cantwrite/
$ ls -al /tmp/cantwrite
total 4
dr-xr-xr-x.  2 rlau   rlau      6 Aug 14 19:15 .
drwxrwxrwt. 12 nobody nobody 4096 Aug 14 19:15 ..
$ TMPDIR=/tmp/cantwrite/ tools/test.py
...
[03:29|% 100|+ 4014|-   2]: Done

Failed tests:
out/Release/node /home/rlau/sandbox/github/node/test/parallel/test-runner-output.mjs
out/Release/node /home/rlau/sandbox/github/node/test/parallel/test-runner-coverage.js
$

@richardlau
Copy link
Member

With an actual unwritable /tmp:

[03:26|% 100|+ 4012|-   4]: Done

Failed tests:
out/Release/node /home/nodejs/node/test/parallel/test-cli-node-options.js
out/Release/node /home/nodejs/node/test/parallel/test-runner-coverage.js
out/Release/node /home/nodejs/node/test/parallel/test-runner-output.mjs
out/Release/node /home/nodejs/node/test/parallel/test-trace-events-http.js
$

@targos
Copy link
Member Author

targos commented Aug 15, 2024

@jakecastelli
Copy link

jakecastelli commented Aug 15, 2024

For test-runner-coverage.js - When NODE_V8_COVERAGE is set the report will be written into the dest directory, currently for test-runner-coverage.js is set to tmpdir.path maybe what we need to do is to clear the report directory after the test is finished to prevent the report fill up the disk space, will take another when I wake up.

@richardlau
Copy link
Member

For test-runner-coverage.js - When NODE_V8_COVERAGE is set the report will be written into the dest directory, currently for test-runner-coverage.js is set to tmpdir.path maybe what we need to do is to clear the report directory after the test is finished to prevent the report fill up the disk space, will take another when I wake up.

I think those cases are working as expected -- but that test has variants that deliberately do not set NODE_V8_COVERAGE and those are causing writes to the system tmp dir which are never cleaned up.

targos added a commit to targos/node that referenced this issue Aug 16, 2024
@jakecastelli
Copy link

jakecastelli commented Aug 19, 2024

Thanks Richard, you are right about this, when I commented out the test that has NODE_V8_COVERAGE, it still writes to the temp dir and not being cleaned up. I will take some time to investigate why this happened.

@jakecastelli
Copy link

jakecastelli commented Aug 19, 2024

I spent some time to further investigate and found out whether or not NODE_V8_COVERAGE is set or not, it will always write to the system temp dir and then if NODE_V8_COVERAGE is set, it will copy the coverage report across to that dir. (see here for the implementation detail)

^ this might explain why when the write access is disabled to the temp dir the test failed immediately.

I have a question - after we've done the copying, shouldn't we remove the temp created dir? @cjihrig

@cjihrig
Copy link
Contributor

cjihrig commented Aug 19, 2024

after we've done the copying, shouldn't we remove the temp created dir?

We could. I wouldn't expect to need to in the temp directory though. The test could also clean up those files.

@jakecastelli
Copy link

It is hard to clean those files after the tests - because the way we created them is using mkdtempSync.

  const coverageDirectory = mkdtempSync(join(tmpdir(), 'node-coverage-'));

which is a unique (random) temporary directory, e.g. node-coverage-060f7B. And I don't know how to obtain the unique (random) part.

@jakecastelli
Copy link

I attempted to remove the tmp dir by doing this:

     } finally {
       if (dir) {
         dir.closeSync();
+        rmSync(this.coverageDirectory, {__proto__: null, recursive: true });
       }
     }
   }

in the cleanup function, however even the rmSync was *successful, the temp dir was still there (looks like somehow it was being recreated again)

  • the reason why I said it was successful it is because I tried to verify by using existSync(this.coverageDirectory) and it returned false.

Any idea? 👀 @cjihrig

@cjihrig
Copy link
Contributor

cjihrig commented Aug 21, 2024

Not sure without trying to debug it myself.

@jakecastelli
Copy link

jakecastelli commented Aug 21, 2024

If you could try to cd into $TMPDIR and rm -rf node-coverage-* then you could add the patch into the cleanup function in coverage.js (make sure you import rmSync at the top) and run ./node test/parallel/test-runner-coverage.js.
In theory we shouldn't see any node-covearge-* but in reality they are still there.

@jakecastelli
Copy link

Sorry for being annoying to add another quick note here, the whole reason I wanted to fix this issue is because each time we run the above test (test-runner-coverage.js) it will create the rest (which is a few mb) into the system tmp folder:

292K	node-coverage-0q1ib3
1.3M	node-coverage-2N0qE9
460K	node-coverage-LHyiP2
 68K	node-coverage-OMUlHM
524K	node-coverage-Pz009O
792K	node-coverage-RZckL8
556K	node-coverage-WVJ1fu
460K	node-coverage-gnrE0T
556K	node-coverage-hFiDtO
1.3M	node-coverage-mVUPrO
556K	node-coverage-nuKBuD
800K	node-coverage-s1yPWE
 68K	node-coverage-tVAHAL
528K	node-coverage-teJBAK

This is fine for local but on CI it would easily fill up the space.

nodejs-github-bot pushed a commit to nodejs/node that referenced this issue Aug 21, 2024
Refs: nodejs/build#3864
PR-URL: #54395
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Juan José Arboleda <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
RafaelGSS pushed a commit to nodejs/node that referenced this issue Aug 25, 2024
Refs: nodejs/build#3864
PR-URL: #54395
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Juan José Arboleda <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
@targos targos changed the title SmartOS hosts are too flaky /tmp filling up on CI hosts Aug 29, 2024
@targos
Copy link
Member Author

targos commented Aug 29, 2024

test-digitalocean-fedora39-x64-2 is affected too

[root@test-digitalocean-fedora39-x64-2 ~]# ls /tmp/node-coverage-* | wc -l
18601

@targos
Copy link
Member Author

targos commented Aug 29, 2024

I cleaned it up, and -1 too.

RafaelGSS pushed a commit to nodejs/node that referenced this issue Aug 30, 2024
Refs: nodejs/build#3864
PR-URL: #54395
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Juan José Arboleda <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
cjihrig added a commit to cjihrig/node that referenced this issue Sep 9, 2024
The test runner's code coverage leaves old coverage data in the
temp directory. This commit updates the cleanup logic to:

- Stop code collection. Otherwise V8 would write collection data
  again when the process exits.
- Remove the temp directory containing the coverage data.
- Attempt to clean up the coverage data even if parsing the
  data resulted in an error.

With this change, I no longer see any coverage data left behind
in the system temp directory.

Refs: nodejs/build#3864
Refs: nodejs/build#3887
cjihrig added a commit to cjihrig/node that referenced this issue Sep 9, 2024
The test runner's code coverage leaves old coverage data in the
temp directory. This commit updates the cleanup logic to:

- Stop code collection. Otherwise V8 would write collection data
  again when the process exits.
- Remove the temp directory containing the coverage data.
- Attempt to clean up the coverage data even if parsing the
  data resulted in an error.

With this change, I no longer see any coverage data left behind
in the system temp directory.

Refs: nodejs/build#3864
Refs: nodejs/build#3887
nodejs-github-bot pushed a commit to nodejs/node that referenced this issue Sep 11, 2024
The test runner's code coverage leaves old coverage data in the
temp directory. This commit updates the cleanup logic to:

- Stop code collection. Otherwise V8 would write collection data
  again when the process exits.
- Remove the temp directory containing the coverage data.
- Attempt to clean up the coverage data even if parsing the
  data resulted in an error.

With this change, I no longer see any coverage data left behind
in the system temp directory.

Refs: nodejs/build#3864
Refs: nodejs/build#3887
PR-URL: #54856
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Moshe Atlow <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@richardlau
Copy link
Member

/tmp on test-equinix_mnx-smartos20-x64-3 was full again. I have deleted all of those node-coverage-* directories.

aduh95 pushed a commit to nodejs/node that referenced this issue Sep 12, 2024
The test runner's code coverage leaves old coverage data in the
temp directory. This commit updates the cleanup logic to:

- Stop code collection. Otherwise V8 would write collection data
  again when the process exits.
- Remove the temp directory containing the coverage data.
- Attempt to clean up the coverage data even if parsing the
  data resulted in an error.

With this change, I no longer see any coverage data left behind
in the system temp directory.

Refs: nodejs/build#3864
Refs: nodejs/build#3887
PR-URL: #54856
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Moshe Atlow <[email protected]>
Reviewed-By: James M Snell <[email protected]>
aduh95 pushed a commit to nodejs/node that referenced this issue Sep 13, 2024
The test runner's code coverage leaves old coverage data in the
temp directory. This commit updates the cleanup logic to:

- Stop code collection. Otherwise V8 would write collection data
  again when the process exits.
- Remove the temp directory containing the coverage data.
- Attempt to clean up the coverage data even if parsing the
  data resulted in an error.

With this change, I no longer see any coverage data left behind
in the system temp directory.

Refs: nodejs/build#3864
Refs: nodejs/build#3887
PR-URL: #54856
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Moshe Atlow <[email protected]>
Reviewed-By: James M Snell <[email protected]>
aduh95 pushed a commit to nodejs/node that referenced this issue Sep 13, 2024
The test runner's code coverage leaves old coverage data in the
temp directory. This commit updates the cleanup logic to:

- Stop code collection. Otherwise V8 would write collection data
  again when the process exits.
- Remove the temp directory containing the coverage data.
- Attempt to clean up the coverage data even if parsing the
  data resulted in an error.

With this change, I no longer see any coverage data left behind
in the system temp directory.

Refs: nodejs/build#3864
Refs: nodejs/build#3887
PR-URL: #54856
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Moshe Atlow <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants