Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI sometimes fails to read swingset config file #10092

Open
gibson042 opened this issue Sep 15, 2024 · 4 comments
Open

CI sometimes fails to read swingset config file #10092

gibson042 opened this issue Sep 15, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@gibson042
Copy link
Member

gibson042 commented Sep 15, 2024

Describe the bug

As seen at https://github.com/Agoric/agoric-sdk/actions/runs/10873209314/job/30169121954?pr=10091#step:5:1

Run cd packages/boot && yarn test | $TEST_COLLECT
  cd packages/boot && yarn test | $TEST_COLLECT
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    AGORIC_AVA_USE_TAP: true
    TEST_COLLECT: tee -a _testoutput.txt
    NODE_V8_COVERAGE: coverage
    GH_ENGINE: 18.x
    CI_NODE_INDEX: 1
    CI_NODE_TOTAL: 4
    ESM_DISABLE_CACHE: true
yarn run v1.22.22
$ ava
TAP version 13
failed to load bundles/decentral-itest-vaults-config.json
not ok 1 - net-ibc-upgrade › before hook %ava-dur=96ms
#   REJECTED from ava test.before(): (SyntaxError#1)
#   SyntaxError#1: Unexpected end of JSON input
#       at JSON.parse (<anonymous>)
#       at loadSwingsetConfigFile (/home/runner/work/agoric-sdk/agoric-sdk/packages/SwingSet/src/controller/initializeSwingset.js:239:25)
#       at async ensureSwingsetInitialized (file:///home/runner/work/agoric-sdk/agoric-sdk/packages/cosmic-swingset/src/launch-chain.js:146:18)
#       at async buildSwingset (file:///home/runner/work/agoric-sdk/agoric-sdk/packages/cosmic-swingset/src/launch-chain.js:218:32)
#       at makeSwingsetTestKit (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/tools/supports.ts:501:48)
#       at makeTestContext (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/test/bootstrapTests/net-ibc-upgrade.test.ts:37:27)
#       at <anonymous> (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/test/bootstrapTests/net-ibc-upgrade.test.ts:49:15)

I suspect this can be fixed by adding a flush: true option to the writeFile call in packages/boot/tools/supports.ts, but only after raising the Node.js version bar to at least v20.10.0 (cf. nodejs/node#50009 for background). And if so, then we may want to consider using it in even more writeFiles. A possible alternative would be reading back the expected size, but that seems too heavyweight.

Subsequent conversation on Slack identified a more likely culprit as concurrently running tests writing and reading against the same file name, a hypothesis which was locally reproducible:

node --input-type=module -e '
import fs from "node:fs";
import fsp from "node:fs/promises";
const sink = () => {};
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
const waitFor = thunk => new Promise(async (resolve, reject) => {
  try {
    while (!(await thunk())) await delay(100);
    resolve();
  } catch (err) { reject(err); }
});

const path = "./dummy";
const data = Object.fromEntries(
  Array.from({ length: 10000 },
  (_, i) => [`key${i.toString().padStart(6, "0")}`, i]),
);
const fn = async worker => {
  for (let i = 0; i < 1000; i++) {
    await fsp.unlink(path).then(sink, sink);
    await waitFor(() => fs.statSync(path, { throwIfNoEntry: false }) === undefined);
    await fsp.writeFile(path, JSON.stringify(data), "utf-8");
    try {
      JSON.parse(fs.readFileSync(path, "utf-8"));
    } catch (err) {
      if (err.code === "ENOENT") continue;
      throw Error(`worker ${worker} failed on attempt ${i}`, { cause: err });
    }
  }      
  return;                       
};
for (let i = 0; i < 2; i++) fn(i);
'
file:///tmp/[eval1]:27
      throw Error(`worker ${worker} failed on attempt ${i}`, { cause: err });
            ^

Error: worker 0 failed on attempt 43
    at fn (file:///tmp/[eval1]:27:13) {
  [cause]: SyntaxError: Unexpected end of JSON input
      at JSON.parse (<anonymous>)
      at fn (file:///tmp/[eval1]:24:12)
}

Node.js v18.18.2

We should instead introduce a random component in these Swingset config file names.

@gibson042 gibson042 added the bug Something isn't working label Sep 15, 2024
@mhofman
Copy link
Member

mhofman commented Sep 16, 2024

I am very confused by what nodejs/node#50009 attempts to fix. A fsync should only be necessary if you attempt to read from another OS, e.g. network mounted system, or if your computer crashes. I believe the linux kernel guarantees that the same file being read by another process will use the most recently written data, even if it wasn't committed to disk.

@mhofman
Copy link
Member

mhofman commented Sep 16, 2024

From a conversation on Slack, it's a lot more likely that ava concurrency results in the same config file being written by concurrent tests, potentially causing tears since the file being written may take multiple write syscalls.

The solution is likely to add a random component to testConfigPath

@siarhei-agoric
Copy link
Contributor

siarhei-agoric commented Sep 16, 2024

I am very confused by what nodejs/node#50009 attempts to fix. A fsync should only be necessary if you attempt to read from another OS, e.g. network mounted system, or if your computer crashes. I believe the linux kernel guarantees that the same file being read by another process will use the most recently written data, even if it wasn't committed to disk.

There are several things to consider even when a single file is used by a single process on a single host:

  1. Linux kernel only guarantees consistent visibility of writes which are completed via kernel's syscall interface. However, there may be several other layers of abstraction, transformation, and caching between the fs interface in JS and the actual Kernel syscall. A flush from the highest-level interface would guarantee that all of the data actually makes it to its final backing store before the top-level execution proceeds further.
  2. Some applications (such as databases) often require strict ordering of separate writes to different places of the same file, or even groups of different files. This is typically required to guarantee atomicity of transactions (potentially across a crash/reboot).
  3. A set of different processes running on a same host may use a file as a mail box to send messages to each other. Just like in 1 and 2 above, an explicit flush would provide a mechanism to enforce completeness, ordering, and atomicity of these messages.
  4. An fsync on a file would force all of the other outstanding writes to the same file complete and return back with appropriate success/error codes.

With that in mind, only option 1 could be applicable to a simple config file write/read by a single process, and only in case of application crash/abort.

@mhofman
Copy link
Member

mhofman commented Sep 21, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants