Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ICP-Ledger): FI-1433: Implement V1 for ICP ledger - add ability to read from memory manager in post_upgrade #1020

Open
wants to merge 98 commits into
base: master
Choose a base branch
from

Conversation

maciejdfinity
Copy link
Contributor

No description provided.

@maciejdfinity maciejdfinity changed the title Maciej icp mem mgr feat(ICP-Ledger): add ability to read from memory manager in post_upgrade Aug 20, 2024
@github-actions github-actions bot added the feat label Aug 20, 2024
@maciejdfinity maciejdfinity changed the title feat(ICP-Ledger): add ability to read from memory manager in post_upgrade feat(ICP-Ledger): FI-1433: add ability to read from memory manager in post_upgrade Aug 20, 2024
@maciejdfinity maciejdfinity changed the title feat(ICP-Ledger): FI-1433: add ability to read from memory manager in post_upgrade feat(ICP-Ledger): FI-1433: V1: add ability to read from memory manager in post_upgrade Oct 1, 2024
@maciejdfinity maciejdfinity changed the title feat(ICP-Ledger): FI-1433: V1: add ability to read from memory manager in post_upgrade feat(ICP-Ledger): FI-1433: Implement V1 for ICP ledger - add ability to read from memory manager in post_upgrade Oct 1, 2024
// In order to read the first bytes we need to use ic_cdk.
// dfn_core assumes the first 4 bytes store stable memory length
// and return bytes starting from the 5th byte.
let mut magic_bytes_reader = ic_cdk::api::stable::StableReader::default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid possibly triggering this code without intent during a future upgrade, maybe we want to put the code that touches the memory behind a boolean that is set in the upgrade args?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which code do you mean? For V1, the post upgrade needs to be able to read stable memory written with using StableWriter from dfn_core, or written using the MemoryManager. This code reads the first three bytes to detect which method was used, and reads the data, deserializes it, and initializes the ledger state on the heap. The code here needs to be run in all upgrades until we have upgraded successfully to V3 (which will have migrated the allowances to stable structures), at which point we won't support reading data written using StableWriter, and we can remove that part of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant that the code is protected by a conditional like

if upgrade_args.run_v1_phase{
...
}

This way we avoid unintentionally run the part that messes with memory. We have had instances in the past where we executed code that we did not mean to execute because of this kind of behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We did discuss different options for handling the different versions during the migration, including upgrade arguments, feature flags, feature branches etc. The agreed upon approach was to small PRs for each version, only commit them to master when we are ready to release, and once merged, qualify a release and make an upgrade proposal. Only after the changes from one version have been deployed on mainnet would we start committing changes for the subsequent version. After merging V1, I don't see a situation where we wouldn't want this code to run. This new code also doesn't change anything in the memory, it only adds support for reading stable memory written using two different approaches.

const MAGIC_BYTES: &[u8; 3] = b"MGR";
let mut first_bytes = [0u8; 3];
let memory_manager_found = match magic_bytes_reader.read_exact(&mut first_bytes) {
Ok(_) => first_bytes == *MAGIC_BYTES,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be that the dfn_core also write "MGR" as the first three bytes because the binary encoding of the stable memory length would accidentally happen to start with "MGR"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! In addition to the magic bytes "MGR", the memory manager also writes the layout version, which is a 1u8. So the first four bytes would be [0x77, 0x71, 0x82, 0x01], and this would correspond to a stable memory size of 25325943 bytes (StableWriter writes the length u32 as le_bytes). Looking at the heap memory usages of the ledger canisters, they are all using an even amount of bytes. Based only on this, and without having looked into the details on how the serialization is done (if it could end up being an odd number of bytes), I would say that it is very unlikely that we would run into a situation where dfn_core would end up writing the same four bytes as the memory manager, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would correspond to a stable memory size of 25325943 bytes

I've got a different memory size:

>>> int.from_bytes(b"MGR" + bytes([1]), "little")
22169421

I agree that it is unlikely to hit exactly this value, but I'm also not sure how to be sure (I'm not sure that CBOR tries to impose any byte-alignment, so it could be a coincidence that the sizes are all even).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be a coincidence that the sizes are all even

Nevertheless, maybe an acceptable risk to take.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some further local testing where I wrote b"MGR" + bytes([1] to the first bytes of raw stable memory in the pre-upgrade without memory manager, such that in the V1 post-upgrade the code would detect the memory manager. In this case, when thereafter trying to deserialize the state, expecting that it was written using the memory manager, the canister trapped. A subsequent downgrade back to the V0 mainnet version was successful, so even if it happened that V0 wrote b"MGR" + bytes([1] as the stable memory size, it is very likely that the subsequent memory manager initialization would fail. This would still be an issue for SNS ICRC ledgers, in the sense that they would require a fixed version to be able to further upgrade, but at least there wouldn't be any data loss/corruption. This is all anyway extremely unlikely to happen, so as you said, it may be an acceptable risk to take.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A subsequent downgrade back to the V0 mainnet version was successful

I guess this is because the actual size of the stable memory that was read was less than

>>> int.from_bytes(b"MGR" + bytes([1]), "little")
22169421

(otherwise dfn_core would likely fail)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand, in practice, the actual size would not be overriden by b"MGR" + bytes([1]), so the downgrade would actually be possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this would be a legitimate scenario, the b"MGR" + bytes([1]) would be the size actual stable memory, so we should be able to read all that correctly in a downgrade. In a scenario where we (accidentally or on purpose) set the length to that value, but the actual serialized ledger state is in fact larger, then as you pointed out dfn_core would likely fail.

if !downgrade_to_mainnet_should_succeed {
panic!("Downgrade from memory manager directly to mainnet should fail (since mainnet is V0)!")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we're back at the mainnet version and will go up to the current again in the next step instead of down to the current

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reread the test again, and IIUC, the flow is as follows:

  1. Install the ledger canister with the mainnet version ledger_wasm_mainnet (which is currently V0 for ICP, V1 for the ICRC ledger) (line 2618)
  2. (Apply transactions)
  3. Upgrade the ledger twice with ledger_wasm_current, which is V1 for both ledgers (lines 2651, 2653)
  4. If specified, upgrade the ledger twice to ledger_wasm_nextmigrationversionmemorymanager, which is V2 for both ledgers (built with a feature flag used only for testing) (lines 2658, 2660)
  5. Downgrade from V2 to the mainnet version ledger_wasm_mainnet (V0 for ICP, V1 for ICRC) (line 2663).
    1. In case of the ICP ledger, this downgrade shall fail, since downgrading from V2 to V0 is not supported (writing with the memory manager, reading only from raw stable memory)
    2. In case of the ICRC ledger, this downgrade succeeds, since it is now only downgrading from V2 to V1, and V1 can read data written using the memory manager.

Does this match your understanding? Are there changes needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of the ICRC ledger, this downgrade succeeds, since it is now only downgrading from V2 to V1,

this is based on the assumption that mainnet = current, right? in the general case, the test should do V0 -> V1 -> V1 (-> V2 -> V2 -> V1, if applicable) -> V0, but the code actually does: V0 -> V1 -> V1 -> V2 -> V2 -> V0 -> V1 -> V0 (meaning that V2 -> V1 is untested); if V0=V1 (as you seem to point out), then there's no difference, but in the general case, we seem to be lacking some coverage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is based on the assumption that mainnet = current, right?

For the ICRC ledger this is now the case, yes!

Thanks for spelling it out with the example! The original test was doing V0 -> V1 -> V1 -> V2 -> V2 -> V0 (this would fail, so the ledger would still be at V2) -> V1 -> V0, which would cover all the cases. Now, since mainnet ICRC is at V1, we are in fact missing the V2 -> V1 for the ICRC ledger - I'll add this. In a separate PR, I will add fixed versions of the ledgers that are deployed on mainnet, so that if the versions currently in mainnet-canisters.bzl are updated, the test doesn't need to be adapted/no new test coverage gaps are introduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the versions currently in mainnet-canisters.bzl are updated, the test doesn't need to be adapted/no new test coverage gaps are introduced

I don't see how this would make any difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants