Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [MR-552] Implement callback expiration #1699

Open
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

alin-at-dfinity
Copy link
Contributor

@alin-at-dfinity alin-at-dfinity commented Sep 26, 2024

MR-552: Generate compact reject responses upon best-effort callback expiration. And inflate them into SYS_UNKNOWN reject responses when peeking/popping.

…lbacks_with_enqueued_response invariant in CanisterQueues doc comment.
…ry reject response that will result in a critical eror anyway (and the response being dropped).
Assign type parameters to canister queues and references, to designate them as either input/inbound or output/outbound.

Ensures that input queues can only hold inbound references; and output queues can only hold outbound references. Implements separate logic (for inbound and outbound references) for determining staleness, lookup and removal.
…d to InboundReference / OutboundReference, to make it clear that the functions may panic (even though the two types are essentially private to the module).
…e item types declare their own context (inbound vs outbound) and use that when constructing a new reference of the given type.
Generate compact reject responses upon best-effort callback expiration. And inflate them into SYS_UNKNOWN reject responses when peeking/popping.
@alin-at-dfinity alin-at-dfinity marked this pull request as ready for review October 1, 2024 05:53
@alin-at-dfinity alin-at-dfinity requested review from a team as code owners October 1, 2024 05:53
Base automatically changed from alin/MR-603-typed-queues to master October 1, 2024 08:01
Copy link
Contributor

@stiegerc stiegerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far lgtm. Didn't spend much time on the tests though.

rs/replicated_state/src/canister_state/queues.rs Outdated Show resolved Hide resolved
rs/replicated_state/src/canister_state/queues.rs Outdated Show resolved Hide resolved
Comment on lines 474 to 491
fn get(&self, reference: InboundReference) -> CanisterInput {
assert_eq!(Context::Inbound, reference.context());

if let Some(msg) = self.pool.get(reference) {
debug_assert!(!self.expired_callbacks.contains_key(&reference));
debug_assert!(!self.shed_responses.contains_key(&reference));
return msg.clone().into();
} else if reference.class() == Class::BestEffort && reference.kind() == Kind::Response {
if let Some(callback_id) = self.expired_callbacks.get(&reference) {
debug_assert!(!self.shed_responses.contains_key(&reference));
return CanisterInput::DeadlineExpired(*callback_id);
} else if let Some(callback_id) = self.shed_responses.get(&reference) {
return CanisterInput::ResponseDropped(*callback_id);
}
}

panic!("stale reference at the front of input queue");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit odd. get() and take() is almost the same function. For one you could probably refactor that using a helper function to remove the duplication. Why is get() returning a clone though? It's usually Option<&T>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's possible to either pass a function pointer or use generics, but the two functions aren't all that large. And trying to extract the common structure might simply add complexity and make it less readable. If you don't have a strong objection, I'd rather leave it as is.

MessageStore<CanisterInput>::get() returns a clone rather than a reference because we have to build a CanisterInput enum on the fly and Rust (with good reason) won't let you return a reference to a temporary value. MessageStore<RequestOrResponse>::get() OTOH, will return an Option<&RequestOrResponse> because it holds RequestOrResponse values within, so it can just return references to them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. How about a function

fn is_best_effort_response(&self) -> bool {
   self.class() == Class::BestEffort && self.kind() == Kind::Response
}

or something like that? Maybe also something similar for expired_callbacks and shed_responses since both of these expressions also appear in the complicated logical expression @derlerd-dfinity mentioned below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, actually. At some point I was thinking of adding fancier methods to Id / Reference, so that you could say BestEffort && Request by applying a single mask and a single binary operation (as opposed to filtering for one bit and then separately for the other). But now that you brought this up, I took a quick look and the two combinations we care about are "inbound best-effort response" (because these can be shed and result in an edge case); and "outbound guaranteed-response requests" (because these are the only non-best-effort messages that expire). So I added explicit methods for the two cases (which are just id & BITMASK == Inbound | BestEffort | Response, which compiles to id & x = y).

rs/replicated_state/src/canister_state/queues.rs Outdated Show resolved Hide resolved
…eStore::callbacks_with_enqueued_response(); fix typo; move inner function definition before use.
Copy link
Contributor

@derlerd-dfinity derlerd-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick initial pass and left some comments.

rs/protobuf/def/state/queues/v1/queues.proto Show resolved Hide resolved
rs/replicated_state/src/canister_state/queues.rs Outdated Show resolved Hide resolved
…::is_stale() for better readability. Make all asserts for the right context type into debug_asserts().
Copy link
Member

@oggy-dfin oggy-dfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM overall, the blocker for me is understanding why the unwrap in system_state is safe (assuming that it is safe).

rs/replicated_state/src/canister_state/queues.rs Outdated Show resolved Hide resolved
@@ -74,6 +74,20 @@ impl CanisterQueuesFixture {
)
}

fn try_push_deadline_expired_input(&mut self) -> Result<(), String> {
self.last_callback_id += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely not for this PR, but conceptually there should be one callback ID for every successful push_output_request. If you're going in the direction of bundling the CallContextManager together with the queues, I do wonder if we have any practical chance of actually enforcing this, i.e., somehow have CallbackId be only generated through push_output_request. I realize you'd need to thread some canister-global state through to keep the numbers unique across the different queues. And I now also realize that there are two next_callback_id fields, one in SandboxSafeSystemState, and one in CallContextManager, and I have no idea what's going on there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also hope it can be done. My eventual aim is to turn SystemState (at least the parts of it that deal with CanisterQueues plus CallContextManager) into a state machine (in-between message executions; executing a request; executing a response; etc.). It should be possible to automatically create callbacks as requests are being enqueued, although it may require refactoring on the Execution end.

The reason why both CallContextManager and SandboxSafeSystemState have a next_callback_id is that the latter is trying to guess the callback IDs that the former will assign the new callbacks when they are registered (likely in order to populate them in the Request). So bundling callback creation and request enqueuing should also get rid of this weirdness.

rs/replicated_state/src/canister_state/system_state.rs Outdated Show resolved Hide resolved
@@ -276,7 +276,7 @@ impl CallContextManagerStats {

/// Calculates the stats for the given call contexts and callbacks.
///
/// Time complexity: `O(|call_contexts| + |callbacks|)`.
/// Time complexity: `O(n)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I found the previous formulation more informative, it's not really clear to me what n is if I just see that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained elsewhere to Christian, my goal with these notes is to give the caller an idea of whether it's OK to use this in a tight loop or not (and ideally not, since anything that isn't constant or logarithmic time has no business being called from the scheduler or Message Routing). If you check, virtually all the functions with this kind of note are invariant checks or "calculate X" sort of things, that should only be called from debug_asserts or while loading the state.

What the exact n is, is irrelevant, since it's more or less under the control of users or an attacker. Saying e.g. O(|shed_responses|) may lead someone to conclude "Oh, this is likely to be a small number, so it's fine to use it for X".

rs/replicated_state/tests/system_state.rs Outdated Show resolved Hide resolved
…_response() and Reference::is_outbound_guaranteed_request() methods and use them to more concisely check for the respective message types.
Copy link
Contributor

@derlerd-dfinity derlerd-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just finished my pass over the prod code. Left some more questions comments but overall LGTM.

Will complete my pass by going over the tests; after that I think I should be ready to approve.

Copy link
Contributor

@derlerd-dfinity derlerd-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. LGTM as soon as Ogi and Christian are happy as well.

rs/replicated_state/tests/system_state.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants