Strong consistency, distributed, in-memory grain directory #9103
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #9074
Fixes #2656
Fixes #4217
An improved in-memory distributed grain directory
This PR introduces a new in-memory grain directory for Orleans. We plan to make this the default implementation shortly (possibly in 9.0), but this PR adds it as an experimental, opt-in feature.
The new grain directory has the following attributes, contrasting with the current default implementation:
Design
The directory is distributed across all silos in the cluster by partitioning the key space into range using consistent hashing where each silo owns some pre-configured number of ring ranges, defaulting to 30 ranges per silo. A hash function is used to associate
GrainIds
with a point on the ring. The silo which owns that point on the ring is responsible for serving any requests for that grain.Ring ranges are assigned to replicas deterministically based on the current cluster membership. We refer to consistent snapshots of membership as views, and each view has a unique number which is used to identify and order views. A new view is created whenever a silo in the cluster changes state (eg, between
Joining
,Active
,ShuttingDown
, andDead
), and all non-faulty silos learn of view changes quickly through a combination of gossip and polling of the membership service.The image below shows a representation of a ring with 3 silos (A, B, & C), each owning 4 ranges on the ring. The grain
user/terry
hashes to0x4000_000
, which is owned by Silo B.The directory operates in two modes: normal operation where a fixed set of processes communicate in the absence of failure, and view change, which punctuates periods of normal operation to transfer of state and control from the processes in the preceding view to processes in the current view.
Once state and control has been transferred from the previous view, normal operation resumes. This coordinated state and control transfer is performed on a range-by-range basis and some ranges may resume normal operation before others. For partitions which do not change owners during a view change, normal operation never stops. For partitions which do change ownership during a view change, service is suspended when the previous owner learns of the view change and resumed when current owner successfully transfers state and control.
When a view change begins, any ring ranges which were previously owned and are now no-longer owned are sealed, preventing further modifications by the previous owner, and a snapshot is created and retained in-memory. The snapshot is transferred to the new range owners and is deleted once the new owners acknowledge that it has been retrieved or if they are evicted from the cluster.
The image below depicts a view change, where Silo B is shutting down and Silo C must transfer state & control from Silo B for the ranges which it owned. Silo B will not complete shutting down until Silo C acknowledges that it has completed the transfer.
After the transfer has completed, the ring looks like so:
Failures are handled by performing recovery. Partition owners perform recovery by requesting the set of registered grain activations belonging to that partition from all active silos in the cluster. Requests are not served until recovery completes. As an optimization, the new partition owner can serve lookup requests for grain registrations which it has already recovered.
To prevent requests from being served during a view change, range locks are acquired during view change, respected by each request, and released when view change completes.
All requests and responses include the current membership version. If a client or replica receives a message with a higher membership version, they refresh membership to at least that version before continuing. After refreshing their view, clients retry misdirected requests (requests targeting the wrong replica for a given point on the ring), sending them to the owner for the current view. Replicas reject misdirected requests.
Range partitioning is more complicated than having a fixed number of fixed-size partitions, but it has nice properties such as elastic scaling (adding more hosts increases system capacity), good balancing, minimal data movement during scaling events, and the partition mapping is derived directly from cluster membership with no other state (such as partition assignments) required.
Enabling
DistributedGrainDirectory
For now, the new directory is opt-in, so you will need to use the following code to enable it on your silos:
If performing a rolling upgrade, you will experience unavailability of the directory until all hosts have the new directory
Does this allow strong single-activation guarantees?
See #2428 for context.
This directory guarantees that a grain will never be concurrently active on more than one silo which is a member of the cluster but it is possible for a grain to still be active on a silo which was evicted from the cluster but is still operating (eg, it's very slow or has experienced a network partition and has not refreshed membership yet). We could provide some stronger guarantees using leases, as described in #2428. This could be implemented host-wide by making silos call
Environment.FailFast(...)
on themselves after some period of time if they have not managed to successfully refresh membership (or contacted some pre-ordained set of other silos to renew their lease on life). Leases do not guarantee single activation unless you assume bounded clock drift and bounded time between lease checks.TODO
ReplicatedGrainDirectory
, since there is currently only one active replica per range. Eg,DistributedGrainDirectory
,InternalGrainDirectory
,ClusterGrainDirectory
,RangePartitionedGrainDirectory
[Experimental]
to opt-in to this implementation