Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Availability Zone Standard #640

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Changes from 9 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
dc9ad59
Create scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Jun 17, 2024
336e2dc
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Jun 18, 2024
3ae46c4
First part for fire zones being through, rest will follow
josephineSei Jun 19, 2024
447d90c
Further work on other factors for AZs than fire zones
josephineSei Jun 21, 2024
eb6e5bc
First complete Draft of Availability Standard
josephineSei Jun 24, 2024
5f8c566
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Jun 24, 2024
b576580
Merge branch 'main' into availability-zones-standard
josephineSei Jun 24, 2024
856e099
Apply suggestions from code review
josephineSei Jun 25, 2024
1a9140e
Restructuring and adding discussed point from IaaS call.
josephineSei Jun 27, 2024
286ff93
Apply suggestions from code review
josephineSei Aug 16, 2024
89e770a
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Aug 16, 2024
afec372
Merge branch 'main' into availability-zones-standard
josephineSei Aug 19, 2024
7475746
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Aug 21, 2024
6c03427
Merge branch 'main' into availability-zones-standard
josephineSei Aug 26, 2024
e660cb4
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Sep 18, 2024
355e84a
Create scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 18, 2024
65460b5
Update scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 18, 2024
fe13def
Merge branch 'main' into availability-zones-standard
josephineSei Sep 18, 2024
8c98249
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Sep 18, 2024
057d093
Update scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 19, 2024
1ad5852
Update scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 20, 2024
f3f76fc
Update scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 20, 2024
a72ef56
Apply suggestions from code review
josephineSei Sep 25, 2024
79e0428
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Sep 25, 2024
e3cc2af
Merge branch 'main' into availability-zones-standard
josephineSei Sep 26, 2024
87e3d48
Rename scs-XXXX-vN-Availability-Zones-Standard.md to scs-0119-v1-Avai…
josephineSei Sep 30, 2024
efd2ab5
Update and rename scs-XXXX-w1-Availability-Zones-Standard.md to scs-0…
josephineSei Sep 30, 2024
58d636a
Update scs-0119-w1-Availability-Zones-Standard.md
josephineSei Sep 30, 2024
720b94f
Update scs-0119-w1-Availability-Zones-Standard.md
josephineSei Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 196 additions & 0 deletions Standards/scs-XXXX-vN-Availability-Zones-Standard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
---
title: Availability Zones Standard
type: Standard
status: Draft
track: IaaS
---

## Introduction

On the IaaS-Level especially in OpenStack it is possible to group resources in Availability Zones.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
Such Zones often are mapped to the physical layer of a deployment, such as e.g. physical separation of hardware or redundancy of power circuits or fire zones.
But how CSPs apply Availability Zones to the IaaS Layer in one deplyoment may differ widely.
Therefore this standard will address the minimal requirements that need to be met, when creating Avaiability Zones.

## Terminology

| Term | Explanation |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
| Availability Zone | (also: AZ) internal representation of physical grouping of service hosts, which also lead to internal grouping of resources. |
| Fire Zone | A physical separation in a data center that will contain fire within it. Effectively stopping spreading of fire. |
| PDU | Power Distribution Unit, used to distribute the power to all physical machines of a single server rack. |
| Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). |
| Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). |
| Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). |

josephineSei marked this conversation as resolved.
Show resolved Hide resolved
## Motivation

Redundancy is a non-trivial but relevant issue for a cloud deployment.
First and foremost it is necessary to increase failure safety through redundancy on the physical layer.
The IaaS layer as the first abstraction layer from the hardware has an important role in this topic, too.
The grouping of redundant physical resources into Availability Zones on the IaaS level, gives customers the option to distribute their workload to different AZs which will result in a better failure safety.
While CSPs already have some similarities in their grouping of physical resources to AZs, there are also differences.
This standard aims to reduce thos differences and will clarify, what customers can expect from Availability Zones in IaaS.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

Availability Zones in IaaS can be set up for Compute, Network and Storage while all refering to the same physical separation in a deployment.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
This standard elaborates the necessity of having Availability Zones for each of these classes.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
It will also check the requirements customers may have, when thinking about Availability Zones in relation to the taxonomy of failure safety levels [^1].
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
The result should enable CSPs to know when to create AZs to be SCS-compliant.

## Design Considerations

Availability Zones should represent parts of the same physical deployment that are independent of each other.
The maximum of physical independence is achieved through putting physical machines into different fire zones.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment.

Having Availability Zones represent fire zones will also result in AZs being able to take workload from another AZ in a Failure Case of Level 3.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
So that even the destruction of one Availability Zone will not automatically include the destruction of the other AZs.

:::caution

Even with fire zones being physically designed to protect parts of a data center from severe destruction in case of a fire, this will not always succeed.
Availability Zones in Clouds are most of the time within the same physical data center.
In case of a big catastrophe like a huge fire or a flood the whole data center could be destroyed.
Availability Zones will not protect customers against these failure cases of level 4 of the taxonomy of failure safety[^1].

:::

Smaller deplyoments like edge deployments may not have more than one fire zone in a single location.
To include such deployments, it should not be required to use Availability Zones.

Other physical factors that should be considered are the power supplies, internet connection, cooling and core routing.
Availability Zones have been also being configured to show redundancy in e.g. Power Supply as in the PDU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be something wrong with this sentence. I don't understand it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the whole paragraph, please re-read it.

That means there are deployments, which have Availability Zones per rack as each rack has it's own PDU and this was considered to be the single point of failure an AZ should represent.
While this is also a possible measurement of independency it only provides failure safty for level 2.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
Therefore this standard should be very clear about which independency an AZ should represent and it should not be allowed to have different deployments with their Availability Zones representing different levels of failure safety.

Additionally Availability Zones are available for Compute, Storage and Network services.
They behave differently for each of these resources and also when working across resource-based Availability Zones, e.g. attaching a volume from one AZ to a virtual machine in another AZ.
For each of these IaaS resource classes, it should be defined, under which circumstances Availablitiy Zones should be used.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

### Scope of tha Availability Zone Standard
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

When elaborating redundancy and failure safety in data centers, it is necessary to also define redundancy on the physical level.
There are already recommendations from the BSI for physical redundancy within a cloud deployment [^2].
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
This standard considers these recommendation as a basis, that is followed by most CSPs.
So this standard will not go into details, already provided by the CSP, but will rather concentrate on the IaaS layer and only have a coarse view on the physical layer.
The first assumtion from the recommendations of the BSI is that the destruction of one fire zone will not lead to an outage of all power lines (not PDUs), internet connections, core routers or cooling systems.

For the setup of Availability Zone this means, that within every AZ, there needs to be redundancy in core routers, internet connection, power lines and at least two separate cooling systems.
This should avoid having single points of failure within the Availability Zones.
But all this physical infrastructure can be the same over all Availability Zones in a deployment, when it is possible to survive the destruction of one fire zone.

[^2]: [Availability recommendations from the BSI](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9)

### Options considered

#### Physical-based Availability Zones

It is possible standardize the Usage of Availability Zones over all IaaS resources.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
The downside from this is, that the IaaS resources behave so differently, that they have different requirements for redundancy and thus Availability Zones.
This is not the way to go.
Besides that, it is already possible to create two physically separated deployments close to each other, connect them with each other and use regions to differ between the IaaS on both deployments.

The question that remains is, what an Availability Zone should consist of?
Having one Availability Zone per fire zone gives the best level of failure safety, that can be achieved by CSPs.
When building up on the relation between fire zone and physical redundancy recommendations as from the BSI, this combination is a good starting point, but need to be checked for the validity for the different IaaS resources.

Another point is where Availability Zones can be instanciated and what the connection between AZs should look like.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
To have a proper way to deal with outages of one AZ, where a second AZ can step in, a few requirements need to be met for the connection between those two AZs.
The amount data that needs to be transferred very fast in a failure case may be enormous, so there is a requirement for a high bandwidth between connected AZs.
Tho avoid additional failure cases the latency between those two Availability Zones need to be low.
With such requirements it is very clear that AZs should only reside within one (physical) region of an IaaS-Deployment.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

#### AZs in Compute

Compute Hosts are physical machines on which the compute service runs.
A single virtual machine is always running on ONE compute host.
Redundancy of virtual machines is either up to the layer above IaaS or up to the customers themself.
Having Availability Zones gives customers the possibility to let another virtual machine as a backup run within another Availability Zone.

Customers will expect that in case of the failure of one Availability Zone all other AZs are still available.
The highest possible failure safety here is achieved, when Availability Zones for Compute are used for different fire Zones.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

When the BSI recommendations are followed, there should already be redundancy in power lines, internet connection and cooling.
An outage of one of these physical resources will not affect the compute host and its resources for more than a minimal timeframe.
But when a single PDU is used for a rack, a failure of that PDU will result in an outage of all compute hosts in this rack.
In such a case it is not relevant, whether this rack represents a whole Availability Zone or is only part of a bigger AZ.
All virtual machines on the affected compute hosts will not be available and need to be restarted on other hosts, whether of the same Availability Zone or another.

#### AZs in Storage

There are many different backends used for the storage service with Ceph being one of the most prominent backends.
Configuring those backends can already include to span one storage cluster over physical machines in different fire zones.
In combination with internal replication a configuration is possible, that already distributes replicas from volumes over different fire zones.
When a deployment has such a configured storage backend, it already can provide safety in case of a failure of level 3.

Using Availability Zone is also possible for the storage service, but configuring AZs, when having a configuration like above will not increase safety.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
Nevertheless using AZs when having different backends in different fire zones will give customers a hint to backup volumes into storages of other AZs.

Additionally when the BSI recommendations are followed, there should already be redundancy in power lines, internet connection and cooling.
An outage of one of these physical resources will not affect the storage host and its resources for more than a minimal timeframe.
When internal replication is used, either through the IaaS or through the storage backend itself, the outage of a single PDU and such a single rack will not affect the availability of the data itself.
All these physical factors are not requiring the usage of an Availability Zone for Storage.
An increase of the level auf failure safety will not be reached through AZs in these cases.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

Still it might be confusing when having deployments with compute AZs but without storage AZs.
CSPs may need to communicate clearly up to which failure safety level their storage service can automatically have redundancy and from which level customers are responsible for the redundancy of their data.

#### AZs in Network

Network resources can be typically fastly and easily set up from building instruction.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
Those instructions are stored in the database of the networking service.

If a physical machine, on which certain network resources are set up, is not available anymore, the resources can be rolled out on another physical machine, without being depended on the current situation of the lost resources.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
There might only be a loss of a few packages within the los network ressources.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) it would not have downsides to not have Availability Zones for the network service.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
It might even be the opposite: Having resources running in certain Availability Zones might permit them from being scheduled in other AZs[^3].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

permit [...] from

Did you mean "prevent [...] from" here?

To be honest, I can't really tell as I haven't fully understood why there are no downsides from omitting AZs in network from this whole paragraph.
Maybe one or two details could be added to explain the reasoning behind this general statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few more lines to this paragraph to explain this better. and yes it was "prevent".

This standard will therefore make no recommendations about Network AZs.

[^3]: [Availability Zones in Neutron for OVN](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html)

### Cross-Attaching volumes from one AZ to another compute AZ

Without the networking AZs we only need to take a closer look into attaching volumes to virtual machines across AZs.

When there is more than one Storage Availability Zone, those AZs do normally align with the Compute Availability Zones.
This means that in fire zone 1 exist compute AZ 1 and storage AZ 1, in fire zone 2 are compute AZ 2 and storage AZ 2 and the same for fire zone 3.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
It is possible to allow or forbid cross-attaching volumes from one storage Availability Zone to virtual machines in another AZ.
If it is not allowed, then the creation of volume-based virtual machines will fail, if there is no space left for VMs in the corresponding Availability Zone.
While this may be unfortunate, it gives customers a very clear picture of an Availability Zone.
It clarifies that having a virtual machine in another AZ also requires have a backup or replication of volumes in the other storage AZ.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
Then this backup or replication can be used to create a new virtual machine in the other AZ.

It seems to be a good decision to not encourage CSPs to allow cross-attach.
Currently CSPs also do not seem to widely use it.

## Standard

If Compute Availability Zone are used, they MUST be in different fire zones.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
Availabilty Zones for Storage SHOULD be setup, if there is no storage backend used that can span over different fire zones and automatically replicate the data.
Otherwise a single Availabilty Zone for Storage SHOULD be configured.

If more than one Availability Zone for Storage is set up, the attaching of volumes from one Storage Availability Zone to another Compute Availability Zone (cross-attach) SHOULD NOT be possible.

Within each Availability Zone:

- there MUST be redundancy in power supply, as in line into the deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho we need to more clearly define what "redundancy" means here, that is, redundancy up two what level? redundancy for every electrica component in the AZ would possibly entail:

  • redundant PDUs and PSUs for each server
  • redundant fuses and electric circuits for each PDUs to the redundant online UPS system
  • possibly connecting each redundant UPS (e.g. battery backed) to two independent local power generators for redundant fallback power generation (e.g. diesel generator)
  • and finally external redundant energy providers over redundant external power connections.

also we may want to specify the level of redundancy (2, 3..).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important topic. But I wonder, if this would fit in a standard about Availability Zones in that kind of detail.

I did not wanted to discuss all possible physical features, that need to be redundant. I think it would be better to refer to a document, doing all this (at least within this standard). Maybe something from the BSI like this?

Or do you think that it should be stated here in a detail like:

  • there MUST be at least two redundant power sources (power line or generator)
  • each PDU SHOULD have at least one redundant twin
  • each PDU MUST have at least two redundant electric circuits to the redundant power sources

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should in the eyes of the CSP (also giving them certain interpretation freedom) while expecting that user is able to have certain level of high-er availability.

- there MUST be redundancy in external connection (e.g. internet connection or WAN-connection)
- there MUST be redundancy in core routers
- there SHOULD be at least two cooling systems, that are independent of each other
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

AZs SHOULD only occur within the same region and have a low-latency interconnection with a high bandwidth.

## Related Documents

The taxonomy of failsafe levels can be used to get an overview over the levels of failure safety in a deployment(TODO: link after DR is merged.)

The BSI can be consulted for further information about [failure risks](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/Kompendium/Elementare_Gefaehrdungen.pdf?__blob=publicationFile&v=4), [risk analysis for a datacenter](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) or [measures for availability](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9).

## Conformance Tests

As this standard will not require Availability Zones to be present, we cannot automatically test the conformance.
The other parts of the standard are physical or internal and could only be tested through an audit.
Whether there are fire zones physically available is a criteria that will never change for a single deployment - this only needs to be audited once.
It might be possible to also use Gaia-X Credentials to provide such information, which then could be tested.