From dc9ad59019bb8170b8073c60f38fa56acd52da00 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 17 Jun 2024 15:38:32 +0200 Subject: [PATCH 01/24] Create scs-XXXX-vN-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-Availability-Zones-Standard.md | 79 +++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 Standards/scs-XXXX-vN-Availability-Zones-Standard.md diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md new file mode 100644 index 000000000..f47a36ad1 --- /dev/null +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -0,0 +1,79 @@ +--- +title: Availability Zones Standard +type: Standard +status: Draft +track: IaaS +--- + +## Introduction + +Introduction + +## Terminology + +| Term | Explanation | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | +| Fire Zone | A physical separation in a data center that will contain fire within it. Effectively stopping spreading of fire. | + +## Motivation + +Motivation + +## Design Considerations + + + AZs should represent parts of the same deployment, that have an independency of each other + AZs should be able to take workload from another AZ in a Failure Case of Level 3 (in other words: the destruction of one AZ will not automatically include destruction of the other AZs) + + Compute: resources are bound to one AZ, replication cannot be guaranteed, downtime or loss of resources is most likely + Storage: highly depended on storage configuration, replication even over different AZs is part of some storage backends + Network: network resources are also stored as configuration pattern in the DB and could be materialized in other parts of a deployment easily as long as the DB is still available. + + We should not require AZs to be present (== allow small deployments and edge use cases) + + +- Availability Zones are available for Compute, Storage and Network. They behave differently there + +### Options considered + +#### AZs in Compute + + + +#### AZs in Storage + + + +#### AZs in Network + + + +### Open questions + +RECOMMENDED + +## Standard + + + AZs should only occur within the same deployment and have an interconnection that represents that (we should not require specific numbers in bandwidth and latency.) + We should separate between AZs for different resources (Compute, Storage, Network) + +Compute needs AZs (because VMs may be single point of failure) if failure case 3 may occur (part of the deployment is destroyed, if the deployment is small there will be no failure case three, as the whole deployment will be destroyed) +Storage should either be replicated over different zones (e.g. fire zones) that are equivalent to compute AZs or also use AZs +Network do not need AZs + + Power supply may be confused with power line in. Maybe a PDU is what we should talk about - those need to exist for each AZ independently. + When we define fire zone == compute AZ, then every AZ of course has to fulfill the guidelines for a single fire zone. Maybe this should be stated implicitly rather than explicitly. + internet uplinks: after the destruction of one AZ, uplink to the internet must still be possible (that can be done without requiring a separate uplinks for each AZ.) + each AZ should be designed with minimal single point of failures (e.g. single core router) to avoid a situation where a failure of class 2 will disable a whole AZ and so lead to a failure of class 3. + + +## Related Documents + +Related Documents, OPTIONAL + +## Conformance Tests + +As this standard will not require Availability Zones to be present, we cannot automatically test the conformance. +The other parts of the standard are physical or internal and could only be tested through an audit. +Whether there are fire zones physically available is a criteria that will never change for a single deployment - this only needs to be audited once. From 336e2dcc945cd4440786b9713b345b18c3c98daf Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Tue, 18 Jun 2024 16:05:12 +0200 Subject: [PATCH 02/24] Update scs-XXXX-vN-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-Availability-Zones-Standard.md | 21 ++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index f47a36ad1..b0e2144c2 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -7,17 +7,32 @@ track: IaaS ## Introduction -Introduction +On the IaaS-Level especially in OpenStack it is possible to group resources in Availability Zones. +Such Zones often are mapped to the physical layer of a deployment, such as e.g. physical separation of hardware or redundancy of power circuits or fire zones. +But how CSPs apply Availability Zones to the IaaS Layer in one deplyoment may differ widely. +Therefore this standard will address the minimal requirements that need to be met, when creating Avaiability Zones. ## Terminology | Term | Explanation | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | +| Availability Zone | (also: AZ) internal representation of physical grouping of service hosts, which also lead to internal grouping of resources. | | Fire Zone | A physical separation in a data center that will contain fire within it. Effectively stopping spreading of fire. | +| PDU | Power Distribution Unit, used to distribute the power to all physical machines of a single server rack. | +| Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | +| Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | +| Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | ## Motivation -Motivation +Redundancy is a non-trivial but relevant issue for a cloud deployment. +The IaaS layer especially as the first virtualization from the hardware has an important role in this topic, because it is possible to provide failure safety through redundancy from failures on the physical layer. +The grouping of physical resources into Availability Zones on the IaaS level, gives customers the option to distribute their workload to different AZs which will result in a better failure safety. +While CSPs already have some similarities in their grouping of physical resources to AZs, there are also differences. +Availability Zones can be set up for Compute, Network and Storage while all refering to the same physical separation in a deployment. +This standard elaborates the necessity of having Availability Zones for each of these classes. +It will also check the requirement customers may have, when thinking about Availability Zones in regarding of the taxonomy of failure safety levels [^1]. +The result should enable CSPs to know when to create AZs to be SCS-compliant. ## Design Considerations @@ -70,7 +85,7 @@ Network do not need AZs ## Related Documents -Related Documents, OPTIONAL +The taxonomy of failsafe levels (TODO: link after DR is merged.) ## Conformance Tests From 3ae46c4d50e90ac126d8b7268df79ed787380033 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 19 Jun 2024 15:08:34 +0200 Subject: [PATCH 03/24] First part for fire zones being through, rest will follow Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-Availability-Zones-Standard.md | 58 ++++++++++++++++--- 1 file changed, 49 insertions(+), 9 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index b0e2144c2..4fe464cba 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -36,46 +36,86 @@ The result should enable CSPs to know when to create AZs to be SCS-compliant. ## Design Considerations +Availability Zones should represent parts of the same deployment, that have an independency of each other. +The maximum of physical independency is achieved through putting physical machines into different fire zones. +In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment. + +Havine Availability Zones represent fire zones will also result in AZs being to take workload from another AZ in a Failure Case of Level 3. +So that even the destruction of one Availability Zone will not automatically include the destruction of the other AZs. + +Smaller deplyoments like edge deployments may not have more than one fire zone in a single location. +To include such deployments, it should not be required to use Availability Zones. + - AZs should represent parts of the same deployment, that have an independency of each other - AZs should be able to take workload from another AZ in a Failure Case of Level 3 (in other words: the destruction of one AZ will not automatically include destruction of the other AZs) Compute: resources are bound to one AZ, replication cannot be guaranteed, downtime or loss of resources is most likely Storage: highly depended on storage configuration, replication even over different AZs is part of some storage backends Network: network resources are also stored as configuration pattern in the DB and could be materialized in other parts of a deployment easily as long as the DB is still available. - We should not require AZs to be present (== allow small deployments and edge use cases) - - -- Availability Zones are available for Compute, Storage and Network. They behave differently there +Availability Zones are available for Compute, Storage and Network services. +They behave differently for each of these resources and also when working across resource-based Availability Zones, e.g. attaching a volume from one AZ to a virtual machine in another AZ. ### Options considered #### AZs in Compute +Compute Hosts are physical machines on which the compute service runs. +A single virtual machine is always running on ONE compute host. +Redundancy of virtual machines is either up to the layer above IaaS or up to the customers themself. +Having Availability Zones gives customers the possibility to let another virtual machine as a backup run within another Availability Zone. +Customers will expect that in case of the failure of one Availability Zone all other AZs are still available. +The highest possible failure safety here is achieved, when Availability Zones for Compute are used for different fire Zones. #### AZs in Storage +There are many different backends used for the storage service with Ceph being one of the most prominent backends. +Configuring those backends can already include to span one storage cluster over physical machines in different fire zones. +In combination with internal replication a configuration is possible, that already distributes replicas from volumes over different fire zones. +When a deployment has such a configured storage backend, it already can provide safety in case of a failure of level 3. + +Using Availability Zone is also possible for the storage service, but configuring AZs, when having a configuration like above will not increase safety. +Nevertheless using AZs when having different backends in different fire zones will give customers a hint to backup volumes into storages of other AZs. +Still it might be confusing when having deployments with compute AZs but without storage AZs. +CSPs may need to communicate clearly up to which failure safety level their storage service can automatically have redundancy and from which level customers are responsible for the redundancy of their data. #### AZs in Network +Network resources can be typically fastly and easily set up from building instruction. +Those instructions are stored in the database of the networking service. + +If a physical machine, on which certain network resources are set up, is not available anymore, the resources can be rolled out on another physical machine, without being depended on the current situation of the lost resources. +There might only be a loss of a few packages within the los network ressources. +With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) it would not have downsides to not have Availability Zones for the network service. +It might even be the opposite: Having resources running in certain Availability Zones might permit them from being scheduled in other AZs[^2]. +This standard will therefore make no recommendations about Network AZs. + +[^2]: [Availability Zones in Neutron for OVN](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html) ### Open questions -RECOMMENDED +Without the networking AZs we only need to take a closer look into attaching volumes to virtual machines across AZs. + +It is ## Standard +### Compute + +Compute Availability Zone MUST be in different fire zones. + + AZs should only occur within the same deployment and have an interconnection that represents that (we should not require specific numbers in bandwidth and latency.) We should separate between AZs for different resources (Compute, Storage, Network) Compute needs AZs (because VMs may be single point of failure) if failure case 3 may occur (part of the deployment is destroyed, if the deployment is small there will be no failure case three, as the whole deployment will be destroyed) -Storage should either be replicated over different zones (e.g. fire zones) that are equivalent to compute AZs or also use AZs -Network do not need AZs + +### Storage + +If there are more than one fire zone in a deployment, the storage SHOULD either be configured to automatically replicate volumes over different fire zones OR also have one Availability Zones for each fire zone. Power supply may be confused with power line in. Maybe a PDU is what we should talk about - those need to exist for each AZ independently. When we define fire zone == compute AZ, then every AZ of course has to fulfill the guidelines for a single fire zone. Maybe this should be stated implicitly rather than explicitly. From 447d90c87fc497a802a569ed1e19d854056b090c Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 21 Jun 2024 15:52:13 +0200 Subject: [PATCH 04/24] Further work on other factors for AZs than fire zones Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-Availability-Zones-Standard.md | 33 ++++++++++++++++--- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index 4fe464cba..a4ff1a258 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -46,17 +46,34 @@ So that even the destruction of one Availability Zone will not automatically inc Smaller deplyoments like edge deployments may not have more than one fire zone in a single location. To include such deployments, it should not be required to use Availability Zones. +Other physical factors that should be considered are the power supplies, internet connection, cooling and core routing. +Availability Zones have been also being configured to show redundancy in e.g. Power Supply as in the PDU. +There are deployments, which have Availability Zones per rack as each rack has it's own PDU and this was considered to be the single point of failure and AZ should represent. +While this is also a possible measurement of independency it only provides failure safty for level 2. +Therefore this standard should be very clear about which independency an AZ should represent and it should not be allowed to have different deployments with their Availability Zones representing different levels of failure safety. +There are recommendations from the BSI for physical redundancy within a cloud deployment. +This standard considers these recommendation as a basis for all data centers. +This means that the destruction of one fire zone will not lead to an outage of all power lines, internet connections, core routers or cooling systems. - Compute: resources are bound to one AZ, replication cannot be guaranteed, downtime or loss of resources is most likely - Storage: highly depended on storage configuration, replication even over different AZs is part of some storage backends - Network: network resources are also stored as configuration pattern in the DB and could be materialized in other parts of a deployment easily as long as the DB is still available. +For the setup of Availability Zone this means, that within every AZ, there needs to be redundancy in core routers, internet connection, power lines and at least two separate cooling systems. +But all this physical infrastructure can be the same over all Availability Zones in a deployment, when it is possible to survive the destruction of one fire zone. -Availability Zones are available for Compute, Storage and Network services. +Additionally Availability Zones are available for Compute, Storage and Network services. They behave differently for each of these resources and also when working across resource-based Availability Zones, e.g. attaching a volume from one AZ to a virtual machine in another AZ. ### Options considered +#### Physical-based Availability Zones + +It is possible standardize the Usage of Availability Zones over all IaaS resources. +The downside from this is, that the IaaS resources behave so differently, that they have different requirements for redundancy and thus Availability Zones. +This is not the way to go. + +The question that remains is, what an Availability Zone should consist of? +Having one Availability Zone per fire zone gives the best level of failure safety, that can be achieved by CSPs. +When building up on the relation between fire zone and physical redundancy recommendations as from the BSI, this combination is a good starting point, but need to be checked for the validity for the different IaaS resources. + #### AZs in Compute Compute Hosts are physical machines on which the compute service runs. @@ -105,7 +122,15 @@ It is ### Compute Compute Availability Zone MUST be in different fire zones. +Availabilty Zones for Storage SHOULD be setup, if there is no storage backend used that can span over different fire zones and automatically replicate the data. + + -- Cross- attaching: If Availability Zones for Storage are used, the attaching of volumes from one Storage Availability +Within each Availability Zone: +- there MUST be redundancy in power supply, as in line into the deployment +- there MUST be redundancy in external connection (e.g. internet connection or WAN-connection) +- there MUST be redundancy in core routers +- there SHOULD be at least two cooling systems, that are independent of each other AZs should only occur within the same deployment and have an interconnection that represents that (we should not require specific numbers in bandwidth and latency.) From eb6e5bc73cf4391ef8a53b9d952d6c5a60d7a4fc Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 24 Jun 2024 11:02:01 +0200 Subject: [PATCH 05/24] First complete Draft of Availability Standard Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-Availability-Zones-Standard.md | 71 +++++++++++-------- 1 file changed, 43 insertions(+), 28 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index a4ff1a258..d1bc9b5ae 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -40,7 +40,7 @@ Availability Zones should represent parts of the same deployment, that have an i The maximum of physical independency is achieved through putting physical machines into different fire zones. In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment. -Havine Availability Zones represent fire zones will also result in AZs being to take workload from another AZ in a Failure Case of Level 3. +Having Availability Zones represent fire zones will also result in AZs being to take workload from another AZ in a Failure Case of Level 3. So that even the destruction of one Availability Zone will not automatically include the destruction of the other AZs. Smaller deplyoments like edge deployments may not have more than one fire zone in a single location. @@ -48,20 +48,22 @@ To include such deployments, it should not be required to use Availability Zones Other physical factors that should be considered are the power supplies, internet connection, cooling and core routing. Availability Zones have been also being configured to show redundancy in e.g. Power Supply as in the PDU. -There are deployments, which have Availability Zones per rack as each rack has it's own PDU and this was considered to be the single point of failure and AZ should represent. +That means there are deployments, which have Availability Zones per rack as each rack has it's own PDU and this was considered to be the single point of failure an AZ should represent. While this is also a possible measurement of independency it only provides failure safty for level 2. Therefore this standard should be very clear about which independency an AZ should represent and it should not be allowed to have different deployments with their Availability Zones representing different levels of failure safety. -There are recommendations from the BSI for physical redundancy within a cloud deployment. -This standard considers these recommendation as a basis for all data centers. -This means that the destruction of one fire zone will not lead to an outage of all power lines, internet connections, core routers or cooling systems. +There are recommendations from the BSI for physical redundancy within a cloud deployment [^2]. +This standard considers these recommendation ar followed by most CSPs and will thus be a basis for all data centers. +From this recommendations this standard assumes that the destruction of one fire zone will not lead to an outage of all power lines (not PDUs), internet connections, core routers or cooling systems. -For the setup of Availability Zone this means, that within every AZ, there needs to be redundancy in core routers, internet connection, power lines and at least two separate cooling systems. +For the setup of Availability Zone this means, that within every AZ, there needs to be redundancy in core routers, internet connection, power lines and at least two separate cooling systems to avoid single points of failure in Availability Zones. But all this physical infrastructure can be the same over all Availability Zones in a deployment, when it is possible to survive the destruction of one fire zone. Additionally Availability Zones are available for Compute, Storage and Network services. They behave differently for each of these resources and also when working across resource-based Availability Zones, e.g. attaching a volume from one AZ to a virtual machine in another AZ. +[^2]: [Availability recommendations from the BSI](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9) + ### Options considered #### Physical-based Availability Zones @@ -74,6 +76,12 @@ The question that remains is, what an Availability Zone should consist of? Having one Availability Zone per fire zone gives the best level of failure safety, that can be achieved by CSPs. When building up on the relation between fire zone and physical redundancy recommendations as from the BSI, this combination is a good starting point, but need to be checked for the validity for the different IaaS resources. +Another point is where Availability Zones can be instanciated and what the connection between AZs should look like. +To have a proper way to deal with outages of one AZ, where a second AZ can step in, a few requirements need to be met for the connection between those two AZs. +The amount data that needs to be transferred very fast in a failure case may be enormous, so there is a requirement for a high bandwidth between connected AZs. +Tho avoid additional failure cases the latency between those two Availability Zones need to be low. +With such requirements it is very clear that AZs should only reside within one (physical) region of an IaaS-Deployment. + #### AZs in Compute Compute Hosts are physical machines on which the compute service runs. @@ -84,6 +92,12 @@ Having Availability Zones gives customers the possibility to let another virtual Customers will expect that in case of the failure of one Availability Zone all other AZs are still available. The highest possible failure safety here is achieved, when Availability Zones for Compute are used for different fire Zones. +When the BSI recommendations are followed, there should already be redundancy in power lines, internet connection and cooling. +An outage of one of these physical resources will not affect the compute host and its resources for more than a minimal timeframe. +But when a single PDU is used for a rack, a failure of that PDU will result in an outage of all compute hosts in this rack. +In such a case it is not relevant, whether this rack represents a whole Availability Zone or is only part of a bigger AZ. +All virtual machines on the affected compute hosts will not be available and need to be restarted on other hosts, whether of the same Availability Zone or another. + #### AZs in Storage There are many different backends used for the storage service with Ceph being one of the most prominent backends. @@ -94,6 +108,12 @@ When a deployment has such a configured storage backend, it already can provide Using Availability Zone is also possible for the storage service, but configuring AZs, when having a configuration like above will not increase safety. Nevertheless using AZs when having different backends in different fire zones will give customers a hint to backup volumes into storages of other AZs. +Additionally when the BSI recommendations are followed, there should already be redundancy in power lines, internet connection and cooling. +An outage of one of these physical resources will not affect the storage host and its resources for more than a minimal timeframe. +When internal replication is used, either through the IaaS or through the storage backend itself, the outage of a single PDU and such a single rack will not affect the availability of the data itself. +All these physical factors are not requiring the usage of an Availability Zone for Storage. +An increase of the level auf failure safety will not be reached through AZs in these cases. + Still it might be confusing when having deployments with compute AZs but without storage AZs. CSPs may need to communicate clearly up to which failure safety level their storage service can automatically have redundancy and from which level customers are responsible for the redundancy of their data. @@ -106,25 +126,30 @@ If a physical machine, on which certain network resources are set up, is not ava There might only be a loss of a few packages within the los network ressources. With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) it would not have downsides to not have Availability Zones for the network service. -It might even be the opposite: Having resources running in certain Availability Zones might permit them from being scheduled in other AZs[^2]. +It might even be the opposite: Having resources running in certain Availability Zones might permit them from being scheduled in other AZs[^3]. This standard will therefore make no recommendations about Network AZs. -[^2]: [Availability Zones in Neutron for OVN](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html) +[^3]: [Availability Zones in Neutron for OVN](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html) ### Open questions Without the networking AZs we only need to take a closer look into attaching volumes to virtual machines across AZs. -It is +It is possible to allow or forbid cross-attaching volumes from one AZ to virtual machines in another AZ. +If it is not allowed, then the creation of volume-based virtual machines will fail, in case of an outage of a complete Availability Zone. +This does not seem to be a good option in regard for the failure safety level, as transfering a virtual machine from one AZ to another in a failure case will get way more complex. +A replication of the volume has to be present in another storage Availability Zone that can be attached to the corresponding compute Availability Zone, which is not the AZ, that has an outage. +Then this replication - maybe a snapshot - can be used to create a new virtual machine. -## Standard +While it seems to be a good decision to allow cross-attach, CSPs currently do not seem to widely use it. +The reasons for and against this configuration may need to be discussed further to decide, whether this standard should make any recommendations regarding cross-attach. -### Compute +## Standard Compute Availability Zone MUST be in different fire zones. Availabilty Zones for Storage SHOULD be setup, if there is no storage backend used that can span over different fire zones and automatically replicate the data. - -- Cross- attaching: If Availability Zones for Storage are used, the attaching of volumes from one Storage Availability +[TO BE DISCUSSED:] If Availability Zones for Storage are used, the attaching of volumes from one Storage Availability Zone to another Compute Availability Zone (cross-attach) SHOULD be allowed. Within each Availability Zone: - there MUST be redundancy in power supply, as in line into the deployment @@ -132,28 +157,18 @@ Within each Availability Zone: - there MUST be redundancy in core routers - there SHOULD be at least two cooling systems, that are independent of each other - - AZs should only occur within the same deployment and have an interconnection that represents that (we should not require specific numbers in bandwidth and latency.) - We should separate between AZs for different resources (Compute, Storage, Network) - -Compute needs AZs (because VMs may be single point of failure) if failure case 3 may occur (part of the deployment is destroyed, if the deployment is small there will be no failure case three, as the whole deployment will be destroyed) - -### Storage - -If there are more than one fire zone in a deployment, the storage SHOULD either be configured to automatically replicate volumes over different fire zones OR also have one Availability Zones for each fire zone. - - Power supply may be confused with power line in. Maybe a PDU is what we should talk about - those need to exist for each AZ independently. - When we define fire zone == compute AZ, then every AZ of course has to fulfill the guidelines for a single fire zone. Maybe this should be stated implicitly rather than explicitly. - internet uplinks: after the destruction of one AZ, uplink to the internet must still be possible (that can be done without requiring a separate uplinks for each AZ.) - each AZ should be designed with minimal single point of failures (e.g. single core router) to avoid a situation where a failure of class 2 will disable a whole AZ and so lead to a failure of class 3. - +AZs SHOULD only occur within the same region and have a low-latency interconnection with a high bandwidth. ## Related Documents -The taxonomy of failsafe levels (TODO: link after DR is merged.) +The taxonomy of failsafe levels can be used to get an overview over the levels of failure safety in a deployment(TODO: link after DR is merged.) + +The BSI can be consulted for further information about [failure risks](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/Kompendium/Elementare_Gefaehrdungen.pdf?__blob=publicationFile&v=4), [risk analysis for a datacenter](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) or [measures for availability](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9). ## Conformance Tests As this standard will not require Availability Zones to be present, we cannot automatically test the conformance. The other parts of the standard are physical or internal and could only be tested through an audit. Whether there are fire zones physically available is a criteria that will never change for a single deployment - this only needs to be audited once. +It might be possible to also use Gaia-X Credentials to provide such information, which then could be tested. + From 5f8c566c7808c950ae61a2ed4ed63f78328b9ae9 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 24 Jun 2024 11:03:57 +0200 Subject: [PATCH 06/24] Update scs-XXXX-vN-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-Availability-Zones-Standard.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index d1bc9b5ae..e8aaeb792 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -40,7 +40,7 @@ Availability Zones should represent parts of the same deployment, that have an i The maximum of physical independency is achieved through putting physical machines into different fire zones. In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment. -Having Availability Zones represent fire zones will also result in AZs being to take workload from another AZ in a Failure Case of Level 3. +Having Availability Zones represent fire zones will also result in AZs being to take workload from another AZ in a Failure Case of Level 3. So that even the destruction of one Availability Zone will not automatically include the destruction of the other AZs. Smaller deplyoments like edge deployments may not have more than one fire zone in a single location. @@ -86,7 +86,7 @@ With such requirements it is very clear that AZs should only reside within one ( Compute Hosts are physical machines on which the compute service runs. A single virtual machine is always running on ONE compute host. -Redundancy of virtual machines is either up to the layer above IaaS or up to the customers themself. +Redundancy of virtual machines is either up to the layer above IaaS or up to the customers themself. Having Availability Zones gives customers the possibility to let another virtual machine as a backup run within another Availability Zone. Customers will expect that in case of the failure of one Availability Zone all other AZs are still available. @@ -152,6 +152,7 @@ Availabilty Zones for Storage SHOULD be setup, if there is no storage backend us [TO BE DISCUSSED:] If Availability Zones for Storage are used, the attaching of volumes from one Storage Availability Zone to another Compute Availability Zone (cross-attach) SHOULD be allowed. Within each Availability Zone: + - there MUST be redundancy in power supply, as in line into the deployment - there MUST be redundancy in external connection (e.g. internet connection or WAN-connection) - there MUST be redundancy in core routers @@ -171,4 +172,3 @@ As this standard will not require Availability Zones to be present, we cannot au The other parts of the standard are physical or internal and could only be tested through an audit. Whether there are fire zones physically available is a criteria that will never change for a single deployment - this only needs to be audited once. It might be possible to also use Gaia-X Credentials to provide such information, which then could be tested. - From 856e0998f6e4e573dc69f349d82f348ac55da598 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Tue, 25 Jun 2024 10:12:36 +0200 Subject: [PATCH 07/24] Apply suggestions from code review Co-authored-by: Sven Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-Availability-Zones-Standard.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index e8aaeb792..6fb2a52b0 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -26,21 +26,21 @@ Therefore this standard will address the minimal requirements that need to be me ## Motivation Redundancy is a non-trivial but relevant issue for a cloud deployment. -The IaaS layer especially as the first virtualization from the hardware has an important role in this topic, because it is possible to provide failure safety through redundancy from failures on the physical layer. +The IaaS layer especially as the first abstraction layer from the hardware has an important role in this topic, because it is possible to increase failure safety through redundancy on the physical layer. The grouping of physical resources into Availability Zones on the IaaS level, gives customers the option to distribute their workload to different AZs which will result in a better failure safety. While CSPs already have some similarities in their grouping of physical resources to AZs, there are also differences. Availability Zones can be set up for Compute, Network and Storage while all refering to the same physical separation in a deployment. This standard elaborates the necessity of having Availability Zones for each of these classes. -It will also check the requirement customers may have, when thinking about Availability Zones in regarding of the taxonomy of failure safety levels [^1]. +It will also check the requirements customers may have, when thinking about Availability Zones in relation to the taxonomy of failure safety levels [^1]. The result should enable CSPs to know when to create AZs to be SCS-compliant. ## Design Considerations -Availability Zones should represent parts of the same deployment, that have an independency of each other. -The maximum of physical independency is achieved through putting physical machines into different fire zones. +Availability Zones should represent parts of the same deployment that are independent of each other. +The maximum of physical independence is achieved through putting physical machines into different fire zones. In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment. -Having Availability Zones represent fire zones will also result in AZs being to take workload from another AZ in a Failure Case of Level 3. +Having Availability Zones represent fire zones will also result in AZs being able to take workload from another AZ in a Failure Case of Level 3. So that even the destruction of one Availability Zone will not automatically include the destruction of the other AZs. Smaller deplyoments like edge deployments may not have more than one fire zone in a single location. @@ -146,7 +146,7 @@ The reasons for and against this configuration may need to be discussed further ## Standard -Compute Availability Zone MUST be in different fire zones. +If Compute Availability Zone are used, they MUST be in different fire zones. Availabilty Zones for Storage SHOULD be setup, if there is no storage backend used that can span over different fire zones and automatically replicate the data. [TO BE DISCUSSED:] If Availability Zones for Storage are used, the attaching of volumes from one Storage Availability Zone to another Compute Availability Zone (cross-attach) SHOULD be allowed. From 1a9140efbc6270d2335f9bbd8152fb3b177316bb Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Thu, 27 Jun 2024 10:32:16 +0200 Subject: [PATCH 08/24] Restructuring and adding discussed point from IaaS call. Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-Availability-Zones-Standard.md | 62 +++++++++++++------ 1 file changed, 42 insertions(+), 20 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index 6fb2a52b0..9a51b4f67 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -26,23 +26,35 @@ Therefore this standard will address the minimal requirements that need to be me ## Motivation Redundancy is a non-trivial but relevant issue for a cloud deployment. -The IaaS layer especially as the first abstraction layer from the hardware has an important role in this topic, because it is possible to increase failure safety through redundancy on the physical layer. -The grouping of physical resources into Availability Zones on the IaaS level, gives customers the option to distribute their workload to different AZs which will result in a better failure safety. +First and foremost it is necessary to increase failure safety through redundancy on the physical layer. +The IaaS layer as the first abstraction layer from the hardware has an important role in this topic, too. +The grouping of redundant physical resources into Availability Zones on the IaaS level, gives customers the option to distribute their workload to different AZs which will result in a better failure safety. While CSPs already have some similarities in their grouping of physical resources to AZs, there are also differences. -Availability Zones can be set up for Compute, Network and Storage while all refering to the same physical separation in a deployment. +This standard aims to reduce thos differences and will clarify, what customers can expect from Availability Zones in IaaS. + +Availability Zones in IaaS can be set up for Compute, Network and Storage while all refering to the same physical separation in a deployment. This standard elaborates the necessity of having Availability Zones for each of these classes. It will also check the requirements customers may have, when thinking about Availability Zones in relation to the taxonomy of failure safety levels [^1]. The result should enable CSPs to know when to create AZs to be SCS-compliant. ## Design Considerations -Availability Zones should represent parts of the same deployment that are independent of each other. +Availability Zones should represent parts of the same physical deployment that are independent of each other. The maximum of physical independence is achieved through putting physical machines into different fire zones. In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment. Having Availability Zones represent fire zones will also result in AZs being able to take workload from another AZ in a Failure Case of Level 3. So that even the destruction of one Availability Zone will not automatically include the destruction of the other AZs. +:::caution + +Even with fire zones being physically designed to protect parts of a data center from severe destruction in case of a fire, this will not always succeed. +Availability Zones in Clouds are most of the time within the same physical data center. +In case of a big catastrophe like a huge fire or a flood the whole data center could be destroyed. +Availability Zones will not protect customers against these failure cases of level 4 of the taxonomy of failure safety[^1]. + +::: + Smaller deplyoments like edge deployments may not have more than one fire zone in a single location. To include such deployments, it should not be required to use Availability Zones. @@ -52,15 +64,21 @@ That means there are deployments, which have Availability Zones per rack as each While this is also a possible measurement of independency it only provides failure safty for level 2. Therefore this standard should be very clear about which independency an AZ should represent and it should not be allowed to have different deployments with their Availability Zones representing different levels of failure safety. -There are recommendations from the BSI for physical redundancy within a cloud deployment [^2]. -This standard considers these recommendation ar followed by most CSPs and will thus be a basis for all data centers. -From this recommendations this standard assumes that the destruction of one fire zone will not lead to an outage of all power lines (not PDUs), internet connections, core routers or cooling systems. - -For the setup of Availability Zone this means, that within every AZ, there needs to be redundancy in core routers, internet connection, power lines and at least two separate cooling systems to avoid single points of failure in Availability Zones. -But all this physical infrastructure can be the same over all Availability Zones in a deployment, when it is possible to survive the destruction of one fire zone. - Additionally Availability Zones are available for Compute, Storage and Network services. They behave differently for each of these resources and also when working across resource-based Availability Zones, e.g. attaching a volume from one AZ to a virtual machine in another AZ. +For each of these IaaS resource classes, it should be defined, under which circumstances Availablitiy Zones should be used. + +### Scope of tha Availability Zone Standard + +When elaborating redundancy and failure safety in data centers, it is necessary to also define redundancy on the physical level. +There are already recommendations from the BSI for physical redundancy within a cloud deployment [^2]. +This standard considers these recommendation as a basis, that is followed by most CSPs. +So this standard will not go into details, already provided by the CSP, but will rather concentrate on the IaaS layer and only have a coarse view on the physical layer. +The first assumtion from the recommendations of the BSI is that the destruction of one fire zone will not lead to an outage of all power lines (not PDUs), internet connections, core routers or cooling systems. + +For the setup of Availability Zone this means, that within every AZ, there needs to be redundancy in core routers, internet connection, power lines and at least two separate cooling systems. +This should avoid having single points of failure within the Availability Zones. +But all this physical infrastructure can be the same over all Availability Zones in a deployment, when it is possible to survive the destruction of one fire zone. [^2]: [Availability recommendations from the BSI](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9) @@ -71,6 +89,7 @@ They behave differently for each of these resources and also when working across It is possible standardize the Usage of Availability Zones over all IaaS resources. The downside from this is, that the IaaS resources behave so differently, that they have different requirements for redundancy and thus Availability Zones. This is not the way to go. +Besides that, it is already possible to create two physically separated deployments close to each other, connect them with each other and use regions to differ between the IaaS on both deployments. The question that remains is, what an Availability Zone should consist of? Having one Availability Zone per fire zone gives the best level of failure safety, that can be achieved by CSPs. @@ -131,25 +150,28 @@ This standard will therefore make no recommendations about Network AZs. [^3]: [Availability Zones in Neutron for OVN](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html) -### Open questions +### Cross-Attaching volumes from one AZ to another compute AZ Without the networking AZs we only need to take a closer look into attaching volumes to virtual machines across AZs. -It is possible to allow or forbid cross-attaching volumes from one AZ to virtual machines in another AZ. -If it is not allowed, then the creation of volume-based virtual machines will fail, in case of an outage of a complete Availability Zone. -This does not seem to be a good option in regard for the failure safety level, as transfering a virtual machine from one AZ to another in a failure case will get way more complex. -A replication of the volume has to be present in another storage Availability Zone that can be attached to the corresponding compute Availability Zone, which is not the AZ, that has an outage. -Then this replication - maybe a snapshot - can be used to create a new virtual machine. +When there is more than one Storage Availability Zone, those AZs do normally align with the Compute Availability Zones. +This means that in fire zone 1 exist compute AZ 1 and storage AZ 1, in fire zone 2 are compute AZ 2 and storage AZ 2 and the same for fire zone 3. +It is possible to allow or forbid cross-attaching volumes from one storage Availability Zone to virtual machines in another AZ. +If it is not allowed, then the creation of volume-based virtual machines will fail, if there is no space left for VMs in the corresponding Availability Zone. +While this may be unfortunate, it gives customers a very clear picture of an Availability Zone. +It clarifies that having a virtual machine in another AZ also requires have a backup or replication of volumes in the other storage AZ. +Then this backup or replication can be used to create a new virtual machine in the other AZ. -While it seems to be a good decision to allow cross-attach, CSPs currently do not seem to widely use it. -The reasons for and against this configuration may need to be discussed further to decide, whether this standard should make any recommendations regarding cross-attach. +It seems to be a good decision to not encourage CSPs to allow cross-attach. +Currently CSPs also do not seem to widely use it. ## Standard If Compute Availability Zone are used, they MUST be in different fire zones. Availabilty Zones for Storage SHOULD be setup, if there is no storage backend used that can span over different fire zones and automatically replicate the data. +Otherwise a single Availabilty Zone for Storage SHOULD be configured. -[TO BE DISCUSSED:] If Availability Zones for Storage are used, the attaching of volumes from one Storage Availability Zone to another Compute Availability Zone (cross-attach) SHOULD be allowed. +If more than one Availability Zone for Storage is set up, the attaching of volumes from one Storage Availability Zone to another Compute Availability Zone (cross-attach) SHOULD NOT be possible. Within each Availability Zone: From 286ff93cc43ebf196df39b20fddae550526f0d89 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 16 Aug 2024 08:34:45 +0200 Subject: [PATCH 09/24] Apply suggestions from code review Co-authored-by: Markus Hentsch <129268441+markus-hentsch@users.noreply.github.com> Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-Availability-Zones-Standard.md | 39 ++++++++++--------- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index 9a51b4f67..3f06b0639 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -7,7 +7,7 @@ track: IaaS ## Introduction -On the IaaS-Level especially in OpenStack it is possible to group resources in Availability Zones. +On the IaaS level especially in OpenStack it is possible to group resources in Availability Zones. Such Zones often are mapped to the physical layer of a deployment, such as e.g. physical separation of hardware or redundancy of power circuits or fire zones. But how CSPs apply Availability Zones to the IaaS Layer in one deplyoment may differ widely. Therefore this standard will address the minimal requirements that need to be met, when creating Avaiability Zones. @@ -22,6 +22,7 @@ Therefore this standard will address the minimal requirements that need to be me | Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | | Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | | Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | +| CSP | Cloud Service Provider, provider managing the OpenStack infrastructure. | ## Motivation @@ -30,9 +31,9 @@ First and foremost it is necessary to increase failure safety through redundancy The IaaS layer as the first abstraction layer from the hardware has an important role in this topic, too. The grouping of redundant physical resources into Availability Zones on the IaaS level, gives customers the option to distribute their workload to different AZs which will result in a better failure safety. While CSPs already have some similarities in their grouping of physical resources to AZs, there are also differences. -This standard aims to reduce thos differences and will clarify, what customers can expect from Availability Zones in IaaS. +This standard aims to reduce those differences and will clarify, what customers can expect from Availability Zones in IaaS. -Availability Zones in IaaS can be set up for Compute, Network and Storage while all refering to the same physical separation in a deployment. +Availability Zones in IaaS can be set up for Compute, Network and Storage separately while all may be referring to the same physical separation in a deployment. This standard elaborates the necessity of having Availability Zones for each of these classes. It will also check the requirements customers may have, when thinking about Availability Zones in relation to the taxonomy of failure safety levels [^1]. The result should enable CSPs to know when to create AZs to be SCS-compliant. @@ -40,7 +41,7 @@ The result should enable CSPs to know when to create AZs to be SCS-compliant. ## Design Considerations Availability Zones should represent parts of the same physical deployment that are independent of each other. -The maximum of physical independence is achieved through putting physical machines into different fire zones. +The maximum level of physical independence is achieved through putting physical machines into different fire zones. In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment. Having Availability Zones represent fire zones will also result in AZs being able to take workload from another AZ in a Failure Case of Level 3. @@ -61,14 +62,14 @@ To include such deployments, it should not be required to use Availability Zones Other physical factors that should be considered are the power supplies, internet connection, cooling and core routing. Availability Zones have been also being configured to show redundancy in e.g. Power Supply as in the PDU. That means there are deployments, which have Availability Zones per rack as each rack has it's own PDU and this was considered to be the single point of failure an AZ should represent. -While this is also a possible measurement of independency it only provides failure safty for level 2. +While this is also a possible measurement of independency it only provides failure safety for level 2. Therefore this standard should be very clear about which independency an AZ should represent and it should not be allowed to have different deployments with their Availability Zones representing different levels of failure safety. Additionally Availability Zones are available for Compute, Storage and Network services. They behave differently for each of these resources and also when working across resource-based Availability Zones, e.g. attaching a volume from one AZ to a virtual machine in another AZ. -For each of these IaaS resource classes, it should be defined, under which circumstances Availablitiy Zones should be used. +For each of these IaaS resource classes, it should be defined, under which circumstances Availability Zones should be used. -### Scope of tha Availability Zone Standard +### Scope of the Availability Zone Standard When elaborating redundancy and failure safety in data centers, it is necessary to also define redundancy on the physical level. There are already recommendations from the BSI for physical redundancy within a cloud deployment [^2]. @@ -86,7 +87,7 @@ But all this physical infrastructure can be the same over all Availability Zones #### Physical-based Availability Zones -It is possible standardize the Usage of Availability Zones over all IaaS resources. +It is possible standardize the usage of Availability Zones over all IaaS resources. The downside from this is, that the IaaS resources behave so differently, that they have different requirements for redundancy and thus Availability Zones. This is not the way to go. Besides that, it is already possible to create two physically separated deployments close to each other, connect them with each other and use regions to differ between the IaaS on both deployments. @@ -95,11 +96,11 @@ The question that remains is, what an Availability Zone should consist of? Having one Availability Zone per fire zone gives the best level of failure safety, that can be achieved by CSPs. When building up on the relation between fire zone and physical redundancy recommendations as from the BSI, this combination is a good starting point, but need to be checked for the validity for the different IaaS resources. -Another point is where Availability Zones can be instanciated and what the connection between AZs should look like. +Another point is where Availability Zones can be instantiated and what the connection between AZs should look like. To have a proper way to deal with outages of one AZ, where a second AZ can step in, a few requirements need to be met for the connection between those two AZs. The amount data that needs to be transferred very fast in a failure case may be enormous, so there is a requirement for a high bandwidth between connected AZs. Tho avoid additional failure cases the latency between those two Availability Zones need to be low. -With such requirements it is very clear that AZs should only reside within one (physical) region of an IaaS-Deployment. +With such requirements it is very clear that AZs should only reside within one (physical) region of an IaaS deployment. #### AZs in Compute @@ -109,7 +110,7 @@ Redundancy of virtual machines is either up to the layer above IaaS or up to the Having Availability Zones gives customers the possibility to let another virtual machine as a backup run within another Availability Zone. Customers will expect that in case of the failure of one Availability Zone all other AZs are still available. -The highest possible failure safety here is achieved, when Availability Zones for Compute are used for different fire Zones. +The highest possible failure safety here is achieved, when Availability Zones for Compute are used for different fire zones. When the BSI recommendations are followed, there should already be redundancy in power lines, internet connection and cooling. An outage of one of these physical resources will not affect the compute host and its resources for more than a minimal timeframe. @@ -124,27 +125,27 @@ Configuring those backends can already include to span one storage cluster over In combination with internal replication a configuration is possible, that already distributes replicas from volumes over different fire zones. When a deployment has such a configured storage backend, it already can provide safety in case of a failure of level 3. -Using Availability Zone is also possible for the storage service, but configuring AZs, when having a configuration like above will not increase safety. +Using Availability Zones is also possible for the storage service, but configuring AZs, when having a configuration like above will not increase safety. Nevertheless using AZs when having different backends in different fire zones will give customers a hint to backup volumes into storages of other AZs. Additionally when the BSI recommendations are followed, there should already be redundancy in power lines, internet connection and cooling. An outage of one of these physical resources will not affect the storage host and its resources for more than a minimal timeframe. When internal replication is used, either through the IaaS or through the storage backend itself, the outage of a single PDU and such a single rack will not affect the availability of the data itself. All these physical factors are not requiring the usage of an Availability Zone for Storage. -An increase of the level auf failure safety will not be reached through AZs in these cases. +An increase of the level of failure safety will not be reached through AZs in these cases. Still it might be confusing when having deployments with compute AZs but without storage AZs. CSPs may need to communicate clearly up to which failure safety level their storage service can automatically have redundancy and from which level customers are responsible for the redundancy of their data. #### AZs in Network -Network resources can be typically fastly and easily set up from building instruction. +Virtualized network resources can typically be quickly and easily set up from building instructions. Those instructions are stored in the database of the networking service. If a physical machine, on which certain network resources are set up, is not available anymore, the resources can be rolled out on another physical machine, without being depended on the current situation of the lost resources. -There might only be a loss of a few packages within the los network ressources. +There might only be a loss of a few packets within the affected network resources. -With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) it would not have downsides to not have Availability Zones for the network service. +With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) there would be no downsides to omitting Availability Zones for the network service. It might even be the opposite: Having resources running in certain Availability Zones might permit them from being scheduled in other AZs[^3]. This standard will therefore make no recommendations about Network AZs. @@ -155,11 +156,11 @@ This standard will therefore make no recommendations about Network AZs. Without the networking AZs we only need to take a closer look into attaching volumes to virtual machines across AZs. When there is more than one Storage Availability Zone, those AZs do normally align with the Compute Availability Zones. -This means that in fire zone 1 exist compute AZ 1 and storage AZ 1, in fire zone 2 are compute AZ 2 and storage AZ 2 and the same for fire zone 3. +This means that fire zone 1 contains compute AZ 1 and storage AZ 1 , fire zone 2 contains compute AZ 2 and storage AZ 2 and the same for fire zone 3. It is possible to allow or forbid cross-attaching volumes from one storage Availability Zone to virtual machines in another AZ. If it is not allowed, then the creation of volume-based virtual machines will fail, if there is no space left for VMs in the corresponding Availability Zone. While this may be unfortunate, it gives customers a very clear picture of an Availability Zone. -It clarifies that having a virtual machine in another AZ also requires have a backup or replication of volumes in the other storage AZ. +It clarifies that having a virtual machine in another AZ also requires having a backup or replication of volumes in the other storage AZ. Then this backup or replication can be used to create a new virtual machine in the other AZ. It seems to be a good decision to not encourage CSPs to allow cross-attach. @@ -167,7 +168,7 @@ Currently CSPs also do not seem to widely use it. ## Standard -If Compute Availability Zone are used, they MUST be in different fire zones. +If Compute Availability Zones are used, they MUST be in different fire zones. Availabilty Zones for Storage SHOULD be setup, if there is no storage backend used that can span over different fire zones and automatically replicate the data. Otherwise a single Availabilty Zone for Storage SHOULD be configured. From 89e770adb443e36a5f2f029960e9c67ecbb8d2dc Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 16 Aug 2024 10:10:08 +0200 Subject: [PATCH 10/24] Update scs-XXXX-vN-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- .../scs-XXXX-vN-Availability-Zones-Standard.md | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index 3f06b0639..4ceef687d 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -22,7 +22,8 @@ Therefore this standard will address the minimal requirements that need to be me | Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | | Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | | Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | -| CSP | Cloud Service Provider, provider managing the OpenStack infrastructure. | +| BSI | German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik) | +| CSP | Cloud Service Provider, provider managing the OpenStack infrastructure. | ## Motivation @@ -60,7 +61,7 @@ Smaller deplyoments like edge deployments may not have more than one fire zone i To include such deployments, it should not be required to use Availability Zones. Other physical factors that should be considered are the power supplies, internet connection, cooling and core routing. -Availability Zones have been also being configured to show redundancy in e.g. Power Supply as in the PDU. +Availability Zones were also used by CSPs as a representations of redundant PDUs. That means there are deployments, which have Availability Zones per rack as each rack has it's own PDU and this was considered to be the single point of failure an AZ should represent. While this is also a possible measurement of independency it only provides failure safety for level 2. Therefore this standard should be very clear about which independency an AZ should represent and it should not be allowed to have different deployments with their Availability Zones representing different levels of failure safety. @@ -69,6 +70,8 @@ Additionally Availability Zones are available for Compute, Storage and Network s They behave differently for each of these resources and also when working across resource-based Availability Zones, e.g. attaching a volume from one AZ to a virtual machine in another AZ. For each of these IaaS resource classes, it should be defined, under which circumstances Availability Zones should be used. +[^1]: [Taxonomy of Failsafe Levels in SCS (TODO: change link as soon as taxonomy is merged)](https://github.com/SovereignCloudStack/standards/pull/579) + ### Scope of the Availability Zone Standard When elaborating redundancy and failure safety in data centers, it is necessary to also define redundancy on the physical level. @@ -146,8 +149,12 @@ If a physical machine, on which certain network resources are set up, is not ava There might only be a loss of a few packets within the affected network resources. With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) there would be no downsides to omitting Availability Zones for the network service. -It might even be the opposite: Having resources running in certain Availability Zones might permit them from being scheduled in other AZs[^3]. -This standard will therefore make no recommendations about Network AZs. +It might even be the opposite: Having resources running in certain Availability Zones might prevent them from being scheduled in other AZs[^3]. +As the network resources like routers are bound to an AZ, in a failure case of one AZ all resource definitions might still be there in the database, while the implementation of those resources is gone. +Trying to rebuild them in another AZ is not possible, because the scheduler will not allow them to be implemented in another AZ, then the one thats present in their definition. +In a failure case of one AZ this might lead to a lot of manual work to rebuild the SDN from scratch instead of just re-using the definitions. + +Because of this severe sideeffect, this standard will make no recommendations about Network AZs. [^3]: [Availability Zones in Neutron for OVN](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html) From 747574686e6065727b6d1d92fb5bbbbae8afb82d Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 21 Aug 2024 15:53:21 +0200 Subject: [PATCH 11/24] Update scs-XXXX-vN-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-Availability-Zones-Standard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index 4ceef687d..cfad1faae 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -186,7 +186,7 @@ Within each Availability Zone: - there MUST be redundancy in power supply, as in line into the deployment - there MUST be redundancy in external connection (e.g. internet connection or WAN-connection) - there MUST be redundancy in core routers -- there SHOULD be at least two cooling systems, that are independent of each other +- there SHOULD be redundancy in the cooling system AZs SHOULD only occur within the same region and have a low-latency interconnection with a high bandwidth. From e660cb467d731814eb9ecee3247ba77d3daddf23 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 18 Sep 2024 15:19:58 +0200 Subject: [PATCH 12/24] Update scs-XXXX-vN-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-Availability-Zones-Standard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index cfad1faae..83c6d7201 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -35,7 +35,7 @@ While CSPs already have some similarities in their grouping of physical resource This standard aims to reduce those differences and will clarify, what customers can expect from Availability Zones in IaaS. Availability Zones in IaaS can be set up for Compute, Network and Storage separately while all may be referring to the same physical separation in a deployment. -This standard elaborates the necessity of having Availability Zones for each of these classes. +This standard elaborates the necessity of having Availability Zones for each of these classes of resources. It will also check the requirements customers may have, when thinking about Availability Zones in relation to the taxonomy of failure safety levels [^1]. The result should enable CSPs to know when to create AZs to be SCS-compliant. From 355e84a7cbcdb497af7706b911d8ff523fb8b5a2 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 18 Sep 2024 15:45:40 +0200 Subject: [PATCH 13/24] Create scs-XXXX-w1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-w1-Availability-Zones-Standard.md | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+) create mode 100644 Standards/scs-XXXX-w1-Availability-Zones-Standard.md diff --git a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md new file mode 100644 index 000000000..068361ef5 --- /dev/null +++ b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md @@ -0,0 +1,28 @@ +--- +title: "SCS Availability Zone Standard: Implementation and Testing Notes" +type: Supplement +track: IaaS +status: Draft +supplements: + - scs-XXXX-vN-Availability-Zones-Standard.md +--- + +## Automated Tests + +The SCS will also allow small deployments and edge deployments, that both will not meet the requirement for bein divided into multiple Availability Zones. +Thus Availability Zones are not always present and there will be no automated tests to search for AZs. + +## Manual Tests / Audits + +The requirements for each Availability Zone are written in the Standard. +For each deployment, that uses Availability Zones there has to be done an Audit to check the following parameters: + +1. The presence of fire zones MUST be checked. +1.1. The correct configuration of one AZ per fire zone MUST be checked. +2. For each fire zone (== AZ) the following parts MUST be checked: +2.1. There MUST be redundancy in Power Supply +2.2. There MUST be redundancy in external connection +2.3. There MUST be redundancy in core routers + +All of these things will either not change at all like the fire zones or it is very unlikely for them to change like redundant internet connection. +Because of this a manual audit will be enough to check for compliance. From 65460b580948b2c397a6815429b8607d92bf4e9e Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 18 Sep 2024 15:47:57 +0200 Subject: [PATCH 14/24] Update scs-XXXX-w1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-w1-Availability-Zones-Standard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md index 068361ef5..a175a773b 100644 --- a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md @@ -23,6 +23,6 @@ For each deployment, that uses Availability Zones there has to be done an Audit 2.1. There MUST be redundancy in Power Supply 2.2. There MUST be redundancy in external connection 2.3. There MUST be redundancy in core routers - + All of these things will either not change at all like the fire zones or it is very unlikely for them to change like redundant internet connection. Because of this a manual audit will be enough to check for compliance. From 8c98249b499543c061c038ff480260aecc8b2ffd Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 18 Sep 2024 15:52:16 +0200 Subject: [PATCH 15/24] Update scs-XXXX-vN-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-Availability-Zones-Standard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index 83c6d7201..dfb39e9ac 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -194,7 +194,7 @@ AZs SHOULD only occur within the same region and have a low-latency interconnect The taxonomy of failsafe levels can be used to get an overview over the levels of failure safety in a deployment(TODO: link after DR is merged.) -The BSI can be consulted for further information about [failure risks](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/Kompendium/Elementare_Gefaehrdungen.pdf?__blob=publicationFile&v=4), [risk analysis for a datacenter](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) or [measures for availability](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9). +The BSI can be consulted for further information about [failure risks](https://www.bsi.bund.de/DE/Themen/Unternehmen-und-Organisationen/Standards-und-Zertifizierung/IT-Grundschutz/IT-Grundschutz-Kompendium/Elementare-Gefaehrdungen/elementare-gefaehrdungen_node.html), [risk analysis for a datacenter](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) or [measures for availability](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9). ## Conformance Tests From 057d093ec6b2f084e26921b2788e65e7b1cdb8fc Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Thu, 19 Sep 2024 14:36:14 +0200 Subject: [PATCH 16/24] Update scs-XXXX-w1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-w1-Availability-Zones-Standard.md | 24 ++++++++++++------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md index a175a773b..b60c29f8e 100644 --- a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md @@ -9,20 +9,28 @@ supplements: ## Automated Tests -The SCS will also allow small deployments and edge deployments, that both will not meet the requirement for bein divided into multiple Availability Zones. -Thus Availability Zones are not always present and there will be no automated tests to search for AZs. +The standard will not exclude small deployments and edge deployments, that both will not meet the requirement for being divided into multiple Availability Zones. +Thus multiple Availability Zones are not always present. +Somtimes there can just be a single Availability Zones. +Because of that, there will be no automated tests to search for AZs. -## Manual Tests / Audits +## Manual Tests / Audits / Required Documentation The requirements for each Availability Zone are written in the Standard. For each deployment, that uses Availability Zones there has to be done an Audit to check the following parameters: 1. The presence of fire zones MUST be checked. -1.1. The correct configuration of one AZ per fire zone MUST be checked. -2. For each fire zone (== AZ) the following parts MUST be checked: -2.1. There MUST be redundancy in Power Supply -2.2. There MUST be redundancy in external connection -2.3. There MUST be redundancy in core routers +2. The correct configuration of one AZ per fire zone MUST be checked. +3. For each fire zone (== AZ) the following parts MUST be checked: +4. There MUST be redundancy in Power Supply +5. There MUST be redundancy in external connection +6. There MUST be redundancy in core routers All of these things will either not change at all like the fire zones or it is very unlikely for them to change like redundant internet connection. Because of this a manual audit will be enough to check for compliance. + +## Physical Audits + +In cases where it is reasonable to mistrust the provided documentation, a physical audit by a natural person - called auditor - send by the OSBA (?) should be performed. +The CSP of the deployment, which needs such an audit, should grant access to the auditor to the physical infrastructure and should show them all necessary IaaS-Layer configurations, that are needed to verify compliance to this standard. + From 1ad585223f76f636e00908b14f113fb3039ad6c8 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 20 Sep 2024 10:33:30 +0200 Subject: [PATCH 17/24] Update scs-XXXX-w1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-w1-Availability-Zones-Standard.md | 29 ++++++++++++------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md index b60c29f8e..16af0f5fc 100644 --- a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md @@ -9,25 +9,32 @@ supplements: ## Automated Tests -The standard will not exclude small deployments and edge deployments, that both will not meet the requirement for being divided into multiple Availability Zones. +The standard will not preclude small deployments and edge deployments, that both will not meet the requirement for being divided into multiple Availability Zones. Thus multiple Availability Zones are not always present. Somtimes there can just be a single Availability Zones. Because of that, there will be no automated tests to search for AZs. -## Manual Tests / Audits / Required Documentation +## Required Documentation The requirements for each Availability Zone are written in the Standard. -For each deployment, that uses Availability Zones there has to be done an Audit to check the following parameters: +For each deployment, that uses more than a single Availability Zone, the CSP has to provide documentation to proof the following points: -1. The presence of fire zones MUST be checked. -2. The correct configuration of one AZ per fire zone MUST be checked. -3. For each fire zone (== AZ) the following parts MUST be checked: -4. There MUST be redundancy in Power Supply -5. There MUST be redundancy in external connection -6. There MUST be redundancy in core routers +1. The presence of fire zones MUST be documented (e.g. through construction plans of the deployment). +2. The correct configuration of one AZ per fire zone MUST be documented. +3. The redundancy in Power Supply within each AZ MUST be documented. +4. The redundancy in external connection within each AZ MUST be documented. +5. The redundancy in core routers within each AZ MUST be documented. -All of these things will either not change at all like the fire zones or it is very unlikely for them to change like redundant internet connection. -Because of this a manual audit will be enough to check for compliance. +All of these requirements will either not change at all like the fire zones or it is very unlikely for them to change like redundant internet connection. +Because of this documentation must only be provided in thw following cases: + +1. When a new deployment with multiple AZs should be tested for compliance. +2. When there are physical changes in a deplyoment, which already provided the documentation: the changes needs to be documented and provided as soon as possible. + +### Alternative Documentation + +If a deployment already did undergo certification like ISO 27001 or ISO 9001, those certificates can be provided as part of the documentation to cover the redundancy parts. +It is still required to document the existence of fire zones and the correct configuration of one AZ per fire zone. ## Physical Audits From f3f76fc3d339447c4f548010b80fda5cef490195 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 20 Sep 2024 10:35:52 +0200 Subject: [PATCH 18/24] Update scs-XXXX-w1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-w1-Availability-Zones-Standard.md | 1 - 1 file changed, 1 deletion(-) diff --git a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md index 16af0f5fc..f151976ef 100644 --- a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-w1-Availability-Zones-Standard.md @@ -40,4 +40,3 @@ It is still required to document the existence of fire zones and the correct con In cases where it is reasonable to mistrust the provided documentation, a physical audit by a natural person - called auditor - send by the OSBA (?) should be performed. The CSP of the deployment, which needs such an audit, should grant access to the auditor to the physical infrastructure and should show them all necessary IaaS-Layer configurations, that are needed to verify compliance to this standard. - From a72ef562b9108e9168f56cd09c22d42a2635ba04 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 25 Sep 2024 16:22:07 +0200 Subject: [PATCH 19/24] Apply suggestions from code review Co-authored-by: Markus Hentsch <129268441+markus-hentsch@users.noreply.github.com> Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-Availability-Zones-Standard.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index dfb39e9ac..eb3072b4e 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -45,7 +45,7 @@ Availability Zones should represent parts of the same physical deployment that a The maximum level of physical independence is achieved through putting physical machines into different fire zones. In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment. -Having Availability Zones represent fire zones will also result in AZs being able to take workload from another AZ in a Failure Case of Level 3. +Having Availability Zones represent fire zones will also result in AZs being able to take workload from another AZ in a failure case of Level 3. So that even the destruction of one Availability Zone will not automatically include the destruction of the other AZs. :::caution @@ -145,16 +145,16 @@ CSPs may need to communicate clearly up to which failure safety level their stor Virtualized network resources can typically be quickly and easily set up from building instructions. Those instructions are stored in the database of the networking service. -If a physical machine, on which certain network resources are set up, is not available anymore, the resources can be rolled out on another physical machine, without being depended on the current situation of the lost resources. +If a physical machine, on which certain network resources are set up, is not available anymore, the resources can be rolled out on another physical machine, without being dependent on the current situation of the lost resources. There might only be a loss of a few packets within the affected network resources. With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) there would be no downsides to omitting Availability Zones for the network service. It might even be the opposite: Having resources running in certain Availability Zones might prevent them from being scheduled in other AZs[^3]. As the network resources like routers are bound to an AZ, in a failure case of one AZ all resource definitions might still be there in the database, while the implementation of those resources is gone. -Trying to rebuild them in another AZ is not possible, because the scheduler will not allow them to be implemented in another AZ, then the one thats present in their definition. +Trying to rebuild them in another AZ is not possible, because the scheduler will not allow them to be implemented in another AZ, than the one thats present in their definition. In a failure case of one AZ this might lead to a lot of manual work to rebuild the SDN from scratch instead of just re-using the definitions. -Because of this severe sideeffect, this standard will make no recommendations about Network AZs. +Because of this severe side effect, this standard will make no recommendations about Network AZs. [^3]: [Availability Zones in Neutron for OVN](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html) From 79e0428f631445cfa83bda5d3116beb3bc7e1fd8 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 25 Sep 2024 16:26:38 +0200 Subject: [PATCH 20/24] Update scs-XXXX-vN-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-Availability-Zones-Standard.md | 1 + 1 file changed, 1 insertion(+) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md index eb3072b4e..ea1f6e937 100644 --- a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md +++ b/Standards/scs-XXXX-vN-Availability-Zones-Standard.md @@ -24,6 +24,7 @@ Therefore this standard will address the minimal requirements that need to be me | Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | | BSI | German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik) | | CSP | Cloud Service Provider, provider managing the OpenStack infrastructure. | +| SDN | Software Defined Network, virtual networks managed by the networking service. | ## Motivation From 87e3d48738e384ad44a35bf860fafecf4b5e05f9 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 30 Sep 2024 07:38:11 +0200 Subject: [PATCH 21/24] Rename scs-XXXX-vN-Availability-Zones-Standard.md to scs-0119-v1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...nes-Standard.md => scs-0119-v1-Availability-Zones-Standard.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename Standards/{scs-XXXX-vN-Availability-Zones-Standard.md => scs-0119-v1-Availability-Zones-Standard.md} (100%) diff --git a/Standards/scs-XXXX-vN-Availability-Zones-Standard.md b/Standards/scs-0119-v1-Availability-Zones-Standard.md similarity index 100% rename from Standards/scs-XXXX-vN-Availability-Zones-Standard.md rename to Standards/scs-0119-v1-Availability-Zones-Standard.md From efd2ab590f4e1316b41688a38898cec57e92fa56 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 30 Sep 2024 07:38:43 +0200 Subject: [PATCH 22/24] Update and rename scs-XXXX-w1-Availability-Zones-Standard.md to scs-0119-w1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...s-Standard.md => scs-0119-w1-Availability-Zones-Standard.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename Standards/{scs-XXXX-w1-Availability-Zones-Standard.md => scs-0119-w1-Availability-Zones-Standard.md} (98%) diff --git a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md b/Standards/scs-0119-w1-Availability-Zones-Standard.md similarity index 98% rename from Standards/scs-XXXX-w1-Availability-Zones-Standard.md rename to Standards/scs-0119-w1-Availability-Zones-Standard.md index f151976ef..1794b6034 100644 --- a/Standards/scs-XXXX-w1-Availability-Zones-Standard.md +++ b/Standards/scs-0119-w1-Availability-Zones-Standard.md @@ -4,7 +4,7 @@ type: Supplement track: IaaS status: Draft supplements: - - scs-XXXX-vN-Availability-Zones-Standard.md + - scs-0119-v1-Availability-Zones-Standard.md --- ## Automated Tests From 58d636a27252debc527cc94ede38c348372223e9 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 30 Sep 2024 07:41:05 +0200 Subject: [PATCH 23/24] Update scs-0119-w1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-0119-w1-Availability-Zones-Standard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-0119-w1-Availability-Zones-Standard.md b/Standards/scs-0119-w1-Availability-Zones-Standard.md index 1794b6034..9799d6273 100644 --- a/Standards/scs-0119-w1-Availability-Zones-Standard.md +++ b/Standards/scs-0119-w1-Availability-Zones-Standard.md @@ -38,5 +38,5 @@ It is still required to document the existence of fire zones and the correct con ## Physical Audits -In cases where it is reasonable to mistrust the provided documentation, a physical audit by a natural person - called auditor - send by the OSBA (?) should be performed. +In cases where it is reasonable to mistrust the provided documentation, a physical audit by a natural person - called auditor - send by e.g. the [OSBA](https://osb-alliance.de/) should be performed. The CSP of the deployment, which needs such an audit, should grant access to the auditor to the physical infrastructure and should show them all necessary IaaS-Layer configurations, that are needed to verify compliance to this standard. From 720b94fd8a76110c2588bffcc889e077c314811f Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 30 Sep 2024 07:41:53 +0200 Subject: [PATCH 24/24] Update scs-0119-w1-Availability-Zones-Standard.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-0119-w1-Availability-Zones-Standard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-0119-w1-Availability-Zones-Standard.md b/Standards/scs-0119-w1-Availability-Zones-Standard.md index 9799d6273..3cedce052 100644 --- a/Standards/scs-0119-w1-Availability-Zones-Standard.md +++ b/Standards/scs-0119-w1-Availability-Zones-Standard.md @@ -26,7 +26,7 @@ For each deployment, that uses more than a single Availability Zone, the CSP has 5. The redundancy in core routers within each AZ MUST be documented. All of these requirements will either not change at all like the fire zones or it is very unlikely for them to change like redundant internet connection. -Because of this documentation must only be provided in thw following cases: +Because of this documentation must only be provided in the following cases: 1. When a new deployment with multiple AZs should be tested for compliance. 2. When there are physical changes in a deplyoment, which already provided the documentation: the changes needs to be documented and provided as soon as possible.