From 88ede0874e1354e2a63c54905fe2b53996da88b9 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 26 Apr 2024 15:31:14 +0200 Subject: [PATCH 01/34] [DRAFT] Create scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 90 +++++++++++++++++++ 1 file changed, 90 insertions(+) create mode 100644 Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md new file mode 100644 index 000000000..fac646f2e --- /dev/null +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -0,0 +1,90 @@ +--- +title: Taxonomy of Failsafe Levels +type: Decision Record +status: Draft +track: IaaS +--- + + +## Abstract + +Talking about redundancy and backups in the context of clouds, the scope under which circumstances these concepts work for various ressources is not clear. +This decision records aims to define different levels of failure-safety. +These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. + +## Terminology + +Image + OpenStack resource, server images usually residing in a network storage backend. +Volume + OpenStack resource, virtual drive which usually resides in a network storage backend. +Virtual Machine (abbr. VM) + IaaS resource, also called server, executes workloads from users. +Secret + OpenStack ressource, could be a key or a passphrase or a certificate in Barbican. +Key Encryption Key (abbr. KEK) + OpenStack resource, used to encrypt other keys to be able to store them encrypted in a database. +floating IP (abbr. FIP) + OpenStack resource, an IP that is usually reachable from the internet. +Disk + A physical disc in a deployment. +Node + A physical machine in a deployment. +Cyber threat + Attacks on the cloud. + +## Context + +Some standards in will talk about or require procedures to backup resources or have redundancy for resources. +This decision record should discuss, which failure threats are CSPs facing and will group them into severel level. +In consequence these levels should be used in standards talking about redundancy or failure-safety. + +## Decision + +First there needs to be an overview about possible failure cases in deployments: + +| Failure Case | Probability | Consequences | +|----|-----|----| +| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | +| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | +| Rack Outage | Medium | similar to Disk Failure and Node Outage | +| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) | +| Fire | Medium | permanently Disk and Node loss in the affected zone | +| Flood | Low | permanently Disk and Node loss in the affected zone | +| Earthquake | Very Low | permanently Disk and Node loss in the affected zone | +| Storm/Tornado | Low | permanently Disk and Node loss in the affected fire zone | +| Cyber threat | High | permanently loss of data on affected Disk and Node | + +These failure case can result in temporary (T) or permanent (P) loss of the resource or data within. +Additionally there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases. +The following table shows the affection without considering any redundancy or failure saftey being in use: + +| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | +|----|----|----|----|----|----|----| +| Image | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | +| Volume | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | +| User Data on RAM /CPU | | P | P | P | P | T/P | +| volume-based VM | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | +| ephemeral-based VM | P (if on disk) | P | P | T | P (T if lucky) | T/P | +| Secret | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | +| network configuration (DB objects) | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | +| network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | +| floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | +.. | Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | + +For some cases there are only temporary unavailabilites and clouds do have certain workflows to avoid data loss, like redundancy in storagy backends and databases. +So some of these outages are more easy to solve than others. +A possible way to group the failure cases into levels considering the matrix of affection would be: + +| Level/Class | level of affection | Use Cases | +|---|---|-----| +| 1. Level | single volumes, VMs... | Disk Failure, Node outage, (maybe rack outage) | +| 2. Level | number of resources, most of the time recoverable | Rack outage, (Fire), (Power outage when different power supplies exist) | +| 3. Level | lots of resources / user data + potentially not recoverable | Fire, Earthquake, Storm/Tornado, Power Outage | +| 4. Level | complete deployment, not recoverable | Flood, Fire | + +Unfortunately something similar does not seem to exist right now. + +## Consequences + +Using the definition of Levels throughout all SCS standards would allow readers to know up to which Level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data. From 5e0742dbd369824e30b4eb04ea30b72eac889c26 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 26 Apr 2024 15:33:42 +0200 Subject: [PATCH 02/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 1 - 1 file changed, 1 deletion(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index fac646f2e..ec5f7264a 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -70,7 +70,6 @@ The following table shows the affection without considering any redundancy or fa | network configuration (DB objects) | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | | network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | | floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | -.. | Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | For some cases there are only temporary unavailabilites and clouds do have certain workflows to avoid data loss, like redundancy in storagy backends and databases. So some of these outages are more easy to solve than others. From 9a1c2cd3633103c8231ec912ee09fcbb1ba29adb Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 26 Apr 2024 15:36:01 +0200 Subject: [PATCH 03/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index ec5f7264a..7808af1cd 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -79,7 +79,7 @@ A possible way to group the failure cases into levels considering the matrix of |---|---|-----| | 1. Level | single volumes, VMs... | Disk Failure, Node outage, (maybe rack outage) | | 2. Level | number of resources, most of the time recoverable | Rack outage, (Fire), (Power outage when different power supplies exist) | -| 3. Level | lots of resources / user data + potentially not recoverable | Fire, Earthquake, Storm/Tornado, Power Outage | +| 3. Level | lots of resources / user data + potentially not recoverable | Fire, Earthquake, Storm/Tornado, Power Outage | | 4. Level | complete deployment, not recoverable | Flood, Fire | Unfortunately something similar does not seem to exist right now. From e0c87bfcad8dc77f0be354994d3442a2af498395 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 29 Apr 2024 09:51:05 +0200 Subject: [PATCH 04/34] Apply suggestions from code review Co-authored-by: Markus Hentsch <129268441+markus-hentsch@users.noreply.github.com> Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 55 ++++++++++--------- 1 file changed, 28 insertions(+), 27 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 7808af1cd..9f6c5fd7e 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -9,39 +9,39 @@ track: IaaS ## Abstract Talking about redundancy and backups in the context of clouds, the scope under which circumstances these concepts work for various ressources is not clear. -This decision records aims to define different levels of failure-safety. +This decision record aims to define different levels of failure-safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. ## Terminology Image - OpenStack resource, server images usually residing in a network storage backend. + OpenStack resource, virtual machine images usually residing in a network storage backend. Volume OpenStack resource, virtual drive which usually resides in a network storage backend. Virtual Machine (abbr. VM) IaaS resource, also called server, executes workloads from users. Secret - OpenStack ressource, could be a key or a passphrase or a certificate in Barbican. + OpenStack ressource, cryptographic asset stored in the Key Manager (e.g. Barbican). Key Encryption Key (abbr. KEK) OpenStack resource, used to encrypt other keys to be able to store them encrypted in a database. -floating IP (abbr. FIP) - OpenStack resource, an IP that is usually reachable from the internet. +Floating IP (abbr. FIP) + OpenStack resource, an IP that is usually routed and accessible from external networks. Disk - A physical disc in a deployment. + A physical disk drive (e.g. HDD, SSD) in the infrastructure. Node - A physical machine in a deployment. + A physical machine in the infrastructure. Cyber threat - Attacks on the cloud. + Attacks on the infrastructure through the means of electronic access. ## Context Some standards in will talk about or require procedures to backup resources or have redundancy for resources. -This decision record should discuss, which failure threats are CSPs facing and will group them into severel level. -In consequence these levels should be used in standards talking about redundancy or failure-safety. +This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. +In consequence these levels should be used in standards concerning redundancy or failure-safety. ## Decision -First there needs to be an overview about possible failure cases in deployments: +First there needs to be an overview about possible failure cases in infrastructures: | Failure Case | Probability | Consequences | |----|-----|----| @@ -49,15 +49,15 @@ First there needs to be an overview about possible failure cases in deployments: | Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | | Rack Outage | Medium | similar to Disk Failure and Node Outage | | Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) | -| Fire | Medium | permanently Disk and Node loss in the affected zone | -| Flood | Low | permanently Disk and Node loss in the affected zone | -| Earthquake | Very Low | permanently Disk and Node loss in the affected zone | -| Storm/Tornado | Low | permanently Disk and Node loss in the affected fire zone | -| Cyber threat | High | permanently loss of data on affected Disk and Node | +| Fire | Medium | permanent Disk and Node loss in the affected zone | +| Flood | Low | permanent Disk and Node loss in the affected zone | +| Earthquake | Very Low | permanent Disk and Node loss in the affected zone | +| Storm/Tornado | Low | permanent Disk and Node loss in the affected fire zone | +| Cyber threat | High | permanent loss or compromise of data on affected Disk and Node | -These failure case can result in temporary (T) or permanent (P) loss of the resource or data within. +These failure cases can result in temporary (T) or permanent (P) loss of the resource or data within. Additionally there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases. -The following table shows the affection without considering any redundancy or failure saftey being in use: +The following table shows the impact when no redundancy or failure safety measure is in place: | Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | |----|----|----|----|----|----|----| @@ -71,19 +71,20 @@ The following table shows the affection without considering any redundancy or fa | network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | | floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | -For some cases there are only temporary unavailabilites and clouds do have certain workflows to avoid data loss, like redundancy in storagy backends and databases. -So some of these outages are more easy to solve than others. -A possible way to group the failure cases into levels considering the matrix of affection would be: +For some cases, this only results in temporary unavailabilities and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. +So some of these outages are easier to mitigate than others. +A possible way to classify the failure cases into levels considering the matrix of impact would be: -| Level/Class | level of affection | Use Cases | +| Level/Class | level of impact | Use Cases | |---|---|-----| -| 1. Level | single volumes, VMs... | Disk Failure, Node outage, (maybe rack outage) | -| 2. Level | number of resources, most of the time recoverable | Rack outage, (Fire), (Power outage when different power supplies exist) | +| 1. Level | individual volumes, VMs... | Disk Failure, Node outage, (maybe rack outage) | +| 2. Level | limited number of resources, most of the time recoverable | Rack outage, (Fire), (Power outage when different power supplies exist) | | 3. Level | lots of resources / user data + potentially not recoverable | Fire, Earthquake, Storm/Tornado, Power Outage | -| 4. Level | complete deployment, not recoverable | Flood, Fire | +| 4. Level | entire infrastructure, not recoverable | Flood, Fire | -Unfortunately something similar does not seem to exist right now. +Based on our research, no similar standardized classification scheme seems to exist currently. +Thus, this decision record establishes its own. ## Consequences -Using the definition of Levels throughout all SCS standards would allow readers to know up to which Level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data. +Using the definition of levels established in this decision record throughout all SCS standards would allow readers to understand up to which level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data and/or resource availability. From 41a75a2c1b1882d1ae56ce930f070e3f936e6985 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 29 Apr 2024 09:53:42 +0200 Subject: [PATCH 05/34] edit more wording Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 9f6c5fd7e..e75930f54 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -8,7 +8,7 @@ track: IaaS ## Abstract -Talking about redundancy and backups in the context of clouds, the scope under which circumstances these concepts work for various ressources is not clear. +When talking about redundancy and backups in the context of cloud infrastructures, the scope under which circumstances these concepts apply to various ressources is neither homogenous nor intuitive. This decision record aims to define different levels of failure-safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. @@ -35,7 +35,7 @@ Cyber threat ## Context -Some standards in will talk about or require procedures to backup resources or have redundancy for resources. +Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources. This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. In consequence these levels should be used in standards concerning redundancy or failure-safety. From 020bf8bc966ffc4dbb8ca5d343aa5da80fb4f667 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 29 Apr 2024 14:21:41 +0200 Subject: [PATCH 06/34] change gloassary section to table Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 38 +++++++++---------- 1 file changed, 17 insertions(+), 21 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index e75930f54..e930b2734 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -12,30 +12,26 @@ When talking about redundancy and backups in the context of cloud infrastructure This decision record aims to define different levels of failure-safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. -## Terminology - -Image - OpenStack resource, virtual machine images usually residing in a network storage backend. -Volume - OpenStack resource, virtual drive which usually resides in a network storage backend. -Virtual Machine (abbr. VM) - IaaS resource, also called server, executes workloads from users. -Secret - OpenStack ressource, cryptographic asset stored in the Key Manager (e.g. Barbican). -Key Encryption Key (abbr. KEK) - OpenStack resource, used to encrypt other keys to be able to store them encrypted in a database. -Floating IP (abbr. FIP) - OpenStack resource, an IP that is usually routed and accessible from external networks. -Disk - A physical disk drive (e.g. HDD, SSD) in the infrastructure. -Node - A physical machine in the infrastructure. -Cyber threat - Attacks on the infrastructure through the means of electronic access. +## Glossary + +| Term | Explanation | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | +| Virtual Machine | Equals the `server` resource in Nova. | +| Ephemeral Storage | Disk storage directly supplied to a virtual machine by Nova. Different from volumes. | +| (Glance) Image | IaaS resource usually storing raw disk data. Managed by the Glance service. | +| (Cinder) Volume | IaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service. | +| (Volume) Snapshot | Thinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes. | +| Volume Type | Attribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted. | +| (Barbican) Secret | IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service. | +| Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | +| Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. | +| Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | +| Node | A physical machine in the infrastructure. | +| Cyber threat | Attacks on the infrastructure through the means of electronic access. | ## Context -Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources. +Some standards provided by the SCS project will talk about or require procedures to backup resources or have redundancy for resources. This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. In consequence these levels should be used in standards concerning redundancy or failure-safety. From f0f75cbed55b0d0943f81f8427b65b04901f4707 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Thu, 2 May 2024 13:33:01 +0200 Subject: [PATCH 07/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- .../scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index e930b2734..a6f6b300a 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -17,6 +17,7 @@ These levels can then be used in standards to clearly set the scope that certain | Term | Explanation | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | | Virtual Machine | Equals the `server` resource in Nova. | +| Ironic Machine | A physical node managed by Ironic or as a `server` resource in Nova. | | Ephemeral Storage | Disk storage directly supplied to a virtual machine by Nova. Different from volumes. | | (Glance) Image | IaaS resource usually storing raw disk data. Managed by the Glance service. | | (Cinder) Volume | IaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service. | @@ -37,14 +38,15 @@ In consequence these levels should be used in standards concerning redundancy or ## Decision -First there needs to be an overview about possible failure cases in infrastructures: +First there needs to be an overview about possible failure cases in infrastructures as well as their probability of occurance and the damage they may cause: | Failure Case | Probability | Consequences | |----|-----|----| -| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | -| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | -| Rack Outage | Medium | similar to Disk Failure and Node Outage | -| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) | +| Disk Failure/Loss | High | Permanent data loss in this disk. Impact depends on type of lost data (data base, user data) | +| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) | +| Node Outage | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of node (impact depends on type of node) | +| Rack Outage | Medium | Outage of all nodes in rack | +| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in all racks | | Fire | Medium | permanent Disk and Node loss in the affected zone | | Flood | Low | permanent Disk and Node loss in the affected zone | | Earthquake | Very Low | permanent Disk and Node loss in the affected zone | @@ -62,6 +64,7 @@ The following table shows the impact when no redundancy or failure safety measur | User Data on RAM /CPU | | P | P | P | P | T/P | | volume-based VM | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | | ephemeral-based VM | P (if on disk) | P | P | T | P (T if lucky) | T/P | +| Ironic-based VM | P (all data on disk) | P | P | T | P (T if lucky) | T/P | | Secret | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | | network configuration (DB objects) | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | | network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | From d475eb16adfe24a9e6817e52d0d3490f65849a14 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Thu, 2 May 2024 13:35:31 +0200 Subject: [PATCH 08/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index a6f6b300a..72f9876d7 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -43,15 +43,15 @@ First there needs to be an overview about possible failure cases in infrastructu | Failure Case | Probability | Consequences | |----|-----|----| | Disk Failure/Loss | High | Permanent data loss in this disk. Impact depends on type of lost data (data base, user data) | -| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) | -| Node Outage | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of node (impact depends on type of node) | +| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) | +| Node Outage | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of node (impact depends on type of node) | | Rack Outage | Medium | Outage of all nodes in rack | -| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in all racks | +| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in all racks | | Fire | Medium | permanent Disk and Node loss in the affected zone | | Flood | Low | permanent Disk and Node loss in the affected zone | | Earthquake | Very Low | permanent Disk and Node loss in the affected zone | | Storm/Tornado | Low | permanent Disk and Node loss in the affected fire zone | -| Cyber threat | High | permanent loss or compromise of data on affected Disk and Node | +| Cyber threat | High | permanent loss or compromise of data on affected Disk and Node | These failure cases can result in temporary (T) or permanent (P) loss of the resource or data within. Additionally there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases. From 04be929bee06f2295fdd710237c1ee72bb1caab6 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 24 May 2024 10:46:22 +0200 Subject: [PATCH 09/34] editing table of classifictaion, as we discussed in the meeting Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 52 ++++++++++++------- 1 file changed, 32 insertions(+), 20 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 72f9876d7..847cb90b7 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -52,37 +52,49 @@ First there needs to be an overview about possible failure cases in infrastructu | Earthquake | Very Low | permanent Disk and Node loss in the affected zone | | Storm/Tornado | Low | permanent Disk and Node loss in the affected fire zone | | Cyber threat | High | permanent loss or compromise of data on affected Disk and Node | +| Software Bug | High | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | These failure cases can result in temporary (T) or permanent (P) loss of the resource or data within. Additionally there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases. The following table shows the impact when no redundancy or failure safety measure is in place: -| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | -|----|----|----|----|----|----|----| -| Image | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | -| Volume | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | -| User Data on RAM /CPU | | P | P | P | P | T/P | -| volume-based VM | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | -| ephemeral-based VM | P (if on disk) | P | P | T | P (T if lucky) | T/P | -| Ironic-based VM | P (all data on disk) | P | P | T | P (T if lucky) | T/P | -| Secret | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | -| network configuration (DB objects) | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | -| network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | -| floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | +| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | Software Bug | +|----|----|----|----|----|----|----|----| +| Image | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | +| Volume | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | +| User Data on RAM /CPU | | P | P | P | P | T/P | P | +| volume-based VM | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | +| ephemeral-based VM | P (if on disk) | P | P | T | P (T if lucky) | T/P | P | +| Ironic-based VM | P (all data on disk) | P | P | T | P (T if lucky) | T/P | P | +| Secret | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | +| network configuration (DB objects) | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | +| network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | T | +| floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | T | For some cases, this only results in temporary unavailabilities and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. So some of these outages are easier to mitigate than others. -A possible way to classify the failure cases into levels considering the matrix of impact would be: +A possible way to classify the failure cases into levels considering the matrix of impact would be, to classify the failure cases from small to big ones. +The following table shows such a classification, the occurance probability of a failure case of each class and what resources with user data might be affected. -| Level/Class | level of impact | Use Cases | -|---|---|-----| -| 1. Level | individual volumes, VMs... | Disk Failure, Node outage, (maybe rack outage) | -| 2. Level | limited number of resources, most of the time recoverable | Rack outage, (Fire), (Power outage when different power supplies exist) | -| 3. Level | lots of resources / user data + potentially not recoverable | Fire, Earthquake, Storm/Tornado, Power Outage | -| 4. Level | entire infrastructure, not recoverable | Flood, Fire | +:::caution + +This table only contains examples of failure cases and examples of affected resources. +This should not be used as a replacement for a risk analysis. +The column **user hints** only show examples of standards that may provide this class of failure safety for a certain resource. +Customers should always check, what they can do to protect their data and not rely solely on the CSP. + +::: + +| Level/Class | Probability | Failure Causes | loss in IaaS | User Hints | +|---|---|---|-----|-----| +| 1. Level | Very High | small Hardware or Software Failures (e.g. Disk/Node Failure, Software Bug,...) | individual volumes, VMs... | [volume replication](https://docs.scs.community/standards/scs-0114-v1-volume-type-standard) | +| 2. Level | High | important Hardware or Software Failures (e.g. Rack outage, small Fire, Power outage, ...) | limited number of resources, sometimes recoverable | [volume backups](https://github.com/SovereignCloudStack/standards/pull/567) | +| 3. Level | Medium | small catastrophes or major Failures (e.g. fire, regional Power Outage, orchestrated cyber attacks,...) | lots of resources / user data + potentially not recoverable | Availability Zones, user responsibility | +| 4. Level | Low | whole deployment loss (e.g. natural desaster,...) | entire infrastructure, not recoverable | user responsibility | Based on our research, no similar standardized classification scheme seems to exist currently. -Thus, this decision record establishes its own. +Something close but also very detailed can be found in [this (german)](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) from the BSI. +As we want to focus on IaaS resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed. ## Consequences From 367d99284c02e44b1300415d21c1b8dc26674904 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Tue, 28 May 2024 08:28:13 +0200 Subject: [PATCH 10/34] Apply suggestions from code review Co-authored-by: anjastrunk <119566837+anjastrunk@users.noreply.github.com> Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- .../scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 847cb90b7..29d22b3ce 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -27,7 +27,7 @@ These levels can then be used in standards to clearly set the scope that certain | Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | | Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. | | Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | -| Node | A physical machine in the infrastructure. | +| Node | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | Cyber threat | Attacks on the infrastructure through the means of electronic access. | ## Context @@ -87,14 +87,14 @@ Customers should always check, what they can do to protect their data and not re | Level/Class | Probability | Failure Causes | loss in IaaS | User Hints | |---|---|---|-----|-----| -| 1. Level | Very High | small Hardware or Software Failures (e.g. Disk/Node Failure, Software Bug,...) | individual volumes, VMs... | [volume replication](https://docs.scs.community/standards/scs-0114-v1-volume-type-standard) | -| 2. Level | High | important Hardware or Software Failures (e.g. Rack outage, small Fire, Power outage, ...) | limited number of resources, sometimes recoverable | [volume backups](https://github.com/SovereignCloudStack/standards/pull/567) | -| 3. Level | Medium | small catastrophes or major Failures (e.g. fire, regional Power Outage, orchestrated cyber attacks,...) | lots of resources / user data + potentially not recoverable | Availability Zones, user responsibility | -| 4. Level | Low | whole deployment loss (e.g. natural desaster,...) | entire infrastructure, not recoverable | user responsibility | +| 1. Level | Very High | small Hardware or Software Failures (e.g. Disk/Node Failure, Software Bug,...) | individual volumes, VMs... | CSP MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...). User SHOULD backup his data hiself and place it on an other host. | +| 2. Level | High | important Hardware or Software Failures (e.g. Rack outage, small Fire, Power outage, ...) | limited number of resources, sometimes recoverable | CSP MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...) OR users MUST backup their data themselves and place it on an other host. | +| 3. Level | Medium | small catastrophes or major Failures (e.g. fire, regional Power Outage, orchestrated cyber attacks,...) | lots of resources / user data + potentially not recoverable | CPS SHOULD operate hardware in dedicated Availability Zones. User SHOULD backup his data, hiself. | +| 4. Level | Low | whole deployment loss (e.g. natural disaster,...) | entire infrastructure, not recoverable | CSP is able to save user from such catastrophes. User is responsibility for saving his data from natural disasters. | Based on our research, no similar standardized classification scheme seems to exist currently. -Something close but also very detailed can be found in [this (german)](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) from the BSI. -As we want to focus on IaaS resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed. +Something close but also very detailed can be found in [this (german)](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) from the German Federal Office for Information Security. +As we want to focus on IaaS and K8s resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed. ## Consequences From b1904408c7e6f05b56df29bd8a905ce856f06758 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Tue, 28 May 2024 08:54:57 +0200 Subject: [PATCH 11/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 29d22b3ce..70853cf53 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -27,7 +27,7 @@ These levels can then be used in standards to clearly set the scope that certain | Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | | Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. | | Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | -| Node | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. +| Node | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | | Cyber threat | Attacks on the infrastructure through the means of electronic access. | ## Context @@ -87,10 +87,10 @@ Customers should always check, what they can do to protect their data and not re | Level/Class | Probability | Failure Causes | loss in IaaS | User Hints | |---|---|---|-----|-----| -| 1. Level | Very High | small Hardware or Software Failures (e.g. Disk/Node Failure, Software Bug,...) | individual volumes, VMs... | CSP MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...). User SHOULD backup his data hiself and place it on an other host. | -| 2. Level | High | important Hardware or Software Failures (e.g. Rack outage, small Fire, Power outage, ...) | limited number of resources, sometimes recoverable | CSP MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...) OR users MUST backup their data themselves and place it on an other host. | -| 3. Level | Medium | small catastrophes or major Failures (e.g. fire, regional Power Outage, orchestrated cyber attacks,...) | lots of resources / user data + potentially not recoverable | CPS SHOULD operate hardware in dedicated Availability Zones. User SHOULD backup his data, hiself. | -| 4. Level | Low | whole deployment loss (e.g. natural disaster,...) | entire infrastructure, not recoverable | CSP is able to save user from such catastrophes. User is responsibility for saving his data from natural disasters. | +| 1. Level | Very High | small Hardware or Software Failures (e.g. Disk/Node Failure, Software Bug,...) | individual volumes, VMs... | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...). Users SHOULD backup their data themself and place it on an other host. | +| 2. Level | High | important Hardware or Software Failures (e.g. Rack outage, small Fire, Power outage, ...) | limited number of resources, sometimes recoverable | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...) OR users MUST backup their data themselves and place it on an other host. | +| 3. Level | Medium | small catastrophes or major Failures (e.g. fire, regional Power Outage, orchestrated cyber attacks,...) | lots of resources / user data + potentially not recoverable | CPSs SHOULD operate hardware in dedicated Availability Zones. Users SHOULD backup their data, themself. | +| 4. Level | Low | whole deployment loss (e.g. natural disaster,...) | entire infrastructure, not recoverable | CSPs may not be able to save user data from such catastrophes. Users are responsible for saving their data from natural disasters. | Based on our research, no similar standardized classification scheme seems to exist currently. Something close but also very detailed can be found in [this (german)](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) from the German Federal Office for Information Security. From 525d9e801b5dab556b3a7ef602b4035e0b4cc2ac Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 12 Jun 2024 15:13:23 +0200 Subject: [PATCH 12/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 70853cf53..2997420ce 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -9,7 +9,11 @@ track: IaaS ## Abstract When talking about redundancy and backups in the context of cloud infrastructures, the scope under which circumstances these concepts apply to various ressources is neither homogenous nor intuitive. -This decision record aims to define different levels of failure-safety. +There does exist very detailed list of risks and what consequences there are for each risk, but this Decision Record should give a high-level view on the topic. +So that in each standard that referenced redundancy, it can easily be seen how far this redundancy goes in that certain circumstance. +Readery of such standards should be able to know at one glance, whether the achieved failure safeness is on a basic level or a higher one and whether there would be additional actions needed to protect the data. + +This is why this decision record aims to define different levels of failure-safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. ## Glossary From ba729d4305d86f34e4f908a70e55b0e68239e131 Mon Sep 17 00:00:00 2001 From: Hannes Baum Date: Mon, 24 Jun 2024 13:53:58 +0200 Subject: [PATCH 13/34] K8s failure cases Added some K8s failure cases and made some small wording fixes. Signed-off-by: Hannes Baum --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 31 ++++++++++++++----- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 2997420ce..da36e715a 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -8,10 +8,10 @@ track: IaaS ## Abstract -When talking about redundancy and backups in the context of cloud infrastructures, the scope under which circumstances these concepts apply to various ressources is neither homogenous nor intuitive. -There does exist very detailed list of risks and what consequences there are for each risk, but this Decision Record should give a high-level view on the topic. +When talking about redundancy and backups in the context of cloud infrastructures, the scope under which circumstances these concepts apply to various resources is neither homogenous nor intuitive. +There does exist very detailed lists of risks and what consequences there are for each risk, but this Decision Record should give a high-level view on the topic. So that in each standard that referenced redundancy, it can easily be seen how far this redundancy goes in that certain circumstance. -Readery of such standards should be able to know at one glance, whether the achieved failure safeness is on a basic level or a higher one and whether there would be additional actions needed to protect the data. +Readers of such standards should be able to know at one glance, whether the achieved failure safeness is on a basic level or a higher one and whether there would be additional actions needed to protect the data. This is why this decision record aims to define different levels of failure-safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. @@ -36,13 +36,13 @@ These levels can then be used in standards to clearly set the scope that certain ## Context -Some standards provided by the SCS project will talk about or require procedures to backup resources or have redundancy for resources. +Some standards provided by the SCS project will talk about or require procedures to back up resources or have redundancy for resources. This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. In consequence these levels should be used in standards concerning redundancy or failure-safety. ## Decision -First there needs to be an overview about possible failure cases in infrastructures as well as their probability of occurance and the damage they may cause: +First there needs to be an overview about possible failure cases in infrastructures as well as their probability of occurrence and the damage they may cause: | Failure Case | Probability | Consequences | |----|-----|----| @@ -58,8 +58,23 @@ First there needs to be an overview about possible failure cases in infrastructu | Cyber threat | High | permanent loss or compromise of data on affected Disk and Node | | Software Bug | High | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | +A similar overview can be provided for Kubernetes infrastructures. These also include the things mentioned for infrastructure failure cases, since a Kubernetes cluster +would most likely be deployed on top of this infrastructure or face similar problems on a bare-metal installation. +Part of this list comes directly from the official [Kubernetes docs](https://kubernetes.io/docs/tasks/debug/debug-cluster/). + +| Failure case | Probability | Consequences | +|----------------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| API server VM shutdown or apiserver crashing | Medium | Unable to stop, update, or start new pods, services, replication controller | +| API server backing storage lost | Medium | kube-apiserver component fails to start successfully and become healthy | +| Supporting services VM shutdown or crashing | Medium | Colocated with the apiserver, and their unavailability has similar consequences as apiserver | +| Individual node shuts down | Medium | Pods on that Node stop running | +| Network partition / Network problems | Medium | Partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down | +| Kubelet software fault | Medium | Crashing kubelet cannot start new pods on the node / kubelet might delete the pods or not / node marked unhealthy / replication controllers start new pods elsewhere | +| Cluster operator error | Medium | Loss of pods, services, etc. / lost of apiserver backing store / users unable to read API | +| Failure of multiple nodes or underlying DB | Low | Possible loss of all data depending on the amount of nodes lost compared to the cluster size, otherwise costly rebuild | + These failure cases can result in temporary (T) or permanent (P) loss of the resource or data within. -Additionally there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases. +Additionally, there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases. The following table shows the impact when no redundancy or failure safety measure is in place: | Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | Software Bug | @@ -75,10 +90,10 @@ The following table shows the impact when no redundancy or failure safety measur | network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | T | | floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | T | -For some cases, this only results in temporary unavailabilities and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. +For some cases, this only results in temporary unavailability and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. So some of these outages are easier to mitigate than others. A possible way to classify the failure cases into levels considering the matrix of impact would be, to classify the failure cases from small to big ones. -The following table shows such a classification, the occurance probability of a failure case of each class and what resources with user data might be affected. +The following table shows such a classification, the occurrence probability of a failure case of each class and what resources with user data might be affected. :::caution From fbca52526f86036ff0a96cbe704c82a008b5a0df Mon Sep 17 00:00:00 2001 From: Martin Morgenstern Date: Mon, 24 Jun 2024 18:15:37 +0200 Subject: [PATCH 14/34] Extend glossary with K8s terms and split into sections Signed-off-by: Martin Morgenstern --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 28 ++++++++++++++++--- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index da36e715a..43336166d 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -18,10 +18,20 @@ These levels can then be used in standards to clearly set the scope that certain ## Glossary +### General Terms + | Term | Explanation | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | +| Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | +| Host | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | +| Cyber threat | Attacks on the infrastructure through the means of electronic access. | + +### OpenStack Resources + +| Resource | Explanation | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | | Virtual Machine | Equals the `server` resource in Nova. | -| Ironic Machine | A physical node managed by Ironic or as a `server` resource in Nova. | +| Ironic Machine | A physical host managed by Ironic or as a `server` resource in Nova. | | Ephemeral Storage | Disk storage directly supplied to a virtual machine by Nova. Different from volumes. | | (Glance) Image | IaaS resource usually storing raw disk data. Managed by the Glance service. | | (Cinder) Volume | IaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service. | @@ -30,9 +40,19 @@ These levels can then be used in standards to clearly set the scope that certain | (Barbican) Secret | IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service. | | Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | | Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. | -| Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | -| Node | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | -| Cyber threat | Attacks on the infrastructure through the means of electronic access. | + +### Kubernetes Resources + +| Resource | Explanation | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | +| Node | A physical or virtual machine that runs workloads (Pods) managed by the Kubernetes control plane. | +| Kubelet | An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod. | +| API Server | The Kubernetes control plane component which exposes the Kubernetes Application Programming Interface (API). | +| Pod | Kubernetes object that represents a workload to be executed, consisting of one or more containers. | +| Container | A lightweight and portable executable image that contains software and all of its dependencies. | +| Persistent Volume Claim (PVC) | Persistent storage that can be bound and mounted to a pod. | + +Source: https://kubernetes.io/docs/reference/glossary/ ## Context From 2777c6ade3eff368717056b8baccfc63d438b0c2 Mon Sep 17 00:00:00 2001 From: Martin Morgenstern Date: Mon, 24 Jun 2024 18:17:33 +0200 Subject: [PATCH 15/34] Categorize the failure scenarios & try to add structure Signed-off-by: Martin Morgenstern --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 46 ++++++++++++++----- 1 file changed, 35 insertions(+), 11 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 43336166d..a1af7bebe 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -62,27 +62,48 @@ In consequence these levels should be used in standards concerning redundancy or ## Decision -First there needs to be an overview about possible failure cases in infrastructures as well as their probability of occurrence and the damage they may cause: +### Failure Scenarios -| Failure Case | Probability | Consequences | +First there needs to be an overview about possible failure scenarios in infrastructures as well as their probability of occurrence and the damage they may cause: + +#### Hardware Related + +| Failure Scenario | Probability | Consequences | |----|-----|----| | Disk Failure/Loss | High | Permanent data loss in this disk. Impact depends on type of lost data (data base, user data) | -| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) | -| Node Outage | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of node (impact depends on type of node) | +| Host Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of host (impact depends on type of host) | +| Host Outage | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of host (impact depends on type of host) | | Rack Outage | Medium | Outage of all nodes in rack | +| Network router/switch outage | High/Medium/Low | ... | +| Loss of network uplink | High/Medium/Low | | | Power Outage (Data Center supply) | Medium | temporary outage of all nodes in all racks | -| Fire | Medium | permanent Disk and Node loss in the affected zone | -| Flood | Low | permanent Disk and Node loss in the affected zone | -| Earthquake | Very Low | permanent Disk and Node loss in the affected zone | -| Storm/Tornado | Low | permanent Disk and Node loss in the affected fire zone | -| Cyber threat | High | permanent loss or compromise of data on affected Disk and Node | + +#### Environmental + +Note that probability for these scenarios is dependent on the location. + +| Failure Scenario | Probability | Consequences | +|----|-----|----| +| Fire | Medium | permanent Disk and Host loss in the affected zone | +| Flood | Low | permanent Disk and Host loss in the affected zone | +| Earthquake | Very Low | permanent Disk and Host loss in the affected zone | +| Storm/Tornado | Low | permanent Disk and Host loss in the affected fire zone | + +#### Others + +| Failure Scenario | Probability | Consequences | +|----|-----|----| +| Cyber threat | High | permanent loss or compromise of data on affected Disk and Host | +| Cluster operator error | High/Medium/Low | ... | | Software Bug | High | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | -A similar overview can be provided for Kubernetes infrastructures. These also include the things mentioned for infrastructure failure cases, since a Kubernetes cluster +#### Kubernetes Specific + +A similar overview can be provided for Kubernetes infrastructures. These also include the things mentioned for infrastructure failure scenario, since a Kubernetes cluster would most likely be deployed on top of this infrastructure or face similar problems on a bare-metal installation. Part of this list comes directly from the official [Kubernetes docs](https://kubernetes.io/docs/tasks/debug/debug-cluster/). -| Failure case | Probability | Consequences | +| Failure Scenario | Probability | Consequences | |----------------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| | API server VM shutdown or apiserver crashing | Medium | Unable to stop, update, or start new pods, services, replication controller | | API server backing storage lost | Medium | kube-apiserver component fails to start successfully and become healthy | @@ -112,6 +133,9 @@ The following table shows the impact when no redundancy or failure safety measur For some cases, this only results in temporary unavailability and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. So some of these outages are easier to mitigate than others. + +### Classification by Severity + A possible way to classify the failure cases into levels considering the matrix of impact would be, to classify the failure cases from small to big ones. The following table shows such a classification, the occurrence probability of a failure case of each class and what resources with user data might be affected. From a9633b11228056c0245c76a314ebc29b4fce782d Mon Sep 17 00:00:00 2001 From: Martin Morgenstern Date: Mon, 24 Jun 2024 18:19:18 +0200 Subject: [PATCH 16/34] Distinguish between impacts on IaaS and KaaS layer Signed-off-by: Martin Morgenstern --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 28 +++++++++++++++---- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index a1af7bebe..5db35f37e 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -114,11 +114,13 @@ Part of this list comes directly from the official [Kubernetes docs](https://kub | Cluster operator error | Medium | Loss of pods, services, etc. / lost of apiserver backing store / users unable to read API | | Failure of multiple nodes or underlying DB | Low | Possible loss of all data depending on the amount of nodes lost compared to the cluster size, otherwise costly rebuild | -These failure cases can result in temporary (T) or permanent (P) loss of the resource or data within. -Additionally, there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases. -The following table shows the impact when no redundancy or failure safety measure is in place: +These failure scenarios can result in temporary (T) or permanent (P) loss of the resource or data within. +Additionally, there are a lot of resources in IaaS alone that are more or less affected by these failure scenarios. +The following tables shows the impact **when no redundancy or failure safety measure is in place**: -| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | Software Bug | +### Impact on OpenStack Resources (IaaS layer) + +| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | |----|----|----|----|----|----|----|----| | Image | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | | Volume | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | @@ -131,6 +133,22 @@ The following table shows the impact when no redundancy or failure safety measur | network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | T | | floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | T | +### Impact on Kubernetes Resources (KaaS layer) + +:::note + +In case the KaaS layer runs on top of IaaS layer, the impacts described in the above table apply for the KaaS layer as well. + +::: + +| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | +|----|----|----|----|----|----|----|----| +|Node|P| | | | | |T/P| +|Kubelet|T| | | | | |T/P| +|Pod|T| | | | | |T/P| +|PVC|P| | | | | |P| +|API Server|T| | | | | |T/P| + For some cases, this only results in temporary unavailability and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. So some of these outages are easier to mitigate than others. @@ -148,7 +166,7 @@ Customers should always check, what they can do to protect their data and not re ::: -| Level/Class | Probability | Failure Causes | loss in IaaS | User Hints | +| Level/Class | Probability | Failure Causes | Loss in IaaS | User Hints | |---|---|---|-----|-----| | 1. Level | Very High | small Hardware or Software Failures (e.g. Disk/Node Failure, Software Bug,...) | individual volumes, VMs... | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...). Users SHOULD backup their data themself and place it on an other host. | | 2. Level | High | important Hardware or Software Failures (e.g. Rack outage, small Fire, Power outage, ...) | limited number of resources, sometimes recoverable | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...) OR users MUST backup their data themselves and place it on an other host. | From 9dfb9c07b3b742a07e39a3b8e170a02a433a0b50 Mon Sep 17 00:00:00 2001 From: Martin Morgenstern Date: Mon, 24 Jun 2024 18:38:32 +0200 Subject: [PATCH 17/34] Fix markdownlint error Signed-off-by: Martin Morgenstern --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 5db35f37e..a5cbdd188 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -52,7 +52,7 @@ These levels can then be used in standards to clearly set the scope that certain | Container | A lightweight and portable executable image that contains software and all of its dependencies. | | Persistent Volume Claim (PVC) | Persistent storage that can be bound and mounted to a pod. | -Source: https://kubernetes.io/docs/reference/glossary/ +Source: [Kubernetes Glossary](https://kubernetes.io/docs/reference/glossary/) ## Context From 9d22126e3a7196b90bd695c07568feb669b9dd2a Mon Sep 17 00:00:00 2001 From: Martin Morgenstern Date: Mon, 8 Jul 2024 08:33:32 +0200 Subject: [PATCH 18/34] Apply restructuring suggestions by Josephine Signed-off-by: Martin Morgenstern --- .../scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index a5cbdd188..7e134b78c 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -89,7 +89,7 @@ Note that probability for these scenarios is dependent on the location. | Earthquake | Very Low | permanent Disk and Host loss in the affected zone | | Storm/Tornado | Low | permanent Disk and Host loss in the affected fire zone | -#### Others +#### Software Related | Failure Scenario | Probability | Consequences | |----|-----|----| @@ -114,11 +114,13 @@ Part of this list comes directly from the official [Kubernetes docs](https://kub | Cluster operator error | Medium | Loss of pods, services, etc. / lost of apiserver backing store / users unable to read API | | Failure of multiple nodes or underlying DB | Low | Possible loss of all data depending on the amount of nodes lost compared to the cluster size, otherwise costly rebuild | +### Impact of the Failure Scenarios + These failure scenarios can result in temporary (T) or permanent (P) loss of the resource or data within. Additionally, there are a lot of resources in IaaS alone that are more or less affected by these failure scenarios. The following tables shows the impact **when no redundancy or failure safety measure is in place**: -### Impact on OpenStack Resources (IaaS layer) +#### Impact on OpenStack Resources (IaaS layer) | Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | |----|----|----|----|----|----|----|----| @@ -133,7 +135,10 @@ The following tables shows the impact **when no redundancy or failure safety mea | network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | T | | floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | T | -### Impact on Kubernetes Resources (KaaS layer) +For some cases, this only results in temporary unavailability and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. +So some of these outages are easier to mitigate than others. + +#### Impact on Kubernetes Resources (KaaS layer) :::note @@ -149,8 +154,6 @@ In case the KaaS layer runs on top of IaaS layer, the impacts described in the a |PVC|P| | | | | |P| |API Server|T| | | | | |T/P| -For some cases, this only results in temporary unavailability and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. -So some of these outages are easier to mitigate than others. ### Classification by Severity From 437217f8d836766f69e5c8b71f201be329487061 Mon Sep 17 00:00:00 2001 From: Martin Morgenstern Date: Mon, 8 Jul 2024 09:25:05 +0200 Subject: [PATCH 19/34] Further work on taxonomy draft (WIP) * restructure in a top-down approach * assign failure scenarios to levels * align K8s resources with OpenStack resources * still a lot of TODOs and question marks remaining * and many more Signed-off-by: Martin Morgenstern --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 228 ++++++++++++------ 1 file changed, 153 insertions(+), 75 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 7e134b78c..98e7ee575 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -13,11 +13,127 @@ There does exist very detailed lists of risks and what consequences there are fo So that in each standard that referenced redundancy, it can easily be seen how far this redundancy goes in that certain circumstance. Readers of such standards should be able to know at one glance, whether the achieved failure safeness is on a basic level or a higher one and whether there would be additional actions needed to protect the data. -This is why this decision record aims to define different levels of failure-safety. +This is why this decision record aims to define different levels of failure safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. + + +## Context + +Some standards provided by the SCS project will talk about or require procedures to back up resources or have redundancy for resources. +This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. +In consequence these levels should be used in standards concerning redundancy or failure safety. + +Based on our research, no similar standardized classification scheme seems to exist currently. +Something close but also very detailed is the [BSI-Standard 200-3 (german)][bsi-200-3] published by the German Federal Office for Information Security. +As we want to focus on IaaS and K8s resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed. + +[bsi-200-3]: https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2 + +## Decision + +### Failsafe Levels + +This Decision Record defines **four** failsafe levels, each of which describe what kind of failures have to +be tolerated by a provided service. + +In general, the lowest, **level 1**, describes isolated/local failures which can occur very frequently, whereas +the highest, **level 4**, describes relatively unlikely failures that impact a whole or even multiple datacenter(s): + +| Level | Probability | Impact | Examples | +| - | - | - | - | +| 1 | Very High | Local | Disk failure, RAM failure, software bug | +| 2 | High | Moderate | Rack outage, power outage, small fire | +| 3 | Medium | High | Regional power outage, huge fire, orchestrated cyber attack | +| 4 | Low | Very high | Natural disaster | + + + +For example, a provided service with failsafe level 2 tolerates a rack outage (because there is some kind of +redundancy in place.) + +From a cloud service provider (CSP) perspective, supporting these failure levels has the following *general* +consequences: + +* **Level 1**: CSPs MUST operate replicas for important components (e.g., RAID, replicated volume backend, uninterruptible power supply). +* **Level 2**: CSPs SHOULD operate hardware in dedicated availability zones (AZs). +* **Level 3**: CSPs SHOULD operate hardware in dedicated regions. +* **Level 4**: Depending on the regions, CSPs may not be able to save user data from such catastrophes. + +More specific guidance on what these levels mean on the IaaS and KaaS layers will be provided in the sections +further down. +But beforehand, we will describe the considered failure scenarios and the resources that may be affected. + +### Failure Scenarios + +The following failure scenarios have been considered for the proposed failsafe levels. +For each failure scenario, we estimate the probability of occurence and the (worst case) damage caused by the scenario. +Furthermore, the corresponding minimum failsafe level covering that failure scenario is given. + + + +#### Hardware Related + +| Failure Scenario | Probability | Consequences | Failsafe Level Coverage | +|----|-----|----|----| +| Disk Failure | High | Permanent data loss in this disk. Impact depends on type of lost data (data base, user data) | L1 | +| Host Failure (without disks) | Medium to High | Permanent loss of functionality and connectivity of host (impact depends on type of host) | L1 | +| Host Failure | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of host (impact depends on type of host) | L1 | +| Rack Outage | Medium | Outage of all nodes in rack | L2 | +| Network router/switch outage | Medium | Temporary loss of service, loss of connectivity, network partitioning | L2 | +| Loss of network uplink | Medium | Temporary loss of service, loss of connectivity | L3 | +| Power Outage (Data Center supply) | Medium | Temporary outage of all nodes in all racks | L3 | + +#### Software Related + +| Failure Scenario | Probability | Consequences | Failsafe Level Coverage | +|----|-----|----|----| +| Cyber threat | High | permanent loss or compromise of data on affected Disk and Host | L1 | +| Software bug (major) | Low | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L1 | +| Software bug (minor) | High | temporary or partial loss or compromise of data | L1 | + + + +#### Environmental + +Note that probability for these scenarios is dependent on the location. + +| Failure Scenario | Probability | Consequences | Failsafe Level Coverage | +|----|-----|----|----| +| Fire | Medium | permanent Disk and Host loss in the affected zone | L3 | +| Flood | Low | permanent Disk and Host loss in the affected region | L4 | +| Earthquake | Very Low | permanent Disk and Host loss in the affected region | L4 | +| Storm/Tornado | Low | permanent Disk and Host loss in the affected region | L4 | + +#### Human Interference + +| Failure Scenario | Probability | Consequences | Failsafe Level Coverage | +|----|-----|----|----| +| Minor operating error | High | Temporary outage | L1 | +| Major operating error | Low | Permanent loss of data | L3 | + + +## Consequences + +Using the definition of levels established in this decision record throughout all SCS standards would allow readers to understand up to which level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data and/or resource availability. + ## Glossary + + ### General Terms | Term | Explanation | @@ -26,7 +142,9 @@ These levels can then be used in standards to clearly set the scope that certain | Host | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | | Cyber threat | Attacks on the infrastructure through the means of electronic access. | -### OpenStack Resources +### Affected Resources + +#### IaaS Layer (OpenStack Resources) | Resource | Explanation | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | @@ -41,87 +159,37 @@ These levels can then be used in standards to clearly set the scope that certain | Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | | Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. | -### Kubernetes Resources +#### KaaS Layer (Kubernetes Resources) -| Resource | Explanation | +| Resource(s) | Explanation | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | -| Node | A physical or virtual machine that runs workloads (Pods) managed by the Kubernetes control plane. | -| Kubelet | An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod. | -| API Server | The Kubernetes control plane component which exposes the Kubernetes Application Programming Interface (API). | | Pod | Kubernetes object that represents a workload to be executed, consisting of one or more containers. | | Container | A lightweight and portable executable image that contains software and all of its dependencies. | -| Persistent Volume Claim (PVC) | Persistent storage that can be bound and mounted to a pod. | - -Source: [Kubernetes Glossary](https://kubernetes.io/docs/reference/glossary/) - -## Context - -Some standards provided by the SCS project will talk about or require procedures to back up resources or have redundancy for resources. -This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. -In consequence these levels should be used in standards concerning redundancy or failure-safety. - -## Decision - -### Failure Scenarios - -First there needs to be an overview about possible failure scenarios in infrastructures as well as their probability of occurrence and the damage they may cause: - -#### Hardware Related - -| Failure Scenario | Probability | Consequences | -|----|-----|----| -| Disk Failure/Loss | High | Permanent data loss in this disk. Impact depends on type of lost data (data base, user data) | -| Host Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of host (impact depends on type of host) | -| Host Outage | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of host (impact depends on type of host) | -| Rack Outage | Medium | Outage of all nodes in rack | -| Network router/switch outage | High/Medium/Low | ... | -| Loss of network uplink | High/Medium/Low | | -| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in all racks | - -#### Environmental - -Note that probability for these scenarios is dependent on the location. +| Deployment, StatefulSet | Kubernetes objects that manage a set of Pods. | +| Job | Application workload that runs once. | +| CronJob | Application workload that runs once, but repeatedly at specific intervals. | +| ConfigMap, Secret | Objects holding static application configuration data. | +| Service | Makes a Pod's network service accessible inside a cluster. | +| Ingress | Makes a Service externally accessible. | +| PersistentVolumeClaim (PVC) | Persistent storage that can be bound and mounted to a pod. | -| Failure Scenario | Probability | Consequences | -|----|-----|----| -| Fire | Medium | permanent Disk and Host loss in the affected zone | -| Flood | Low | permanent Disk and Host loss in the affected zone | -| Earthquake | Very Low | permanent Disk and Host loss in the affected zone | -| Storm/Tornado | Low | permanent Disk and Host loss in the affected fire zone | +Also see [Kubernetes Glossary](https://kubernetes.io/docs/reference/glossary/). -#### Software Related - -| Failure Scenario | Probability | Consequences | -|----|-----|----| -| Cyber threat | High | permanent loss or compromise of data on affected Disk and Host | -| Cluster operator error | High/Medium/Low | ... | -| Software Bug | High | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | - -#### Kubernetes Specific - -A similar overview can be provided for Kubernetes infrastructures. These also include the things mentioned for infrastructure failure scenario, since a Kubernetes cluster -would most likely be deployed on top of this infrastructure or face similar problems on a bare-metal installation. -Part of this list comes directly from the official [Kubernetes docs](https://kubernetes.io/docs/tasks/debug/debug-cluster/). - -| Failure Scenario | Probability | Consequences | -|----------------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| API server VM shutdown or apiserver crashing | Medium | Unable to stop, update, or start new pods, services, replication controller | -| API server backing storage lost | Medium | kube-apiserver component fails to start successfully and become healthy | -| Supporting services VM shutdown or crashing | Medium | Colocated with the apiserver, and their unavailability has similar consequences as apiserver | -| Individual node shuts down | Medium | Pods on that Node stop running | -| Network partition / Network problems | Medium | Partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down | -| Kubelet software fault | Medium | Crashing kubelet cannot start new pods on the node / kubelet might delete the pods or not / node marked unhealthy / replication controllers start new pods elsewhere | -| Cluster operator error | Medium | Loss of pods, services, etc. / lost of apiserver backing store / users unable to read API | -| Failure of multiple nodes or underlying DB | Low | Possible loss of all data depending on the amount of nodes lost compared to the cluster size, otherwise costly rebuild | +## Old sections ### Impact of the Failure Scenarios These failure scenarios can result in temporary (T) or permanent (P) loss of the resource or data within. Additionally, there are a lot of resources in IaaS alone that are more or less affected by these failure scenarios. -The following tables shows the impact **when no redundancy or failure safety measure is in place**: +The following tables shows the impact **when no redundancy or failure safety measure is in place**, i.e., when +**not even failsafe level 1 is fulfilled**. + +TODO: why should we do that? #### Impact on OpenStack Resources (IaaS layer) +TODO: this table is getting difficult to maintain + | Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | |----|----|----|----|----|----|----|----| | Image | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | @@ -154,7 +222,6 @@ In case the KaaS layer runs on top of IaaS layer, the impacts described in the a |PVC|P| | | | | |P| |API Server|T| | | | | |T/P| - ### Classification by Severity A possible way to classify the failure cases into levels considering the matrix of impact would be, to classify the failure cases from small to big ones. @@ -176,10 +243,21 @@ Customers should always check, what they can do to protect their data and not re | 3. Level | Medium | small catastrophes or major Failures (e.g. fire, regional Power Outage, orchestrated cyber attacks,...) | lots of resources / user data + potentially not recoverable | CPSs SHOULD operate hardware in dedicated Availability Zones. Users SHOULD backup their data, themself. | | 4. Level | Low | whole deployment loss (e.g. natural disaster,...) | entire infrastructure, not recoverable | CSPs may not be able to save user data from such catastrophes. Users are responsible for saving their data from natural disasters. | -Based on our research, no similar standardized classification scheme seems to exist currently. -Something close but also very detailed can be found in [this (german)](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) from the German Federal Office for Information Security. -As we want to focus on IaaS and K8s resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed. +### Kubernetes Specific -## Consequences +TODO: merge this with new sections -Using the definition of levels established in this decision record throughout all SCS standards would allow readers to understand up to which level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data and/or resource availability. +A similar overview can be provided for Kubernetes infrastructures. These also include the things mentioned for infrastructure failure scenario, since a Kubernetes cluster +would most likely be deployed on top of this infrastructure or face similar problems on a bare-metal installation. +Part of this list comes directly from the official [Kubernetes docs](https://kubernetes.io/docs/tasks/debug/debug-cluster/). + +| Failure Scenario | Probability | Consequences | +|----------------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| API server VM shutdown or apiserver crashing | Medium | Unable to stop, update, or start new pods, services, replication controller | +| API server backing storage lost | Medium | kube-apiserver component fails to start successfully and become healthy | +| Supporting services VM shutdown or crashing | Medium | Colocated with the apiserver, and their unavailability has similar consequences as apiserver | +| Individual node shuts down | Medium | Pods on that Node stop running | +| Network partition / Network problems | Medium | Partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down | +| Kubelet software fault | Medium | Crashing kubelet cannot start new pods on the node / kubelet might delete the pods or not / node marked unhealthy / replication controllers start new pods elsewhere | +| Cluster operator error | Medium | Loss of pods, services, etc. / lost of apiserver backing store / users unable to read API | +| Failure of multiple nodes or underlying DB | Low | Possible loss of all data depending on the amount of nodes lost compared to the cluster size, otherwise costly rebuild | From b904df09711c0ff6693923356f620614adcb202e Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 16 Aug 2024 11:40:14 +0200 Subject: [PATCH 20/34] Adding glossary at the right point Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 33 ++++++++++--------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 98e7ee575..165381fc1 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -23,6 +23,17 @@ TODO: What time frame do we look at? (so called Recovery Time Objecte aka RTO) TODO: how does this relate to Business Continuity Planning (BCP) --> +## Glossary + +| Term | Explanation | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | +| Availability Zone | (also: AZ) internal representation of physical grouping of service hosts, which also lead to internal grouping of resources. | +| BSI | German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik). | +| CSP | Cloud Service Provider, provider managing the OpenStack infrastructure. | +| Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | +| Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | +| Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | + ## Context Some standards provided by the SCS project will talk about or require procedures to back up resources or have redundancy for resources. @@ -45,16 +56,12 @@ be tolerated by a provided service. In general, the lowest, **level 1**, describes isolated/local failures which can occur very frequently, whereas the highest, **level 4**, describes relatively unlikely failures that impact a whole or even multiple datacenter(s): -| Level | Probability | Impact | Examples | -| - | - | - | - | -| 1 | Very High | Local | Disk failure, RAM failure, software bug | -| 2 | High | Moderate | Rack outage, power outage, small fire | -| 3 | Medium | High | Regional power outage, huge fire, orchestrated cyber attack | -| 4 | Low | Very high | Natural disaster | - - +| Level | Probability | Impact | Examples | +| - | - | - | - | +| 1 | Very High | small Hardware Issue | Disk failure, RAM failure, small software bug | +| 2 | High | Rack-wide | Rack outage, power outage, small fire | +| 3 | Medium | site-wide (temporary) | Regional power outage, huge fire, orchestrated cyber attack | +| 4 | Low | site destruction | Natural disaster | For example, a provided service with failsafe level 2 tolerates a rack outage (because there is some kind of redundancy in place.) @@ -128,12 +135,6 @@ Note that probability for these scenarios is dependent on the location. Using the definition of levels established in this decision record throughout all SCS standards would allow readers to understand up to which level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data and/or resource availability. -## Glossary - - - ### General Terms | Term | Explanation | From 57b1d30bceecf1657827e8a2e9cc3cc12a06e1b6 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 16 Aug 2024 15:05:10 +0200 Subject: [PATCH 21/34] Extend the context and glossary and make a better consequences table to also include users Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 72 +++++++++++++++---- 1 file changed, 57 insertions(+), 15 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 165381fc1..59c14b37d 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -16,13 +16,6 @@ Readers of such standards should be able to know at one glance, whether the achi This is why this decision record aims to define different levels of failure safety. These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. - - ## Glossary | Term | Explanation | @@ -33,17 +26,56 @@ TODO: how does this relate to Business Continuity Planning (BCP) | Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | | Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | | Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | +| RTO | Recovery Time Objective. | ## Context Some standards provided by the SCS project will talk about or require procedures to back up resources or have redundancy for resources. -This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. +This decision record should discuss, which failure threats exist within an IaaS and KaaS deployment and will classify them into several levels according to their impact and possible handling mechanisms. In consequence these levels should be used in standards concerning redundancy or failure safety. Based on our research, no similar standardized classification scheme seems to exist currently. Something close but also very detailed is the [BSI-Standard 200-3 (german)][bsi-200-3] published by the German Federal Office for Information Security. As we want to focus on IaaS and K8s resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed. + + +### Goal of this Decision Record + +The SCS wants to classify levels of failure cases according to their impact and the respective measures CSPs can implement to prepare for each level. +Standards that deal with redundancy or backups or recovery SHOULD refer to the levels of this standard. +Thus every reader knows, up to which level of failsafeness the implementation of the standard works. +Reader then should be able to abstract what kind of other measures they have to apply, to reach the failsafe lavel they want to reach. + +### Differentiation between failsafe levels and high availability, disaster recovery, redundancy and backups + +The levels auf failsafeness that are defined in this decision record are classifying the possibilities and impacts of failure cases (such as data loss) and possible measures. +High Availability, disaster recovery, redundancy and backups are all measures that can and should be applied to IaaS and KaaS deployments by both CSPs and Users to reduce the possibility and impact of data loss. +So with this document every reader can see to what level of failsafeness their measures protect user data. + +To differentiate also between the named measures the following table can be used: + +| Term | Explanation | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| High Availability | Refers to the availability of resources over an extended period of time unaffected by smaller hardware issues. E.g. achievable through having several instances of resources. | +| Disaster Recovery | Measures taken after an incident to recover data, IaaS resource and maybe even physical resources. | +| Redundancy | Having more than one (or two) instances of each resource, to be able to switch to the second resource (could also be a data mirror) in case of a failure. | +| Backup | A specific copy of user data, that presents all data points at a givne time. Usually managed by users themself, read only and never stored in the same place as the original data. | + +### Failsafe Levels and RTO + +As this documents classifies failure case with very broad impacts and it is written in regards of mostly IaaS and KaaS, there cannot be one simple RTO set. +It should be taken into consideration that the RTO for IaaS and KaaS means to make user data available again through measures within the infrastructure. +But this will not be effective, when there is no backup of the user data or a redundancy of it already in place. +The different failsafe levels, measures and impacts will lead to very different RTOs. +For example a storage disk that has a failure will result in an RTO of 0 seconds, when the storage backend uses internal replication and still has two replicas of the user data. +While in the worst case of a natural disaster, most likely a severe fire, the whole deployment will be lost and if there were no off-site backups done by users there will be no RTO, because the data cannot be recovered anymore. + [bsi-200-3]: https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2 ## Decision @@ -66,13 +98,14 @@ the highest, **level 4**, describes relatively unlikely failures that impact a w For example, a provided service with failsafe level 2 tolerates a rack outage (because there is some kind of redundancy in place.) -From a cloud service provider (CSP) perspective, supporting these failure levels has the following *general* -consequences: +There are some *general* consequences, that can be addressed by CSPs and users in the following ways: -* **Level 1**: CSPs MUST operate replicas for important components (e.g., RAID, replicated volume backend, uninterruptible power supply). -* **Level 2**: CSPs SHOULD operate hardware in dedicated availability zones (AZs). -* **Level 3**: CSPs SHOULD operate hardware in dedicated regions. -* **Level 4**: Depending on the regions, CSPs may not be able to save user data from such catastrophes. +| Level | consequences for CSPs | consequences for Users | +|---|-----|-----| +| 1. Level | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...). | Users SHOULD backup their data themself and place it on an other host. | +| 2. Level | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...) | Users MUST backup their data themselves and place it on an other host. | +| 3. Level | CPSs SHOULD operate hardware in dedicated Availability Zones. | Users SHOULD backup their data, in different AZs or even other deployments. | +| 4. Level | CSPs may not be able to save user data from such catastrophes. | Users are responsible for saving their data from natural disasters. | More specific guidance on what these levels mean on the IaaS and KaaS layers will be provided in the sections further down. @@ -104,7 +137,6 @@ TODO: define the meaning of our probabilities | Failure Scenario | Probability | Consequences | Failsafe Level Coverage | |----|-----|----|----| -| Cyber threat | High | permanent loss or compromise of data on affected Disk and Host | L1 | | Software bug (major) | Low | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L1 | | Software bug (minor) | High | temporary or partial loss or compromise of data | L1 | @@ -123,13 +155,23 @@ Note that probability for these scenarios is dependent on the location. | Earthquake | Very Low | permanent Disk and Host loss in the affected region | L4 | | Storm/Tornado | Low | permanent Disk and Host loss in the affected region | L4 | +As we consider mainly deployments in central Europe, the probability of earthquakes is low and in the rare case of such an event the severity is also low compared to other regions in the world (e.g. the pacific ring of fire). +The event of a flood will most likely come from overflowing rivers instead of storm floods from a sea. +There can be measures taken, to reduce the probability and severity of a flooding event in central Europe due to simply choosing a different location for a deployment. + #### Human Interference | Failure Scenario | Probability | Consequences | Failsafe Level Coverage | |----|-----|----|----| | Minor operating error | High | Temporary outage | L1 | | Major operating error | Low | Permanent loss of data | L3 | +| Cyber threat | High | permanent loss or compromise of data on affected Disk and Host | L1 | +Mistakes in maintaining a data center will always happen. +To reduce the probability of such a mistake, measures are needed to reduce human error, which is more an issue of sociology and psychology instead of computer science. +On the other side an attack on an infrastructure cannot be avoided by this. +Instead every deployment needs to be prepared for an attack all the time, e.g. through security updates. +The severity of Cyber attacks can also vary broadly: from denial-of-service attacks, which should only be a temporary issue, up until coordinated attacks to steal or destroy data, which could also affect a whole deployment. ## Consequences From 36c0d7fcb4ada37243500c286ce23282d64742ff Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 19 Aug 2024 14:59:38 +0200 Subject: [PATCH 22/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 33 ++++++++++++------- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 59c14b37d..52aa44a0d 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -133,17 +133,6 @@ TODO: define the meaning of our probabilities | Loss of network uplink | Medium | Temporary loss of service, loss of connectivity | L3 | | Power Outage (Data Center supply) | Medium | Temporary outage of all nodes in all racks | L3 | -#### Software Related - -| Failure Scenario | Probability | Consequences | Failsafe Level Coverage | -|----|-----|----|----| -| Software bug (major) | Low | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L1 | -| Software bug (minor) | High | temporary or partial loss or compromise of data | L1 | - - - #### Environmental Note that probability for these scenarios is dependent on the location. @@ -159,6 +148,28 @@ As we consider mainly deployments in central Europe, the probability of earthqua The event of a flood will most likely come from overflowing rivers instead of storm floods from a sea. There can be measures taken, to reduce the probability and severity of a flooding event in central Europe due to simply choosing a different location for a deployment. +#### Software Related + +| Failure Scenario | Probability | Consequences | Failsafe Level Coverage | +|----|-----|----|----| +| Software bug (major) | Low | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L1 | +| Software bug (minor) | High | temporary or partial loss or compromise of data | L1 | + +Many software components have lots of lines of code and cannot be proven correct in their whole functionality. +They are tested instead with at best enough test cases to check every interaction. +Still bugs can and will occur in software. +Most of them are rather small issues, that might even seem like a feature to some. +An exmple for this would be: [whether a floating IP in OpenStack could be assigned to a VM even if it is already bound to another VM](https://bugs.launchpad.net/neutron/+bug/2060808). +Bugs like this do not affect a whole deployment, when they are triggered, but just specific data or resources. +Nevertheless those bugs can be a daily struggle. +This is the reason, the probability of such minor bugs may be pretty high, but the consequences would either be just temporary or would only result in small losses or compromisation. + +On the other hand major bugs, which might be used to compromise data, that is not in direct connection to the triggered bug, occur only a few times a year. +This can be seen e.g. in the [OpenStack Security Advisories](https://security.openstack.org/ossalist.html), where there were only 3 major bugs found in 2023. +While these bugs might appear only rarely their consequences are immense. +They might be the reason for a whole deployment to be compromised or shut down. +CSPs should be in contact with people triaging and patching such bugs, to be informed early and to be able to update their deployments, before the bug is openly announced. + #### Human Interference | Failure Scenario | Probability | Consequences | Failsafe Level Coverage | From 2d1663b2136eae63d758c8de59e81b4393df0def Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Thu, 22 Aug 2024 14:11:20 +0200 Subject: [PATCH 23/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 148 ++++++++---------- 1 file changed, 63 insertions(+), 85 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 52aa44a0d..25e2aa092 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -18,15 +18,18 @@ These levels can then be used in standards to clearly set the scope that certain ## Glossary -| Term | Explanation | -| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | -| Availability Zone | (also: AZ) internal representation of physical grouping of service hosts, which also lead to internal grouping of resources. | -| BSI | German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik). | -| CSP | Cloud Service Provider, provider managing the OpenStack infrastructure. | -| Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | -| Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | -| Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | -| RTO | Recovery Time Objective. | +| Term | Explanation | +| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | +| Availability Zone | (also: AZ) internal representation of physical grouping of service hosts, which also lead to internal grouping of resources. | +| BSI | German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik). | +| CSP | Cloud Service Provider, provider managing the OpenStack infrastructure. | +| Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | +| Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | +| Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | +| RTO | Recovery Time Objective. | +| Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | +| Host | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | +| Cyber attack/threat | Attacks on the infrastructure through the means of electronic access. | ## Context @@ -85,6 +88,13 @@ While in the worst case of a natural disaster, most likely a severe fire, the wh This Decision Record defines **four** failsafe levels, each of which describe what kind of failures have to be tolerated by a provided service. +:::caution + +This table only contains examples of failure cases. +This should not be used as a replacement for a risk analysis. + +::: + In general, the lowest, **level 1**, describes isolated/local failures which can occur very frequently, whereas the highest, **level 4**, describes relatively unlikely failures that impact a whole or even multiple datacenter(s): @@ -102,10 +112,17 @@ There are some *general* consequences, that can be addressed by CSPs and users i | Level | consequences for CSPs | consequences for Users | |---|-----|-----| -| 1. Level | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...). | Users SHOULD backup their data themself and place it on an other host. | -| 2. Level | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...) | Users MUST backup their data themselves and place it on an other host. | -| 3. Level | CPSs SHOULD operate hardware in dedicated Availability Zones. | Users SHOULD backup their data, in different AZs or even other deployments. | -| 4. Level | CSPs may not be able to save user data from such catastrophes. | Users are responsible for saving their data from natural disasters. | +| 1. Level | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, replicated database, ...). | Users SHOULD backup their data themself and place it on an other host. | +| 2. Level | CSPs MUST have redundancy for important components (e.g. HA for API services, redundant power supply, ...). | Users MUST backup their data themselves and place it on an other host. | +| 3. Level | CSPs SHOULD operate hardware in dedicated Availability Zones. | Users SHOULD backup their data, in different AZs or even other deployments. | +| 4. Level | CSPs may not be able to save user data from such catastrophes. | Users MUST have a backup of their data in a different geographic location. | + +:::caution + +The columns **consequences for CSPs / Users** only show examples of actions that may provide this class of failure safety for a certain resource. +Customers should always check, what they can do to protect their data and not rely solely on the CSP. + +::: More specific guidance on what these levels mean on the IaaS and KaaS layers will be provided in the sections further down. @@ -152,7 +169,7 @@ There can be measures taken, to reduce the probability and severity of a floodin | Failure Scenario | Probability | Consequences | Failsafe Level Coverage | |----|-----|----|----| -| Software bug (major) | Low | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L1 | +| Software bug (major) | Low | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L3 | | Software bug (minor) | High | temporary or partial loss or compromise of data | L1 | Many software components have lots of lines of code and cannot be proven correct in their whole functionality. @@ -176,56 +193,56 @@ CSPs should be in contact with people triaging and patching such bugs, to be inf |----|-----|----|----| | Minor operating error | High | Temporary outage | L1 | | Major operating error | Low | Permanent loss of data | L3 | -| Cyber threat | High | permanent loss or compromise of data on affected Disk and Host | L1 | +| Cyber attack (minor) | High | permanent loss or compromise of data on affected Disk and Host | L1 | +| Cyber attack (major) | Medium | permanent loss or compromise of data on affected Disk and Host | L3 | Mistakes in maintaining a data center will always happen. To reduce the probability of such a mistake, measures are needed to reduce human error, which is more an issue of sociology and psychology instead of computer science. On the other side an attack on an infrastructure cannot be avoided by this. Instead every deployment needs to be prepared for an attack all the time, e.g. through security updates. The severity of Cyber attacks can also vary broadly: from denial-of-service attacks, which should only be a temporary issue, up until coordinated attacks to steal or destroy data, which could also affect a whole deployment. +The more easy an attack is, the more often it will be used by various persons and organizations up to be just daily business. +Major attacks are often orchestrated and require speicif knowledge e.g. of Day-0 Bugs or the attacked infrastructure. +Due to that nature their occurance is less likely, but the damage done can be far more severe. ## Consequences Using the definition of levels established in this decision record throughout all SCS standards would allow readers to understand up to which level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data and/or resource availability. -### General Terms - -| Term | Explanation | -| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | -| Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | -| Host | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | -| Cyber threat | Attacks on the infrastructure through the means of electronic access. | - ### Affected Resources #### IaaS Layer (OpenStack Resources) -| Resource | Explanation | -| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | -| Virtual Machine | Equals the `server` resource in Nova. | -| Ironic Machine | A physical host managed by Ironic or as a `server` resource in Nova. | -| Ephemeral Storage | Disk storage directly supplied to a virtual machine by Nova. Different from volumes. | -| (Glance) Image | IaaS resource usually storing raw disk data. Managed by the Glance service. | -| (Cinder) Volume | IaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service. | -| (Volume) Snapshot | Thinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes. | -| Volume Type | Attribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted. | -| (Barbican) Secret | IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service. | -| Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | -| Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. | +| Resource | Explanation | Affected by Level | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | +| Ephemeral VM | Equals the `server` resource in Nova, booting from ephemeral storage. | L1, L2, L3, L4 | +| Volume-based VM | Equals the `server` resource in Nova, booting from a volume. | L2, L3, L4 | +| Ephemeral Storage | Disk storage directly supplied to a virtual machine by Nova. Different from volumes. | L1, L2, L3, L4 | +| Ironic Machine | A physical host managed by Ironic or as a `server` resource in Nova. | L1, L2, L3, L4 | +| (Glance) Image | IaaS resource usually storing raw disk data. Managed by the Glance service. | (L1), L2, L3, L4 | +| (Cinder) Volume | IaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service. | (L1, L2), L3, L4 | +| (Volume) Snapshot | Thinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes. | (L1, L2), L3, L4 | +| Volume Type | Attribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted. | L3, L4 | +| (Barbican) Secret | IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service. | L3, L4 | +| Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | L3, L4 | +| Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. | L3, L4 | #### KaaS Layer (Kubernetes Resources) -| Resource(s) | Explanation | -| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | -| Pod | Kubernetes object that represents a workload to be executed, consisting of one or more containers. | -| Container | A lightweight and portable executable image that contains software and all of its dependencies. | -| Deployment, StatefulSet | Kubernetes objects that manage a set of Pods. | -| Job | Application workload that runs once. | -| CronJob | Application workload that runs once, but repeatedly at specific intervals. | -| ConfigMap, Secret | Objects holding static application configuration data. | -| Service | Makes a Pod's network service accessible inside a cluster. | -| Ingress | Makes a Service externally accessible. | -| PersistentVolumeClaim (PVC) | Persistent storage that can be bound and mounted to a pod. | +A detailed list of consequnces for certain failures can be found in the [Kubernetes docs](https://kubernetes.io/docs/tasks/debug/debug-cluster/). +The following table gives an overview about certain resources on the KaaS Layer and in which failsafe classes they are affected: + +| Resource(s) | Explanation | Affected by Level | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | +| Pod | Kubernetes object that represents a workload to be executed, consisting of one or more containers. | ??? | +| Container | A lightweight and portable executable image that contains software and all of its dependencies. | ??? | +| Deployment, StatefulSet | Kubernetes objects that manage a set of Pods. | ??? | +| Job | Application workload that runs once. | ??? | +| CronJob | Application workload that runs once, but repeatedly at specific intervals. | ??? | +| ConfigMap, Secret | Objects holding static application configuration data. | ??? | +| Service | Makes a Pod's network service accessible inside a cluster. | ??? | +| Ingress | Makes a Service externally accessible. | ??? | +| PersistentVolumeClaim (PVC) | Persistent storage that can be bound and mounted to a pod. | ??? | Also see [Kubernetes Glossary](https://kubernetes.io/docs/reference/glossary/). @@ -276,42 +293,3 @@ In case the KaaS layer runs on top of IaaS layer, the impacts described in the a |PVC|P| | | | | |P| |API Server|T| | | | | |T/P| -### Classification by Severity - -A possible way to classify the failure cases into levels considering the matrix of impact would be, to classify the failure cases from small to big ones. -The following table shows such a classification, the occurrence probability of a failure case of each class and what resources with user data might be affected. - -:::caution - -This table only contains examples of failure cases and examples of affected resources. -This should not be used as a replacement for a risk analysis. -The column **user hints** only show examples of standards that may provide this class of failure safety for a certain resource. -Customers should always check, what they can do to protect their data and not rely solely on the CSP. - -::: - -| Level/Class | Probability | Failure Causes | Loss in IaaS | User Hints | -|---|---|---|-----|-----| -| 1. Level | Very High | small Hardware or Software Failures (e.g. Disk/Node Failure, Software Bug,...) | individual volumes, VMs... | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...). Users SHOULD backup their data themself and place it on an other host. | -| 2. Level | High | important Hardware or Software Failures (e.g. Rack outage, small Fire, Power outage, ...) | limited number of resources, sometimes recoverable | CSPs MUST operate replicas for important components (e.g. replicated volume back-end, uninterruptible power supply, ...) OR users MUST backup their data themselves and place it on an other host. | -| 3. Level | Medium | small catastrophes or major Failures (e.g. fire, regional Power Outage, orchestrated cyber attacks,...) | lots of resources / user data + potentially not recoverable | CPSs SHOULD operate hardware in dedicated Availability Zones. Users SHOULD backup their data, themself. | -| 4. Level | Low | whole deployment loss (e.g. natural disaster,...) | entire infrastructure, not recoverable | CSPs may not be able to save user data from such catastrophes. Users are responsible for saving their data from natural disasters. | - -### Kubernetes Specific - -TODO: merge this with new sections - -A similar overview can be provided for Kubernetes infrastructures. These also include the things mentioned for infrastructure failure scenario, since a Kubernetes cluster -would most likely be deployed on top of this infrastructure or face similar problems on a bare-metal installation. -Part of this list comes directly from the official [Kubernetes docs](https://kubernetes.io/docs/tasks/debug/debug-cluster/). - -| Failure Scenario | Probability | Consequences | -|----------------------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| API server VM shutdown or apiserver crashing | Medium | Unable to stop, update, or start new pods, services, replication controller | -| API server backing storage lost | Medium | kube-apiserver component fails to start successfully and become healthy | -| Supporting services VM shutdown or crashing | Medium | Colocated with the apiserver, and their unavailability has similar consequences as apiserver | -| Individual node shuts down | Medium | Pods on that Node stop running | -| Network partition / Network problems | Medium | Partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down | -| Kubelet software fault | Medium | Crashing kubelet cannot start new pods on the node / kubelet might delete the pods or not / node marked unhealthy / replication controllers start new pods elsewhere | -| Cluster operator error | Medium | Loss of pods, services, etc. / lost of apiserver backing store / users unable to read API | -| Failure of multiple nodes or underlying DB | Low | Possible loss of all data depending on the amount of nodes lost compared to the cluster size, otherwise costly rebuild | From 358b429274de6056524588a7b98b2c5e0a1e609b Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 23 Aug 2024 15:14:15 +0200 Subject: [PATCH 24/34] Create scs-XXXX-v1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...v1-example-impacts-of-failure-scenarios.md | 68 +++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 Standards/scs-XXXX-v1-example-impacts-of-failure-scenarios.md diff --git a/Standards/scs-XXXX-v1-example-impacts-of-failure-scenarios.md b/Standards/scs-XXXX-v1-example-impacts-of-failure-scenarios.md new file mode 100644 index 000000000..baf592904 --- /dev/null +++ b/Standards/scs-XXXX-v1-example-impacts-of-failure-scenarios.md @@ -0,0 +1,68 @@ +# Examples of the impact from certain failure scenarios on Cloud Resources + +Failure cases in Cloud deployments can be hardware related, environmental, due to software errors or human interference. +The following table summerizes different failure scenarios, that can occur: + +| Failure Scenario | Probability | Consequences | Failsafe Level Coverage | +|----|-----|----|----| +| Disk Failure | High | Permanent data loss in this disk. Impact depends on type of lost data (data base, user data) | L1 | +| Host Failure (without disks) | Medium to High | Permanent loss of functionality and connectivity of host (impact depends on type of host) | L1 | +| Host Failure | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of host (impact depends on type of host) | L1 | +| Rack Outage | Medium | Outage of all nodes in rack | L2 | +| Network router/switch outage | Medium | Temporary loss of service, loss of connectivity, network partitioning | L2 | +| Loss of network uplink | Medium | Temporary loss of service, loss of connectivity | L3 | +| Power Outage (Data Center supply) | Medium | Temporary outage of all nodes in all racks | L3 | +| Fire | Medium | permanent Disk and Host loss in the affected zone | L3 | +| Flood | Low | permanent Disk and Host loss in the affected region | L4 | +| Earthquake | Very Low | permanent Disk and Host loss in the affected region | L4 | +| Storm/Tornado | Low | permanent Disk and Host loss in the affected region | L4 | +| Software bug (major) | Low | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L3 | +| Software bug (minor) | High | temporary or partial loss or compromise of data | L1 | +| Minor operating error | High | Temporary outage | L1 | +| Major operating error | Low | Permanent loss of data | L3 | +| Cyber attack (minor) | High | permanent loss or compromise of data on affected Disk and Host | L1 | +| Cyber attack (major) | Medium | permanent loss or compromise of data on affected Disk and Host | L3 | + +Those failure scenarios can result in either only temporary (T) or permanent (P) loss of IaaS / KaaS resources or data. +Additionally, there are a lot of resources in IaaS alone that are more or less affected by these failure scenarios. +The following tables shows the impact **when no redundancy or failure safety measure is in place**, i.e., when +**not even failsafe level 1 is fulfilled**. + +## Impact on IaaS Resources (IaaS Layer) + +| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | +|----|----|----|----|----|----|----|----| +| Image | P[^1] | T[^3] | T/P | T | P (T[^4]) | T/P | P | +| Volume | P[^1] | T[^3] | T/P | T | P (T[^4]) | T/P | P | +| User Data on RAM /CPU | | P | P | P | P | T/P | P | +| volume-based VM | P[^1] | T[^3] | T/P | T | P (T[^4]) | T/P | P | +| ephemeral-based VM | P[^1] | P | P | T | P (T[^4]) | T/P | P | +| Ironic-based VM | P[^2] | P | P | T | P (T[^4]) | T/P | P | +| Secret | P[^1] | T[^3] | T/P | T | P (T[^4]) | T/P | P | +| network configuration (DB objects) | P[^1] | T[^3] | T/P | T | P (T[^4]) | T/P | P | +| network connectivity (materialization) | | T[^3] | T/P | T | P (T[^4]) | T/P | T | +| floating IP | P[^1] | T[^3] | T/P | T | P (T[^4]) | T/P | T | + +For some cases, this only results in temporary unavailability and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. +So some of these outages are easier to mitigate than others. + +[^1]: If the resource is located on that specific disk. +[^2]: Everything located on that specific disk. If more than one disk is used, some data could be recovered. +[^3]: If the resource is located on that specific node. +[^4]: In case of disks, nodes or racks are not destroyed, some data could be safed. E.g. when a fire just destroyes the power line. + +## Impact on Kubernetes Resources (KaaS layer) + +:::note + +In case the KaaS layer runs on top of IaaS layer, the impacts described in the above table apply for the KaaS layer as well. + +::: + +| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | +|----|----|----|----|----|----|----|----| +|Node|P| | | | | |T/P| +|Kubelet|T| | | | | |T/P| +|Pod|T| | | | | |T/P| +|PVC|P| | | | | |P| +|API Server|T| | | | | |T/P| From dcd910b4aaf278673b7919c90c5d877d2e2e7387 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 23 Aug 2024 15:50:02 +0200 Subject: [PATCH 25/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 87 +++++-------------- 1 file changed, 23 insertions(+), 64 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 25e2aa092..0f1312d7e 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -41,13 +41,6 @@ Based on our research, no similar standardized classification scheme seems to ex Something close but also very detailed is the [BSI-Standard 200-3 (german)][bsi-200-3] published by the German Federal Office for Information Security. As we want to focus on IaaS and K8s resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed. - - ### Goal of this Decision Record The SCS wants to classify levels of failure cases according to their impact and the respective measures CSPs can implement to prepare for each level. @@ -55,6 +48,15 @@ Standards that deal with redundancy or backups or recovery SHOULD refer to the l Thus every reader knows, up to which level of failsafeness the implementation of the standard works. Reader then should be able to abstract what kind of other measures they have to apply, to reach the failsafe lavel they want to reach. +:::caution + +This document will not be a replacement for a risk analysis. +Every CSP and every Customer (user of IaaS or KaaS resources) need to do a risk analysis of their own. +Also the differentiation of failure cases in classes, may not be an ideal basis for Business Continuity Planning. +It may be used to get general hints and directions though. + +::: + ### Differentiation between failsafe levels and high availability, disaster recovery, redundancy and backups The levels auf failsafeness that are defined in this decision record are classifying the possibilities and impacts of failure cases (such as data loss) and possible measures. @@ -133,10 +135,15 @@ But beforehand, we will describe the considered failure scenarios and the resour The following failure scenarios have been considered for the proposed failsafe levels. For each failure scenario, we estimate the probability of occurence and the (worst case) damage caused by the scenario. Furthermore, the corresponding minimum failsafe level covering that failure scenario is given. +The following table give a coarse view over the probabilities, that are used to describe the occurance of failure cases: - +| Probability | Meaning | +|-----------|----| +| Very Low | Occurs at most once a decade OR needs extremly unlikely circumstances. | +| Low | Occurs at most once a year OR needs very unlikely circumstances. | +| Medium | Occurs more than one time a year, up to one time a month. | +| High | Occurs more than once a month and up to a daily basis. | +| Very High | Occurs within minutes. | #### Hardware Related @@ -156,8 +163,8 @@ Note that probability for these scenarios is dependent on the location. | Failure Scenario | Probability | Consequences | Failsafe Level Coverage | |----|-----|----|----| -| Fire | Medium | permanent Disk and Host loss in the affected zone | L3 | -| Flood | Low | permanent Disk and Host loss in the affected region | L4 | +| Fire | Low | permanent Disk and Host loss in the affected zone | L3 | +| Flood | Very Low | permanent Disk and Host loss in the affected region | L4 | | Earthquake | Very Low | permanent Disk and Host loss in the affected region | L4 | | Storm/Tornado | Low | permanent Disk and Host loss in the affected region | L4 | @@ -169,8 +176,8 @@ There can be measures taken, to reduce the probability and severity of a floodin | Failure Scenario | Probability | Consequences | Failsafe Level Coverage | |----|-----|----|----| -| Software bug (major) | Low | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L3 | -| Software bug (minor) | High | temporary or partial loss or compromise of data | L1 | +| Software bug (major) | Low to Medium | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L3 | +| Software bug (minor) | Medium to High | temporary or partial loss or compromise of data | L1 | Many software components have lots of lines of code and cannot be proven correct in their whole functionality. They are tested instead with at best enough test cases to check every interaction. @@ -193,7 +200,7 @@ CSPs should be in contact with people triaging and patching such bugs, to be inf |----|-----|----|----| | Minor operating error | High | Temporary outage | L1 | | Major operating error | Low | Permanent loss of data | L3 | -| Cyber attack (minor) | High | permanent loss or compromise of data on affected Disk and Host | L1 | +| Cyber attack (minor) | Very High | permanent loss or compromise of data on affected Disk and Host | L1 | | Cyber attack (major) | Medium | permanent loss or compromise of data on affected Disk and Host | L3 | Mistakes in maintaining a data center will always happen. @@ -216,7 +223,7 @@ Using the definition of levels established in this decision record throughout al | Resource | Explanation | Affected by Level | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | | Ephemeral VM | Equals the `server` resource in Nova, booting from ephemeral storage. | L1, L2, L3, L4 | -| Volume-based VM | Equals the `server` resource in Nova, booting from a volume. | L2, L3, L4 | +| Volume-based VM | Equals the `server` resource in Nova, booting from a volume. | L2, L3, L4 | | Ephemeral Storage | Disk storage directly supplied to a virtual machine by Nova. Different from volumes. | L1, L2, L3, L4 | | Ironic Machine | A physical host managed by Ironic or as a `server` resource in Nova. | L1, L2, L3, L4 | | (Glance) Image | IaaS resource usually storing raw disk data. Managed by the Glance service. | (L1), L2, L3, L4 | @@ -245,51 +252,3 @@ The following table gives an overview about certain resources on the KaaS Layer | PersistentVolumeClaim (PVC) | Persistent storage that can be bound and mounted to a pod. | ??? | Also see [Kubernetes Glossary](https://kubernetes.io/docs/reference/glossary/). - -## Old sections - -### Impact of the Failure Scenarios - -These failure scenarios can result in temporary (T) or permanent (P) loss of the resource or data within. -Additionally, there are a lot of resources in IaaS alone that are more or less affected by these failure scenarios. -The following tables shows the impact **when no redundancy or failure safety measure is in place**, i.e., when -**not even failsafe level 1 is fulfilled**. - -TODO: why should we do that? - -#### Impact on OpenStack Resources (IaaS layer) - -TODO: this table is getting difficult to maintain - -| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | -|----|----|----|----|----|----|----|----| -| Image | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | -| Volume | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | -| User Data on RAM /CPU | | P | P | P | P | T/P | P | -| volume-based VM | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | -| ephemeral-based VM | P (if on disk) | P | P | T | P (T if lucky) | T/P | P | -| Ironic-based VM | P (all data on disk) | P | P | T | P (T if lucky) | T/P | P | -| Secret | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | -| network configuration (DB objects) | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P | -| network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | T | -| floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | T | - -For some cases, this only results in temporary unavailability and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases. -So some of these outages are easier to mitigate than others. - -#### Impact on Kubernetes Resources (KaaS layer) - -:::note - -In case the KaaS layer runs on top of IaaS layer, the impacts described in the above table apply for the KaaS layer as well. - -::: - -| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | -|----|----|----|----|----|----|----|----| -|Node|P| | | | | |T/P| -|Kubelet|T| | | | | |T/P| -|Pod|T| | | | | |T/P| -|PVC|P| | | | | |P| -|API Server|T| | | | | |T/P| - From 1f3de87bcdcfbe2608496890f9e47203840cc3c2 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 26 Aug 2024 10:14:24 +0200 Subject: [PATCH 26/34] Update and rename scs-XXXX-v1-example-impacts-of-failure-scenarios.md to scs-XXXX-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ... scs-XXXX-w1-example-impacts-of-failure-scenarios.md} | 9 +++++++++ 1 file changed, 9 insertions(+) rename Standards/{scs-XXXX-v1-example-impacts-of-failure-scenarios.md => scs-XXXX-w1-example-impacts-of-failure-scenarios.md} (95%) diff --git a/Standards/scs-XXXX-v1-example-impacts-of-failure-scenarios.md b/Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md similarity index 95% rename from Standards/scs-XXXX-v1-example-impacts-of-failure-scenarios.md rename to Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md index baf592904..72ea3096d 100644 --- a/Standards/scs-XXXX-v1-example-impacts-of-failure-scenarios.md +++ b/Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md @@ -1,3 +1,12 @@ +--- +title: "SCS Taxonomy of Failsafe Levels: Examples of Failure Cases and their impact on IaaS and KaaS resources" +type: Supplement +track: IaaS +status: Proposal +supplements: + - scs-XXXX-vN-taxonomy-of-failsafe-levels.md +--- + # Examples of the impact from certain failure scenarios on Cloud Resources Failure cases in Cloud deployments can be hardware related, environmental, due to software errors or human interference. From 2a492f861e868f27deb686340c143f4a3d5238b2 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Mon, 26 Aug 2024 10:38:02 +0200 Subject: [PATCH 27/34] Update scs-XXXX-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- .../scs-XXXX-w1-example-impacts-of-failure-scenarios.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md b/Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md index 72ea3096d..633f3ba6a 100644 --- a/Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md +++ b/Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md @@ -7,7 +7,7 @@ supplements: - scs-XXXX-vN-taxonomy-of-failsafe-levels.md --- -# Examples of the impact from certain failure scenarios on Cloud Resources +## Examples of the impact from certain failure scenarios on Cloud Resources Failure cases in Cloud deployments can be hardware related, environmental, due to software errors or human interference. The following table summerizes different failure scenarios, that can occur: @@ -37,7 +37,7 @@ Additionally, there are a lot of resources in IaaS alone that are more or less a The following tables shows the impact **when no redundancy or failure safety measure is in place**, i.e., when **not even failsafe level 1 is fulfilled**. -## Impact on IaaS Resources (IaaS Layer) +### Impact on IaaS Resources (IaaS Layer) | Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | Natural Catastrophy | Cyber Threat | Software Bug | |----|----|----|----|----|----|----|----| @@ -60,7 +60,7 @@ So some of these outages are easier to mitigate than others. [^3]: If the resource is located on that specific node. [^4]: In case of disks, nodes or racks are not destroyed, some data could be safed. E.g. when a fire just destroyes the power line. -## Impact on Kubernetes Resources (KaaS layer) +### Impact on Kubernetes Resources (KaaS layer) :::note From 53c6521e6f240f81216dcb49360d7e8b1c7beabd Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Fri, 6 Sep 2024 14:53:15 +0200 Subject: [PATCH 28/34] Apply suggestions from code review Co-authored-by: Michal Gubricky Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 0f1312d7e..61e88250f 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -59,7 +59,7 @@ It may be used to get general hints and directions though. ### Differentiation between failsafe levels and high availability, disaster recovery, redundancy and backups -The levels auf failsafeness that are defined in this decision record are classifying the possibilities and impacts of failure cases (such as data loss) and possible measures. +The levels of failsafeness defined in this decision record classify the possibilities and impacts of failure cases (such as data loss) and the possible measures. High Availability, disaster recovery, redundancy and backups are all measures that can and should be applied to IaaS and KaaS deployments by both CSPs and Users to reduce the possibility and impact of data loss. So with this document every reader can see to what level of failsafeness their measures protect user data. @@ -208,7 +208,7 @@ To reduce the probability of such a mistake, measures are needed to reduce human On the other side an attack on an infrastructure cannot be avoided by this. Instead every deployment needs to be prepared for an attack all the time, e.g. through security updates. The severity of Cyber attacks can also vary broadly: from denial-of-service attacks, which should only be a temporary issue, up until coordinated attacks to steal or destroy data, which could also affect a whole deployment. -The more easy an attack is, the more often it will be used by various persons and organizations up to be just daily business. +The easier an attack is, the more frequently it will be used by various persons and organizations up to be just daily business. Major attacks are often orchestrated and require speicif knowledge e.g. of Day-0 Bugs or the attacked infrastructure. Due to that nature their occurance is less likely, but the damage done can be far more severe. From 0e392549377c83d0884738875fce7fbccbc8e7c7 Mon Sep 17 00:00:00 2001 From: Jan Schoone <6106846+jschoone@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:55:34 +0200 Subject: [PATCH 29/34] fix(kaas): use PV instead of PVC as this is actually the Volume Signed-off-by: Jan Schoone <6106846+jschoone@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 61e88250f..f11ea0d68 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -249,6 +249,6 @@ The following table gives an overview about certain resources on the KaaS Layer | ConfigMap, Secret | Objects holding static application configuration data. | ??? | | Service | Makes a Pod's network service accessible inside a cluster. | ??? | | Ingress | Makes a Service externally accessible. | ??? | -| PersistentVolumeClaim (PVC) | Persistent storage that can be bound and mounted to a pod. | ??? | +| PersistentVolume (PV) | Persistent storage that can be bound and mounted to a pod. | ??? | Also see [Kubernetes Glossary](https://kubernetes.io/docs/reference/glossary/). From 2a52226fa387e77005bf0b5719702f7dab798d5b Mon Sep 17 00:00:00 2001 From: Jan Schoone <6106846+jschoone@users.noreply.github.com> Date: Tue, 10 Sep 2024 12:00:12 +0200 Subject: [PATCH 30/34] feat(kaas): first proposal for levels on kaas layer Signed-off-by: Jan Schoone <6106846+jschoone@users.noreply.github.com> --- .../scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index f11ea0d68..44736c037 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -241,14 +241,14 @@ The following table gives an overview about certain resources on the KaaS Layer | Resource(s) | Explanation | Affected by Level | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | -| Pod | Kubernetes object that represents a workload to be executed, consisting of one or more containers. | ??? | -| Container | A lightweight and portable executable image that contains software and all of its dependencies. | ??? | -| Deployment, StatefulSet | Kubernetes objects that manage a set of Pods. | ??? | -| Job | Application workload that runs once. | ??? | -| CronJob | Application workload that runs once, but repeatedly at specific intervals. | ??? | -| ConfigMap, Secret | Objects holding static application configuration data. | ??? | -| Service | Makes a Pod's network service accessible inside a cluster. | ??? | -| Ingress | Makes a Service externally accessible. | ??? | -| PersistentVolume (PV) | Persistent storage that can be bound and mounted to a pod. | ??? | +| Pod | Kubernetes object that represents a workload to be executed, consisting of one or more containers. | L3, L4 | +| Container | A lightweight and portable executable image that contains software and all of its dependencies. | L3, L4 | +| Deployment, StatefulSet | Kubernetes objects that manage a set of Pods. | L3, L4 | +| Job | Application workload that runs once. | L3, L4 | +| CronJob | Application workload that runs once, but repeatedly at specific intervals. | L3, L4 | +| ConfigMap, Secret | Objects holding static application configuration data. | L3, L4 | +| Service | Makes a Pod's network service accessible inside a cluster. | (L2), L3, L4 | +| Ingress | Makes a Service externally accessible. | L2, L3, L4 | +| PersistentVolume (PV) | Persistent storage that can be bound and mounted to a pod. | L1, L2, L3, L4 | Also see [Kubernetes Glossary](https://kubernetes.io/docs/reference/glossary/). From 5ffe31a0c7beea7ec8d9523ca42f0acca412b10b Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 25 Sep 2024 14:06:49 +0200 Subject: [PATCH 31/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 44736c037..2cbe2c83e 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -26,7 +26,7 @@ These levels can then be used in standards to clearly set the scope that certain | Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | | Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | | Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | -| RTO | Recovery Time Objective. | +| RTO | Recovery Time Objective, the maximum amount of time allowed to restore a ressource. | | Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | | Host | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | | Cyber attack/threat | Attacks on the infrastructure through the means of electronic access. | @@ -70,7 +70,7 @@ To differentiate also between the named measures the following table can be used | High Availability | Refers to the availability of resources over an extended period of time unaffected by smaller hardware issues. E.g. achievable through having several instances of resources. | | Disaster Recovery | Measures taken after an incident to recover data, IaaS resource and maybe even physical resources. | | Redundancy | Having more than one (or two) instances of each resource, to be able to switch to the second resource (could also be a data mirror) in case of a failure. | -| Backup | A specific copy of user data, that presents all data points at a givne time. Usually managed by users themself, read only and never stored in the same place as the original data. | +| Backup | A specific copy of user data, that presents all data points at a given time. Usually managed by users themself, read only and never stored in the same place as the original data. | ### Failsafe Levels and RTO @@ -176,7 +176,7 @@ There can be measures taken, to reduce the probability and severity of a floodin | Failure Scenario | Probability | Consequences | Failsafe Level Coverage | |----|-----|----|----| -| Software bug (major) | Low to Medium | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine | L3 | +| Software bug (major) | Low to Medium | permanent loss or compromise of data that trigger the bug up to data on the whole deployment | L3 | | Software bug (minor) | Medium to High | temporary or partial loss or compromise of data | L1 | Many software components have lots of lines of code and cannot be proven correct in their whole functionality. @@ -209,7 +209,7 @@ On the other side an attack on an infrastructure cannot be avoided by this. Instead every deployment needs to be prepared for an attack all the time, e.g. through security updates. The severity of Cyber attacks can also vary broadly: from denial-of-service attacks, which should only be a temporary issue, up until coordinated attacks to steal or destroy data, which could also affect a whole deployment. The easier an attack is, the more frequently it will be used by various persons and organizations up to be just daily business. -Major attacks are often orchestrated and require speicif knowledge e.g. of Day-0 Bugs or the attacked infrastructure. +Major attacks are often orchestrated and require specific knowledge e.g. of Day-0 Bugs or the attacked infrastructure. Due to that nature their occurance is less likely, but the damage done can be far more severe. ## Consequences From 79311276149a3f09cb8751e1c327bc26d19f4687 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 25 Sep 2024 14:28:39 +0200 Subject: [PATCH 32/34] Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md index 2cbe2c83e..069fdfc52 100644 --- a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md +++ b/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md @@ -26,7 +26,7 @@ These levels can then be used in standards to clearly set the scope that certain | Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). | | Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). | | Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). | -| RTO | Recovery Time Objective, the maximum amount of time allowed to restore a ressource. | +| RTO | Recovery Time Objective, the acceptable time needed to restore a ressource. | | Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | | Host | A physical machine in the infrastructure providing computational, storage and/or network connectivity capabilities. | | Cyber attack/threat | Attacks on the infrastructure through the means of electronic access. | @@ -75,11 +75,12 @@ To differentiate also between the named measures the following table can be used ### Failsafe Levels and RTO As this documents classifies failure case with very broad impacts and it is written in regards of mostly IaaS and KaaS, there cannot be one simple RTO set. -It should be taken into consideration that the RTO for IaaS and KaaS means to make user data available again through measures within the infrastructure. +The RTOs will differ for each resource and also between IaaS and KaaS level. +It should be taken into consideration that the measure to achieve the RTOs for IaaS and KaaS means to make user data available again through measures within the infrastructure. But this will not be effective, when there is no backup of the user data or a redundancy of it already in place. -The different failsafe levels, measures and impacts will lead to very different RTOs. -For example a storage disk that has a failure will result in an RTO of 0 seconds, when the storage backend uses internal replication and still has two replicas of the user data. -While in the worst case of a natural disaster, most likely a severe fire, the whole deployment will be lost and if there were no off-site backups done by users there will be no RTO, because the data cannot be recovered anymore. +So the different failsafe levels, measures and impacts will be needed to define realistic RTOs. +For example a storage disk that has a failure will not result in a volume gein unavailable and needing a defined RTO, when the storage backend uses internal replication and still has two replicas of the user data. +While in the worst case of a natural disaster, most likely a severe fire, the whole deployment will be lost and if there were no off-site backups done by users any defined RTO will never be met, because the data cannot be recovered anymore. [bsi-200-3]: https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2 From ee531ad1f67eebff7dae4b486d18e843e0167fc0 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 25 Sep 2024 16:06:30 +0200 Subject: [PATCH 33/34] Rename scs-XXXX-vN-taxonomy-of-failsafe-levels.md to scs-0118-v1-taxonomy-of-failsafe-levels.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...lsafe-levels.md => scs-0118-v1-taxonomy-of-failsafe-levels.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename Standards/{scs-XXXX-vN-taxonomy-of-failsafe-levels.md => scs-0118-v1-taxonomy-of-failsafe-levels.md} (100%) diff --git a/Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md b/Standards/scs-0118-v1-taxonomy-of-failsafe-levels.md similarity index 100% rename from Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md rename to Standards/scs-0118-v1-taxonomy-of-failsafe-levels.md From 37ec252bfa578f15c0e7a94705d41af4cbab1ef5 Mon Sep 17 00:00:00 2001 From: josephineSei <128813814+josephineSei@users.noreply.github.com> Date: Wed, 25 Sep 2024 16:07:03 +0200 Subject: [PATCH 34/34] Update and rename scs-XXXX-w1-example-impacts-of-failure-scenarios.md to scs-0118-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com> --- ...s.md => scs-0118-w1-example-impacts-of-failure-scenarios.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename Standards/{scs-XXXX-w1-example-impacts-of-failure-scenarios.md => scs-0118-w1-example-impacts-of-failure-scenarios.md} (98%) diff --git a/Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md b/Standards/scs-0118-w1-example-impacts-of-failure-scenarios.md similarity index 98% rename from Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md rename to Standards/scs-0118-w1-example-impacts-of-failure-scenarios.md index 633f3ba6a..5bd84e76d 100644 --- a/Standards/scs-XXXX-w1-example-impacts-of-failure-scenarios.md +++ b/Standards/scs-0118-w1-example-impacts-of-failure-scenarios.md @@ -4,7 +4,7 @@ type: Supplement track: IaaS status: Proposal supplements: - - scs-XXXX-vN-taxonomy-of-failsafe-levels.md + - scs-0118-v1-taxonomy-of-failsafe-levels.md --- ## Examples of the impact from certain failure scenarios on Cloud Resources