Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Availability Zone Standard #640

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

josephineSei
Copy link
Contributor

closes #539

@josephineSei josephineSei changed the title Create scs-XXXX-vN-Availability-Zones-Standard.md Draft: scs-XXXX-vN-Availability-Zones-Standard.md Jun 17, 2024
@josephineSei josephineSei marked this pull request as draft June 19, 2024 13:42
@josephineSei josephineSei marked this pull request as ready for review June 24, 2024 10:15
@josephineSei josephineSei changed the title Draft: scs-XXXX-vN-Availability-Zones-Standard.md Create Availability Zone Standard Jun 24, 2024
Copy link
Contributor

@artificial-intelligence artificial-intelligence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of comments, I'll need to get back to this later.
Notice there are still some spelling mistakes, which I didn't have the time to address one by one just yet.

Thanks for all the effort put into this!

Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved

Within each Availability Zone:

- there MUST be redundancy in power supply, as in line into the deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho we need to more clearly define what "redundancy" means here, that is, redundancy up two what level? redundancy for every electrica component in the AZ would possibly entail:

  • redundant PDUs and PSUs for each server
  • redundant fuses and electric circuits for each PDUs to the redundant online UPS system
  • possibly connecting each redundant UPS (e.g. battery backed) to two independent local power generators for redundant fallback power generation (e.g. diesel generator)
  • and finally external redundant energy providers over redundant external power connections.

also we may want to specify the level of redundancy (2, 3..).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important topic. But I wonder, if this would fit in a standard about Availability Zones in that kind of detail.

I did not wanted to discuss all possible physical features, that need to be redundant. I think it would be better to refer to a document, doing all this (at least within this standard). Maybe something from the BSI like this?

Or do you think that it should be stated here in a detail like:

  • there MUST be at least two redundant power sources (power line or generator)
  • each PDU SHOULD have at least one redundant twin
  • each PDU MUST have at least two redundant electric circuits to the redundant power sources

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should in the eyes of the CSP (also giving them certain interpretation freedom) while expecting that user is able to have certain level of high-er availability.

Co-authored-by: Sven <[email protected]>
Signed-off-by: josephineSei <[email protected]>
Copy link
Contributor Author

@josephineSei josephineSei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to put in all information I got from CSPs, while keeping the focus on Availability Zones.

This keeps much of the physical redundancy part out, but if this should also be part of it, we need to discuss to what extent or if it may be better to refer to some document outside of this standard.


Within each Availability Zone:

- there MUST be redundancy in power supply, as in line into the deployment
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important topic. But I wonder, if this would fit in a standard about Availability Zones in that kind of detail.

I did not wanted to discuss all possible physical features, that need to be redundant. I think it would be better to refer to a document, doing all this (at least within this standard). Maybe something from the BSI like this?

Or do you think that it should be stated here in a detail like:

  • there MUST be at least two redundant power sources (power line or generator)
  • each PDU SHOULD have at least one redundant twin
  • each PDU MUST have at least two redundant electric circuits to the redundant power sources

@josephineSei
Copy link
Contributor Author

In todays IaaS call, we discussed a few open questions:

Network AZ

In the standard I discussed, that it is possible to have Network AZ, but this has downsides for users. Thus i did not make any recommendations. We discussed, whether we even want to discourage CSPs to use it ("SHOULD NOT"):

  • it has been brought up that it is hard to configure and not nice to use for users
  • @garloff: discourage or even forbid usage of network AZs
  • @berendt: should not be forbidden, there are use cases
  • These are really not nice for users, we should discourage it (but not disallow)
    • ToDo: Ask for more use cases, maybe we can not even discourage

Cross-Attach AZ

Question was, whether we want to encourgage / allow / discourage or disallow this?

  • so far, nearly no CSP uses this according to Hedgedoc input
  • @garloff: unlike for network it is not obvious that I can attach volumes from other AZs
  • when using Ceph, you'd normally have a global cross-AZ for storage (but not several storage AZs)
  • if not using Ceph, implementation would be hard, we should not request this from CSPs
    • Use-case wavecon: Local dedicated (per AZ) ceph clusters, no support for x-attaching
  • @artificial-intelligence: X-attach would negatively impact isolation between AZs (and performance)
  • Maybe transparency is the most important feature here?
  • important to distinguish between replicating storage between AZs vs. cross-attaching volumes across AZs

Overall

  • We can not define all kinds of details how DCs should be built for highest availability
  • Reference DC taxonomies / BSI taxonomy for this
  • SCS can be useful by providing some minimal bounds that allows uses to have meaningfully higher chance to survive by spreading over several AZs
  • Highest level of redundancy will always be achieved by replicating data over several regions
    • Can we define something with "AZ"s that's better than nothing (though never as good as regions)?

@josephineSei
Copy link
Contributor Author

I send a mail to the ML asking for feedback on the network AZ topic.

@horazont
Copy link
Member

Single network AZ is not a problem for us. Neutron's HA capabilities are strong enough and our networks are small enough that we wouldn't gain anything from separate AZs.

@josephineSei
Copy link
Contributor Author

I read through the standard after my vacation and looked through the IaaS call protocols, that happened in the mean time. I think we still need feedback from CSPs, so I wrote a Mail to the scs ml.

Copy link
Contributor

@markus-hentsch markus-hentsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments and suggestions mostly revolving around spelling and phrasing.

I did notice there is a mix of capitalization for some terms: often "Storage" and "Compute" are capitalized (not everywhere 100% though) whereas "network" in the network AZ section is not while it is in others. I think this could also be aligned a bit better over the whole document.

Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
There might only be a loss of a few packages within the los network ressources.

With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) it would not have downsides to not have Availability Zones for the network service.
It might even be the opposite: Having resources running in certain Availability Zones might permit them from being scheduled in other AZs[^3].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

permit [...] from

Did you mean "prevent [...] from" here?

To be honest, I can't really tell as I haven't fully understood why there are no downsides from omitting AZs in network from this whole paragraph.
Maybe one or two details could be added to explain the reasoning behind this general statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few more lines to this paragraph to explain this better. and yes it was "prevent".

Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
Copy link
Contributor

@gtema gtema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I am fine but agree with one comment on rephrasing or dropping one statement


Within each Availability Zone:

- there MUST be redundancy in power supply, as in line into the deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should in the eyes of the CSP (also giving them certain interpretation freedom) while expecting that user is able to have certain level of high-er availability.

Standards/scs-XXXX-vN-Availability-Zones-Standard.md Outdated Show resolved Hide resolved
@frosty-geek
Copy link
Member

FTR, plusserver's definition on AZ https://docs.plusserver.com/en/general/plusserver-region-az/

Copy link
Contributor

@matfechner matfechner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@chrisschwa chrisschwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the overall definition of the AZ´s but i think that regulation on e.G. how many PDU`s a CSP has depends heavily on their design.

@josephineSei
Copy link
Contributor Author

@artificial-intelligence and @markus-hentsch we've got feedback from CSPs and I added a note for manual testing. Could you check, if all your comments are addressed now?


## Physical Audits

In cases where it is reasonable to mistrust the provided documentation, a physical audit by a natural person - called auditor - send by the OSBA (?) should be performed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garloff When we want to have someone auditing deployments in special cases we need to define, who will name such a person. Will that be the OSBA?

@josephineSei
Copy link
Contributor Author

@artificial-intelligence and @markus-hentsch we've got feedback from CSPs and I added a note for manual testing. Could you check, if all your comments are addressed now?

josephineSei and others added 2 commits September 25, 2024 16:22
Co-authored-by: Markus Hentsch <[email protected]>
Signed-off-by: josephineSei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Availability Zones: standardized levels of independecies.
10 participants