New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Enhancement for alarms #215

Open

pixelsoccupied wants to merge 4 commits into openshift-kni:main from pixelsoccupied:enhance-alarms

+671 −0

Collaborator

pixelsoccupied commented Sep 24, 2024 •

edited

Loading

This enhancement talks about re-architecting the Alarm server as specified in InfrastructureMonitoring Service API o-ran spec.

Notable changes include:

Data returned from API calls follow closely to what's defined by o-ran
Dynamically checking and gathering cluster resources during init such as PrometheuseRule
Combining servers into the same code base
Introducing persistence storage via Postgres

This enhancement includes all the code tested during spike, the k8s resources needed to deploy through operator and other libraries/tool that can be used to quickly develop this.

co-authored with @browsell and @Jennifer-chen-rh

openshift-ci bot added the do-not-merge/work-in-progress label

openshift-ci bot commented Sep 24, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

Jennifer-chen-rh requested changes

View reviewed changes

Collaborator

Jennifer-chen-rh left a comment

General question about DB serial number. It will increase with the entry of DB rows or increase with DB row update?

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved

openshift-ci bot assigned Jennifer-chen-rh

Collaborator Author

pixelsoccupied commented Sep 25, 2024 •

edited

Loading

@Jennifer-chen-rh on entry. More on SERIAL datatypes https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL. But we can have custom type which be incremented on insertion or update.

Jennifer-chen-rh reviewed

View reviewed changes

Collaborator

Jennifer-chen-rh left a comment •

edited

Loading

@pixelsoccupied aha, I felt something missing in PR. Now I figured out that the section we discussed about history table entry age out not here.

openshift-ci bot commented Sep 25, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jennifer-chen-rh. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jennifer-chen-rh reviewed

View reviewed changes

Collaborator

Jennifer-chen-rh left a comment

@pixelsoccupied

the datastructures are still public
Missing the the alarm notification event object structure

pixelsoccupied force-pushed the enhance-alarms branch from 9e69f7c to ce256ad Compare

September 30, 2024 16:18

pixelsoccupied changed the title ~~[WIP] Enhancement for alarms~~ Enhancement for alarms

pixelsoccupied marked this pull request as ready for review

September 30, 2024 16:39

openshift-ci bot removed the do-not-merge/work-in-progress label

openshift-ci bot requested review from bartwensley and Missxiaoguo

September 30, 2024 16:39

pixelsoccupied requested a review from Jennifer-chen-rh

September 30, 2024 16:47


          Enhancement for alarms

e60ffbc

pixelsoccupied force-pushed the enhance-alarms branch from ce256ad to e60ffbc Compare

September 30, 2024 17:07

bartwensley reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/alarms.md Show resolved Hide resolved

pixelsoccupied added 2 commits

October 1, 2024 14:31


          update requriment for when acknowledge is sent by user

921c8d9


          fix render

7a3aa1b


          clean up intro

031094b

bartwensley reviewed

View reviewed changes

Collaborator

bartwensley left a comment

Looks good - I took a look and added some comments and questions.

docs/enhancements/infrastructure-monitoring-service-api/alarms.md Outdated Show resolved Hide resolved

docs/enhancements/infrastructure-monitoring-service-api/alarms.md Show resolved Hide resolved

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

    
              - ProbableCause

                  ProbableCause is a subset of data present in AlarmDefinition

Collaborator

bartwensley Oct 1, 2024

Not sure this is accurate (or relevant).

Collaborator Author

pixelsoccupied Oct 2, 2024 •

edited

Loading

Yeah ProbableCause is somewhat vague in the interface doc. But basically when the user wants a get (or pushed) a new alert event...the spec says to pass on the ProbableCauseID with it. And with the ID user may see Alert name and Alert description, and the name are description are copy of its corresponding AlarmDefinition.

We are not exposing any API to get AlarmDefinition but we do have /O2ims_infrastructureMonitoring/{apiVersion}/probableCause/{probableCauseId} to get probablecause data. Note that this API is not in spec currently and we are the one introducing it (will update the doc reflect this)

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

    
                  }

                  ```

              ## Infrastructure Monitoring Service Alarms API

Collaborator

bartwensley Oct 1, 2024

Would be good to provide the name/version of the O-RAN spec this was taken from.

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

    
              | **Internal Endpoint**                           | **HTTP Method** | **Description**                              | **Input Payload**                                                        | **Returned Data** |

              |-------------------------------------------------|-----------------|----------------------------------------------|--------------------------------------------------------------------------|-------------------|

              | `/internal/v1/caas-alerts/alertmanager`         | POST            | Alertmanager notifications come through here | https://prometheus.io/docs/alerting/latest/configuration/#webhook_config | None              |

Collaborator

bartwensley Oct 1, 2024

nit: would be nice to format the URL as a link so reader can click on it

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

    
                  ('LowMemory','4.16','Low memory.','Low memory, with a longer description to help user fix the issue','{"CustomKey2": "CustomValue"}'),

                  ('NodeClockNotSynchronising','4.17','Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP2 is configured on this host.', '{"CustomKey3": "CustomValue"}');

              -- probable_cause will be auto populated

Collaborator

bartwensley Oct 1, 2024

How will it be auto populated? Where will the data come from?

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

Comment on lines +481 to +483

    
                  ('NodeClockNotSynchronising','4.16', 'Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.','{"CustomKey": "CustomValue"}'),

                  ('LowMemory','4.16','Low memory.','Low memory, with a longer description to help user fix the issue','{"CustomKey2": "CustomValue"}'),

                  ('NodeClockNotSynchronising','4.17','Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP2 is configured on this host.', '{"CustomKey3": "CustomValue"}');

Collaborator

bartwensley Oct 1, 2024

Where does this data come from? How will we decide which alarms to define? Where does the proposed_repair_action come from?

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

    
              - `alarm_dictionary` essentially links `ResourceTypeID` and `Version`. The conversion can be seen below.

                  | managed_cluster<br/>`from alerts`<br/> | resourceID  <br/>`same as managed_cluster`<br/> | resourceTypeID for caas<br/>`derived from (resourceID + ResourceKindLogical + ResourceClassCompute)`<br/> |

Collaborator

bartwensley Oct 1, 2024

So initially we are only supporting the ManagedCluster resource type? Does this account for additional resource types in the future? What are some examples of resource types we might want to support in the future?

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

    
                            labels: 

                              severity: critical #### (alarm_definitions.extensions)

                  ```

                  - Use ACM to get credentials of the unique major.minor clusters and retrieve all the PrometheusRules from them to parse.

Collaborator

bartwensley Oct 1, 2024

Do PrometheusRules ever change or new ones get added (e.g. on a z-stream upgrade)? Or if a cluster gets installed with a release that wasn't being managed yet? What will be the strategy to handle that?

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

    
              - Service: ClusterIP should be good as it's only used within the cluster. 

              - Secrets and Config: default creds needed to spin postgres 

              ## Tooling and general dev guidelines

Collaborator

bartwensley Oct 1, 2024

Can you add a brief note about how concurrency will be handled? I assume there can be multiple requests coming in parallel - GET/POST/PATCH/DELETE on the external API along with notifications from the AlertManager (and then potentially multiple callbacks to subscribers). Which of these events are going to be handled in parallel and will serialization be required for any of them? If so, what mechanism(s) will be used?

browsell reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/alarms.md



		#### Postgres
		This deployment can be leveraged by many microservices by creating their own Database.

Collaborator

browsell Oct 2, 2024

This needs more detail.

How are the credentials being exposed and consumed for the various servers in the o-cloud operator that need a DB
There is a requirement that you have replicated persistent storage
How do the clients discover the ClusterIP ?
How are you configuring the DB, where does this configuration come from ?

browsell reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/alarms.md


		## Summary

		`o-ran` requires `InfrastructureMonitoring Service Alarms API` which is a collection of APIs that can be queried by client to

Collaborator

browsell Oct 2, 2024

Nit capitalize o-ran

browsell reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+              the service exposes APIs, configures Alertmanager deployment, read PrometheusRules from managedclusters and finally
+              store data in a persistent storage.
+              ### Goals

Collaborator

browsell Oct 2, 2024

Another goal is to allow future integration of alarms from additional sources (i.e. H/W)

browsell reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+                  This is primarily the link between Alarms and Inventory. A ResourceType (currently we are mostly dealing with type "cluster") can have exactly one AlarmDictionary.
+                  ```go
+                  // 3.2.6.2.8-1

Collaborator

browsell Oct 2, 2024

Expand

browsell reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+              Please note that this is not an exhaustive list but are here to help the reader get a feel for the Alarm specific data we are dealing with.
+              - AlarmDictionary
+                  This is primarily the link between Alarms and Inventory. A ResourceType (currently we are mostly dealing with type "cluster") can have exactly one AlarmDictionary.

Collaborator

browsell Oct 2, 2024

NodeCluster

browsell reviewed

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

Collaborator

browsell Oct 2, 2024

Please add walkthroughs, in particular the initialization sequence.
Please describe how alerts are translated, and the mapping of alert fields (including the extensions) to alarm fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet