Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement for alarms #215

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pixelsoccupied
Copy link
Collaborator

@pixelsoccupied pixelsoccupied commented Sep 24, 2024

This enhancement talks about re-architecting the Alarm server as specified in InfrastructureMonitoring Service API o-ran spec.

Notable changes include:

  • Data returned from API calls follow closely to what's defined by o-ran
  • Dynamically checking and gathering cluster resources during init such as PrometheuseRule
  • Combining servers into the same code base
  • Introducing persistence storage via Postgres

This enhancement includes all the code tested during spike, the k8s resources needed to deploy through operator and other libraries/tool that can be used to quickly develop this.

co-authored with @browsell and @Jennifer-chen-rh

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2024
Copy link

openshift-ci bot commented Sep 24, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Collaborator

@Jennifer-chen-rh Jennifer-chen-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question about DB serial number. It will increase with the entry of DB rows or increase with DB row update?

@pixelsoccupied
Copy link
Collaborator Author

pixelsoccupied commented Sep 25, 2024

@Jennifer-chen-rh on entry. More on SERIAL datatypes https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL. But we can have custom type which be incremented on insertion or update.

Copy link
Collaborator

@Jennifer-chen-rh Jennifer-chen-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pixelsoccupied aha, I felt something missing in PR. Now I figured out that the section we discussed about history table entry age out not here.

Copy link

openshift-ci bot commented Sep 25, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jennifer-chen-rh. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Collaborator

@Jennifer-chen-rh Jennifer-chen-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pixelsoccupied

  1. the datastructures are still public
  2. Missing the the alarm notification event object structure

@pixelsoccupied pixelsoccupied changed the title [WIP] Enhancement for alarms Enhancement for alarms Sep 30, 2024
@pixelsoccupied pixelsoccupied marked this pull request as ready for review September 30, 2024 16:39
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 30, 2024
Copy link
Collaborator

@bartwensley bartwensley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I took a look and added some comments and questions.


- ProbableCause

ProbableCause is a subset of data present in AlarmDefinition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is accurate (or relevant).

Copy link
Collaborator Author

@pixelsoccupied pixelsoccupied Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ProbableCause is somewhat vague in the interface doc. But basically when the user wants a get (or pushed) a new alert event...the spec says to pass on the ProbableCauseID with it. And with the ID user may see Alert name and Alert description, and the name are description are copy of its corresponding AlarmDefinition.

We are not exposing any API to get AlarmDefinition but we do have /O2ims_infrastructureMonitoring/{apiVersion}/probableCause/{probableCauseId} to get probablecause data. Note that this API is not in spec currently and we are the one introducing it (will update the doc reflect this)

}
```
## Infrastructure Monitoring Service Alarms API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to provide the name/version of the O-RAN spec this was taken from.


| **Internal Endpoint** | **HTTP Method** | **Description** | **Input Payload** | **Returned Data** |
|-------------------------------------------------|-----------------|----------------------------------------------|--------------------------------------------------------------------------|-------------------|
| `/internal/v1/caas-alerts/alertmanager` | POST | Alertmanager notifications come through here | https://prometheus.io/docs/alerting/latest/configuration/#webhook_config | None |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to format the URL as a link so reader can click on it

('LowMemory','4.16','Low memory.','Low memory, with a longer description to help user fix the issue','{"CustomKey2": "CustomValue"}'),
('NodeClockNotSynchronising','4.17','Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP2 is configured on this host.', '{"CustomKey3": "CustomValue"}');
-- probable_cause will be auto populated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it be auto populated? Where will the data come from?

Comment on lines +481 to +483
('NodeClockNotSynchronising','4.16', 'Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.','{"CustomKey": "CustomValue"}'),
('LowMemory','4.16','Low memory.','Low memory, with a longer description to help user fix the issue','{"CustomKey2": "CustomValue"}'),
('NodeClockNotSynchronising','4.17','Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP2 is configured on this host.', '{"CustomKey3": "CustomValue"}');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this data come from? How will we decide which alarms to define? Where does the proposed_repair_action come from?


- `alarm_dictionary` essentially links `ResourceTypeID` and `Version`. The conversion can be seen below.

| managed_cluster<br/>`from alerts`<br/> | resourceID <br/>`same as managed_cluster`<br/> | resourceTypeID for caas<br/>`derived from (resourceID + ResourceKindLogical + ResourceClassCompute)`<br/> |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So initially we are only supporting the ManagedCluster resource type? Does this account for additional resource types in the future? What are some examples of resource types we might want to support in the future?

labels:
severity: critical #### (alarm_definitions.extensions)
```
- Use ACM to get credentials of the unique major.minor clusters and retrieve all the PrometheusRules from them to parse.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do PrometheusRules ever change or new ones get added (e.g. on a z-stream upgrade)? Or if a cluster gets installed with a release that wasn't being managed yet? What will be the strategy to handle that?

- Service: ClusterIP should be good as it's only used within the cluster.
- Secrets and Config: default creds needed to spin postgres

## Tooling and general dev guidelines
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief note about how concurrency will be handled? I assume there can be multiple requests coming in parallel - GET/POST/PATCH/DELETE on the external API along with notifications from the AlertManager (and then potentially multiple callbacks to subscribers). Which of these events are going to be handled in parallel and will serialization be required for any of them? If so, what mechanism(s) will be used?



#### Postgres
This deployment can be leveraged by many microservices by creating their own Database.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more detail.

  1. How are the credentials being exposed and consumed for the various servers in the o-cloud operator that need a DB
  2. There is a requirement that you have replicated persistent storage
  3. How do the clients discover the ClusterIP ?
  4. How are you configuring the DB, where does this configuration come from ?


## Summary

`o-ran` requires `InfrastructureMonitoring Service Alarms API` which is a collection of APIs that can be queried by client to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit capitalize o-ran

the service exposes APIs, configures Alertmanager deployment, read PrometheusRules from managedclusters and finally
store data in a persistent storage.

### Goals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another goal is to allow future integration of alarms from additional sources (i.e. H/W)

This is primarily the link between Alarms and Inventory. A ResourceType (currently we are mostly dealing with type "cluster") can have exactly one AlarmDictionary.

```go
// 3.2.6.2.8-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand

Please note that this is not an exhaustive list but are here to help the reader get a feel for the Alarm specific data we are dealing with.

- AlarmDictionary
This is primarily the link between Alarms and Inventory. A ResourceType (currently we are mostly dealing with type "cluster") can have exactly one AlarmDictionary.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeCluster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add walkthroughs, in particular the initialization sequence.
Please describe how alerts are translated, and the mapping of alert fields (including the extensions) to alarm fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants