-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement for alarms #215
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question about DB serial number. It will increase with the entry of DB rows or increase with DB row update?
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
@Jennifer-chen-rh on entry. More on SERIAL datatypes https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL. But we can have custom type which be incremented on insertion or update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pixelsoccupied aha, I felt something missing in PR. Now I figured out that the section we discussed about history table entry age out not here.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the datastructures are still public
- Missing the the alarm notification event object structure
9e69f7c
to
ce256ad
Compare
ce256ad
to
e60ffbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I took a look and added some comments and questions.
docs/enhancements/infrastructure-monitoring-service-api/alarms.md
Outdated
Show resolved
Hide resolved
|
||
- ProbableCause | ||
|
||
ProbableCause is a subset of data present in AlarmDefinition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is accurate (or relevant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ProbableCause
is somewhat vague in the interface doc. But basically when the user wants a get (or pushed) a new alert event...the spec says to pass on the ProbableCauseID
with it. And with the ID user may see Alert name and Alert description, and the name are description are copy of its corresponding AlarmDefinition.
We are not exposing any API to get AlarmDefinition but we do have /O2ims_infrastructureMonitoring/{apiVersion}/probableCause/{probableCauseId}
to get probablecause data. Note that this API is not in spec currently and we are the one introducing it (will update the doc reflect this)
} | ||
``` | ||
## Infrastructure Monitoring Service Alarms API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to provide the name/version of the O-RAN spec this was taken from.
|
||
| **Internal Endpoint** | **HTTP Method** | **Description** | **Input Payload** | **Returned Data** | | ||
|-------------------------------------------------|-----------------|----------------------------------------------|--------------------------------------------------------------------------|-------------------| | ||
| `/internal/v1/caas-alerts/alertmanager` | POST | Alertmanager notifications come through here | https://prometheus.io/docs/alerting/latest/configuration/#webhook_config | None | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be nice to format the URL as a link so reader can click on it
('LowMemory','4.16','Low memory.','Low memory, with a longer description to help user fix the issue','{"CustomKey2": "CustomValue"}'), | ||
('NodeClockNotSynchronising','4.17','Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP2 is configured on this host.', '{"CustomKey3": "CustomValue"}'); | ||
-- probable_cause will be auto populated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will it be auto populated? Where will the data come from?
('NodeClockNotSynchronising','4.16', 'Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.','{"CustomKey": "CustomValue"}'), | ||
('LowMemory','4.16','Low memory.','Low memory, with a longer description to help user fix the issue','{"CustomKey2": "CustomValue"}'), | ||
('NodeClockNotSynchronising','4.17','Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP2 is configured on this host.', '{"CustomKey3": "CustomValue"}'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this data come from? How will we decide which alarms to define? Where does the proposed_repair_action come from?
|
||
- `alarm_dictionary` essentially links `ResourceTypeID` and `Version`. The conversion can be seen below. | ||
|
||
| managed_cluster<br/>`from alerts`<br/> | resourceID <br/>`same as managed_cluster`<br/> | resourceTypeID for caas<br/>`derived from (resourceID + ResourceKindLogical + ResourceClassCompute)`<br/> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So initially we are only supporting the ManagedCluster resource type? Does this account for additional resource types in the future? What are some examples of resource types we might want to support in the future?
labels: | ||
severity: critical #### (alarm_definitions.extensions) | ||
``` | ||
- Use ACM to get credentials of the unique major.minor clusters and retrieve all the PrometheusRules from them to parse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do PrometheusRules ever change or new ones get added (e.g. on a z-stream upgrade)? Or if a cluster gets installed with a release that wasn't being managed yet? What will be the strategy to handle that?
- Service: ClusterIP should be good as it's only used within the cluster. | ||
- Secrets and Config: default creds needed to spin postgres | ||
|
||
## Tooling and general dev guidelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief note about how concurrency will be handled? I assume there can be multiple requests coming in parallel - GET/POST/PATCH/DELETE on the external API along with notifications from the AlertManager (and then potentially multiple callbacks to subscribers). Which of these events are going to be handled in parallel and will serialization be required for any of them? If so, what mechanism(s) will be used?
|
||
|
||
#### Postgres | ||
This deployment can be leveraged by many microservices by creating their own Database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more detail.
- How are the credentials being exposed and consumed for the various servers in the o-cloud operator that need a DB
- There is a requirement that you have replicated persistent storage
- How do the clients discover the ClusterIP ?
- How are you configuring the DB, where does this configuration come from ?
|
||
## Summary | ||
|
||
`o-ran` requires `InfrastructureMonitoring Service Alarms API` which is a collection of APIs that can be queried by client to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit capitalize o-ran
the service exposes APIs, configures Alertmanager deployment, read PrometheusRules from managedclusters and finally | ||
store data in a persistent storage. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another goal is to allow future integration of alarms from additional sources (i.e. H/W)
This is primarily the link between Alarms and Inventory. A ResourceType (currently we are mostly dealing with type "cluster") can have exactly one AlarmDictionary. | ||
|
||
```go | ||
// 3.2.6.2.8-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand
Please note that this is not an exhaustive list but are here to help the reader get a feel for the Alarm specific data we are dealing with. | ||
|
||
- AlarmDictionary | ||
This is primarily the link between Alarms and Inventory. A ResourceType (currently we are mostly dealing with type "cluster") can have exactly one AlarmDictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeCluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add walkthroughs, in particular the initialization sequence.
Please describe how alerts are translated, and the mapping of alert fields (including the extensions) to alarm fields.
This enhancement talks about re-architecting the Alarm server as specified in
InfrastructureMonitoring Service API
o-ran spec.Notable changes include:
This enhancement includes all the code tested during spike, the k8s resources needed to deploy through operator and other libraries/tool that can be used to quickly develop this.
co-authored with @browsell and @Jennifer-chen-rh