diff --git a/docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md b/docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md index a2778ead..9b582588 100644 --- a/docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md +++ b/docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md @@ -28,6 +28,7 @@ superseded-by: - [Key o-ran data structures](#key-o-ran-data-structures) - [InfrastructureMonitoring Service API](#Infrastructure-Monitoring-Service-Alarms-API) - [Database schema](#schema) +- [Tooling](#tooling-and-general-dev-guidelines) ## Summary @@ -35,87 +36,109 @@ superseded-by: `o-ran` requires `InfrastructureMonitoring Service API` which is a collection of APIs that can be queried by client to monitor the health of the `o-cloud`. This enhancement describes initialization steps and ready steps for `InfrastructureMonitoring Service API`. -More specifically we will describe everything needed for `alarms`. +More specifically we will describe everything needed for `Alarms`. + +At a high level, this service can be viewed as a thin wrapper of ACM observability stack which eventually translates +OCP cluster resources to data structures understood and defined by `o-ran` spec. Among other things +the service exposes APIs, can configure Alertmanager deployment, read PrometheusRules from managedclusters and finally +store this data in a persistence storage for O-ran clients. ### Goals - Define steps to initialize -- Define steps for when ready +- Define steps for when ready to API calls - Define database schema and queries - Define K8s CRs +- Define developer tools ## Key o-ran data structures -`InfrastructureMonitoring Service API`, specifically Alarms, primarily deals with the following o-ran data structures during initialization. -Comments for each attribute is taken from o-ran spec doc. +`InfrastructureMonitoring Service API` Alarms, primarily deals with the following o-ran data structures during initialization. +Comments for each attribute is taken from o-ran spec doc. + +Please note that this is not an exhaustive list but are here to help the reader get a feel for the Alarm specific data we are dealing with. - AlarmDictionary -```go -// 3.2.6.2.8-1 -type AlarmDictionary struct { - AlarmDictionaryVersion string `json:"alarmDictionaryVersion"` // M, 1, Version of the Alarm Dictionary. Version is vendor defined such that the version of the dictionary can be associated with a specific version of the software delivery of this product. - AlarmDictionarySchemaVersion string `json:"alarmDictionarySchemaVersion"` // M, 1, Version of the Alarm Dictionary Schema to which this alarm dictionary conforms. Note: The specific value for this should be defined in the IM/DM specification for the Alarm Dictionary Model Schema when it is published at a future date - EntityType string `json:"entityType"` // M, 1, O-RAN entity type emitting the alarm: This shall be unique per vendor ResourceType.model and ResourceType.version - Vendor string `json:"vendor"` // M, 1, Vendor of the Entity Type to whom this Alarm Dictionary applies. This should be the same value as in the ResourceType.vendor attribute. - ManagementInterfaceID []ManagementInterfaceID `json:"managementInterfaceId"` // M, 1..N, List of management interface over which alarms are transmitted for this Entity Type. RESTRICTION: For the O-Cloud IMS Services this value is limited to O2IMS. - PKNotificationField []string `json:"pkNotificationField"` // M, 1..N, Identifies which field or list of fields in the alarm notification contains the primary key (PK) into the Alarm Dictionary for this interface; i.e. which field contains the Alarm Definition ID. - AlarmDefinition []AlarmDefinition `json:"alarmDefinition"` // M, 1..N, List of alarms that can be detected against this ResourceType -} -``` + + This is primarily the link between Alarms and Inventory. A ResourceType (currently we are mostly dealing with type "cluster") can have exactly one AlarmDictionary. + + ```go + // 3.2.6.2.8-1 + type AlarmDictionary struct { + AlarmDictionaryVersion string `json:"alarmDictionaryVersion"` // M, 1, Version of the Alarm Dictionary. Version is vendor defined such that the version of the dictionary can be associated with a specific version of the software delivery of this product. + AlarmDictionarySchemaVersion string `json:"alarmDictionarySchemaVersion"` // M, 1, Version of the Alarm Dictionary Schema to which this alarm dictionary conforms. Note: The specific value for this should be defined in the IM/DM specification for the Alarm Dictionary Model Schema when it is published at a future date + EntityType string `json:"entityType"` // M, 1, O-RAN entity type emitting the alarm: This shall be unique per vendor ResourceType.model and ResourceType.version + Vendor string `json:"vendor"` // M, 1, Vendor of the Entity Type to whom this Alarm Dictionary applies. This should be the same value as in the ResourceType.vendor attribute. + ManagementInterfaceID []ManagementInterfaceID `json:"managementInterfaceId"` // M, 1..N, List of management interface over which alarms are transmitted for this Entity Type. RESTRICTION: For the O-Cloud IMS Services this value is limited to O2IMS. + PKNotificationField []string `json:"pkNotificationField"` // M, 1..N, Identifies which field or list of fields in the alarm notification contains the primary key (PK) into the Alarm Dictionary for this interface; i.e. which field contains the Alarm Definition ID. + AlarmDefinition []AlarmDefinition `json:"alarmDefinition"` // M, 1..N, List of alarms that can be detected against this ResourceType + } + ``` - AlarmDefinition -```go -// 3.2.6.2.9-1 -type AlarmDefinition struct { - AlarmDefinitionID uuid.UUID `json:"alarmDefinitionID"` // M, 1, Provides a unique identifier of the alarm being raised. This is the Primary Key into the Alarm Dictionary. - AlarmName string `json:"alarmName"` // M, 1, Provides short name for the alarm. - AlarmLastChange string `json:"alarmLastChange"` // M, 1, Indicates the Alarm Dictionary Version in which this alarm last changed. - AlarmChangeType AlarmLastChangeType `json:"alarmChangeType"` // M, 1, Indicates the type of change that occurred during the alarm last change; added, deleted, modified. - AlarmDescription string `json:"alarmDescription"` // M, 1, Provides a longer descriptive meaning of the alarm condition and a description of the consequences of the alarm condition. This is intended to be read by an operator to give an idea of what happened and a sense of the effects, consequences, and other impacted areas of the system - ProposedRepairActions string `json:"proposedRepairActions"` // M, 1, Provides guidance for proposed repair actions. - ClearingType ClearingType `json:"clearingType"` // M, 1, Whether alarm is cleared automatically or manually - ManagementInterfaceID []ManagementInterfaceID `json:"managementInterfaceId,omitempty"` // M, 0..N, List of management interface over which alarms are transmitted for this Entity Type. RESTRICTION: For the O-Cloud IMS Services this value is limited to O2IMS. - PKNotificationField []string `json:"pkNotificationField,omitempty"` // M, 0..N, Identifies which field or list of fields in the alarm notification contains the primary key (PK) into the Alarm Dictionary for this interface; i.e. which field contains the Alarm Definition ID. - AlarmAdditionalFields []AttributeValuePair `json:"alarmAdditionalFields,omitempty"` // M, 0..N, List of metadata key-value pairs used to associate meaningful metadata to the related resource type. -} -``` + + AlarmDefinition is what stores rules and it's metadata that is always evaluated to see if an alert is fired by a Resource Type. For caas, this is effectively the content of `PrometheusRules`. + + ```go + // 3.2.6.2.9-1 + type AlarmDefinition struct { + AlarmDefinitionID uuid.UUID `json:"alarmDefinitionID"` // M, 1, Provides a unique identifier of the alarm being raised. This is the Primary Key into the Alarm Dictionary. + AlarmName string `json:"alarmName"` // M, 1, Provides short name for the alarm. + AlarmLastChange string `json:"alarmLastChange"` // M, 1, Indicates the Alarm Dictionary Version in which this alarm last changed. + AlarmChangeType AlarmLastChangeType `json:"alarmChangeType"` // M, 1, Indicates the type of change that occurred during the alarm last change; added, deleted, modified. + AlarmDescription string `json:"alarmDescription"` // M, 1, Provides a longer descriptive meaning of the alarm condition and a description of the consequences of the alarm condition. This is intended to be read by an operator to give an idea of what happened and a sense of the effects, consequences, and other impacted areas of the system + ProposedRepairActions string `json:"proposedRepairActions"` // M, 1, Provides guidance for proposed repair actions. + ClearingType ClearingType `json:"clearingType"` // M, 1, Whether alarm is cleared automatically or manually + ManagementInterfaceID []ManagementInterfaceID `json:"managementInterfaceId,omitempty"` // M, 0..N, List of management interface over which alarms are transmitted for this Entity Type. RESTRICTION: For the O-Cloud IMS Services this value is limited to O2IMS. + PKNotificationField []string `json:"pkNotificationField,omitempty"` // M, 0..N, Identifies which field or list of fields in the alarm notification contains the primary key (PK) into the Alarm Dictionary for this interface; i.e. which field contains the Alarm Definition ID. + AlarmAdditionalFields []AttributeValuePair `json:"alarmAdditionalFields,omitempty"` // M, 0..N, List of metadata key-value pairs used to associate meaningful metadata to the related resource type. + } + ``` - ProbableCause -```go -// 2.1.3.3 -type ProbableCause struct { - ProbableCauseID uuid.UUID `json:"probableCauseId"` // M, Identifier of the ProbableCause. - Name string `json:"name"` // M, Human readable text of the probable cause. - Description string `json:"description"` // M, Any additional information beyond the name to describe the probableCause -} -``` + + ProbableCause is a subset of data present in AlarmDefinition + + ```go + // 2.1.3.3 + type ProbableCause struct { + ProbableCauseID uuid.UUID `json:"probableCauseId"` // M, Identifier of the ProbableCause. + Name string `json:"name"` // M, Human readable text of the probable cause. + Description string `json:"description"` // M, Any additional information beyond the name to describe the probableCause + } + ``` + - AlarmEventRecord -```go -type AlarmEventRecord struct { - AlarmEventRecordID uuid.UUID `json:"alarmEventRecordId"` // M, Identifier of an entry in the AlarmEventRecord. Locally unique within the scope of an O-Cloud instance. - ResourceID uuid.UUID `json:"resourceId"` // M, A reference to the resource instance which caused the alarm. - AlarmDefinitionID uuid.UUID `json:"alarmDefinitionId"` // M, A reference to the Alarm Definition record in the Alarm Dictionary associated with the referenced Resource Type. - ProbableCauseID uuid.UUID `json:"probableCauseId"` // M, A reference to the ProbableCause of the Alarm. - AlarmRaisedTime time.Time `json:"alarmRaisedTime"` // M, This field is populated with a Date/Time stamp value when the AlarmEventRecord is created. - AlarmChangedTime *time.Time `json:"alarmChangedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when any value of the AlarmEventRecord is modified. - AlarmClearedTime *time.Time `json:"alarmClearedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when the alarm condition is cleared. - AlarmAcknowledgedTime *time.Time `json:"alarmAcknowledgedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when the alarm condition is acknowledged. - AlarmAcknowledged bool `json:"alarmAcknowledged"` // M, This is a Boolean value defaulted to FALSE. When a system acknowledges an alarm, it is then set to TRUE. - PerceivedSeverity PerceivedSeverity `json:"perceivedSeverity"` // M, This is an enumerated set of values which identify the perceived severity of the alarm. - Extensions []KeyValue `json:"extensions"` // M, These are unspecified (not standardized) properties (keys) which are tailored by the vendor or operator to extend the information provided about the O-Cloud Alarm. -} -``` + AlarmEventRecord how we represent an alert that is fired or resolved. An alert coming from Alertmanager is mapped 1:1 with an instance of AlarmEventRecord. + + ```go + type AlarmEventRecord struct { + AlarmEventRecordID uuid.UUID `json:"alarmEventRecordId"` // M, Identifier of an entry in the AlarmEventRecord. Locally unique within the scope of an O-Cloud instance. + ResourceID uuid.UUID `json:"resourceId"` // M, A reference to the resource instance which caused the alarm. + AlarmDefinitionID uuid.UUID `json:"alarmDefinitionId"` // M, A reference to the Alarm Definition record in the Alarm Dictionary associated with the referenced Resource Type. + ProbableCauseID uuid.UUID `json:"probableCauseId"` // M, A reference to the ProbableCause of the Alarm. + AlarmRaisedTime time.Time `json:"alarmRaisedTime"` // M, This field is populated with a Date/Time stamp value when the AlarmEventRecord is created. + AlarmChangedTime *time.Time `json:"alarmChangedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when any value of the AlarmEventRecord is modified. + AlarmClearedTime *time.Time `json:"alarmClearedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when the alarm condition is cleared. + AlarmAcknowledgedTime *time.Time `json:"alarmAcknowledgedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when the alarm condition is acknowledged. + AlarmAcknowledged bool `json:"alarmAcknowledged"` // M, This is a Boolean value defaulted to FALSE. When a system acknowledges an alarm, it is then set to TRUE. + PerceivedSeverity PerceivedSeverity `json:"perceivedSeverity"` // M, This is an enumerated set of values which identify the perceived severity of the alarm. + Extensions []KeyValue `json:"extensions"` // M, These are unspecified (not standardized) properties (keys) which are tailored by the vendor or operator to extend the information provided about the O-Cloud Alarm. + } + ``` -- Subscriber +- AlarmSubscriptionInfo + This what stores info about subscription who needs to be notified when an Alarm is raised -```go -type AlarmSubscriptionInfo struct { - SubscriptionID uuid.UUID `json:"subscriptionID"` - ConsumerSubscriptionID uuid.UUID `json:"consumerSubscriptionId"` - Filter string `json:"filter"` - Callback string `json:"callback"` -} -``` + 3.3.6.2.3 + ```go + type AlarmSubscriptionInfo struct { + SubscriptionID uuid.UUID `json:"subscriptionID"` // M, Identifier of the subscription. Locally unique within the scope of an O-Cloud instance. + ConsumerSubscriptionID uuid.UUID `json:"consumerSubscriptionId"` // O, The consumer may provide its identifier for tracking, routing, or identifying the subscription used to report the event. + Filter string `json:"filter"` // O, Criteria for events which do not need to be reported or will be filtered by the subscription notification service. Therefore, if a filter is not provided then all events are reported. + Callback string `json:"callback"` // M, The fully qualified URI to a consumer procedure which can process a Post of the AlarmEventNotification. + } + ``` ## Infrastructure Monitoring Service Alarms API | **Endpoint** | **HTTP Method** | **Description** | **Input Payload** | **Returned Data** | @@ -255,11 +278,11 @@ oc -n open-cluster-management-observability create secret generic alertmanager-c alarm_cleared_time = EXCLUDED.alarm_cleared_time, alarm_changed_time = EXCLUDED.alarm_changed_time; ``` -3. Grab the Subscribers and send notification. +3. Grab the Subscriptions and send notification. - Database interaction is further explained [here](#notification-tracking) 4. Move all the `status: resolved` rows from `alarm_event_record` to `alarm_event_record_archive` -Eventually data in `alarm_event_record_archive` will be cleared (hardcoded to 24hr) as +Eventually data in `alarm_event_record_archive` will be cleared (hardcoded to 24hr) as seen [here](#daily-archive-cleanup-) ## Schema We only take care of Alarms* data contained within a specific DB, this approach allows for @@ -269,6 +292,8 @@ We only take care of Alarms* data contained within a specific DB, this approach Each table is modeled after o-ran data structures. DB in our case maybe called `o-ran-infrastructure-monitoring-alarms` Init SQL may look like the following: + +specify how ```sql -- ENUM for ManagementInterfaceID DROP TYPE IF EXISTS ManagementInterfaceID CASCADE; @@ -384,7 +409,7 @@ CREATE TABLE alarm_event_record ( extensions JSONB, created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, finger_print TEXT not null, - alarm_sequence_number BIGSERIAL, + alarm_sequence_number BIGINT DEFAULT nextval('alarm_sequence_seq'), resource_id UUID NOT NULL, resource_type_id UUID NOT NULL, CONSTRAINT fk_resource_type FOREIGN KEY (resource_type_id) REFERENCES alarm_dictionary (resource_type_id) ON DELETE CASCADE, @@ -395,17 +420,19 @@ CREATE TABLE alarm_event_record ( ALTER SEQUENCE alarm_sequence_seq OWNED BY alarm_event_record.alarm_sequence_number; --- Update the sequence if resolved +-- Update the sequence if resolved or alarm_changed_time changed CREATE OR REPLACE FUNCTION set_alarm_sequence_on_update() RETURNS TRIGGER AS $$ BEGIN - IF NEW.status = 'resolved' AND (OLD.status IS DISTINCT FROM 'resolved') THEN + IF (NEW.status = 'resolved' AND OLD.status IS DISTINCT FROM 'resolved') + OR (NEW.alarm_changed_time IS DISTINCT FROM OLD.alarm_changed_time) THEN NEW.alarm_sequence_number := nextval('alarm_sequence_seq'); END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; + -- Attach the trigger to alarm_event_record CREATE TRIGGER trg_alarm_sequence_update BEFORE UPDATE ON alarm_event_record @@ -451,6 +478,9 @@ VALUES Notes on Init phase - `versions` table reflects unique `major-minor` version `ManagedClusters` currently deployed. To get available `major-minor` managed cluster, we can list `ManagedCluster` CR and look for label `openshiftVersion-major-minor`. + ```shell + oc get managedclusters + ``` ```yaml apiVersion: cluster.open-cluster-management.io/v1 kind: ManagedCluster @@ -491,7 +521,9 @@ Notes on Init phase labels: severity: critical #### (alarm_definitions.extensions) ``` - + - Use ACM to get credentials of the unique major.minor clusters and retrieve all the PrometheusRules from them to parse. + E.g if we are managing 3 clusters 4.16.2, 4.17.2 and 4.16.8, Pick 4.16.8 and 4.17.2 which effectively represents all the rules in 4.16.z and 4.17.z clusters. +- Build out a mapping between cluster ID, resource type ID and resource ID in memory as needed for quick lookup during runtime. ### For a given ResourceTypeID and AlarmName (coming from AM alert), find the AlarmDefinitionID and ProbableCauseID - Find resourceType ID and alart name @@ -525,8 +557,8 @@ Notes on Init phase ### Notification tracking -- Collect all subscriber info including ID, callback and filter -- Collect all the AlarmEventRecord based on sequence number and optionally the filter. Here we are collecting everything that's "CRITICAL" +- Collect all subscription info including ID, callback and filter +- For each subscription, collect all the `AlarmEventRecord` rows based on sequence number and optionally the filter. Here we are collecting everything that's "CRITICAL" ```sql SELECT aer.* FROM alarm_event_record aer @@ -536,8 +568,8 @@ Notes on Init phase and aer.perceived_severity = 'CRITICAL' ORDER BY aer.alarm_sequence_number; ``` -- Process and notify by deriving `AlarmEventRecordModifications` o-ran DS. -- Update sequence for subscriber indicating the latest event sent so far +- Process and notify by deriving `AlarmEventRecordModifications` o-ran DS + callback. +- Update sequence for subscription indicating the latest event sent so far ```go var largestProcessedSequenceNumber int64 // for each alarms Update the largest sequence number we've processed @@ -550,12 +582,43 @@ Notes on Init phase SET largest_number_alarm_event_seen_so_far = $largestProcessedSequenceNumber WHERE alarm_subscription_id = 'a2eebc99-9c0b-4ef8-bb6d-6bb9bd380a13'; ``` +- NOTE: `alarm_sequence_number` is automatically handled from inside the DB. The sequence updates when a row is inserted, + update to `resolved` or updated `alarm_changed_time`. These conditions can be seen when as when a subscriber is notified. + See [schema](#schema) for more ### Daily archive cleanup -Run this using a k8s CronJob CR at the start of every how +Run this using a k8s CronJob CR at the start of every hour ```sql DELETE FROM alarm_event_record_archive WHERE alarm_cleared_time < NOW() - INTERVAL '24 hour' and status = 'resolved'; ``` +We can apply the CR before server starts and remove it during shutdown as part of teardown e.g inside `server.RegisterOnShutdown` + +### K8s resources +We will need few K8s resources that will be eventually applied using the Operator. + +#### Alarm server +This is essentially a typical CRUD app and we need the following + +- Deployment CR: It should have one initContainers that runs to completion to perform DB migration and one main container which starts the main server. No HA, so set replica to 1 +- Service CR: Expose and balance (though to start with we will only set replica to 1) +- + +### Tooling and general dev guidelines +- The HTTP server should be built with latest Go 1.22 `net/http` std lib. The latest update in the package brings in + many requested features including mapping URI pattern. This allows to drop third party lib `gorilla/mux`. +- Prefer creating structs to hold HTTP data for idiomatic Go code. +- OpenAPI spec should be the source of truth. Other than standardization, free validation and documentation, + with it, we can leverage a code generator such [this](https://github.com/oapi-codegen/oapi-codegen, allowing us to avoid writing boilerplate code. +- For Postgres communication use library [pgx](https://github.com/jackc/pgx) v5. This Go Postgres driver and lots of + important features such as automatic type mapping, detailed error reporting (capture performance info) etc. + There's also many ORM and SQL query builder libraries but pgx looks like the best of both worlds. +- DB migration is generally handled with a different tool. [golang-migrate](https://github.com/golang-migrate/migrate) is generally used for this which we can call during service init. +- Cobra CLI should be used to have better control of the servers. Each microservice should own its verbs, allow them to develop independent. E.g + ```shell + o-ran alarms -h + o-ran alarms start -h + o-ran alarms db-migration -h + ``` \ No newline at end of file