Event & Alert Management
Description/Summary
Event & Alert Management deals with any kind of Event & Alert in the IT infrastructure and IT services. A well-defined and controlled process leads to the effective handling of these events and alerts. Event & Alert Management is triggered by occurrence of noticeable signals or messages which has significance for the services of infrastructure. Typically these events and alerts are generated by monitoring tools, configuration items (CI) or IT services. Each Event & Alert is classified by determining its significance, analysed and handled reflecting a handling rule. Handling is done by human or automated operations and might be followed up by a set of actions. Events and alerts which are3 critical for the delivery of defined Class of services are forwarded to Incident, Problem or Change Management.
Event & Alert Management is in closes co relevance with Incident and Problem Management:
- Event & Alert Management is monitoring and handling all events occurring throughout the IT services and systems
- Incident Management is monitoring and handling malfunctions of IT services and systems and concentrating on restoring of the services
- Problem Management is researching the root causes of incidents
Objectives
The purpose of Event & Alert Management is to establish standardized procedures for the handling of IT-related Events & Alerts throughout the recording, classification, action definition and implementation, closure and PIR activities
Event & Alert Management contributes to an integrated Service Management approach by achieving the following goals:
- Every Event & Alert and all required data is recorded.
- Every Event & Alert runs through a set of standardized activities and procedures in order to ensure effective and efficient processing.
- Every Event & Alert meaning risk to normal service operation is carefully assessed and forwarded to either Incident, Change or Problem Management
Event and Alert Management is typically tool supported. This tools ensure active and passive monitoring of related services and infrastructure.
Roles & Functions
Event & Alert Management specific roles
Static Process Roles
- Event & Alert Process Owner
- Initiator of the process, accountable for defining the process strategic goals and allocating all required process resources. See Continual Process Improvement Management for a detailed description of these activities.
- Event & Alert Process Manager (Event & Alert Manager)
- Manager of the entire process, {responsible en} for its effectiveness and efficiency. Team leader of the function „Event & Alert Management Team“. See Continual Process Improvement Management for a detailed description of these activities.
- Event & Alert Management Team
- Team associated to the Event & Alert Management Process.
Dynamic Process Roles
These roles are dynamically created during the Event & Alert Management Process. See the process-specific or activity-specific rules for details.
- Event & Alert Owner
- The attribute in the records contains the value of the Role/Function currently accountable for the Event & Alert (but NOT for the Event & Alert Management Process).
- Event & Alert Agent
- The attribute in the records contains the value of the Role/Function currently {responsible en} for either an activity or task within the overall activity of the Event & Alert.
Service Specific Roles
Roles depending on the affected service are found in the Service Description. The Service Description, including the service specific roles, is delivered from the Service Portfolio Management.
Example: Service Description
- Service Expert/Service Specialist
- They can consult and/or act as
- Event & Alert Analyser & Classificator
- Event & Alert Action Reviewer/ Auditor
- Service Owner
Customer Specific Roles
Roles depending on the affected customer(s) are found in the Service Level Agreement (SLA). The Service Level Agreement for the customer specific roles is maintained by the Service Level Agreement Management.
- Customer(s)
- Customers of the affected Service with a valid SLA
Example: Service Level Agreement Management
Information artifacts
This section describes information/data required or recorded by the process. In general, a process record (here: the Event & Alert record) represents the current progress of a process and will contain all information related to the execution of that process. Additional information items (artifacts), such as the Request for Event & Alert (RFC) or the Forward Schedule of Event & Alert (FSC) in the context of Event & Alert Management, typically can be realized by considering the information out of one or more (up to all) process records and either filtering, merging, correlating or interpreting these information, sometimes with regard to the context of other data and information sources.
Event and Alter Notifier
Message send by monitoring tool or monitoring person notifying about event or alert. Vary types of messages are possible due to different monitoring tools used.
Event & Alert Record
No standard Event & Alert Record can be defined but some information is typically regarded. On creation, it is based on (filled with) the information provided by the Event & Alert Notifier.
- Unique identifier : Event & Alert ID
- Event & Alert device: Name/ Type/ ID Number of device alerting and device being effected by alert
- Event & Alert component: Name of component alerting and component being effected by alert
- Type of Event & Alert/ Type of failure: Failure occurred
- Event & Alert time/ date: Date and time of alert
- Status : Status of the Event & Alert. Is set when passing a Control Activity
- Services : Service(s) affected by this Event & Alert
- Event & Alert Description : The description of the Event & Alert including the Event & Alert Argument
- Additional Remarks : Text for additional information
Event and Alert Log
Log containing all Event and Alert Notifier and Records for a certain time period. Time period to be defined in Security Management or other Service Process
Key Concepts
Event & Alert Category
Events can be distinguished in events and alerts that signify
- regular operations as defined by service SLAs: e.g. notifications of user log in
- exceptions to regular operation as defined by service SLAs: e.g. notifications of incorrect user log in attempts
- exception but not irregular behaviour as defined by SLAs: e.g. notifications of decreasing server capacity
This means that events and alerts do not only stand for incidents or problems but do also support the regular work monitoring. The definition of what is a „normal“ versu „not normal“ behaviour of system or component needs to be defined for each service and each component. This sets of rules are the control rules for the alerts and events monitoring.
Event & Alert Monitoring
Two basic monitoring patters can be defined:
- Active monitoring of components and services: Pro-active actions to define the behaviour of the system or service
- Passive monitoring of components and services: Passive acknowledgment of the behaviour of the system or service
Event & Alert Significance
Categorization of events & alerts is done individual by each organization. In general three basic categories van be defined:
- Informational: Events not requiring actions and not representing exceptions
- Warning: Events generated when a threshold is reached by a system or service. In those cases an action is recommended.
- Exception: Events providing the information on malfunction of a system or component. Incident or Problem Management should be informed.
Event & Alert Controls
To be defined
Process
High Level Process Flow Chart
This chart illustrates the Event & Alert Management process and its activities as well as the status model reflected by the Event & Alert Record evolution.
Critical Success Factors
Critical Success Factors (CSF):
- Define correct level of filtering of events and alerts
- Close integration of Event & Alter monitoring in the Service Management Processes
- Define thresholds together with Service Design and Service Operations in trail and error manner
- Use appropriate tools for Event & Alert Monitoring supporting the automated monitoring of systems and services
Performance Indicators (KPI)
- Number of Events & Alerts
- Number of Events & Alerts that required human intervention
- Number of Events & Alerts that could be solved without human intervention
- Number of Events & Alerts that defined incidents and problems
- Number of Events & Alerts caused by Know Errors
- Number of Events & Alerts duplicated or repeated
- Number of Events & Alerts indicating issues in other Service Operation Processes: Availability, Continuity, Performance etc.
per service, per user, per significance, per customer, per location, per category, … based on time period.
Process Trigger
Event Trigger
- Any Event & Alert
- from a Event & Alert monitoring sub-process
For more information see the relevant process interface description in the Process Interfaces Overview
Time Trigger
- None (Event & Alert Management is typically event-triggered)
Process Specific Rules
- every event in IT service or IT infrastructure triggers the creation of a new Event & Alert Record.
- the Event & Alert Agent is responsible for documenting each activity in the Event & Alert Record
- the Event & Alert Owner has to control the Event & Alert Agent
- the Event & Alert Owner and Event & Alert Agent can only transfer their duties if the new person or group agrees
- then the subsequent Event & Alert Owner or Event & Alert Agents have to be recorded in the appropriate attribute in the Event & Alert Record.
- the Event & Alert Owner and Event & Alert Agent should preferably be a person rather than a group.
- refer to the Service Description and Service Level Agreement in order to take service specific and customer specific rules into account for the definition of thresholds and actions
Note: for the different types of rules see Rules.
Process Activities
Event & Alert Authorization and Recording
In regular IT operations a permanent system and service monitoring is in place, generating Events & Alerts. Events & Alerts performed by automated tool and manual operations are displayed, recorded and logged for classification. First activity is the authorization check – is the event and alert providing system or person authorized to provide that information? Then Event & Alert record is filled in and checked for completeness and formal correctness; i.e. it is verified that all mandatory information has been provided. If this is not the case, or if the Event & Alter generating system or person is not authorized to send Events & Alerts in general, the Event & Alert record can be rejected. The Event & Alert Record is then shifted into the status „recorded-accepted“ or „recorded-rejected“.
Activity Specific Rules
- Event & Alert provider is set to the person or systems who triggered the Event & Alert
- Event & Alert sponsor is set to the person who triggered the Event & Alert if no other sponsor is known
- Event & Alert agent is set „Event & Alert Management Team“ if there is not a person as Event & Alert agent available
- Event & Alert owner is set „Event & Alert Management Team“ if there is not a person as Event & Alert owner available
- Event & Alert description The description must contain a meaningful description of the Event & Alert.
- process instance which triggered the Event & Alert It should be referred to the respective process instance e.g. Unique Identifier of the Event & Alert (see also Event & Alert Management)
- affected service At least one Service should be defined. Not always a service can be defined.
Event & Alert Classification and Filtering
The classification of a Event & Alert aims at providing a control factor for the Event & Alert, which will be used for factual decision making with respect to Event & Alert handling. In order to classify a Event & Alert its category, significance and content is regarded, compared to the Service Description. In case of:
- Informational events and alerts – the event is logged, no further steps required
- Exceptional events and alerts – event is forwarded to incident, problem or change management, generating there RFC, Incident Record or Problem Record
- Warning events and alerts – event is proceeded on in next process step
The most challenging activity within the Event & Alert handling is the classification and filtering of Events & Alerts. Thresholds set to high can lead to dangerous situations when issues are realized to late, thresholds defined to low lead to an overload of information and event messages, exceeding the processual capacity of monitoring tools and staff.
Activity Specific Rules
- set thresholds for Event & Alert filtering according to Event & Alert Expert and SLAs of services
- if the significance and category of Event & Alert is
- Informational – the event is logged, no further steps required
- Exceptional – event is forwarded to Incident, Problem or Change Management. Generate there RFC, Incident Record or Problem Record.
- Warning – event is proceeded on in next process step
- the Event & Alert agent is set to the Event & Alert builder.
Event & Alert Documentation
In this activity, the Event & Alert is described on the basis of the Event & Alert record. This description is logged in the Event & Alert log.
After a successful verification of actions taken in case of events forwarded to other Service Delivery Processes, the Event & Alert record is shifted into the status „documented“.
Activity Specific Rules
- Event & Alert agent has to update documentation
- Event & Alert owner has to verify the documentation
- If the verification is NOT OK, the Event & Alert agent is requested to improve the documentation
- Process Interface: Information about Services and Configuration Items affected by the Event & Alert are updated in the CMDB via the help of Configuration Management
Event & Alert Evaluation and Closure (PIR)
In this activity, the success of an implemented Event & Alert reaction is evaluated. Therefore, a Post Implementation Review (PIR) is performed according to the regular level of SLA quality defined in the Service Description. A PIR is performed, and its results are recorded. If the PIR does not attest a successful Event & Alert counter action, a rollback must be triggered. Change Management needs to be informed. The Event & Alert record is shifted into the status „closed-verified“ or „closed-failed“ depending on the success of the Event & Alert reflected by the post implementation review and the necessity of a rollback.
The following evaluation questionnaire should be considered for the PIR:
- Does still the Event & Alert classified as an issue occur?
- Did the Event & Alert counter action meet the desired targets?
- Was the Event & Alert counter action implemented in time?
- Did any incidents occur while the Event & Alert counter action passed through the process?
- Was the Event & Alert counter action performed without exceeding the assigned budget of money?
- Has the Event & Alert been documented and the CMDB (in some cases) updated respectively?
- Did every one involved in the Event & Alert counter action stick to the process?
- Was any information missing, which was needed to make a decision at any stage of the process?
Activity Specific Rules
- at the beginning the Event & Alert agent is set to Event & Alert & Fallback Tester
- if the task „testing“ is successful, the Event & Alert agent is set to Event & Alert Owner
- the test result has to be recorded
- if the task „testing“ is NOT successful,
- the Event & Alert agent is set to Event & Alert Implementer
- result is passed to Change Management Process