Problem Management

Description/Summary

The objective of Problem Management is to remedy incidents permanently. This objective is achieved by reactive and preventive actions:

Reactive Problem Management analyses the issued reasons for incidents and develops proposals on avoiding of those reasons.
Preventive Problem Management supports the prevention of incidents before they occur and before they can become a major incident. This is achieved by analysing IT services for their weak points and providing proposals to remove those weaknesses.

Moreover, Problem Management:

Performs continuous quality improvement of IT infrastructure by preventing Incidents
Will locate, record, track and solve structural defects
Develop a „Knowledge Database”
Submits a Request for Change (RFC) to improve the IT infrastructure

Problem Management must establish a close relationship to Change Management

Objectives

The purpose of Problem Management is to establish standardized procedures which will analyse IT-services on their possible weaknesses in their delivery of defined SLAs and analyse incidents that might develop into major issues for the defined IT-services.

Problem Management contributes to an integrated Service Management approach by achieving the following goals:

locating the root causes of problems, and
consequently preventing incidents.

Roles & Functions

Problem Management Specific Roles

Static Process Roles

See Problem Management section, Service Management Roles x Person for details

Problem Process Owner: Initiator of the process, accountable for defining the process strategic goals and allocating all required process resources. See Continual Process Improvement Management for a detailed description of these activities.

Problem Process Manager (Problem Manager): Manager of the entire process, responsible for its effectiveness and efficiency. Team leader of the function „Problem Management Team“. See Continual Process Improvement Management for a detailed description of these activities.

Problem Management Team: Team associated with the Problem Management Process.

Senior Management: Senior Management of the IT provider

Dynamic Process Roles

The group or person undertaking these roles are dynamically created during the Problem Management Process. See the process-specific or activity-specific rules for details.

Problem Owner: The attributes in the records contain the value of the Role/Function currently accountable for the problem (but NOT for the Problem Management Process). The Problem Owner can be changed with the help of a Hierarchical Escalation.

Problem Agent: The attributes in the records contain the value of the Role/Function currently responsible for the activity or task within an activity of the problem. The Problem Agent can be changed with the help of a Functional Escalation, if permitted by the Rules.

Diagnosis Team: This is the team which searches for the root cause of a problem.

Solution Team: This is the team which tries to find a solution that will eliminate the root cause of the known error. This team is built using staff from the Diagnosis Team.

Service Specific Roles

Roles depending on the affected service are found in the Service Description. The Service Description, including the service specific roles, are delivered from the Service Portfolio Management.

For an example, refer to Example: Service Description

Service Expert/Service Specialist

They can consult and/or act as

Problem Agent
Diagnosis team
Solution team

Service Owner: Person responsible for the service as defined in the Service Description.

Customer Specific Roles

Roles depending on the affected customer(s) are found in the Service Level Agreement (SLA). The Service Level Agreement for customer specific roles is maintained by the Service Level Agreement Management.

Customer(s): Customers of the affected Service with a valid SLA

For an example, refer to Example: Service Level Agreement Management

Information Artifacts

This section describes information/data required or recorded by the process. In general, a process record (here: the problem record) contains all information relating to the execution of a process. It also represents the current progress of that process. Additional information items (artifacts) can be typically realized by considering the information out of one or more (up to all) process records and by filtering, merging, correlating or interpreting this information (sometimes with reference to other data and information sources).

Trend Analysis Record

Trend Analysis Design ID
Trend Analysis Requester
Trend Analysis Description
Trend Analysis Design Agent
Trend Analysis Agent
Trend Analysis Verification Agent
Trend Analysis Service Expert
Trend Analysed Service: Service regarded
Trebd Analysis Base Data
Trend Analysis Relevant incidents IODs and problem IDs

Problem Record

The Problem Record (Problem Ticket) is the record holding any management-relevant information and the history of a specific problem.

Unique identifier : Problem ID
Problem caller: Name of the Person triggering the Change
Problem owner: Refer to Roles section
Process instance : Unique identifier of the Process instance triggering the Problem Management Process. This can be a Problem Instance also.
Status : Status of the problem. Is set when passing a Control Activity
Services : Service(s) affected by this problem
Customer : Customer(s) affected by this problem
Problem Description : The description of the problem
Diagnosis Team: see roles description
Scheduled starting time/ date diagnosis: Defined time when a diagnosis activity should start
Solution Team: see roles description
Scheduled starting time/ date solution: Defined time when a solution building activity should start

Key Concepts

Problem & Known Error

A problem is unknown cause for one or more (potential) incidents. In case of Known Error, the cause known however the solution must not be known.

Trend Analysis

Trend Analysis reviews reports on negative indicators from other processes, especially Incident Management, Availability Management and Capacity Management. When negative indicators are found, a detailed analysis of that topic can be performed. For example: Based on certain incident statistics, it can be seen that the number of incidents related to the Email service during a certain time period is higher than in the time periods before. An analysis is provided. Results of the analysis may be:

The number of users increased by same ratio in same time period. The increase of incidents correlates with the increase number of users. No additional steps necessary OR
One of the mail servers did not operate at the availability level agreed. The increase of incidents correlates with these mail server issues. Reasons for mail server issues could be, for example old hardware etc. Together with Availability Management, the next steps will be determined in order to fix the issue.

Workaround (Temporary Solution)

Problem Management provides a temporary solution to Incident Management when the problem can not be solved. The Incident Management can also provide this temporary solution. If this is the case, Incident Management must inform Problem Management so that the Problem Management can document the solution in the Workaround Record.

Hierarchical Escalation

This escalation is performed when the Problem Management Process can not be performed in the defined way. The escalation step is the „Problem Manager“. Throughout the life cycle of the Problem Ticket, an escalation to the Problem Manager is possible.

Problem Controls

Status – Definition

new – a possible problem was defined
created – the problem was described (overview provided)
registered – the problem was described in detail and classified
classified – all control data values of the problem data set were set
diagnosed – cause of problem was defined and the known Error was defined
error registered – all control data values of the known error data set were set
found solution – one solution from a number of solution alternatives was chosen
recovered – solution for the problem was brought in
failed – solution DID NOT pass the PIR
solved – solution DID pass the PIR
aborted – analysis and problem solving was stopped. The problem data set did not pass the whole life cycle of the problem record
end – problem investigation and solution process were appraised finally

Process

Critical Success Factors

Incident Management and Problem Management cooperate and have precise fields of responsibility
Access to well trained and experienced staff from the IT units
Access to internal and external databases to avoid Incidents
Reasonable and continuous Trend Analysis

Problem Trend Analysis

Trend Analysis uses the analysis of collected incident and problem information in order to define a pattern, or trend, in the information so that it can predict future incidents and problems in the service operation. Statistical methods are often used.

High Level Process Flow Chart

This chart illustrates the Problem Management Trend Analysis process and its activities, as well as the status model reflected by the Problem Record evolution.

Performance Indicators (KPI)

Accuracy of trend analysis
Transparency of used methods and tools

Process Triggers

Event Trigger

The process is not event triggered.

Time Trigger

Trend analysis is performed on collected data within defined time periods.

Process Specific Rules

All changes to trend analysis activity needs to be documented
All changes to trend analysis activity needs to be communicated

Note: for the different types of rules, refer to Rules.

Process Activities

Design Trend Analysis

This activity facilitates the design of trend analysis for Problem Management. The first step is to define and monitor relevant data. This analysis is done based on the reports from Report Management. Diverse trend analysis operations are possible, therefore the most suitable method needs to be chosen. This could be any of the following:

Least-squares method
Random data method
Trend plus noise analysis
R-squared analysis
Correlation analysis

This activity includes following steps:

Definition of the problems and incidents that are to be regarded
Collection and analysis of diverse reports in order to determine the values defined above
Using this analysis, providing a definition of values for trend analysis
Assignment of Report Management to collect relevant data in adequate reports
Definition of thresholds for reports
Definition of counter actions in the case of exceeded thresholds

Activity Specific Rules

Analysis Requester is set to the person who triggered the analysis
Analysis Sponsor is set to the person who triggered the analysis, if no other sponsor is known
Analysis Owner is responsible for trend analysis
Analysis Agent is the Analysis Owner
Analysis Description must contain a meaningful description of the problem.
Affected Service must define at least one Service

Conduct Trend Analysis

In this activity, the defined analysis is performed using the reports generated by Report Management. In the case of data exceeding the thresholds, counter actions will need to be taken.

Activity Specific Rules

Analysis agent is set to Service Expert
Monitor reports from Reporting Management
Analyse these reports
Define counteractions
If the analysis needs to be changed, return to Analysis Design

Verify Trend Analysis

The analysis is audited in this activity. In the case of data exceeding the thresholds, counter actions will need to be taken by Problem Management. Problem Management will be informed by a Ticket. This activity is performed jointly by the Service Owner and the Service Expert.

Activity Specific Rules

Monitor reports from Reporting Management
Analyse these reports
Define counteractions
In the case of exceeded thresholds, open a Problem Ticket
If the analysis needs to be changed, return to Analysis Design

Problem Handling

High Level Process Flow Chart

This chart illustrates the Problem Management process and its activities. In addition, it shows the status model reflected by the Problem Record evolution.

Performance Indicators (KPI)

Definition of Performance Indicators

Open Problem

An open problem will have either have the status:

new
created
registered
classified

Closed Problem

This is problem which is NOT open

Known Errors

Known errors will have either have the status:

diagnosed
error registered
found solution
recovered
failed

Duration of Problem Investigation

Time stamp diagnosed – Time stamp new

Duration of Solution Definition

Time stamp solved – Time stamp diagnosed

Definitions of KPIs

All indicators need to be analysed per service, per customer and per priority. The Problem Manager is responsible. Typical basic time period settings for this analysis are:

• 1 Month • 1 Quarter • 1 Year

Key Performance Indicators KPI

OP = („No. open problems in time period“/ „No. open problems in previous time period“) – 1
CP = („No. closed problems in time period“/ „No. closed problems in previous time period“) – 1
KE = („No. known errors in time period“/ „No. known errors in previous time period“) – 1
UE = („No. unknown errors in time period“/ „No. unknown errors in previous time period“) – 1

Process Trigger

Event Trigger

This is a request on problem analysis from:
- the Incident Management Process
- the Problem Management Process itself, when an analysis on Problem Management is needed

For more information, see the relevant Process Interface Description in the Process Interfaces Overview

Time Trigger

None (Problem Management is typically event-triggered)

Process Specific Rules

Every Problem triggers the creation of a new Problem Record.
The Problem Agent is responsible for documenting each activity in the Problem Record.
The Problem Owner must control the Problem Agent.
The Problem Owner and Problem Agent can only transfer their duties if the new person or group agrees.
The subsequent Problem Owner and Problem Agents must be recorded in the appropriate attribute in the Problem Record.
The Problem Owner and Problem Agent should preferably be a person rather than a group.
Refer to service specific and customer specific rules in order to take the Service Description and Service Level Agreement into account.

Note: for the different types of rules, refer to Rules.

Process Activities

Problem Recording

The Problem Record is checked for completeness and formal correctness; i.e. it is verified that all mandatory information on the form has been provided by the requester. If this is not the case, or if the requester is not authorized to send Problem Records, then the problem can be rejected. The Problem is documented, the Problem Record is being filled in.

Activity Specific Rules

Problem requester is set to the person who triggered the problem.
Problem sponsor is set to the person who triggered the problem, if no other sponsor is known.
Problem agent is set „Problem Management Team“, if no other person is available as Problem agent.
Problem owner is set „Problem Management Team“, if no other person is available as Problem owner.
Problem description must contain a meaningful description of the problem.
Problem priority is set to „3 (Middle)“ if no other priority is known.
Affected service must define at least one Service.

Problem Classification

The classification of a Problem aims to provide a control factor for the Problem. It is also used in Problem scheduling in order to make factual decisions. In order to classify a Problem, its priority and category are determined. While the Problem priority often depends on the urgency and impact of related incidents and problems, which reflects the degree of importance to implement this Problem, the category will also depend on the risk of the Problem, particularly with regards to IT service operations and potential disruptions or to degradations in quality. A recorded and accepted Problem needs to be categorized and then will be processed according to its resulting category. More over the diagnosis team is defined and the starting time/ date for the activity „Problem Diagnosis“ is set. The Diagnosis Team should represent a variety of skills.

A standard Problem („Category 0“) is preauthorize and can be processed by recording the record. For a minor Problem („Category 1“), a Risk Analysis and Proof of Concept will be undertaken and is assisted by the Delivery Processes. The category of a Problem can later be updated if this is encouraged by the result of the Risk Analysis or the Proof of Concept. Otherwise, the existing category value for Problem will be confirmed.

If the Problem’s category has been changed, then the sub process Problem Classification will be restarted and the Problem is now handled according to its new category.

The next activity depends on whether the category is confirmed or updated.

Activity Specific Rules

Problem Owner is set to „Service Owner“ or to „Problem Manager“ when a service is not defined.
Problem Agent is set to Problem Staff.
The outcome of the „Risk Analysis“ and the „Proof of Concept“ must be documented in the Problem record .
The „Problem Priority“ is defined according to the outcome.
The „Problem Category“ is defined according to the risk as diagnosed in the Risk Analysis
„Diagnosis Team“ is defined
Start time/ date of diagnosis is set as well
If the category is „1 – Minor -“ then the Problem Authority is set to Service Owner
If the category is „2 – Significant -“ then the Problem Authority is set to Problem Manager
If the category is „3 – Critical“ then the Problem Authority is set to Senior Management (who consults Problem Manager)
After selecting the Problem Authority, the Problem Agent is then set to Problem Authority
At the end of the activity the Problem Agent is set to Problem Owner

Problem Diagnosis

Problem Diagnosis is the most complex activity of the Problem Management Process and facilitates the quick resolution and restoration of disrupted service(s). In this activity, the Incident is investigated on the basis of the available information about incident symptoms, affected services, CIs and users and considers additional information about related incidents, errors, information from the CMDB and technical/specialist knowledge.

The first action in that sub process is to check whether the classification is correct, i.e is the relevant service addressed?

If not, the incident is set to “recorded – accepted” and returned to Initial Investigation. This is because Incident Investigation is organized by service and incorrect service classification makes further work impossible.

If the classification is correct, then it is further checked to ensure that the initial support is correct, i.e. have all relevant check lists been checked?

If the initial support was wrong, then the incident is returned and set to “classified”.

If the initial support was correct, then the second level must perform an investigation of the incident and try to solve it.

If a diagnosis is found, then the incident is set to “diagnosed”. Describe diagnosis and diagnosis approach in record. Describe even not approaches that are not successful as well.

If a solution can not be found, then the incident is escalated again, set to “diagnosed – failed” and forwarded to an external provider of the service (when possible). In this case, the “Incident Investigation” will need to be started again.

Activity Specific Rules

Problem Agent is set to Diagnosis Team
Check classification
- If classification was not correct – set status on „recorded – accepted“ and return case to Initial Investigation
Check initial support correct?
- If initial support was not correct – set status on „classified“ and return case
Perform diagnosis of problem
- If not diagnosis provided – escalate problem
Fix description of diagnosis and diagnosis approach in record
Problem Agent is set to Problem Owner

Problem Registration of Known Errors

The Problem Owner checks the description of Problem cause and (if it exists) the description of a workaround. They will then approve these for Incident Management. In addition, they will define a solution team, which may be defined using the diagnosis team as a basis. The time schedule for a solution definition is also defined.

Activity Specific Rules

Start: The Problem Record is in the status dignified.
End: The Problem Record is either in the status error registered OR failed.
Provide starting time for solution building
Define solution team

Problem Solution Building

In this activity one or more options for the solution of the problem are defined. All alternatives need to be documented. Finally one alternative needs to be preferred. RFC is containing the preferred solution and the other solution alternatives. Problem Owner can change the solution team or stop the process.

Activity Specific Rules

Problem Agent is set on Solution Team
Provide description for preferred solution and the description of other solution alternatives

Problem Recovery

Problem Recovery is performed in the Change Management Process. That process is supported by this activity during the Building, Testing and Implementing of Solutions.

At the end of the activity, the Problem record is shifted into the status „recovered“.

If Configuration Items are needed, which are not yet released then the Release Management process is also triggered. Release Management is responsible for testing the Configuration Items, including their functionality and accuracy, as well as for testing that they are compatible and operable with other Configuration Items for that Release and Problem.

The main task of this activity is to approve the implementation provided by the Change Management Process.

Activity Specific Rules

A Problem Builder is defined from the Expert/Specialist Group.
A Problem & Fallback Tester is defined from the Expert/Specialist Group.
A Problem Implementer is defined from the Expert/Specialist Group.
The Problem Builder must be different from the Problem & Fallback Tester.
The Problem Implementer must be different from the Problem & Fallback Tester.
At the beginning of this task, the Problem agent is set to Problem Builder.
After the task „building“, the Problem agent is set to Problem & Fallback Tester.
If the task „testing“ is successful, the Problem agent is set to Problem Implementer.
An Implementation Plan is recorded and made available to Change Management and Release Management.
The Implementation Plan is tested, and the test results are stored in the Problem Record.
An Fallback Plan is recorded and made available.
The Fallback Plan is tested, and the test results are stored in the Problem Record.
Functionality test procedures are described.

Problem Documentation

In this activity, the Problem solution is described using information from the Problem and Incidents record.

After being successfully verified, the Problem record is shifted into the status „solved – documented“.

Activity Specific Rules

The Problem Agent (which is the Problem Implementer at the moment) triggers the Configuration Management Process in order to update and/or relations in the CMDB.
The Problem Implementer and Configuration Librarian CAN be the same person.
The Problem Owner must verify the documentation, consequently the Problem Agent is set to the Problem Owner in Configuration Items
If the verification is NOT accepted, the Problem Agent is reset to Problem Implementer in order to improve the documentation.
Process Interface: Information about Services and Configuration Items affected by the Problem are updated in the CMDB via the help of Configuration Management.

Problem Evaluation and Closure (PIR)

In this activity, the success of an implemented Problem is evaluated. This means a Post Implementation Review (PIR) is performed. A PIR is performed and its results are recorded. If the PIR does not attest to a successful Problem resolution then a rollback must be triggered. Depending on the success of the Problem reflected by the Post Implementation Review (and the necessity of a rollback), the Problem record is shifted into the status of either „closed-verified“ or „closed-failed“.

The following evaluation questionnaire should be considered for the PIR:

Did the Problem solution meet the desired targets?
Was the Problem solution implemented in time?
Did any incidents occur while the Problem passed through the process?
Was the Problem solution performed without exceeding the assigned financial budget?
Has the Problem solution been documented and the CMDB updated?
Did everyone involved into the Problem solution follow the process?
Was any information missing which was needed to make a decision at any stage of the process?
Have the stakeholders been informed about Problem progress in accordance to the RASCI chart?

Activity Specific Rules

At the beginning of this activity, the Problem agent is set to Problem & Fallback Tester.
If the task „testing“ is successful, the Problem agent is set to Problem Owner.
The test result has to be recorded.
If the task „testing“ is NOT successful, the Problem agent is set to Problem Implementer