Process Mining in Application Maintenance and Support — Part 2

Sandeep Raizada
9 min readJul 18, 2020

In the previous part we highlighted how Process Mining can help in finding answers to some daily issues of Application Management and Support.

In this article we review the actual logs from a Bank’s service desk application. Much of these are common and exist in “any” service desk application, nomenclature at times, may defer. We will discuss log filtering for relevance and size to obtain reasonable computation time. Reverse engineer the process from these logs! And determine some bottlenecks.

Background of process and data (in-brief)

The case study focuses on Application Maintenance and Support of a large Bank’s IT environment. A first-line support team registers queries and issues as “Interactions” with internal users. If the interaction is not closed (an acceptable solution provided) to the user, then it is created as an “incident” and gets escalated to the Next Level(s) of specialists for resolution and the user finally closes the incident. Standard ITSM (IT Service Management) process implemented using a popular service management tool.

Files provided by the customer have records starting from 2012 till March 2014. Logs are most detailed between Sept 2013 and March 2014.

It is interesting to see how much can be discovered/ inferred from application logs about the business process with minimal user interactions. That enables a detailed discussion with process owners via questions vs attempting to elicit process and hoping that they have an elephant’s memory to recall and speak about every scenario that exists. This ensures that requirement gathering, development changes/ re-platforming and testing become comprehensive. Let us get a quick sense of the shared data.

There are 4 large data files shared by the client. Presented below is the structure, where the first column indicates the field next column is a sample row of data and the last column records comment/explanation. At the bottom of the structure presented is a summary highlighting important aspects from the data file; as it is cumbersome to present the entire dataset.

The first table contains details of the interactions, that is the users reaching the first line of support for issue resolution and the information that gets recorded. If the issues do not get resolved at the interaction level then these get reported and recorded as incidents details captured for the incidents are in the second structure. Details of changes moved to the production system are reported in this table. Last table provides us the activity details for an incident from the service desk tool of the client.

Lay of the Land — Available Information and Structure

Interaction Information

summary

Total count of interactions: 147004

Interactions that do not have a reference to “Related Incidents”: 94250*

First Call Resolution (FCR): 93996

*A shortcoming is when an interaction is linked to more than 1 incident, the information is captured as “#MULTIVALUE” but lacks reference to those incidents.

Incident information

summary

Total count of incidents: 46809

Incidents that do not have a reference to “Related Interactions”: 317*

Incidents that do not have a reference to “Related Change”: 46249*

*Data gaps should not be construed as missing as these may be blank fields for a reason. But important to note.

Change information

summary

Total count of Changes: 30275

Changes with related Incidents: 28327

Changes with related Interactions: 30273

Detailed log of incidents

summary

IncidentActivity_Types: 38

Total count of incidents: 25249 (total 35146 — Request for information, complaints and other ignored for this analysis)

Total count of events: 539576

Data Cleansing

This invariably consumes a large percentage of analysis time. Like in data mining, just as essential here. Why?

  1. Make data manageable as event logs are huge and many applications will tend to show a lag (become unresponsive) if not done.
  2. Focus on essentials to create a process description that is concise and easy to comprehend.

Though incidents and changes are recorded from 2012 to 2014. High activity is only observed between Sept’13 to March’14, we can focus on just this period.

Incidents recorded in Production
Changes planned and moved to Production

“Detailed log of incidents” contains information about incidents, but also requires filtering for incidents and removing others like Request for Information etc.

“Detailed log of incidents” can be filtered by attributes — e.g. incidents with reassignments or with reopen activity. Though tools have the capabilities to work with large amounts of data sometimes computation time will become lengthy (and you will exhaust all the coffee at home). So filtering can help.

Data in some of these files has about 500,000 lines. Information summary is easier using a light database like SQLite. Some of us may tend to work these files on excel, google sheets, or Libre calc. In such a case you may have enough time to learn a new hobby by the time the tool responds. A judicious call should be taken in picking the right tool and to use it for its strength. There is no silver bullet or a universal tool — choose wisely to get to your objective in the “quickest” time.

Learning is about knowing when to touch that button, experience is about knowing when not to touch that button!

The Process reverse engineered — (Now I know what you did last summer)

Where we started this discussion was to get a quick sense of the system support process. Using the process mining tools like ProM, Celonis, etc. we can reverse engineer the application logs to build a model as below. This is done using the detailed incident logs.

Incident flow

Above is a State-chart (similar but not the same as a process flow — but will refer to it as process flow for ease of understanding). Some quick observations and inferences from this flow:

  1. Please explode the visual above else the lettering will look like billboards on the Moon.
  2. The start of the process is represented at the top with a “play” like button in a circle. While the end of the process is represented by a square in a circle at the bottom.
  3. There are some black rectangles that you will notice in the flow chart these are state chart representations.
  4. Color of the boxes — the more often an activity was part of an incident flow the darker the shade of Blue. “Assignment” is “dark blue” indicating it occurred 100% or in every flow.
  5. “Dark/Darker Flowlines” indicates the path taken most often. Number by the flow indicates the frequency and the same in % is indicated in brackets.

Green highlight box:

  • Activities seem like events that “change” the status of an incident record.
  • Impact and urgency changes are about 1% of transactions. An inference to draw is that the initial assessment is accurate and does not change many times over the life-cycle of an incident. Excellent!
  • The amount of times that reopen has happened in the flow is a mere 0.1% — indicating that rework is minimal after closure. Admirable!

Blue highlight box:

  • Activities pertaining to managing external vendors (possibly product) — for cases when fixes are sought from them.

Red highlight box:

  • Activities that are “more directly” related to fixing incidents.
  • Reassignment — this happens in over 30% of the incidents
  • There are also “self-loops” here — pointing to multiple Reassignments in an incident resolution — over 60% of the incidents.
  • A filtered log reveals incidents that have over 47 reassignments within an incident resolution.

Ok, so what does that mean? (we have a ping pong match going on here)

Thank you for the reassignment information — but Is it possible to determine the material impact on incident resolution time? Yes so let us filter the log and redo this area of the flow. We also create “nodes” that combine (rolls-up) a set of activities thereby reducing count of “path(s)”. Grouped activities are represented as hexagons in the flow below. This impacts a couple of factors Fitness and Precision that determine a sound process model, but gets us there quickly. Our objective is to find the broader issues and bottlenecks. There are views below, showing the flow with incident counts on the connectors and the other shows the connector showing time (median). Why median? If you may ask, is only to ensure that we choose a middle point vs an average as that may tilt the statistics.

“Reassignment” — Happens in 18,654 incidents (total 46,809 incidents) about 40% of the time. Reassignment can indicate a change of users/ support levels — obviously an activity that adds time to getting an incident resolved. Lower the better (0 is the target) — meaning the assigned person also closes the incident in consultation with the user.

Below we present two views of the same flow. The first one has connectors that have time as the basis to “color” the connectors. While the second one uses the count of incidents that flow through a connection as the basis for connector thickness.

Incident Flow view — # Incidents on Nodes and time on Connectors
Incident Flow view — thickness of connector depends upon incident flow though

To establish a material impact, it is possible to draw the same flows for incidents that “had” reassignment activity and the ones that did not. The two views below should help us with the answer to that question.

Incidents with Reassignments — Time view
Incident closure time with Reassignments

Mean time required to reach closure when incidents have reassignments in a flow is 76 hrs which equates to 3 days and 4 hrs. Assumption being that there are teams working in shifts to provide 24 hr support. Incidents get handed over between teams and their members. It is visible in the graph that most cases fall into this category. There are some outlier cases that took 392 days to reach closure. It is possible to even extract the logs from these flows to review incidents.

Incidents without Reassignments — Time view
Incident closure time without Reassignments

The above flow is with incidents that have no reassignments, these take a mere 18 hours. There are outliers and can take over 9K hours. Our intention is to look at what happens most often and ignore the outliers in both cases.

Some inferences from above:

  1. To do: Reduce the reassignment of Incidents across teams. Action: determine the reason(s) for reassignment that can be worked upon. One of these reasons may be wrong queue assignment or it could be due to handover across teams covering 24x7.
  2. To do: Reduce reassignments within teams. Action: determine the reason(s), is it the skill gap(s)?

There are solutions that assist the support personnel, but we will not go down that path in this article.

Recap

Did we answer any issues listed in part 1?

Recap Table — part 2

What’s coming up in Part 3

Have the changes to the production system improved the quality of service delivery?

Determine service components and teams that are a part of the process bottleneck.

Some pointers, that make for easier process elicitation.

Last but not least, predictive model to estimate expected incidents when a change is moved to production.

--

--