Process Mining in Application Maintenance and Support — Part 3

Sandeep Raizada
10 min readJul 24, 2020

In the previous part, process mining enabled answering questions that we had raised in Part 1. In this article, last of the series, we evaluate the following:

Have the changes to the production system improved the quality of service delivery?

Determine service components and teams that are a part of the process bottleneck.

Some pointers, that make for easier process elicitation.

Last but not least, predictive model to estimate expected incidents when a change is moved to production.

What else?

Oh, yes.

Changes to a production system are expected to achieve the following objective:

  1. “New functionalities” or enhancements
  2. Changes that correct/ stabilize any erroneous behavior.

Rationale and tracking of changes/ enhancements is not the topic for this article. But can we establish if service delivery quality improved over time?

As change(s) move to the production environment, most often, they lead to a spike in interactions and incidents. Support organizations require planning (extra) resources for a time duration to manage the expected spike. Can we determine that?

To answer these questions we determine when changes move to production and subsequent interactions and incidents that get recorded. It is possible to determine using the relationship

Interaction → Related Incident → Related Change between the sets of data provided. But when this linkage is “queried”, linked changes and incidents reduce to 868 interactions and incidents that have links to changes. This represents about 1.8% of the total interaction and incidents data provided. This sample size is too small for any analysis.

Inference: Interlinking of interactions and/or incidents to changes has not been recorded every time. Possible reasons could be that this change (to service desk operating procedure) was initiated later or it is not mandatory to record and hence gaps exist. Or is it a process compliance gap? Any of the above could be the reasons.

An approximation approach using “time as the basis” can be used. By knowing when “change” moved to production, find interactions and incidents that follow. Interlinking is done using the service component, the assumption being that this is recorded and updated at closure. We see from the process flow that it is updated before closure. Side note: This interlinking gap is observed in many organizations, “time as the basis” is often used for analysis.

We intend to determine the size of the spike (number of incidents (height) x duration(width) for which spike exists). Either of these decreases reduces the size of the spike and hence will indicate a better service quality? Yes, but, it is only “a” dimension and not a complete picture. So let us move on.

Incident spike

The image explains the thinking. T0, change moved to the production system. Users start at T1 and logging of new incidents starts for a change and service component.

  1. Method A: look for incident volume returning to the “steady-state”, that is the count of “open” incidents before application of change.
  2. Method B: scan for incident records in a “wide enough” time window and stop when recording of new incidents stops for consecutive days.

Latter approach is used as this takes away the dependency on mean time to repair. It’s just an opinion. Nothing stops us from taking “Method A” over “Method B” or using both approaches and comparing results. With this, we determine the point T2. Duration (T2-T1) is the duration of the spike. We will call this the “Day_spread”.

The chart below is done using a 5 days time window. The vertical axis shows days for which new incidents were logged following the change. A crowded chart makes comprehension difficult.

Incident recorded corresponding to change

So here is a summary of it. 90% of changes have related incidents created within the first 4 days of releasing change to production.

Days_spread — incidents

We repeat a similar exercise for the interactions as resources are typically different. No surprises the Day_spread is lower than incidents. This aligns with what we know, interactions if not resolved lead to incidents getting logged. 90% of the changes have interactions recorded in less than the first 4 days from release to production.

Days_spread — Interactions

That answers an aspect of resource planning — the duration is about 4 days.

Let’s get back to “sizing” the spike and below is a plot of the same.

Size of Spike over Time

A bar graph above visualizes the area of the spike and it shows a decline from Jan’14 onward. Another ratio that corroborates this is the ratio of the total count of incidents/ total count of changes on a monthly basis. This also shows a downward trend (indicated by the red line) in the graph.

A watch out: this is a “black-box” way of determining if changes are improving service delivery. It requires a more detailed analysis, this is only a dimension.

And there is more!

There are some more inferences that can be drawn from the data available. Let us look at the service components and analyze their first call responses. There are 63 Service Components that have not closed any call at the interaction level. A possible area to evaluate further. Below is a limited snapshot of the same :

FCR for Service Component

Total count of Service Components defined is 289.

There are 17 Service Components that have a high level of left shift. The left shift is moving more resolutions to the first line of support/self-service. Enabling support teams to focus on ITSM activities and changes. Thereby driving down cost/ incident. All interactions for these service components are closed by first-level support and no incidents are created.

The top 10 service components with incidents are in the graph below. The highest incidents are logged for service component WBS000073(12901 incidents). A possible evaluation of the service is recommended. As the count of incidents is over 5 times to the closest service component.

Incident by Service Component

Whodunnit? (sorry this is not that kinda thriller)

But to find what dunnit we go further into the data and look at the teams and their interactions. Why? Because reassignments lead to 4 times the time of a regular incident to resolve. We can dig deeper to understand which are the teams and the and the service components that are at the center of these.

We extract the logs that have reassignments and plot service components vs the teams. The size and color of the bubble enable focusing on the top items. Colour tending to red and size indicate a high number for easier comprehension. Hence, our focus should be service components WBS000263, WBS000223, WBS000098, WBS000014 … and teams TEAM0031, TEAM0017, TEAM0003, TEAM0053 … These are the service components and teams with maximum count of reassignments.

Reassigned Incident plot

We also need to look at teams that work closely or together. There is a limitation with the shared data — we do not have organizational information. The graph gets crowded and difficult to determine escalating teams. Below is a cluster of teams working together. This focuses on the Service components that we referred to above.

Team interactions for reassigned Incidents

The plot below indicates teams in the following ways:

  1. Horizontal ellipses indicate more outgoing vs incoming reassignments.
  2. Vertical ellipses indicate more incoming and fewer outgoing reassignments.
  3. Circular ones indicate a combination of both.

This should give a sense of the teams that are taking up the bulk of incidents/reassignments.

Team workload view

We evaluate the service components that create maximum incidents. Possible inferences to draw:

  1. High likelihood that these changes were “insufficient” tested
  2. Possibly certain gaps existed during the requirement enunciation and capture process.
  3. Changes to WS0000073 lead to incidents that are 5 times greater than for the next highest service component (WS000091) see figure: Incident by Service Component

Many times a documented process may differ from the actual implementation (documentation may have lagged). Or the way users are using the process is different from the initial plan. To recall all possibilities at the time when changes are being discussed is challenging. Ensuring that all variants of (As-Is) processes are enunciated for the developer or the implementation team in re-platforming can be challenging.

Process Mining to rescue again (the Knight in shining armor?)

Process mining is of considerable assistance in using logs to construct the process. Specifically when an enhancement or re-platforming is planned. We saw earlier how the service desk process got constructed from logs. I may not be wrong in saying that if someone had asked for the As-Is process it would not have come back looking like this!

Incident Flow

Even a summary of the log gives us plenty of insights.

There are a total of 46,616 incidents (scenarios) from start to finish present in this log file.

They capture a total of 933,474 activities.

Summary log

There are 39 activities (image truncated) in the snapshot below indicating the count of times these were captured (in the 933474) also represented in % form.

Activities

The flow can start from 23 of these — this is particularly useful as many a time we forget to get this detail and miss out on possible start points in As-Is. Gentle reminder, we require a discussion with the Process owners to finalize the start of the incident logging process. Many of these may be change event starters, but not incident logging start points. To explain that point; “Description Update” activity can happen when the incident was opened, or any time later in the life-cycle of the incident. However, “Description Update” has a low likelihood of being the activity that starts the incident logging process; “Open” is more like that event. None the less it reminds us that there are as many entry points into this process.

Start events

Logging of incidents may end with, any or all, of the 29 activities. Gentle reminder, it is important to understand from the practitioners on what is the “close” of the incident. Many of them may only reflect the closure of an event but not the incident. It is an important difference and should be discussed and understood from specialists and users.

End events

In summary, process mining can be of tremendous assistance and an essential component in uncovering all scenarios for developers and testers. This will ensure that all aspects are covered and tested instead of discovering them in integration testing/ user acceptance/ post go-live. We all know that late detection lead to more expensive fixes. The incident being the most expensive.

In a later series, I will focus on using Process Mining to determine process conformance. But that is a topic for later, this one is only to emphasize the importance of process mining before the start of development and testing.

So can you tell the future? (machine learning — is no crystal ball but close)

How valuable will it be if we can predict the count of expected interactions and incidents when a change is released to production! Yes, and that is also possible.

A basic estimation model, I implemented, requires the Service Component(s) that are in the upcoming change and the date when this is expected to go into production. I call it a basic model because a discussion is required to understand the Configuration Item (CI) and their grouping under Service Components to application …. To get to a more precise model. Well, I mis-state, it may be possible to infer this from correlations in the given data and then build a more complex model. But well this article is not to focus on ML and hence I will stop here.

So let’s keep it simple and say the task is to predict the count of expected incidents when a change with WBS…XXX service component is going to be released on dd-mm-yyyy date. I constructed an ensemble model to make this prediction.

To train the model I used incidents between Sept’13 to Feb’14 and incident data from Mar’14 used for model validation. In the process of tweaking the model, 142 models got evaluated to finally arrive at the model.

I will refrain from going into the metrics that tend to define the accuracy but share some outputs obtained. Errors exist, but the results were reasonably close.

Predicted vs Actual Incidents

We looked at a change that consisted of the service components in the graph. We got individual tickets and a total that could be expected.

Total defects estimated: 175

Total defects actual (compared with data): 168

Error: 7

This shows how results from here can be used for resource planning. Competency for a specific service component and the human resources required. While Date_spread helps with defining the duration of deployment.

It is also possible to look at open incidents and estimate the date of completion. This is specifically useful in re-prioritization when required.

Final Recap

We started with some questions in part 1, let us revisit those again

Recap Table — part 3

I hope this series has been helpful. Your comments including brickbats are welcome. I can also be reached on sandeepraizada@gmail.com

--

--