Only 2…

That’s the number of KPI’s you need for improving your service management related to incident, change and problem management.

Only 2 you say? Yes, only 2. And they are:

  • Number if incidents (with a certain threshold, preferably 0)
  • Resolution time (with a certain threshold, preferably ASAP)

Incident management
The less incidents you have the better your services perform. There are 2 ways to decrease the number of incidents you have on your plate at any moment in time:

  • Prevent incidents from happening
  • Solve incidents faster

Actually, the number of incidents shouldn’t be a KPI for incident management on it’s own. Incident management starts when an incident happens, so it has “no” influence on the creation of incidents. It does have an influence on how fast the incidents are solved.

Problem management
The goal of problem management is to actually prevent incidents from happening again and/or elsewhere, therefore problem management directly influences the number of incidents you have. If you speed up your problem solving process, you will have less incidents faster.

Change management
We all know that a lot of incidents happen because a change wasn’t implemented correctly (the code itself, the way it was implemented, or whatever). If you perform changes really well, this directly influences the number of incidents. The faster and better you implement changes, the number of incidents will go down.

What are your KPI’s? And can they be reduced to jus these 2? Let me know and see if we can find a way 🙂

DoD for incidents

So, incident management and Scrum. Unplanned work vs. planned work. The shear definition of things that pop up unexpectedly vs. sprints with fixed work that cannot be changed. ITIL vs. Agile. And how to connect both.

One of the ways to connect different ways of working with different goals, is to try to speak the same language and align communication. With this goal in mind, I’m experimenting with what I call the Definition of Done for Incidents. In this way at the end of the incident management process, you know what needs to be done. And when asked the question: “Are you done done?”, you can answer: “Yes”.

I’m suggesting that an incident is done, if:

  • All the incident AC’s are met
  • (Temporary) fix is available on production
  • Live (temporary) fix is verified by reporter/enduser/customer
  • Service is back to normal service level (SLA)
  • Documentation is updated (if needed)
  • Problem ticket is created if temp fix or critical incident
  • The incident ticket is updated

My questions to you are: What do you think? What am I missing? What is unnecessary? What other feedback do you have?

Hold on…

Customer: Regarding your SLA report.

You: Yes…

Customer: I know everything looks good in the report, but INC70034 seems of. It took you 3 weeks to solve it, but the report says it was still done within a week.

You: The report is correct. The total resolution time was within a week.

Customer: I can see what the report says, but we filed the incident 4 weeks ago and you only reported it done last week.

You: Let me see…

Customer: …

You: Ah I see, well, after 2 days we asked you for more information to solve the incident. It took you 5 days to respond, so we are not calculating that within the resolution time. After getting the required information, we deduced the cause for the incident within a day and we concluded what the fix should be. We developed the fix the next day but we had to wait for a new release to implement and that was last week. So in total 5 business days of actual working on the fix and release, therefore resolution time within the 7 days as stated in the SLA.

Customer: Say what?

You: Yes, we place the ticket “On Hold” when we are waiting for something and that’s excluded from the SLA report calculations.

Customer: Again, say what?

You: The 5 days we were waiting for your response and the waiting time for the release are excluded from the SLA calculations.

Customer: But that’s not in the SLA…

You: Yes it is. Read amendment 3.2.3 to addendum 1.2.9 of the SLA.

Customer: I’m not happy…

Let it go

Not the conversation you want, but probably one you had a couple of times. The customer is not happy and it is your own fault. You have what is called a watermelon SLA (green on the outside, red in the inside). There has been written a lot about watermelon SLA’s. Some, including me, even consider it to be fraud (you make it seem everything is ok, but it’s not). The usual solution is to redefine the KPI’s. You redefine the report to get the report to look green. That might fix the symptoms immediately, but you are not getting to the root cause of your problem.

I’m suggesting a different approach. Get rid of the “Pause” or “Waiting for” or whatever “On Hold” button or status you have in your incident flow.

I hear you, the next time you produce the SLA report, it will probably turn red. Actually, if something is wrong, it takes too long to solve an incident for instance, you want the SLA report to turn red. The SLA report is a tool, an instrument, a document, a conversation with your customer to possibly improve several things:

  • The efficiency of your (complex) problem solving skills and tools
  • A better alignment with your customer on when to respond
  • Clearer expectations on when something will be solved
  • Being alerted on where your constraints are in your incident flow

With every improvement on the red SLA, it turns greener and greener.

It’s gone

Customer: Hé, for this incident you did not meet the SLA, what’s going on?

You: Let me check. Ah I see, we were waiting for 5 days for your response to our question. And it took us some time to release the new fix.

Customer: About that question you asked, I never got it through the ticket system, I only received it after you escalated through email.

You: Huh? That’s weird, let’s see what’s going on there. And by the way, I can also tell you we skimmed of 3 days of our release time, so it might happen that we are not as quick as we want the next time something happens but we are getting there.