Sometimes it helps to go back to the basics. Such as reminding ourselves of the point of incident response (IR). The answer is simple: to keep the business running. But that simplicity is deceptive. This is an incredibly heavy responsibility, as you know if anything has ever gone wrong in your ability to respond to a major incident. According to Gartner, every minute that your systems are down cost on average $5,600, adding up to more than $300,000 per hour. That’s a lot of money, and a. lot of pressure.
At KT, we put our collective heads together and came up with seven best practices for ensuring the success of your IR program. They include some operational, some technical, some organizational suggestions, but all of them contribute to building a first-class IR team.
Why Incident Response?
ITIL describes an incident as any interruption or disturbance to normal IT services. We can make it more personal to your business, and say than an incident is any circumstance in which a system acts in a way that negatively impacts your customers. It doesn’t have to be an outright system crash. Take a slow-performing email system. Does that constitute an incident? Using our definition, you bet it does, as slow emails mean slower response to customer service enquiries, delayed reactions to requests for proposals (RFPs), slowed product development, and just about every activity your business engages in for profit.
IR is your process for responding to these incidents (and incidents are different from problems, which we will discuss later). Successful IR—which means that it’s both fast and effective—results in improved worker and process efficiency, higher productivity, and, ultimately, higher revenues for the business. It really is a mission-critical operation.
7 best practices for superb incident response
Here are eight best practices that will fine-tune your IR team to make it top-performing.
1. Communicate, communicate, communicate
There has historically been a communication chasm between IT and the rest of the organization—particularly between IT and users. This raises problems when attempting to deliver great IR, because many, if not most, of your incidents will be reported by your users. They must have an easy way to do this reporting, so you hear about incidents as soon as possible. Then you have to keep them informed in real-time as you resolve the incident. All this is necessary to gain their trust so they will work more closely with you—a collaboration that is essential—in future incidents. For starters, open up multiple channels to let users raise tickets easily. For example, they should be able to alert the IR team via email, chat, a portal, or an enterprise social network like Yammer. You should also create self-service mechanisms so users can solve the easy incidents. Make self-service easily accessible and educate users about the benefits of self-help and using the knowledge base to resolve issues on their own.
Then, as the IR team works on fixing the incident, it’s essential to keep everyone apprised of progress in real-time. There are two pieces of information that should be prominently displayed at all times: the incident status (current resolution state, including estimated time of completion), and the priority of the incident (how important it is to resolve the incident relative to other incidents.
Automation can help, by sending automatic updates throughout the lifecycle of major incidents. Clear and visible notifications will also prevent users from raising duplicate tickets and overloading the help desk. Even if there’s nothing to report, tell your stakeholders that, on an hourly or half-hourly basis. And have a dedicated line to respond to major incidents immediately and offer support to anyone affected.
2. Adopt DevOps Processes
Before DevOps became mainstream, the IR team was basically in it for themselves. They, rather than the people who had actually built the systems, were responsible for all incidents. There was no feedback loop to the developers on how to fix repetitive interruptions to a particular application, for example. There was very little communication at all between the people who built the systems, and the ones responsible for fixing them when things went wrong. Indeed, one reason that DevOps was created was to eliminate these organizational silos. This is essential because of the complexity of today’s systems—they are all interconnected, and what affects one is likely to affect others.
With a DevOps structure in place, developers do a better job in building their systems, because they now know they must also support them—no more throwing problems over the wall for another group to worry about. IR teams have support, and, typically—if DevOps is done right—clear documentation of how to keep complex systems up and running.
3. Sense when to “swarm”
Although most businesses have a “tiered” structure for dealing with incidents—Tier 1 is the help desk, Tier 2 involves application specialists, and Tier 3 are generally the system uber-experts and developers—you don’t want to universally enforce this structure when solving major incidents. You want to give your team the freedom to “swarm” when necessary.
This usually is necessary when an issue has a huge business impact. In such cases, you want to deviate from normal tiered IR processes. Swarming replaces that structure with a model of networked collaboration. It originated at Cisco, which wrote about it in its 2008 white paper, “Digital Swarming.” The concept was subsequently adopted by the Consortium for Service Innovation, and developed into a vision entitled “Intelligent Swarming..”
The general idea behind swarming is that instead of escalation, you bring everyone who might be able to help solve an incident into the IR team at the same time. There they brainstorm and bounce ideas off each other, and in general use the group dynamic to come up with fresh and innovative solutions to difficult IR issues.
Core principles of swarming include:
- The “tiers” of support are eliminated
- There is no escalation from one group to another—everyone who needs to be on the team is there from the beginning
- The case should be given directly to the person or persons most likely to be able to resolve it
- The person who takes the case is the one who sees it through to resolution.
4. Implement a Don’t-Let-It-Happen-Again policy
You should also take care not to be putting out the same fires over and over again. This means knowing the difference between IR and problem management. IR takes care of getting things back to normal, even if that means only a temporary fix. Problem management is when you find out the root cause of the incident, and fix it.
Note that you can never eliminate incidents from occurring, that isn’t realistic. However, you can avoid having to provide fixes to the same problem repeatedly by effective problem management.
5. Get the problem statement and priority right
Probably the single most important thing you can do is understand and articulate what the incident involves. This is called incident classification, but you need to go behind putting the incident into some basic category to specifying the problem statement extremely accurately and precisely. This should include such parameters as the system(s) impacted, the geographic location, how many internal users are impacted, and what the specific impact on business operations is.
Only when you have a clear problem statement can you set priorities. Proper classification helps in better troubleshooting and improving the resolution time. Then, prioritization ensures that the most business-critical issues are addressed first.
6. Encourage a no-blame culture
This is essential. Rather than looking to point fingers if something goes wrong—either in the IR response itself, or in the underlying issue with a system—consider simply focusing on the problem, and on finding the true root cause, whatever that might be. Having a “blame-and-shame” culture does you no good, and can even slow down IR response because people are so afraid of making mistakes.
7. Set the right KPIs and improve them
Key performance indicators (KPIs) are incredibly important because they measure how you’re doing, and give you a quantitative yardstick to use to see if you are improving. However, be careful about your KPIs. Some give false ideas of how well your IR team is performing and can cause you to prioritize the wrong things. For example, first call resolution (FCR), a common metric, measures how many incidents can be resolved with the first call. But sometimes that results in hasty decisions and actions when service quality is more important.
Therefore, set up realistic metrics and measure them for constant improvement. Here are some suggested KPIs to track:
- Incident volume (per issue category, priority, status, requester, etc.)
- Mean time to resolution
- Mean time to respond
- SLA %
- Incidents resolved without escalation
- Average cost per Incident
- Incident reopen rate
Conclusion: Benefits of effective incident management
We all know the results of poor IR—the business suffers. Alternatively, the benefits of doing IR right are manifold. You have smooth business operations. You achieve Improved efficiency and productivity within IT team as well as the organization. You have much higher user satisfaction as you maintain your SLAs. And, as you get better at IR, you can begin proactively identifying and preventing major incidents from occurring by spotting potential major incidents before they’re reported by users or customers. That’s a big win-win-win.
About Kepner-Tregoe
Kepner-Tregoe has been the industry leader in problem-solving and service-excellence processes for more than 60 years. The experts at KT have helped companies raise their level of incident- and problem-management performance through tools, training and consulting – leading to highly effective service-management teams ready to respond to your company’s most critical issues.