Companies and government entities around the world have discovered that cloud services can make their organizations faster, more scalable and more agile when responding to business change. With dynamic infrastructure environments, such as Infrastructure as a Service (IaaS) and Platform as a Service (PaaS), and a wide variety of both SaaS software and 3rd party offerings, distributed and operated in the cloud, there seems to be no limit to what companies can do and how quickly they can do it. While speed and technical agility are very beneficial for business functions that must operate quickly, they can cause serious issues for problem managers tasked with finding out “what happened” and “why it happened ” when an issue disrupts business operations.
Tools and techniques of the past are inadequate for the speed of cloud
Problem managers have long relied on two core sets of operational data to aid them in diagnosing the root-cause of issues: dependency data and change data. With cloud services, problem managers still need dependency and change data, but now they need that operational data to be current, complete and accurate – all the time. They also need operational data to evolve and remain in-sync with every change made to your IT environment; therefore, if a cloud service spins up a new instance or reconfigures component dependencies, then you must know it when it is happening.
In a past world of on-premise infrastructure and installed software packages, changes were made relatively infrequently (daily, weekly, monthly), so dependency and change data didn’t change very often. With cloud services, changes are made continuously, and dependencies may only exist for a few seconds before they are reconfigured as a part of normal cloud optimization. If an incident occurs that causes an outage, then the “snapshot” of the environment at the time of the incident may only exist for a fraction of a second. Legacy approaches to maintaining change and dependency records can’t realistically scale to capture an environment changing that quickly.
Cloud-admin tools as a problem-management toolset
The only systems able to capture the changes in cloud services as they occur are the administrative tools built into the cloud services themselves and initiating the changes. Traditional Configuration Management Databases (CMDBs) and change-record repositories in IT Service Management (ITSM) systems can help with integration and big-picture issues, but the cloud-admin tools have the details needed for modern problem management. Problem managers must understand cloud services operate at an entirely different pace than legacy IT environments and diagnosing the root-cause of issues in these environments necessitates interacting with a new set of tools and data.
A helpful analogy is the difference between a still-frame photograph and a motion-picture sequence. Problem managers are used to staring at the still image, looking for hidden details, but they aren’t used to the complexities of images in motion. CMDB and change records provide the problem manager with pointers to the general area in the sequence, but a different set of tools is required to isolate individual frames and objects to develop a true understanding of what is occurring. Cloud-admin tools do essentially the same. Once the problem manager is able to identify when the incident occurred, he or she can focus on what was happening during that time and what actions/activities triggered certain events to happen.
Modern capabilities make some problem management activities unnecessary
One promising development in the cloud-services area is the maturation of self-healing capabilities. When cloud-administration tools identify an issue has occurred, they are increasingly able to collect data and reconfigure the services to maintain continuity and avoid a disruption to users. This increased service resiliency has caused many ITSM practitioners to question whether understanding the root-cause of an issue that resolved itself is necessary all the time.
With machine-learning and artificial-intelligence capabilities being added to company’s suite of ITSM capabilities, many of the traditional analytics processes, such as problem management, are likely to change significantly during the next few years. It is unlikely these technologies will eliminate problem management as a human-based process, but rather Machine Learning (ML) and Artificial Intelligence (AI) will provide an enhanced, more robust set of tools and new information to enable problem managers to be more effective and efficient, assuming they have the ability to interpret the data and convert raw data into usable information to feed their RCAs and investigations. Cloud services may make problem management more difficult, but other technologies will offset the impact and make it a bit easier.
About Kepner-Tregoe
Kepner-Tregoe has been the industry leader in problem-solving and service-excellence processes for more than 60 years. The experts at KT have helped companies raise their level of incident- and problem-management performance through tools, training and consulting – leading to highly effective service-management teams ready to respond to your company’s most critical issues.