January 17, 2025

Towards System Resilience (Part 1)

Nothing has the potential to ruin a product or even an organization more than software instability. Most of us get excited by focusing on developing new features or services and we neglect the operation thereof. Agile methodologies mostly focus on product development. In large organisations with hundreds of systems this is not enough. You need more than just a methodology achieve resilience.

Mike Murphy is undoubtedly an expert when it comes to resilience. An expert for me is someone who has the scars of years of sleepless nights and ruined weekends because of a system being down AND and who has the theoretical background on the subject matter. Mike is such a person

In this first part of three articles we will focus on the causes of system instability. In others he will cover the Engineering and Operational practices that can improve system resilience. Over to Mike.


System Resilience

The objective of this post is to leave the reader with a clear understanding of what causes system instability.

Resilience can be described as the ability of a system to maintain certain functions, processes, or populations after experiencing a disturbance, an unexpected change in usage patterns or sudden increase in volumes, etc. In the context of IT this can be translated to mean the ability of the system to keep processing business transactions during and after a failure of any component (e.g., server, network component failure, etc.). An important concept related to resilience is stability.

Stability refers to the disturbances a system faces. If there are few disturbances or small disturbances, then the system is relatively stable. If there are many disturbances or large disturbances, then the system is relatively unstable. It’s natural to think that a resilient system would be one with more stability. However, this is not always the case. Sometimes, some instability can help increase resilience. This occurs when the disturbances increase the system’s ability to respond to further disturbances (a topic covered later in the paper).

In this age of digital disruption resilient systems are a “ticket to the game” and it is by no coincidence that IT security and IT resilience remain top of the strategic agenda. The effective execution of both require continued focus on the “long term” and success will only be achieved once both are tightly woven into the fabric of IT.

 Factors affecting system stability

There are numerous factors in the systems development life cycle, ranging in scope from design through to production support, that can have a material impact on the resilience and stability of IT systems. It is beyond the brief of this post to provide a finite list of these factors so the focus will be on those proved to have had the most significant impact on the resilience and stability of systems in large organisations.

Complexity

French poet Antoine de Saint Exupery wrote “perfection is finally attained not when there is no longer more to add, but when there is no longer anything to take away.” This principle is also applied to the design and construction of software. Software simplicity is a prerequisite to reliability. The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it.

At a base level every new line of code that is added to a system has to be debugged, read & understood, and supported. Software development should always be a last resort, because of the cost and complexity of building and maintaining software. Contrary to popular belief, this does not restrict innovation but rather keeps the environment uncluttered of distractions so that focus remains squarely on innovation.

Software should behave predictably and accomplish its goals without too many surprises (that is, outages in production). The number of surprises directly correlates with the amount of unnecessary complexity found in the system. Given the size and complexity of the software estate that develops over decades it is likely that surprises caused by complexity will remain the single biggest threat to systems stability for the foreseeable future. Simplifying the estate is not an insurmountable task but will take a concerted effort over the long term.

Manual, human-centric processes

As IT systems grow exponentially, non-automated, manual systems increasingly are becoming a major business liability. Today’s systems are simply becoming too big and complex to run completely manually, and working without automation is largely unsustainable. Many manual operations are prone to error, offer slow response times and devour costly man-hours. This hampers the overall efficiency and effectiveness of IT operations.

Human error is the single biggest contributor to system failure so relying on humans to execute tasks that can and should be automated is not a sustainable strategy. Manually developing and deploying software to production is archaic at best viz., manual code analysis allows technical issues to accumulate; manual testing often misses regressions; manual infrastructure management introduces anomalies in environment configuration, and manual deployments introduce risks.

A dramatic example of downtime caused by lack of automation is the case of Knight Capital Group, which famously lost $460 million in 45 minutes of trading due to a failed update that had been manually made to 8-year- old software. Had Knight automated its deployments, fully re-deployed servers periodically using automated tools, or removed old unused code from its codebase more aggressively, a technician’s failure to deploy code to the eighth server in a cluster would not have had such disastrous results.

Human error remains the single biggest contributing factor to systems instability at a large organisation.

Inadequate telemetry, monitoring and alerting

Monitoring is the practice of observing systems and determining if they’re healthy, Systems monitoring should address two questions: what’s broken and why? The idea is to implement monitoring in such a way as to maximize signal and minimize noise. Without effective monitoring in place the only way of knowing whether a system's performance has degraded or whether the system is no longer operational would be via occasional operator checks or alerts raised by customers / users. The latency between the failure and the alert in both cases would almost always result in a delay in mean time to repair (MTTR).

Monitoring and alerting don’t lead to systems stability but, if implemented effectively, reduce the impact of a system failure by alerting operators as soon as abnormal events are detected.

If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your system grows.” Carla Geisser, Google

Monitoring a complex application is a significant engineering endeavor in and of itself. In a multilayered system, one person’s symptom is another person’s cause. Therefore, monitoring is sometimes symptom-oriented, and sometimes cause-oriented. System behavior changes over time and monitoring thresholds have to be continually fine tuned to ensure an optimal signal to noise ratio. When a system generates too many “false positives” operators eventually start ignoring alerts and in so doing raise the risk of critical failures / performance degradation being missed.

Telemetry is the process of gathering information generated by instrumentation and logging systems. This information is used to discover trends, gain insights into usage and performance, and to detect and isolate faults. Telemetry is not only used to monitor performance and to obtain early warning of problems, but also to isolate issues that arise, detect the nature of faults and perform root cause analysis. In order for telemetry to be effective, instrumentation needs to be written into the application i.e., software engineers need to ensure that they build routines into the application that generate information about how the system is performing. Regrettably, many older (legacy) applications (of which the Bank has many) are not written with telemetry in mind and as a result understanding why the application is performing sub-optimally requires some element of heuristic deduction.

Using predictive analytics it is possible to pre-empt systems failure and react before failure occurs. Predictive analytics uses many techniques from data mining, statistics, modeling, machine learning, and artificial intelligence to analyze current data to make predictions about future.

Inadequate focus on non-functional requirements

In systems engineering a non-functional requirement is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviors.

Broadly, functional requirements define what a system is supposed to do and non-functional requirements (also known as quality requirements) define how a system is supposed to be.

Non-functional requirements typically include things such as availability requirements (must the application be available 24x7, 24x5, etc?); backup and recovery requirements; performance requirements (e.g., how many concurrent users should the system cater for); scalability requirements; security requirements, etc. In an effort to get new features out to customers / users, product owners may compromise the non-functional requirements of a system leaving it fragile and vulnerable to failure. It is therefore important that non-functional requirements are adequately contemplated at design time and engineered into the system from the onset. Retrospectively adding non-functional requirements to a system can be costly.

Also, as a system evolves over time the non-functional requirements need to be constantly revised to ensure that the system continues to meet performance, availability and security expectations. Unfortunately this does not always occur and it is common to see systems degrade as more and more features are added and the usage profiles of the systems change.

Typical screenshot using a tool like AppDynamics

Bibliography

While this is not an exhaustive list, there are a number of sources that have been drawn on for inspiration in the compilation of this and the subsequent posts. Some of the content has been used verbatim, some quoted and others used simply to frame and argument or position.

Books

Online Papers & Blogs

How Complex Systems fail. Richard I. Cook MD

Continuous software engineering: A roadmap and agenda. Brian Fitzgerald, Klaas-Jan Stol

Infrastructure As Code, The Missing Element In The I&O Agenda. Robert Stroud

#NoProjects. Shane Hastie

Put developers On The Front Lines Of app support. Kurt Bittner, Eveline oehrlich, Christopher Mines and Dominique Whittaker

Agile Manifesto

KitchenSoap (Johan Allspaw)

Martin Fowler

Systems Blindness: The Illusion of Understanding (Daniel Goleman)


This was a guest post by Mike Murphy. Mike is the Chief Technology Officer of the Standard Bank Group. You can also listen to a podcast featuring Mike by clicking here.

Leave a Reply

Your email address will not be published. Required fields are marked *