An attention-deficit planet is unlikely to forget the memorable Monday moan in a hurry. For six hours, Facebook and its app family of Instagram, Messenger, and WhatsApp went down. Beyond the flood of FOMO memes, the impact to Facebook was tangible – shares fell 4.9 percent, its US ad revenues bled $545,000 each hour, not to mention the inconvenience it brought on to millions of businesses that rely on Facebook pay to access e-commerce sites. Using specific indicators from World Bank and other agencies, the cost of shutdown tool (COST) was estimated at $160m to the global economy.
An internal Facebook blog pointed to a cascade of mistakes, but the effect grows ominous as one considers how outages at brands like Target, Amazon Web Services and Microsoft influences market leaders across BFSI and Fintechs. After all, with massive investments in digital banking, banks want to, and laudably so, replicate their scale and customer agility.
To be fair, not all of us are FOMO sapiens. But the fear-of-missing-out does influence the billions of dollars’ worth of resiliency agendas pursued across global enterprises. That fear is of course rooted due to rising global uncertainty, fluctuating geopolitical risks, an increased frequency of natural disasters, recurring large-scale outages and security breaches.
Resilience is an easy word to proffer, tricky to pin, harder to promise. But the singular difference, between Big Techs and banking, is the essential nature of services. Not being able to stream a favourite movie, upload a picture, or refresh a feed, are all lower on the “anxiety-spectrum” as compared to when banking systems go down. The sweet spot? Combine the agility and innovation of Big Techs with the resilience and reliability of banking systems.
To a certain extent while regulatory oversight shields customers’ data and deposits, the number of outages at banks aren’t negligible, in fact just the opposite. Consider, the ten outages a month at Barclays, or even the recent disruptions across Bank of America, and Visa.
How serious is the resiliency imperative?
In an age where every company aspires to turn pure cloud, the resiliency imperative is alive for not only the FAANG’s (Facebook, Amazon, Apple, Netflix, and Google). According to McKinsey research, companies report that one month or more of disruptions occur every 3.7 years, resulting in losses worth almost 45 percent of one year’s EBITDA over the course of a decade.
So, what does resiliency mean for our hyper connected world? Simply said, whereas quarantines and social distancing works for humans; for systems a different fail-safe is needed. It is called Graceful Degradation.
The business need for Graceful Degradation
The theory and practice of Graceful Degradation (GD) is captured by the question “If everything was to fail, what’s the most important thing that needs to work?” For network engineers, product managers, UI designers, and CX professionals, as we would shortly discover, the answer to that question is neither straightforward nor facile.
Let’s say you are in the middle of an online moment – booking an airline seat, at the ATM withdrawing cash, locking a stock market transaction, browsing Netflix recommendations – and the network breaks, or latency hits a threshold, or maybe a power outage or the system inexplicably behaves in an unexpected way. Complex systems, after all, are often fragile systems where macro level issues of saturation, latency, and excessive workloads are failures with more than one root cause.
What happens next? Does the seat get locked out, the card retained by the ATM, instead of your personalisation’s does Netflix offer general selections? How much of failure status does the system communicate to the user and at what stages? At the point of failure, how much information counts as good customer service and yet, doesn’t create silent anxiety? Does the “broken” system offer alternatives to delight the customer or not? If yes, how? Does it shed workloads, or time-shift it or reduce the service quality or add more capacity? Does the system prioritise between functions – may be onboarding new users smoothly over allowing latency for users already on the platform?
These are all answers to the primary question: how should the system gracefully degrade?
The ability to maintain limited functionality even when a large portion of it is rendered inoperative to prevent catastrophic failure, is how large-scale enterprise applications generate the power of resiliency.
Designing a Graceful Degradation system
In the age of CX dominance, the first aspect of designing a gracefully degradable system is probably obvious. These are the twin design features of fail fast (setting aggressive timeouts such that failing components won’t make the entire system crawl to a halt) and fall backs (designs that allows for fall back to lower quality). Once the failure has been “injected” comes the critical part – test it.
Graceful degradation may begin with causing the failure to see what happens, but it is equally thinking about what to expect before it happens, and what was designed to happen when it does take place?
How to introduce GD in banking user journeys
- Rather than stating the entire app being down, build portions (especially information vending) that are serviceable from multiple sources of the same truth.
- For transactions, even as the promise and intent are to provide straight through processing, the architecture should be message driven. It should switch between request/response at full availability and publish/subscribe on lower service levels.
- Create secure embedded data stores within customer apps that isn’t network-dependent or has to seek enterprise app(s) for every function.
Seizing the future – learning the Netflix way
No matter the cause, the fact that technology will disappoint, can be countered by the question: How can we handle the failure gracefully? What stops us from planning ahead to keep our customers happy? Ahead. Not after the feature is built, the product tested, and the version released. The profits of heeding GD lessons come to us from another iconic brand – Netflix. After its 2o12 Christmas eve outage when across parts of US, Canada, and Latin America programming went off-stream, the company put in a slew of GD measures. Netflix today regularly uses external services to simulate service failure, automates zone failover and recovery, and de-risks critical dependencies.
Eventually like Netflix, if banks are to increase GD into their operational systems, they must begin by rethinking their resiliency philosophy.
To learn more about Maveric Systems, visit their website.