By Daniel Mayo, Ovum
There was plenty of speculation about the cause of the recent RBS IT failure that downed the retail bank operation for a week and prevented payments in and out from being processed, causing wages, mortgages and businesses to go unpaid and seriously damaging the reputation of the bank. Many observers were initially quick to point the finger at the bank's legacy core systems, but that wasn’t the real story. The young age of many inexperienced staff, which were unable to fix the problem quickly as a backlog built up was just as crucial as the aging technology, and the problem will only get worse as demand increases thanks to the rise of online and mobile transactions increasing volumes and decreasing the amount of downtime.
The infamously old legacy systems at many UK retail banks, with some even still replying on ancient IBM Z series servers, have wrongly been stereotyped as being a ‘creaking’ component in the technology systems of the banking sector. Yes, they are old and should be replaced but they are not the only concern, as the recent case of RBS illustrates.
RBS' main retail banking customer account system dates back to the 1980s and is based on older technology, but the system has proved itself largely reliable over the years. The problem in this case was actually a software upgrade that went wrong. The software in question is Computer Associates’ CA-7 batch scheduling software. This coordinates the end-of-day batch cycle at RBS where transactions executed during the day are processed in offline windows overnight. When the problem was first encountered, what should have been a routine procedure and straightforward upgrade fix was unintentionally aggravated due to poor staffing knowledge and procedures, and the meltdown ensued.
What went wrong at RBS
An operative running the end-of-day overnight batch cycle managed to erase the entire scheduling queue at RBS. This error required the re-inputting of the queue, which is a complex process requiring a detailed understanding of the core system's processing quirks and an advanced technological knowledge of legacy software; something that is often lacking in younger IT graduates. Completing the overnight batch processing cycle within the offline window – and reinstating the queue – during the small amount of time available proved to be impossible, especially as pent-up demand and payment instructions built up over time, causing other RBS systems, such as access to its online banking, to suffer. The Financial Services Authority (FSA) has launched an investigation so no doubt more details will emerge, but in essence this was the key problem at RBS.
The order of scheduling is important with batch-based systems. To reconcile the deletion of scheduling queues, RBS had to re-run the previous day's transactions. This had to be complete before it could take on new transactions, but it is a bit like running to stand still when demand spikes as people panic. The majority of affected consumers would have suffered because the delay in fixing the initial problem extended the backlog of transactions and the problem then fed upon itself. This protracted the resolution resulting in RBS opening its retail bank branches in the UK for a seventh day and running from 8am until 7pm at night; an unprecedented event in UK banking.
There are two areas which will hinder retail banks as they struggle to ensure that IT failures such as RBS’ now notorious case cannot happen again – namely, a lack of skilled staff and increased demand thanks to the rise of online and mobile banking, which effectively increases operating hours and shrinks the time available for batch processing.
An increasing shortage of skilled staff
The availability of both skilled and experienced IT staff is deteriorating. New IT staff can command a high salary if they focus on older technologies, but many do not and continue to lack a comprehensive understanding of current in-house systems and processes because they naturally enough want to be at the bleeding edge of technology.
Retail bank systems in developed markets run on older technologies, however, and the number of experienced IT staff – both for core system software and the underlying supporting platforms – is in decline as more and more people retire. What’s more, as senior staff retire they often do not pass their knowledge of legacy systems on to junior employees. Therefore, entry-level IT professionals concentrate on newer technologies without adequate knowledge of what went before and there is no procedure for educating them.
As in all sectors of the UK other developed market economies, banks are currently weathering an economic downturn. As such, they are under increasingly heavy cost pressures and IT budgets are tight. The responsibility and resources for managing complex legacy systems, is therefore often stretched when errors occur, with expensive older or highly qualified staff not necessarily on call internally.
These issues are compounded by the low level of documentation for many older IT systems. Detailed understanding of system code, processes, and requirements is often pooled across a small number of staff. What banks must consider is making this information readily accessible in disaster recovery situations. This requires a comprehensive knowledge transfer between staff coming up to retirement and the junior IT employees, which is lacking at the moment.
The continued growth of mobile and online banking
What is more worrying for both banks and consumers, is the popularity of online banking and this is now further compounded by the speedy uptake of mobile banking. Both are creating more demand for transactions and 24×7 service. In addition, consumers are also pushing for longer branch bank opening hours, further increasing the pressure to reduce system offline times. Hence, if things go wrong there is nowhere to hide and less time to fix any problems, as RBS discovered.
Previously the batch window operated with enough time to absorb any human or technological error. However, as we move further away from the traditional world of restricted queuing times and branch-based banking where banks could close at 15:30 and remain shut throughout the weekend, the IT support team is losing the time to rollback and re-run procedures if needed.
The very public failure of RBS’ retail banking IT infrastructure demonstrated the key problem of squeezed offline windows and a distinct shortage of skilled IT staff. It should act as a catalyst for all UK banks, and those in other developed markets, to reassess their core systems strategy, staffing and resiliency plans. It is not just RBS in the spotlight. Most of the big UK banks run batch and mainframe-based legacy core systems. What the RBS case has shown is that the skills shortage issues around older technologies mean operational risks are only going to intensify. Lessons need to be learnt.