Reprinted below is a statement, issued late today, from the Royal Bank.
Canadian readers of this blog will be aware that Canada's biggest bank has been working through an embarrassing and annoying technological problem. [My Globe colleagues have been writing the heck out of this and I put an item up for CTV national news last Friday on this.] All week long, we've been asking the bank what happened. Don't know if this makes it any clearer to me, but it might to you. I'd like to know, for example, what operating system the application was running on that the bank staff were trying to update. Also of some note: IBM Canada is Royal's biggest technology vendor. At first, we wondered if an IBM product or software was to blame. They were quick to point out — on the record and in no uncertain terms — that IBM was not involved at all. In fact, as the brief below indicates, IBM has been hired to be the independent investigator of the problem. Here's what the Royal said:
Transaction Processing Disruption – Technology Issue Summary
What caused the problem?
The root cause of the problem was an error made in a program change on Monday, May 31 that surfaced in the early hours of Tuesday morning, June 1.
Our operating procedures require that all program changes undergo thorough testing before entering our production environment and we continue to investigate all aspects associated with this issue, to determine which, if
any, protocols were broken.
We can assure you that this problem was not a result of any information security breach or malicious act by internal or external parties.
Why didn't you go to back-up facilities?
Back up facilities exist in case our primary facility is disabled. As a matter of policy, therefore, all program changes are implemented simultaneously to both the primary and backup facilities Therefore, our back up facility would not have been useful in this case. One of the issues we will investigate as we complete our learnings from this event is whether this policy should be more robust.
Why did it take so long to recover?
The error manifested itself during the Monday-Tuesday overnight system runs.
Once identified early on Tuesday morning, the error was corrected within two hours. However, our recovery was delayed because we had proceeded to launch into end of day production based on incomplete information. Until we were able to conclude that this error did not pose a pervasive risk to other systems, the decision was made to stop production on Tuesday, June 1st. Once production was restarted later on Tuesday, the verification process took longer than expected because two days of transactions needed to be processed
on the same date. This created additional complexity and the requirement for further manual verification of dates, creating additional backlog. In this situation, rather than running the information through the use of our automated tools, we had to manually override the automated scheduling systems and this significantly slowed processing time.
Through Wednesday, June 2, it was our belief that we could catch up to Tuesday's processes by late that evening. This would have allowed us to process Wednesday's data by early Thursday morning and thus to be up to date for the opening of business.
The manual rescheduling interventions also included the decision to give priority to payroll transactions. As a result of this decision and because of additional time delays resulting from a higher level of manual interventions, we were not able to meet our objective and concluded that we would need the weekend to eliminate the backlog.
Paramount in our efforts was the need to ensure the integrity and security of all the data being processed by the system.
How will you ensure that this will not happen again?
With the recovery process behind us, we are now conducting our own exhaustive, internal review of the specific causes and effects associated with this problem. While we continually review our technology and processes and benchmark against other high performing companies, the events of this past week have caused us to initiate an aggressive assessment of potential procedural gaps. In addition, Gordon Nixon, president and chief executive officer has retained the services of IBM to conduct an independent review of the original cause of the problem, our current processes and the recovery procedures that were employed.
A second phase of our investigation will be to ensure that our policies, procedures and technology comply with best practices — a process that will involve input from other institutions in our industry as well as the sharing of our findings regarding best practices.
Our Commitment to Clients:
We apologize for this disruption and realize that we still have work to do to make things right for those who have been impacted by this situation. We appreciate the cooperation of other financial institutions for accommodating clients who have been affected. We are committed to rebuilding the goodwill of our clients and to taking all necessary steps to accomplish this end.