Reprinted below is a statement, issued late today, from the Royal Bank.
Canadian readers of this blog will be aware that Canada's biggest bank has been working through an embarrassing and annoying technological problem. [My Globe colleagues have been writing the heck out of this and I put an item up for CTV national news last Friday on this.] All week long, we've been asking the bank what happened. Don't know if this makes it any clearer to me, but it might to you. I'd like to know, for example, what operating system the application was running on that the bank staff were trying to update. Also of some note: IBM Canada is Royal's biggest technology vendor. At first, we wondered if an IBM product or software was to blame. They were quick to point out — on the record and in no uncertain terms — that IBM was not involved at all. In fact, as the brief below indicates, IBM has been hired to be the independent investigator of the problem. Here's what the Royal said:
Transaction Processing Disruption – Technology Issue Summary
What caused the problem?
The root cause of the problem was an error made in a program change on Monday, May 31 that surfaced in the early hours of Tuesday morning, June 1.
Our operating procedures require that all program changes undergo thorough testing before entering our production environment and we continue to investigate all aspects associated with this issue, to determine which, if
any, protocols were broken.
We can assure you that this problem was not a result of any information security breach or malicious act by internal or external parties.
Why didn't you go to back-up facilities?
Back up facilities exist in case our primary facility is disabled. As a matter of policy, therefore, all program changes are implemented simultaneously to both the primary and backup facilities Therefore, our back up facility would not have been useful in this case. One of the issues we will investigate as we complete our learnings from this event is whether this policy should be more robust.
Why did it take so long to recover?
The error manifested itself during the Monday-Tuesday overnight system runs.
Once identified early on Tuesday morning, the error was corrected within two hours. However, our recovery was delayed because we had proceeded to launch into end of day production based on incomplete information. Until we were able to conclude that this error did not pose a pervasive risk to other systems, the decision was made to stop production on Tuesday, June 1st. Once production was restarted later on Tuesday, the verification process took longer than expected because two days of transactions needed to be processed
on the same date. This created additional complexity and the requirement for further manual verification of dates, creating additional backlog. In this situation, rather than running the information through the use of our automated tools, we had to manually override the automated scheduling systems and this significantly slowed processing time.
Through Wednesday, June 2, it was our belief that we could catch up to Tuesday's processes by late that evening. This would have allowed us to process Wednesday's data by early Thursday morning and thus to be up to date for the opening of business.
The manual rescheduling interventions also included the decision to give priority to payroll transactions. As a result of this decision and because of additional time delays resulting from a higher level of manual interventions, we were not able to meet our objective and concluded that we would need the weekend to eliminate the backlog.
Paramount in our efforts was the need to ensure the integrity and security of all the data being processed by the system.
How will you ensure that this will not happen again?
With the recovery process behind us, we are now conducting our own exhaustive, internal review of the specific causes and effects associated with this problem. While we continually review our technology and processes and benchmark against other high performing companies, the events of this past week have caused us to initiate an aggressive assessment of potential procedural gaps. In addition, Gordon Nixon, president and chief executive officer has retained the services of IBM to conduct an independent review of the original cause of the problem, our current processes and the recovery procedures that were employed.
A second phase of our investigation will be to ensure that our policies, procedures and technology comply with best practices — a process that will involve input from other institutions in our industry as well as the sharing of our findings regarding best practices.
Our Commitment to Clients:
We apologize for this disruption and realize that we still have work to do to make things right for those who have been impacted by this situation. We appreciate the cooperation of other financial institutions for accommodating clients who have been affected. We are committed to rebuilding the goodwill of our clients and to taking all necessary steps to accomplish this end.
My Globe and Mail colleagues have the latest on this in today's paper.
I haven't been involved with the incident but I do work in the same building. All I can say is that what the article says is pretty much what we have heard too. Nothing with the OS or software. Just an error in the code. I feel bad for the guy who created the bug but I don't necessarily think that he\she can be blaimed. Every coder makes mistakes otherwise there wouldn't be any bugs. The thing that really failed is that the bug was not discovered during testing and that it took a long time before everything was back to normal.
Knowing how much commotion this has caused, I am sure that we will see some big changes to prevent this from happening again because, even if the chances are very very small, it's something the bank can not afford. In a way I believe that, as a result of this, chances of something similar happening to any of the banks will be smaller than ever before.
I meant to post this yesterday but didn't get around to it. I see Richard has already commented with some semi-inside knowledge. Here is my comment anyway as it was written before Richard posted:
I don't think it is too important what OS they had this problem on. Its extremely likely they are using Microsoft Windows (at least for some of their systems).
It sounds to me like they just missed something in testing. It would be interesting to know how elaborate their testing procedure really is. They should be doing something along the lines of simulating the production environment in real-time in a test environment where they can apply the changes and then compare to the production environment or expected outputs.
I disagree with Tim. I think it extremely unlikely that the code was running on a Windows-brand OS. I say that because: 1 — Royal Bank is Canada's biggest bank and that means Big Iron when it comes to computing infrastructure. Windows and SQL server have been making great gains but when you really want to crunch data, the apps you're using are running on UNIX or its variants. 2) IBM is Royal's biggest vendor partner. IBM says (as I noted) it wasn't involved but IBM would likely have had some input into the apps and, therefore, the OS for the core computing functions in the bank. It's quite possible, then, that IBM has got the bank running apps on some Linux boxes. After all, IBM is in the market with some big advertising campaigns pushing Linux and non-proprietary operating systems. So why is all this important? It's important precisely because of Royal's relationship with IBM and IBM's relationship with Linux. The Microsoft marketing and sales staff will have a field day if it comes out that the coder who wrote the offending bits for Royal was writing an app running on Linux. Even though the OS may have had nothing to do with it, Microsoft folks (and its allies) can point and say, “Well, remember that Royal snafu? they were running Linux ….”I'm sure the reverse is true, as well. If this is a MS OS or Solaris or some other proprietary locked-down OS, the open source folks may say, “Well, there you go, another example of screw-up involving a Microsoft OS.!”
Even the coding tools used will be an issue for a vendor. Was it a Borland tool the guy was using? Was it some Microsoft tool?
So I think Royal and its technology partners have a great interest in controlling the information involved in this event.