Treasury Board Secretariat has launched a full investigation with Canada Revenue Agency into the software bug that brought down the national e-tax filing system for 10 days earlier this month.
It was the longest stretch of time that a mission-critical system at CRA had been out of service. Canada Revenue says it had an army of 700 employees working to get the system back up and running, including IT staff who worked around the clock.
CRA has confirmed the system crash was caused by a vendor-issued software patch for its database management system, Advantage CA-IDMS (Integrated Database Management System), manufactured by CA Inc.
A spokesperson for CA did not give details on why or how the patch malfunctioned, but he did say the company helped to restore system availability. “CA worked closely with CRA to resolve the problem and restore service,” said Fabrice Zambito, regional vice-president, CA Canada.
Amidst the crisis came scathing criticism from Liberal Revenue critic Judy Sgro, who accused federal Revenue Minister Carol Skelton of ignoring the system crash while away in Saskatchewan. Sgro called for an extension to the tax filing deadline if the system remained down for much longer.
Treasury Board confirmed it was investigating the case with CRA officials. “The Chief Information Officer Branch has worked closely with the Canada Revenue Agency, as it does with all departments, to support the Agency in working through this event,” said a spokesperson. “One of Treasury Board Secretariat’s roles is to assist departments in monitoring and addressing arising issues.”
Gordon O’Grady, deputy assistant commissioner with the IT branch at CRA, talked this week with InterGovWorld senior writer Lisa Williams about the main culprit of the glitch, why it took so long to fix the faulty patch, and the processes behind its post mortem with Treasury Board.
Q) The Commissioner of CRA (Michel Dorais) had said that it was a malfunctioning software patch that was the cause of the service disruption. Can you give me the details on the patch itself?
A) The patch was created by the vendor (CA Inc.) to address a problem related to their database management software, which if encountered would cause database processing to stop, resulting in a service disruption.
Q) What was the type of software that was being used?
A) It was database management system software called Advantage CA-IDMS (Integrated Database Management System).
Q) How was the malfunction noticed?
A) On March 5, (the day after the patch was applied), there were sporadic anomalies that were reported, occurring in the Efile system and other integrated management system databases, but as identified earlier, all were predominantly individual taxpayer databases.
Q) What was the specific testing procedure that was in place?
A) At CRA, patches are thoroughly tested in multiple test environments; so they’re progressively released through each environment, until they’re finally released in production. We usually allow two weeks between each release on each test environment before we go into production.
Q) When you say two weeks is allowed, do you mean in terms of the testing itself?
A) There are various test environments or test states. Each testing that’s done spans the two-week period before it’s then released to the next test environment. So there’s a period of stabilization before it moves to the next stage in testing.
Q) This is viewed as a mission-critical service and is the time of year when people are normally filing their taxes. It’s also the longest time online services have been down on the CRA Web site. Why did it take so long to get things up and running?
A) Due to the unprecedented scale and scope of this incident, it had a significant impact on the level of effort and time required to get all IT services back to normal production status. As we indicated earlier, it was mostly impacting individual IT services that included more than 76 computer applications, comprising more than 7.5 million lines of code and interfacing with 91 different databases.
Q) How big of a team was working on this?
A) Our IT staff worked around the clock, as you would expect, to resolve the problem. Due to the complexity and size of databases and systems, this was the amount of time (10 days) required to restore services. We estimate that 700 CRA employees worked on this problem.
Q) Were there other departments that were involved or that you had to consult with?
A) The IT part of the branch in the agency did of course work with business branches in the agency to make sure that the recovery plan did include business impact. We called upon the expertise and advice of key CIOs and other federal agencies and departments who are business partners. Together with our business partners, we worked in cooperation to manage the possible impact on other government organizations and coordinate the implementation sequence and all implementation of key services back into production.
Q) Now that the systems are up and running, do you think that your branch will be looking at amending best practices or changing procedures?
A) We’re currently working with Treasury Board on a post-mortem document as a result of the outage. And I’d just like to say that CRA’s IT branch follows a set of well-established, rigorous frameworks that ensure risk is proactively managed and that process and quality in data integrity are respected and safeguarded across all our operating environments. These management processes are derived from the ITIL (Information Technology Infrastructure Library) framework, which is a framework of industry best practices in IT.