Software failure cited in August blackout investigation

The task force responsible for investigating the cause of the Aug. 14 blackout that crippled most of the Northeast corridor of the U.S. and parts of Canada concluded that a software failure at FirstEnergy Corp. “may have contributed significantly” to the outage.

The Interim Report of the U.S.-Canada Power System Outage Task Force, released late last month, highlights the failure of various IT systems that thwarted utility workers’ ability to contain the blackout before it cascaded out of control, and found no evidence that malicious insiders or external saboteurs were responsible for the power outage.

According to the task force, FirstEnergy’s Alarm and Event Processing Routine (AEPR), a key software program that gives operators visual and audible indications of events occurring on their portion of the grid, began to malfunction. As a result, “key personnel may not have been aware of the need to take preventive measures at critical times, because an alarm system was malfunctioning.”

In addition, “some companies appear to have had only a limited understanding of the status of the electric systems outside their immediate control,” the task force report concluded. “This may have been, in part, the result of a failure to use modern dynamic mapping and data sharing systems.”

Besides the alarm software failure, the task force found that Internet links to Supervisory Control and Data Acquisition (SCADA) software weren’t properly secure and some operators lacked a system to view the status of electric systems outside their immediate control.

The task force also provided a “cyber timeline” listing significant electronic control events that contributed to the rolling blackout. The first major event occurred at 12:40 p.m. EDT, when an engineer from the Midwest Independent Transmission System Operator disabled an automatic periodic trigger on software that allows the utility to determine the real time status of the power system for its region. That action was needed to conduct a manual check of the network, the report states. However, the engineer later went to lunch and forgot to re-engage the automatic trigger.

By 2:40 p.m. EDT, the AEPR software began to malfunction, although FirstEnergy engineers weren’t aware of the problem at the time. One minute later, FirstEnergy’s AEPR server failed and switched over automatically to the backup server. Engineers, however, remained unaware of any other problems with the software. Then, at 2:54 p.m., the backup server failed.

At 3:05 p.m., when the first power-line failure occurred at FirstEnergy, system operators did not receive alarm notifications because of the malfunctioning AEPR software. That software continued to malfunction until 3:42 p.m., when the lights at FirstEnergy’s control facility flickered and alerted engineers to the larger problem. It was only then that an operator noticed the problem with the AEPR software. The fragile nature of the power grid also raised questions about the overall cybersecurity of the electric power grid and its susceptibility to potential deliberate disruption by terrorist organizations.

The U.S. Department of Homeland Security is currently working with the electric industry and private-sector IT companies to develop IT intrusion-detection systems that are capable of operating in the real time environment of SCADA systems.

Related Download
3 reasons why Hyperconverged is the cost-efficient, simplified infrastructure for the modern data center Sponsor: Lenovo
3 reasons why Hyperconverged is the cost-efficient, simplified infrastructure for the modern data center
Find out how Hyperconverged systems can help you meet the challenges of the modern IT department. Click here to find out more.
Register Now