One night last week, I was rushing to get home at a decent time when the head of our software support team slipped me a note suggesting that I look into an issue with data loss on one of our production database servers. I was tempted to just leave it until the next day, but I decided to investigate right away to obtain the highest-quality data from the fresh crime scene.
I quickly discovered that there was little hope of preserving evidence. The data had been deleted six hours before my arrival. During that time, the support team had run a partial investigation, but like any support group, the staffers had focused primarily on diagnosing the cause of the loss and restoring service. While they succeeded in restoring service, their work had made evidence preservation impossible.
Pieces of the Puzzle
The problem was that all of the tables within the database had been dropped and then re-created in a 10-second period. This erased all of the data in the database, including financial data that must, by law, be kept for seven years. The very quick deletion pointed to a script-based attack rather than a manual one.
The support team had found a prime suspect the install scripts for the software that uses the database. They drop and re-create tables in the same order in which they had been deleted. But who had run the script?
Our preferred response to a possible system compromise is to pull the plug and image the disk. After a few weeks of careful analysis, we wipe the disks and reinstall. When something goes wrong on a production system, we must leave it running for as long as possible, until we have compelling evidence that something awful is happening. By waiting while we trample all over the crime scene with our tools, we reduce the amount and quality of the evidence we collect. Unfortunately, that’s the trade-off we have to make.
If we can’t stop the machine, our first instinct is to gather the ephemeral digital footprints of the intruder. These usually involve system properties that don’t get logged because they change too often, but if we can capture them quickly, they can easily point our investigation in the right direction. Details like TCP connection status or recently opened documents can tell us what has been happening on a system and who’s connected.
Running Windows’ Netstat utility showed that only the investigating support team was currently connected to the machine. The security log was empty. The system wasn’t monitored by our host-based intrusion-detection system, as the administrators had felt the performance overhead of logging would outweigh the risk of any incident.
The system was running in a locked machine room, which nobody had entered, so the closed-circuit television wasn’t going to show us who had caused the data loss.
However, as in many environments, lack of physical access doesn’t imply that nobody can access the machine. A review of the enabled remote access methods showed that Birkerod, Denmark-based Danware Data A/S’s NetOp remote control software was enabled, as were Microsoft Corp.’s Systems Management Server (SMS) and Windows NT Server 4 Terminal Server Edition connections. (Our company has tried to define a standard for remote access to servers for support, but each support team has a preferred method.)
Just because the formal security logging isn’t enabled, however, doesn’t mean that a system doesn’t have evidence on its disks. This system had more than 300KB of text log files that had been updated since its last reboot that morning.
The lost data was still stored on a second machine. The backup had been designed to protect us from hardware errors, but it helped with this incident as well. After ensuring that all the logs had been secured, I called it a night.
Incompetence Produces Results
I’m sure there are high-tech ways to crunch such data, but given the number of logs and the strange formats involved, I used the low-tech approach: I printed it all out and read it, line by line.
The SMS log showed two log-ins, one using SMS and another using Terminal Server, but it had no record of an associated user name. By looking for all events that had happened at the same time as these log-ins, I chanced upon a useful error log generated by our Patrol performance monitoring software from Houston-based BMC Software Inc. This log contained the results of the software’s attempts to install desktop icons for each user who had connected.
The software had been improperly installed under a normal user account, so it lacked privileges to access users’ desktops. Every time someone logged in, it recorded a log entry saying “Patrol install user DOMAIN\BOB doesn’t have privileges to write to user DOMAIN\FRED desktop.” In so doing, it gave me the exact times and user names for everyone who had logged in. Patrol logging also highlighted that a third log-in had occurred this one via NetOp.
Our investigation tracked down the circumstances of the log-ins: An operator had logged in overnight to do normal quality assurance; a database administrator had checked for performance issues midmorning; and the support team logged in eight minutes after the incident to respond to the errors that the missing data caused.
Without a log-in at the time of the erasure, it looked like the script might have been run remotely by SQL Server. For ease of central deployment, the script had been written to take a host name and deploy remotely onto that node. Without a timely dump of network connection information or SQL logs, we had reached a dead end. We chalked the event up to a mistyped host name on an installer rather than to malicious behaviour.
The system administrators still wouldn’t accept additional logging, but we did manage to train them on incident handling how to avoid destroying evidence and when to call us. We also deployed a “first-aid” package of programs (LoggedOn, Fport and Netstat) and wrapped them in a script that staffers can run to notify security and send us data that would otherwise be lost.
We also raised security awareness in the support staff and found a new way to track log-ins, via the broken Patrol install. Every time broken systems help me do my job better, it brings a smile to my face.
After our mysterious database crash, we issued a “first-aid kit” to our software support team as a way to protect evidence during future incidents. We stitched the following components together into a single script that the staff can run:
http://www.foundstone.com/knowledge/free_tools.html: Fport, free from Mission Viejo, Calif.-based Foundstone Inc., identifies which applications are listening on which ports. Fport is a great way to uncover Trojan horses installed on your system.
http://www.sysinternals.com/ntw2k/freeware/psloggedon.shtml: PSLoggedOn, an applet available free from Austin, Texas-based SysInternals LLC, provides a comprehensive listing of who’s logged on to a Windows NT system at a given time.
http://www.microsoft.com: Microsoft includes the Netstat utility with its Windows and .Net operating systems. We use it to list all ports and systems engaged in communication with a Windows NT server.