It doesn’t take a scientist to figure out that shooting an animal with a dart gun and clamping a radio collar around its neck probably is going to be a little invasive. So a European non-profit organization called WildTrack came up with a technique called footprinting, to monitor animals through their footprints to see how climate change and poaching are affecting endangered populations.
So far, it’s been used with rhinos in Africa and tigers in India, and it’s now being used by Queen’s University to track polar bear populations in the Canadian Arctic. The Holy Grail will be to create a global database of endangered species – but to reach that goal, the researchers will need powerful computing resources. And this is an issue that many researchers, developers and analysts are facing as they collaborate on projects around the world that involve massive amounts of data.
“The goal really is to see whether we can use footprinting to identify individuals and to find out where those individuals are going and how they’re making use of the diminishing ice in the Arctic,” said Zoe Jewell, one of the founders and directors of WildTrack. The longer-term goal is to see whether observations can be made about how the changing climate is affecting polar bear populations and, to some extent, how polar bears are indicating climate change by their movements.
Building the footprint database
At the moment, monitoring of polar bears is done by capture-and-release studies, where the animals have to be radio-collared. This is only done every nine to 11 years, because it’s expensive – and invasive. And when we’re talking about permanent sea ice disappearing in 40 to 60 years, we really have to monitor animals on an annual or biannual basis to have an understanding of what’s happening out there.
“If we can in addition bring Inuit expertise into this, the whole system will benefit enormously from a tradition of people who’ve co-existed with polar bears for thousands of years, instead of having outsiders come in with helicopters and dart guns.”
The goal is to develop a WildTrack endangered species database, which will act as a conduit of information for anybody interested in getting non-invasive techniques up-and-running for other species. Right now, researchers have to build up a database of footprints from known animals; if WildTrack already had that, it would have the algorithms in place for researchers to start using. “That would be a huge advance for us, and it’s something we’re ready for,” she said. “It’s a matter of getting the funding.”
WildTrack estimates it would take two years to build up a basic database, but it would in turn provide a “biological inventory,” which would not only be useful for monitoring endangered species, but perhaps for doing biological research or analyzing specific areas where animals are a problem, like man-eating tigers in India.
WildTrack is able to identify individuals by taking a series of footprint measurements and then using statistical analysis to identify individuals through a “geometric profile.” It all started with the black rhino in Zimbabwe. “We found that the immobilization of these animals was having an impact on their fertility,” said Sky Alibhai, the other founder and director of WildTrack.
“In a way that led us to develop a non-invasive strategy of monitoring these endangered species.”
WildTrack has since developed an algorithm for the white rhino, the lowland tapir and the Bengal tiger, and several other species are in development.
The kinds of questions being asked by researchers will determine the sort of database required. It’s built on SAS JMP technology, which is being used in its standalone capacity.
“We’re going to investigate with them the possibility of using SAS on the backend to do more work as their data loads get bigger,” said Jeff Perkinson, product manager for JMP with SAS. JMP (pronounced “jump”) is an interactive data exploration tool intended for scientists and engineers that allows them to explore data and discover relationships and patterns. In the case of WildTrack, it helps researchers query data and retrieve the portions they may be interested in, from a subspecies within a given geographic region down to an individual animal.
“If you only have one way to look at your data, you’re only going to be able to see one kind of problem,” he said. Genomics researchers are already doing this with genomics databases, which can be hundreds of thousands of variables wide.
Google has the world’s largest database, which would generally be the computing model you’d look to in order to distribute data globally, said Darin Stahl, lead analyst with Info-Tech Research Group. The challenge for university research teams, however, is that a lot of their funding is start-stop, and this may lead them to consider some of the other architectures out there, largely driven by cost.
Distributing the computing power
Cloud computing, for example, is cheap and cheerful. Amazon offers cloud computing through Amazon Web Services, and Google has announced a similar solution. IBM has launched its ready-to-use cloud computing, Blue Cloud, in China. “It would be interesting to see if this sort of R&D initiative gains some sort of backing from these types of corporations, because they’d need an entity like that to smooth the ongoing cost,” said Stahl. “It’s probably a trivial expense in terms of Google or IBM, but it would be big for universities.”
Also, to the scientists who trek out into far-away dangerous places at great expense, security is a huge issue because they don’t want their data to become corrupted or discredited. “Those are risks, but it’s an interesting set of risk management,” he said. “It’s not the same way society looks at credit card risk, and consolidating that sort of data into a database.”
Security measures could include multifactor authorization and separation of duties. But the challenge is when organizations develop their own applications and just wrap some security around it, rather than building it into the application itself. Aside from keeping the lights on, there needs to be logging, reporting and auditing functions.
What’s central to high-performance computing is not actually the computing, but the data, said Chris Pratt, IBM Canada’s manager of strategic initiatives. If you’re running an experiment that’s taken years to set up, or only happens once in a lifetime, or surrounds a very specific event, then the data itself is incredibly valuable.
“Some researchers will be looking for as much cheap disk as possible, while other researchers will be looking for the right amount of incredibly reliable infrastructure,” he said. “It’s like the difference between storing your wedding photographs or a happy snap.”
The key to finding the right solution is in understanding your requirements – if you’re doing research into genomics or clinical trials, the last thing you want to do is mix up data. How fast a transaction runs may determine how many different iterations you can get, so if you have enough compute power, you can simulate every possibility and pick the best answer.
“There are lots of industries looking to compress this compute time down or take into account more data,” he said. Researchers cannot have too much storage or too much compute power. “These researchers always need more than they can afford and that is the nature of the beast, because the problems they’re trying to solve are bigger than anything we have today,” said Pratt.
So they have to be careful to design a system that can be expanded over time, because there’s no point in putting millions of dollars into a high-performance computer facility that quickly becomes obsolete.
Here in Canada, TRIUMF is a multi-disciplinary lab at Simon Fraser University, which is part of the worldwide Atlas project – a particle physics experiment that will explore the fundamental nature of matter and basic forces that shape our universe. The Atlas project, located underground in a 27km tunnel in Geneva, is a giant detector that searches for new discoveries in proton collisions. It’s a collaboration of 2,000 physicists in 35 countries and 180 institutions (Canada makes up five per cent of that); it will run for at least 10 years.
One event – or collision – is about 2MB. “And we’re talking millions of collisions, so we cannot do a recording of all these events at that rate,” said TRIUMF data centre manager Reda Tafirout. “So we have a trigger, which means we’ll do a partial analysis of the event and decide whether it’s interesting or not.” The target right now is 200 events per second, which adds up to about 3.5 petabytes a year.
The Atlas project stores and analyzes this data through a tiered approach. The primary data is collected at a Tier Zero centre in Geneva, which is then farmed out to 10 Tier One centres around the world, including TRIUMF. Each Tier One centre has an associated set of Tier Two centres that do simulations with recalibrated data. “It took 20 years to build the experiment, so you don’t want to lose it,” said Tafirout.