While many observers of the night sky gaze upward to simply enjoy the stars, Sam Roweis, assistant professor of computer science at the University of Toronto, is using statistical data analysis to tell him more about what he sees in the cosmos.
Recently, astronomers managed to assemble several huge digital catalogues – lists of stars they know about and can see from earth – that hold, among other data, the positions and brightness of billions of stars.
Roweis is working with U of T master’s of computer science student Chris Harvey and David Hogg, a professor in New York University’s department of physics, on an astrometic calibration robot – a tool that Roweis hopes will be able to process an arbitrary picture of the night sky, and then determine the angular location on the sky at which the picture was taken.
The challenge, explained Roweis, is that when a new picture is taken, many stars that aren’t already in the catalogue end up in picture – and that makes it much more difficult to narrow down the positioning possibilities. Roweis said his team built an inverted index based on the statistics related to the positions and magnitudes of stars they already have in their catalogues – it functions much like a Web search engine, or the index found in most library systems, where the user types in a keyword in order to get a list of items. In Roweis’s index, the attributes and positions of the stars act as the “keywords” that help narrow down the list of possible viewing directions.
“We are looking at occurrences of stars near each other,” Roweis said. “If we see three stars in a certain pattern, with a certain distance between them and with a certain brightness, we can search through our data and learn which kinds of triples of stars are very common…(or) relatively unusual. If we find (the unusual triples), we know that we are probably now in one of these limited numbers of locations” that the picture is depicting, he explained.
The astrometic calibration robot was just one of the data mining applications featured at the Mathematics of Information Technology and Complex Systems networks (MITACS) Quebec Interchange conference held recently at McGill University in Montreal. Dr. Arvind Gupta, scientific director of MITACS, said the conference focused on various aspects of data extraction involving machine learning.
Also at the conference, Bell Canada discussed its use of data mining to determine long-distance calling patterns based on peoples’ names and ethnic origins, in order to nail down their purchasing practices. “For example, the Chinese community would probably place a lot more calls to China (than other communities),” Gupta said. With that data in hand, Bell could “offer a specific long distance plan for that area of the city,” he explained.
Gupta added that data mining techniques could also be applied to direct marketing – for example, with grocery store loyalty programs.
“They would be collecting statistical information and keeping track of who you are, which can be used for marketing…or if you’re loyal to the company, they could reward you in some way.”
With machine learning, the patterns within data are modelled so algorithms can “recognize certain patterns and unrecognize other patterns,” Gupta said. Machine learning requires a lot of computation and statistics, as well as time. “These models refine themselves over time – they need lots of data so they can become accurate.”
Large enterprises are ideal environments for machine learning because they typically have huge data sets, he noted.
Although data mining can be useful for a business that wants to tailor its offerings to its clients, privacy is always an issue, Gupta admitted. “It’s up to companies to really design policies that ensure privacy is protected and to know what information the customer has agreed to give out.”