In an effort to highlight the role of data mining in humanities research, a University of Alberta professor is helping to create text analysis tools to deeply examine historical trial accounts from the U.K.’s famous Old Bailey criminal court. While the research project is important to academia, the Edmonton-based researcher said that improving the quality of text mining tools could have benefits for businesses as well.
The project is a joint effort between the University of Alberta, U.S.-based George Mason University, and U.K.-based University of Hertfordshire. It is part of the “Digging Into Data Challenge” competition, which aims to show how tools in the digital humanities can improve the effectiveness of text mining large databases.
Geoffrey Rockwell, a professor of philosophy and humanities computing at the University of Alberta, said the explosion of blogs, wikis, digitalized books, and discussion forums has led to information overload for many researchers. But the same could also be true for marketers or other business units keeping tabs on what is being written about their companies.
“If I were doing this for market research, I could track how we appear in discussion lists, what words are near the brand, and whether they are talking to stock up or down,” he said.
In the Old Bailey project, Rockwell hopes the tools he and his international colleagues are developing will revolutionize how users and businesses cull through digital material. The key, he said, is to draw correlations from a body of text through the use of a mining tool.
“You can start looking at words most used in cases involving women pick pockets,” he said, referring to the U.K. prison record dataset. “Are people still worried about witchcraft? How are children being talked about? Is poison still popular?”
Rockwell and his international colleagues are developing tools like TAPoR, a textual analysis tool that can summarize a body of text, find collocates, identify important dates, and discover the co-occurrences of two target words. Some of the tools in TAPoR use forms of visualization to help researchers grasp the data even clearer.
Another data analysis tool called Zotero, which works as a free Firefox extension, collects and manages research sources.
Both TAPoR and Zotero are part of the Old Bailey project.
The ultimate goal for Rockwell is to make these tools more accessible to students, consumers, and businesses and have them start appearing on blogs, wikis, discussion boards, and even embedded right into browsers.
A search engine like Google, he said, is comparable to a card catalogue that directs you to a piece of information. This is also the way many text mining tools currently work.
“Most of these tools assume you have the word you want to find,” Rockwell said. “But instead of looking for a needle in the haystack, an effective text mining tool will try and show you the shape of the haystack and tell you the words you might want to find.”
One trend that Rockwell expects in this space is the rise of entity recognition, which could involve tools which recognize proper names, dates, and places. This would be useful to easily classify what a particular body of text is about.
For businesses, text mining and analysis of blogs or discussion boards can provide valuable insights into a customer sentiment. The more structured approach of conducting multiple-choice questionnaires “limit the depth of insight and breadth of customer feedback,” Bruce Temkin, vice-president and principal analyst with Cambridge, Mass.-based Forrester Research Inc., said in an interview with ComputerWorld Canada last year.
Text mining, on the other hand, is by no means a novel technology, but vendors are increasingly making it accessible for applications like Voice of the Customer (VoC). Unstructured data channels — in-bound e-mail, call centre conversations, blogs, SMS messages — by virtue of being text-oriented are prime targets for mining, he said, as are social media.
– With files from Kathleen Lau