Everyone loves social networks these days. Homeland security wants to track which terrorists know one another. Laundry companies want to know your friends so that they can get you to pass along the good word about the new starch. Content mongers, meanwhile, believe that they can link together similar movie, television, and music preferences among users so that people who love “Die Hard” can be automatically informed that they might want to check out “Die Hard 2: Die Harder.”
At the heart of these problems and dozens of other ideas springing from the forehead of marketing directors everywhere are graph databases. (Computer scientists use the word “graph” to describe collections of objects and the links among them.) Using graph databases instead of traditional relational databases to store social-network-type data structures can yield faster answers to important questions — such as what kind of donut your friend’s friend’s friend prefers, or whether someone from your “Labor Day” DVD was in a movie with someone who was in a movie with Kevin Bacon.
Pure graph databases aren’t the only answer. Simple, schema-free relational databases are emerging that rival the capabilities of graph databases in swiftly cranking out quick answers to the aforementioned questions. They achieve this feat by not wasting time fretting over transactions, instead focusing on pre-computing answers.
Moreover, certain types of simple queries that might be required of a social network are better suited for the indexing and table balancing built into a relational database. For example, if you have links between people stored in an indexed table with the ID numbers in two columns, it’s easy for a relational database to find everyone who is a friend of Bob or everyone who follows Chris. The graph structure doesn’t really help with such queries.
I looked at three databases, both graph and relational, that are geared toward social networks: Neo4j, Cassandra, and FluidDB. All three are relatively young but hold promise in helping organizations connect the dots among their user base.
At first glance, there’s not much to Neo4j, a graph database written in Java that can be linked with Ruby and Python. There are just nodes, attributes to nodes, and relationships between nodes. To find an answer, you create a traversal object that bounces around the nodes by following the relationships until it comes up with your answer.
Neo4j can easily answer basic questions such as, “How many friends of a friend does someone have?” by simply following the friendship links between people nodes. In fact, the Neo4j documentation includes a sample project drawing on data from IMDB.com to solve the classic Six Degrees of Kevin Bacon conundrum in just a few lines of code.
Neo4j’s real power lies in its ability to solve problems that demand repeated probing throughout the network. You can bundle up a query in a traversal object that will scan through multiple connected nodes to find the answer. It will repeatedly ask for one row of a traditional database, then use that information to search for a new row again and again and again. By contrast, a traditional database would require a separate query for each step through the search, driving traffic through the roof.
Searching algorithms aren’t news to anyone who’s taken basic computer science courses. There are even a number of libraries, such as JGraphT, that implement many of the classic graph algorithms in Java. The beauty of Neo4j is that it turns these data structures into a database by adding persistence, transactions, and caching. You just keep dumping those nodes in, and Neo4j will find a way to store them on disk so that they can be found after the power failure.
In building a few projects, I found Neo4j’s performance to be quite good in cases involving deep searching through the networks. Neo4j promises results that are a thousand times faster than a relational database, and this seems entirely consistent for intense problems such as searching everything in a big network.
It’s pretty easy to bump up against the limits of Neo4j today. Implementing a project requires some forethought, much like the design work that goes into planning a schema for a relational database. The challenge lies in the fact that searches are all on the nodes, not the relationships on them, and this confused me for a bit. I wanted to skip looking through all these nodes and zoom in on only the ones bound by a relatively rare relationship. The trick is to create, then grab, extra nodes that represent the different types of relationships out there.
Moreover, searching for a particular node with a particular attribute is better handled with Lucene, which now comes bolted onto the bigger distribution of the Neo4j project. If you want to ask for the wives of all of Bob’s male friends, you would first use Lucene to search for Bob, then turn to the Neo4j part of the API to search his social network. The Neo4j project is starting to expand, however, with the addition of new algorithms and data structures.
Neo4j is beginning to attract all of the necessary extras to build production tools out of it. Some nice subprojects, add-ons, and tools have appeared in fertile open source projects. Ruby and Scala bindings offer REST-ful interfaces. An Eclipse plug-in, Neoclipse, draws the graphs in Eclipse so that you can debug them. There are tools to suck in SQL databases and others to back up the database.
Neo4j’s documentation for the project is composed of excellent pages and thin sections. There is a fair amount of discussion devoted to optimizing the performance of the system, which shows that the group is serious about using the tool in real applications where caching and transaction costs matter.
Neo4j comes with one of two licenses: AGPL (the tightest open source license) or a commercial license from Neo Technology.
All in all, Neo4j is an exciting tool that’s just starting to be really useful. The fun comes when you start imagining what all of the other graph algorithms can do. There are implementations of the shortest path algorithms that would help a genealogist, a forensic accountant, and many others playing with social networks. These are just the beginning, and I expect there will be a bit of a renaissance as Website developers start unlocking some of the more arcane graph algorithms developed by computer scientists over the years.
Not all of the queries on a social network require the sophistication of a tool such as Neo4j, because not all searches require deep trips through the graph. Many of the simplest involve intersections and unions of the information attached to various nodes.
Digg, for instance, wanted a symbol to appear beside a link each time it was “dugg” by a user. This simple intersection, however, is complicated by the huge mass of information flowing through Digg, thus making conventional approaches with JOINs of relational tables too slow, even with good indexing.
Digg’s solution has been to use a more write-friendly (nontransactional) environment to write out multiple versions of the data. Instead of computing the intersection at query time with a JOIN, it just precomputes the information for all but the most loved pages. The moment a person “diggs” a link, the denormalization process begins: The app precomputes the JOIN by inserting a mention into the lists of all of the followers of that user, effectively shifting the computational load. That means if someone with 10,000 followers likes a link, there will be 10,000 different entries written at that time.
Digg uses Cassandra, a NoSQL database that promises to be “eventually consistent” — that is, the update doesn’t occur immediately in all instances, which is sufficient for something as ephemeral as a link to an article. Facebook, the original developers of Cassandra, often gives me wildly inaccurate versions of my newsfeed, featuring old articles from odd times. It’s not a big deal, though, because it’s just friendly chatter.
If something like this happened when I accessed my bank account online, for example, I would be angry. But the lack of sophistication in adding information to the database means that it doesn’t take long to add the tens of thousands of links.
I’ve enjoyed working with Cassandra and the other new NoSQL databases for some time. The limitations and inaccuracies are often acceptable when the data is as expendable as many of the text strings floating around social networks, especially if the higher speeds make it possible to do some of the pre-computation of JOINs.
The denormalization can also chew through disk space because the data is repeated ad nauseum, but this is less of a problem now that disk space is so cheap. Whereas services such as Digg and Twitter can use these kinds of techniques to speed up answer delivery, they still face the theoretical problem of quadratic growth: If everyone starts following everyone else on Twitter, the load is a disaster.
Cassandra is an excellent tool, and there are a number of similar low-rent databases that can support this kind of approach. MongoDB and CouchDB are also popular.
FluidDB is not a graph database, but it can still handle queries on social networks, thanks to a simple structure and a radical amount of openness. The database is designed to let the world cooperate on tagging data elements; this collaborative work can produce answers to questions.
The FluidDB structure is beguiling. Anyone can attach tags to any data object, but only people with the right roles can see and search these tags. If tags are added with a consistent structure, then boolean operations on these tags can produce accurate solutions to many of the questions that might occur on a social network.
Imagine that my Twitter reader would inject “Peter Wayner follows” tags pointing to all of the people that I follow. At the same time, other people’s readers might inject similar tags pointing to their followers. Then FluidDB could answer questions such as, “Show me everyone followed by two specific people” or, for that matter, any boolean operations of these sets. The advantage of FluidDB is that everyone’s reader is setting tags independently, yet everyone can search all of the tags. The “graph” is built up piece by piece by individuals, but the answers that come from intersections are open to all.
There are limits to this power. The queries can work on only one layer at a time, just like the Digg example using Project Cassandra. The query can’t search through several layers without repeatedly asking questions of the database, refining the answer, and then sending another query. If Digg wants to put a special icon next to the posts of the friend of friends, a full graph database such as Neo4j would be required.
The lack of structure of the tags, though, means it’s possible to essentially pre-compute some of the most complicated queries. If someone stops following a person on Twitter, the reader might add a “stopped following” tag, thus saving the trouble of subtracting or intersecting lists.
The interest in public/private databases such as FluidDB is just beginning, and the version I tested is merely an alpha. However, I imagine that the structure of the tags will develop organically, much in the same way that Twitter users started coming up with hash tags. The more I use Twitter these days, the more I wish it had an open and flexible structure along the lines of FluidDB.
Room to grow
All three of these solutions are just the first cut at tools that can answer questions about social networks fast enough to satisfy the needs of the people who have to know what their friends’ friends are doing. They can’t quite perform more complex tasks, however, such as computing sums and averages over result sets, at least not with a built-in command. You can implement many of these on your own.
There’s plenty of room for improvement. Neo4j, for example, scans the nodes in the network, but it can’t handle more complex queries; it wouldn’t be able to find nodes with two and only two friends unless you start adding attributes and other features yourself. All it can do is scan the entire graph. That’s why the database needs to be a more flexible indexing mechanism for finding nodes quickly, not just the text in nodes that’s indexed by Lucene.
These improvements will probably be coming along soon, though I’m not sure whether the result will be simple and straightforward. Neo4j includes a number of plug-ins, and it would be simple to add some flexible indexing routines. The open source nature of the code means you can revise and extend it even if you buy a commercial license.
The NoSQL field is attracting a great deal of interest, and the techniques used with Cassandra can easily be applied to any of the other NoSQL databases emerging. All of the energy put into adding speed to these tools will benefit anyone who wants to save query time by denormalizing the data before storing. The technique is so powerful I’m expecting that tools for pre-computing these JOINs should be added to the systems soon enough. This will save programmers the trouble of writing the loops themselves and will even open the door to more optimization later on.
All of this means the kind of people who need people — the kind that Barbra Streisand once sang about — are the luckiest people once again. They’ll have many faster solutions for finding the people who need people and, in turn, the people who need people who need people, and the people who need people who need people who need people and so on and so on. And it’ll all be done with one call to a database.