Wednesday, April 23, 2008

The power of links and the value of global knowledge

Long, long ago, before Google, search engines evaluated and ranked web pages by considering each page in isolation, examining the size of the fonts, the contents of the meta tags, etc. In some cases, it was even possible to "hijack" another site's listings by simply cloning their HTML. Perhaps a few search engines attempted to improve on this with simple tactics such as counting the number of links to a page, but that was generally useless since it's so easy to create "fake" links in order to boost your count.

With Pagerank, Google took a very different approach. Instead of considering each page in isolation, they examined the link structure of the entire web and computed a global evaluation of that structure. In other words, they began looking at the entire forest instead of just the individual trees. Google did other things too -- Pagerank is just one of many factors, but this general approach of evaluating information in a global context is fundamental to many of the algorithms. These algorithms made it easier for Google to spot which web sites were actually important, and which were just pretenders. Of course Google isn't perfect, and people can still manipulate rankings to some extent, but it was substantially better than the old way, and good enough to form the foundation of what is now a $174 billion dollar company.

Last week I wrote about Facebook gathering similar information about people. By collecting information about people and the links between them, they can start to get a global view of the human "forest". Unfortunately, based on many of the responses, that post wasn't very well written. A lot of people focused on how annoying Facebook applications are (true), how search results limited to your friends would be useless (also true), or other things completely unrelated to my point. A few people mentioned that Facebook hasn't done anything useful with this data, which is actually a good point, but I think that has more to do with Facebook and the newness of the data than it does with the value of the data. After all, the web was around for many years before Google came along and started profitably mining the link structure.

Will Facebook ever do anything useful with the human link data? I have no idea, and it's not particularly important to me. However, I'm confident that SOMEONE will begin mining this data, and that it could ultimately be more valuable than the link data from the web. Facebook is a convenient example because they happen to have a head start on collecting the data, but others might be the first to actually profit from it. Google, in particular, is much better at data mining and also has quite a bit of human link data (from Gmail and Orkut). Microsoft+Yahoo will also have a nice data set, though I doubt that they will know what to do with it. Of course none of this data is perfectly clean and noise-free, but real data never is -- the web certainly isn't.