Follow links where they may take you
To some degree, Google Search was powerful right from the start, no doubt
To some degree, Google Search was powerful right from the start, no doubt
Listen to this article |
Take a look at these lines: “A method assigns importance ranks to nodes in a linked database, such as any database of documents containing citations... The rank assigned to a document is calculated from the ranks of documents citing it."
Do they make some sense? Would it help if I told you they apply to something you use all the time?
They are taken from a paragraph that a young man wrote in 1998, an abstract for a longer paper that he submitted as a patent application. Maybe you need to read more to fully understand what it all means. But maybe you recognize the name of the young man: Lawrence Page. Page is one of the two founders of an obscure company by name Google LLC. This paper describes an algorithm that truly made Google Search the powerful and valuable tool that you use today. Because he devised it, Page sought to patent the algorithm.
To some degree, Google Search was powerful right from the start, no doubt. The idea of searching the exponentially growing World Wide Web (WWW), finding pages on it that are relevant to whatever you’re interested in, is something that goes back to the earliest days of the WWW. What’s the use of any collection of information, after all, if you cannot quickly find what’s important to you? That’s why libraries systematically catalogue their books and give you ways to search through that catalogue. In the same way, users of the Web needed a tool to search through everything that’s on it.
But it also became quickly clear that a straightforward search isn’t good enough, for it throws up too many irrelevant results. Is there a way to tell which pages are particularly valuable, particularly useful, to a given search? Well, that’s about what Page tried to answer.
One metric that might seem worth trying is the number of times a search term appears on a page. The higher that number, the easier it is to believe that the page is relevant. But this only takes us so far. For example, suppose you’re interested in information about harmonicas—like I often am, as it turns out—meaning, how they are crafted, how are different notes sounded, what kinds of music each model is suitable for, etc. You turn to Google and type in “harmonica". Because it mentions the word “harmonica" several times, one result you get—maybe even the first—is an Amazon.com list of harmonicas you can buy. But that doesn’t help you, because you’re not interested in buying. How can you get more relevant results? One way is to refine your query—for example, maybe you’ll try “how are harmonicas made".
But can Google itself give you more satisfying results?
Actually, yes. That’s what Page’s algorithm does. What it aims for is to rank a page based on how many other pages link to it. This happens all the time with academic papers. The more one is cited by other academic papers, the more likely it is that it is widely useful, important in some sense. In the same way, if a given page on the internet has many other pages linking to it, it’s a good bet that it too is widely useful, important in some sense. That is, one sign of this importance is the number of such links.
But the algorithm goes further still, as the second sentence in that excerpt above suggests. It considers not just how many other pages link to a particular page, but the rankings of those other pages too. What’s the effect here? A given page’s ranking improves not just with the number of other pages that link to it, but with the rankings of those pages in turn. Again, think of the analogy with academic papers. If the mention of Einstein’s Theory of Relativity in this sentence I am writing is the first and only reference to it there is, it probably would not be the celebrated, ground-breaking Theory it is. Instead, it has been mentioned and discussed in millions of papers, articles, books, podcasts, films, and TV shows over the last century. That itself is a measure of its impact on science and our lives.
That’s what Page’s algorithm tries to do with webpages. As you can imagine, it doesn’t have to stop with the ranking of pages that link to the one I’m considering. What about pages that link to the ones that link to the one I’m considering? Can we go steadily farther afield, following a chain of links to determine the importance and relevance of a document? In theory, yes. In practice, the number of links to explore and enumerate quickly spirals out of control. So, there are questions about how far we can follow the chain of links.
This was actually a problem I faced many years ago. One of my claims to minor fame is that I was part of a team of four young programming enthusiasts that built an early hypertext system we called PlaneText, some years before the Web came along. While it was something of a prototype, it was also on a scale that’s vanishingly small compared to today’s wildly hyperlinked world. We put PlaneText together really as a tool for our work colleagues. One immediate use they found for it was when a team worked on a project. Here was a way for each member to document her own effort while using links to show how it fit with everyone else’s. Naturally, our own PlaneText team used it like that too. In fact, that’s how we wrote a user manual for PlaneText, and later a technical report describing our work.
One part of the project was entirely mine—the PlaneText Browser. Our colleagues who used early versions of PlaneText told us that what they really wanted was a way to see their linked documents—a picture, if you like, of the web they had built. Again, this is something impossible to even imagine with the billions of links today’s Internet contains. But the PlaneText webs in our company at the time typically had a few hundred links. My task was to start at a specified document, count how many others it linked to, follow each link and repeat at each document I reached. Then I had to draw on the screen a picture of this network of documents.
Yes, just a few hundred links. But even so, I had to be able to detect documents the Browser had already visited, so as not to go in circles. I had to find creative ways to follow links, so that the task remained manageable. With all that, when my Browser drew its picture, it was immediately clear which documents had how many links, and thus their relative importance to the project.
Page’s problem is, of course, infinitely larger and more complex than mine. He would have failed miserably had he relied on the link-following mechanism I used. But there’s a more efficient way. He started by describing links on the Web using mathematical objects called matrices. Page then realized that if he performed a calculation broadly familiar to college students who study matrices, problem solved: It would reveal which webpages are the most linked.
That’s why the patent. That’s why Page’s obscure company is worth several trillion dollars today. Me, at least I still have my PlaneText technical report with a few Browser pictures.
Once a computer scientist, Dilip D’Souza now lives in Mumbai and writes for his dinners. His Twitter handle is @DeathEndsFun.
Never miss a story! Stay connected and informed with Mint. Download our App Now!!