Key words are not a rich enough for web users to describe what they are looking for. Often, I already have some examples of what I'm looking for, but I want more. My approach is to classify pages based on link structure.
My Algorithm will take in a list of URLs (entered by user), and potentially some keywords, and return other similar pages. This algorithm can naturally be divided into two logical tasks:
Here is one possible approach:
Clearly this algorithm has several variables that could be tweaked. For example exponential decay does not seem like the right scoring approach to use. Also it seems that rather than just returning the common set, it would make more sense to find pages that point to it. Also, it would be better to get a richer description of the commonality than just the common set; For example we could find that most pages link to this page within two hops and most of the pages are linked to from this page. It would obviously be help to use these richer descriptions, furthermore our users could would find it beneficial to see that information as well. A really complicated search engine could use the set to derive rules, and then use a user interface to selectively turn rules on and off.