Finding similar pages in the Web

Key words are not a rich enough for web users to describe what they are looking for. Often, I already have some examples of what I'm looking for, but I want more. My approach is to classify pages based on link structure.

My Algorithm will take in a list of URLs (entered by user), and potentially some keywords, and return other similar pages. This algorithm can naturally be divided into two logical tasks:

  1. Find the intersection of all the pages (what do they have in common).
  2. Find other pages that fill in this category.

Here is one possible approach:

  1. get list of urls from user {u1, ..., uk}
  2. Discover the neighborhoods Fi = Nforward(ui), Bi = Nbackward(ui).
  3. Take G = UNIONall i Fi, Bi
  4. Assign each node in G a score, based on how many hops it is from each node (for example exponential decay, summed over each u)
  5. Take the set of all nodes which have a score above some threshold, this is called the common set. Hopefully this set defines the group we're looking for.
  6. Return the common set.

Clearly this algorithm has several variables that could be tweaked. For example exponential decay does not seem like the right scoring approach to use. Also it seems that rather than just returning the common set, it would make more sense to find pages that point to it. Also, it would be better to get a richer description of the commonality than just the common set; For example we could find that most pages link to this page within two hops and most of the pages are linked to from this page. It would obviously be help to use these richer descriptions, furthermore our users could would find it beneficial to see that information as well. A really complicated search engine could use the set to derive rules, and then use a user interface to selectively turn rules on and off.

mz