Home > About > Search engine > Technology


PopularityMost of the web exists in the dark
Web search has been revolutionized by popularity algorithms like PageRank, performing a better sorting phase than ever before.

A side effect of popularity-sorting is to leave millions of less popular sites in the last pages of search results; among them are valuable pages/content we miss in our regular search sessions.

Only a very small % of users are browsing results after page 1 or page 2 of the search results pages. The result of this search behaviour is that we mostly navigate across a small set of sites with high popularity: most of the web exists in the dark. Millions of interesting sites remain hidden at pages 2, 10, 50 or 100 of the results of a web search, because of their low popularity.
The idea we started with was to present the web in a new way, removing the 'popularity' criteria from the search. We wanted to bring that sea of hidden websites into the light of day: Similarity classification was a way to re-map internet to let people discover and search a larger portion of the web.
The internet's long tail is full of interesting sites.


In order to make this a reality, we developed PageAffinity a set of algorithms to identify and calculate the degree of similarity between webpages, bringing to light the ‘long tail’ of the web.

By massively applying PageAffinity, we created a new map of the whole web built around the concept of similarity. 200 million lists of similar sites have been computed and are available through thi site and the SimilarPages  add-on.

PageAffinity analyzes both the content of pages as well as the linking structure of the web to determine the level of similarity between webpages.  In addition the algorithms extracts from the web the tagging content created by users and webmasters.


The Map of Internet by Similarity

To create a new Internet map based on similarity, we crawled a large portion of the web.

To perform this massive task we used two great opensource tools from the Apache Foundation: a web crawler called Nutch, and Hadoop, an distributed file system that uses a MapReduce framework, together with modules specifically written by our team.

We first launched some thousands of Nutch instances to perform a crawling of a large portion of internet. Today, the SimilarPages map is growing constantly and connects more than 3.2 billion pages. It covers all domain extensions and types of webpages, regardless of their popularity. The degree of similarity between webpages is computed regularly, and the map is updated continuously with newly published sites.

If a site is missing, let us know. We will check it out and add it to the map as soon as possible.
Some characteristics that distinguish us
SimilarPages is a unique search engine and web discovery tool that works with its own real index of the Internet.
Our index is updated continuously, hundreds of servers are searching the web for SimilarPages every week.
We have already indexed 3 billion web pages to create 200 million lists, in all languages and covering all existing topics.
We are a passionate group of tech-addicts, delivering a new way to organize the web. Without popularity-based algorithms, the web is wider.