Sunday, February 15, 2009

Solution for Duplicate page listings - 'canonical' attribute in head tag

One sort of duplicate content happens because of plagiarism, copying to forums, copying of articles to blogs (see  How Google Treats Duplicate Content). In those cases, there are really several physical instances of a page for various reasons. 
 
But there is another sort of "duplicate content" that is often really just an artifact of how the Web works and in part a bug of search engines. It is not duplicate content usually, but rather duplicate URLs for the same physical content.
 
 Suppose you have a page at http://seo.yu-hu.com. Just one physcial page. This one page can be reached in four different ways.
 
That is a simple case  for a site that uses physical files, not pages generated from a database.
A site that is run by a content management system however, may generate the same exact content in dozens of ways, from different URLs from the "products" or "catalog" or "archives" sections. It is still the same physical content that comes from the Database.  
 
Google  and other search engines decide that the additional pages are "duplicate content." - They really are.  
It is not clear how this may penalize your site or if it penalizes it.
 
Google and Yahoo! now let you tell them how to index the page. You do it by putting a "Canonical" attribute  in the head section of the page in a dummy link tag, link this.

<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
 
The result should be that all the pagerank and other goodies will be given to the version of the page specified. See here for more details.