Saturday, February 14, 2009

How Google treats Duplicate Content

Duplicate content is a headache for Search Engine positioning. Duplicate content is created in these situations:
  • Syndicated news items that appear in many Web sites
  • Sites that legitimately archive news items
  • Blogs that legitimately quote part of an article, or provide an entire article for reference.and link to the original under "fair use" doctrine.
  • "Classic" articles that are copied at numerous sites.
  • Classic literature and poetry
  • Plagiarism - A takes the content of B without permission and without giving credit to B, the originator, and puts it at their own Web site, or copies it to some Web forum or community Web site.
Very often, online journal articles disappear from the Web after a few years, and journals routinely do not reply to requests for syndication or quoting, so it is legitimate to quote large parts or all of such articles to make sure the record is intact, especially if your web log commentary refers to the article and makes no sense without it.
Plagiarism is a different matter of course.


Different search engines may deal with Duplicate content in different ways. Google uses a patented algorithm for finding duplicate content. It can put all or most duplicate content in "supplementary listings" that will not even be shown unless requested. Others may not even list those pages. The big problem is to determine what page "deserves" to be listed at the top of the SERP (Search Engine Results Page) listing. Most people are going to click on the top listing.


Obviously, the Web site that has the oldest file is probably the originator and should be listed first. But that is frequently NOT what happens. Plagiarism is the sincerest form of flattery. I wrote a rather successful article about an issue. It was promptly copied to a large Web site, and the listing at that Web site pushed the listing of my own page in Google way down in the page. In a different case of plagiarism, material published at our website was copied to a major journal, and to a major news service, neither of which gave any credit for the original and both of which claimed they had copyrighted our material!


If you are only "in business" to influence political opinion, then of course you are willing to sacrifice popularity of your own article in order to spread the word to the largest number of people. But in the long run, you still want your Web site to get more traffic, as that will help your cause the best.


In another case, I looked for an item using a keyword and found a prominently listed page at a closed, keyword protected, Web site. As it turns out, the original article is in the public domain and is freely available at another Web site, but that could only be found by searching the supplementary listings.

Google (if they are listening) should look into this problem, as it reduces the quality of their results, and in the long run, it will reduce the quality of materials on the Web. There is no practical way to prevent copying of materials, but these copies should all credit the original version and link to it. Authors should not be cheated out of credit for their work - that will not promote the creation of quality materials. If you own, or control a Web site or archiving forum, insist that any duplicate material that you copy must link to the original version on the Web. That will not necessarily ensure that the original is listed at the top of SERPs, but it will help. It will also reward the originator by providing the originating site with all important Link

There is one exception - if you are quoting an item as an example of hate propaganda, it seems to you that you are not morally obligated to provide a live Link to the original site and help their website popularity. You can provide the text of the URL without a live link or use a Nofollow attribute

.
Ami Isseroff

No comments: