Friday, August 21, 2009

The end of free Internet news?

Rupert Murdoch and others have decided it is time to end free news content  on the Web. One of the reasons that someone like Murdoch could make such a proposition is that he admittedly knows nothing about the Internet or how it works. I think there is no way at all to really end free content and that paid content will never be able to compete with free. Consider firstly all the legitimate primary sources of internet news that are always going to be free: Government Web sites, government broadcasters and NGOs. Governments and NGOs want you to see their content. A large part of international news consists of refurbished government press agency announcements or NGO press releases. "The prominent NGO Birdwatchers International has relased a new study showing..." and the rest of the item just quotes the information in the press release.
 
Now consider also the question of copyright and "fair use." A blogger subscribes to a paid service and copies the main content of an article to their Web log. As long as they comment on the article, and they are not a for-profit organization, it is "fair use for educational purposes." Attempts to stop them will be stymied because they will be labelled attempts to stifle freedom of the press.
 
The claim of publishers that they produce "quality content" that people will want to pay for is also highly dubious. During the Iraq war and the Second Lebanon war and other such events, the press often published government or terrorist propaganda indiscriminately. A CNN report described how dramatic footage of amulances rushing to the rescue was generously faked by Hezbollah for the benefit of the press. A Reuters photo of smoke over Beirut was shown by a blogger to a fake, and bloggers showed many other instances in which the commercial press was fooled by biased stringers or interested parties into passing off fabrications as fact - the French footage of the alleged killing of Muhamad al Dura was one such instance. Consider also stories like Sy Hersh's allegations of an imminent US attack on Iran that never materialized. These stories appeared over and over, though they had no basis in fact. If you want to lie to me for free that's fine, but I won't pay for it. The same was true for Judith Miller's NYT stories about WMD in Iraq.
 
There are so many ways for good free content to get to the Web and be available to all, that it is really doubtful that many people will want to pay for it, especially considering the poor quality of a lot of commercial journalism.
 
Ami Isseroff

Friday, May 22, 2009

Decline of Dmoz: Schadenfreude and sadness

As a long time frustrated user, submitter and ex-editor of the Open Directory (AKA Dmoz) I had feelings of Schadenfreude mixed with sadness when I learned of its decline. It has lost a lot of its viewing public, mostly because search engines do the search job better, but it has also stopped accumulating new listings. Triplicate listing of garbage pages, editors that tyrannize people with other political viewpoints and confine their directories to polemical articles or to their friends' web sites, arbitrary and capricious editing rules used to keep out sites that editors don't like, all detract from the quality of Dmoz. Lack of quality ratings and quality criteria are also a problem.
 
There are a few articles on the decline of dmoz around the Web, and they have attracted quite a lot of comment. Dmoz editors keep writing to say how great they are, without any understanding of what the statistics are telling them. Frustrated users are venting, but dmoz editors are never going to take them seriously, and that's a big part of the problem - contempt for users, arrogance, the elitism of a closed group. But there are a lot of good editors at dmoz and it is worth saving from itself.
 
Some enterprising people made a dmozsucks.org Web site. God bless 'em, but directories like dmoz serve an important function if they are run right, because they can provide information about quality of Web pages to search engines. My detailed thoughts about this are at: The Decline of Dmoz.
 
Ami Isseroff

Sorting media garbage from media information - with special application to the Web and Internet

When I wrote the article: The Decline of Dmoz. it got me thinking about how to give Web page and other media raters objective criteria for deciding if an article or other media item is useful or good or if it is flotsam to be ignored. If there was a directory for "everything" how would you keep out the flood of garbage in internet, printed matter, video and TV, and how could you spot and highlight the really superb new articles or books that should be highlighted and emphasized. It's not as easy as you think. Leonard and Virginia Woolf had a publishing business, and one of the surprising things that they found was that in the long run, the books and poems and articles that were least popular in their initial publication often became best sellers. Indeed, their tiny, romantic, hopeless venture, Hogarth Press, that operated from a hand press, produced some of the greatest classics of the twentieth century. But these great artists sold pitifully small numbers of books when their works first appeared.
 
A scale of quality would be useful for consumers as well, since it would give them a better idea of how much reliance to place in a Web page, article or newscast. For that, we would have to eliminate some of the most obviously useless categories I will mention below, and provide more details of how to judge the less bad material.
 
This scale is going to need a lot of work, but here is a first go, from the bottom (or near it) to the top.
 
Web sites that should not be indexed at all:
 
Web sites that have taken over domain names and use them for porn, gambling or other exploitation.
 
Parked Domains
 
Gimmick sites that are just search engines or advertising
 
Plagiarized material - material that is taken verbatim from another Web site without a link to the original, and often without specifying author or credit and posted  to another Web site, Web log or forum. I have been, and am, the victim of this sort of thing and I am not the only one. The people who do it invariably have an excuse
 
Racist and hate sites, videos etc.  - I think Google's policy is wrong. Web sites like Stormfront, Jew Watch and ihr have no place on the Internet. Spreading disinformation and hate is not doing anyone a service. Inciting to genocide is a crime under the International genocide convention - really it is.
 
Spam and confidence schemes should be banned from the mails and the Internet.
 
Search engines index a surprising quantity of such sites.
 
Lowest Quality Materials
 
Anonymous emails and Web logs or sites that post and re-post copies of anonymous material that never had an author who would acknowledge them. They are almost invariably hoaxes.
 
"news" reports based on anonymous sources and without confirmation - whether they are on the Web or in other media, are of the same approximate quality as anonymous email hoaxes about the latest email
 
Opinion pieces or news items that rely on attribution from non-authoritative sources to establish facts, such as "A guy I met told me that Google no longer uses Pagerank for anything and it is not important." The author didn't say it, and it is probably not true. They hid behind the non-authoritative source to intentionally perpetrate a falsehood. This is done all the time by supposedly serious journals
 
Videos and similar material that are so poorly produced that you cannot hear what people are saying. It is beyond me why people post such things to YouTube.
 
Materials that are just copies of articles published elsewhere, properly attributed. These have some utility especially if the original may be obscure or removed from the Web by the publisher.Pn the Web, it is generally considered legitimate to post whole articles provided you give due credit to the original, and arent just duplicating someone else's Web site to steal their income. Usualy though, it is best to go to the original source and to quote only parts of it, if you can be sure the source will still be there in five years. On the Web, you cannot be too sure.
 
Conspiracy theories that are not verified from other sources. There are whole Web sites devoted to the most fantastic ideas, usually based on total disinformation and often involving race hate and paranoia. The FBI and the Mossad did not cause the 9-11 attacks or the attack in Mumbai. The Federal Reserve system is not a plot to steal your money and give it to rich bankers.
 
 
Materials to be treated with due caution
 
Claims made by commercial sources who are selling a product
 
Sources with an obvious political bias.
 
Materials that use adjectives or hype to describe products or political issues. If words like "right wing" or "left wing" or "progressive" appear too often in an article or report, you have to ask yourself if this person is telling you facts or trying to convince you of their opinion.
 
Differential treatment of subjects - For example, a publication that will regularly use adjectives like "right wing" or "extremist" to describe politicians on one side of a conflict, but refrains from using any adjectives to describe leaders of the other sie.
 
Materials that are unsourced.
 
Assertions from publications or authors who have a poor track record for accuracy. Certain people for example, regularly predict that Iran will explode a nuclear weapon in a few months, or that Israel or the US will attack Iran, but it never happens. If they were ignored, they could not make a living by spreading disinformation in that way.
 
An article or publication that omits important facts that you know to be true is probably trying to create bias.
 
An article or publication that intentionally distorts a quote or lies about a fact, should not be trusted about other facts and assertions.
 
An article or book that has more than a few ellipses ("...") in quotes, is probably distoring the meaning of the quotes. This is a favorite technque of certain politicians, and is useful for dishonest commercial purposes as well.
 
Information you can rely on
 
Source is generally known to be correct
 
Information is confirmed by other reports
 
The report is plausible based on scientific evidence and common sense.
 
Source has no reason to lie
 
There's a lot less of that around than you might think. Remember the Iraq WMD that weren't? The Israeli bio-weapon hoax that was reported in numerous respected journals?

Monday, March 16, 2009

Google Keyword Search Frequency mystery, or "Is Sex going out of style?"

According to  Google's  AdWords tool, sex may be going out of style. That is,  Google's data show supposedly that over a 12 month period, on average there were 124,000,000  (One Hundred and Twenty Four Million) searches for keyword Sex each month, whereas in the month of February there were only 90,500 (Ninety thousand Five Hundred) people searching for sex. No mistake about the number of 0s anywhere either. Double checked.   All the rest of the people looking for sex  must have found it. But sex was not the only keyword affected. Every keyword I checked except Facebook had a lowever search frequency in February than over the last 12 months on average. The size of the drop was not consistent however. Some words dropped much more than others Is search going out of style or there just something wrong with Google's reporting or what?
 

What happened to Jewwatch.com?

A political-social search engine optimization issue developed around the hate web site jewwatch.com. For many years, searches in  Google for the  keyword  Jew returned this odious site at the top of the  listings or among the first ten. Jewwatch.com features standard Anti-Semitism fare including ZOG, the Zionist Occupied Goverment and the forged Protocols of the Elders of Zion. Attempts to get  Google to ban the site failed in the past, but now Jewwatch is gone from the top listings for keyword Jew. The question is:

Sunday, February 15, 2009

Solution for Duplicate page listings - 'canonical' attribute in head tag

One sort of duplicate content happens because of plagiarism, copying to forums, copying of articles to blogs (see  How Google Treats Duplicate Content). In those cases, there are really several physical instances of a page for various reasons. 
 
But there is another sort of "duplicate content" that is often really just an artifact of how the Web works and in part a bug of search engines. It is not duplicate content usually, but rather duplicate URLs for the same physical content.
 
 Suppose you have a page at http://seo.yu-hu.com. Just one physcial page. This one page can be reached in four different ways.
 
That is a simple case  for a site that uses physical files, not pages generated from a database.
A site that is run by a content management system however, may generate the same exact content in dozens of ways, from different URLs from the "products" or "catalog" or "archives" sections. It is still the same physical content that comes from the Database.  
 
Google  and other search engines decide that the additional pages are "duplicate content." - They really are.  
It is not clear how this may penalize your site or if it penalizes it.
 
Google and Yahoo! now let you tell them how to index the page. You do it by putting a "Canonical" attribute  in the head section of the page in a dummy link tag, link this.

<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
 
The result should be that all the pagerank and other goodies will be given to the version of the page specified. See here for more details.
 
 

Saturday, February 14, 2009

How Google treats Duplicate Content

Duplicate content is a headache for Search Engine positioning. Duplicate content is created in these situations:
  • Syndicated news items that appear in many Web sites
  • Sites that legitimately archive news items
  • Blogs that legitimately quote part of an article, or provide an entire article for reference.and link to the original under "fair use" doctrine.
  • "Classic" articles that are copied at numerous sites.
  • Classic literature and poetry
  • Plagiarism - A takes the content of B without permission and without giving credit to B, the originator, and puts it at their own Web site, or copies it to some Web forum or community Web site.
Very often, online journal articles disappear from the Web after a few years, and journals routinely do not reply to requests for syndication or quoting, so it is legitimate to quote large parts or all of such articles to make sure the record is intact, especially if your web log commentary refers to the article and makes no sense without it.
Plagiarism is a different matter of course.


Different search engines may deal with Duplicate content in different ways. Google uses a patented algorithm for finding duplicate content. It can put all or most duplicate content in "supplementary listings" that will not even be shown unless requested. Others may not even list those pages. The big problem is to determine what page "deserves" to be listed at the top of the SERP (Search Engine Results Page) listing. Most people are going to click on the top listing.


Obviously, the Web site that has the oldest file is probably the originator and should be listed first. But that is frequently NOT what happens. Plagiarism is the sincerest form of flattery. I wrote a rather successful article about an issue. It was promptly copied to a large Web site, and the listing at that Web site pushed the listing of my own page in Google way down in the page. In a different case of plagiarism, material published at our website was copied to a major journal, and to a major news service, neither of which gave any credit for the original and both of which claimed they had copyrighted our material!


If you are only "in business" to influence political opinion, then of course you are willing to sacrifice popularity of your own article in order to spread the word to the largest number of people. But in the long run, you still want your Web site to get more traffic, as that will help your cause the best.


In another case, I looked for an item using a keyword and found a prominently listed page at a closed, keyword protected, Web site. As it turns out, the original article is in the public domain and is freely available at another Web site, but that could only be found by searching the supplementary listings.

Google (if they are listening) should look into this problem, as it reduces the quality of their results, and in the long run, it will reduce the quality of materials on the Web. There is no practical way to prevent copying of materials, but these copies should all credit the original version and link to it. Authors should not be cheated out of credit for their work - that will not promote the creation of quality materials. If you own, or control a Web site or archiving forum, insist that any duplicate material that you copy must link to the original version on the Web. That will not necessarily ensure that the original is listed at the top of SERPs, but it will help. It will also reward the originator by providing the originating site with all important Link

There is one exception - if you are quoting an item as an example of hate propaganda, it seems to you that you are not morally obligated to provide a live Link to the original site and help their website popularity. You can provide the text of the URL without a live link or use a Nofollow attribute

.
Ami Isseroff

Duplicate and Triplicate Google Ad-Sense advertisements

The recession is upon us. That seems to mean that for many topics in many locales, AdSense may display the same advertisement in more than one ad slot on a page. Of course, this reduces Click-Through Rate (CTR- the percentage of visitors to a page who click on advertisements) because nobody will click on the same ad twice, and people who are not interested in finding out how to get a flat stomach might be interested in finding out where to get gourmet foods. Variety of advertisements obviously should increase CTR. Google often puts duplicate ads on a page even when there are different ads (also duplicates) on other similar pages!

The ways that Google seems to use to decide what ads to put on a page according to content are somewhat mysterious. You may have a whole page about astrophysics, but if for some reason there is a single link to a poetry website on that page, they may put an advertisement for poetry there. There algorithm may be a bit primitive.
You would have thunk that if Google AdSense knows about your page content, they also know what advertisements they put there. Since Google gets revenue from the advertisements, they should be interested in maximizing click through rate, right? I have not seen that anyone who obtained an interview with a Google guru asked about this problem.

Until Google acknowledges the plight of their suffering publishers and fixes the problem, you can help fix this condition a bit. Different shaped ad slots will draw different ads (though large horizontal and vertical graphic ads seem to draw the same content. You can also specify that one slat accepts graphics while another is text-only. What a pity that Google has that rigid rule about three ad units per page, whether they are big ad units or little ones, and whether it is a huge page or a little one.

Sunday, February 8, 2009

When did it happen?

Take a look at this news item:
 
Screen resolution 800 x 600 significantly decreased for exploring the internet according to OneStat.com
 
Amsterdam - July 25 - OneStat.com ( www.onestat.com ), the number one provider of real-time intelligence web analytics, today reported that more and more internet users choose for screen resolution 1024 x 768 which is the most popular screen resolution for exploring the internet.
 
The finding has important implications for web site designers because most web sites are designed for a screen resolution of 800 x 600 pixels.
 

The screen resolution 1024 x 768 has reached an all time high and has risen from 54.02 percent in June 2004 to 57.38 percent. Users with monitors set to the most common resolution 800 x 600 for web sites have an approximate 18.23 percent global usage share. A year ago this percentage was 24.66 percent.

Only one detail is missing from this page - the year. When did this happen? We can guess from the last paragraph that the article was published in 2005, but it does not say that.
 
Don't forget to put dates on time-locked materials.