Feb 15, 2006
Duplicate content is a problem with many websites, and most webmasters don't realise they are doing anything wrong.Most search engines want to provide relevant results for their users, it's how Google got successful. If the search engine was to return 5 identical pages on the same page of the search results, it's not likely to be useful to the searcher.
Many search engines have filters in place to remove the duplicate listings - this keeps their search results clean, and is overall a good feature. From a webmaster's point of view however, you don't know which copy of the content the search engine is hiding, and it can put a real damper on your marketing efforts if the search engines won't show the copy you are trying to promote.
You may think you don't have any duplicate pages on your site... think again...
Duplicate content occurs when the search engine finds identical content at different URLs. Consider the following scenarios...
www vs non-www
in most cases these will return the same page, in other words, a duplicate of your entire site.
Root vs Index
Most people's homepages are available by typing either URL - duplicate content.
Session IDs - the root of all evil
This problem effects many dynamic sites, including PHP, ASP and Cold Fusion sites. Many forums are poorly indexed because of this as well. Session IDs change every time a visitor comes to your site. In other words, every time the Search engine indexes your site, it gets the same content with a different URL. Amazingly, most search engines aren't clever enough to detect this and fix it, so it's up to you as a webmaster.
One page, multiple URLs
and
http://www.domain.com/product.php?category=outdoor&product=chair
A product may be allocated to more than one category - in this case the "product detail" page is identical, but it's available via 2 URLs.
Removing Duplicate Content
Having duplicate content on your site can make marketing significantly more difficult, especially when you are marketing the non-www version and Google is only showing the www version. Because you can't tell the search engines which is the "original" copy, you must prevent any duplicate content from occuring on your site.Here are some tips to make this process easier.
1. non-www vs www
I prefer to use the www version of my domain (no particular reason, it seems to look better on paper). If you are using Apache as your web server, you can include the following lines in your .htaccess file (change the values to your own of course).
RewriteRule (.*) http://www.domain.com/$1 [R=301,L]
If your webhost does not let you edit the .htaccess file, I would consider finding a new host. When it comes to removing duplicate content and producing search engine friendly URLs, Apache's .htaccess is too good to ignore. If your website is hosted on Microsoft IIS, I recommend ISAPI Rewrite instead.
2. Remove all reference to "index.htm".
Your homepage should never be referred to as index.htm, index.php, index.asp etc. When you build incoming links, you will always get links to www.domain.com - your internal links should always be the same. One of my sites had a different pagerank on "/" (root) and "index.php" because the internal links were pointing to index.php, and creating duplicate content. Why go to the trouble of promoting two "different" pages at half strength when you can promote a single URL at full strength?
After you have removed all references to index.htm you should set up a 301 redirect (below) to redirect index.htm to / (root).
3. Remove Session IDs.
I can give advice for PHP users, ASP and CF users should do their own research on exactly how to remove these. With PHP, if the user does not support cookies, the Session ID is automatically inserted into the URL, as a way of maintaining state between pages. Most search engines don't support cookies, which means they get a different PHPSESSID in the URL every time they visit - this leads to very ugly indexing.
There is no ideal solution to this, so I have to compromise. When sessions are a requirement for the website, I would rather lose a small number of visitors who don't have cookies, than put up with PHPSESSID in my search engine listings (and potentially lose a lot more visitors).
To disable PHPSESSID in the URL, you should insert the following code into .htaccess
php_value session.use_trans_sid 0
This will mean visitors with cookies turned off won't be able to use any features of your site that use sessions, eg logging in, or remembering form data etc.
4. Ensure all database generated pages have unique URLs.
This is somewhat more complicated, depending how your site is setup. When I design pages, I'm always wary of the "one page, one url" rule, and I design my page structure accordingly. If a product belongs to 2 categories, I ensure that both categories link to the same URL, or modify the content significantly on both versions of the page so it's not "identical" in the eyes of the search engine.
301 Redirections
A 301 redirect is the correct way of telling the Search engines that a page has moved permanently. When you still want the non-www domain name to work, you should 301 redirect the visitor to the www domain. The visitor will see the address change and Search Engines will know to ignore the non-www and use the www instead.Use your .htaccess to 301 redirect visitors from index.htm to / and any other pages that get renamed. eg.
Summary
While your site will work fine with duplicate content, it definitely spreads your efforts and may cost you in ways that you don't understand. In order to maximise pagerank and the effectiveness of link campaigns, you should ensure there is no duplicate content on your site. Feel free to post a comment if you would like your site checked for any duplicate content.21 Comments
Nice Article. I came here through Google to find a solution to redirect my index.htm to index.php. Then I understood from the article that I should redirect to the main domain. I have implemented it and it is working fine. I can understand why this page is PR 5 because it is worthy page rank 5. Thanks buddy good work.
Do you have any advice on how to prevent sites like trailfire from showing Google duplicate pages of our content? Their url has taken over our Google placement off and on for about two months and seems to be confusing Google. Bascially 1/3 of the time it is our content with our url in the Google results, 1/3 of the time our content with trailfire url and then Google just drops the site for a few days. Then it starts again with our url. Needless to say, we are not happy with trailfire.
Another great article, keep up the good work! Then only thing I noticed that you didn't mention that I had to add to the .htaccess for was site was:
RewriteEngine On
then
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
:)
Hi Dusty,
Looking at the google results for "2007 prom dress shop" I see you at #2 and Trailfire at #5.
If they are sometimes replacing you, this is because Google considers their site more authorative than yours for that phrase.
The best strategy is to continue building targeted links to your site so you become more authoritive - remember, there are other scrapers out there.
Another trick is to change the text around your phrase, but this won't help if they are constantly scraping and updating their copy.
I have generally been quite lucky with scrapers in that most aren't all that authorative. I actually quite like to get the free link.
Harvey.
AutoInsuranceRemedy.com - Jul 13, 2007
Harvey you’re the man!
I emailed Harvey a day ago or so asking him to help me do some 301 redirects for https to http, index.php to / and non www to www. He has created a short script that has done all I need. It is great.
Thanks again Harvey
The script:
<?php
function usingSSLConnection()
{
return ((isset($_SERVER['HTTPS']) &&
($_SERVER['HTTPS'] == 'on')) ||
getenv('SSL_PROTOCOL_VERSION'));
}
$protocol = usingSSLConnection() ? 'https://' : 'http://';
$actualurl = $protocol.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
$correcturl = 'http://www.yourdomain.com/';
if ($actualurl != $correcturl) {
header("HTTP/1.1 301 Moved Permanently");
header("Location: " . $correcturl);
exit();
}
?>
Jon Lee - Web Development - Jul 21, 2007
Unfortunately, no matter how much you do, if someone is scrapping your feeds you're stuck with duplicate content anyways!
I'm not sure I totally agree about the RSS scrapers.
I actually provide the full blog post in my RSS feed and scrapers have used these on their websites without permission.
Thing is, I'll often include internal links in my blog posts, so the scrapers will often end up linking back to me.
I think Google can see these links. I think Google looks at 2 pages that look pretty much the same and wonders which is the original. But because site B links to site A and site A doesn't link to site B then it can assume that site A is the original source of the content.
With this logic, the scrapers are welcome to scrape my content, and I have never had any real problem with my own content pushing me out of the search results. The rules will probably change when a big fat PR8 authority site scrapes your RSS feed and nofollows your links - you will likely have real problems then.
I think Google is getting much better at detecting duplicate content, and figuring out who the original source is so they can ensure the best source shows up in the search results.
Anze - Aug 28, 2007
Re: "The rules will probably change when a big fat PR8 authority site scrapes your RSS feed and nofollows your links - you will likely have real problems then."
I don't think so. You see, Google is smart and it is getting smarter each day. If there is a content on the page that has nofollow links - then it's an external content, like comments and scraped content. It should not count as the author's content and should therefore be discarded from index altogether.
The problem would be if the scrapers replaced the links with their own, or even put new links... But even then it would be possible to distinguish between the freeloaders and real authors.
Nice idea, adding links...
I'm thinking about posting articles to article directories. Will it be a problem for me to submit the same article to several directories with a link pointing back to my site?
I have been told if I do not place the article to my site I will benefit from all the back links from the article directories and from anyone who posts my article to their site. Any commit will be appreciated.
Hi Bill,
Generally these articles are picked up by scraper sites looking for easy content, so the links you get aren't of the highest quality.
In the past I have submitted articles and modified the version that I posted on my own site - different title, different first paragraph. The goal being to reduce the risk of showing as dup content.
I don't see any issue with submitting your article to as many sites as you can, though I'd be pretty certain Google would notice the duplication, and reduce the value of those links. In other words, 100 links from 100 different articles would be better than 100 links from the same article.
Harvey.
Hello,
there's something wrong - but can't figure out what. I followed your advice redirect links to index.php, index.htm, index.html to the root which is www.glonn.de - but using this htaccess gives loading errors. Any advice? Thanx!
RewriteEngine On
RewriteCond %{HTTP_HOST} ^glonn.de
RewriteRule (.*) http://www.glonn.de/$1 [R=301,L]
redirect 301 /index.htm http://www.glonn.de
redirect 301 /index.html http://www.glonn.de
redirect 301 /index.php3 http://www.glonn.de
redirect 301 /index.php http://www.glonn.de
RedirectPermanent /aktuelles/veranstaltungen.php3 http://www.glonn.de/aktuelles/veranstaltungen.php
RedirectPermanent /aktuelles/umfrage_marktplatz.htm http://www.glonn.de/wp/index.php?cat=10
RedirectPermanent /adressen/branchenbuch_result.php3 http://www.glonn.de/adressen/branchenbuch_result.php
RedirectPermanent /aktuelles/gentechnik-frei.html http://www.glonn.de/aktuelles/gentechnik-frei.php
RedirectPermanent /gewerbe/index.htm http://www.glonn.de/firmen/index.php
Howdy, I just stumbled upon this little gem of a website a few days ago and I've been pouring over a bunch of the articles.
I just thought I'd share with you my own solution to the PHPSESSID problem. It is a bit elaborate, so it won't likely work for everyone, but on my website, I have a system in place for identifying, tracking, and logging search engine spiders (the ones that matter can easily be identified by their 'USER_AGENT' string). What I decided to do, since I already have the spider detection in place, is to disable url rewriting only when the user has been identified as a spider. For my own particular implementation, spider sessions are tracked by a concatenation of their user_agent string and their ip address in place of the usual 32-character hexadecimal code. Spiders, then can still access content and be tracked just as normal users can, all without those pesky url variables.
Custom session handlers are cool - we use our own handler to get around timeout issues and to make scaling across multiple servers possible.
However this isn't something that occurred to me - an interesting concept.
Also you should check Google's webmaster tool as it will help inform you of duplicated text on your site.
Catalin - Aug 30, 2008
Helo,
I have a question concerning duplicate content matter that you wrote about it:http://www.ragepank.com/articles/3/preventing-duplicate-content/. To be more specific my question is about "One page, multiple URLs".
Let's say you have a list of books (3 of them)
www.site.com/book-sortby-name.html
www.site.com/book-sortby-price.html
www.site.com/book-sortby-year.html
The content is the same. Does google punish you for having this kind of duplicate content?
Carl - Jan 7, 2009
Thanks for this, that htaccess config has proven very useful.
Have been trying to follow advice and rewritten non www to www and this works well.
When I try to redirect index.htm to root / I get a loop error. Any help would be apreciated.
I used to publish my articles, but now I wander should I stop doing this, because the risk of duplicate content penalty. Should I stop publish my articles on article directories?
Thanks for taking this opportunity to discuss this, I feel fervently about this and I like learning about this subject. If possible, as you gain information, please update this blog with more information. I have found it really useful.



















Good Article, though from experience, most of the time same domain duplicate content is ignored from exclusion, the problem comes when 50 sites have the same content on them.