Nov 13, 2007
Here's a theory I have. I say that you might be able to avoid a duplicate content penalty by duplicating your content on another domain.Humour me for a moment, and consider the following argument before reaching for the back button.
Duplicate content
Duplicate content is loosely defined as being several copies of the same content in different parts of the web. It causes a problem when search engines spider all the copies, but only want to include a single copy in the search engine index to maintain relevancy. Generally, all other copies of this 'duplicate' content are ignored in the search results, bunged into the supplemental index, or perhaps not even indexed at all.Doing the right thing
Fundamentally, search engines want to do the right thing and let the original author rank in the search engines for the content they wrote. We have just said that only one copy of the content can rank in the search engines, and search engines probably want this to be the original.When you search for something, and some Wikipedia content appears in the search results, you want to see the copy from wikipedia.org, not one of the thousands of copies of the content on various rubbish scraper sites.
I believe that search engines want to give the search result / traffic to the original author of the content.
Determining the original author of the content
Consider the following 4 copies of the same article on different domains. How is Google to know which is the original copy?
The above example shows 4 copies of the same content. The date indicates the date the content was first indexed, and the PageRank bar indicates, um, PageRank. Let's assume for this simplified example that PageRank is an accurate measure of link strength / domain authority / trust etc. The smaller pages pointing to each larger page represent incoming links from other websites.
- Document 1 was first indexed a couple of weeks after the other copies, so as a search engine you might decide that this is not the original because it wasn't published first.
- Documents 2 and 3 have the same good PR, were first indexed about the same time, and have the same number of incoming links.
- Document 4 was indexed slightly after documents 2 and 3, and it also has less PR and less links, so as a search engine you might conclude this is not the best copy to list either.
As a search engine, we are stuck deciding between document 2 and document 3 as to which is the original / best copy to list. At this point, Google is likely to take it's best guess and leave it at that, which will see the original author "penalised" on many occasions.
Enter the cheesy scraper sites
Let's recycle that same example, but this time we are going to add an "author credit" link to the bottom of document 4. Document 4 could be considered a cheesy low PR, low value, scraper site, but one that was kind enough to provide a link back to the original document.
All of a sudden, there is a crystal-clear signal to the search engines that document 2 is the original.
When there are a collection of identical pages out there on the web and it's hard to decide who is the author - it's likely that search engines look at how those copies link between each other and use that data to determine an original.
All other things being equal, this seems like a logical assumption to make.
Duplicating your own content
So, if you know your content is being duplicated on scraper sites - I'm saying you can prevent getting penalised by making sure some of the scrapers provide a link back to your original document.If none of the scrapers are polite enough to do this, then I'm suggesting you should create your own scraper site, scrape your own content, and provide a link back to yourself.
As RSS feeds become more popular, and content is recycled all over the web, this problem is only going to get worse.
Disclaimer: This is just a hypothesis / theory, and I'm yet to back this up with any real testing, so don't blame me if you duplicate your own website and find yourself having duplicate content problems. I wouldn't even consider this tactic unless you were having problems with high PR sites scraping your RSS feed.
Related Articles
- Keeping up with copyright licenses
- Not getting involved
- Long tail SEO - A simplified how-to
- Passing the sniff test
- The pagination issue
<< SEO Articles index < Ribbing your web developer in public | Movember - mid-way update >
Comments
Harvey - Nov 14, 2007
Yes, I'll be doing some formal testing on this when I get a chance.
I developed the concept based on my experience with RagePank.com - the RSS feed here gives away full articles (not shortened descriptions). Penty of webmasters have used this to publish my content on their sites.
I'm yet to see myself being blocked out of the SERPs by another copy, so in this case Google is doing a good job of determining the original author.
I think a strong factor in this is that several of the RSS scrapers include a credit link back to this site, which I say is a strong signal to Google that the original copy is RagePank.
It could also be that none of the RSS scrapers have enough PR/Authority to dislodge me - could be a different story if W3C or Apple scraped some of my articles and didn't link back!
I totally agree about crosslinking between articles - I'm yet to develop the "related articles" plugin. I also want to do an email notification feature for comments, I'm pretty sure pageviews would double by doing those 2 things.
Michael Brandon - Nov 15, 2007
It is said, "there is nothing new under the sun". But with this one you have proved it wrong. A great new concept.
I really hate how sites scrape content, especially proxy sites, and then rankings decrease for the mother website. Picked up a proxy site yesterday that had one of my sites index page cached. I will be interested in testing this new gem of a theory out.
So how would it work with an index page - and one that was constantly changing... I suppose to scrape a copy of it whenever you change your own website, and make sure that the copy is linking back to your main website. Something you would want to test on a throwaway domain!
Would not be as hard to test for an inner article/blog page.
Aidan Rogers - Nov 15, 2007
Hey Harvey
Just out of curiosity I submitted this to spinn. Did you see what Jill Whalen said?
Harvey - Nov 15, 2007
Thanks for pointing out the Sphinn comment.
I think what Jill is saying is "Ante up and show some proof" which isn't unreasonable.
I don't think the concept is ridiculous at all - I think it's sometimes part of your job as a SEO to try and figure out why Google ranks pages the way it does. If you are having a problem with scrapers blocking you for your own content then it's worth experimenting to find ways to work around the problem.
Chris Giddings - Nov 20, 2007
Harvey,
This concept had been on my mind a bit recently as I have been trying to handle the complexities of dup content for my employer.
You concept provides an interesting theory to play with. I wish I knew someone deep(er) at Google so I could ask bluntly whether this avenue was worth pursuing.
As always you provide some interesting fodder to consider. Thanks.
Michael - Nov 22, 2007
Hum... I understand that this happens, but how am I really to know if my site is being Scrapped and Duplicated... I write articles and post them in article directorys to get my Author Line in with the link.
Sometimes none of the article gets hot linked or the link is a NoFollow. How does the search engine know then?
I would be happy to hear more assumptions on this situation, but I'm more interested in the proper way to stop scrapping of content and how search Engine Algorithms Actually work.
Harvey - Nov 22, 2007
@Michael - I wouldn't suggest using your good content for article directories, generally people will write articles specifically for submitting, knowing they will be duplicated all over the place.
If there is no author link in any of the article copies, then probably the most powerful copy will rank - often the ArticleCity / EzineArticles copy, which is more powerful than the scrapers.
I haven't submitted articles for a long time, but I used to avoid the dupe content issue by deliberately de-optimizing the copy I submitted, and use rich title / meta / opening paragraph on the copy I kept. This used to work great, but things have probably changed now.
Here's an example SERP showing an article I copied a while back, you can see my site ranking well due to a unique (and admittedly crap) meta description and opening paragraph.
Does SEO marketing really work

Post Comment
We welcome comments on this article, provided they have something to contribute. Please note that all links will be created using the nofollow attribute. This is a spam free zone. HTML is stripped from comments, but BBCode is allowed.











Aidan Rogers - Nov 14, 2007
Hi Harvey
Interesting stuff, would love to see it backed up with a bit of proof. Do you intend on testing this theory yourself?
I also have another question regarding your blog: How come you don't link to your older posts? You have written some real gems but the older ones are pretty buried in the archives. I believe linking to these older posts would really add value to your visitors :)
Cheers
Aidan