robots.txt vs sitemap.xml - Who will Win?

On-page SEO is largely about getting search engines to visit your site so that they can index it nicely. But it's also about stopping them from indexing things you don't want indexed.

So today when I came across some indexed content that shouldn't have been indexed, I was a bit surprised / shocked. You see, the content had been specifically blocked from spiders using 2 different methods - robots.txt and a noindex/nofollow meta tag.

However, upon further inspection, for one reason or another this content had lost it's noindex/nofollow meta tag and had somehow found it's way into the XML sitemap. Doh!

Game on

So the situation here is that robots.txt is saying "Block" and sitemap.xml is saying "allow". I verified that the content in question was in fact blocked using the tool on Google Webmaster Tools.

Sitemap.xml is the winner - the content gets indexed regardless.

Makes sense, I guess

I had thought that robots.txt was acting as a failsafe to make doubly-sure that the content doesn't get indexed, but this isn't really the purpose of robots.txt. Robots.txt is more designed to instruct crawlers that are blindly trolling through your content. It will stop a crawler from visiting a page if it happens to find a link to that page on it's travels.
However, when you write a sitemap.xml file, you are specifically inviting the bots to visit and index the page. It makes sense that they ignore the instructions given in robots.txt because they are serving a higher authority.

So, this does mean that extra care needs to be taken with the creation of your sitemap.xml file. If search engines are treating this as gospel, then make sure it's correct.
Digg StumbleUpon del.icio.us technorati blinklist furl reddit sphinn

Tags: robots.txt sitemaps sitemap.xml indexing

8 Comments

- Aug 30, 2009

so you mean to say even if we block the links using robot.txt file and still include them in sitemap .....google indexes it....
teres
Lemme check this interesting observation

- Sep 9, 2009

i dont understand u if we block the links using robot.txt file and still include them in sitemap

- Oct 7, 2009

I too checked it. The sitemap is the winner. But does it make any sense in blocking them.

- Jan 15, 2010

I think Even better because I provided both.

- Jan 20, 2010

It is true that search engine gives more importance to the sitemap but i am still confused about robot.txt file .

- Feb 5, 2010

Wow I haven't thought about this problem it may be cause to a serious trouble or can be taken as security hole. But I still think that bots are smart enough to not display the data on that website which has been denied by the robots.txt file.

- Feb 5, 2010

Hi,
I think both are have importance but search engine give more importance to sitemaps. I enjoy the reading such type of post.
Thanks

- Apr 7, 2010

Perfect, exactly the info i was looking for. I have also read that using the "Allow" tag in robots.txt doesn't actually work in all search engines, would we be able to generate a sitemap.xml file that does "Allow" access to files in folders that have the DisAllow rule applied to it?