robots.txt vs sitemap.xml - Who will Win?

robots.txt vs sitemap.xml - Who will Win? On-page SEO is largely about getting search engines to visit your site so that they can index it nicely. But it's also about stopping them from indexing things you don't want indexed.

So today when I came across some indexed content that shouldn't have been indexed, I was a bit surprised / shocked. You see, the content had been specifically blocked from spiders using 2 different methods - robots.txt and a noindex/nofollow meta tag.

However, upon further inspection, for one reason or another this content had lost it's noindex/nofollow meta tag and had somehow found it's way into the XML sitemap. Doh!

Game on

So the situation here is that robots.txt is saying "Block" and sitemap.xml is saying "allow". I verified that the content in question was in fact blocked using the tool on Google Webmaster Tools.

Sitemap.xml is the winner - the content gets indexed regardless.

Makes sense, I guess

I had thought that robots.txt was acting as a failsafe to make doubly-sure that the content doesn't get indexed, but this isn't really the purpose of robots.txt. Robots.txt is more designed to instruct crawlers that are blindly trolling through your content. It will stop a crawler from visiting a page if it happens to find a link to that page on it's travels.
However, when you write a sitemap.xml file, you are specifically inviting the bots to visit and index the page. It makes sense that they ignore the instructions given in robots.txt because they are serving a higher authority.

So, this does mean that extra care needs to be taken with the creation of your sitemap.xml file. If search engines are treating this as gospel, then make sure it's correct.
Digg StumbleUpon technorati blinklist furl reddit sphinn

Tags: robots.txtsitemapssitemap.xmlindexing