Jun 18, 2009
On-page SEO is largely about getting search engines to visit your site so that they can index it nicely. But it's also about stopping them from indexing things you don't want indexed.So today when I came across some indexed content that shouldn't have been indexed, I was a bit surprised / shocked. You see, the content had been specifically blocked from spiders using 2 different methods - robots.txt and a noindex/nofollow meta tag.
However, upon further inspection, for one reason or another this content had lost it's noindex/nofollow meta tag and had somehow found it's way into the XML sitemap. Doh!
Game on
So the situation here is that robots.txt is saying "Block" and sitemap.xml is saying "allow". I verified that the content in question was in fact blocked using the tool on Google Webmaster Tools.Sitemap.xml is the winner - the content gets indexed regardless.
Makes sense, I guess
I had thought that robots.txt was acting as a failsafe to make doubly-sure that the content doesn't get indexed, but this isn't really the purpose of robots.txt. Robots.txt is more designed to instruct crawlers that are blindly trolling through your content. It will stop a crawler from visiting a page if it happens to find a link to that page on it's travels.However, when you write a sitemap.xml file, you are specifically inviting the bots to visit and index the page. It makes sense that they ignore the instructions given in robots.txt because they are serving a higher authority.
So, this does mean that extra care needs to be taken with the creation of your sitemap.xml file. If search engines are treating this as gospel, then make sure it's correct.
8 Comments
i dont understand u if we block the links using robot.txt file and still include them in sitemap
I too checked it. The sitemap is the winner. But does it make any sense in blocking them.
I think Even better because I provided both.
It is true that search engine gives more importance to the sitemap but i am still confused about robot.txt file .
Wow I haven't thought about this problem it may be cause to a serious trouble or can be taken as security hole. But I still think that bots are smart enough to not display the data on that website which has been denied by the robots.txt file.
Hi,
I think both are have importance but search engine give more importance to sitemaps. I enjoy the reading such type of post.
Thanks
Perfect, exactly the info i was looking for. I have also read that using the "Allow" tag in robots.txt doesn't actually work in all search engines, would we be able to generate a sitemap.xml file that does "Allow" access to files in folders that have the DisAllow rule applied to it?


















so you mean to say even if we block the links using robot.txt file and still include them in sitemap .....google indexes it....
teres
Lemme check this interesting observation