Built Environment Specialists Click Here

Koozai > Blog > A Guide to Robots.txt and Mistakes to Avoid

A Guide to Robots.txt and Mistakes to Avoid

5th Sep 2012

| 8 minutes to read

Irish Wonder explains the origins of the Robots.txt file, why it’s one of the most important SEO documents you’ll ever write, and five mistakes that can damage your site.

What Is a Robots.txt File?

An important, but sometimes overlooked element of onsite optimisation is the robots.txt file. This file alone, usually weighing not more than a few bytes, can be responsible for making or breaking your site’s relationship with the search engines.

Robots.txt is often found in your site’s root directory and exists to regulate the bots that crawl your site. This is where you can grant or deny permission to all or some specific search engine robots to access certain pages or your site as a whole. The standard for this file was developed in 1994 and is known as the Robots Exclusion Standard or Robots Exclusion Protocol. Detailed info about the robots.txt protocol can be found at robotstxt.org.

Standard robots.txt Rules

The “standards” of the Robots Exclusion Standard are pretty loose as there is no official ruling body of this protocol. However, the most widely used robots elements are:

– User-agent (referring to the specific bots the rules apply to)
– Disallow (referring to the site areas the bot specified by the user-agent is not supposed to crawl – sometimes “Allow” is used instead of it or in addition to it, with the opposite meaning)

Often the robots.txt file also mentions the location of the sitemap.

Most existing robots (including those belonging to the main search engines) “understand” the above elements, however not all of them respect them and abide by these rules. Sometimes, certain caveats apply, such as this one mentioned by Google here:

While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.

Interestingly, today Google is showing a new message:

“A description for this result is not available because of this site’s robots.txt – learn more. “

Which further indicates that Google can still index other pages even if they are blocked in a Robots.txt file.

File Structure

The typical structure of a robots.txt file is something like:
User-agent: *
Disallow:
Sitemap: https://www.yoursite.com/sitemap.xml

The above example means that any bot can access anything on the site and the sitemap is located at https://www.yoursite.com/sitemap.xml. The wildcard (*) means that the rule applies to all robots.

Access Rules

To set access rules for a specific robot, e.g. Googlebot, the user-agent needs to be defined accordingly:

User-agent: Googlebot
Disallow:/images/

In the above example, Googlebot is denied access to the /images/ folder of a site. Additionally, a specific rule can be set to explicitly disallow access to all files within a folder:

Disallow:/images/*

The wildcard in this case refers to all files within the folder. But robots.txt can be even more flexible and define access rules for a specific page:

Disallow:/blog/readme.txt

– or a certain filetype:

Disallow:/content/*.pdf

Similarly, if a site uses parameters in URLs and they result in pages with duplicate content you can opt out of indexing them by using a corresponding rule, something like:

Disallow: /*?*

(meaning “do not crawl any URLs with ? in them”, which is often the way parameters are included in URLs).

That’s quite an extensive set of commands with a lot of different options. No wonder, then, that often site owners and webmasters cannot get it all right, and make all kinds of (sometimes dangerous or costly) mistakes.

Common robots.txt Mistakes

Here are some typical robots.txt mistakes:

1. No robots.txt file at all

Having no robots.txt file for your site means it is completely open for any spider to crawl. If you have a simple 5-page static site with nothing to hide this may not be an issue at all, but since it’s 2012, your site is most likely running on some sort of a CMS. Unless it’s an absolutely perfect CMS (and I’m yet to see one), chances are there are indexable instances of duplicate content because of the same articles being accessible via different URLs, as well as backend bits and pieces not intended for your site visitors to see.

2. Empty robots.txt

This is just as bad as not having the robots.txt file at all. Besides the unpleasant effects mentioned above, depending on the CMS used on the site, both cases also bear a risk of URLs like this one getting indexed:

https://www.somedomain.com/srgoogle/?utm_source=google&utm_content=some%20bad%20keyword&utm_term=&utm_campaign…

This can expose your site to potentially being indexed in the context of a bad neighborhood (the actual domain name has of course been replaced but the domain where I have noticed this specific type of URLs being indexable had an empty robots.txt file)

3. Default robots.txt allowing to access everything

I am talking about the robots.txt file like this:

User-agent: *
Allow:/

Or like this:

User-agent: *
Disallow:

Just like in the previous two cases, you are leaving your site completely unprotected and there is little point in having a robots.txt file like this at all, unless, again, you are running a static 5-page site á la 1998 and there is nothing to hide on your server.

4. Your robots.txt contradicts your XML sitemap

Misleading the search engines is never a good idea. If your sitemap.xml file contains URLs explicitly blocked by your robots.txt, you are contradicting yourself. This can often happen if your robots.txt and /or sitemap.xml files are generated by different automated tools and not checked manually afterwards.

Luckily, this kind of error is relatively easy to spot using Google Webmaster Tools. If you have added your site to Google Webmaster Tools, verified it and submitted an XML sitemap for it, you can see a report on crawling the URLs submitted via the sitemap in Optimization -> Sitemaps section of GWT. Depending on how many sitemaps your site has, it can look like this:

robots.txt and Sitemap Errors In Webmaster Tools

From there, you can dig deeper into the specific sitemap with issues detected and see the examples of blocked URLs:

robots.txt Sitemap Errors in Webmaster Tools

An additional word of praise for Google Webmaster Tools: there is a robots.txt testing tool inside it and this little handy tool can make webmasters’ lives so much easier. Its location is not so obvious but you can find it here:

The beauty of this tool is in that before you apply any changes to your live robots.txt file on the server, you can test them here and see if what you want blocked will end up getting blocked and any pages you want have indexed aren’t added by accident:

To see how the Googlebot will treat specific URLs after applying the changes, you need to enter a sample URL you want to test and press the button:

This example shows a URL that will stay accessible to the Googlebot. If a URL is going to be blocked, you will see the following message:

To me this looks like a perfect tool for learning to build proper robots.txt files.

5. Using robots.txt to block access to sensitive areas of your site

If you have any areas on your site that should not be accessible, password protect them. Do not, I repeat, DO NOT ever use robots.txt for blocking access to them. There are a number of reasons why this should never be done:

robots.txt is a recommendation, not a mandatory set of rules;
Rogue bots not respecting the robots protocol and humans can still access the disallowed areas;
robots.txt itself is a publicly accessible file and everybody can see if you are trying to hide something if you place it under a disallow rule in robots.txt;
If something needs to stay completely private do not put it online, period.

One of the favourite entertainments of the SEO community is checking Google’s robots.txt to see what new secret projects they are working on – and many times in the past people have spotted such projects that have not been released to the public via robots.txt.

Quirky Uses of Robots.txt and Fun Facts

Robots.txt is a serious element of any onsite optimization project but in the meantime, lots of fun can be had with it too:

Some people declare their love of robots via it (Vizify.com) or send you off to watch a video of a dancing robot (Moz);
Google has used its robots.txt over Halloween to protect itself from zombies;
Brett Tabke has used WebmasterWorld’s robots.txt file to… run a blog in it!
Daily Mail used their robots.txt to hire an SEO a while ago (via Malcolm Coles);
Whyte & Mackay once ran a hidden competition in their robots.txt and those who spotted it could win some whisky (Via Malcolm Coles, who definitely likes discovering such things);
Digg has its robots.txt file cloaked – when you access it as a human visitor all you see is this:

This post by Sebastian X has an instruction on how to cloak your robots.txt file if you’re feeling particularly paranoid – just remember that if somebody really wants to see it they can always spoof their user agent or look at Google’s cache of the file);

Among the search engines, Yahoo has the shortest robots.txt file (4 lines only), while Google’s robots.txt is massive – 291 lines.
Youtube has a reference to the Terminator plot in their robots.txt:

The views expressed in this post are those of the author so may not represent those of the Koozai team

For more information on our SEO services, please get in touch with us today.

Services

Call us on 0330 353 0300, email info@koozai.com or fill out our Contact Form.

Share this post

Responses

How SEO Companies in Nigeria Should Perform an SEO Audit – The Firebrick Blog
4th Sep 2017
[…] extensive read on robots.txt is highly […]
Reply
CR
2nd Feb 2017
Hi,
Good one about robots.txt
I have one issue with my robots indexers.
we see more connections open in Apache server due to this robot indexers. So, server was going down and customers see http 503 error while accessing site.
Any ideas on this?
Thanks,
CR
Reply
Things you should be doing to block bad BOTS. – Digital web Online Internet Marketing
12th Nov 2016
[…] what is Robots.txt and what are the basic things need to be done to maintain this file , read this article he have covered it pretty much nicely. I will cover the couple of major issues not added in the […]
Reply
Things you should be doing to block bad BOTS. – digital web news
15th Aug 2016
[…] what is Robots.txt and what are the basic things need to be done to maintain this file , read this article he have covered it pretty much nicely. I will cover the couple of major issues not added in the […]
Reply
Chris
15th Mar 2016
What if you use a robots.txt command to hide old URLs from a previous site?
The old site has had many redirects going to the new one. Will this have a ngeative effect on SEO due to link juice not being passed? With out hiding these links I get 404s when i do a crawl.
Reply
1. James Challis
  16th Mar 2016
  Hi Chris,
  I would suggest that you don’t block old URLs as this would mean that search engines are unable to crawl these URLs and any value from external links pointing to these URLs won’t be passed on to any page.
  Instead what you should look to do is redirect every old URL, ideally to the most relevant working page on the site or at the very least to the site’s homepage to help search engines and users find the right webpage from an external link.
  Thanks,
  James
  Reply
Dean Marsden
15th Dec 2015
Hi James,
There is no logical way or reason for doing this unfortunately, so it’s not something that can be implemented.
If your content is behind a paywall and is a trusted news source, then Google has a few options in place for publishers. https://support.google.com/news/publisher/answer/40543
However a method of driving search traffic to a specific keyword targeted page might be to have a snippet or teaser of your content on a public page/section of your site, then a link to the full content under the password protected area.
Reply
James Roach
13th Dec 2015
Can I ask you a question….. I want to password protect a folder but I want google to be able to crawl these folders and files… Anyway to do this?
Reply
nev
20th Jun 2015
This is wrong.
Disallow:/images/*
… is not right. It should just be:
Disallow:/images/
Reply
Shaan @ PNR Status
18th Dec 2013
To prevent the indexing of robots.txt in search engine add this code in your .htaccess file.
Header set X-Robots-Tag “noindex”
Reply
Laust
29th Aug 2013
my robots.txt is being indexed and is in the SERP – what to do?
Reply
Kingsley Agu
2nd Mar 2013
wooow.. This tips will help me know how to include the robot.txt file better in my site. I’ll try it out now.
Thanks for the write up.
Reply
Ahmed
4th Feb 2013
if i want to add multiple sitemaps then what to do?
Reply
Anna Lewis
5th Sep 2012
Great post, thanks Irish Wonder!
I love all the techy solutions, but if you’re talking about robots.txt you have to see an awesome example of one full of ascii art by Rishi Lakhani : https://explicitly.me/robots.txt
It add a whole other dimension to what you could use your robots.txt file for!
Reply
1. IrishWonder
  5th Sep 2012
  Wow Rishi’s robots is ace, thanks for sharing Anna!
  Reply
g1smd
5th Sep 2012
A rule such as…
Disallow: /this
should disallow…
example.com/this
example.com/this/
example.com/this.ext
example.com/thisorthat
example.com/this/that
and anything else that begins
example.com/this
I’ve not seen any counter examples. The few times I thought there was a problem, it turned out the disallow rule was not in the right ruleset to be “seen” by the bot in question.
Reply
1. IrishWonder
  5th Sep 2012
  Thanks Ian, that explains it perfectly then. Looking forward to seeing you in Brighton!
  Reply
g1smd
5th Sep 2012
Trailing * wildcard is always redundant.
Disallow: /this*
is exactly the same as
Disallow: /this
URL patterns are “matched from the left”.
Use the * wildcard only in the middle of a pattern:
Disallow: /*this
Be aware too, that when you have a section of the robots.txt file beginning:
User-agent: Googlebot
that Google will read ONLY this section of the file.
This means that any rules in the User-agent: * section of the robots.txt file that you want to also apply to Google must be duplicated into the section of the file that Google will read.
Reply
1. IrishWonder
  5th Sep 2012
  Great info Ian, thanks for adding it here! You are THE expert so it’s very beneficial to have your input on the topic.
  One concern that I have from my experience, however: I have seen cases where
  Disallow:/something/
  still lets domain.com/something/whatever.htm slip into the index, while
  Disallow:/something/*
  in addition to the non-wildcard line seems to fix the issue. What’s going on here?
  Reply
DaveN
5th Sep 2012
never stops amazing me how many times a developer will roll back the robots.txt file and block everything :)
Reply
1. IrishWonder
  5th Sep 2012
  shit happens :-)
  Reply
CSRWEB
5th Sep 2012
It’s amazing how simple the robots.txt file is but it can make such a large difference. I also really like the Google Webmaster tools robots testing tool, useful for beginners and experts alike.
Reply
Earl Grey
5th Sep 2012
funnily enough i have been working a lot with robots.txt recently.
a very powerful but ignored tool
Reply
Gareth
5th Sep 2012
Also read somewhere about building links to a robot.txt blocked page makes the spiders hungry for more and keeps them coming back. Is that still true you think?
Reply
1. IrishWonder
  5th Sep 2012
  Gareth – can neither confirm nor deny that, it sounds a bit along the lines of “SEO Voodoo” ;-) but in my experience, having a link to the robots.txt file itself make it indexed and even gets it to appear as one of the results for a site: search in G
  Reply
Mike Essex
5th Sep 2012
Thanks for a great write-up and it’s always nice to get a refresher on the basics. It’s crazy how many big sites forget about Robots.txt files as they chase other SEO aspects. It’s really essential for a day one site launch.
I hate how Google still indexing pages even if they are blocked. Do you feel “nofollow / noindex” and canonical tags will be enough if you want a page to not be indexed, or are even they ignored?
Reply
1. IrishWonder
  5th Sep 2012
  Google puts that disclaimer up for a reason – there isn’t much they can really do about NOT indexing blocked pages, be it via robots.txt or a meta robots tag, until they access the actual page. So if they just spider links from elsewhere on the web, they either need to first visit the pages linked to, discover they are blocked and not include them in the index – or they just include them but since they are blocked they cannot retrieve anything from those pages. Previously, they used to display just the URL (like in the supplementary index way back) – now they add a note that I caught in the first screenshot.
  Reply

Digital Ideas Monthly

Sign up now and get our free monthly email. It’s filled with our favourite pieces of the news from the industry, SEO, PPC, Social Media and more. And, don’t forget – it’s free, so why haven’t you signed up already?