The robots.txt is a defacto standard to ask cooperative robots to not index specific pages. One example of such robots is the Google spider that crawls the web searching for things to entangle and add to its growing wisdom.

Robots found in the wild

For those of you who are not the type to manage a web server or look at the logs, I list here an excerpt of the user agents I find in the log of this web server. These are all user agents recorded in December 2016 to access the robots.txt on eucli.de. The first column shows the number of times this "user agent" accessed the robots.txt file:

      1 COMODO SSL Checker
      1 Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)
      1 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36
      1 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0
      1 Mozilla/5.0 (compatible; Dataprovider.com;)
      1 Python-urllib/1.17
      2 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
      2 Mozilla/5.0 (compatible; Dataprovider; https://www.dataprovider.com/spider/)
      3 CCBot/2.0 (http://commoncrawl.org/faq/)
      3 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36
      3 Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)
      4
      4 CSS Certificate Spider (http://www.css-security.com/certificatespider/)
      4 Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)
      4 Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)
      6 Mozilla/5.0 (X11; U; Linux Core i7-4980HQ; de; rv:32.0; compatible; JobboerseBot; http://www.jobboerse.com/bot.htm) Gecko/20100101 Firefox/38.0
      7 SafeDNSBot (https://www.safedns.com/searchbot)
     11 Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)
     13 -
     16 Mozilla/5.0 (compatible; electricmonk/3.2.0 +https://www.duedil.com/our-crawler/)
     17 Mozilla/5.0 (compatible; AhrefsBot/5.1; +http://ahrefs.com/robot/)
     19 Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
     20 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
     25 Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
     43 Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
     48 Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)
     52 Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
     53 Mozilla/5.0 (compatible; SemrushBot/1.1~bl; +http://www.semrush.com/bot.html)
     92 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    238 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Let's state some observations:

Let's give them something to eat.

The veil

Some of the pages here are not yet quite finished. I tag them by adding a file named .staging to the directory, and then their entries in the table of contents are marked with a css class staging. You do not see them, but they are there. Again, I do not see this as a measure of security, just as layers of veils between the place where an article is first conceived and when it is getting closer to being public. Before articles even appear here, I have two other levels of staging areas.

Back to the veiled menu entries: Bots do see them, of course. This may lead to the awkward situation where Google sends you here to read a blog article, and then you don't find it in the table of contents.

Let's tell Google not to do that.

The advice

At the time of this writing, this is the robots.txt:

User-agent: *
Disallow: /arts
Disallow: /music
Disallow: /blog/2017-01-01+Cognitive_Tools
Disallow: /blog/2017-01-15+Robots
Disallow: /games/sokoban

See, I may be writing something about arts and music at some point, but I'm not quite there yet.

So we start the file by addressing every user agent in the world, and then we add a new line for each area where a tag file .staging tells the index generator to mark the menu entry as staging.

jan 2017-01-15
grep for robots.sh
grep robots.txt access.log.1 \
    | cut -f6 -d'"' | sort | uniq -c | sort -n
excerpt from Makefile.make
robots.txt: .stages
       @echo 'User-agent: *' > $@
       @sort -u .stages | sed 's,^\(\./\)\?,Disallow: /,' >> $@
excerpt from make-index.py
def tagIfStaging(name):
    if os.path.isfile(os.path.join(name, '.staging')):
        addToRobotsTxt(name)
        return ' staging'
    return ''

def addToRobotsTxt(name):
    with open('.stages', 'a') as stages:
        print >> stages, name
: