The robots.txt is a defacto standard to ask cooperative robots to not index specific pages. One example of such robots is the Google spider that crawls the web searching for things to entangle and add to its growing wisdom.
Robots found in the wild
For those of you who are not the type to manage a web server or look
at the logs, I list here an excerpt of the user agents I find in the
log of this web server. These are all user agents recorded in
December 2016 to access the
eucli.de. The first column shows the number of
times this "user agent" accessed the
1 COMODO SSL Checker 1 Mozilla/4.0 (compatible; MSIE 5.01; Windows NT) 1 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36 1 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0 1 Mozilla/5.0 (compatible; Dataprovider.com;) 1 Python-urllib/1.17 2 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:22.214.171.124) Gecko/20101026 Firefox/3.6.12 2 Mozilla/5.0 (compatible; Dataprovider; https://www.dataprovider.com/spider/) 3 CCBot/2.0 (http://commoncrawl.org/faq/) 3 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36 3 Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot) 4 4 CSS Certificate Spider (http://www.css-security.com/certificatespider/) 4 Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, email@example.com) 4 Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+) 6 Mozilla/5.0 (X11; U; Linux Core i7-4980HQ; de; rv:32.0; compatible; JobboerseBot; http://www.jobboerse.com/bot.htm) Gecko/20100101 Firefox/38.0 7 SafeDNSBot (https://www.safedns.com/searchbot) 11 Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots) 13 - 16 Mozilla/5.0 (compatible; electricmonk/3.2.0 +https://www.duedil.com/our-crawler/) 17 Mozilla/5.0 (compatible; AhrefsBot/5.1; +http://ahrefs.com/robot/) 19 Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot) 20 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 25 Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/) 43 Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 48 Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/) 52 Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) 53 Mozilla/5.0 (compatible; SemrushBot/1.1~bl; +http://www.semrush.com/bot.html) 92 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 238 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Let's state some observations:
There's quite a lot of bots who actually do read the
Until now they were all disappointed because I was too lazy to write a
robots.txt: I left it out of the list for clarity, but all of these requests were greeted with a shrugging 404.
It's not just Google - there's also Microsoft (bingbot) and Yahoo! (which I thought was long extinct), some I've heard of like the Russian Yandex, but then there's quite a few I can't even guess what they are. Also Microsoft is 2.5 times more interested in me than Google, for whatever that means ...
Yes, you are reading this correctly: There are 17 accesses where the client doesn't identify itself (user agent string is empty or "-"). I can virtually hear them snickering and rubbing there hands and saying "Now let's see what they don't want us to look at ...". Remember that security by obscurity is no security.
Let's give them something to eat.
Some of the pages here are not yet quite finished. I tag them by
adding a file named
.staging to the directory, and then their
entries in the table of contents are marked with a css class
staging. You do not see them, but they are there. Again, I do not
see this as a measure of security, just as layers of veils between the
place where an article is first conceived and when it is getting
closer to being public. Before articles even appear here, I have two
other levels of staging areas.
Back to the veiled menu entries: Bots do see them, of course. This may lead to the awkward situation where Google sends you here to read a blog article, and then you don't find it in the table of contents.
Let's tell Google not to do that.
At the time of this writing, this is the
User-agent: * Disallow: /arts Disallow: /music Disallow: /blog/2017-01-01+Cognitive_Tools Disallow: /blog/2017-01-15+Robots Disallow: /games/sokoban
See, I may be writing something about arts and music at some point, but I'm not quite there yet.
So we start the file by addressing every user agent in the world, and
then we add a new line for each area where a tag file
the index generator to mark the menu entry as