Good bot, bad bot. How search indexing works.
Posted: 11 November, 2020 · Tweet
In 1993, Charlie Stross was feeling bored and fractious in his job at SCO UNIX. He decided to learn Perl and work on a Web spider since his company was developing HTML. Back then the resources that existed to help people learn to build these things were very much in their early stages, but Charlie muddled through. After coding up something that looked like it would work, he began testing it, initially putting it to work crawling one of the websites that he'd used in order to learn how to build his spider. Unfortunately, the website in question was owned by a company that did not have the kind of infrastructure to handle the incoming traffic from this newly-built tool. Charlie overloaded them with traffic, much in the way that a Direct Denial of Service attack would today.
Seeing the drain that this spider was having on his web server, the owner of the website, Martjin Koster, who is also attributed as having created the Web's first search engine ALIWeb, quickly created a protocol. This protocol was essentially a list of rules that an inbound spider would be asked to obey in order to index a website properly. This was the unlikely chain of events which led to the birth of a de-facto Web standard that has persisted ever since: robots.txt.
Interestingly, just as Charlie decided to learn Perl by working on a Web spider, Mojeek founder Marc Smith started developing the search engine Mojeek as a way of learning C.
How does robots.txt work? An analogy might help
Imagine you are in a library of books that have not as-of-yet been properly indexed. There is a great deal of information stored here, but no-one has put in the work to look through it and make records, like the old paper card indexes that libraries used to hold. Such indexes improve the ability of avid readers and researchers to find what it is that they are looking for, whilst also preventing the library from being cluttered up with people desperately looking for the information they need in an undirected manner.
When it comes to the process of doing the work of finding out what is contained within these books, we could just send out a large quantity of librarians to check each and every one and take notes on them, but without any kind of standardisation we are left to the individual whims of each of these workers when it comes to what kind of information ends up chronicled and stored at the front desk.
But there is another way. What if each of these books had an instruction page which marked those pages/chapters not worth visiting or reading? This kind of a standard speeds up the process of indexing information, whilst allowing authors in the analogy, and web developers in the real world, to tell good actors which parts of their creations should be avoided. On top of this, the file prevents websites from being overloaded by large quantities of requests from good actors, we are repeating good actors here because pages blocked by robots.txt are still indexable; it is a choice to follow this standard, and it's one that Mojeek has always made.
You can try this out now, picking whichever website comes to mind and add after the main URL: /robots.txt, like https://www.linkedin.com/robots.txt. What you will likely find is a long list of identified user-agents which are the names of spiders who are going around and indexing the Web, with a list of what they should and should not crawl from the site (allow, disallow). It's important to underline here that robots.txt is a trust system. It's a way that good actors explore and index the Web; the rules in these documents are not binding, but you're not playing fair if you decide to go against them.
Standards of Practice
The robots.txt standard has persisted for ages, and in true early Web fashion, the original specifics were fleshed out prior to being published via consensus through a mailing list. Martjin Koster still hosts the original robotstxt.org website, which has on its homepage:
It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.
This is the kind of open source, decentralised, and utopian thinking that permeated the Web of the past. There was no standard for this process and so people came together and created it, by consensus and for free.
More recently the protocol has been formalised properly, with Google and the Internet Engineering Task Force coming together on July 1, 2019 to draft an updated standard which is now in its second version and was worked on by, amongst other individuals, Martjin Koster.
Mojeek now uses this updated protocol and the improvements should give better search results to people using Mojeek for pages already within our index, as well as helping us to find new pages which were not previously factored into the process of scouting out information on the Web. For web developers the inclusion of more tools as well as a process of standardisation will result in less time spent on defining crawl specifications in the headers of their websites, and more time spent on designing a beautiful experience for their visitors.