myTech.Today

Google’s Open Source Blog announces that their robots.txt Parser is Now Open Source

Google's Open Source Blog announces that their robots.txt Parser is Now Open Source

"For 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard. This had frustrating implications sometimes. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?

Today, we announced that we're spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files. 

We're here to help: we open-sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the '90s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet-draft when it made sense. 

We also included a testing tool in the open-source package to help you test a few rules. Once built, the usage is very straightforward..."

 
In open-sourcing their Robots.txt parser, Google is attempting to create a new Internet Standard to deals with websites' exclusions list in a more methodical and organized manner.  Previously, the Robots Exclusion Protocol (REP) was simply a list of folders and files that the web spiders should ignore.  Whether the spiders honored the request for exclusion was ignored, unless dramatic steps are taken.
 

An Open Source Robots.txt Parse will provide transparency to website developers so that they can correctly create files that effectively convey the correct information.  It will also allow for the creation of new web development tools that are custom designed to create robots.txt files that match the expected inputs into the Parser.

In the future, it will allow for other spider bot creators to collaborate on an extensible and forward designed protocol that will solver the current issues, such as the aforementioned ignoring or rules as well as more common issues such as multi-megabyte robots.txt files in place of the expected several kilobyte files of older websites.