Robots.txt: who is looking for the files you want to keep hidden
When hackers first probe a site for vulnerabilities, there is one thing that is almost always on their list, robots.txt.
This is a special file for search engine crawlers telling them which pages or files from your site they should or shouldn’t parse and index. Pay attention to the second part, it tells search engines which files you want them to skip. It tells them this information is not indexed and, accordingly, is not publicly available through Google search and other search services, which is useful information for a hacker. It can be used by an attacker to analyze the structure of the site and its entry points to plan and implement further attacks.
And here’s the thing, the file that is almost never protected from any visitor is robots.txt. The dilemma is that it should be available publicly, so search engines will be able to read it. But how then can you protect it? Well, it depends on whether you can distinguish your visitors or not. The most appropriate policy would be to allow access to this file only for legitimate search engine crawlers and deny it for everyone else. However, the problem is how to make this distinction in practice.
Tip: How to check whether your site was visited by a real or fake Googlebot.
- Grep web server access log for IP addresses that identify themselves as ‘Googlebot’ via User-Agent request header.
- Perform reverse DNS lookup on them using the ‘host’ command (ex.: $ host 126.96.36.199) The returned domain should be googlebot.com or google.com
- Run forward DNS lookup on the full shown domain name (ex.: $ host crawl-66–249–64–96.googlebot.com). It should return you the same IP from your web server access log. If they didn’t match, it was a fake Googlebot.
Most websites are protected with some sort of Web Application Firewall. However, the purpose of the WAF is to protect you from security threats by analyzing the content of the request. They don’t analyze who made the request and what specific software was used. Basically, you can write custom rules for WAF that will protect the file from everyone except search engines. However, this is a rather difficult task that may take a long time because it is, in fact, a tool designed for other purposes. Almost no one does this. The file that reveals to hackers exactly what you would like to hide remains open to everyone.
The solution is surprisingly simple. You need software that analyzes not only the content of the request, but its source. You just allow access to the real search engine crawlers and block it to anyone else. BotGuard provides you software that protects your robots.txt contents from unwelcome visitors by default.
By the way, we do not recommend relying only on robots.txt in order to instruct search engines. Some crawlers ignore the instructions set in this file, so it is better to take care of protection.
It would be interesting to know where else it is necessary to distinguish requests by their sources. You are welcome to comment on this post or send your thoughts to firstname.lastname@example.org.