(Updated 2 October 2011) There is no question that large sites with global reach have significantly greater security concerns than small local sites. Whether the metric is traffic, page requests, or visibility, large numbers bring with them a reminder that not all visits are welcome. Commerce has required focus on a secure transaction process, and while some members of the community may be momentarily behind the curve, everyone realizes that vulnerabilities WILL be exploited. If you're not paying attention to security, you should be.
One preventative strategy in place everywhere is locking down and securing the network - permitting only necessary data to be publicly available. This is routine. Except, it seems, when it comes to robots.txt.
Think about how much information you're giving away about your implementation via robots.txt. We all know who looks at those files. Other than bots, it's going to be your competition, and hackers, both trying to learn more about your activities. By itself, this should be a reason to act, because robots.txt delivers exact pathways to cgi-bin, images, etc.
So what are the solutions?
One simple way would be to block the file from any user agent that was not a search engine, using an .htaccess instruction to deny, allow. So basically a cloak job - in this case we only want the search engines to see our file, everyone else is denied access. The problem with this overly simple approach is it doesn't account for the consequences of denying access. Having a blank, 404, or server error on the screen is not acceptable.
Our recommended method would be to rewrite the url to a cgi script and handle the entire process server side. We have a php accessible version running on this site and revealed below. In this case all requests, except for search engines, get redirected to the homepage. Whatever technique you use, always confirm that Google can still see your robots.txt file from Webmaster Tools.
Try it. http://www.re1y.com/robots.txt
How re1y.com Hides Its robots.txt File
There are three pieces to this php implementation. The first is the conditional .htaccess rewrite. The second is the code that attempts to validate the visiting bot. The third is what calls the code from the robots.txt file.
Code below resides in the .htaccess file. First line permits Google, MSN, or Slurp to see the actual robots.txt. Second line rewrites any request NOT from the 3 search engines to the homepage. Third line permits a php instruction in the robots.txt file. (.htaccess code below assumes RewriteEngine on)
RewriteCond %{http_user_agent} !(googlebot|Msnbot|Slurp) [NC]
RewriteRule ^robots\.txt$ http://www.re1y.com/ [R,NE,L]
AddType application/x-httpd-php .php
AddHandler application/x-httpd-php .txt
The php code below resides in an include file called reverse-dns-inc.php and is referenced by the robots.txt file. If the user agent can not be validated the script will redirect the page request to the homepage.
<?
$ua = $_SERVER['HTTP_USER_AGENT'];
if(stristr($ua, 'msnbot') || stristr($ua, 'Googlebot') || stristr($ua, 'Yahoo Slurp')){
$ip = $_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($ip);
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/crawl\.yahoo\.net$/", $hostname)) {
$block = TRUE;
$URL="/";
header ("Location: $URL");
exit;
} else {
$real_ip = gethostbyname($hostname);
if($ip!= $real_ip){
$block = TRUE;
$URL="/";
header ("Location: $URL");
exit;
} else {
$block = FALSE;
}
}
}
?>
Code below is the first line in our actual robots.txt file, calling the above include to check the ip of the visitor.
<? include("includes/reverse-dns-inc.php"); ?>