Search
Compliance
Friday 20 September 2024 01:08 PM   Your IP: 44.222.131.239
Structural SEO
Home       SEO Enterprise Blog       Search Compliance       Structural SEO       The Semantic Imperative       About re1y.com      
Home
Restoring Ranks Post Panda
When Google Destroys Your Business
Search Due Diligence For Internet Investments
SEO Enterprise Blog
Enterprise SEO
Negative SEO
The Risks of Relying on Search
Rank Preservation
When SEO Fails
Search Compliance
Google Penalty Solutions
The Ethics Of Search
Structural SEO
Multiple Sites
Defensive Strategies
Inbound Links
Link Vetting
Third Party Interference
Filename Masking
Black Hat Reality
Recourse & SEO
The Null Set Redundancy
The Semantic Imperative
In The Name Of Relevance?
Automation And SEO
PageRank
Content Authority
Google Penalties Insight
Link Authority Trainwreck
robots.txt
Paid Links
Securing robots.txt
Foreign Language Sites
nofollow
RDF / RDFa
Replacing Nofollow
Canonical Condom
Granularity In CMS
Evaluating SEO Agencies
Search Forensics: Subdomains & Supplemental Results
Google Hiding Link Metrics Behind Sample Links
Enterprise Link Building
Link Velocity Debunked
New Link Disavow Tool
Turn Old Product Pages Into Link Bait
10386

Securing The Enterprise's robots.txt File

(Updated 2 October 2011) There is no question that large sites with global reach have significantly greater security concerns than small local sites. Whether the metric is traffic, page requests, or visibility, large numbers bring with them a reminder that not all visits are welcome. Commerce has required focus on a secure transaction process, and while some members of the community may be momentarily behind the curve, everyone realizes that vulnerabilities WILL be exploited. If you're not paying attention to security, you should be.

One preventative strategy in place everywhere is locking down and securing the network - permitting only necessary data to be publicly available. This is routine. Except, it seems, when it comes to robots.txt.

Think about how much information you're giving away about your implementation via robots.txt. We all know who looks at those files. Other than bots, it's going to be your competition, and hackers, both trying to learn more about your activities. By itself, this should be a reason to act, because robots.txt delivers exact pathways to cgi-bin, images, etc.

So what are the solutions?

One simple way would be to block the file from any user agent that was not a search engine, using an .htaccess instruction to deny, allow. So basically a cloak job - in this case we only want the search engines to see our file, everyone else is denied access. The problem with this overly simple approach is it doesn't account for the consequences of denying access. Having a blank, 404, or server error on the screen is not acceptable.

Our recommended method would be to rewrite the url to a cgi script and handle the entire process server side. We have a php accessible version running on this site and revealed below. In this case all requests, except for search engines, get redirected to the homepage. Whatever technique you use, always confirm that Google can still see your robots.txt file from Webmaster Tools.

Try it. http://www.re1y.com/robots.txt

How re1y.com Hides Its robots.txt File
There are three pieces to this php implementation. The first is the conditional .htaccess rewrite. The second is the code that attempts to validate the visiting bot. The third is what calls the code from the robots.txt file.
Code below resides in the .htaccess file. First line permits Google, MSN, or Slurp to see the actual robots.txt. Second line rewrites any request NOT from the 3 search engines to the homepage. Third line permits a php instruction in the robots.txt file. (.htaccess code below assumes RewriteEngine on)
RewriteCond %{http_user_agent} !(googlebot|Msnbot|Slurp) [NC]
RewriteRule ^robots\.txt$ http://www.re1y.com/ [R,NE,L]
AddType application/x-httpd-php .php
AddHandler application/x-httpd-php .txt
The php code below resides in an include file called reverse-dns-inc.php and is referenced by the robots.txt file. If the user agent can not be validated the script will redirect the page request to the homepage.
<?
$ua = $_SERVER['HTTP_USER_AGENT'];
if(stristr($ua, 'msnbot') || stristr($ua, 'Googlebot') || stristr($ua, 'Yahoo Slurp')){
$ip = $_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($ip);
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/crawl\.yahoo\.net$/", $hostname)) {
$block = TRUE;
$URL="/";
header ("Location: $URL");
exit;
} else {
$real_ip = gethostbyname($hostname);
if($ip!= $real_ip){
$block = TRUE;
$URL="/";
header ("Location: $URL");
exit;
} else {
$block = FALSE;
}
}
}
?>
Code below is the first line in our actual robots.txt file, calling the above include to check the ip of the visitor.
<? include("includes/reverse-dns-inc.php"); ?>
Home       SEO Enterprise Blog       Search Compliance       Structural SEO       The Semantic Imperative       About re1y.com      

re1y.com
Enterprise SEO
Google Penalty Solutions
Automation & Search Compliance

Looking for SEO enabled content management systems with structural, semantic optimization built into the cms? You're on the right site. Research identified targets are implemented within the markup, content, and filenames to enable the site to rank as high as possible based upon semantic relevance. 34789366G off site content requirements