This should not be a startling fact: Most enterprise sites are only search compliant if their robots.txt file is operational. When you manage a large document set with sophisticated navigation and directory structure, certain redundancies improve the visitor experience (or the collection of visitor data), creating a need to address search compliance. It's virtually impossible to manage an authority site without the need for a sophisticated robots.txt strategy.
Basically, this disallow instruction set holds the enterprise compliant by preventing the indexing of redundancies (and other non-compliant areas), and a failure in form or syntax would very likely result in a Google penalty.
Some persistent robots.txt myths
- You can only disallow directories. If you believe this, you need to bone up, big time.
- Can protect from bad inbound links. Bad inbounds are evaluated at the source. If it's pointing at your page and that page does not show 404, you get the credit, regardless of robots.txt.
- Can protect you from discovery of ownership issues. robots.txt only deal with INDEXING. But if you look at your visitor logs, you are very likely to see a Mozilla Googlebot inside files you had disallowed. They're not getting indexed, you're getting looked at carefully for compliance. So you can't hide connections to other sites exposed in scripts, ips, email addresses, or anything with robots.txt (or nofollow).
- Never need more than one. The fact is, you probably need at least 2. Google says, "Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols."
- Only used for disallows. There are many other very important reasons for using a more sophisticated robots.txt file on a large system. For one, using the 'allow' command to specify permission, or exceptions to a previous general disallow. Another is to keep the bots out of directories that play no role in ranking the site. That cuts the use of server resources, so on a large system, disallowing everything that doesn't support rank can be a very smart move from an efficiency point of view - aside from the compliance or privacy concerns.
Here's the quick reference on robots.txt blocking functionality:
- blocks the page from indexing based on internal links
- disallowed pages still accrue PR
- pages are not crawled and so do not pass PR
- but pages may appear in the search results if external links point to them (but they will not show a cached link)
And then there's the fact that the protocols have been evolving. Recently robots.txt was enabled to point robots directly to your sitemap, using a 'sitemap: url' syntax:
Sitemap: http://www.domain.com/sitemap.html
There's a lot more on the enterprise use of robots.txt coming on this site. Tell us your war stories.
re1y.com
Enterprise SEO
Google Penalty Solutions
Automation & Search Compliance