Login or signup
36
ONLINE BUSINESS

Untangled Web: Control Spider Access to Your Site

Advertisement

If you host a Web site, you might be surprised to learn about search engine spiders that traverse the Internet.

Search engine spiders are electronic robots that surf through sites at much higher rates than a human visitor can.

They scour the Web indexing individual pages, which is how your site gets listed. This is a good thing.

But robots can consume the costly bandwidth and processing power of your system.

Danger, Will Robinson, Danger
What's more, robotic visits can bring your server to its knees, throw your visitor stats off, and access and publish private documents.

Or it could be a competitor regularly crawling your site.

Auction giant eBay was dismayed to find a robot crawling its site up to 100,000 times per day to check auction prices. Those figures were then matched against competitors' prices. As a result, eBay filed suit.

Something mentioned in the eBay suit is the "robots exclusion protocol." It states that robots must request and abide by the instructions in the robots.txt file located at the root level of a site domain.

No Trespassing
Because the standard requires robots to recognize and abide by these limitations, which carry legal precedents pertaining to trespassing, you can use this protocol to control robot access to your site by using a robots.txt file. Consider it to be an extremely important part of hosting your Web site and author it wisely.

The file instructions inform robots which areas are off-limits. These areas can be specific folders or file names, or simply the entire site. Instructions can be tailored to a particular robot or intended for all robots.

The robots.txt file is easy to create because the syntax is simple. You can place it in the site's root directory to control robot access to the entire site instead of placing directives into individual pages.

Goal Disallowed
For example, the contents of a robot.txt file might look something like the following, where User-agent identifies which spiders the instructions pertain to, and Disallow denotes which sections are off-limits. A file containing the two lines below excludes robots from the entire site.

User-agent: *
Disallow: /

For those of you who tend to fly through instructions and manuals, I'll warn you not to use the splat (*) as a wildcard.

When the splat is used in the User-agent field, it indicates a special value that means "any robot." The splat has no meaning in the Disallow field.

See the Robots Exclusion page for a complete set of syntax rules.

You can also use the robots meta tag within the head tag of an HTML document. Directives Index and Follow specify whether a robot is allowed to index the page or follow links within it. In this example, robots can neither index nor follow links.

However, this secondary protocol is not well supported by general spiders. It's better to use the robots.txt file anyway; it gives you much better control and much less work to do.

No, Thanks, Mr. Robot
You have many reasons to limit robot access to your site.

For example, there is no need for spiders to crawl any CGI-BIN or staging area.

Spiders should also be excluded from any banner or GoTo.com landing pages you're using to track the effectiveness of those campaigns through page-request numbers. Hundreds or thousands of visits may quickly follow a visit by a particular search engine robot.

Because all landing pages you host -- you might have dozens of them -- might contain nearly identical text, limiting access to them will help you avoid a mirror page spam penalty at search engines where you enjoy great positioning.

Look Into It
If you do not host your own Web site, many hosting companies will simply disallow all robots from your entire site by default. They do this to make sure their system runs smoothly, but this practice may be detrimental to your business.

If you haven't viewed your robots.txt file, take a look at it. You can typically see your file by surfing to http://www.yoursite.com/robots.txt using your browser.

If it returns a 404 error, "File Not Found," the file might be missing.

In that case, have a talk with your hosting service about robot management.

Good luck!

Copyright © 1995-2001 Pinnacle WebWorkz Inc. All rights reserved. Do not duplicate or redistribute in any form.

Last updated: Jul 24, 2001




Register on Inc.com today to get full access to:
All articles  |  Magazine archives | Comment and share features
EMAIL
PASSWORD
EMAIL
FIRST NAME
LAST NAME
EMAIL
PASSWORD

Or sign up using: