Template by:
Free Blog Templates

Robots.txt

Spiders and Robots Exclusion

Web Robots are programs that automatically traverse the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. This page explains how you can control what these robots do when visiting your site.

What is Robots.txt?

The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. When a Robot visits a web site, it first checks for the file robots.txt in the root directory; e.g. http://Stars.com/robots.txt. If it can find this file, it will analyze its contents to see if it may retrieve further documents (files). You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files. Here is a sample robots.txt file that prevents all robots from visiting the entire site:-

# Tells Scanning Robots Where They Are and Are Not Welcome
# User-agent:           can also specify by name; "*" is for everyone
# Disallow:               if this matches first part of requested path,
# forget it
User-agent: *    # applies to all robots
Disallow: / # disallow indexing of all pages

The record starts with one or more User-agent lines, specifying which robots the record applies to, followed by "Disallow" and "Allow" instructions to that robot. To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

For example:

User-agent: webcrawler
User-agent: infoseek
Allow:    /tmp/ok.html
Disallow: /tmp

WebCrawler and InfoSeek are not allowed access to the /tmp/ directory, except to access ok.html. All other robots are allowed unrestricted access.

Sometimes robots get stuck in CGI programs, trying to evoke all possible outputs. This keeps robots out of the cgi-bin directory, and disallows execution of a specific CGI program in the /Ads/ directory:-

User-agent: *
Disallow: /cgi-bin/
Disallow: /Ads/banner.cgi
Robots.txt is a simple text file tha’s uploaded to the root directory of a website. Spirders request this file first, and process it before they crawl the site. The simplest robots.txt file possible is this: 
 
User-agent: *
Disallow:
That’s it, The first line identifies the user agent; an asterisk means that the following lines apply to all agents. The blank after Disallow: means that no part of the site is off limits.
 
This robots.txt file doesn’t do anything: all user agents are able to see everything on the site. It’s worth putting a robots.txt file on every website, even it it doesn’t restrict the content that spiders may access. Doing so will prevent the server from returning (and logging) a 404 Not Found error every time a spider requests robots.txt Although a missing robots.txt file foes no real harm from an SEO perspective, the 404 errors can be annoying to webmasters who are examining log files in an attempt to identify real problems.

0 comments: