All search engine crawlers look for the robots.txt file when entering the site first of all. This is a text file located in the root directory of the site (in the same place as the main index file, for the main domain/site, this is the public_html folder), special instructions for search engine crawlers are written in it
These instructions can prohibit indexing of folders or pages on the site, point the robot to the main mirror of the site, recommend to the search robot to comply with a certain time interval of site indexing and much more
If the robotx.txt file is not in the directory of your site, then you can create it.
To disallow indexing of the site through the robots.txt file, two directives are used: User-agent and Disallow.
- User-agent: specify_search_bot
- Disallow: / # will prohibit indexing of the entire site
- Disallow: /page/ # will not allow indexing of /page/ directory
Examples:
Disallow indexing of your site by MSNbot
User-agent: MSNBot
Disallow: /
Prevent your site from being indexed by the Yahoo bot
User-agent: Slurp
Disallow: /
Prevent your site from being indexed by the Yandex bot
User-agent: Yandex
Disallow: /
Prevent your site from being indexed by the Google bot
User-agent: Googlebot
Disallow: /
Disallow indexing of your site by all search engines
User-agent: *
Disallow: /
Barring all search engines from indexing the cgi-bin and images folders
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Now how to allow all pages of the site to be indexed by all search engines (note: the equivalent of this instruction would be an empty robots.txt file):
User-agent: *
Disallow:
Example:
Allow only Yandex, Google, Rambler bots to index the site with a delay of 4 seconds between page polls.
User-agent: *
Disallow: /
User-agent: Yandex
Crawl-delay: 4
Disallow
User-agent: Googlebot
Crawl-delay: 4
Disallow
User-agent: StackRambler
Crawl-delay: 4
Disallow: