All search engine crawlers look for the robots.txt file when entering the site first of all. This is a text file located in the root directory of the site (in the same place as the main index file, for the main domain/site, this is the public_html folder), special instructions for search engine crawlers are written in it

These instructions can prohibit indexing of folders or pages on the site, point the robot to the main mirror of the site, recommend to the search robot to comply with a certain time interval of site indexing and much more

If the robotx.txt file is not in the directory of your site, then you can create it.
To disallow indexing of the site through the robots.txt file, two directives are used: User-agent and Disallow.

  • User-agent: specify_search_bot
  • Disallow: / # will prohibit indexing of the entire site
  • Disallow: /page/ # will not allow indexing of /page/ directory

Examples:

Disallow indexing of your site by MSNbot

User-agent: MSNBot  
Disallow: /  

Prevent your site from being indexed by the Yahoo bot

User-agent: Slurp  
Disallow: /  

Prevent your site from being indexed by the Yandex bot

User-agent: Yandex  
Disallow: /  

Prevent your site from being indexed by the Google bot

User-agent: Googlebot  
Disallow: /  

Disallow indexing of your site by all search engines

User-agent: *  
Disallow: /  

Barring all search engines from indexing the cgi-bin and images folders

User-agent: *  
Disallow: /cgi-bin/  
Disallow: /images/  

Now how to allow all pages of the site to be indexed by all search engines (note: the equivalent of this instruction would be an empty robots.txt file):

User-agent: *  
Disallow:  

Example:

Allow only Yandex, Google, Rambler bots to index the site with a delay of 4 seconds between page polls.

User-agent: *  
Disallow: /  

User-agent: Yandex  
Crawl-delay: 4  
Disallow  

User-agent: Googlebot  
Crawl-delay: 4  
Disallow  

User-agent: StackRambler  
Crawl-delay: 4  
Disallow:  
Updated Oct. 13, 2024