Using robots.txt

Guidelines for configuring site indexing with a robots.txt file

Robots.txt - is a text file that contains parameters for indexing the site for search engine robots.

How to create robots.txt

In a text editor, create a file named robots.txt and fill it out in accordance with the rules below.

Check the file in the Yandex. Webmaster (go to menu Analysis robots.txt).

Upload the file to the root directory of your site.

User-agent directive

Yandex robot supports the standard of exceptions for robots with advanced features, which are described below.

The robot uses a session-based principle of operation, for each session a certain pool of pages is formed that the robot plans to load.

The session starts with a robots.txt file. If the file is missing or is not textual or the robot's request returns an HTTP status other than 200 OK, the robot assumes that access to the documents is not restricted.

In the robots.txt file, the robot checks for the presence of entries starting with User-agent:, with the Yandex substring (case insensitive) or * . If the string User-agent: Yandex is found, the directives for User-agent: * are not taken into account. If the strings User-agent: Yandex and User-agent: * are missing, it is assumed that the robot has no access restriction.

You can specify separate directives for the following Yandex robots:

'YandexBot' - the main indexing robot;
'YandexDirect' - downloads information about the content of partner sites of the advertising network to specify their subjects for the selection of relevant ads, interprets robots.txt in a special way;
'YandexDirectDyn' - dynamic banner generation robot, interprets robots.txt in a special way;
'YandexMedia' - a robot that indexes multimedia data;
'YandexImages' - Yandex.Pictures indexer;
'YaDirectFetcher' - Yandex.Direct robot, interprets robots.txt in a special way;
'YandexBlogs' - a robot that indexes posts and comments;
'YandexNews' - Yandex.News robot;
'YandexPagechecker' - micro-markup validator;
'YandexMetrika' - YandexMetrics robot;
'YandexMarket' - YandexMarket robot;
'YandexCalendar' - Yandex.Calendar robot.

User-agent: YandexBot # will be used only by the main indexing robot
Disallow: /*id=

User-agent: Yandex # will be used by all Yandex robots
Disallow: /*sid= # except the main indexing robot

User-agent: * # will not be used by Yandex robots
Disallow: /cgi-bin

Disallow and Allow

To prevent robots from accessing the site or some of its sections, use the Disallow directive.

User-agent: Yandex
Disallow: / # block access to the entire site

User-agent: Yandex
Disallow: /cgi-bin # blocks access to pages,  
                   # block access to pages that begin with '/cgi-bin

The standard recommends inserting an empty line feed before each User-agent directive.

The # symbol is intended to describe comments. Anything after this character and before the first line feed is ignored.

To allow a robot to access the website or some of its sections, use the Allow directive

User-agent: Yandex
Allow: /cgi-bin
Disallow: /
# forbids downloading anything but pages 
# pages that start with '/cgi-bin

Collaborative use of directives

Allow and Disallow directives from the corresponding User-agent block are sorted by URL prefix length (from the shortest to the longest) and are applied sequentially. If several directives fit on the same page, the robot selects the last one in the order they appear in the sorted list. Thus, the order of the directives in the robots.txt file does not affect their use by the robot.

# Source robots.txt:
User-agent: Yandex
Allow: /catalog
Disallow: /
# Sorted robots.txt:
User-agent: Yandex
Disallow: /
Allow: /catalog
# allow only pages to be downloaded,
# allow only pages that begin with '/catalog

# The original robots.txt:
User-agent: Yandex
Allow: /
Allow: /catalog/auto
Disallow: /catalog
# Sorted robots.txt:
User-agent: Yandex
Allow: /
Disallow: /catalog
Allow: /catalog/auto
# disallows downloading pages starting with '/catalog',
# but allow downloading pages that begin with '/catalog/auto'.

Directive Sitemap

If you use a Sitemap to describe the structure of your site, specify the path to the file as a parameter of the sitemap directive (if there are multiple files, specify all of them).

User-agent: Yandex
Allow: /
sitemap: https://example.com/site_structure/my_sitemaps1.xml
sitemap: https://example.com/site_structure/my_sitemaps2.xml

The directive is cross-sectional, so it will be used by the robot regardless of the place in the robots.txt file where it is specified.

The robot will remember the path to the file, process the data and use the results in subsequent loading sessions.

Directive Host

If your site has mirrors, a special mirror robot (Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)) will detect them and form a group of mirrors of your site. Only the main mirror will participate in the search. You can specify it for all mirrors in robots.txt file: the name of the main mirror must be the Host directive value.

The Host directive does not guarantee that the specified main mirror is selected, however, the algorithm considers it with high priority when making its decision.

If https://www.glavnoye-zerkalo.ru is the main site mirror, then #robots.txt for all sites from the mirror group will look like this 
User-Agent: *
Disallow: /forum
Disallow: /cgi-bin
Host: https://www.glavnoye-zerkalo.ru

Crawl-delay directive

If your server has a heavy load and has no time to respond to download requests, use the Crawl-delay directive. It allows you to give the search robot a minimum period (in seconds) between the end of a page load and the start of the next page load.

For compatibility with robots that do not fully comply with the standard when processing robots.txt, you should add Crawl-delay directive to the group that starts with the User-Agent record (directly after Disallow and Allow directives).

Yandex search robot supports fractional values of Crawl-Delay, e.g. 0.1. This does not guarantee that the search robot will come to your site 10 times per second, but it helps speed up site crawling.

User-agent: Yandex
Crawl-delay: 2 # sets a timeout of 2 seconds

User-agent: *
Disallow: /search
Crawl-delay: 4.5 # sets timeout of 4.5 seconds

Clean-param directive

If your site pages contain dynamic parameters that do not affect content (e.g.: session IDs, user IDs, referrers, etc.), you can describe them using the Clean-param directive.

Yandex robot, using this information, will not reload duplicate information repeatedly. Thus, it will increase the efficiency of circumvention of your site and reduce the load on the server.

More information can be found at official website.