home How to fix Ready robots txt for all search engines. How to edit robots txt. Directives "Host:" and "Sitemap:"

Ready robots txt for all search engines. How to edit robots txt. Directives "Host:" and "Sitemap:"

There are no small things in SEO. Sometimes just one small file, Robots.txt, can affect website promotion.If you want your site to be indexed so that search robots bypass the pages you need, you need to write recommendations for them.

"Is it possible?", - you ask.Maybe. To do this, your site must have robots file.txt.How to make a file robots, configure and add to the site - we understand in this article.

What is robots.txt and what is it for

Robots.txt is the usual text file , which contains recommendations for search robots: which pages should be crawled and which should not.

Important: the file must be encoded in UTF-8, otherwise search robots may not accept it.

Will a site that does not have this file go into the index?It will, but robots can “grab” those pages that are undesirable in the search results: for example, login pages, admin panel, personal user pages, mirror sites, etc. All this is considered "search garbage":

If personal information is included in the search results, both you and the site may suffer. Another point - without this file, the indexing of the site will take longer.

Three types of commands for search spiders can be specified in the Robots.txt file:

scanning is prohibited;
scanning is allowed;
scanning is partially allowed.

All this is written using directives.

How to create the correct Robots.txt file for a website

The Robots.txt file can be created simply in the Notepad program, which is available by default on any computer. Prescribing a file will take even a beginner a maximum of half an hour of time (if you know the commands).

You can also use other programs - Notepad, for example. There are also online services which can generate the file automatically. For example, such asCYPR.com or Mediasova.

You just need to specify the address of your site, for which search engines you need to set rules, the main mirror (with or without www). Then the service will do everything itself.

Personally, I prefer the old "grandfather" way - to register the file manually in notepad. There is also a “lazy way” - to puzzle your developer with this 🙂 But even in this case, you should check if everything is written there correctly. Therefore, let's figure out how to compile this very file, and where it should be located.

The finished Robots.txt file must be located in the root folder of the site. Just a file, no folder:

Want to check if it's on your site? Drive in address bar address: site.ru/robots.txt. You will see the following page (if the file exists):

The file consists of several blocks separated by an indent. Each block contains recommendations for search robots of different search engines (plus a block with general rules for everyone), and a separate block with links to the sitemap - Sitemap.

There is no need to indent inside the block with rules for one search robot.

Each block begins with the User-agent directive.

Each directive is followed by a ":" sign (colon), a space, after which a value is indicated (for example, which page to close from indexing).

You need to specify relative page addresses, not absolute ones. Relative - this is without "www.site.ru". For example, you need to disable indexing of a pagewww.site.ru/shop. So after the colon we put a space, a slash and "shop":

Disallow: /shop.

An asterisk (*) denotes any set of characters.

The dollar sign ($) is the end of the line.

You may decide - why write a file from scratch if you can open it on any site and just copy it to yourself?

For each site you need to prescribe unique rules. It is necessary to take into account the features CMS. For example, the same admin panel is located at /wp-admin on the WordPress engine, on another address it will be different. The same with the addresses of individual pages, with a site map and so on.

Setting up the Robots.txt file: indexing, main mirror, directives

As you have already seen in the screenshot, the User-agent directive comes first. It indicates for which search robot the rules below will go.

User-agent: * - rules for all search robots, that is, any search engine (Google, Yandex, Bing, Rambler, etc.).

User-agent: Googlebot - Indicates the rules for the Google search spider.

User-agent: Yandex - rules for the Yandex search robot.

For which search robot to prescribe the rules first, there is no difference. But usually recommendations for all robots are written first.

Disallow: Prohibit indexing

To disable indexing of the site as a whole or individual pages, use the Disallow directive.

For example, you can completely close the site from indexing (if the resource is being finalized, and you don't want it to get into the search results in this state). To do this, write the following:

User-agent: *

disallow: /

Thus, all search robots are prohibited from indexing content on the site.

And this is how you can open a site for indexing:

User-agent: *

Disallow:

Therefore, check if there is a slash after the Disallow directive if you want to close the site. If you want to open it later - do not forget to remove the rule (and this often happens).

To close individual pages from indexing, you need to specify their address. I already wrote how it's done:

User-agent: *

Disallow: /wp-admin

Thus, the admin panel was closed on the site from third-party views.

What you need to close from indexing without fail:

administrative panel;
personal pages of users;
baskets;
site search results;
login, registration, authorization pages.

You can close from indexing and certain types of files. Let's say you have some .pdf files on your site that you don't want indexed. And search robots very easily scan the files uploaded to the site. You can close them from indexing as follows:

User-agent: *

Disallow: /*. pdf$

How to open a site for indexing

Even with a site completely closed from indexing, you can open the path to certain files or pages for robots. Let's say you're redesigning the site, but the services directory remains intact. You can direct search robots there so that they continue to index the section. For this, the Allow directive is used:

User-agent: *

Allow: /services

disallow: /

Main website mirror

Until March 20, 2018, in the robots.txt file for the Yandex search robot, it was necessary to specify the main site mirror through the Host directive. Now you don't need to do this - it's enough set up a page-by-page 301 redirect .

What is the main mirror? This is which address of your site is the main one - with or without www. If you do not set up a redirect, then both sites will be indexed, that is, there will be duplicates of all pages.

Sitemap: robots.txt sitemap

After all the directives for the robots are written, you must specify the path to the Sitemap. The sitemap shows the robots that all the URLs that need to be indexed are located at a certain address. For example:

Sitemap: site.ru/sitemap.xml

When the robot crawls the site, it will see what changes were made to this file. As a result, new pages will be indexed faster.

Clean-param Directive

In 2009, Yandex introduced a new directive - Clean-param. It can be used to describe dynamic parameters that do not affect the content of the pages. Most often, this directive is used on forums. There is a lot of garbage here, for example session id, sorting parameters. If you register this directive, the Yandex search robot will not repeatedly download information that is duplicated.

You can write this directive anywhere in the robots.txt file.

Parameters that the robot does not need to take into account are listed in the first part of the value through the & sign:

Clean-param: sid&sort /forum/viewforum.php

This directive avoids duplicate pages with dynamic URLs (which contain a question mark).

Crawl-delay directive

This directive will come to the aid of those who have a weak server.

The arrival of a search robot is an additional load on the server. If you have a high site traffic, then the resource may simply not withstand and "lie down". As a result, the robot will receive a 5xx error message. If this situation is repeated constantly, the site may be recognized by the search engine as non-working.

Imagine that you are working, and in parallel you have to constantly answer calls. Your productivity then drops.

Likewise with the server.

Let's get back to the directive. Crawl-delay allows you to set a delay in scanning website pages in order to reduce the load on the server. In other words, you set the period after which the pages of the site will be loaded. This parameter is specified in seconds, as an integer:

Not all modern webmasters are able to work with HTML code. Many do not even know how the functions written in the key CMS files should look like. The internals of your resource, such as the robots.txt file, are intellectual property in which the owner should be like a fish in water. Fine-tuning the site allows you to increase its search rankings, bring it to the top and successfully collect traffic.

The robots.txt file is one of the main elements of adapting a resource to the requirements of search engines. He contains technical information and restricts access to a number of pages by search robots. After all, not every written page should certainly be in the search results. Previously, FTP access was required to create a robots txt file. The development of the CMS has made it possible to access it directly through the control panel.

What is the robots.txt file for?

This file contains a number of recommendations addressed to search bots. It restricts their access to certain parts of the site. Due to the location of this file in the root directory, there is no way for bots to miss it. As a result, when they get to your resource, they first read the rules for processing it, and only after that they start checking.

Thus, the file indicates to search robots which site directories are allowed for indexing and which are not subject to this process.

Given that the presence of a file does not directly affect the ranking process, many sites do not contain robots.txt. But the way full access cannot be considered technically correct. Let's take a look at the benefits robots.txt provides to a resource.

You can prohibit the indexing of the resource in whole or in part, limit the circle of search robots that will have the right to index. By ordering robots.txt to disable everything, you can completely isolate the resource while it is being repaired or rebuilt.

By the way, Google developers have repeatedly reminded webmasters that the robots.txt file should not exceed 500 KB in size. This will certainly lead to indexing errors. If you create a file manually, then "reaching" this size, of course, is unrealistic. But here are some CMS that automatically generate the content of robots.txt, can significantly overload it.

Easy file creation for any search engine

If you are afraid to practice fine tuning independently, it can be carried out automatically. There are constructors that collect such files without your participation. They are suitable for people who are just starting out as webmasters.

As you can see in the image, setting up the constructor begins with entering the site address. Next, you choose the search engines you plan to work with. If the issue of a particular search engine is not important to you, then there is no need to create settings for it. Now move on to specifying the folders and files to which you plan to restrict access. AT this example you can specify the address of the map and mirror of your resource.

Robots.txt generator will fill in the form as the constructor fills up. All that is required of you in the future is to copy the received text into a txt file. Don't forget to name it robots.

How to check the effectiveness of the robots.txt file

In order to analyze the effect of a file in Yandex, go to the corresponding page in the Yandex.Webmaster section. In the dialog box, enter the name of the site and click the "download" button.

The system will analyze the robots.txt file and check whether the search robot will bypass pages that are prohibited from indexing. If there are problems, directives can be edited and checked directly in the dialog box. True, after that you will have to copy the edited text and paste it into your robots.txt file in the root directory.

A similar service is provided by the "Webmaster Tools" service from the Google search engine.

Creating robots.txt for WordPress, Joomla and Ucoz

Various CMS, which have gained wide popularity on the Runet, offer users their own versions of robots.txt files. Some of them do not have such files at all. Often, these files are either too versatile and do not take into account the characteristics of the user's resource, or have a number of significant drawbacks.

An experienced specialist can manually correct the situation (if there is a lack of knowledge, it is better not to do this). If you are afraid to delve into the insides of the site, use the services of colleagues. Such manipulations, with knowledge of the matter, take only a couple of minutes of time. For example, robots.txt might look like this:

In the last two lines, as you might guess, you need to enter the data of your own resource.

Conclusion

There are a number of skills that any webmaster must master. And self-configuration and website maintenance is one of them. Beginning site builders can break such firewood while debugging a resource, which you won’t be able to clean up later. If you do not want to lose your potential audience and positions in the search results due to the structure of the site, approach the process of setting it up thoroughly and responsibly.

First, I'll tell you what robots.txt is.

Robots.txt- a file that is located in the root folder of the site, where special instructions for search robots. These instructions are necessary so that when entering the site, the robot does not take into account the page / section, in other words, we close the page from indexing.

Why robots.txt is needed

The robots.txt file is considered a key requirement for SEO optimization of absolutely any site. The absence of this file can negatively affect the load from robots and slow indexing, and even more so, the site will not be indexed completely. Accordingly, users will not be able to go to pages through Yandex and Google.

Impact of robots.txt on search engines?

Search engines(especially Google) will index the site, but if there is no robots.txt file, then, as I said, not all pages. If there is such a file, then the robots are guided by the rules that are specified in this file. Moreover, there are several types of search robots, if some can take into account the rule, then others ignore it. In particular, the GoogleBot robot does not take into account the Host and Crawl-Delay directives, the YandexNews robot has recently ceased to take into account the Crawl-Delay directive, and the YandexDirect and YandexVideoParser robots ignore the generally accepted directives in robots.txt (but take into account those that are written specifically for them).

The site is loaded the most by robots that load content from your site. Accordingly, if we tell the robot which pages to index and which to ignore, as well as at what time intervals to load content from pages (this is more important for large sites that have more than 100,000 pages in the search engine index). This will make it much easier for the robot to index and load content from the site.

Files that are related to CMS, for example, in Wordpress - /wp-admin/, can be classified as unnecessary for search engines. In addition, ajax, json scripts responsible for pop-up forms, banners, captcha output, and so on.

For most robots, I also recommend that you close all Javascript and CSS files from indexing. But for GoogleBot and Yandex, it is better to index such files, as they are used by search engines to analyze the convenience of the site and its ranking.

What is the robots.txt directive?

directives- these are the rules for search robots. The first standards for writing robots.txt and, accordingly, appeared in 1994, and an extended standard in 1996. However, as you already know, not all robots support certain directives. Therefore, below I have described what the main robots are guided by when indexing the pages of the site.

What does user-agent mean?

This is the most important directive that determines for which search robots further rules will apply.

For all robots:

For a specific bot:

User agent: Googlebot

Case in robots.txt is not important, you can write both Googlebot and googlebot

Google crawlers

Yandex search robots


	Yandex's main indexing robot
	Used in the Yandex.Images service
	Used in the Yandex.Video service
	multimedia data
	Blog Search
	A crawler accessing a page when it is added via the "Add URL" form
	robot that indexes site icons (favicons)
	Yandex.Direct
	Yandex.Metrica
	Used in the Yandex.Catalog service
	Used in the Yandex.News service
YandexImageResizer	Search robot of mobile services

Search robots Bing, Yahoo, Mail.ru, Rambler

Disallow and Allow directives

Disallow closes sections and pages of your site from indexing. Accordingly, Allow, on the contrary, opens them.

There are some features.

First, the additional operators are *, $, and #. What are they used for?

“*” is any number of characters and their absence. By default, it is already at the end of the line, so there is no point in putting it again.

“$” - indicates that the character before it must come last.

“#” - comment, everything that comes after this character is ignored by the robot.

Examples of using Disallow:

Disallow: *?s=

Disallow: /category/

Accordingly, the search robot will close pages like:

But pages of the form will be open for indexing:

Now you need to understand how nested rules are executed. The order in which directives are written is very important. The inheritance of rules is determined by which directories are specified, that is, if we want to close a page / document from indexing, it is enough to write a directive. Let's look at an example

This is our robots.txt file

Disallow: /template/

This directive is also indicated anywhere, and you can register several sitemap files.

Host directive in robots.txt

This directive is required to specify the main mirror of the site (often with or without www). Note that the host directive is specified without the http:// protocol, but with the https:// protocol. The directive is taken into account only by Yandex and Mail.ru search robots, while other robots, including GoogleBot, will not take into account the rule. Host to register 1 time in the file robots.txt

Example with http://

Host: www.website.ru

Example with https://

Crawl-delay directive

Sets the time interval for indexing site pages by the search robot. The value is specified in seconds and milliseconds.

Example:

It is mainly used on large online stores, information sites, portals, where site traffic is from 5,000 per day. It is necessary for the search robot to make a request for indexing in a certain period of time. If you do not specify this directive, then this can create a serious load on the server.

The optimal crawl-delay value for each site is different. For search engines Mail, Bing, Yahoo, the value can be set minimum value 0.25, 0.3, since these search engine robots can crawl your site once a month, 2 months, and so on (very rarely). For Yandex, it is better to set a larger value.

If the load of your site is minimal, then there is no point in specifying this directive.

Clean-param Directive

The rule is interesting because it tells the crawler that pages with certain parameters do not need to be indexed. 2 arguments are written: page URL and parameter. This directive is supported by the Yandex search engine.

Example:

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

User agent: GoogleBot

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

Allow: /plugins/*.css

Allow: /plugins/*.js

Allow: /plugins/*.png

Allow: /plugins/*.jpg

Allow: /plugins/*.gif

User agent: Yandex

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

Allow: /plugins/*.css

Allow: /plugins/*.js

Allow: /plugins/*.png

Allow: /plugins/*.jpg

Allow: /plugins/*.gif

Clean-Param: utm_source&utm_medium&utm_campaign

In the example, we have written rules for 3 different bots.

Where to add robots.txt?

Added to root folder site. In addition, so that it can be followed by a link:

How to check robots.txt?

Yandex Webmaster

On the Tools tab, select Analyze robots.txt and then click Check

Google Search Console

On the tab Scanning choose robots.txt file inspection tool and then click check.

Conclusion:

The robots.txt file must be mandatory on every promoted site, and only its correct configuration will allow you to get the necessary indexing.

And finally, if you have any questions, ask them in the comments under the article and I'm also wondering, how do you write robots.txt?

Robots.txt is a text file that contains special instructions for search engine robots to explore your site on the Internet. Such instructions are called directives- may prohibit indexing of some pages of the site, indicate the correct "mirroring" of the domain, etc.

For sites running on the Nubex platform, a file with directives is created automatically and located at domen.ru/robots.txt, where domen.ru is Domain name site..ru/robots.txt.

You can change robots.txt and prescribe additional directives for search engines in the site admin panel. To do this, on the control panel, select the section "Settings", and in it - point SEO.

Find a field "The text of the robots.txt file" and write the necessary directives in it. It is advisable to activate the checkbox "Add a link to the automatically generated sitemap.xml file in robots.txt": this way the search bot will be able to load the sitemap and find all the necessary pages for indexing.

Basic directives for the robots txt file

When loading robots.txt, the crawler first looks for an entry starting with user-agent: The value of this field must be the name of the Robot whose access rights are set in this entry. Those. the User-agent directive is a kind of call to the robot.

1. If the value of the User-agent field contains the symbol " * ”, then the access rights specified in this entry apply to any search robots that request the /robots.txt file.

2. If more than one robot name is specified in the entry, then the access rights are extended to all the specified names.

3. Uppercase or lowercase characters do not matter.

4. If the string User-agent: BotName is found, directives for User-agent: * are not taken into account (this is the case if you are making multiple entries for different robots). Those. the robot will first scan the text for the entry User-agent: MyName, and if it finds it, it will follow these instructions; if not, it will act according to the instructions of the User-agent: * entry (for all bots).

By the way, before each new User-agent directive, it is recommended to insert an empty line feed (Enter).

5. If the lines User-agent: BotName and User-agent: * are absent, it is considered that access to the robot is not limited.

Prohibition and permission of site indexing: directives Disallow and Allow

To prevent or allow search bots access to certain pages of the site, directives are used Disallow and allow respectively.

The value of these directives specifies the full or partial path to the section:

Disallow: /admin/- prohibits indexing of all pages inside the admin section;
Disallow: /help— prohibits indexing of both /help.html and /help/index.html;
Disallow: /help/ - only closes /help/index.html;
disallow: /- blocks access to the entire site.

If the Disallow value is not specified, then access is not restricted:

Disallow:- indexing of all pages of the site is allowed.

You can use the allow directive to set up exceptions. allow. For example, such an entry will prevent robots from indexing all sections of the site, except for those whose path begins with /search:

It doesn't matter in what order the directives for denying and allowing indexing are listed. When reading, the robot will still sort them by the length of the URL prefix (from smallest to largest) and apply them sequentially. That is, the example above in the perception of the bot will look like this:

- only pages starting with /search are allowed to be indexed. Thus, the order of the directives will not affect the result in any way.

Host directive: how to specify the main site domain

If several domain names are linked to your site (technical addresses, mirrors, etc.), the search engine may decide that these are all different sites. And with the same content. Solution? To the bath! And one bot knows which of the domains will be “punished” - the main one or the technical one.

To avoid this trouble, you need to tell the search robot which of the addresses your site is participating in the search for. This address will be designated as the main one, and the rest will form a group of mirrors of your site.

You can do this with host directives. It must be added to the entry starting with User-Agent, immediately after the Disallow and Allow directives. In the value of the Host directive, you must specify the main domain with a port number (80 by default). For example:

Host: test-o-la-la.ru

Such an entry means that the site will be displayed in search results with a link to the test-o-la-la.ru domain, and not www.test-o-la-la.ru and s10364.. screenshot above).

In the Nubex constructor, the Host directive is added to the text of the robots.txt file automatically when you specify in the admin panel which domain is the main one.

The host directive can only be used once in robots.txt. If you write it several times, the robot will only accept the first entry in order.

Crawl-delay directive: how to set page loading interval

To indicate to the robot the minimum interval between finishing loading one page and starting loading the next, use Crawl-delay directive. It must be added to the entry starting with User-Agent, immediately after the Disallow and Allow directives. In the value of the directive, specify the time in seconds.

Using this delay when processing pages will be convenient for overloaded servers.

There are also other directives for crawlers, but the five described - User-Agent, Disallow, Allow, Host and Crawl-delay - usually enough to compose the text of the robots.txt file.

Search robots - crawlers begin their acquaintance with the site by reading the robots.txt file. It contains all the important information for them. Site owners should create and periodically review robots.txt. The speed of indexing pages and the place in the search results depends on the correctness of its work.

It is not a mandatory element of the site, but its presence is desirable, because it is used by site owners to control search robots. Set different levels of access to the site, a ban on indexing the entire site, individual pages, sections or files. For high-traffic resources, limit the indexing time and prohibit access to robots that are not related to the main ones. search engines. This will reduce the load on the server.

Creation. Create a file in text editor Notepad or similar. Make sure that the file size does not exceed 32 KB. Choose ASCII or UTF-8 encoding for the file. Please note that the file must be unique. If the site is created on CMS, then it will be generated automatically.

Place the created file in the site root directory next to the main index.html file. For this use FTP access. If the site is made on a CMS, then the file is processed through the administrative panel. When the file is created and works correctly, it is available in the browser.

In the absence of robots.txt, search robots collect all information related to the site. Do not be surprised when you see blank pages or service information in the search results. Determine which sections of the site will be available to users, and close the rest from indexing.

Examination. Periodically check if everything is working correctly. If the crawler does not receive a 200 OK response, then it automatically assumes that the file does not exist and the site is fully open for indexing. Error codes are as follows:

3xx - redirect responses. The robot is directed to another page or to the main one. Create up to five redirects on one page. If there are more of them, the robot will mark such a page as a 404 error. The same applies to redirects based on the principle of an endless loop;

4xx - site error responses. If the crawler receives a 400 error from the robots.txt file, it concludes that the file does not exist and all content is available. This also applies to 401 and 403 errors;

5xx - server error responses. The crawler will "knock" until it receives a response other than the 500th.

Creation rules

We start with a greeting. Each file must begin with a User-agent greeting. With it, search engines will determine the level of openness.

The code	Meaning
User-agent: *	Available to everyone
User agent: Yandex	Available to the Yandex robot
User agent: Googlebot	Available to Googlebot
User agent: Mail.ru	Available to the Mail.ru robot

Add separate directives for robots. If necessary, add directives for specialized Yandex search bots.

However, in this case the directives * and Yandex will not be taken into account.

Google has its own bots:

First we ban, then we allow. Operate with two directives: Allow - I allow, Disallow - I forbid. Be sure to include the disallow directive, even if access is allowed to the entire site. This directive is mandatory. If it is missing, the crawler may not correctly read the rest of the information. If the site does not have restricted content, leave the directive empty.

Work with different levels. In the file, you can specify settings at four levels: site, page, folder, and content type. Let's say you want to hide images from indexing. This can be done at the level:

folders - disallow: /images/
content type - disallow: /*.jpg

Group directives in blocks and separate them with an empty line. Don't write all the rules on one line. Use a separate rule for each page, crawler, folder, etc. Also, do not confuse the instructions: write the bot in the user-agent, and not in the allow / disallow directive.

Not	Yes
Disallow: Yandex	User agent: Yandex disallow: /
Disallow: /css/ /images/	Disallow: /css/ Disallow: /images/

Write case sensitive. Enter the file name in lowercase letters. Yandex in the explanatory documentation indicates that case is not important for its bots, but Google asks to respect the case. It is also possible that the names of files and folders are case-sensitive.

Specify a 301 redirect to the main site mirror. The Host directive used to be used for this, but as of March 2018 it is no longer needed. If it is already in the robots.txt file, remove it or leave it at your discretion; robots ignore this directive.

To specify the main mirror, put a 301 redirect on each page of the site. If there is no redirect, the search engine will independently determine which mirror is considered the main one. To fix the site mirror, simply enter a 301 page redirect and wait a few days.

Write the directive Sitemap (sitemap). The sitemap.xml and robots.txt files complement each other. Check to:

files do not contradict each other;
pages were excluded from both files;
pages were allowed in both files.

When analyzing the contents of robots.txt, pay attention to whether the sitemap is included in the directive of the same name. It is written like this: Sitemap: www.yoursite.ru/sitemap.xml

Specify comments with the # symbol. Anything written after it is ignored by the crawler.

File verification

Analyze robots.txt using developer tools: Yandex.Webmaster and Google Robots Testing Tool. Please note that Yandex and Google only check if the file meets their own requirements. If the file is correct for Yandex, this does not mean that it will be correct for Google robots, therefore check in both systems.

If you find errors and fix robots.txt, crawlers don't read the changes instantly. Typically, page re-crawling occurs once a day, but often takes much longer. Check the file after a week to make sure search engines are using the new version.

Checking in Yandex.Webmaster

First verify the rights to the site. After that, it will appear in the Webmaster panel. Enter the name of the site in the field and click check. The result of the check will be available below.

Additionally, check individual pages. To do this, enter the page addresses and click "check".

Testing in Google Robots Testing Tool

Allows you to check and edit the file in the administrative panel. Gives a message about logical and syntax errors. Correct the text of the file directly in the Google editor. But note that changes are not automatically saved. After fixing robots.txt, copy the code from the web editor and create new file through notepad or another text editor. Then upload it to the server in the root directory.

Remember

The robots.txt file helps search robots index the site. Close the site during development, the rest of the time - the entire site or part of it should be open. A correctly working file should return a 200 response.

The file is created in a regular text editor. In many CMS, the administrative panel provides for the creation of a file. Make sure that the size does not exceed 32 KB. Place it in the root directory of the site.

Fill out the file according to the rules. Start with the code "User-agent:". Write the rules in blocks, separate them with an empty line. Follow the accepted syntax.

Allow or deny indexing for all crawlers or selected ones. To do this, specify the name of the search robot or put the * icon, which means "for everyone".

Work with different access levels: site, page, folder or file type.

Include in the file a reference to the main mirror using a paginated 301 redirect and to the sitemap using the sitemap directive.

Use developer tools to parse robots.txt. These are Yandex.Webmaster and Google Robots Testing Tools. First confirm the rights to the site, then check. In Google, immediately edit the file in a web editor and remove the errors. Edited files are not saved automatically. Upload them to the server instead of the original robots.txt. After a week, check if the search engines are using the new version.

The material was prepared by Svetlana Sirvida-Llorente.

Just about the complex. Programs. Iron. Internet. Windows

Ready robots txt for all search engines. How to edit robots txt. Directives "Host:" and "Sitemap:"

What is robots.txt and what is it for

How to create the correct Robots.txt file for a website

Setting up the Robots.txt file: indexing, main mirror, directives

Disallow: Prohibit indexing

How to open a site for indexing

Main website mirror

Sitemap: robots.txt sitemap

Clean-param Directive

Crawl-delay directive

What is the robots.txt file for?

Easy file creation for any search engine

How to check the effectiveness of the robots.txt file

Creating robots.txt for WordPress, Joomla and Ucoz

Conclusion

Why robots.txt is needed

Impact of robots.txt on search engines?

What is the robots.txt directive?

What does user-agent mean?

Google crawlers

Yandex search robots

Search robots Bing, Yahoo, Mail.ru, Rambler

Disallow and Allow directives

Host directive in robots.txt

Crawl-delay directive

Clean-param Directive

How to check robots.txt?

Conclusion:

Basic directives for the robots txt file

Prohibition and permission of site indexing: directives Disallow and Allow

Host directive: how to specify the main site domain

Crawl-delay directive: how to set page loading interval

Creation rules

File verification

Checking in Yandex.Webmaster

Testing in Google Robots Testing Tool

Remember