Fighting duplicate pages

The owner may not suspect that some pages on his site have copies - most often this happens. The pages open, everything is in order with their content, but if you just pay attention to, you will notice that with the same content, the addresses are different. What does it mean? For live users, absolutely nothing, since they are interested in the information on the pages, but soulless search engines perceive such a phenomenon in a completely different way - for them it is completely different pages with the same content.

Are duplicate pages harmful? So, if an ordinary user cannot even notice the presence of duplicates on your site, then search engines will immediately determine this. What kind of reaction can you expect from them? Since, in fact, copies are seen as different pages, the content on them ceases to be unique. And this already negatively affects the ranking.

Also, the presence of duplicates blurs the SEO that the optimizer was trying to focus on the landing page. Due to duplicates, it may not be on the page to which they wanted to transfer it. That is, the effect of internal linking and external links can decrease many times over.

In the vast majority of cases, duplicates are to blame - due to incorrect settings and the lack of proper attention of the optimizer, clear copies are generated. Many CMS, such as Joomla, sin with this. It is difficult to find a universal recipe to solve the problem, but you can try using one of the plugins for removing copies.

The occurrence of fuzzy duplicates, in which the content is not completely identical, usually occurs through the fault of the webmaster. Such pages are often found on online store sites, where pages with product cards differ only in a few sentences with a description, and all other content, consisting of cross-cutting blocks and other elements, is the same.

Many experts argue that a small number of duplicates will not harm the site, but if there are more than 40-50% of them, then serious difficulties can await the resource during promotion. In any case, even if there are not so many copies, it is worth taking care of their elimination, so you are guaranteed to get rid of problems with duplicates.

Search for copy pages There are several ways to find duplicate pages, but the first thing to do is to go to several search engines and see how they see your site - you just need to compare the number of pages in the index of each. It is quite simple to do this without resorting to any additional means: in Yandex or Google, it is enough to enter host:yoursite.ru in the search bar and look at the number of results.




If, after such a simple check, the number will be very different, by 10-20 times, then with some degree of probability this may indicate the content of duplicates in one of them. Copy pages may not be to blame for such a difference, but nevertheless this gives rise to a further more thorough search. If the site is small, then you can manually calculate the number of real pages and then compare with the indicators from search engines.

You can search for duplicate pages by URL in the search engine results. If they must have CNC, then pages with URLs of obscure characters, like "index.php?s=0f6b2903d", will immediately be knocked out of the general list.

Another way to determine the presence of duplicates by means of search engines is a search in text fragments. The procedure for such a check is simple: you need to enter a text fragment of 10-15 words from each page into the search bar, and then analyze the result. If there are two or more pages in the search results, then there are copies, but if there is only one result, then this page has no duplicates, and you don’t have to worry.

It is logical that if the site consists of a large number of pages, then such a check can turn into an impossible routine for the optimizer. To minimize time costs, you can use special programs. One such tool, which is probably familiar to experienced professionals, is the Xenu`s Link Sleuth program.


To check the site, you need to open new project, selecting "File" "Check URL" from the menu, enter the address and click "OK". After that, the program will start processing all URLs of the site. Upon completion of the check, you need to export the received data to any convenient editor and start looking for duplicates.

In addition to the above methods, the toolkit of the Yandex.Webmaster and Google Webmaster Tools panels has tools for checking page indexing that can be used to search for duplicates.

Problem Solving Methods After all duplicates are found, they will need to be eliminated. This can also be done in several ways, but each specific case needs its own method, it is possible that you will have to use them all.

  • Copy pages can be deleted manually, but this method is more suitable only for those duplicates that were created manually due to the indiscretion of the webmaster.
  • The 301 redirect is great for gluing copy pages whose URLs differ in the presence and absence of www.
  • The solution to the problem with duplicates using the canonical tag can be applied to fuzzy copies. For example, for categories of goods in an online store that have duplicates that differ in sorting by various parameters. Also canonical is suitable for printable versions of pages and in other similar cases. It is applied quite simply - for all copies, the rel="canonical" attribute is specified, but not for the main page, which is the most relevant. The code should look something like this: link rel="canonical" href="http://yoursite.ru/stranica-kopiya"/, and be within the head tag.
  • In the fight against duplicates, setting up the robots.txt file can help. The Disallow directive will allow you to block access to duplicates for search robots. You can read more about the syntax of this file in our mailing list.

The owner may not suspect that some pages on his site have copies - most often this happens. The pages open, their content is all right, but if you just pay attention to the URL, you will notice that with the same content, the addresses are different. What does it mean? For live users, absolutely nothing, since they are interested in the information on the pages, but soulless search engines perceive this phenomenon in a completely different way - for them these are completely different pages with the same content.

Are duplicate pages harmful?

So, if an ordinary user cannot even notice the presence of duplicates on your site, then search engines will immediately determine this. What kind of reaction can you expect from them? Since, in fact, search robots see copies as different pages, the content on them ceases to be unique. And this already negatively affects the ranking.

Also, the presence of duplicates blurs the link juice that the optimizer was trying to focus on the landing page. Due to duplicates, it may not be on the page to which they wanted to transfer it. That is, the effect of internal linking and external links can decrease many times over.

In the vast majority of cases, CMS is to blame for the occurrence of duplicates - due to incorrect settings and the lack of due attention of the optimizer, clear copies are generated. Many CMS, such as Joomla, sin with this. It is difficult to find a universal recipe to solve the problem, but you can try using one of the plugins for removing copies.

The occurrence of fuzzy duplicates, in which the content is not completely identical, usually occurs through the fault of the webmaster. Such pages are often found on online store sites, where pages with product cards differ only in a few sentences with a description, and all other content, consisting of cross-cutting blocks and other elements, is the same.

Many experts argue that a small number of duplicates will not harm the site, but if there are more than 40-50% of them, then serious difficulties can await the resource during promotion. In any case, even if there are not so many copies, it is worth taking care of their elimination, so you are guaranteed to get rid of problems with duplicates.

Search for copy pages

There are several ways to find duplicate pages, but the first thing to do is to go to several search engines and see how they see your site - you just need to compare the number of pages in the index of each. It is quite simple to do this without resorting to any additional means: in Yandex or Google, it is enough to enter host:yoursite.ru in the search bar and look at the number of results.

If, after such a simple check, the number will be very different, by 10-20 times, then with some degree of probability this may indicate the content of duplicates in one of them. Copy pages may not be to blame for such a difference, but nevertheless this gives rise to a further more thorough search. If the site is small, then you can manually calculate the number of real pages and then compare with the indicators from search engines.

You can search for duplicate pages by URL in the search engine results. If they must have CNC, then pages with URLs of obscure characters, like "index.php?s=0f6b2903d", will immediately be knocked out of the general list.

Another way to determine the presence of duplicates by means of search engines is a search in text fragments. The procedure for such a check is simple: you need to enter a text fragment of 10-15 words from each page into the search bar, and then analyze the result. If there are two or more pages in the search results, then there are copies, but if there is only one result, then this page has no duplicates, and you don’t have to worry.

It is logical that if the site consists of a large number of pages, then such a check can turn into an impossible routine for the optimizer. To minimize time costs, you can use special programs. One such tool, which is probably familiar to experienced professionals, is the Xenu`s Link Sleuth program.

To check the site, you need to open a new project by selecting "File" "Check URL" from the menu, enter the address and click "OK". After that, the program will start processing all URLs of the site. At the end of the check, you need to export the received data to any convenient editor and start searching for duplicates.

In addition to the above methods, the toolkit of the Yandex.Webmaster and Google Webmaster Tools panels has tools for checking page indexing that can be used to search for duplicates.

Problem Solving Methods

After all duplicates are found, they will need to be eliminated. This can also be done in several ways, but each specific case needs its own method, it is possible that you will have to use them all.

Copy pages can be deleted manually, but this method is more suitable only for those duplicates that were created manually due to the indiscretion of the webmaster.

The 301 redirect is great for gluing copy pages whose URLs differ in the presence and absence of www.

The solution to the problem with duplicates using the canonical tag can be applied to fuzzy copies. For example, for categories of goods in an online store that have duplicates that differ in sorting by various parameters. Also canonical is suitable for printable versions of pages and in other similar cases. It is applied quite simply - for all copies, the rel="canonical" attribute is specified, but not for the main page, which is the most relevant. The code should look something like this: link rel="canonical" href="http://yoursite.ru/stranica-kopiya"/, and be within the head tag.

In the fight against duplicates, setting up the robots.txt file can help. The Disallow directive will allow you to block access to duplicates for search robots. You can read more about the syntax of this file in issue #64 of our mailing list.

conclusions

If users perceive duplicates as one page with different addresses, then for spiders these are different pages with duplicate content. Copy pages are one of the most common pitfalls that beginners cannot get around. Their presence in large numbers on the site being promoted is unacceptable, as they create serious obstacles to entering the TOP.

The reason for writing this article was another call from an accountant with a panic before submitting VAT returns. In the last quarter, I spent a lot of time cleaning up duplicate counterparties. And again they are the same and new. Where?

Decided to take the time to deal with the cause, not the effect. The situation with is mainly relevant for configured automatic uploads through exchange plans from control program(in my case, UT 10.3) to the accounting department of the enterprise (in my case, 2.0).

A few years ago, these configurations were installed, and automatic exchange between them was configured. Faced with the problem of the peculiarity of maintaining a directory of counterparties by the sales department, which began to create duplicate counterparties (with the same TIN / KPP / Name) for one reason or another (they scattered the same counterparty into different groups). The accounting department expressed its "phi", and decided - it doesn't matter to us what they have there, combine the cards when loading into one. I had to intervene in the process of transferring objects by exchange rules. We removed the search by internal identifier for counterparties, and left the search by TIN + KPP + Name. However, even here their pitfalls surfaced in the form of lovers of renaming the names of counterparties (as a result, duplicates are created in the BP by the rules themselves). We all got together, discussed, decided, convinced that doubles are unacceptable in UT, removed them, returned to the standard rules.

But after "combing" the duplicates in the UT and in the BP, the internal identifiers of many counterparties were different. And since the standard exchange rules search for objects exclusively by the internal identifier, a new counterpart of the counterparty arrived with the next portion of documents in the BP (if these identifiers differed). But the universal XML data exchange would not be universal if this problem could not be bypassed. Because It is impossible to change the identifier of an existing object by standard means, then you can get around this situation using a special information register "Correspondence of objects for exchange", which is available in all standard configurations from 1C.

In order to avoid new duplicates, the algorithm for cleaning up duplicates became as follows:

1. In the BP, using the processing "Search and replacement of duplicate elements" (it is typical, it can be taken from the Trade Management configuration or on the ITS disk, or choose the most suitable among the many variations on the Infostart itself) I find a duplicate, determine the correct element, click execute replacement.

2. I get the internal identifier of the only (after replacement) object of our double (I sketched a specially simple processing for this, so that the internal identifier is automatically copied to the clipboard).

3. I open the register "Correspondence of objects for exchange" in UT, I make a selection using my own link.

Duplicates of site pages, their impact on search engine optimization. Manual and automated ways detection and elimination of duplicate pages.

The influence of duplicates on website promotion

The presence of duplicates negatively affects the ranking of the site. As mentioned above, search engines see the original page and its duplicate as two separate pages. Content duplicated on another page is no longer unique. In addition, the link weight of the duplicated page is lost, since the link may transfer not to the target page, but to its duplicate. This applies to both internal linking and external links.

According to some webmasters, a small number of duplicate pages in general will not cause serious harm to the site, but if their number approaches 40-50% of the total site volume, serious difficulties in promotion are inevitable.

Reasons for duplicates

Most often, duplicates appear as a result of incorrect settings of individual CMS. The internal scripts of the engine start to work incorrectly and generate copies of the site pages.

The phenomenon of fuzzy duplicates is also known - pages whose content is only partially identical. Such duplicates occur, most often, through the fault of the webmaster himself. This phenomenon is typical for online stores, where product card pages are built according to the same template, and ultimately differ from each other by only a few lines of text.

Methods for Finding Duplicate Pages

There are several ways to detect duplicate pages. You can turn to search engines: for this, in Google or Yandex, you should enter a command like “site:sitename.ru” in the search bar, where sitename.ru is the domain of your site. The search engine will give out all the indexed pages of the site, and your task will be to detect duplicates.

There is another equally simple way: search in text fragments. To search in this way, you need to add a small piece of text from your site, 10-15 characters, to the search bar. If there are two or more pages of your site in the search results for the searched text, it will not be difficult to detect duplicates.

However, these methods are suitable for sites consisting of a small number of pages. If the site has several hundred or even thousands of pages, then manually searching for duplicates and optimizing the site as a whole becomes impossible tasks. For such purposes, there special programs, for example, one of the most common is Xenu`s Link Sleuth.

In addition, there are special tools for checking the status of indexing in the Google Webmaster Tools and Yandex.Webmaster panels. They are also fashionable to use to detect duplicates.

Methods for Eliminating Duplicate Pages

Eliminate not desired pages can also be done in several ways. For each specific case, a different method is suitable, but most often, when optimizing a site, they are used in combination:

  • deleting duplicates manually - suitable if all unnecessary ones were also detected manually;
  • gluing pages using a 301 redirect - suitable if duplicates differ only in the absence and presence of "www" in the URL;
  • the use of the “canonical” tag is suitable in case of fuzzy duplicates (for example, the above-mentioned situation with product cards in an online store) and is implemented by entering a code like “link rel="canonical" href="http://sitename.ru/ stranica-kopiya"/" within the head block of duplicate pages;
  • correct setting of the robots.txt file - using the "Disallow" directive, you can prohibit duplicate pages for indexing by search engines.

Conclusion

The appearance of duplicate pages can become a serious obstacle in optimizing the site and bringing it to the top positions, therefore this problem must be addressed at an early stage.

Duplicate pages on websites or blogs where they come from and what problems they can create.
This is what we will talk about in this post, we will try to deal with this phenomenon and find ways to minimize those potential troubles that duplicate pages on the site can bring us.

So let's continue.

What are duplicate pages?

Duplicate pages on any web resource means access to the same information at different addresses. Such pages are also called internal duplicates of the site.

If the texts on the page are completely identical, then such duplicates are called complete or clear. With a partial match duplicates are called incomplete or fuzzy.

Incomplete takes- these are category pages, product list pages and similar pages containing announcements of site materials.

Full page duplicates- these are print versions, versions of pages with different extensions, pages of archives, search on the site, pages with comments, and so on.

Sources of duplicate pages.

On the this moment most duplicate pages are generated when using modern CMS- content management systems, they are also called site engines.

This and WordPress and Joomla and DLE and other popular CMS. This phenomenon seriously strains website optimizers and webmasters and gives them additional trouble.

In online stores duplicates may appear when displaying goods sorted by various details (product manufacturer, product purpose, production date, price, etc.).

We must also remember the notorious prefix WWW and decide whether to use it in the domain name when creating, developing, promoting and promoting the site.

As you can see, the sources of duplicates can be different, I have listed only the main ones, but they are all well known to specialists.

Duplicate pages, negative.

Despite the fact that many do not pay much attention to the appearance of duplicates, this phenomenon can create serious problems. website promotion problems.

The search engine may consider duplicates like spam and, as a result, seriously lower the position of both these pages and the site as a whole.

When promoting a site with links, the following situation may arise. At some point, the search engine will regard as the most relevant duplicate page, and not the one that you promote with links and all your efforts and costs will be in vain.

But there are people who try use doubles to build weight to the desired pages, the main one, for example, or any other.

Methods for dealing with duplicate pages

How to avoid duplicates or how to nullify negative points when they appear?
And in general, is it worth it somehow to fight or to give everything at the mercy of search engines. Let them figure it out themselves, since they are so smart.

Using robots.txt

Robots.txt is a file located in the root directory of our site and containing directives for search robots.

In these directives, we specify which pages on our site to index and which not. We can also specify the name of the main domain of the site and the file containing the sitemap.

To disable page indexing Disallow directive is used. It is it that is used by webmasters in order to close duplicate pages from indexing, and not only duplicates, but any other information that is not directly related to the content of the pages. For example:

Disallow: /search/ - close the site search pages
Disallow: /*? - close the pages containing the question mark “?”
Disallow: /20* - close archive pages

Using the .htaccess file

.htaccess file(without extension) is also located in the root directory of the site. To combat duplicates in this file, configure the use of 301 redirects.
This method helps to keep site indicators well. changing the CMS of the site or changing its structure. The result is a correct redirect without loss of link mass. In this case, the weight of the page at the old address will be transferred to the page at the new address.
301 redirects are also used when determining the main domain of a site - with WWW or without WWW.

Using the REL tag = “CANNONICAL”

Using this tag, the webmaster indicates to the search engine the source, that is, the page that should be indexed and take part in the ranking of search engines. The page is called canonical. The HTML entry will look like this:

When using CMS WordPress, this can be done in the settings of such a useful plugin as All in One Seo Pack.

Additional anti-duplicate measures for CMS WordPress

Having applied all the above methods of dealing with duplicate pages on my blog, I always had the feeling that I did not do everything that was possible. Therefore, after digging around on the Internet, after consulting with professionals, I decided to do something else. Now I will describe it.

I decided to eliminate the duplicates that are created on the blog, when using anchors, I talked about them in the article “HTML Anchors”. On WordPress blogs, anchors are generated when the tag is applied. "#more" and when using comments. The expediency of their use is rather controversial, but they clearly produce duplicates.
Now how do I fix this problem.

Let's tackle the #more tag first.

Found a file where it is formed. Rather, they told me.
This is ../wp-includes/post-template.php
Then I found the program snippet:

ID)\" class= \"more-link\">$more_link_text", $more_link_text);

The part marked in red has been removed.

#more-($post->ID)\" class=

And I ended up with a line like this.

$output .= apply_filters('the_content_more_link', ' $more_link_text", $more_link_text);

Remove comment anchors #comment

Now let's move on to the comments. I already figured it out myself.
I also decided on the file ../wp-includes/comment-template.php
Finding the right piece of code

return apply_filters('get_comment_link', $link . ‘#comment-‘ . $comment->comment_ID, $comment, $args);)

Similarly, the fragment marked in red was removed. Very carefully, carefully, down to every point.

. ‘#comment-‘ . $comment->comment_ID

We end up with the following line of code.

return apply_filters('get_comment_link', $link, $comment, $args);
}

Naturally, he did all this, having previously copied the indicated program files to his computer, so that in case of failure it would be easy to restore the state to the changes.

As a result of these changes, when I click on the text "Read the rest of the entry ...", I get a page with the canonical address and without the addition to the tail address in the form of "#more-....". Also, when I click on the comments, I get a normal canonical address without the prefix in the form of “#comment-…”.

Thus, the number of duplicate pages on the site has somewhat decreased. But I can’t say what else our WordPress will form there. We will continue to monitor the issue.

And in conclusion, I bring to your attention a very good and informative video on this topic. highly recommend to watch.

All health and success. See you next time.

Helpful Materials: