Duplicate category or product pages in an online store can have quite devastating negative effects in the store’s organic search rankings as well sales. Issues such as the wrong pages ranking for your desired target search phrases on the light end of the scale to triggering an algorithmic Panda penalty on the more severe end of the scale are reasons you want to weed out any potential culprits of duplicate content early as well as over the course of scaling the size of your store. A Duplicate content clean up is a vital SEO task eCommerce managers should pencil into their calendars.
From experience, I tend to find that the more complex (in information architecture) or the larger a store, the more likely duplicate content issues crop up. So my rule is to keep things simple! Easier said than done; especially when the web development team is out of sync with the technical SEO team or there is a long chain of approvals required to make necessary changes.
To kick things off, lets begin with the tools you have in your arsenal to tackle duplicate content issues:
<meta name="robots" content="noindex,follow">
The meta robots directive is the most effective means of controlling what pages get indexed or not indexed by search engines. Its drawback is scalability as it has to be applied on a page by page basis. That said if you use a platform such as Magento Commerce, be sure that any extension that you install with the intention of scaling or automatically adding more pages such as category filters to your store has a meta robots feature. This would ensure a batch roll-out of the meta robots tag across the specific set of pages that the extension generates – which gives you control.
Let’s break down the code snippet
<meta name="robots" content=" <value> ">
In name=”robots”, “robots” signifies all search engine bots but you can be more specific and replace “robots” with a user agent of your choice:
The content="<value>"
The “<value>” field allows multiple of the following values to be declared with comma separation:
The X-Robots-Tag has identical functions to the Meta Robots tag but the only difference is that it can be used as an element of the HTTP header response for a given URL.
So an HTTP header can look like this:
HTTP/1.1 200 OK
Date: Tue, 22 July 2014 04:45:00 GMT
(…)
X-Robots-Tag: googlebot: noindex, nofollow
X-Robots-Tag: bingbot: nofollow
(…)
OR
HTTP/1.1 200 OK
Date: Tue, 22 July 2014 04:45:00 GMT
(…)
X-Robots-Tag: noindex, follow
(…)
Directives specified without a user-agent are valid for all crawlers. Read more on Google’s documentation about the robots meta tag and this blog post.
The next piece of arsenal at our disposal for controlling duplicate content is the robots.txt ‘disallow’ parameter. I would not recommend this option unless you have a folder for special offers or sales in a directory you would not want indexed such as /sales, /specials or /offers; or your architecture has a highly specific name pattern.
It is however a more potent tool than the meta robots tag and with power comes great responsibility. Here’s why: blocks from the robots.txt’s ‘DISALLOW’ command completely prevents bots from crawling and so indexation is not even an option.
Whilst the meta robots directive assumes that the page has been crawled and the provides instructions to follow or noindex; the Robots.txt Disallow command completely prevents crawling in the first place.
Here is how next.co.uk use the DISALLOW tool to control crawls and indexation: http://www.next.co.uk/robots.txt
Disallow: /shop/*-colour-*-colour-*
Disallow: /shop/colour-*-colour-*
Disallow: /shop/*/*-colour-*-colour-*
Disallow: /shop/*/colour-*-colour-*
Disallow: /shop/*/*-category-*-category-*
Disallow: /shop/*/category-*-category-*
Disallow: /shop/*-brand-*-brand-*
Disallow: /shop/brand-*-brand-*
Disallow: /shop/*/*-brand-*-brand-*
Disallow: /shop/*/brand-*-brand-*
Disallow: /shop/*-department-*-department-*
Disallow: /shop/department-*-department-*
Disallow: /shop/*/*-department-*-department-*
Disallow: /shop/*/department-*-department-*
Disallow: /shop/*-use-*-use-*
Disallow: /shop/use-*-use-*
Disallow: /shop/*/*-use-*-use-*
Disallow: /shop/*/use-*-use-*
Disallow: /shop/*-designfeature-*-designfeature-*
Disallow: /shop/designfeature-*-designfeature-*
Disallow: /shop/*/*-designfeature-*-designfeature-*
Disallow: /shop/*/designfeature-*-designfeature-*
Disallow: /shop/*-size-*
Disallow: /shop/*/size-*
Using the first line as an example:
Disallow: /shop/*-colour-*-colour-*
Their store is hosted in the /shop/ directory.
/*-colour-*-colour-*
The line above means that if a store category page has two colour filters applied to it, do not crawl the page.
Here is an example of a category page that will be crawled and indexed:
This is their Men’s Boots category page:
http://www.next.co.uk/shop/gender-men-productaffiliation-footwear/category-boots-category-wellies
If a single colour filter for only brown boots was applied to the page’s results, the URL would look like this:
Notice the last appendage: “-colour-brown”
Remember that the Robots.txt line above refers to:
Disallow: /shop/*-colour-*-colour-*
Which means two instances of colour.
So a double filter to show both brown and black boots will NOT be crawled:
Because of the last appendage to the URL:
-colour-black-colour-brown
The brilliant SEO team at Next.co.uk go further to noindex, follow the category page above and all category pages with two or more filters. This is CONTROL!
My advice for robots.txt domination is to get your information architecture right by using distinct folders and a documented naming convention across your site. This way, you are able to rollout section wide Disallows or even noindex directives from your meta robots.
Smaller stores with 10s of items per category should not allow the indexation of filtered category pages because they are likely to generate thin pages that are likely to be candidates for the Panda Penalty. Only create exceptions where significant search volume justifies the search engine visibility of the filtered category page.
URL parameters is an advanced feature in Google Webmaster Tools you are probably already familiar with, designed specifically for parameter URLs generated by filtering. It is can be accessed from the ‘CRAWL’ tab in WMT under the heading URL Parameters.
A point to note about URL parameters is it serves as a helpful hint for indexation to Google and is not as effectives as explicit declarations in the robots.txt and meta robots. That said, URL parameters help Google crawl sites more efficiently, saves bandwidth and removes redundant duplicated content.
In order to consider using URL parameters to help indexation or prevent duplication, your website should generate *key or name value pair* URLs similar to:
http://www.mywebstore.com/page.php?key=value1&key=value2
http://www.prettylittlething.com/clothing/tops.html?colour=55
http://www.glamorous.com/clothing/jackets-coats.html?colour_code=43&price=80-
A ‘?’ is always appended to the end of a typical URL, followed by:
key1=value1
In a multi-option scenarios, two or more key/name value pairs could be appended but separated by an ‘&’ like in:
key1=value1&key2=value2
Google looks at the key/name values only and not URL constructs. So these URLs are identical to Google:
http://www.mywebstore.com/page.php?key1=value1&key2=value2
and
http://www.mywebstore.com/page.php?key2=value2&key1=value1
A point worth noting is that enforcing a flat URL structure for filtering category pages will eliminate queries parameter altogether and the use of URL parameters in webmaster tools. This will apply to a URL like this:
http://www.evanscycles.com/categories/bikes/bmx-bikes/f/2015 – !
Their /f/ folder represents their filter rather than a key/name value pair.
There are two broad URL parameter declaration options available:
As an example, if I was the eCommerce manager for this website and felt that all ‘size’ filter pages like this one should not be indexed using URL parameters:
http://www.prettylittlething.com/clothing/tops.html?clothes_size=36
I will take the following steps:
Go into Google Webmaster Tools URL parameter tab and edit the Parameter ‘clothes_size’ – if it not on the list, add it as an entry.
Does this parameter change page content seen by the user?
Yes: Changes, reorders or narrows page content
How does this parameter affect page content?
Narrows
Which URLs with this parameter should Googlebot crawl?
Every URL
Think of a canonical tag as a 301 redirect that occurs on the background. It helps normalize individual pages with multiple URL variations via an invisible redirect to a single dominant version.
Here is the syntax:
<link rel=”canonical” href=”http://example.com/category-page” />
If your category page URL creates multiple versions of a single page, then using the canonical tag might be an option worth exploring.
I do prefer to use the meta robots tag or X meta robots tag over the canonical but the canonical tag is an ideal candidate for the following scenarios:
Although the rel=”next” and rel=”prev” is the recommended directive for paginated content, if a a view-all page exists, then a the rel=”canonical” tag should be used.
Tracking variables tend to be appended to URLs that either link to your site or link internally. The format is similar to:
The issue with URLs with appended variables and even capitalisation are treated differently by search engines. And even though Google are able to automatically figure out such discrepancies with the URL parameter handling (as highlighted above), Google is still rife with issues. The canonical tag ensures a singular reference version of a specific URL. So the canonical versions of the URLs above would look something like this:
If a session ID generated product page URL was linked from an article or review and did not have a canonised URL the link will almost be wasted. Implementing the rel=”canonical” tag ensures ensures that link value is passed from variants of a URL to a singular URL. In scenarios such as this, other indexation fixes that I have mentioned above such as the meta robots, URL parameter and robots.txt disallow will not be best suited.
This guide serves as a reference to eCommerce managers and analysts looking to resolve duplication issues in their stores. Do you have any questions? Or would you like to add some more tips? If yes, please drop a comment below.
Duplicate category or product pages in an online store can have quite devastating negative effects in the store’s organic search rankings as well sales. Issues such as the wrong pages ranking for your desired target search phrases on the light end of the scale to triggering an algorithmic Panda penalty on the more severe end of the scale are reasons you want to weed out any potential culprits of duplicate content early as well as over the course of scaling the size of your store. A Duplicate content clean up is a vital SEO task eCommerce managers should pencil into their calendars.
From experience, I tend to find that the more complex (in information architecture) or the larger a store, the more likely duplicate content issues crop up. So my rule is to keep things simple! Easier said than done; especially when the web development team is out of sync with the technical SEO team or there is a long chain of approvals required to make necessary changes.
To kick things off, lets begin with the tools you have in your arsenal to tackle duplicate content issues:
<meta name="robots" content="noindex,follow">
The meta robots directive is the most effective means of controlling what pages get indexed or not indexed by search engines. Its drawback is scalability as it has to be applied on a page by page basis. That said if you use a platform such as Magento Commerce, be sure that any extension that you install with the intention of scaling or automatically adding more pages such as category filters to your store has a meta robots feature. This would ensure a batch roll-out of the meta robots tag across the specific set of pages that the extension generates – which gives you control.
Let’s break down the code snippet
<meta name="robots" content=" <value> ">
In name=”robots”, “robots” signifies all search engine bots but you can be more specific and replace “robots” with a user agent of your choice:
The content="<value>"
The “<value>” field allows multiple of the following values to be declared with comma separation:
The X-Robots-Tag has identical functions to the Meta Robots tag but the only difference is that it can be used as an element of the HTTP header response for a given URL.
So an HTTP header can look like this:
HTTP/1.1 200 OK
Date: Tue, 22 July 2014 04:45:00 GMT
(…)
X-Robots-Tag: googlebot: noindex, nofollow
X-Robots-Tag: bingbot: nofollow
(…)
OR
HTTP/1.1 200 OK
Date: Tue, 22 July 2014 04:45:00 GMT
(…)
X-Robots-Tag: noindex, follow
(…)
Directives specified without a user-agent are valid for all crawlers. Read more on Google’s documentation about the robots meta tag and this blog post.
The next piece of arsenal at our disposal for controlling duplicate content is the robots.txt ‘disallow’ parameter. I would not recommend this option unless you have a folder for special offers or sales in a directory you would not want indexed such as /sales, /specials or /offers; or your architecture has a highly specific name pattern.
It is however a more potent tool than the meta robots tag and with power comes great responsibility. Here’s why: blocks from the robots.txt’s ‘DISALLOW’ command completely prevents bots from crawling and so indexation is not even an option.
Whilst the meta robots directive assumes that the page has been crawled and the provides instructions to follow or noindex; the Robots.txt Disallow command completely prevents crawling in the first place.
Here is how next.co.uk use the DISALLOW tool to control crawls and indexation: http://www.next.co.uk/robots.txt
Disallow: /shop/*-colour-*-colour-*
Disallow: /shop/colour-*-colour-*
Disallow: /shop/*/*-colour-*-colour-*
Disallow: /shop/*/colour-*-colour-*
Disallow: /shop/*/*-category-*-category-*
Disallow: /shop/*/category-*-category-*
Disallow: /shop/*-brand-*-brand-*
Disallow: /shop/brand-*-brand-*
Disallow: /shop/*/*-brand-*-brand-*
Disallow: /shop/*/brand-*-brand-*
Disallow: /shop/*-department-*-department-*
Disallow: /shop/department-*-department-*
Disallow: /shop/*/*-department-*-department-*
Disallow: /shop/*/department-*-department-*
Disallow: /shop/*-use-*-use-*
Disallow: /shop/use-*-use-*
Disallow: /shop/*/*-use-*-use-*
Disallow: /shop/*/use-*-use-*
Disallow: /shop/*-designfeature-*-designfeature-*
Disallow: /shop/designfeature-*-designfeature-*
Disallow: /shop/*/*-designfeature-*-designfeature-*
Disallow: /shop/*/designfeature-*-designfeature-*
Disallow: /shop/*-size-*
Disallow: /shop/*/size-*
Using the first line as an example:
Disallow: /shop/*-colour-*-colour-*
Their store is hosted in the /shop/ directory.
/*-colour-*-colour-*
The line above means that if a store category page has two colour filters applied to it, do not crawl the page.
Here is an example of a category page that will be crawled and indexed:
This is their Men’s Boots category page:
http://www.next.co.uk/shop/gender-men-productaffiliation-footwear/category-boots-category-wellies
If a single colour filter for only brown boots was applied to the page’s results, the URL would look like this:
Notice the last appendage: “-colour-brown”
Remember that the Robots.txt line above refers to:
Disallow: /shop/*-colour-*-colour-*
Which means two instances of colour.
So a double filter to show both brown and black boots will NOT be crawled:
Because of the last appendage to the URL:
-colour-black-colour-brown
The brilliant SEO team at Next.co.uk go further to noindex, follow the category page above and all category pages with two or more filters. This is CONTROL!
My advice for robots.txt domination is to get your information architecture right by using distinct folders and a documented naming convention across your site. This way, you are able to rollout section wide Disallows or even noindex directives from your meta robots.
Smaller stores with 10s of items per category should not allow the indexation of filtered category pages because they are likely to generate thin pages that are likely to be candidates for the Panda Penalty. Only create exceptions where significant search volume justifies the search engine visibility of the filtered category page.
URL parameters is an advanced feature in Google Webmaster Tools you are probably already familiar with, designed specifically for parameter URLs generated by filtering. It is can be accessed from the ‘CRAWL’ tab in WMT under the heading URL Parameters.
A point to note about URL parameters is it serves as a helpful hint for indexation to Google and is not as effectives as explicit declarations in the robots.txt and meta robots. That said, URL parameters help Google crawl sites more efficiently, saves bandwidth and removes redundant duplicated content.
In order to consider using URL parameters to help indexation or prevent duplication, your website should generate *key or name value pair* URLs similar to:
http://www.mywebstore.com/page.php?key=value1&key=value2
http://www.prettylittlething.com/clothing/tops.html?colour=55
http://www.glamorous.com/clothing/jackets-coats.html?colour_code=43&price=80-
A ‘?’ is always appended to the end of a typical URL, followed by:
key1=value1
In a multi-option scenarios, two or more key/name value pairs could be appended but separated by an ‘&’ like in:
key1=value1&key2=value2
Google looks at the key/name values only and not URL constructs. So these URLs are identical to Google:
http://www.mywebstore.com/page.php?key1=value1&key2=value2
and
http://www.mywebstore.com/page.php?key2=value2&key1=value1
A point worth noting is that enforcing a flat URL structure for filtering category pages will eliminate queries parameter altogether and the use of URL parameters in webmaster tools. This will apply to a URL like this:
http://www.evanscycles.com/categories/bikes/bmx-bikes/f/2015 – !
Their /f/ folder represents their filter rather than a key/name value pair.
There are two broad URL parameter declaration options available:
As an example, if I was the eCommerce manager for this website and felt that all ‘size’ filter pages like this one should not be indexed using URL parameters:
http://www.prettylittlething.com/clothing/tops.html?clothes_size=36
I will take the following steps:
Go into Google Webmaster Tools URL parameter tab and edit the Parameter ‘clothes_size’ – if it not on the list, add it as an entry.
Does this parameter change page content seen by the user?
Yes: Changes, reorders or narrows page content
How does this parameter affect page content?
Narrows
Which URLs with this parameter should Googlebot crawl?
Every URL
Think of a canonical tag as a 301 redirect that occurs on the background. It helps normalize individual pages with multiple URL variations via an invisible redirect to a single dominant version.
Here is the syntax:
<link rel=”canonical” href=”http://example.com/category-page” />
If your category page URL creates multiple versions of a single page, then using the canonical tag might be an option worth exploring.
I do prefer to use the meta robots tag or X meta robots tag over the canonical but the canonical tag is an ideal candidate for the following scenarios:
Although the rel=”next” and rel=”prev” is the recommended directive for paginated content, if a a view-all page exists, then a the rel=”canonical” tag should be used.
Tracking variables tend to be appended to URLs that either link to your site or link internally. The format is similar to:
The issue with URLs with appended variables and even capitalisation are treated differently by search engines. And even though Google are able to automatically figure out such discrepancies with the URL parameter handling (as highlighted above), Google is still rife with issues. The canonical tag ensures a singular reference version of a specific URL. So the canonical versions of the URLs above would look something like this:
If a session ID generated product page URL was linked from an article or review and did not have a canonised URL the link will almost be wasted. Implementing the rel=”canonical” tag ensures ensures that link value is passed from variants of a URL to a singular URL. In scenarios such as this, other indexation fixes that I have mentioned above such as the meta robots, URL parameter and robots.txt disallow will not be best suited.
This guide serves as a reference to eCommerce managers and analysts looking to resolve duplication issues in their stores. Do you have any questions? Or would you like to add some more tips? If yes, please drop a comment below.