eCommerce Marketing Growth Hacks 

2X eCommerce Podcast

Kunle interviews Founders of Fast Growing 7-8 Figure Online Retail Business & E-commerce Marketing Experts

View podcasts

Download your free ebook

More

The eCommerce Marketing Blueprint

Advanced SEO: 5 Ways to Fix E-Commerce Duplicate Content Issues

Posted on 29th August 2014 , by Kunle Campbell in Technical SEO, Traffic

Duplicate category or product pages in an online store can have quite devastating negative effects in the store’s organic search rankings as well sales. Issues such as the wrong pages ranking for your desired target search phrases on the light end of the scale to triggering an algorithmic Panda penalty on the more severe end of the scale are reasons you want to weed out any potential culprits of duplicate content early as well as over the course of scaling the size of your store.  A Duplicate content  clean up is a vital SEO task eCommerce managers should pencil into their calendars.

From experience, I tend to find that the more complex (in information architecture) or the larger a store, the more likely duplicate content issues crop up. So my rule is to keep things simple! Easier said than done; especially when the web development team is out of sync with the technical SEO team or there is a long chain of approvals required to make necessary changes.

To kick things off, lets begin with the tools you have in your arsenal to tackle duplicate content issues:

  1. Page Level Mark-Up:, Robots Meta Tag
  2. Page Level Mark-Up: X-Robots-Tag
  3. Robots.txt DISALLOW
  4. URL Parameters in Google Webmaster Tools for Name/Value Paired URLs
  5. Canonical Tag

1. Page Level Mark-Up

<meta name="robots" content="noindex,follow">

The meta robots directive is the most effective means of controlling what pages get indexed or not indexed by search engines. Its drawback is scalability as it has to be applied on a page by page basis. That said if you use a platform such as Magento Commerce, be sure that any extension that you install with the intention of scaling or automatically adding more pages such as category filters to your store has a meta robots feature. This would ensure a batch roll-out of the meta robots tag across the specific set of pages that the extension generates – which gives you control.

Let’s break down the code snippet

<meta name="robots" content=" <value> ">

In name=”robots”, “robots” signifies all search engine bots but you can be more specific and replace “robots” with a user agent of your choice:

The content="<value>"

The “<value>” field allows multiple of the following values to be declared with comma separation:

  • NOINDEX – prevents the page from being included in the index.
  • NOFOLLOW – prevents bots from following any links on the page.
  • NOARCHIVE – prevents a cached copy of this page from being available in the search results.
  • NOSNIPPET – prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
  • NOODP – blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.
  • NONE – equivalent to “NOINDEX, NOFOLLOW” 

2. The X-Robots-Tag HTTP header:

The X-Robots-Tag has identical functions to the Meta Robots tag but the only difference is that it can be used as an element of the HTTP header response for a given URL.

So an HTTP header can look like this:
HTTP/1.1 200 OK
Date: Tue, 22 July 2014 04:45:00 GMT
(…)
X-Robots-Tag: googlebot: noindex, nofollow
X-Robots-Tag: bingbot: nofollow
(…)

OR

HTTP/1.1 200 OK
Date: Tue, 22 July 2014 04:45:00 GMT
(…)
X-Robots-Tag: noindex, follow
(…)

Directives specified without a user-agent are valid for all crawlers. Read more on Google’s documentation about the robots meta tag and this blog post.

3. Robots.txt DISALLOW

The next piece of arsenal at our disposal for controlling duplicate content is the robots.txt ‘disallow’ parameter. I would not recommend this option unless you have a folder for special offers or sales in a directory you would not want indexed such as /sales, /specials or /offers; or your architecture has a highly specific name pattern.

It is however a more potent tool than the meta robots tag and with power comes great responsibility. Here’s why: blocks from the robots.txt’s ‘DISALLOW’ command completely prevents bots from crawling and so indexation is not even an option.

Whilst the meta robots directive assumes that the page has been crawled and the provides instructions to follow or noindex; the Robots.txt Disallow command completely prevents crawling in the first place.

Here is how next.co.uk use the DISALLOW tool to control crawls and indexation: http://www.next.co.uk/robots.txt
Disallow: /shop/*-colour-*-colour-*
Disallow: /shop/colour-*-colour-*
Disallow: /shop/*/*-colour-*-colour-*
Disallow: /shop/*/colour-*-colour-*
Disallow: /shop/*/*-category-*-category-*
Disallow: /shop/*/category-*-category-*
Disallow: /shop/*-brand-*-brand-*
Disallow: /shop/brand-*-brand-*
Disallow: /shop/*/*-brand-*-brand-*
Disallow: /shop/*/brand-*-brand-*
Disallow: /shop/*-department-*-department-*
Disallow: /shop/department-*-department-*
Disallow: /shop/*/*-department-*-department-*
Disallow: /shop/*/department-*-department-*
Disallow: /shop/*-use-*-use-*
Disallow: /shop/use-*-use-*
Disallow: /shop/*/*-use-*-use-*
Disallow: /shop/*/use-*-use-*
Disallow: /shop/*-designfeature-*-designfeature-*
Disallow: /shop/designfeature-*-designfeature-*
Disallow: /shop/*/*-designfeature-*-designfeature-*
Disallow: /shop/*/designfeature-*-designfeature-*
Disallow: /shop/*-size-*
Disallow: /shop/*/size-*

Using the first line as an example:
Disallow: /shop/*-colour-*-colour-*
Their store is hosted in the /shop/ directory.
/*-colour-*-colour-*
The line above means that if a store category page has two colour filters applied to it, do not crawl the page.

Here is an example of a category page that will be crawled and indexed:

This is their Men’s Boots category page:

http://www.next.co.uk/shop/gender-men-productaffiliation-footwear/category-boots-category-wellies

If a single colour filter for only brown boots was applied to the page’s results, the URL would look like this:

http://www.next.co.uk/shop/gender-men-productaffiliation-footwear/category-boots-category-wellies-colour-brown

Buy_brown_boots_wellies_Men_s_Footwear_from_the_Next_UK_online_shop

Notice the last appendage: “-colour-brown”

Remember that the Robots.txt line above refers to:

Disallow: /shop/*-colour-*-colour-*

Which means two instances of colour.

So a double filter to show both brown and black boots will NOT be crawled:

http://www.next.co.uk/shop/gender-men-productaffiliation-footwear/category-boots-category-wellies-colour-black-colour-brown

Buy_black_brown_boots_wellies_Men_s_Footwear_from_the_Next_UK_online_shop

Because of the last appendage to the URL:

-colour-black-colour-brown

The brilliant SEO team at Next.co.uk go further to noindex, follow the category page above and all category pages with two or more filters. This is CONTROL!

My advice for robots.txt domination is to get your information architecture right by using distinct folders and a documented naming convention across your site. This way, you are able to rollout section wide Disallows or even noindex directives from your meta robots.

Note for smaller stores:

Smaller stores with 10s of items per category should not allow the indexation of filtered category pages because they are likely to generate thin pages that are likely to be candidates for the Panda Penalty. Only create exceptions where significant search volume justifies the search engine visibility of the filtered category page.

4. URL Parameters in Google Webmaster Tools for Name/Value Paired URLs:

URL parameters is an advanced feature in Google Webmaster Tools you are probably already familiar with, designed specifically for parameter URLs generated by filtering. It is can be accessed from the ‘CRAWL’ tab in WMT under the heading URL Parameters.

Webmaster_Tools_-_URL_Parameters_-_http___2xmedia_co_

A point to note about URL parameters is it serves as a helpful hint for indexation to Google and is not as effectives as explicit declarations in the robots.txt and meta robots. That said, URL parameters help Google crawl sites more efficiently, saves bandwidth and removes redundant duplicated content.

In order to consider using URL parameters to help indexation or prevent duplication, your website should generate *key or name value pair* URLs similar to:

http://www.mywebstore.com/page.php?key=value1&key=value2

http://www.prettylittlething.com/clothing/tops.html?colour=55

http://www.glamorous.com/clothing/jackets-coats.html?colour_code=43&price=80-

A ‘?’ is always appended to the end of a typical URL, followed by:

key1=value1

In a multi-option scenarios, two or more key/name value pairs could be appended but separated by an ‘&’ like in:

key1=value1&key2=value2

Google looks at the key/name values only and not URL constructs. So these URLs are identical to Google:

http://www.mywebstore.com/page.php?key1=value1&key2=value2

and

http://www.mywebstore.com/page.php?key2=value2&key1=value1

A point worth noting is that enforcing a flat URL structure for filtering category pages will eliminate queries parameter altogether and the use of URL parameters in webmaster tools. This will apply to a URL like this:

http://www.evanscycles.com/categories/bikes/bmx-bikes/f/2015 – !

Their /f/ folder represents their filter rather than a key/name value pair.

Defining parameter constraints to prevent duplication of content:

There are two broad URL parameter declaration options available:

Parameter constraint does not affect page content

  1. Parameters that don’t affect page content (applies parameters that typically track pages like Session IDs – SID, Affiliate IDs – affiliateID and Tracking IDs – tracking-ID
  2. Parameters that change, reorder or narrow page content; typically Sort filters like: sort=price_ascending, rankBy=bestSelling, order=highest-rated, sort=newest

As an example, if I was the eCommerce manager for this website and felt that all ‘size’ filter pages like this one should not be indexed using URL parameters:

http://www.prettylittlething.com/clothing/tops.html?clothes_size=36

I will take the following steps:

Go into Google Webmaster Tools URL parameter tab and edit the Parameter ‘clothes_size’ – if it not on the list, add it as an entry.

Webmaster_Tools_-_URL_Parameters

Does this parameter change page content seen by the user?
Yes: Changes, reorders or narrows page content

How does this parameter affect page content?
Narrows

Which URLs with this parameter should Googlebot crawl?
Every URL

5. The Canonical Tag

Think of a canonical tag as a 301 redirect that occurs on the background.  It helps normalize individual pages with multiple URL variations via an invisible redirect to a single dominant version.

Here is the syntax:

<link rel=”canonical” href=”http://example.com/category-page” />

If your category page URL creates multiple versions of a single page, then using the canonical tag might be an option worth exploring.

I do prefer to use the meta robots tag or X meta robots tag over the canonical but the canonical tag is an ideal candidate for the following scenarios:

1. Paginated content to a View All Page

Although the rel=”next” and rel=”prev” is the recommended directive for paginated content, if a a view-all page exists, then a the rel=”canonical” tag should be used.

rel-canonical-paginated-content

2. Tracking Codes and Session ID Generated URLs

Tracking variables tend to be appended to URLs that either link to your site or link internally. The format is similar to:

  1. mywebstore.example.com/books/?trackingId=cat123
  2. mywebstore.example.com/cds/?trackingId=cat124
  3. mywebstore.example.com/cards/?trackingId=cat125

The issue with URLs with appended variables and even capitalisation are treated differently by search engines. And even though Google are able to automatically figure out such discrepancies with the URL parameter handling (as highlighted above), Google is still rife with issues. The canonical tag ensures a singular reference version of a specific URL. So the canonical versions of the URLs above would look something like this:

  1. mywebstore.example.com/books/
  2. mywebstore.example.com/cds/
  3. mywebstore.example.com/cards/

3. Link Consolidation

If a session ID generated product page URL was linked from an article or review and did not have a canonised URL the link will almost be wasted. Implementing the rel=”canonical” tag ensures ensures that link value is passed from variants of a URL to a singular URL. In scenarios such as this, other indexation fixes that I have mentioned above such as the meta robots,  URL parameter and robots.txt disallow will not be best suited.

Your Turn…

This guide serves as a reference to eCommerce managers and analysts looking to resolve duplication issues in their stores. Do you have any questions? Or would you like to add some more tips? If yes, please drop a comment below.

 

About the author:

Kunle Campbell

An ecommerce advisor to ambitious, agile online retailers and funded ecommerce startups seeking exponentially sales growth through scalable customer acquisition, retention, conversion optimisation, product/market fit optimisation and customer referrals.

Did You Enjoy Reading this Article?

Get Free Email Updates by Signing Up Below:

Podcasts you might like

Advanced SEO: 5 Ways to Fix E-Commerce Duplicate Content Issues

Posted on 29th August 2014 , by Kunle Campbell in Technical SEO, Traffic

Duplicate category or product pages in an online store can have quite devastating negative effects in the store’s organic search rankings as well sales. Issues such as the wrong pages ranking for your desired target search phrases on the light end of the scale to triggering an algorithmic Panda penalty on the more severe end of the scale are reasons you want to weed out any potential culprits of duplicate content early as well as over the course of scaling the size of your store.  A Duplicate content  clean up is a vital SEO task eCommerce managers should pencil into their calendars.

From experience, I tend to find that the more complex (in information architecture) or the larger a store, the more likely duplicate content issues crop up. So my rule is to keep things simple! Easier said than done; especially when the web development team is out of sync with the technical SEO team or there is a long chain of approvals required to make necessary changes.

To kick things off, lets begin with the tools you have in your arsenal to tackle duplicate content issues:

  1. Page Level Mark-Up:, Robots Meta Tag
  2. Page Level Mark-Up: X-Robots-Tag
  3. Robots.txt DISALLOW
  4. URL Parameters in Google Webmaster Tools for Name/Value Paired URLs
  5. Canonical Tag

1. Page Level Mark-Up

<meta name="robots" content="noindex,follow">

The meta robots directive is the most effective means of controlling what pages get indexed or not indexed by search engines. Its drawback is scalability as it has to be applied on a page by page basis. That said if you use a platform such as Magento Commerce, be sure that any extension that you install with the intention of scaling or automatically adding more pages such as category filters to your store has a meta robots feature. This would ensure a batch roll-out of the meta robots tag across the specific set of pages that the extension generates – which gives you control.

Let’s break down the code snippet

<meta name="robots" content=" <value> ">

In name=”robots”, “robots” signifies all search engine bots but you can be more specific and replace “robots” with a user agent of your choice:

The content="<value>"

The “<value>” field allows multiple of the following values to be declared with comma separation:

  • NOINDEX – prevents the page from being included in the index.
  • NOFOLLOW – prevents bots from following any links on the page.
  • NOARCHIVE – prevents a cached copy of this page from being available in the search results.
  • NOSNIPPET – prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
  • NOODP – blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.
  • NONE – equivalent to “NOINDEX, NOFOLLOW” 

2. The X-Robots-Tag HTTP header:

The X-Robots-Tag has identical functions to the Meta Robots tag but the only difference is that it can be used as an element of the HTTP header response for a given URL.

So an HTTP header can look like this:
HTTP/1.1 200 OK
Date: Tue, 22 July 2014 04:45:00 GMT
(…)
X-Robots-Tag: googlebot: noindex, nofollow
X-Robots-Tag: bingbot: nofollow
(…)

OR

HTTP/1.1 200 OK
Date: Tue, 22 July 2014 04:45:00 GMT
(…)
X-Robots-Tag: noindex, follow
(…)

Directives specified without a user-agent are valid for all crawlers. Read more on Google’s documentation about the robots meta tag and this blog post.

3. Robots.txt DISALLOW

The next piece of arsenal at our disposal for controlling duplicate content is the robots.txt ‘disallow’ parameter. I would not recommend this option unless you have a folder for special offers or sales in a directory you would not want indexed such as /sales, /specials or /offers; or your architecture has a highly specific name pattern.

It is however a more potent tool than the meta robots tag and with power comes great responsibility. Here’s why: blocks from the robots.txt’s ‘DISALLOW’ command completely prevents bots from crawling and so indexation is not even an option.

Whilst the meta robots directive assumes that the page has been crawled and the provides instructions to follow or noindex; the Robots.txt Disallow command completely prevents crawling in the first place.

Here is how next.co.uk use the DISALLOW tool to control crawls and indexation: http://www.next.co.uk/robots.txt
Disallow: /shop/*-colour-*-colour-*
Disallow: /shop/colour-*-colour-*
Disallow: /shop/*/*-colour-*-colour-*
Disallow: /shop/*/colour-*-colour-*
Disallow: /shop/*/*-category-*-category-*
Disallow: /shop/*/category-*-category-*
Disallow: /shop/*-brand-*-brand-*
Disallow: /shop/brand-*-brand-*
Disallow: /shop/*/*-brand-*-brand-*
Disallow: /shop/*/brand-*-brand-*
Disallow: /shop/*-department-*-department-*
Disallow: /shop/department-*-department-*
Disallow: /shop/*/*-department-*-department-*
Disallow: /shop/*/department-*-department-*
Disallow: /shop/*-use-*-use-*
Disallow: /shop/use-*-use-*
Disallow: /shop/*/*-use-*-use-*
Disallow: /shop/*/use-*-use-*
Disallow: /shop/*-designfeature-*-designfeature-*
Disallow: /shop/designfeature-*-designfeature-*
Disallow: /shop/*/*-designfeature-*-designfeature-*
Disallow: /shop/*/designfeature-*-designfeature-*
Disallow: /shop/*-size-*
Disallow: /shop/*/size-*

Using the first line as an example:
Disallow: /shop/*-colour-*-colour-*
Their store is hosted in the /shop/ directory.
/*-colour-*-colour-*
The line above means that if a store category page has two colour filters applied to it, do not crawl the page.

Here is an example of a category page that will be crawled and indexed:

This is their Men’s Boots category page:

http://www.next.co.uk/shop/gender-men-productaffiliation-footwear/category-boots-category-wellies

If a single colour filter for only brown boots was applied to the page’s results, the URL would look like this:

http://www.next.co.uk/shop/gender-men-productaffiliation-footwear/category-boots-category-wellies-colour-brown

Buy_brown_boots_wellies_Men_s_Footwear_from_the_Next_UK_online_shop

Notice the last appendage: “-colour-brown”

Remember that the Robots.txt line above refers to:

Disallow: /shop/*-colour-*-colour-*

Which means two instances of colour.

So a double filter to show both brown and black boots will NOT be crawled:

http://www.next.co.uk/shop/gender-men-productaffiliation-footwear/category-boots-category-wellies-colour-black-colour-brown

Buy_black_brown_boots_wellies_Men_s_Footwear_from_the_Next_UK_online_shop

Because of the last appendage to the URL:

-colour-black-colour-brown

The brilliant SEO team at Next.co.uk go further to noindex, follow the category page above and all category pages with two or more filters. This is CONTROL!

My advice for robots.txt domination is to get your information architecture right by using distinct folders and a documented naming convention across your site. This way, you are able to rollout section wide Disallows or even noindex directives from your meta robots.

Note for smaller stores:

Smaller stores with 10s of items per category should not allow the indexation of filtered category pages because they are likely to generate thin pages that are likely to be candidates for the Panda Penalty. Only create exceptions where significant search volume justifies the search engine visibility of the filtered category page.

4. URL Parameters in Google Webmaster Tools for Name/Value Paired URLs:

URL parameters is an advanced feature in Google Webmaster Tools you are probably already familiar with, designed specifically for parameter URLs generated by filtering. It is can be accessed from the ‘CRAWL’ tab in WMT under the heading URL Parameters.

Webmaster_Tools_-_URL_Parameters_-_http___2xmedia_co_

A point to note about URL parameters is it serves as a helpful hint for indexation to Google and is not as effectives as explicit declarations in the robots.txt and meta robots. That said, URL parameters help Google crawl sites more efficiently, saves bandwidth and removes redundant duplicated content.

In order to consider using URL parameters to help indexation or prevent duplication, your website should generate *key or name value pair* URLs similar to:

http://www.mywebstore.com/page.php?key=value1&key=value2

http://www.prettylittlething.com/clothing/tops.html?colour=55

http://www.glamorous.com/clothing/jackets-coats.html?colour_code=43&price=80-

A ‘?’ is always appended to the end of a typical URL, followed by:

key1=value1

In a multi-option scenarios, two or more key/name value pairs could be appended but separated by an ‘&’ like in:

key1=value1&key2=value2

Google looks at the key/name values only and not URL constructs. So these URLs are identical to Google:

http://www.mywebstore.com/page.php?key1=value1&key2=value2

and

http://www.mywebstore.com/page.php?key2=value2&key1=value1

A point worth noting is that enforcing a flat URL structure for filtering category pages will eliminate queries parameter altogether and the use of URL parameters in webmaster tools. This will apply to a URL like this:

http://www.evanscycles.com/categories/bikes/bmx-bikes/f/2015 – !

Their /f/ folder represents their filter rather than a key/name value pair.

Defining parameter constraints to prevent duplication of content:

There are two broad URL parameter declaration options available:

Parameter constraint does not affect page content

  1. Parameters that don’t affect page content (applies parameters that typically track pages like Session IDs – SID, Affiliate IDs – affiliateID and Tracking IDs – tracking-ID
  2. Parameters that change, reorder or narrow page content; typically Sort filters like: sort=price_ascending, rankBy=bestSelling, order=highest-rated, sort=newest

As an example, if I was the eCommerce manager for this website and felt that all ‘size’ filter pages like this one should not be indexed using URL parameters:

http://www.prettylittlething.com/clothing/tops.html?clothes_size=36

I will take the following steps:

Go into Google Webmaster Tools URL parameter tab and edit the Parameter ‘clothes_size’ – if it not on the list, add it as an entry.

Webmaster_Tools_-_URL_Parameters

Does this parameter change page content seen by the user?
Yes: Changes, reorders or narrows page content

How does this parameter affect page content?
Narrows

Which URLs with this parameter should Googlebot crawl?
Every URL

5. The Canonical Tag

Think of a canonical tag as a 301 redirect that occurs on the background.  It helps normalize individual pages with multiple URL variations via an invisible redirect to a single dominant version.

Here is the syntax:

<link rel=”canonical” href=”http://example.com/category-page” />

If your category page URL creates multiple versions of a single page, then using the canonical tag might be an option worth exploring.

I do prefer to use the meta robots tag or X meta robots tag over the canonical but the canonical tag is an ideal candidate for the following scenarios:

1. Paginated content to a View All Page

Although the rel=”next” and rel=”prev” is the recommended directive for paginated content, if a a view-all page exists, then a the rel=”canonical” tag should be used.

rel-canonical-paginated-content

2. Tracking Codes and Session ID Generated URLs

Tracking variables tend to be appended to URLs that either link to your site or link internally. The format is similar to:

  1. mywebstore.example.com/books/?trackingId=cat123
  2. mywebstore.example.com/cds/?trackingId=cat124
  3. mywebstore.example.com/cards/?trackingId=cat125

The issue with URLs with appended variables and even capitalisation are treated differently by search engines. And even though Google are able to automatically figure out such discrepancies with the URL parameter handling (as highlighted above), Google is still rife with issues. The canonical tag ensures a singular reference version of a specific URL. So the canonical versions of the URLs above would look something like this:

  1. mywebstore.example.com/books/
  2. mywebstore.example.com/cds/
  3. mywebstore.example.com/cards/

3. Link Consolidation

If a session ID generated product page URL was linked from an article or review and did not have a canonised URL the link will almost be wasted. Implementing the rel=”canonical” tag ensures ensures that link value is passed from variants of a URL to a singular URL. In scenarios such as this, other indexation fixes that I have mentioned above such as the meta robots,  URL parameter and robots.txt disallow will not be best suited.

Your Turn…

This guide serves as a reference to eCommerce managers and analysts looking to resolve duplication issues in their stores. Do you have any questions? Or would you like to add some more tips? If yes, please drop a comment below.

 

About the author:

Kunle Campbell

An ecommerce advisor to ambitious, agile online retailers and funded ecommerce startups seeking exponentially sales growth through scalable customer acquisition, retention, conversion optimisation, product/market fit optimisation and customer referrals.

Did You Enjoy Reading this Article?

Get Free Email Updates by Signing Up Below:

eCommerce Marketing Growth Hacks 

2X eCommerce Podcast

Kunle interviews Founders of Fast Growing 7-8 Figure Online Retail Business & E-commerce Marketing Experts

View podcasts

Download your free ebook

More

The eCommerce Marketing Blueprint