Block Bad URLs with Robots.txt

Block Bad URLs with Robots.txt

20.Nov.2021

Use a robots.txt file to block search engines from indexing some bad URLs on your site. Blocking bad URLs is also known as “cloaking,” but use this term cautiously because it can have negative associations with manipulation of Google’s search results.

The [robots.txt](https://www.robotstxt.org/robotstxt) file is a text file that gives instructions for crawling and accessing a website's pages to crawlers and web robots . It sits in the root of your domain and establishes rules for all of the sites' paths and files that you want included or excluded from being indexed by search engines. You can also use a wildcard (*) in place of the page-path, which will block crawling of all URLs that start with the same base URL.

To prevent multiple URLs from pointing to similar or duplicate content, use a robots.txt file to block search engines from indexing any unnecessary domain names or dynamic URLs that are pointing to the canonical version of your site's pages.

For example, if you have a search feature on your site, generating dynamic and duplicate URLs for each filtered search result is not recommended because it can create junk pages in Google’s index. This results in poor user experience when they click on these bad links in Search Results .

Here are some instructions about how best to configure your robots.txt file based on examples you may encounter on various websites:

- If you have a website with one static homepage but several dynamic URLs that are pointing to the same page, block these URLs by adding disallow rules in your robots.txt file for each unnecessary URL. The site below has 3 bad dynamic URLs that all redirect to the canonical home page.

- If your site has one page (homepage) with several dynamic URLs (each pointing to a different product page), block only the unnecessary ones. The site below displays one canonical version of each product, so you would want to block only the extraneous dynamic URLs that are causing duplicate content. http://example.com/

- If your site has multiple URLs that redirect to the same page, block any unnecessary URLs. The site below has 2 product pages with similar content but distinct URLs .

- If you have one canonical version of your content on a webpage and a search feature generating dynamic and duplicate more specific versions of this URL , only block the dynamic version of the URL that you do not want indexed by search engines. The site below has a canonical version of the page with the content that is exactly what users are searching for in Google, but several dynamic URLs resulting in different filtered search queries.

- If your site is loading multiple versions of JavaScript or CSS code, this can create duplicate content issues because search engines are unable to crawl JavaScript or CSS. To prevent crawlers from crawling unnecessary URLs , block the dynamic versions of these scripts by adding disallow rules in your robots.txt file for each script name that is generating these extraneous dynamic URLs . The site below has code for 3 different versions of social media buttons, so you would want to block these and point them all to the canonical URL for social media.

- If your site displays dynamic ads causing duplicate content issues , block any unnecessary URLs . The site below has 2 dynamic letters with the same content but different display attributes — one is static HTML and the other is generated by JavaScript.

- If you have one canonical version of your website, but dynamic URLs are causing duplicate content issues with the canonical URL , only block the dynamic versions of these pages. The site below has 2 product landing pages that are dynamically made for each new session according to the user’s geographical location. Since this may create duplicate content issues , only block the dynamic versions of the URLs .

- If your site has a canonical version of all product pages, but several other similar dynamic URLs are pointing to these pages , only block the unnecessary ones by adding disallow rules in your robots.txt file for each script name that is generating these extraneous dynamic URLs . The site below has a canonical version of the product page with the exact content matching what users are searching in Google, but several dynamic URLs resulting in different filtered search queries.