An in-depth Guide Into The Robots.txt File

Benjamin Beckwith ● August 10, 2020 | Blog, SEO

Written by Benjamin Beckwith
August 10, 2020 | Blog, SEO

The robots.txt is a very powerful file that can be added to your website to help control which areas of your site search engines should crawl and which areas should be ignored. It is important to review your robots.txt on a regular basis to make sure it is up to date and if possible use a monitoring tool to be alerted when changes occur.

At Semetrical, as part of our technical SEO service offering we will audit a client’s robots.txt file when undertaking a technical audit of a clients website to check that the paths that are being blocked should be. Additionally, if the SEO team comes across issues as part of the technical SEO audit process such as duplication, new robots.txt rules may be written and added to the file.

As the robots.txt is an important file we have put together a guide that covers what it ultimately is, why someone may use it and common pitfalls that can occur when writing rules.

What is a robots txt file?

The robots.txt file is the first port of call for a crawler when visiting your website. It’s a text file which lists instructions for different user agents that essentially tells web crawlers which parts of a site should be crawled and which should be ignored. The main instructions used in a robots.txt file are specified by an “allow” or “disallow” rule.

Historically a “noindex” rule would also work, however in 2019 Google stopped supporting the noindex directive as it was an unpublished rule.

If the file is not used properly it can be detrimental to your website and could cause a huge drop in traffic and rankings. For example, mistakes can happen when a whole website is blocked from search engines or a section of a site is blocked by mistake. When this happens the rankings connected to that part of the site will gradually drop and traffic will in turn drop.

Do you actually need a robots.txt file?

No, it is not compulsory to have a robot.txt on your website especially for small websites with minimal URLs but it is highly recommended for medium to large websites. On large sites it makes it easier to control which parts of your site are accessible and which sections should be blocked from crawlers. If the file does not exist your website will generally be crawled and indexed as normal.

What is the robots txt file mainly used for?

The robots.txt has many use cases and at Semetrical we have used it for the below scenarios:

Blocking internal search results as these pages are not usually valuable to a crawler and can cause a lot of duplication across a website.
Blocking parts of a facet navigation if certain facets are not valuable from an SEO perspective but are still needed for UX when a user is on your website.
Blocking different levels of a facet navigation, where one facet level may be useful for search engines but when combining two different facet filters they may become irrelevant for a search engine to crawl and index.
Blocking parameters which cause duplication or are wasting crawl budget. This is slightly controversial as others may tell you not to block parameters in the robots.txt but this has worked on a number of our client websites where parameters are needed but crawlers don’t need to crawl them. It is highly recommended to check that any parameter you are blocking has no valuable links or are ranking for any valuable keywords bringing in traffic.
Blocking private sections of a website such as checkout pages and login sections.
Including your XML sitemap locations to make it easy for crawlers to access all of the URLs on your website.
To allow only specific bots to access and crawl your site.
Blocking user-generated content that cannot be moderated.

Where to put a robots txt & How to add it to your site?

A robots.txt file needs to be placed at the root of your website, for example, on Semetrical’s site it sits at www.semetrical.com/robots.txt and must be named robots.txt. A website can only have one robots.txt and it needs to be in an UTF-8 encoded text file which includes ASCII.

If you have subdomains such as blog.example.com then the robots.txt can sit on the root of the subdomain such as blog.example.com/robots.txt.

What does a robots.txt file look like?

A typical robots.txt file would be made up of different components and elements which include:

User-agent
Disallow
Allow
Crawl delay
Sitemap
Comments (Occasionally you may see this)

Below is an example of Semetrcals robots.txt that includes a user-agent, disallow rules and a sitemap.

User-agent:  *

Disallow:  /cgi-bin/
Disallow:  /wp-admin/
Disallow:  /comments/feed/
Disallow:  /trackback/
Disallow:  /index.php/
Disallow:  /xmlrpc.php
Disallow:  /blog-documentation/
Disallow:  /test/
Disallow:  /hpcontent/

 
Sitemap:  https://devsemetrical.wpengine.com/sitemap.xml

User-agent

The user-agent defines the start of a group of directives. It often is represented with a wildcard (*) which signals that the instructions below are for all bots visiting the website. An example of this would be:

User-agent: *

User-agent:  *
Disallow:  /cgi-bin/
Disallow:  /wp-admin/

There will be occasions when you may want to block certain bots or only allow certain bots from accessing certain pages. In order to do this you need to specify the bots name as the user agent. An example of this would be:

User-agent:  AdsBot-Google
Disallow:  /checkout/reserve
Disallow:  /resale/checkout/order

Disallow:  /checkout/reserve_search

Common user-agents to be aware of include:

There is also the ability to block specific software from crawling your website or delaying how many URLs they can crawl a second as each tool will have their own user agents that crawl your site. For example, if you wanted to block SEMRush or Ahrefs from crawling your website the below would be added to your file:

User-agent:  SemrushBot
Disallow:  *
 
User-agent:  AhrefsBot
Disallow:  *

If you wanted to delay the number of URLs crawled the below rules would be added to your file:

User-agent:  AhrefsBot
Crawl-Delay:  [value]
 
User-agent:  SemrushBot
Crawl-Delay:  [value]

Disallow directive

The disallow directive is a rule a user can put in the robots.txt file that will tell a search engine not to crawl a specific path or set of URLs depending on the rule created. There can be one or multiple lines of disallow rules in the file as you may want to block multiple sections of a website.

If a disallow directive is empty and does not specify anything then bots can crawl the whole website, so in order to block certain paths or your whole website you need to specify a URL prefix or a forward slash “/”. For example in the below example, we are blocking any URL that runs off the path of /cgi-bin/ or /wp-admin/.

User-agent:  *
Disallow:  /cgi-bin/
Disallow:  /wp-admin/

If you wanted to block your whole website from bots such as Google then you would need to add a disallow directive followed by a forward slash. Typically you may only need to do this on a staging environment when you do not want the staging website from being found or indexed. An example would look like:

User-agent:  *
Disallow:  /

Allow directive

Most search engines will abide by the allow directive where it essentially will counteract a disallow directive. For example, if you were to block /wp-admin/ it usually would block all the URLs that run off that path, however, if there is an allow rule for /wp-admin/admin-ajax.php then bots will crawl /admin-ajax.php but block any other path that runs off /wp-admin/. See example below:

User-agent:  *
Disallow:  /wp-admin/
Allow:  /wp-admin/admin-ajax.php

Crawl Delay

The crawl delay directive helps slow down the rate a bot will crawl your website. Not all search engines will follow the crawl delay directive as it’s an unofficial rule.

– Google will not follow this directive

– Baidu will not follow this directive

– Bing and Yahoo supports the crawl delay directive where the rule instructs the bot to wait “n” seconds after a crawl action.

– Yandex also supports the crawl delay directive but interprets the rule slightly differently where it will only access your site once in every “n” seconds”.

An example of a crawl delay directive below:

User-agent:  BingBot
Disallow:  /wp-admin/
Crawl-delay:  5

Sitemap Directive

The sitemap directive can tell search engines where to find your XML sitemap and it makes it easy for different search engines to find the URLs on your website. The main search engines that will follow this directive include, Google, Bing, Yandex and Yahoo.

It is advised to place the sitemap directive at the bottom of your robots.txt file. An example of this is below:

User-agent:  *
Disallow:  /cgi-bin/
Disallow:  /wp-admin/
Disallow:  /comments/feed/

Sitemap:  https://devsemetrical.wpengine.com/sitemap.xml

Comments

A robots.txt file can include comments but the presence of comments are only for humans and not bots as anything after a hashtag will be ignored. Comments can be useful for multiple reasons which include:

– Provides a reason why certain rules are present

– References who added the rules

– References which parts of a site the rules are for

– Explains what the rules are doing

– Below shows examples of comments in different robots.txt files:

#Student

Disallow:  /student/*-bed-flats-*
Disallow:  /student/*-bed-houses*
Disallow:  /comments/feed/

#Added by Semetrical

Disallow:  /jobs*/full-time/*
Disallow:  /jobs*/permanent/*

#International

Disallow:  */company/fr/*
Disallow:  */company/de/*

Is the ordering of rules important?

The ordering of rules is not important, however when several allow and disallow rules apply to a URL, the longest matching path rule is the one that is applied and takes precedence over the less specific shorter rule. If both paths are the same length, then the less restrictive rule will be used. If you need a specific URL path to be allowed or disallowed, you can make the rule longer by utilising “*” to make the string longer. For example, Disallow: ********/make-longer

On Google’s own website they have listed a sample set of situations which shows the priority rule that takes precedence. The table below was taken from Google.

How to check your robots.txt file?

It is always important to check and validate your robots.txt file before pushing it live as having incorrect rules can have a great impact on your website.

The best way to test is to go to the robots.txt tester tool in Search Console and test different URLs that should be blocked with the rules that are in place. This is also a great way to test any new rules that you are wanting to add to the file.

Examples of using regular expressions in the robots.txt

When creating rules in your robots.txt file, you can use pattern matching to block a range of URLs in one disallow rule. Regular expressions can be used in order to do pattern matching and the two main characters that both Google and Bing abide by include:

Dollar sign ($) which matches the end of a URL
Asterisk (*) which is a wildcard rule that represents any sequence of characters.

Examples of pattern matching at Semetrical:

Disallow:  */searchjobs/*

This will block any URL that includes the path of /searchjobs/ such as: www.example.com/searchjobs/construction. This was needed for a client as the search section of their site needed to be blocked so search engines would not crawl and index that section of the site.

Disallow:  /jobs*/full-time/*

This will block URLs that include a path after /jobs/ followed by /full-time/ such as

www.example.com/jobs/admin-secretarial-and-pa/full-time/

. In this scenario we need full time as a filter for UX but for search engines there is no need for a page to be indexed to cater for “job title” + “full time”.

Disallow:  /jobs*/*-000-*-999/*

This will block URLs that include salary filters such as

www.example.com/jobs/city-of-bristol/-50-000-59-999/

. In this scenario we need salary filters but there was not a need for search engines to crawl salary pages and index them.

Disallow:  /jobs/*/*/flexible-hours/

This will block URLs that include flexible-hours and include two facet paths in between. In this scenario we found via keyword research that users may search for location + flexible hours or job + flexile hours but users would not search for “job title” + “location” + “flexible hours”. An example URL looks like

www.example.com/jobs/admin-secretarial-and-pa/united-kingdom/flexible-hours/

Disallow:  */company/*/*/*/people$

This will block a URL that includes three paths between company and people as well as the URL ending with people. An example would be

www.example.com/company/gb/04905417/company-check-ltd/people

Disallow:  *?CostLowerAsNumber=*

This rule would block a parameter filter that ordered pricing.

Disallow:  *?Radius=*

Disallow:  *?radius=*

These two rules blocked bots from crawling a parameter URL that changed the radius of a users search. Both an uppercase and lowercase rule was added as the site included both versions.

Things to be aware of with the robots.txt

The robots.txt is case sensitive so you need to use the correct casing in your rules, For example, /hello/ will be treated differently to /Hello/.
To get search engines such as Google to re-cache your robots.txt quicker in order to find new rules you can inspect the robots.txt URL in Search Console and request indexing.
If your website relies on a robots.txt with a number of rules and your robots.txt URL serves a 4xx status code for a prolonged period of time, the rules will be ignored and the pages that were blocked would become indexable. It is important to make sure it is always serving a 200 status code.
If your website is down then make sure the robots.txt returns a 5xx status code as search engines will understand the site is down for maintenance and they will come back to crawl the website again at a later date.
When URLs are already indexed and a disallow is then added to your website to remove those URLs from the index, it may take some time for those URLs to be dropped and removed. Additionally, URLs can still stay in the index for a while but the meta description will display a message such as “A description for this result is not available because of this site’s robots.txt – learn more”.
A robots.txt disallow rule does not always guarantee that a page will not appear in search results as Google may still decide, based on external factors such as incoming links, that it is relevant and should be indexed.
If you have a disallow rule in place and also place a “no index” tag within the source code of a page, the “no index” will be ignored as search engines cannot access the page to discover the “no index” tag.
A disallow rule on indexed pages, especially those with incoming links means you will lose the link equity of those backlinks that would otherwise be passed on to benefit other pages. This is why it is important to check if pages have backlinks before adding a disallow rule.
If the leading slash in the path is missing when writing an allow or disallow rule then the rule will be ignored. For example, “Disallow: searchjobs.

If you would like to speak with one of our technical SEO specialists at Semetrical please visit our technical SEO services page for more information.