How to Block AIs From Crawling Your Content

  ●   October 24, 2023 | Technology
Written by
October 24, 2023 | Technology

AI generative tools, like Google Bard and Bing Chat, are built from many content sources including the web. To the consternation of many, search engines have been quietly training their AI models on all the content they find whilst crawling for the traditional web search.

Bing and Google have now announced methods for blocking content from being used for AI training whilst remaining indexed for web search.

So, should you block the AIs, and how do you go about it?

Should you block AIs?

Companies who make their own products may consider it a benefit to include their content in AI models. Information, such as technical specifications or product support, may help with sales and reduced customer support costs.

But for many other online businesses, the content is their product. There are valid concerns that the energy invested in creating content will be used to improve AI products owned by the big tech companies without delivering any value in the form of traffic.

Google and Bing are trying to find ways to credit the sources and deliver some referral traffic but it’s likely to be less than the traditional web search, and more likely to be transactional than informational search queries.

It’s important to note that blocking content from these AIs won’t affect the crawling behaviour. Google says ‘the robots.txt user-agent token is used in a control capacity.’ Your site will be crawled as normal by the bots to build their search indexes. 

And if the search engines are already blocked from crawling certain pages, you do not need to block them specifically for the AIs.

How do you block AI bots?

It’s currently possible to block Google, Bing, and ChatGPT using methods familiar to most SEOs, the robots.txt file and page-level robots directives.

Google and ChatGPT have opted for the robots.txt method which allows you to specify URL patterns, and Bing has opted for using robots directives applied to individual pages.

The robots.txt has the advantage of being easy to configure for an entire website in a single place. It’s very transparent which URLs are being blocked compared to the page level robots directives, which have to be tested by fetching every single page.

How to block Bing’s AI

Bing looks for the nocache or noarchive robots directives, which can be added to a page as a meta tag or in an X-Robots-Tag response header.

Nocache will allow pages to be included in Bing Chat answers using only URLs, Titles, and Snippets in training Microsoft’s AI models.

Noarchive does not allow pages to be included in Bing Chat, and no content will be used training Microsoft’s AI models.

If a page has both Nocache and Noarchive, the less restrictive Nocache will take precedence.

The ‘robots‘ token will apply the directive to all crawlers. This includes Google which will prevent the page appearing with a cached link in search results.

<meta name=”robots” content=”noarchive”>

You can use the more specific ‘bingbot‘ or ‘msnbot‘ tokens to avoid affecting other search engines.

<meta name=”bingbot” content=”nocache”>

How to block Google’s AI

Google has opted for the robots.txt method which allows you to specify URL patterns to match pages you do not want to be used in Bard and their Vertex API equivalent. It doesn’t currently apply to the Search Generative Experience (SGE).

They will match against a user-agent token of Google-extended. The case of the token does not matter.

User-agent: Google-Extended

Disallow:    /

If there is not a rule block specifically for the google-extended token, it will match against the wildcard token (*).

User-agent: *

Disallow:    /

Be careful if you have a specific rule block for Googlebot, and a separate wildcard block. Google-extended will match the wildcard block, not the Googlebot block.

User-agent: Googlebot

Allow:    /

User-agent: *

Disallow: / 

You can list multiple user-agents before the rule blocks to be more precise.

User-agent: Google-Extended

User-agent: Googlebot

Allow:    /

User-agent: *

Disallow: / 

How to block ChatGPT

ChatGPT also opted for the robots.txt method.

Chat GPT has two different user-agent tokens, ChatGPT-User for queries on behalf of ChatGPT users, and GPTBot, which is OpenAI’s web crawler used to build their models. 

The opt-out system currently treats both user agents the same, so any robots.txt disallow for one agent will cover both. This may change in the future so we’d recommend blocking them separately.

User-agent: GPTBot

User-agent: ChatGPT-User

Disallow: /

Testing

Testing is simple if you are blocking your entire website. 

To check if Google and ChatGPT are blocked you need to see if your robots.txt has a disallow everything rule for the bots you want to block. 

User-agent: Google-Extended

User-agent: GPTbot

Disallow: /

If you only want to block some URLs, it may require a more complex set of robots.txt directives. You may consider testing a number of URLs that you expect to be blocked and not blocked.

Tomo is our free robots.txt tool that can help you test if specific URLs are blocked in robots.txt. You can define tests in the form of a list of URLs, and the expected disallowed status for each URL.

It can be configured with the Google-Extended, GPTBot, and ChatGPT-User user agent tokens to show you which URLs are blocked for each, and if that matches the expected test result.

Whenever your robots.txt file is updated, the tests will be re-run and you’ll be notified if the results do not match what was expected.

To test if Bing is blocked, you can inspect your key page templates in the browser and confirm it has the robots tag.

If you’re using an X-Robots-Tag response header, it can be seen in the network tab by selecting the page in the list of network requests and viewing the ‘Headers’ tab.

Testing will be more complicated if you are blocking a specific set of pages, but there are some tools which can help.

The Lumar crawler will also now automatically report all pages where Google and Bing’s AIs are blocked.

Do you need additional technical support? Learn more about Semetrical’s technology offering or get in touch for more information!

Our Blog