share

NoML Proposal for Fair Use of Content in AI and Search

colin

25 October 2023

4 min

a copyright symbol with a partial strike through it

Fair Usage

Commercial search engines crawl and index your web content, monetise a search service, and (used to) play fair by providing traffic back to your content through hyperlinks. Those ten blue links, on Google of old, were a hub fuelling a World Wide Web that we all benefitted from. Fast forward to today and we have AI companies using your web content, compressing it into machine learning models and playing unfair; providing little or no traffic in return.

It’s beyond time these companies played fair again, as search engines used to, and Mojeek still does. A voluntary code of practice, where you can flag your wishes to search engines using robots.txt, has largely worked, but what can be done in the age of generative AI? We have a proposal, which we ask you to consider and support below. But first we’ll explain how it would work.

Search and AI

Let’s consider some scenarios that you may fit into:

  1. You want to be searchable on Google search and you are happy for your content used by anyone for AI training
  2. You want to be searchable on Bing, and other proxies like DuckDuckGo, but you don’t want your content used by Bing Chat
  3. You don’t want to be searchable on Google but you are happy for your content to be used by OpenAI for AI
  4. You don’t want to be searchable on Mojeek and you don’t want your content used by OpenAI

As you may imagine this list of examples could get very long. But we can we make it simpler by considering generalisations of the four examples above, where you want to allow search, or not; and you want to allow AI, or not.

1. Do use content for search
Do use content for machine learning
2. Do use content for search
Do not use content for machine learning
3. Do not use content for search
Do use content for machine learning
4. Do not use content for search
Do not use content for machine learning

Several people and bodies have suggested that all we need to do is adapt usage of robots.txt for AI. But this has a major flaw since most content creators will want to be searchable by search engines, but also have the autonomy to request that their content is not used to train AI products. If you are in that category, as we expect many are, please consider signing the open letter.

The Proposal: Use Meta and X-Robots

A simple and practical approach is to:

Use a meta and X-Robots tag: noml

Meta robots tags are used for search engine crawlers so companies like Google, that crawl for search but also train AI models, already check for requests made in this way. For example the noindex meta value is used to tell search engines not to include a page in their search results with.

<meta name="robots" content="noindex">

Similarly the noml meta value could be used to instruct any bot that data from that page should not be collected for or used in AI training datasets.

<meta name="robots" content="noml">

these can be used together; to tell search engines not to index or use content for AI like this:

<meta name="robots" content="noindex, noml">

These meta declarations are made in the <head> section of any HTML page, but a similar X-Robots tag is used for non-HTML, so to flag for AI opt-out you can similarly use:

X-Robots-Tag: noml

Just as for the meta tag, where name=“robots”, refers to all user-agent tokens you can also identify individual user-agents. For example, you can currently request that Google does not include a page in the search results:

<meta name="googlebot" content="noindex">

and similarly you could request that Google does include a page in the search index, but opt-out of AI usage by Google search with:

<meta name="googlebot" content="noml">

<meta name="google-extended" content="noml">

You could request that Bing does include a page in the search index, but opt-out of AI usage by Microsoft with:

<meta name="bingbot" content="noml">

although in this case, since OpenAI is not operating a search engine, you could (also) block them from crawling in the robots.txt file using:

User-agent: gptbot
Disallow: /

Of course, just like robots.txt for search, this proposal will not work on its own for the bad actors. In those cases one will need to fall back, as now, on tools such as Web Application Firewalls. Still it is a practical way to separate wishes for search engine crawling from AI usage.

This proposal can also be used to flag when content should not be made available for machine learning training and AI augmentation, through APIs. For example when search providers like Bing and Mojeek, make the noml request available in their API feeds.

Sign the Open Letter

If you want your content searchable on search engines, but not used for AI training, the existing proposals to use robots.txt have limitations. Indeed, loopholes are very likely already being exploited by several big and many small players.

This proposal will enable you, and others, to express your wishes simply and clearly. Google or indeed Microsoft may not like this idea, but if you do, you can help by spreading the message and signing the open letter.

colin

25 October 2023

4 min

Get the latest

Subscribe to our newsletter and receive Mojeek news and articles by email.

Subscribe