Will Chatbots Kill the Search Star?

09 February 2023

3 min

Everyone, it seems, is playing with chatbots, and search engines are looking to get in on the action. But what lies underneath the AIs of chatbots? And why do they suddenly seem so adept at churning out plausible answers?

Intelligent or Smart?

Three things make chatbots seem smart: data, training, and computing. Lots of all three are needed for chatbots and for other AIs based on machine learning.

Chatbots are based on what are called LLMs (Large Language Models), which generate readable text based on statistical models of data. An innovation called transformers provides a new way of predicting the next word in a response and a memory of words that came before. The recent advances are from tuning of these models using reinforcement learning from human feedback (RLHF). Training also comes into play when chatbots are interacting with you. “You train and the machine learns” so with chatbots watch out for a new form of “you are the product”.

ChatGPT certainly seems smarter than previous open-dialogue chatbots, this is achieved not least by using better-designed guard rails to avoid problems due to out-of-date training data and hallucination. Lessons were learned from Microsoft’s Tay (2016) and Meta’s BlenderBot (March 2022) which experienced predominantly-negative publicity soon after their launches.

It’s the Data Stupid

What’s rarely mentioned is the dependence of LLMs on search indexes, and that most are based on Common Crawl, a freely-available web text search index. However Common Crawl is limited in size, comprising 3.3 billion pages; it is only available as a datadump; and is always out of date. With the trend in LLMs moving to larger data sizes rather than more model parameters, Common Crawl based LLMs may well fall behind. Companies with access to large web search indexes will be at a greater advantage, and obviously that includes Google (and their subsidiary DeepMind) and Microsoft (and their partner OpenAI). The only other large web text search index outside of Russia and Asia is Mojeek, which has a growing index of currently just over 6.6 billion pages.

Google/DeepMind are notably vague on the main web data source for their LLMs. If they are using data from their search index why don’t they say so? And will GPT-4 be based on data from the Bing search index? Will Google and Microsoft continue to restrain AI competitors through restrictive terms in their search APIs? Others may follow Meta, and use the flexibility of the Mojeek Search API for chatbots and machine learning applications.

The Web, Search and Chatbots

Everyone seems to agree search is going to be transformed, but no one can really predict how. Microsoft have announced they will incorporate ChatGPT into Bing, Google have just announced Bard.

Chatbot answers may be convenient, but will that always be more useful? After all search engines with hyperlinks encourage discovery and navigation on the web. If search engines turn into answer engines will that further suffocate the web? Either way, with many AIs based on data from search indexes, the search engines Google, Bing, and Mojeek, may hold a trump card of large-scale web data.

colin