Search Engine Myths?

11 June 2024

5 min

King Midas touching things and turning them into gold.

Myths abound about search engines and notably Google search. We briefly explain our perspective about five important aspects of search engines which should be better understood.

There are many Search Engines

There are very few international search engines with Mojeek being the only one based in Europe. Google dominates market share in most of the world, with notable exceptions being for Russia (Yandex) and China (Baidu). English language search engines can be seen on the search engine map. There are many (over 100) search or metasearch services (often called proxies) that claim, or are referred to as search engines, but all take their results from search engines, usually Bing or Google, but also increasingly from Mojeek as you can see in the search engine map. Sometimes web scrapers will pretend to be a known web crawler. One strategy is to pretend to be GoogleBot in the hope of obtaining the widest access. So how do you identify these pretenders and scrapers?

Only Google is Allowed to Crawl at Web Scale

While Google does have extensive web crawling capabilities, it's not the only entity allowed to crawl the web at scale. Other search engines that crawl the web respectfully are generally allowed wide access via the robots.txt file on websites. For example, “verified” good bots are monitored by the likes of Cloudflare. The voluntary arrangement of robots.txt obviously depends on good actors and there seems to be fewer and fewer around. The whole situation is made worse because currently there is no robust way to declare that you want your content searchable on search engines, but not used for machine learning. So far seven search companies & projects, plus individuals, have co-signed the NoML open letter which proposes a solution to this.

Tracking is Needed for Ads

While massive data collection is used by Big Tech for optimising targetted advertising, it is not necessary when selecting relevant search ads. Such contextual advertising does not rely on user tracking. Users implicitly express an intention with queries, so search engines can serve highly relevant search ads based on the search query, plus supplementary data such as language and location which the user may be able to control (fully in the case of Mojeek). Contextual ads based on search queries can thus be highly relevant, timely and privacy-preserving.

Building and Running a Search Engine is Costly

It is often stated by technology writers and the media that search engines are costly to build and maintain. But as Mojeek has shown, with determination, perseverance, and prudent spend, building and maintaining a search engine can be very cost effective. Mojeek has been developed and built up over 15 years without spending huge amounts of money, and yet has one of the largest web search indexes on the web today. Its index of over 8 billion pages can return answers to search queries within 200ms on average. Furthermore Mojeek is all built on it's own algorithms, running on in-house maintained servers dramatically reducing costs compared with cloud services.

Generative AI will Kill Search

Generative AI is poised to transform many aspects of technology, including how we search for information. However AI is more likely to augment and integrate with traditional search technologies, enhancing capabilities like answering queries directly or providing summaries of found content. It may change the way search engines operate but will not render them obsolete. In any case, foundation models in generative AI are built on datasets, and the biggest datasets used are search engine databases. Generative AI also increasingly uses search engine APIs as a source of data for RAG (Retrieval-Augmented Generation).

colin