Automating Custom Search Engines

05 December 2024

5 min

A code block showing the process outlined in this piece, using an LLM with Mojeek's API to automate custom search engines in TypeScript.

The web is vast and increasingly full of low-quality information. Imagine if you could automate the process of searching across sites that are particularly relevant to the context of your search. Using the Mojeek API we have done just that, and explain here how you can do the same. With this you can build an automated custom search engine for any topic, category or theme.

In the early days of web search, human categorisation of websites was popular. As the web grew this human curation became impractical, with the algorithmic ranking and curation of Google emerging as the dominant platform. Initially that was a huge leap forward and the growing expanse of the early web became discoverable. However in recent times, things have gotten worse with SEO spam, AI answers and slop suppressing the rich, diverse and more-valuable human content that is still there.

At Mojeek we work hard to bring you results from the long tail of the web, and we are always seeking better ways to surface the most relevant pages for your searches. With the Mojeek API you have more ways to choose how to do this yourself. One of the ways you can do this is by creating automated custom search engines. How you can do this yourself using the API will be detailed below. But first a reminder about what you can do with non-automated curation.

Human Curation

Mojeek Focus and Custom Search Engines (which use Mojeek Focus through the Mojeek API) are very useful for finding useful, and often the most relevant, search results. Still they do require some human curation of site lists to set up.

In a previous post we demonstrated how you can do this and search the parts of the web you value, using Mojeek Focus and Custom Search Engines. To do so, we used the website include (and exclude) functions to define a set of websites to search across. Building a list of these sites isn’t too difficult. And this is particularly true when you know the topic, theme or segment you are designing the search engine for. The important thing is that such site lists are static, not dynamic. So we imagined how you might automate this process with dynamic site lists.

Automated Curation

To make an automated custom search engine we implemented the following three steps:

Send the Mojeek search query to an LLM (Large Language Model), along with a system prompt asking it to return a list of relevant websites to search across.
Parse the returned site list, into a form suitable for the Mojeek API.
Call the Mojeek API with the query and this site list to search across.

For the details on how we did this, let’s take a look at a particular search query example, and illustrate how it might be done with some code snippets (here in Typescript).

First we define the variables to be used:

// Define automated custom search engine variable let focusPrompt: string; let focusInput: string; let numberSites: string = '100'; let query: string searchText;

The ‘searchText’ for the example case to be shown here was ‘AI+nuclear+power’ which of course should be, in general, an input variable from your user, as it would be with Mojeek for a regular search: https://www.mojeek.com/search?q=AI+nuclear+power

Also defined is how many sites to search across (numberSites), plus the variables for prompting.

We developed and defined a system prompt which refers to the number of sites and the search query, as follows:

// Set-up the LLM system prompt for an automated custom search engine, based on the user input query focusPrompt = Give me only a list of + numberSites + , or less, web sites that have useful and interesting text content about the search less-than-symbol Query more-than-symbol provided. focusPrompt += Do not provide any extra text such as an introduction, summary, explanations, descriptions or any other text. focusPrompt += The web site should be domains or subdomains. Do not use paths to parts of domains or subdomains. focusPrompt += Do not number the list. Return only the domain names as a comma delimited list. focusInput = less-than-symbol Query more-than-symbol + query;

and which refers to the number of sites and the search query, as defined above.

The system prompt above seems to work well enough and across a range of LLM models. It can no doubt be improved in general and/or tuned for specific models.

Next you would choose and define the LLM model. Obviously this definition will depend on your LLM of preference, and how you or your provider deploy it. At the end this post we will comment on the choices of number of sites and LLM models.

We then bundle the model choice, system prompt and query into a payload:

// Construct payload for LLM API payload = {model: model, messages: [{role: system, content: focusPrompt}, {role: user, content: focusInput}]};

and then send this payload to the appropriate API, defined here by apiURL:

// Call the LLM API const response = await fetch(apiURL, { method: 'POST', headers: { }); }, 'Content-Type': 'application/json', 'Authorization': <code>Bearer ${apiKEY}</code> body: JSON.stringify(payload)

We then parsed the output of a list of sites, in the LLM API response, with a few lines of code, into a form suitable for the Mojeek API. This is necessary as the LLMs don’t always give you the exact format required which is a comma delimited list; for example LLMs often add spaces after the commas, so these were removed. We also choose to add a period before each domain.tld, so that Mojeek searches across subdomains. You may choose not to do this, but note:

if you send domain.tld to the API, results will not be returned from any subdomains for each domain.tld
if you send .domain.tld to the API, results will be returned for any subdomains of domain.tld (eg results from mojeek.com and blog.mojeek.com)

The parsed list of sites should be a string which will be sent to the Mojeek API. For the example search query, the parsed site list was as follows for a particular LLM:

A parsed list of sites from this process, including world-nuclear.org, iaea.org, nei.org, and forbes.com

We then call the Mojeek API, using the &fi parameter for the sites to search across:

https://api.mojeek.com/search?api_key=apiKEY&fmt=json&q=query&fi=sitelist

You can then use the API response for the service provided to the user, or process it further as preferred. Part of the API JSON response for the example call is shown below:

The API JSON response for AI nuclear power, featuring results from Forbes, CNET, The Independent, Axios, and ABC.

Choosing Models and Numbers of Sites

We have tried the above method using a variety of models and model sizes including Claude-3-Haiku, Llama-3-8b, Llama-3.2-90b, Mistral-7b, Mistral-8x7b, GPT-3.5, GPT-4o. Not surprisingly the bigger and later models generally work better, but obviously are more expensive to use and incur longer latency. What you will prefer will depend on your context, preference and need for privacy.

With regard to numbers of sites, we have found that sometimes 25 sites can work well, but sometimes 100 sites is much better. We allow 25 sites to be called on all plans for the Mojeek API and we allow up to 100 sites on Custom plans. You can find further details of the available plans and sign up on our Mojeek API Page.

colin