Automating Custom Search Engines
The web is vast and increasingly full of low-quality information. Imagine if you could automate the process of searching across sites that are particularly relevant to the context of your search. Using the Mojeek API we have done just that, and explain here how you can do the same. With this you can build an automated custom search engine for any topic, category or theme.
In the early days of web search, human categorisation of websites was popular. As the web grew this human curation became impractical, with the algorithmic ranking and curation of Google emerging as the dominant platform. Initially that was a huge leap forward and the growing expanse of the early web became discoverable. However in recent times, things have gotten worse with SEO spam, AI answers and slop suppressing the rich, diverse and more-valuable human content that is still there.
At Mojeek we work hard to bring you results from the long tail of the web, and we are always seeking better ways to surface the most relevant pages for your searches. With the Mojeek API you have more ways to choose how to do this yourself. One of the ways you can do this is by creating automated custom search engines. How you can do this yourself using the API will be detailed below. But first a reminder about what you can do with non-automated curation.
Human Curation
Mojeek Focus and Custom Search Engines (which use Mojeek Focus through the Mojeek API) are very useful for finding useful, and often the most relevant, search results. Still they do require some human curation of site lists to set up.
In a previous post we demonstrated how you can do this and search the parts of the web you value, using Mojeek Focus and Custom Search Engines. To do so, we used the website include (and exclude) functions to define a set of websites to search across. Building a list of these sites isn’t too difficult. And this is particularly true when you know the topic, theme or segment you are designing the search engine for. The important thing is that such site lists are static, not dynamic. So we imagined how you might automate this process with dynamic site lists.
Automated Curation
To make an automated custom search engine we implemented the following three steps:
- Send the Mojeek search query to an LLM (Large Language Model), along with a system prompt asking it to return a list of relevant websites to search across.
- Parse the returned site list, into a form suitable for the Mojeek API.
- Call the Mojeek API with the query and this site list to search across.
For the details on how we did this, let’s take a look at a particular search query example, and illustrate how it might be done with some code snippets (here in Typescript).
First we define the variables to be used:
The ‘searchText’ for the example case to be shown here was ‘AI+nuclear+power’ which of course should be, in general, an input variable from your user, as it would be with Mojeek for a regular search: https://www.mojeek.com/search?q=AI+nuclear+power
Also defined is how many sites to search across (numberSites), plus the variables for prompting.
We developed and defined a system prompt which refers to the number of sites and the search query, as follows:
and which refers to the number of sites and the search query, as defined above.
The system prompt above seems to work well enough and across a range of LLM models. It can no doubt be improved in general and/or tuned for specific models.
Next you would choose and define the LLM model. Obviously this definition will depend on your LLM of preference, and how you or your provider deploy it. At the end this post we will comment on the choices of number of sites and LLM models.
We then bundle the model choice, system prompt and query into a payload:
and then send this payload to the appropriate API, defined here by apiURL:
We then parsed the output of a list of sites, in the LLM API response, with a few lines of code, into a form suitable for the Mojeek API. This is necessary as the LLMs don’t always give you the exact format required which is a comma delimited list; for example LLMs often add spaces after the commas, so these were removed. We also choose to add a period before each domain.tld, so that Mojeek searches across subdomains. You may choose not to do this, but note:
- if you send domain.tld to the API, results will not be returned from any subdomains for each domain.tld
- if you send .domain.tld to the API, results will be returned for any subdomains of domain.tld (eg results from mojeek.com and blog.mojeek.com)
The parsed list of sites should be a string which will be sent to the Mojeek API. For the example search query, the parsed site list was as follows for a particular LLM:
We then call the Mojeek API, using the &fi parameter for the sites to search across:
https://api.mojeek.com/search?api_key=apiKEY&fmt=json&q=query&fi=sitelist
You can then use the API response for the service provided to the user, or process it further as preferred. Part of the API JSON response for the example call is shown below:
Choosing Models and Numbers of Sites
We have tried the above method using a variety of models and model sizes including Claude-3-Haiku, Llama-3-8b, Llama-3.2-90b, Mistral-7b, Mistral-8x7b, GPT-3.5, GPT-4o. Not surprisingly the bigger and later models generally work better, but obviously are more expensive to use and incur longer latency. What you will prefer will depend on your context, preference and need for privacy.
With regard to numbers of sites, we have found that sometimes 25 sites can work well, but sometimes 100 sites is much better. We allow 25 sites to be called on all plans for the Mojeek API and we allow up to 100 sites on Custom plans. You can find further details of the available plans and sign up on our Mojeek API Page.