About Mojeek; Open Source, TOR and Crawling
05 July 2022
We frequently get questions asking why Mojeek is built as it is, and why certain pieces of functionality are, or are not, included. We thought it would be a good idea to publish answers to some of these questions on our blog. (part 1 of these FAQs is available here, part 2 here, and part 3 here).
Will You Open Source Mojeek?
Search engine optimization (SEO) as a practice makes it very difficult for a search engine to be open source, whilst also avoiding the manipulation of its results from malicious actors. If Mojeek's methods and algorithms were completely open to the public, then SEO professionals could devise ways to get to the top of Mojeek’s results rather than doing so by providing any real value. There is a balance somewhere between transparency and necessary confidentiality that we will continually evaluate. We do appreciate greatly, and benefit from, open-source initiatives such as cURL and Linux, and are always looking for ways that we can contribute to and share with the FOSS community.
Will You Make Mojeek Accessible via a .Onion Address?
We do not provide access to Mojeek through a .onion address and are not intending to do so in the short term. We agree with the aims of the TOR Project, which allows people to escape both surveillance and subsequent actions from freedom-limiting actors. We are thus increasingly inclined to provide this access, but at the moment we have made the decision to wait. This is something that we will be revisiting periodically.
As a search engine we deal every day with the issue of automated queries from bot traffic. It is something that consumes a large portion of our resources. We have to regularly develop and deploy new ways of blocking access to Mojeek for the many bad actors that attack us, whilst preserving that access for good actors. It is a major challenge to deal with this as an independent search engine, and we would prefer to devote the time and resources spent in this area to improving Mojeek for both users and customers. Unfortunately this bot activity is a big challenge, and is a reality for all search engines.
This is our issue when it comes to providing a .onion address through which you can search on Mojeek; estimates in the past have suggested that up to 94% of traffic through the network is malicious. Sadly this reality is not very different for Mojeek. So whilst we deeply understand and resonate with the legitimate uses of the TOR network, we have to be careful about how we deploy our resources. Currently we are not in a place where we could handle the increase in automated traffic from bad actors we could get through TOR whilst continuing to provide a great service to legitimate users.
Will You Crawl my website at [domain.tld]?
It is possible that the website in question has not been reached due to the lack of good and prominent links to it; you can check if a site is in the Mojeek index by entering a query, as below, with domain and tld switched out to match your website.
The other possibility is that we are blocked from being able to crawl the website in question. We’ve written before about robots.txt, the rules by which good bots on the web play, and in a few cases, such as with Facebook and LinkedIn, we are unable to crawl these sites due to following these rules. With these two in particular, a lot of time and resources would be devoted to mapping out these sites in their entirety, and so we’re not unhappy that they are not in the Mojeek index. This being said, there will be cases where we have been (in our eyes unfairly, as MojeekBot is a good actor that respects robots.txt) blocked from crawling a website that should be accessible in Mojeek’s index.
If you find that the website you’re looking for has blocked Mojeek in their robots.txt file, you can help us out by contacting the person who maintains the website, and asking for them to account for Mojeek in these rules.
05 July 2022