ActualScan Tech

Social indexing (Roads into search #1)

Posted at Nov 7, 2020

With the broadening concerns about privacy, results quality, and the antitrust actions both in the EU and the US against Google, there is more and more of a climate for alternative search engines (other than Big Tech ones, like Google and Bing).

Among those, the most popular is probably DuckDuckGo, with similar propositions from Ecosia, Qwant, Runnaroo, Startpage and others. All of these sites aggregate results from some mainstream engine indexes (most often Bing, although Startpage is a good private source of Google results).

The main tangible benefit for the users here is insulating themselves from the Google tracking and advertising empire, thanks to getting their search results through these middlemen. You can also achieve something similar by using an instance of Searx, a free-software metasearch engine.

To varying extents, some of these engines claim to have also their own web crawlers and indexes. This is most often on a very small scale, and anyway the task of crawling by yourself the 1.5 billion websites (or even the proportion of them that is active) is difficult. Furthermore, war on bots provides market incumbents a convenient pretext for harassing alternative crawlers, as chronicled e.g. on the Gigablast blog.

Nevertheless, sites like Gigablast and Mojeek still crawl the Web all by themselves, and kudos to them for that. Gigablast is open source, so at least in theory you can spin up your own server. There’s also the YaCy project that lets you run your own search engine instance with your own index, federated with other server owners. The petabytes of data collected by CommonCrawl project might be a starting point for an independent index, but it’s not frequently updated.

What to do differently?

Every website that I mentioned before tries to replicate, in broad outlines, the service provided by Google (and the oldschool engines that it replaced). You enter your search query, you get a page of results – with the hope that

  1. the search engine managed to obtain the pages you’d want to find (may be hard to do for independent engines), and that
  2. it will surface the infomative pages to the top (which often seems hard to do even for Google).

But an alternative search engine must do something differently: give the user something they cannot get elsewhere. Preferably, something that big companies aren’t interested in giving. Of course, one such thing is privacy. Privacy tends to be the thing around which marketing/communications of alternative search engines is centered.

In these blog entries (for now I’m planning just two, but who knows) I want to explain the two main features that make ActualScan different for the user. It implements social indexing – to get as close to the needs of users as possible – and analytic results, to yield the most relevant links, fight SEO, and remove the need for invasive tracking. Today, we will discuss the first concept.

Social indexing

Traditionally, crawling the Internet starts from some set of known sites and, by following links, tries to index as much of the Web as possible. This means collecting a lot of spam sites and assorted web junk. Very optimistically, 10% of what we get may be desirable for users. This means not only containing the keywords (obviously web spam contains the keywords!), but also being worth your time.

Our trick here is to narrow down crawling aggressively. ActualScan is a search engine for niches. It doesn’t aim to answer cursory questions about what is on the general Internet, but deeper questions on more information-heavy parts of it.

ActualScan collects pages pertaining to selected topics, organized by tags. It also indexes forums, blogs and media outlets selecting the articles that are interesting to the users.

As a registered member of ActualScan, you can add sites and trigger indexing by yourself. In principle you can use the same tools that are available to site admins and moderators. Of course, we have to note that:

This means that the indexing system has to be optimized for a (relatively) small circle of dedicated curators for each topic. They act transparently and are heavily aided by automation.

At this stage, I wouldn’t bother much with typical social features like profiles and messaging, as these can be easily outsourced to existing social and messaging platforms (preferably open source ones). The system for collaborative indexing itself, allowing for many members curating the index in parallel, is the priority.

An example list of scans and crawls in a work-in-progress interface

A preview of what a list of pending Web scans may look like.