Decentralized search engine & automatized press reviews

Roadmap 2018

As seen in the previous blog-post, the project is currently articulated around a working proof of concept. I still have a lot to do to keep this website’s promises, and I still have to code a lot before calling for contributions to grow the scrapped newspapers base.

First of all, I have some small bugs to fix, about the date filtering feature, the sub-search (as the List.js built-in fuzzy search is not what we need here) and some CSS glitches.

Then, the tool allows some rudimentary date filtering of the fetched last results, but I would like to offer a priori filtering, to get results of a date slice. But as most of the newspapers don’t offer this possibility directly, it will be required to fetch many pages of results to find the good ones.

Only after this question will be treated that it will be time to grow the newspaper base, as it should be enough features to go toward the version 1. After this, the API to fill to add a newspaper to the base will be considered as stabilized and it will be time to add more newspapers in the system. Adding plenty of them, in all languages (and not only in English as I currently did) will bring problems, regarding date parsing for instance, but I’ll get back to it later, as I have an asset in my pocket about it, from my previous 2013’s prototype.

1. Newspaper checking test suite

Contributors helping me to scrap a lot of newspapers would be great then, but as the approach here is based on newspapers web interface parsing, we will be vulnerable to each of their updates… each structural modification will require an update of our knowledge of this newspaper.

To stay up to date, it will be required to setup a test suite to check every newspaper every night, and to quickly spot which to rework on.

To do so, I currently kept, for each newspaper a search term which brings result, and another without.

2. Updates and choice of sources newspapers

The newspaper base may often be updated. So we will have to ensure that those updates quickly reach the users.

We may take advantage of the integrated extension update feature of Firefox, but it may be required to setup another routine. Maybe before each request.

In addition, it would be interesting that don’t become the censorship mechanism it tries to circumvent. Though, I want to allow users to set a different newspaper source than the default one, in the (coming) extension preferences. It may permit other usages than the rather generalist one I’m aiming at.

3. Tags and selection

Once fitted with hundreds (if not thousands) of newspapers, we won’t be able to query all of them at each request, it would be too slow. So the selection of relevant newspapers for a particular request will have to be made easy, based on tags, and if possible, reduced to ~30 newspapers.

It will be possible to filter out the newspaper based on :
- the language (of the user)
- the country (of the user)
- the field of interest : politics, sport, ecology…
- the periodicity : daily, weekly, monthly…
- some technical criteria : fastest to answer, HTTPS…

Moreover, some newspapers are complicating things to me. Instead of helping me to circumvent the monopoly position of Google, indexing their own content, they are using Google as their internal search engine… It’s the case for the Guardian for instance. So, as we’ll have to make a beautiful newspaper selection panel, better add one to avoid the lazy newspapers that are not indexing their own content.

4. Import / Export

After each request, hundreds of results are listed in the extension page. It took some time to gather them. So it would be interesting to be able to save this work : to quickly come back to it later, to share it with someone or to work on it offline… This should not be hard as we already know the structure of the results (title, date, excerpt…). We’ll just have to match this structure to an existing format such as RSS or ATOM for instance.

It’s those exports which will allow to reach the press review publishing in one click.

5. Publishing counterpart

Indeed, once the results are displayed in the user browser, nothing prevents us from adding checkboxes in front of each result to easily make a selection and to only export this selection. If sent to a personal cloud app, this last one should be able to publish it on the web, as a paginated naked list, to easily integrates, it via iFrame, in the existing website of an association for instance…

And then, all of the sudden, the press review task becomes "interesting part-only". c.f. previous blog post.

6. Better Firefox integration

To finish, if we are to make a search engine, better list it among the integrated search engines of Firefox. Else, I still have a lot to learn about preferences handling of Firefox extensions.


That’s it for the moment, as I’m only speaking about what seems required for the version 1.
There is already a TODO list for version 2.