Decentralized search engine & automatized press reviews

Version 1.8.6 : Illustrated headlines from RSS feeds and child mode

For the v1.8.6 the development was focused on two main objectives :

  • the child mode

  • headlines feature

1. Child mode

Regarding the child mode, a new checkbox appeared in the settings. It allows to enable the child mode, in which only child compatible sources are available (and access to settings and source creation is removed from the main interface).

This is intended for school purposes. With this feature, you can turn your back a few minutes while kids are using the tool. As long as they are on the main interface, they’re safe.

When you need to disable the child mode, reach the settings via the Firefox' standard way of reaching the settings of a WebExtension (via the menu or the about:addons internal address).

The sources available when the child mode is activated are marked with the for kids tech. tag, so you can easily list them to check them via the source list feature of the main interface. Two other tags have been set : for kids < 9 and for kids > 9 ; to address specific needs of a thinner audience based on it age.

20220813 child mode
Figure 1. Child mode source listing

Currently 26 sources are marked for kids and I warmly encourage you to send me references to more sources for this category.

All the instances of Vidipedia and Wikikids have been added to as declared in this post :

2. RSS news feeds

Regarding the headlines, a radical change occurred. Headlines are now fetched from main RSS feeds of sources (if they provide one). This brings 3 main advantages :

  • every compatible source can now provide up to 10 general purpose news (your setting)

  • those news are now fitted with excerpts in addition to titles

  • half of the compatible sources now provide illustrations is not to become a general purpose news feed reader, but it is now a news feed reader for its indexed sources.

Settings have been added to chose how many news to display from each sources.

On my own instance, I saved a new "scheduled search" setup with my favourite independent sources (among current 224 ones). I let it on its default "Stop" run frequency and I open it from the settings only to load the news ! Also, the indep. tag is not a source type anymore but a tech tag that can be crossed with many source types (such as: encyclo. press agenda …).

This new feature was introduced thanks to a modification in the source definition API. I scripted the upgrade for the current 600+ sources and half of them was found to provide an RSS feed. Maybe some are missing for sources you know, don’t hesitate to signal them, it’s easy to add back for the next release.

This was a simplification trade between the headline_url + h_title selectors in source definitions for a single news_rss_url and even if we’re losing half the sources currently, it’s removing the heaviest maintenance burden. 2/3 of the source upgrading tasks where about the headlines (as the frontpage of newspapers is constantly evolving). The only foreseeable evolution in RSS news feeds are their possible removal, but they have been doomed dead long ago already.

Another modification that this work introduced in the source definition API was the removal of the xml_type entry. No need anymore. The work on the dynamic source creation form (introduced in v1.8.5) lead to the creation of a "detect_RSS_variant" function, that were re-used and battle proven for RSS headline fetching and so extended to the search feature of RSS based source definitions. So defining a new source is a bit easier now in the case of RSS based sources, and the new source creation form have been updated accordingly.

An expected visible effect is more illustrations on results (because we might have missed some with the previous "manually setup" system).

The online documentation have been update also. (and it’s the kind of lines that do not reflect the work it have been).

3. Fancy or serious sources

Aside to sources that should be easy to index, but finally provides bad RSS results like the or (the two of them were discovered the same day) :

Big up for the #DailyTelegraph which manage to publish a half empty #RSS feed without any date on entries !!
Over 235 RSS feeds, its the deadliest broken one.
The good news is that you'll be able to fetch all the other feeds from the next version of #MetaPress.

There is a source that is hard to work with. It’s the serious dealing with infosecs (sounds great). This source already lead to some improvements in the source definition API of like the new search_ctype which allows to specify the content-type used to send the request to the server (every 608 currently working sources are using application/x-www-form-urlencoded, but this one requires application/json). But it’s not enough to work with IntelligenceOnline as this source also encode (maybe encrypt) the search terms before it sends them to the server… OK, why not. As it’s all open-hearted client-side JavaScript, the exact mechanism might be studied and a cipher key might be extracted and I would love to enrich source definition API with term_cipher_method and term_cipher_key entries.

As it wont be in my priorities anytime soon, a contributer might earn a lot of consideration, some stickers and badges, and an in-depth blog-post here (10k visitors a month) diving into this problem. An issue were created to track this subject :

To finish, some new tendencies… Some sources are still using jsonp to serve results (it’s a workaround CORS limitations on JSON that was heavily used 20 years ago). It’s the case of (OK, it’s the south-half of the word, with an insane climate and a terrible economic crisis 20 years ago…) and So it exists a jsonp_to_json_re entry for source definitions that allows to specify a RegExp used to extract the JSON data the JSONP script source file.

But a something new is emerging and appears to be the contrary : JSON is used to encapsulate server-side computed regular HTML… Why not sending bare HTML ? Still, the new json_to_html source definition entry allows to extract and parse the HTML. I imagine that I’ll have to detail this mechanism on a per-field level soon.