Meta-Press.es

Decentralized search engine & automatized press reviews

Version 1.8.14 : source maintenance and access content tag

As I was to do some source maintenance to verify Meta-Press.es recovered from the previous date parsing problem, I decided to perform a global source maintenance, trying to fix every broken source as per automated full testing procedure.

1. Sources' Spring cleaning

It turned out to be a lot of work… I had hundreds of sources to fix. With this momentum I tried to fix also the sources already marked as "broken" I would encounter in the json/sources.json file… 15 of them got back to work ! (including some which needed tokens to be fetched).

It was:

2743 insertions(+), 1824 deletions(-)

And it took more time than I expected.

Meta-Press.es now counts 953 working sources, including 510 french speaking ones and it’s still more than Google Actualités (which claims to limit itself to 500 sources).

The approximative sources (which are difficult to use) dropped down to 99 (because many of them were converted to more precision via the filter_results mechanism).

2. access content and direct content tags

But the biggest improvment is with this two new tags : access content and direct content.

The first one allows to search only through sources with online accessible content : true web without paywall. 692 sources were tagged this way, it represents 72% of all the known sources in Meta-Press.es. So what was though as a weakness of Meta-Press.es (not giving access to the content of the articles) turns out to be a strength instead because Meta-Press.es is in fact mainly leading you to readable content.

And even better, it appeared that 127 sources are pushing their content on demand so Meta-Press.es allows you to read a lot of content directly in its result pages. This content is embedded in exports and you can share it or archives it. Those sources got the direct content tag.

This two tags are currently findable among the "tech" tags, but might get their own category one day.

3. Monitoring the age of the lines

As this source maintenance was a lot of work, I made a small script to compute the age of the lines in the json/sources.json file. Smartly used it will allow to follow the maintenance work needed between two source-addition sessions.

For the moment it already allows to follow the amount of work done each of the last years.

Year

# lines

2019

598

2020

924

2021

3177

2022

7024

2023

8895

2024

3111

This source definition file counts 23.718 lines, all crafted by human beings.