Meta-Press.es

Decentralized search engine & automatized press reviews

Version 1.8.15 : source maintenance, bugfix and big exports

This new version mainly contains under-the-hood improvements that should stay invisible but required a lot work… The way Meta-Press.es decides if a URL is valid or not, it’s internal organisation, how Meta-Press.es creates timezoned dates or import files… important pieces of code were worked on for this release, fixing a few apparent bugs.

1. Source maintenance

After the big source re-tagging of the previous release, it appears that some sources where less precise than hopped. 200 of them were tagged back as "approx" for approximative sources.

In fact I wanted to have a unique scale for source precision from :

  • approx

  • one word

  • many words

For source giving approximative results, exact results for only one word or good results even for many words. But there are two concepts : exact / approx and one word / many words. If a source is "one word" only, it is in fact applying a logical-OR between the given search terms, and a many words is applying a logical-AND. Being exact or approx is orthogonal, and some sources will, in the future be tagged as many words and approx.

Meanwhile, sources were added again to reach a total of 988 as of 1.8.15 across 75 countries and 75 languages.

1.1. 800 lines were edited in the source definition file

Year

# lines

2019

593

2020

881

2021

3066

2022

6970

2023

8759

2024

3984

For a total of 24.253 lines (+535).

2. Bugs fixed

A total of 12 issues were fixed. Here are the most important ones.

2.1. #80 [1.8.14] RSS re-import failure

After a big search with thousands of results, I like to export everything and then re-import it. But here it failed. Some hours of work allowed to find that illegal XML characters could be introduced to Meta-Press.es generated RSS files via the URL of some illustrations. In particular the one talking to PHP Thumbnailer like https://www.journal-ipns.org is.

Some unencoded '&' were passing through, luring the XML parser into unfinished XML entities. My first move was to encoding those '&' with some & # 2 6 ; (without spaces) but this lead to miss interpretation of the URL by the server-side PHP Thumbnailer.

Encoding them with & amp ; did the trick.

2.2. #73 [1.8.14] Search in the full list of sources interprets '-' signs

Here the problem was that searching for journal-ipns.org in the source list were not giving any results, while the source exists. It was due to the fact that the ListJS library used by Meta-Press.es is interpreting the scheme -atext as : do not include results containing atext in the ListJS search results.

As ListJS is not maintained, Meta-Press.es is now equipped with the Lovasoa version, containing fix for such bugs.

2.3. #66 [1.8.12] Can’t schedule a new search : Failed to parse next run date Invalid Date

When a timezone were not explicitly chosen by the user (so the default "Browser timezone" setting was used), french users could not schedule new automated search during summer time.

This was fixed by some solid improvements in how Meta-Press.es creates timezoned dates.

2.4. #64 [1.8.11] Date filter is reset when all results are in

This was a (very well) user reported bug. It was the kind of small bugs that should be simple to fix : just re-apply filters each time you add new results to a query. But it was just in the middle of the way of the MVC refactoring of the code (detailed bellow), and the fix needed to import big files.

2.5. #79 [1.8.14] Investigate JSON export failure for 10448 results

If you search for "europe" across all the known sources (which is approximately how sources are tested) you end up with more than 10 000 results. Meta-Press.es was able to export them in RSS or CSV, but not in JSON !? And without any error message.

It appears that all the current export procedure might be limited to 20MO files only (it should be 32MO). In our case RSS and CSV files are just smaller than JSON ones, because JSON files include more information, especially the list of sources and their alleged number of results.

The fastest fix was to remove JSON presentation / indentation characters to get a smaller file and pretend that you currently can’t produce searches with more results with Meta-Press.es, but a better solution will have to be found soon.

It might look like an export feature that splits big exports into 20MO file slices and a reworked import feature and dialog to be able to import many slices.

Being focused at it, imports were globally improved in user feedback reactivity while performance penalty for MVC refactoring (impacting JSON imports) have been mitigated by a 15x factor, reducing how often ListJS' lists were re-ordered. Ordering a list is time consuming, and you can’t loose that time when importing a file (as you’re not just waiting for the next source to answer).

3. MVC refactoring and NodeJS client

MVC stands for model-view-controler. Its a code architecture that helps to maintain the codebase via clear separations between code and data, and among the code between core and interface code.

Meta-Press.es started as a single web page embedded in a WebExtension and grown a lot since to reach 8000 lines of JavaScript today.

According to the cloc command the code was ventilated into 19 files and 7586 lines as of 2023 mid-may.

It is now ventilated into 32 files and 8073 lines as of 2024 mid-may.

Core functions were isolated in a js/core folder and will be usable in a Meta-Press.es NodeJS client in addition to the current WebExtension one. This work helped to distinguish and isolate the required dependencies that the NodeJS client will have to provide (such as DOM_parser(), XPath_evaluator(), HTML_decode_entities()…).

Also, useful generic JavaScript functions were stored in separate libraries in js/lib/js:

  • array.js

  • date.js (including timezoned_date() to parse or create a new date with the given timezone)

  • math.js

  • object.js

  • text.js

  • types.js

  • URL.js (including is_valid_HTTP_URL() elaborated from 5 different sources)

  • uuid.js

  • XML.js (including encode_XML() that allowed to fix #80)

It’s only 211 lines of JavaScript, mainly made of obvious shortcuts, but if volunteers arise to help turning it into a viable separate lib I would help.

4. v1.8.15.2 : Removing two sources

Two sources were slowing down searches, and got them stuck into never-ending requests.

A quickfix is the to mark them as broken. A better approach will be to fix the timeout and AbortController implementation.