Decentralized search engine & automatized press reviews

Version 1.7.8 : ergonomic enhancements

A month and a half after the last release, the version 1.7.8 of is now online. This new version is bringing ergonomic enhancements and a major round of fix for the known sources.

The enhancements are including some long awaited requests :

  • a slice date filter, with two inputs, to work on local results

  • a search input to easily find a particular source in the source box of a finished search, when there are more than 30 sources listed here

  • some "select all" / "select none" and "toggle selection" buttons when selecting results to export. Those buttons only affect the results visible in the current page (and it’s still possible to choose how many elements are listed on a page)

  • the list of the sources we’re waiting for, when a search is taking a noticeable time (which can be expanded from the search status line when there are less than 30 awaited sources)

  • a Cancel button that actually stops the running search where it is and let you work on the results (the previous solution was just refreshing the page, loosing the results, this is done via the recent JavaScript promise aborting API, thanks to a mention from @lutindiscret)

    • subsequently, a new setting appeared : a request timeout ; which automatically finish a search after 90s (but can be set to 0 to wait "forever")

  • a new source statistic line which displays the number of selected sources and the number of needed permissions to perform the next search, along with a button to give those permissions

In addition, every regular expressions of the 314 sources (which represent already 10k lines of formated JSON) have been screen for ReDOS vulnerabilities using RegexStaticAnalysis.

25 regex were flagged with exponential degree of ambiguity (EDA) or infinite degree of ambiguity (IDA) over 180 regex analysed. Each time it was related to unclear boundaries, multiple infinite quantifiers * or +, or an OR construct (a|a)* with an infinite quantifier.

Surprisingly it have been possible for each case to improve the RegExp and have it passing the test and running faster (being more tightly bound to the subject to capture). For example, this simple and easy to read regular expression :

  • (\d+) (.) (\d) [1] ;

Captures a date (for instance : '23 july 2021') and was replaced by :

  • ^(\d{1,2}) ([^ ]{3,9}) (\d{4})$ [2] ;

Which captures the same date but with boundaries around the portion of string (^ at the beginning and $ at the end) and sharper descriptions of each field to capture (sharp number of digits), month name that can contain french accented letters (like décembre) but no spaces… Real life examples are usually a bit more complex but the main idea is here.

Again, like with the Accessibility audit, this work generally resulted in improvements in the parsing of the concerned source so a general improvement for

1. The 1st symbols between parenthesis are capturing a number, here it’s the date number, the second parenthesis group is capturing everything between the two spaces, here it’s the month name, and the last parenthesis group captures another number, the year number
2. There are still 3 parenthesis groups, the first can only be two digits long (we only need to capture a 31 as the biggest number here), then the month can’t contain spaces in its name (and can be 3 to 9 characters long) and the year is expected to be four digits long. I would be happy to fix this 'bug' myself when years will be 5 digits long, provided that no other religion messes with the currently used Gregorian calendar