Decentralized search engine & automatized press reviews


1. What is it? is a tool to search the online press (international reference press, specialized reviews, general purpose press…). It can be used to make press reviews for big associations for instance or scientific press coverage (via dedicated sources). Researches are keyword driven and you can select precisely in which sources to search (via type, language, theme, result type…). So users have full control over the sources queried to perform their searches and they even can add their own sources.

Then it’s possible to schedule daily searches (or monthly ones).

This search engine is shaped as a web browser addon (WebExtension standard format). The search and sort job is handled by the the extension itself, on the computer of the user and it allows to gives the user fine grained control over the tool. is already able to search through hundreds of sources and this number regularly grows.

Using, there is no central server involved, to watch over and profile users, to adapt search results to their behavior and habits or to add targeted ads. So there is no bubble effect like with commercial mainstream search engines. works great with the Tor Browser (which have been made, among other goals, to read the press with more confidentiality than with an ordinary web browser). is a free / libre software with open sources. Those whom know HTML and JavaScript can explore its source code and adapt the tool to their needs (or easily add new sources).

To finish, is an energy efficient tool : there is no need for polar circle datacenters to perform searches through the online press…

1.1. Limits

Limits of are to search only through recent news (access to older results will be available in the future), to only sort results by chronological order and to need users to install a web browser addon…

2. How does it work? is a meta-search engine that allows you to query the internal search engine of each source (knows to the system and selected for the search).

You will get the number of alleged results in all the sources, and the newest available results from each.

Newspapers are scrapped from the web (or RSS feeds of results), one by one. So you save, at each query, the time that the developpers spent parsing the newspapers :-)

Doing this won’t activate any ad. or social trackers of the queried newspapers, protecting your privacy, but it gives you back the choice of your information sources and clearly states in which source it searched.

Once you get the results, you can select and export some results in a re-usable format (RSS, jSON, CSV…). You’ll be able to re-import them later or use them somewhere else (send them by email to a friend, import them in a WordPress…).

3. How to install it?

To install on your Firefox-based web browser : load the following page in your browser and click on + Add to Firefox big blue button.

4. How to use it?

Once installed, the add-on creates a new button in the toolbar decorated with the landing net icon.

favicon metapress v2
Figure 1. button with the icon

A click on this button opens a new tab on the search engine interface.

If the button is missing in the toolbar, you might haven’t give the right to run in private mode during it’s installation (and be using this mode). In this case, it’s still possible to give the autorisation via the list of addons of your browser (or via the about:addons address).

By the way, you can check if the automatic updates are activated to be sure to get the latest sources and features.

20210316 main page
Figure 2. head of the search engine main interface

Under the logo and title, headlines from the selected newspapers are shown and rotated.

Then you can type your query and choose in which newspapers to search in, based on a multiple criteria filter mechanism.

Results are then listed right under, when they arrive.

Each result is composed of a title, and link, its source name, its date and potentially an author and an extract:

20191029 meta press result
Figure 3. details of a result

Tools to work on results (sort, search, select) appear in the right column.

20210316 results
Figure 4. number of results by source and filter by source (clicking on its name)
20171216 meta press country
Figure 5. filter by date

You can, for example, click on the "Toggle selection mode" to display a checkbox for each result. You can then export your selection of checked results in various format (JSON, RSS, ATOM and soon CSV).

20191029 meta press europe selection mode crop circles
Figure 6. select mode

To re-import the search results, click on the "Import JSON" (or RSS, or ATOM) link in the cyan horizontal top bar, and select the file to import in the file picker that pops up.

4.1. Cherry-pick sources to search in

It is possible to select the sources you wanna query, one by one.

To do so, you first need to deploy the 2nd line of source filtering criterion, clicking on the [+] sign before the title of this part. At the end of the second line, there’s a "Cherry-pick source" select-multiple input.

20210127 meta capture filtres sources
Figure 7. Cherry-pick sources

All the available sources are listed here and you can pick-up only the wanted ones.

When a search is finished a statistics line appears on top of the listed results. This line is fitted with a 🔗 "chain link" icon at the end. This icon allows you to launch the same search again.

20210128 permalink
Figure 8. Search permalink icon

So it’s possible to create bookmarks for your favorite searches (sparing configuration time).

4.3. Scheduled searches

Once you typed your search terms and selected the sources you wanna search in, it’s possible to save the search for later instead of launching it immediately. It’s the role of the ⏰ Schedule search button under the source selection. This button opens a new tab on the "Settings" scrolled to the Scheduled searches part.

20210210 recherche automatique
Figure 9. Scheduled searches, on dark background interface

This table shows a scheduled search by line. When created, a scheduled search is in "Stop" state, but you just have to select the date, time and periodicity you want for this search to have it activated.

So you can schedule a daily search in a few clicks.

Actions are possible on scheduled searches:

  • the ✏️ "pen" button allows to edit the search, it opens the main search interface with the scheduled search settings (search terms, source selection). Once modified, your search settings can be saved clicking on the "Schedule search" button of the main interface ;

  • the 2nd button, with a copy/paste icon, allows to clone a scheduled search to get another one, that you can configure with the previously described button ;

  • the 3rd button, with a cross on it allows to delete a scheduled search ;

  • the 4th and last button allows to manually start the search from the table.

5. How to add a new source to the search engine?

If you are a programmer, you just have to add an entry in the js/sources.json JSON object (or write your entry in the setting panel of the addon).

Here are useful examples, listed at the top of the js/sources.json file :

  • is a good and simple example using "normal" CSS selectors

  • New Europe is an example of source providing results in RSS format

  • New Europe Greece extends the New Europe definition

  • The Washingtown Post provides results in JSON

  • The Japan News uses HTTP POST method

  • Helsinky Times uses XPaths to parse some fields

5.1. Methodology

  • First, visit the website of the source you want to add and note the main URL (preferably in HTTPS);

  • Try its search functionality :

    • check if the results are accessible in RSS (or ATOM) format using the developer’s tools (F12 key, default Inspector tab, search for "rss") it would spare 2/3 of the parsing work

      • in this case : you don’t need to provide the timezone of the source in the tags

    • check that this URL is for results in chronological order, or have the results sorted, else the source is an incompatible one, see the admonition block below

    • if the result URL does not contain your search terms, the source might be using the POST HTTP method, you can look at other sources using POST method, such as The Japan News.

    • check that results are really from this request via the developer’s tools : F12 key, Network tab, Response preview. Results can be loaded via JSON and XHR requests see The Washingtown Post for example of how to deal with it

    • try to search multiple terms and find the way to get a logical "and" between the search terms (try adding quotes around the terms for instance). If you can’t have a logical "AND", add the "OR" technical tag for this source. It’s also possible to add quotes around the search terms and obtain an "EXACT" search on the expression.

    • note a search with terms giving results, default is "quadrature du net", but "yellow vests" works well also

    • check if the source is providing different type of results : text, image, video, audio ; in this case, you will be able to create one source entry by result type (it’s easy when you just extend your first source definition)

  • Search for it’s main RSS feed

  • Search for it’s favicon, smallest version (32px width for the best)

If something goes wrong, like :

  • no search functionality

  • no date on results

  • no date sort

Please provide some feedback to the source about the problem and add it to the list of incompatible sources in the wiki with your feedback effort status.

You can also help by contacting sources of this list with no feedback yet.

Then, to write the source definition, there are 4 kinds of information to provide :

  • general info: name, timezone and tags at the end;

  • headlines: one entry to point at the main RSS feed of the source;

  • search: the source search URL (which provides the results);

  • result parsing: 5 more entries to retrieve specific elements of each result (the last two being facultative):

    • title: r_h1

    • link: r_url

    • date: r_dt

    • extract: r_txt

    • author: r_by

Each of these entries can be followed by an _attr and _re version of it. In the first case it allows the targeting of a specific HTML or HTML-node JavaScript [1] attribute of the designated HTML element, or to apply a .replace() on it. The _re needs a list of two strings : the first being a regular expression and the second a replacement pattern (see example below).

It’s also possible to give an _xpath version to use XPaths instead of CSS selectors.

5.2. External doc about CSS, RegEx and XPath

5.2.1. JSON

JSON syntax at Mozilla Developer Network and : just keep in mind that only double quotes are allowed, and no trailing comas

5.2.2. CSS selectors

Mozilla Developer Network about CSS selectors

More documentation on CSS selectors from

5.2.3. Regular Expressions

5.2.4. XPath

XPath doc at MDN.

XPath at

5.3. Examples

5.3.1. RSS based source

	"": {
		"favicon_url": "",
		"news_rss_url": "",
		"search_url": "{}/feed/rss2/", (1)
		"search_url_web": "{}", (2)
		"type": "XML", (3)
		"tags": {  } (4)
1 In this URL, the {} will be replaced by with your search terms.
2 This 2nd URL allows to redirect the user to the source online result page, for instance to go deeper into this source.
3 When there is no type entry in the source definition, results are provided in HTML. Here we precise that the source is reponding in XML, but it could also be JSON.
4 tags are explained later

5.3.2. Extend your own source definitions

	"": {
		"extends": "", (1)
		"favicon_url": "",
		"news_rss_url": "",
		"search_url": "{}/feed/rss2/",
		"search_url_web": "{}",
		"tags": {  }
1 Here "" is the key of entry to extend.

In this case, a copy of the extended source (here it’s is used and completed with the provided elements of the new source

If you need to remove an element coming from the extended source, you can set it with the null value in the new source definition. This way it won’t be erased and won’t be considered by

5.3.3. JSON based source definition

To diagnose an AJAX result loading case, it’s possible to use the Firefox’s developer’s tools. The F12 key allow to open those tools, and then we can click on the Console tab. The XHR requests are that occur after the initial page loading are listed here. Each requests can be inspected in the console, including the JSON response payload.

If the inspected request contains your search results then you already get its address and then you can determine the JSON paths to reach each wanted information.

	"": {
		"favicon_url": "",
		"type": "JSON",
		"search_url": "{}&count={#}&sort=displaydatetime desc",  (1) (2)
		"search_url_web": "{}&btn-search=&sort=Date&datefilter=All%20Since%202005",
		"res_nb": "", (3)
		"results": "results.documents", (4)
		"r_h1": "headline",
		"r_url": "contenturl",
		"r_dt": "pubdatetime",
		"r_txt": "blurb",
		"r_by": "authors", (5)
		"r_by_attr": "name", (5)
		"tags": {  }
1 {#} will be replaced by with the max number of results by request. You can change this parameter in the settings.
2 For this example the URL have been cut, but I let the sort=displaydatetime desc GET parameter because we always try to get results sorted by date, newest first
3 When parsing JSON objects, you can specify a path (JavaScript style), to point deep values (not a 1st level).
4 This JSON path point to the list of results that will go through.
5 Those two lines are in fact from the source definition. The r_by property point at a JSON list, and the r_by_attr designate the attribute to fetch from each elements of the list. Then names are joined with comas between them to build the list of authors as a single field.

Results might also be sent in "jsonp" format. It means that your JSON data are embeded in a regular JavaScript file (and program).

In this case, a regular expression replacement scheme can be specified (with the name jsonp_to_json_re) to extract the valid JSON portion from the JavaScript file (and String).


Results might also be sent as valid HTML embeded in a JSON object.

In this case you can specify a json_to_html JSON path in the source definition to point at specific JSON location where a valid HTML string will be found and parsed.

For the moment only one JSON location can be parsed as HTML, but it might get thinner with per-field based conversion (r_h1_html, r_url_html…).

5.3.4. CSS based source definition

	"": {
		"favicon_url": "",
		"news_rss_url": "",
		"search_url": "{}&sort=date&order=desc",
		"res_nb": ".sub-title",
		"res_nb_re": ["^(\\d+) ", "$1"], (1)
		"results":	" > li", (2)
		"r_h1": "h2",
		"r_url": "h2 a",
		"r_url_attr": "href", (3)
		"r_dt": "",
		"r_dt_fmt_1": [
			"\\s(\\d+)[ermè]* (.+) (\\d{4})",
		], (4)
		"r_txt": "p",
		"r_by": ".author a[rel=author]",
		"tags": {  }
1 res_nb can also use a _re complementary entry, here it extracts a number at the beginning of a line
2 It’s this CSS expression that allows to extract the results from the web page. It’s directly pointing at the results collection, that will be grabed via querySelectorAll(). Note that we used a strict CSS selector (with >) to ensure we don’t grab unwanted elements from elsewhere on the page.
3 r_url_attr allows to get the href attribute value
4 r_dt_fmt_1 :
  • Here we capture the date elements to put them in the right order. The month name (pointed by the {$2}) will be converted in the correct number.

  • Note that to specify an anti-slash in a JavaScript string, you need to escape it, hence the double anti-slash in "\\s" and "\\d".

  • To finish, as the name of this attribute suggests, you can define as much date formats as used by the source (for instance if the source is using relative date formats "1h ago" in addition to the absolute one "2022-03-21").

5.3.5. HTTP POST based source definition

	"": {
		"favicon_url": "",
		"method": "POST", (1)
		"body": "siteSearchInput={}&x=7&y=11&span=365", (2)
		"search_url": "",
		"r_dt": "time", (3)
		"r_dt_attr": "datetime", (3)
1 In addition to the usual search_url, we need to set the POST method
2 And a body for the request, which is the GET equivalent for query string. This is called application/x-www-form-urlencoded format. It might also be JSON, and in this case you’ll have to specify a search_ctype entry with 'application/json' content.
3 Here we can note that when a <time datetime=""> HTML tag is available, it’s preferable to use it to avoid this regular expression format step, and to avoid having a timezone to define in the tags.

5.3.6. XPath based source definition

XPath is a very powerful language and it can be used in replacement of every CSS selectors.

   "": {
		"favicon_url": "",
		"news_rss_url": "",
		"search_url": "{}&ordering=newest&searchphrase=all&limit={#}", (1)
		"res_nb": ".searchintro .bagde",
		"results": ".result-title",
		"r_h1": "a",
		"r_url": "a",
		"r_url_attr": "href",
		"r_dt_xpath": "./following-sibling::dd[@class='result-created'][1]/strong", (2)
		"r_txt_xpath": "./following-sibling::dd[@class='result-text'][1]",
		"r_by_xpath": "./following-sibling::dd[@class='result-category'][1]/span",
		"tags": {  }
1 As for the WaPo (JSON based) source definition, the Helsinky Times allows us to set the number of results we want in its answer, so the {#} token is used to let replace it by the wanted number from the the setting page.
2 Instead of a regular r_dt field, here we have a r_dt_xpath field. So it’s a not a CSS selector but an XPath definition that follows. Here it allows to reach the next sibling element relatively to the current one, which is not possible via CSS.

One can also note that :

  • Reaching parent elements is not possible in CSS neither.

  • XPath is also needed when XML namespaces are involved (like in most encountered RSS feeds extended with Dublin Core DTD).

5.4. Regular expressions

Regular expression are a complex subject. Here are some documentation again. If you have alreday work with RegEx here are some key points to keep in mind :

  • patterns need to be delimited with knows elements before and after what you want to extract : "\\s(\\d+) " here there is a space (or a tab) before and a space after.

  • you mainly need : \\d+ \\w+ \\s+ (to match : numbers, words, and any kind of spaces)

  • then you’ll mostly use : () ()? (?:) (to extract the pattern between parenthesis, with a ? after if the pattern might be missing, and with ?: inside at the beginning to avoid extracting this group, no corresponding "$1" / "$2" in the replacement pattern).

5.5. Images

The integration of images in results has been simplified by the following fields: r_img, r_img_src, r_img_alt and r_img_title.

r_img allows to directly retrieve all the fields of an image if it point on an <img … HTML tag with an src attribute (and optionally alt and title attributes.

with a CSS or XPath selector and to integrate them directly without any additional processing in the case the images source is well informed in the src attribute (the alternative text and the title, optional, respectively in alt and title attributes)

If it’s not the case (as for Euronews where the information is stored in other attributes like data-src, data-alt, data-title, or Die Press where the information is stored in different HTML tags) it is possible to complete the definition of images with r_img_src, r_img_alt et r_img_title fields and even r_img_src_attr, r_img_alt_attr and r_img_title_attr.

For JSON sources with images (such as La Croix or Les Echos), r_img is useless, and r_img_src is mandatory and it’s advised to add r_img_alt and r_img_title if the information is available.

It is possible as well to use regular expressions on these fields with re (ex. _El Mercurio (fotos)) or templates with tpl (ex. _Les Echos).

5.6. Date formats supports every date format accepted by new Date('date_string') and the english relative dates like 3 minutes ago, 8 hours ago or even today and yesterday.

5.6.1. Languages

For sources of other languages, the date have to be converted in one of the supported formats (it’s generally the ISO format yyyy/mm/dd hh:mm:ss tz that is used).

5.6.2. relative dates

Then, as sources may use different date formats (based on results age) you can specify multiple date formats nammed : r_dt_fmt_1 r_dt_fmt_2

Those formats are RegEx replacement patterns, and they are tried one after another until a valide date comes out.

5.6.3. TimeZones

Else, using the toLocaleTimeString() function, all the dates are normalized regarding their time-zones by (function timezoned_date() in js/BOM_utils.js) using the "tz" entry of the "tags", if provided, when the information is not already included in the grabbed date format. A native JavaScript API would be welcome in this area.

5.6.4. Month name conversion

As shown in the CSS based source definition example, you can get a month name converted in its number putting it between curly braces : "$3-{$2}-$1".

But if your date is written in english in a japanese newspaper you’ll have to set a date_locale entry in the tags to get correct month name conversions.

A date_locale is used, for instance, for the Esperanto version of the Monde Diplomatique newspaper.

5.7. Tags

It’s important to reproduce at least the tags of '' :

"tags": {
	"name": "",
	"lang": "pt", (1)
	"country": "br", (2)
	"themes": ["general", "politics"],
	"tech": ["OR", "fast"], (3)
	"src_type": ["Press", "Reference Press"],
	"res_type": ["text", "image"],
	"tz": "Europe/Paris", (4)
	"charset": "gb2312", (5)
	"date_locale": "en" (6)
1 The digram of the language following the ISO 639 norm.
2 The digram of the country following the ISO 3166 norm.
3 Technical tags mostly work by pairs :
  • one word or many words depend on the source ability to give results that match one word or all the words of a query/search. If even for one word the source can’t give matching results, the approx tag is used, those sources are usually deceitful with queries about which they haven’t proper answers, but still useful on widely covered subjects. If a source is configured to return results matching the exact given expression (for instance because they have be integrated with quotes around the expression in their search URL) they are tagged exact.

  • fast or slow currently depends on whether results are fetched in less than 3 seconds or more. We will live-test this information for more accuracy in the future.

  • external search refers to the search mechanism of the source and should be set if the source rely on a third party (like Google Search for the Guardian and I hope to help them taking back the control of their search engine one day).

  • indep.: if the source is not part of a bigger group with non journalistic activities, nor is own by a state or a company listed on a stock exchange market it can be defined as independent with this tag.

  • for kids sources are the only available sources when the "child mode" is activated in the settings. You are encouraged to add also for kids < 9 or for kids > 9 when relevant.

  • HTTPS / HTTP is a computed tag, you don’t have to set it. It allows to search only in secured https accessible sources.

  • the broken tag allows to avoid using the source (for instance if it has been reported as defective)

4 The timezone tz tag is only needed if the date of the results have no timezone in it.
5 The charset tag is only needed when the source is not serving its web pages in UTF8.
6 The date_locale tag is only needed if you have to get a month name converted in its number but the date is not written in the same language than the rest of the newspaper.

5.8. Gather multiple elements in a template

In case of "image" (result type) search, it’s interesting to display both the photo result (image) and its description.

To do this, it’s possible to define a list of elements (as JSON paths or CSS selectors…) for the specified field (such as : r_txt), and to add an r_txt_tpl entry defining a string where you can put replacement tokens like $1, $2 … which will be replaced by the respective values of the elements of the list.

Furthermore, you can define an r_txt_attr with a list of attribute names to be retrieved.

To finish, if the last attribute name is missing in the list, the textContent of the last element will be retrieved instead.

You can check the "El Mercurio (fotos)" source for an example, or "Midi Libre photos", and "Süddeutsche Zeitung" for a missing last attribute element example.

5.9. Redirections

A source may need to perform an HTTP redirection to actually serve results. If it’s possible to target directly the 2e URL, it’s still the simplest way. But if it’s not possible, like with the Daily Telegraph, one will have to add a redir_url field in its source definition. will then ask for the Host Permission of this domain too (at search time).

5.10. Domain part

If a source is using relative URL in its href attributes those URL will be completed with a prefix containing the source domain. Unfortunately, if the correct path contains additional subfolders, you will have to specify which "domain_part" to use to complete relative URLs via a dedicated field in the source definition. It looks like this :

	"domain_part": "",

5.11. Help

If having read this documentation you still have questions about how to add sources to, you can ask those questions (preferably) :

1. This allows to request innerHTML for instance, and parsing the HTML comments