Meta-Press.es

Decentralized search engine & automatized press reviews

Documentation

1. What is it?

Meta-Press.es is a tool to search the online press (international reference press, specialized reviews, general purpose press…). It can be used to make press reviews for big associations for instance or scientific press coverage (via dedicated sources). Researches are keyword driven and you can select precisely in which sources to search (via type, language, theme, result type…). So users have full control over the sources queried to perform their searches and they even can add their own sources.

Then it’s possible to schedule daily searches (or monthly ones).

This search engine is shaped as a web browser addon (WebExtension standard format). The search and sort job is handled by the the extension itself, on the computer of the user and it allows to gives the user fine grained control over the tool.

Meta-Press.es is already able to search through hundreds of sources and this number regularly grows.

Using Meta-Press.es, there is no central server involved, to watch over and profile users, to adapt search results to their behavior and habits or to add targeted ads. So there is no bubble effect like with commercial mainstream search engines.

Meta-Press.es works great with the Tor Browser (which have been made, among other goals, to read the press with more confidentiality than with an ordinary web browser).

Meta-Press.es is a free / libre software with open sources. Those whom know HTML and JavaScript can explore its source code and adapt the tool to their needs (or easily add new sources).

To finish, Meta-Press.es is an energy efficient tool : there is no need for polar circle datacenters to perform searches through the online press…

1.1. Limits

Limits of Meta-Press.es are to search only through recent news (access to older results will be available in the future), to only sort results by chronological order and to need users to install a web browser addon…

2. How does it work?

Meta-Press.es is a meta-search engine that allows you to query the internal search engine of each source (knows to the system and selected for the search).

You will get the number of alleged results in all the sources, and the newest available results from each.

Newspapers are scrapped from the web (or RSS feeds of results), one by one. So you save, at each query, the time that the developpers spent parsing the newspapers :-)

Doing this Meta-Press.es won’t activate any ad. or social trackers of the queried newspapers, protecting your privacy, but it gives you back the choice of your information sources and clearly states in which source it searched.

Once you get the results, you can select and export some results in a re-usable format (RSS, jSON, CSV…). You’ll be able to re-import them later or use them somewhere else (send them by email to a friend, import them in a WordPress…).

3. How to install it?

To install Meta-Press.es on your Firefox-based web browser : load the following page in your browser addons.mozilla.org and click on + Add to Firefox big blue button.

4. How to use it?

Once installed, the add-on creates a new button in the toolbar decorated with the Meta-Press.es landing net icon.

favicon metapress v2
Figure 1. button with the Meta-Press.es icon

A click on this button opens a new tab on the search engine interface.

If the button is missing in the toolbar, you might haven’t give Meta-Press.es the right to run in private mode during it’s installation (and be using this mode). In this case, it’s still possible to give the autorisation via the list of addons of your browser (or via the about:addons address).

20210316 main page
Figure 2. head of the search engine main interface

Under the Meta-Press.es logo and title, headlines from the selected newspapers are shown and rotated.

Then you can type your query and choose in which newspapers to search in, based on a multiple criteria filter mechanism.

Results are then listed right under, when they arrive.

Each result is composed of a title, and link, its source name, its date and potentially an author and an extract:

20191029 meta press result
Figure 3. details of a result

Tools to work on results (sort, search, select) appear in the right column.

20210316 results
Figure 4. number of results by source and filter by source (clicking on its name)
20171216 meta press country
Figure 5. filter by date

You can, for example, click on the "Toggle selection mode" to display a checkbox for each result. You can then export your selection of checked results in various format (JSON, RSS, ATOM and soon CSV).

20191029 meta press europe selection mode crop circles
Figure 6. select mode

To re-import the search results, click on the "Import JSON" (or RSS, or ATOM) link in the cyan horizontal top bar, and select the file to import in the file picker that pops up.

4.1. Cherry-pick sources to search in

It is possible to select the sources you wanna query, one by one.

To do so, you first need to deploy the 2nd line of source filtering criterion, clicking on the [+] sign before the title of this part. At the end of the second line, there’s a "Cherry-pick source" select-multiple input.

20210127 meta press.es capture filtres sources
Figure 7. Cherry-pick sources

All the available sources are listed here and you can pick-up only the wanted ones.

When a search is finished a statistics line appears on top of the listed results. This line is fitted with a 🔗 "chain link" icon at the end. This icon allows you to launch the same search again.

20210128 permalink
Figure 8. Search permalink icon

So it’s possible to create bookmarks for your favorite searches (sparing configuration time).

4.3. Scheduled searches

Once you typed your search terms and selected the sources you wanna search in, it’s possible to save the search for later instead of launching it immediately. It’s the role of the ⏰ Schedule search button under the source selection. This button opens a new tab on the "Settings" scrolled to the Scheduled searches part.

20210210 recherche automatique
Figure 9. Scheduled searches, on dark background interface

This table shows a scheduled search by line. When created, a scheduled search is in "Stop" state, but you just have to select the date, time and periodicity you want for this search to have it activated.

So you can schedule a daily search in a few clicks.

Actions are possible on scheduled searches:

  • the ✏️ "pen" button allows to edit the search, it opens the main search interface with the scheduled search settings (search terms, source selection). Once modified, your search settings can be saved clicking on the "Schedule search" button of the main interface ;

  • the 2nd button, with a copy/paste icon, allows to clone a scheduled search to get another one, that you can configure with the previously described button ;

  • the 3rd button, with a cross on it allows to delete a scheduled search ;

  • the 4th and last button allows to manually start the search from the table.

5. How to add a new source to the search engine?

If you are a programmer, you just have to add an entry in the js/sources.json JSON object (or write your entry in the setting panel of the addon).

Here are useful examples, listed at the top of the js/sources.json file :

  • Mediapart.fr/en is a good and simple example using "normal" CSS selectors

  • New Europe is an example of source providing results in RSS format

  • New Europe Greece extends the New Europe definition

  • The Washingtown Post provides results in JSON

  • The Japan News uses HTTP POST method

  • Helsinky Times uses XPaths to parse some fields

5.1. Methodology

  • First, visit the website of the source you want to add and note the main URL to pickup headlines from (preferably in HTTPS) ;

  • Try its search functionality :

    • check if the results are accessible in RSS (or ATOM) format using the developer’s tools (F12 key, default Inspector tab, search for "rss") it would spare 2/3 of the parsing work

      • in this case, add a "favicon_url" entry in your JSON object for the source’s favicon

      • still in this case : you don’t need to provide the timezone of the source in the tags

    • check that this URL is for results in chronological order, or have the results sorted, else the source is an incompatible one, see the admonition block below

    • if the result URL does not contain your search terms, the source might be using the POST HTTP method, you can look at other sources using POST method, such as The Japan News.

    • check that results are really from this request via the developer’s tools : F12 key, Network tab, Response preview. Results can be loaded via JSON and XHR requests see The Washingtown Post for example of how to deal with it

    • try to search multiple terms and find the way to get a logical "and" between the search terms (try adding quotes around the terms for instance). If you can’t have a logical "AND", add the "OR" technical tag for this source. It’s also possible to add quotes around the search terms and obtain an "EXACT" search on the expression.

    • note a search with terms giving results, default is "quadrature du net", but "yellow vests" works well also

    • check if the source is providing different type of results : text, photo, video, audio ; in this case, you will be able to create one source entry by result type (it’s easy when you just extend your first source definition)

If something goes wrong, like :

  • no search functionality

  • no date on results

  • no date sort

Please provide some feedback to the source about the problem and add it to the list of incompatible sources in the wiki with your feedback effort status.

You can also help by contacting sources of this list with no feedback yet.

Then, to write the source definition, there are 4 kinds of information to provide :

  • general info: name, timezone and tags at the end;

  • headline: two entries to allow to retrieve the big news of the moment for this newspaper (the URL of the newspaper, and the CSS selector to get the link);

  • search: the source search URL (which provides the results);

  • result parsing: 5 more entries to retrieve specific elements of each result (the last two being facultative):

    • title: r_h1

    • link: r_url

    • date: r_dt

    • extract: r_txt

    • author: r_by

Each of these entries can be followed by an _attr and _re version of it to allow the targeting a specific attribute of the designated HTML element, or to apply a .replace() on it. The _re needs a list of two strings : the first being a regular expression and the second a replacement pattern (see example below).

It’s also possible to give an _xpath version to uses XPaths instead of CSS selectors.

5.2. External doc about CSS, RegEx and XPath

5.2.1. JSON

JSON syntax at Mozilla Developer Network and json.org : just keep in mind that only double quotes are allowed, and no trailing comas

5.2.2. CSS selectors

Mozilla Developer Network about CSS selectors

More documentation on CSS selectors from medium.com

5.2.3. Regular Expressions

5.2.4. XPath

XPath doc at MDN.

XPath at devhints.io.

5.3. Examples

5.3.1. RSS based source

{
	"https://www.neweurope.eu": {
		"favicon_url": "https://www.neweurope.eu/wp-content/uploads/2019/07/NE-16.jpg",
		"headline_url": "https://www.neweurope.eu",
		"headline_selector": ".td-module-meta-info .entry-title a",
		"search_url": "https://www.neweurope.eu/search/{}/feed/rss2/", (1)
		"search_url_web": "https://www.neweurope.eu/?s={}", (2)
		"extends": "RSS", (3)
		"tags": {  } (4)
	}
}
1 In this URL, the {} will be replaced by Meta-Press.es with your search terms.
2 This 2nd URL allows to redirect the user to the source online result page, for instance to go deeper into this source.
3 This definition rely on the provided generic RSS source definition and extends it.
4 tags are explained later

5.3.2. Extend your own source definitions

	"https://www.neweurope.gr": {
		"favicon_url": "https://www.neweurope.gr/wp-content/uploads/2019/07/favicongr-16.jpg",
		"headline_url": "https://www.neweurope.gr",
		"search_url": "https://www.neweurope.gr/search/{}/feed/rss2/",
		"extends": "https://www.neweurope.eu", (1)
		"tags": {  }
	}
1 Here "https://www.neweurope.eu" is the key of entry to extend.

5.3.3. JSON based source definition

To diagnose an AJAX result loading case, it’s possible to use the Firefox’s developer’s tools. The F12 key allow to open those tools, and then we can click on the Console tab. The XHR requests are that occur after the initial page loading are listed here. Each requests can be inspected in the console, including the JSON response payload.

If the inspected request contains your search results then you already get its address and then you can determine the JSON paths to reach each wanted information.

	"https://www.washingtonpost.com": {
		"type": "JSON",
		"search_url": "https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json?query={}&count={#}&sort=displaydatetime desc",  (1) (2)
		"search_url_web": "https://www.washingtonpost.com/newssearch/?query={}&btn-search=&sort=Date&datefilter=All%20Since%202005",
		"favicon_url": "https://www.washingtonpost.com/favicon.ico",
		"res_nb": "results.total", (3)
		"results": "results.documents", (4)
		"r_h1": "headline",
		"r_url": "contenturl",
		"r_dt": "pubdatetime",
		"r_txt": "blurb",
		"r_by": "authors", (5)
		"r_by_attr": "name", (5)
		"headline_unavailable": "because of GDPR consent first", (6)
		"tags": {  }
1 {#} will be replaced by Meta-Press.es with the max number of results by request (will soon be a parameter in the settings, currently it’s 20).
2 For this example the URL have been cut, but I let the sort=displaydatetime desc GET parameter because we always try to get results sorted by date, newest first
3 When parsing JSON objects, you can specify a path (JavaScript style), to point deep values (not a 1st level).
4 This JSON path point to the list of results that Meta-Press.es will go through.
5 Those two lines are in fact from the https://www.arretsurimages.net source definition. The r_by property point at a JSON list, and the r_by_attr designate the attribute to fetch from each elements of the list. Then names are joined with comas between them to build the list of authors as a single field.
6 There is no headline_url property in this definition, instead I put an arbitrary variant of the name to store the explanation of the problem. Whith no headline_url Meta-Press.es wont be able to load headlines from this source, but that’s ok.

5.3.4. CSS based source definition

	"https://www.mediapart.fr": {
		"headline_url": "https://www.mediapart.fr",
		"headline_selector": ".une-block h3 a", (1)
		"search_url": "https://www.mediapart.fr/search?search_word={}&sort=date&order=desc",
		"res_nb": ".sub-title",
		"res_nb_re": ["^(\\d+) ", "$1"], (2)
		"results":	"ul.search > li", (3)
		"r_h1": "h2",
		"r_url": "h2 a",
		"r_url_attr": "href", (4)
		"r_dt": "span.author",
		"r_dt_fmt_1": [
			"\\s(\\d+)[ermè]* (.+) (\\d{4})",
			"$3-{$2}-$1"
		], (5)
		"r_txt": "p",
		"r_by": ".author a[rel=author]",
		"tags": {  }
	},
1 This should point to a headline link, the main title of the main page if possible.
2 res_nb can also use a _re complementary entry, here it extracts a number at the beginning of a line
3 It’s this CSS expression that allows to extract the results from the web page. It’s directly pointing at the results collection, that will be grabed via querySelectorAll(). Note that we used a strict CSS selector (with >) to ensure we don’t grab unwanted elements from elsewhere on the page.
4 r_url_attr allows to get the href attribute value
5 r_dt_fmt_1 :
  • Here we capture the date elements to put them in the right order. The month name (pointed by the {$2}) will be converted in the correct number by month_nb.

  • Note that to specify an anti-slash in a JavaScript string, you need to escape it, hence the double anti-slash in "\\s" and "\\d".

  • To finish, as the name of this attribute suggests, you can define as much date formats as used by the source (for instance if the source is using relative date formats "1h ago" in addition to the absolut one "2022-03-21").

5.3.5. HTTP POST based source definition

	"https://the-japan-news.com": {
		"headline_url": "https://the-japan-news.com",
		"headline_selector": "#topNewsWrapper a",
		"favicon_url": "https://the-japan-news.com/favicon.ico",
		"method": "POST", (1)
		"body": "siteSearchInput={}&x=7&y=11&span=365", (2)
		"search_url": "https://the-japan-news.com/news/search",
		
		"r_dt": "time", (3)
		"r_dt_attr": "datetime", (3)
		
	}
1 In addition to the usual search_url, we need to set the POST method
2 And a body for the request, which is the GET equivalent for query string.
3 Here we can note that when a <time datetime=""> HTML tag is available, it’s preferable to use it to avoid this regular expression format step, and to avoid having a timezone to define in the tags.

5.3.6. XPath based source definition

XPath is a very powerful language and it can be used in replacement of every CSS selectors.

   "https://www.helsinkitimes.fi": {
		"headline_url": "https://www.helsinkitimes.fi",
		"headline_selector": "h2[itemprop=headline] a",
		"search_url": "https://www.helsinkitimes.fi/search1332318146.html?searchword={}&ordering=newest&searchphrase=all&limit={#}", (1)
		"res_nb": ".searchintro .bagde",
		"results": ".result-title",
		"r_h1": "a",
		"r_url": "a",
		"r_url_attr": "href",
		"r_dt_xpath": "./following-sibling::dd[@class='result-created'][1]/strong", (2)
		"r_txt_xpath": "./following-sibling::dd[@class='result-text'][1]",
		"r_by_xpath": "./following-sibling::dd[@class='result-category'][1]/span",
		"tags": {  }
	}
1 As for the WaPo JSON based source definition, the Helsinky Times allows us to set the number of results we want in their answer, it the {#} token. Meta-Press.es will replace this token by the wanted number, that will soon be a preference in the settings.
2 Instead of a regular r_dt field, here we have a r_dt_xpath field. So it’s a not a CSS selector but an XPath definition that follows. Here it allows to reach the next sibling element relatively to the current one, which is not possible via CSS.

One can also note that :

  • Reaching parent elements is not possible in CSS neither.

  • XPath is also needed when XML namespaces are involved (like in most encountered RSS feeds extended with Dublin Core DTD).

5.4. Regular expressions

Regular expression are a complex subject. Here are some documentation again. If you have alreday work with RegEx here are some key points to keep in mind :

  • patterns need to be delimited with knows elements before and after what you want to extract : "\\s(\\d+) " here there is a space (or a tab) before and a space after.

  • you mainly need : \\d+ \\w+ \\s+ (to match : numbers, words, and any kind of spaces)

  • then you’ll mostly use : () ()? (?:) (to extract the pattern between parenthesis, with a ? after if the pattern might be missing, and with ?: inside at the beginning to avoid extracting this group, no corresponding "$1" / "$2" in the replacement pattern).

5.5. Images

The integration of images in Meta-Press.es results has been simplified by the following fields: r_img, r_img_src, r_img_alt and r_img_title.

r_img allows to directly retrieve all the fields of an image if it point on an <img … HTML tag with an src attribute (and optionally alt and title attributes.

with a CSS or XPath selector and to integrate them directly without any additional processing in the case the images source is well informed in the src attribute (the alternative text and the title, optional, respectively in alt and title attributes)

If it’s not the case (as for Euronews where the information is stored in other attributes like data-src, data-alt, data-title, or Die Press where the information is stored in different HTML tags) it is possible to complete the definition of images with r_img_src, r_img_alt et r_img_title fields and even r_img_src_attr, r_img_alt_attr and r_img_title_attr.

For JSON sources with images (such as La Croix or Les Echos), r_img is useless, and r_img_src is mandatory and it’s advised to add r_img_alt and r_img_title if the information is available.

It is possible as well to use regular expressions on these fields with re (ex. _El Mercurio (fotos)) or templates with tpl (ex. _Les Echos).

5.6. Date formats

Meta-Press.es supports every date format accepted by new Date('date_string') and the english relative dates like 3 minutes ago, 8 hours ago or even today and yesterday.

For sources of other languages, the date have to be converted in one of the supported formats (it’s generally the ISO format yyyy/mm/dd hh:mm:ss tz that is used).

Then, as sources may use different date formats (based on results age) you can specify multiple date formats nammed : r_dt_fmt_1 r_dt_fmt_2

Those formats are RegEx replacement patterns, and they are tried one after another until a valide date comes out.

In those replacement patterns you can put curly braces around a month name to get it converted into its number : "$3-{$2}-$1". In this case it’s the month_nb function that will do the conversion. It can convert nearly every month names, of every living written languages, into its corresponding month number just from the month name (in UTF-8).

Else, using the recent toLocaleTimeString() function, all the dates are normalized regarding their time-zones by Meta-Press.es (function timezoned_date() in js/utils.js) using the "tz" entry of the "tags", if provided, when the information is not already included in the grabbed date format. A native JavaScript API would be welcome in this area.

5.7. Tags

It’s important to reproduce at least the tags of 'Mediapart.fr/en' :

"tags": {
	"name": "Mediapart.fr",
	"lang": "pt", (1)
	"country": "br", (2)
	"themes": ["general", "politics"],
	"tech": ["OR", "own search", "fast"], (3)
	"src_type": ["Press", "Reference Press"],
	"res_type": ["text", "photo"],
	"tz": "Europe/Paris", (4)
	"charset": "gb2312" (5)
}
1 The digram of the language following the ISO 639 norm.
2 The digram of the country following the ISO 3166 norm.
3 Technical tags mostly work by pairs :
  • one word or many words depend on the source ability to give results that match one word or all the words of a query/search. If even for one word the source can’t give matching results, the approx tag is used, those sources are usually deceitful with queries about which they haven’t proper answers, but still useful on widely covered subjects. If a source is configured to return results matching the exact given expression (for instance because they have be integrated with quotes around the expression in their search URL) they are tagged exact.

  • fast or slow currently depends on whether results are fetched in less than 3 seconds or more. We will live-test this information for more accuracy in the future.

  • internal search or external search refers to the search mechanism of the source : is this service internal or provided by a third party (typically Google Search is the default search method of the Guardian and I hope to help them taking back the control of their search engine).

  • HTTPS / HTTP is a computed tag, you don’t have to set it. It allows to search only in secured https accessible sources.

  • the broken tag allows to avoid using the source (for instance if it has been reported as defective)

4 The timezone tz tag is only needed if the date of the results have no timezone in it.
5 The charset tag is only needed when the source is not serving its web pages in UTF8.

5.8. date_locale

In the case of source showing date month names in a language that is not the one of the result contents (like with the Esperanto version of Le Monde Diplomatique) it’s possible to specify the language used for the dates via the date_locale entry of the tags object.

5.9. Gather multiple elements in a template

In case of "photo" (result type) search, it’s interesting to display both the photo result (image) and its description.

To do this, it’s possible to define a list of elements (as JSON paths or CSS selectors…) for the specified field (such as : r_txt), and to add an r_txt_tpl entry defining a string. In this string, you can put replacement tokens like $1, $2 … which will get replaced by the respective values of the elements of the list.

Furthermore, you can define a r_txt_attr with a list of attribute names to be retrieved.

To finish, if the last attribute name is missing in the list, the textContent of the last element will be retrieved instead.

You can check the "El Mercurio (fotos)" source for an example, or "Midi Libre photos", and "Süddeutsche Zeitung" for a missing last attribute element example.

5.10. Fine choice of text for headline

In some sources the headline_selector field encompasses too many elements, like headers: "En Direct", "Live"…

To solve this problem it’s possible to add the h_title field to select the text to extract with a CSS or XPath selector less encompassing than headline_selector which must point at an HTML link (see examples in: Les Echos or L’Obs).

You can also use, like with source result fields :

  • regular expressions, re (as for _Le Progrès)

  • or select an attribute with _attr

5.11. Redirections

It’s possible that a source needs de perform an HTTP redirection to actually serve results. If it’s possible to target directly the 2e URL, it’s still the simplest way. But if it’s not possible, like with the Daily Telegraph, one will have to add a redir_url field in its source definition. Meta-Press.es will then ask for the Host Permission of this domain too (at search time).

5.12. Domain part

If a source is using relative URL in its href attributes those URL will be completed with a prefix containing the source domain. Unfortunately, if the correct path contains additional subfolders, you will have to specify which "domain_part" to use to complete relative URLs via a dedicated field in the source definition. It looks like this :

{
	
	"domain_part": "http://china.dailynk.com/chinese",
	
}

5.13. Help

If having read this documentation you still have questions about how to add sources to Meta-Press.es, you can ask those questions (preferably) :