Meta-Press.es

Decentralized search engine & automatized press reviews

Documentation

1. How to install it?

To install Meta-Press.es on your Firefox-based web browser : load the following page in your browser addons.mozilla.org and click on + Add to Firefox big blue button.

Discussion about permissions. A popup may also ask you to authorise Meta-Press.es to run in "private" mode, it seems better to accept it. Those settings are rather new and still moving.

2. How to use it?

Once installed, the add-on creates a new button in the toolbar decorated with the Meta-Press.es landing net icon.

favicon metapress v2
Figure 1. button with the Meta-Press.es icon

A click on this button opens a new tab on the search engine interface.

20191029 meta press filters
Figure 2. head of the interface of the search engine

Under the Meta-Press.es logo and title, headlines from the selected newspapers are shown and rotated.

Then you can type your query and choose in which newspapers to search in, based on a multiple criteria filter mechanism.

Results are then listed right under, when they arrive.

Each result is composed of a title, and link, its source name, its date and potentially an author and an extract:

20191029 meta press result
Figure 3. details of a result

Tools to work on results (sort, search, select) appear in the right column.

20171216 meta press country
Figure 4. number of results by source and filter by source (clicking on its name)
20171216 meta press country
Figure 5. filter by date

You can, for example, click on the "Toggle selection mode" to display a checkbox for each result. You can then export your selection of checked results in various format (JSON, RSS, ATOM and soon CSV).

20191029 meta press europe selection mode crop circles
Figure 6. select mode

To re-import the search results, click on the "Import JSON" (or RSS, or ATOM) link in the cyan horizontal top bar, and select the file to import in the file picker that pops up.

3. How to add a new source to the search engine?

If you are a programmer, you just have to add an entry in the js/sources.json JSON object (or write your entry in the setting panel of the addon).

Here are useful examples, listed at the top of the js/sources.json file :

  • Mediapart.fr/en is a good and simple example using "normal" CSS selectors

  • New Europe is an example of source providing results in RSS format

  • New Europe Greece extends the New Europe definition

  • The Washingtown Post provides results in JSON

  • The Japan News uses HTTP POST method

  • Helsinky Times uses XPaths to parse some fields

3.1. Methodology

  • First, visit the website of the source you want to add and note the main URL to pickup headlines from.

  • Try its search functionality :

    • check if the results are accessible in RSS (or ATOM) format using the developer’s tools (F12 key, default Inspector tab, search for "rss") it would spare 2/3 of the parsing work

      • in this case, add a "favicon_url" entry in your JSON object for the source’s favicon

      • still in this case : you don’t need to provide the timezone of the source in the tags

    • else, note the URL of the results

    • check that this URL is for results in chronological order, or have the results sorted, else the source is an incompatible one, see the admonition block below

    • if the result URL does not contain your search terms, the source might be using the POST HTTP method, you can look at other sources using POST method, such as The Japan News.

    • check that results are really from this request via the developer’s tools : F12 key, Network tab, Response preview. Results can be loaded via JSON and XHR requests see The Washingtown Post for example of how to deal with it

    • try to search multiple terms and find the way to get a logical "and" between the search terms (try adding quotes around the terms for instance). If you can’t have a logical "and", add the "or logic" technical tag for this source

    • note a search with terms giving results, default is "quadrature du net", but "yellow vests" works well also

If something goes wrong, like :

  • no search functionality

  • no date on results

  • no date sort

Please provide some feedback to the source about the problem and add it to the list of incompatible sources in the wiki with your feedback effort status.

You can also help by contacting sources of this list with no feedback yet.

Then, to write the source definition, there are 4 kinds of information to provide :

  • general info: name, timezone and tags at the end;

  • headline: two entries to allow to retrieve the big news of the moment for this newspaper (the URL of the newspaper, and the CSS selector to get the link);

  • search: the source search URL (which provides the results);

  • result parsing: 5 more entries to retrieve specific elements of each result (the last two being facultatives):

    • title: r_h1

    • link: r_url

    • date: r_dt

    • extract: r_txt

    • author: r_by

Each of these entries can be followed by an _attr and _re version of it to allow the targeting a specific attribute of the designated HTML element, or to apply a .replace() on it. The _re needs a list of two strings : the first being a regular expression and the second a replacement pattern (see example below).

It’s also possible to give an _xpath version to uses XPaths instead of CSS selectors.

3.2. External doc about CSS, RegEx and XPath

3.2.1. JSON

JSON syntax at Mozilla Developer Network and json.org : just keep in mind that only double quotes are allowed, and no trailling comas

3.2.2. CSS selectors

Mozilla Developer Network about CSS selectors

More documentation on CSS selectors from medium.com

3.2.3. Regular Expressions

3.2.4. XPath

XPath doc at MDN.

XPath at devhints.io.

3.3. Examples

3.3.1. RSS based source

{
	"https://www.neweurope.eu": {
		"favicon_url": "https://www.neweurope.eu/wp-content/uploads/2019/07/NE-16.jpg",
		"headline_url": "https://www.neweurope.eu",
		"headline_selector": ".td-module-meta-info .entry-title a",
		"search_url": "https://www.neweurope.eu/search/{}/feed/rss2/", (1)
		"extends": "RSS", (2)
		"tags": {  } (3)
	}
}
1 In this URL, the {} will be replaced by Meta-Press.es with your search terms.
2 This definition rely on the provided generic RSS source definition and extends it.
3 tags are explained later

3.3.2. Extend your own source definitions

	"https://www.neweurope.gr": {
		"favicon_url": "https://www.neweurope.gr/wp-content/uploads/2019/07/favicongr-16.jpg",
		"headline_url": "https://www.neweurope.gr",
		"search_url": "https://www.neweurope.gr/search/{}/feed/rss2/",
		"extends": "https://www.neweurope.eu", (1)
		"tags": {  }
	}
1 Here "https://www.neweurope.eu" is the key of entry to extend.

3.3.3. JSON based source definition

To diagnose an AJAX result loading case, it’s possible to use the Firefox’s developer’s tools. The F12 key allow to open those tools, and then we can click on the Console tab. The XHR requests are that occur after the initial page loading are listed here. Each requests can be inspected in the console, including the JSON response payload.

If the inspected request contains your search results then you already get its address and then you can determine the JSON paths to reach each wanted information.

	"https://www.washingtonpost.com": {
		"type": "JSON",
		"search_url": "https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json?query={}&count={#}&sort=displaydatetime desc",  (1) (2)
		"favicon_url": "https://www.washingtonpost.com/favicon.ico",
		"res_nb": "results.total", (3)
		"results": "results.documents", (4)
		"r_h1": "headline",
		"r_url": "contenturl",
		"r_dt": "pubdatetime",
		"r_txt": "blurb",
		"r_by": "authors", (5)
		"r_by_attr": "name", (5)
		"headline_unavailable": "because of GDPR consent first", (6)
		"tags": {  }
1 {#} will be replaced by Meta-Press.es with the max number of results by request (will soon be a parameter in the settings, currently it’s 20).
2 For this example the URL have been cut, but I let the sort=displaydatetime desc GET parameter because we always try to get results sorted by date, newest first
3 When parsing JSON objects, you can specify a path (JavaScript style), to point deep values (not a 1st level).
4 This JSON path point to the list of results that Meta-Press.es will go through.
5 Those two lines are in fact from the https://www.arretsurimages.net source definition. The r_by property point at a JSON list, and the r_by_attr designate the attribute to fetch from each elements of the list. Then names are joined with comas between them to build the list of authors as a single field.
6 There is no headline_url property in this definition, instead I put an arbitrary variant of the name to store the explanation of the problem. Whith no headline_url Meta-Press.es wont be able to load headlines from this source, but that’s ok.

3.3.4. CSS based source definition

	"https://www.mediapart.fr": {
		"headline_url": "https://www.mediapart.fr",
		"headline_selector": ".une-block h3 a", (1)
		"search_url": "https://www.mediapart.fr/search?search_word=\"{}\"&sort=date&order=desc",
		"res_nb": ".sub-title",
		"res_nb_re": ["^(\\d+) ", "$1"], (2)
		"results":	"ul.search > li", (3)
		"r_h1": "h2",
		"r_url": "h2 a",
		"r_url_attr": "href", (4)
		"r_dt": "span.author",
		"r_dt_re": ["\\s(\\d+)[ermè]* (.+) (\\d{4})", "$3-{$2}-$1"], (5)
		"r_txt": "p",
		"r_by": ".author a[rel=author]",
		"tags": {  }
	},
1 This should point to a headline link, the main title of the main page if possible.
2 res_nb can also use a _re complementary entry, here it extracts a number at the beginning of a line
3 Here the global application of the results CSS selector on the result web-page will create the list of results to go through. Note that we used a strict CSS selector (with >) to ensure we don’t grab unwanted elements from elsewhere on the page.
4 r_url_attr allows to get the href attribute value
5 r_dt_re
  • Here we capture the date elements to put them in the right order. : The month name (pointed by the {$2}) will be converted in the correct number by month_nb.

  • Note that to specify a \ in a JavaScript string, you need to escape it, hence : "\\s" and "\\d".

3.3.5. HTTP POST based source definition

	"https://the-japan-news.com": {
		"headline_url": "https://the-japan-news.com",
		"headline_selector": "#topNewsWrapper a",
		"favicon_url": "https://the-japan-news.com/favicon.ico",
		"method": "POST", (1)
		"body": "siteSearchInput={}&x=7&y=11&span=365", (2)
		"search_url": "https://the-japan-news.com/news/search",
		
		"r_dt": "time", (3)
		"r_dt_attr": "datetime", (3)
		
	}
1 In addition to the usual search_url, we need to set the POST method
2 And a body for the request, which is the GET equivalent for query string.
3 Here we can note that when a <time datetime=""> HTML tag is available, it’s preferable to use it to avoid this regular expression format step, and to avoid having a timezone to define in the tags.

3.3.6. XPath based source definition

XPath is a very powerful language and it can be used in replacement of every CSS selectors.

   "https://www.helsinkitimes.fi": {
		"headline_url": "https://www.helsinkitimes.fi",
		"headline_selector": "h2[itemprop=\"headline\"] a",
		"search_url": "https://www.helsinkitimes.fi/search1332318146.html?searchword={}&ordering=newest&searchphrase=all&limit={#}", (1)
		"res_nb": ".searchintro .bagde",
		"results": ".result-title",
		"r_h1": "a",
		"r_url": "a",
		"r_url_attr": "href",
		"r_dt_xpath": "./following-sibling::dd[@class=\"result-created\"][1]/strong", (2)
		"r_txt_xpath": "./following-sibling::dd[@class=\"result-text\"][1]",
		"r_by_xpath": "./following-sibling::dd[@class=\"result-category\"][1]/span",
		"tags": {  }
	}
1 As for the WaPo JSON based source definition, the Helsinky Times allows us to set the number of results we want in their answer, it the {#} token. Meta-Press.es will replace this token by the wanted number, that will soon be a preference in the settings.
2 Instead of a regular r_dt field, here we have a r_dt_xpath field. So it’s a not a CSS selector but an XPath definition that follows. Here it allows to reach the next sibling element relatively to the current one, which is not possible via CSS.

One can also note that :

  • Reaching parent elements is not possible in CSS neither.

  • XPath is also needed when XML namespaces are involved (like in most encountered RSS feeds extended with Dublin Core DTD).

3.4. Regular expressions

Regular expression are a complex subject. Here are some documentation again. If you have alreday work with RegEx here are some key points to keep in mind :

  • patterns need to be delimited with knows elements before and after what you want to extract : "\\s(\\d+) " here there is a space (or a tab) before and a space after.

  • you mainly need : \d+ \w+ \s+ (to match : numbers, words, and any kink of spaces)

  • then you’ll mostly use : () ()? (?:) (to extract the pattern between parenthesis, with a ? after if the pattern might be missing, and with ?: inside at the beginning to avoid extracting this group, no corresponding "$1" / "$2" in the replacement pattern).

3.5. Date formats

Meta-Press.es accepts every date format handled by new Date('date string'). In addition to this, dates like 3 minutes ago, 8 hours ago or even today, yesterday and new are managed by Meta-Press.es.

For sources of other languages, the date have to be converted in one of the JavaScript parsable format we just listed, like the English one or the recommended ISO format (yyyy/mm/dd hh:mm:ss tz). This can be done via the r_dt_re RegEx replacement pattern entry ("$3-{$2}-$1"), adding curly braces ({}) around months names to convert in month numbers (see the 4th point of the previous example). In this case it’s the month_nb function that will do the conversion. It can convert nearly every month names, of every living written languages into its corresponding month number, just from the month name (in UTF-8).

Else, using the recent toLocaleTimeString() function, all the dates are normalized regarding their time-zones by Meta-Press.es (function timezoned_date() in js/utils.js) using the "tz" entry of "tags" if provided when the information is not already provided in the grabbed date format. A native JavaScript API would be welcome in this area.

3.6. Tags

It’s important to reproduce at least the tags of 'Mediapart.fr/en' :

"tags": {
	"name": "Mediapart.fr",
	"lang": "fr",
	"country": "fr",
	"themes": ["general", "politics"],
	"tech": ["OR", "own search", "fast"], (2)
	"src_type": ["press", "reference press"],
	"res_type": "text",
	"scope": "International",
	"freq": "daily",
	"tz": "Europe/Paris" (2)
}

"tags": {
	"name": "Mediapart.fr",
	"themes": ["generalist", "international"],
	"lang": "fr",
	"country": "fr",
	"freq": "daily"
	"tech": ["OR", "own search", "fast", "newspaper"], (1)
	"tz": "Europe/Paris", (2)
},
1 Technical tags mostly work by pairs :
  • AND or OR depends on the ability to make a query with a logical "and" connection between the words. If it’s not possible, use the OR tag.

  • fast or slow currently depends on whether results are fetched in less than 3 seconds or more. We will live-test this information for more accuracy in the future.

  • own search or external search refers to the search mechanism of the source : is this service internal or provided by a third party (typically Google Search is the default search method of the Guardian and I hope to help them taking back the control of their search engine).

  • HTTPS / HTTP is a computed tag, you don’t have to set it. It allows to search only in secured https accessible sources.

  • the broken tag allows to avoid using the source (for instance if it has been reported as defective)

2 The timezone tz tag is only needed if the date of the results have no timezone in it.

3.7. Gather multiple elements in a template

In case of "photo" (result type) search, it’s interesting to display both the photo result (image) and its description.

To do this, it’s possible to define a list of elements (as JSON paths or CSS selectors…) for the specified field (such as : r_txt), and to add an r_txt_tpl entry defining a string. In this string, you can put replacement tokens like {1}, {2} … which will get replaced by the respective values of the elements of the list.

You can check the "El Mercurio (fotos)" source for an example.