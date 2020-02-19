Share it:

"Words fly, what is written remains". That is Verba's motto, a Civio project directed by David Cabo to analyze the coverage of Spanish Television News since 2014.

The project – which is also Open Source – comes to light these days after months of work, and it does so by demonstrating that not all of these great ideas have to leave a Silicon Valley startup. We could talk to Cabo to tell us how Verba was conceived, what it allows us to do and what future awaits this interesting initiative.





The origins of Verba

Civio is an independent and non-profit organization that has been combining journalism and technology for years because, as David Cabo (@dcabo), its founder, "technology is at the mercy of journalism." In this case, for – as they explain themselves – "monitor public authorities, inform all citizens and press for real and effective transparency in institutions. "

The search for transparency and access to information that Civio pursues have been demonstrated through efforts such as 'our daily BOE' with which Eva Belmonte (@evabelmonte) brings to the citizens the conclusions of each of the Official State Bulletins that are published, or 'Medicamentalia', a journalistic investigation by Ángela Bernardo (@maberalv) on the global gap in access to health, among other projects.

This project called Verba, however, is very different. David Cabo told us how a few years ago he saw in the United States an issue with the Congress sessions in which the idea came up: on the RTVE "A la carte" website He checked that each news had subtitles that were not only available when playing the video: they could also be downloaded.

That joined a recurring debate that then began to emerge on the topics covered in the news of state public television. What was talked about in each news, and how long?

The subtitles allowed answering that question, and after obtaining a Google News Initiative scholarship – which encourages the use of technology in the media industry – they launched the project taking advantage of a very special scientific field: Natural Language Processing (PLN).

Sorry, what are you saying?

Natural Language Processing has progressed exceptionally thanks to the introduction of machine learning algorithms for language processing, and allows machines to be able to process large amounts of data from natural language that we use in our day to day.

An example of what Verba allows: what corruption plot has been most talked about in RTVE in recent years? It has not been the Punic with 357 appearances, but both the Gürtel (712 appearances) and the ERE (804). Source: Verba

The technique is perfect for analyzing and extrapolating information from those subtitles offered by the RTVE News. As Cabo explained, Verba "has some technical complexity, but not too much." Its operation is based primarily on the download of subtitles, which they are added to a large database Built with Elasticsearch, a powerful distributed search engine.

Of course, Cabo stressed, there is an intermediate step before: those subtitles split or split by sentences thanks to a PLN library which allows to "dissect" each news in parts that then facilitate finding search results effectively.

From there start the application work, done in JavaScript with Vue.js, and that in turn makes calls to a visualization library called D3.js which is the one that offers the results that the user sees when their search is processed. This Civio project, Cabo explained in the official announcement, has advanced among other things thanks to the collaboration of experts such as Victor Hairstyle or Pablo Rey.

You can already know when a certain topic was discussed on the News

Verba Turn RTVE news into a unique newspaper archive: one for which it is possible to navigate with simple search terms that we can also combine using the Elasticsearch operators. Thus, we can exclude terms (with the "-" symbol) or do for example "OR" searches using the "|" (for example looking for "Trump | Obama").

The search can not only be shared on networks – each one generates a URL, as happens with our example – but when doing it a graph appears showing the number of occurrences of those search terms over the years in the TVE news. We can also download the search results in .csv format.

This graphic is a visual representation of those appearances, but we will have each one available with a small excerpt from the transcript in which those search terms were found with the day and the edition of the Newspaper to which they belong. In these "boxes" for each result we can also hover over the mouse (the traditional hover) to access the context, which will cause a pop-up window to appear with the transcription somewhat larger, but also with a direct link to the video of that broadcast, which will open in a new browser tab.

Example of a first result for the search performed as an example. It indicates the moment at which the search terms are mentioned in the program, the term found is highlighted and access to the context is given or a link to the video in the RTVE "A la carte" service.

In that video we can easily locate the exact moment at which the search term or terms were named, because that information also appears as part of the data published in each result.

In addition to the search engine, in Civio they put at our disposal some examples of the analyzes that can be performed from those results. In the section "Headlines" show five examples of coverage that have been made in the News in recent years to analyze among other things the scientific rigor when talking about the diets or the difference in coverage that have had different corruption schemes such as the Gürtel or the ERE.

This is just the beginning

The service is functional and its response is surprisingly fast and accurate, but for David Cabo "we have only launched a first part". He and his team at Civio want to "apply more PLN technologies" than among other things "allow entity extraction". Thanks to that capacity Verba will be able to recognize proper names and differentiate them according to the context.

In Civio they offer all the transcripts of the different RTVE news programs from 3pm and 9pm that have been issued since 2014 organized by year, month and day of issue.

There is a very clear example of the current limitations of Verba: for now A search for "We can" confuse the political party with the use of verbal form, but that extraction of entities will help to differentiate between one and the other.

As Cabo explained to us, that option "is close" to be implemented, but it was not as precise as they would have liked and have preferred to delay its launch. To offer it, they will once again take advantage of machine learning techniques that, with a lot of training – and a little trick, the detection of capital letters – allow us to help differentiate with the help of the context between some cases and others.

To be able to measure times, we are working on splitting transcripts into individual news, and training a model that classifies news into topics. Something like that: pic.twitter.com/rgnO61PwT7 – David Cabo (@dcabo) February 18, 2020

Not only that: David Cabo also pointed to another especially interesting future option: the division of transcription into pieces classified by subject, so that in each Newscast you know how much has been said about sports or politics, for example. In fact, the idea is to be able to make a very precise classification that allows us to know how much has been said about each topic in each informative.

That will allow answering questions that now have a more diffuse answer such as if you are talking little, nothing or a lot of different topics in the news of a public network like RTVE, and makes Verba run as a very useful tool to analyze the true transparency of these news.

And here we can find, once again thanks to Civio, (hello, @evabelmonte) Pedro Sánchez's promise to repeal the gag laws. Nothing like good technology to relieve the amnesia of politicians. https://t.co/Qie1Y3GK3L – Almeida (@bufetalmeida) February 18, 2020

That, of course, besides being a "damn (blessed) newspaper archive" to detect what was said, who said it and when he said it, something that some users and experts have already discovered.

Verba works with a database built from 2014 for a simple reason: This is the moment when RTVE began to subtitle its News and put these subtitles on the web.

Is it feasible that this search ends up going further back in the past? Of course, but for that they need the transcription of those informative. Although they have done small experiments to transcribe them with automatic systems – such as Amazon Transcribe -, Cabo indicated that the conclusion is that this process is costly in time and money.

Transcribing a single Newspaper is not very expensive, but when we talk about doing it with all the news for several years the thing changes. In Civio they will speak with RTVE in fact to try to collect more subtitles of previous news, so it is feasible that by one way or another the range of time covered by Verba is extended.

With Verba it is for example easy to find the appearances of Pichai or Zuckerberg in a News. Both did so at the 2015 Mobile World Congress in Barcelona. Source: RTVE

In fact the process is totally extrapolable to the news of other chains. In Civio they tried to get the subtitles of the news like those of the private ones, but they are not published or they don't do it in a format that is easily treatable to them at the moment.

The idea of ​​Cabo is to offer this tool not only to any user – who can replicate the project without problems thanks to the GitHub repository where the code is located – but to audiovisual councils, universities and journalism faculties, or regulatory bodies to use it so they can draw their own conclusions.

Precisely GitHub – in the Issues section of the project – reveals many of the keys to the evolution of a project in which David Cabo for example compared different PLN platforms and also talked about the problems in recognizing proper names or the economic cost that this processing would have before carrying it out.

What is clear is that Cabo and the entire Civio team want to see this initiative grow. In fact the founder of Civio has made a call to the Civic Community that we extend.

As he said both in the official presentation of the service on the Civio blog as On twitter"If you're curious about these things, stop by the Civio community and we'll talk." Not only that: if you want help them find interesting stories on the news Through Verba, you can already do it thanks to its Community.