AI Enhanced TV and Radio Content Findability

Yle contents have always been made for people but now also artificial intelligence can enjoy them. In the Yle Beta project, we tested how automatic content analysis could improve the findability and usability of audio and video content in Yle's ondemand service Yle Areena.

The latest services for consumers like cars driving themselves and automatic translation of text to different languages are based on artificial intelligence. Simplified, artificial intelligence is when a computer can perform things requiring cognitive reasoning and understanding that people were needed for before. The breakthrough of artificial intelligence is attributed to the continuous increase in computational capacity that enables the analysis of ever larger masses of data on the computer and teaching the computer to be “intelligent” based on the data.

For Yle, what is interesting is how artificial intelligence helps Yle to improve its contents and services and rationalise its operations, for example. In this article, we describe experiments we conducted using artificial intelligence techniques with Yle TV and radio content at the end of the year 2016. The results are promising!

Text Makes Audio and Video Content Findable

Findability online is strongly based on text. When you use a search engine, you first type a search word or two as text and the engine digs up a list of contents that match your search best, in a matter of milliseconds. Searching for results is based on a complex algorithm that compares the search words and the text content of web pages with each other.

Searching for video and audio content is dependent on the text and other metadata connected to it like the headline, description, genre, and cover picture, for example. However, different media items are very often full of content in an audiovisual form like speech, music, sound and film. Because this content is not text, the media content can’t be found by search methods based on text.

The answer to this is artificial intelligence. We harness the computer to watch and listen to audiovisual contents and tell us what the content is: what is being said, what you can see in it, which people are present in the content, what the content is about, and what the content means. Because the process is automated, content analysis can be done more cost-effectively than manually. This makes it possible for us to go through very large collections of data.

For example, Yle Areena publishes approximately 15,000 hours of video content and 35,000 hours of audio per year. In total, there are approximately 150,000 items (episodes, clips or individual programmes) available at any given moment. It is slow and laborious to produce metadata so it is essential to focus on the key factors and not on maximising findability.

When audiovisual content has been transformed into text and metadata (picture 1), it can be utilised by the means of all the same functionalities as text content: search, indexing keywords, notifications according to the subject, automatic links to further information on the same topic, abstracts, dividing the content into logical segments and so on. Because of this, the findability of the content will be improved, the service will be better, and customer satisfaction will increase.

Picture 1: The automated content analysis process.

There are similar projects being conducted elsewhere, as well. The Swedish public service broadcaster SVT, for example, recently published a prototype that enables users to search speech content in the company’s TV programmes by doing a text search. Also, the BBC has stated that artificial intelligence and speech recognition are among the key technologies for the competitiveness of the company’s online services.

 

From Speech to Text

During the first test, we presented the artificial intelligence application with a group of various items of Areena content, with the task of transcribing speech to text automatically. The objective of the test was simple: to find out whether the computer could recognise speech well enough in order to help to improve findability. In terms of Internet searches, could artificial intelligence do the same to video and audio content that is now an everyday thing for text content?

The test content consisted of approximately 90 radio and TV programmes like news broadcasts, magazine programmes, and talk shows. We selected different types of programmes, and content that might possibly be challenging for the computer. The aim of picking such varied types of materials was to get a feeling of how well automatic recognition works in different situations.

The process of speech recognition was as follows: the media file consisting of an individual TV or radio programme in Areena was entered in the speech recognition application that transformed the speech to text, or transcribed it. The transcript was then added on the web page of the programme in Areena so that the index robots of search engines would be able to find the text (picture 2). When the search engines had indexed the content, it was possible to conduct a search. We verified that the search worked on public search engines like Google and Bing, and also on the Yle search engine (picture 3). At the end of this article, you will find the list of the programmes we produced an automatic transcript for.

Picture 2: The speech in a radio programme had been transcribed into text and the text was printed on the web page of the programme on Areena. There are some errors but the content is mainly delivered correctly. Speech recognition enables searches with the spoken content of the programme. On the right, you can see the keywords automatically selected from the speech content.
Picture 3: Google found the abbreviation “ADHD” that was mentioned in Ben Furman’s programme. We tested two speech recognition applications, Google Speech API, and the speech recognition application of Aalto University. Both of them mainly worked equally well and well enough for the purposes of our tests. We also printed the transcript produced by Google Speech API on the Areena web page.

There were some errors in the transcripts, for example mistakes in compounds and the interpretation of colloquial expressions, and some individual words were misrecognised. It was easy to see that the transcripts were produced by a computer. For search purposes, however, the accuracy was adequate in both applications. The search does not require a perfect transcript to work. What is relevant is that the term searched is found. In speech recognition, the differences between the clarity of materials were highlighted; what required some heightened alertness from a listener may have been nearly impossible for a computer. Heated discussion and talking into each other’s mouths, or a sudden change of the language spoken, are a challenge for artificial intelligence.

Transcribing speech to text does not, however, guarantee that audiovisual content will automatically be among the top search results. This is especially true if the term is a very common one and there are several other good contents competing for visibility in search engines. But then again, words or special terms used more rarely may achieve highly improved visibility which means better visibility for individual contents that are rarely searched for, or for “long-tail keywords”. If individual contents are used more and there is a large number of contents in the publication, the total increase of usage may be very significant. For example, the abbreviation “ADHD” mentioned in a programme by Ben Furman became discoverable and the programme topped the search results when using Google as the search engine.

The following points can be mentioned as test results: we succeeded in producing text from speech out of various types of audio and video contents and in publishing the text automatically on the web page of each programme in Areena. Index robots were able to save transcripts, and it was possible to carry out searches on the basis of what was said in the programme.

In terms of results, it remained open just how much transcripts increase the use of media items. According to the search statistics, the number of searches for the approximately 90 programmes included in the test material increased slightly. But on the other hand, the number of searches was so small that it could also be explained with randomness.

There is also room for further development. This time around, we didn’t make use of the time code of the transcript, yet. If this was done, it would be possible to click an individual word in the transcript and access the next occurrence of the word in the media file. Or, when playing the media, the word being said at any given moment would be highlighted in the transcript.

Recognising the Structure of the Programme

The goal of the second experiment was to test whether artificial intelligence can recognise the structure of the programme, or segment the programme in logical categories in terms of content. This would enable many ways of improving the usability of programmes. For example, the viewers of magazine programmes and news broadcasts could skip subjects that are not that interesting to them or go back to a certain subject just like when reading a newspaper. The segments could also be automatically published as individual clips which would mean less manual labour and might improve the visibility of contents in social media, for example. If it was easy to skip the opening titles and the end credits, the viewers might stick with the TV series longer since it would be more pleasant to watch. The material used in the experiment included TV news broadcasts and episodes of the current affairs show “A-studio” which both consist of several topics that have no connection to each other.

The structure of each programme was recognised by using artificial intelligence technology from the company Valossa AI, conducting a visual analysis of the image on the screen and by utilising the subtitle text tracks of the programmes. For the evaluation of the end result, we produced a simple user interface that enabled us to test segmenting (picture 4).

Picture 4: The computer has automatically recognised the structure of a news broadcast (opening titles, individual topics and end credits). The technical user interface used in the test and shown in the picture enabled the navigation from one segment to another within the programme. Click on the image to watch as video.

The result of the experiment was that by combining facial observation, the recognition of the topics spoken about, and recognising various repeated visual elements, we automatically created segmentation that is useful for users.

One of the observations was that by comparing the episodes of the same TV series to each other, it was fairly easy to recognise the repeated elements like opening titles, bumpers and end credits. Comparing episodes to each other seems to serve as a good general principle for very varied audiovisual contents in the form of a series published by Yle, for example.

Another observation made was the fact that using visual recognition requires fine-tuning the artificial intelligence system individually for each series. So, at this point, fully automatic segmentation was not possible.

Recognising the Key Concepts of the Programme in Speech and on Film

Keywords are a handy way of determining the main concepts in a certain type of content in a very detailed way, for example, the subject of the content and which characters are present in the content.

With the help of keywording, the contents dealing with the same subject are linked to each other which makes it possible to offer the user concept-oriented search and navigation functionalities. This is already being done with written article contents at Yle. For example, the users of Yle NewsWatch (Uutisvahti) can select in a detailed way what content they are interested in or are not. This is, for the time being, not done with TV and radio content at Yle because keywording is not done on a larger scale yet.

During our third set of tests, we experimented with the ways of automatically producing keywording from transcripts or by the means of image recognition from the film material on the screen.

We entered the automatically produced transcript in the automatic keywording service used by Yle that creates keywords for the article content at Yle (picture 2). The system recognised the key concepts of the programme on the basis of the transcript. The preliminary observations included the facts that it is possible to automatically create keywords describing the content and that they describe the content of the programme well. The obvious challenge here is the problem that, in this method, keywords can only include subjects that are mentioned in speech.

TV programmes, especially, often make use of the visual means of the media. For example, the name of the person being interviewed is not said out loud but the name is shown on the screen as text. Or a topic that is important for the programme is dealt with using visual means only, examples of which include emotional states, physical objects, processes, or locations.

In the context of recognising visual items, we tested automatic analysis of video material which means the computer recognises what is in the picture. The recognition process was conducted by comparing visual content with publicly available visual recognition databases (LSCOM, COCO and SUN). The results were very varied in recognising what is in the picture: a man, a woman, a child, an aeroplane, a banana, a garden, or an art gallery (pictures 5 and 6).

Picture 5: Concepts were recognised in the picture like an object and a hand.
Picture 6: As an experiment, artificial intelligence was used for creating a description in a natural language about what is happening in the picture. The recognition may fail a bit every now and then...

Conclusions

Computers are not perfect. In terms of speech recognition, the computer sometimes misinterpreted the speech or completely ignored a part of it. Considering searches, automatic transcription seems to work well now, already. We were able to prove that the adoption of the technology in Yle Areena would be straightforward. The key question here is not so much how well speech recognition works but to what extent the audience wants to search for audiovisual contents on the basis of subjects as opposed to doing searches with the names of programmes.

In some cases, the segments recognised automatically were too short to be significant in terms of content. However, automatic segmentation of content seemed to work well, and it was possible to automatically recognise the repeated parts in the episodes of a series. Segmenting within a programme based on topics requires, at least for the time being, fine-tuning the system individually for each series. Thus, fully automated segmentation was not achieved yet.

Errors in automatic content recognition are not an obstacle if the data produced is put to use keeping the strengths and weaknesses in mind and understanding them. For example, simple statistical methods can be used to filter the most common ones out of the words that were recognised. By doing this, you can be fairly certain that the words actually occur in the content often. This was also the starting point for automatic keywording on the basis of the transcript.

The role of people will not disappear but it will change. Computers will surely not produce perfect quality for all needs in the coming years, so people will be needed to correct misinterpretations and to make the choices where intuition and decade-long experience guarantee better quality for contents and end users. Computers serve as tools for assisting people in their work: they offer better efficiency and enable contents and modes of operation that have previously been too expensive. The Internet has rocked the media industry globally for decades. The next thing shaking the media industry is artificial intelligence. Our experiments are just a small example of the possibilities of artificial intelligence.

The experiments were conducted in partnership with Qvik, Valossa Labs, and the Aalto University. Kim Viljanen works as a concept designer for the Yle Areena Development team. Sami Mattila works as a SEO specialist for the Yle.fi team.

Related resources:

1. A list of Yle content used as the test material for the automated transcription experiments
2. A demo for a test user-interface for viewing the segmented programming (video)