A post on the BBC Research & Development Blog outlines work on automatic tagging of speech audio. The work is concerned with the World Service archive, which apparently has “very sparse” associated programme data. The archive “covers many decades and consists of about two and a half years of high-quality continuous audio content”. The aim was to associate the content of the programme with key words. The post explains:
For example if a programme mentions ‘London’, ‘Olympics’ and ‘1948’ a lot, then there is a high chance it is talking about the 1948 Summer Olympics.
The post discusses the technical challenges of this endeavour – automatic transcription, searching for terms from a subject classification. This uses “an approach inspired by the Enhanced Topic-based Vector Space Model proposed by D. Kuropka“. A detailed description is given in the full article of moving from constructing a vector space to extracting a ranked list of topic identifiers for each programme.
The resulting classification was evaluated:
against 150 programmes that have been manually tagged in BBC Programmes and [we] found that the results, although by no means perfect, are good enough to efficiently bootstrap the tagging of a large collection of programmes.
The algorithm is apparently described in more detail in a paper accepted for Linked Data on the Web (LDOW2012), a workshop of the World Wide Web 2012 conference in Lyon 16th-20th April 2012. The post also discusses next steps for the work.
Source: Automatically tagging the World Service archive.
This is very interesting! I did a bit more digging about the Topic Vector Space Model.
I think this is the most relevant article on Kuropka’s site about the method: Topic-Based Vector Space Model. The “enhanced method” is what the BBC people used to automatically assign vectors to topics, but the only available writeup of that is a hundreds of pages-long analysis of data. The wikipedia page is middlingly enlightening.
The BBC people have put their code on Github, and included a pretty simple explanation of the algorithm in the README file. They say:
So each topic has a vector pointing towards its parent topics, with closer (more specific) topics weighed more heavily than more distant (broader) topics.
The dot product (called the cosine similarity in the BBC article for some reason) of two topics’ vectors is then a sort of measure of their similarity – if two topics are in the same sort of area, their vectors will point in roughly the same direction, so the dot product will be high. Similarly, if two topics have nothing in common, they will point in completely different (orthogonal, not opposite) directions, so the dot product of their vectors will be zero.
It’s interesting, reading the original paper and the BBC source code, how computer scientists take useful bits of pure maths but don’t quite get the terminology right, or rephrase things in a way that makes more sense to them.
Interesting. I’m always nervous of bringing too much of the article over because I want people to read the original but I think what you’ve done here – particularly translating from computer science to mathematics – is sufficiently different.
Many thanks for the mention!
Another useful resource was Polyvyanyy’s thesis entitled “Evaluation of a Novel Information Retrieval Model: {eTVSM}”. The original description of that model is only available in German, it seems.
@Christian I am curious about your point about terminology – is there anything in particular you’re thinking about? ‘cosine similarity’ in this context is just a normalised dot product and is used a lot in IR?
I think “cosine similarity” was the only one in your post, but I was thinking of other CS papers I’ve read. Can’t bring any to mind right now.