A post on the BBC Research & Development Blog outlines work on automatic tagging of speech audio. The work is concerned with the World Service archive, which apparently has “very sparse” associated programme data. The archive “covers many decades and consists of about two and a half years of high-quality continuous audio content”. The aim was to associate the content of the programme with key words. The post explains:
For example if a programme mentions ‘London’, ‘Olympics’ and ‘1948’ a lot, then there is a high chance it is talking about the 1948 Summer Olympics.
The post discusses the technical challenges of this endeavour – automatic transcription, searching for terms from a subject classification. This uses “an approach inspired by the Enhanced Topic-based Vector Space Model proposed by D. Kuropka“. A detailed description is given in the full article of moving from constructing a vector space to extracting a ranked list of topic identifiers for each programme.
The resulting classification was evaluated:
against 150 programmes that have been manually tagged in BBC Programmes and [we] found that the results, although by no means perfect, are good enough to efficiently bootstrap the tagging of a large collection of programmes.
The algorithm is apparently described in more detail in a paper accepted for Linked Data on the Web (LDOW2012), a workshop of the World Wide Web 2012 conference in Lyon 16th-20th April 2012. The post also discusses next steps for the work.
Source: Automatically tagging the World Service archive.