Making MOCC Videos accessible to non-English Speakers

The news that California State Universities were tying up with Udacity for inexpensive online MOOC courses for credit was not surprising. The only surprise is the speed with which the changes are starting. I am inclined to agree with the analysis on Techcrunch (http://goo.gl/fRppX) that this online project is going to end colleges as we know them. The advantage for all is that a wealth of options will become available for all to learn. The disadvantage is that most of the content will be in English.

Can the videos in English be accessible to students not very comfortable with English? These students would benefit a lot if subtitles (http://en.wikipedia.org/Subtitling) in their language are provided. How do you go about it?

The time and effort required for translating to the vast number of languages would be huge. Crowd sourcing can be an answer, e.g. by using http://www.amara.org/.

Subtitling/Captioning and the Web

Video players can merge the video frames with subtitles at run time. There are numerous formats available for subtitling. The basic content, though, will be similar. Each subtitle is a text line to be displayed along with information about when to start the display and when to stop. The best way to provide this information is by specifying starting time and the end time or duration for each subtitle. This makes the subtitle file independent of the frame rate at which a video file may be created. One common format is the SubRip (.srt extension). SubRip format was the basis of another useful format, WebVTT, which may become widespread as it is now a W3C standard. W3C has a competing timed text, TTML, standard, which is an XML document, intended to ensure interoperability of streaming video and captions on the web.

However, HTML5 video element supports a track element which can be used to specify the subtitle file, e.g. in WebVTT format and meet the needs of streaming video with captions in user defined language.

It is common these days to have same language subtitles (http://en.wikipedia.org/wiki/Same_language_subtitling) for television and video. The obvious advantage is that it makes content accessible to hearing impaired. Another advantage is its educational value. It helps reading practice an incidental and subconscious part of entertainment. However, on the web, it has an even greater significance, which is probably why Google has been in the forefront of the WebVTT format. The reason is that it allows video content to be searched easily!

Machine Translation of Captions

Manual translation is time consuming and expensive even with crowd sourcing. The quantum of content is too large to be translated within a useful time frame in all languages of interest. Furthermore, the content of technical courses is likely to be unambiguous and not expected to rely on the subtle differences in the interpretation of the words and phrases. Machine translation may provide the answer.

If you search the web for open source machine translation engines, you will find Moses (http://www.statmt.org/moses/), a statistical translator, and Apertium (http://www.apertium.org/), a rule based translator.

Capabilities of Moses, in principle, are similar to the software used by Google and Microsoft. However, it does not come with language models and datasets for carrying out the translations. So, to be useful, you need to provide language models and training datasets.

Apertium, however, comes with translation capabilities for a number language pairs. The current list and status can be seen at http://wiki.apertium.org/wiki/List_of_language_pairs.

Unfortunately, the progress in pure open source tools is likely to be slow. The reason is fairly obvious. Web-based translators from Google, Microsoft and others provide excellent functioning alternatives. These sites have a wealth of data, e.g. pages from multi-lingual sites, which may be used for training and fine tuning the translations.

If same language subtitles are available, you may rely upon machine translation for generating subtitles in a language for which a machine translator is available. YouTube provides this feature for translated captions on its site by using Google Translate, e.g. http://www.youtube.com/watch?v=1St0tJVGCW8.

So, the easiest option is to use Google or Bing translators on the web. Several open source tools had been created to translate subtitles using the Google translate api. However, these tools are no longer working after the changes in the usage policy of Google translate api; but they may be modified to use Microsoft's translation api instead.

We can hope that the MOCC course videos will make same-language captions available so that machine translation can spread this knowledge to an even wider group of learners.

A side lesson - the sudden changes in the usage policy for Google translate api re-enforces the need for pure open source solutions for both the translation applications as well as language models and translation datasets. Generosity of commercial sites will be aligned with their commercial interests and cannot be taken for granted.

Comments