Efficient transcription of interviews

Posted by on 2018 Feb 24, Sat in Technology / programming, Education, Usability

Transcribing audio or video recordings is a very time-consuming task. Depending on your experience, a 15min interview could take 40min. Once upon a time I wanted to produce subtitles for one of my lectures - the first hour of the video took me around 3 hours to process. I gave up.

If you're lucky and the voices are clear, speech recognition could help - you can see this in action by trying out some videos on Youtube. However, if the transcription is a part of an anonymous user study, you cannot use Youtube or any other online tool, because you'll be sharing the data with a third-party service, thus violating the privacy of your participants!

The best way to handle it is to avoid the problem altogether, by not signing up for the job. However, if you have no choice, you can tweak your workflow and make the process less painful:

I use VLC, a free media player
I leverage VLC's ability to adjust playback speed, reducing it to 0.75x or 0.5x, depending on the interviewee; this is adjusted by pressing [ or ] on the keyboard
I configure VLC's global hotkeys feature, it enables me to control playback without switching to VLC itself. The hotkeys you need are Play/pause, which I set to Ctrl+Alt+Space, and jump back/forward, which I set to Ctrl+Alt+z/x

Once you do this, you can type the text as you listen to the interview, pausing and resuming it without having to switch to the media player. If you miss an audio fragment, you can rewind or fast forward without getting out of the text editor. The reason I intentionally slow down the speech, is to make it more likely that I can type as I listen, in one pass (i.e. without having to rewind). If you leave the playback speed untouched, you'll catch yourself going back more often, thus spending more time overall.

That's it!

Why does this matter? If you observe a person producing a transcription (it can also be you, just be introspective), you notice that context-switching takes a great amount of time:

you press Alt+Tab to switch focus to the window of the media player
then you move your hand to the mouse to drag the progress slider back (or at best, use keyboard shortcuts to rewind, thus doing it without getting your hands off the keyboard)
as you do that, you probably "overshoot" by going further back, such that by the time you get to the moment you want to transcribe, you've already managed to switch back to the text editor and have your fingers on the keyboard
the "overshoot" reserve can sometimes be too big, so you'll have to stay idle, waiting for the moment you need to resume typing

You can measure the amount of wasted time precisely, by using keystroke-level modeling - note how hovering and pointing are very expensive operations, while keying (i.e. pressing keys) is one of the cheapest.

Action	Duration (s)
Mental preparation	1.35
Keying	0.2
Hovering (hand goes from keyboard to mouse and the other way around)	0.4
Pointing the mouse to a specific place	1.1

This puts things into perspective and it becomes clear that you can shave off a lot of time from the entire process if you minimize context switches and get rid of hovering and pointing. There is a cognitive switch too - you're moving from a text editor to a media player and the other way around, and you're juggling two apples, instead of one!

If keystroke-level modeling sounds interesting to you, check out "The humane interface" by Jeff Raskin, that's where I learned about this technique. You can also check out an earlier story I've written, where I use KLM to compare two online-banking interfaces (in Romanian, but the screenshots and the calculations may speak for themselves). Finally, some people say that this video lecture about keystroke-level modeling is pretty good.

No feedback yet

Form is loading...