These are notes that I hope will be helpful for those trying to install and use forced alignment systems. I have used two, PraatAlign and FAVE. These are quite similar in terms of their fundamental structure and accuracy of output. I include instructions for FAVE here because I believe it is more commonly used and there is more documentation.
Before you start
- These notes are Mac-specific. There are also instructions on the FAVE wiki (see below) for other platforms.
- You will need to use the command line in the Terminal in order to install and use the aligner.
- Along with FAVE, you will also need to download and install the softwares HTK and Sox, and may need to update XCode tools. Installing HTK is the most complicated part of the installation process. You have to register before installing it (this is free).
- FAVE used to have a web-based interface - I guess you could feed in a wav file and transcription, and it would align it (http://fave.ling.upenn.edu). This site is unavailable and appears to be down for good. Therefore, you may hear about people who have used this in the past, but it doesn’t seem to be an option anymore. Therefore, to use FAVE, you have to install it (and the other necessary software) on your own computer.
- Some basic vocab/clarifications
- A phonetic forced alignment system takes an orthographic transcription and a sound file, and outputs the “best-fit” sequence of phones, and their boundaries, aligned with the sound signal. The sequence of phones is determined from the orthographic transcription via a dictionary that includes the phonetic transcription(s) for all words in the orthographic transcription. The alignment is based on the best fit of the given sequence of phones to the signal, based on pre-trained acoustic models for each phone.
- Most phonetic aligners make use of HTK, a toolkit for building and manipulating Hidden Markov Models that is primarily used for automatic speech recognition research.
- FAVE is not itself a forced aligner, but rather a set of scripts that allows for flexible use of the Penn Phonetics Lab Forced Aligner. FAVE allows you to feed in time-aligned transcriptions, whereas the base aligner does not allow you to specify times, so you have to have one full transcription per sound file. Therefore, FAVE is much more useful for most applications, and allows for more accurate transcriptions.
Documentation/Links
- FAVE wiki: The most important resource. Read this BEFORE you start. This will guide you through the entire download and installation process of FAVE, HTK, and Sox.
- However, there are currently some conflicting instructions about how to avoid a bug in HTK installation:
- These instructions, under the tab “HTK on OS X,” are I believe the best way to go about this. You can see an alternative solution here, under the tab “HTK 3.4.1.” You should not have to do both.
- HTK: http://htk.eng.cam.ac.uk/
- Information on the Penn Phonetics Lab Forced Aligner for English (and there is also one for Mandarin!) is available here, under “Softwares.”
Troubleshooting
export CPPFLAGS="-UPHNALG -I/usr/include/malloc/"
./configure --without-x --disable-hslab
make all
sudo make install