Lip reading from thermal cameras
A constructive research methodology has been used to explore the use of thermal images for improving Automatic Speech Recognition (ASR) performance. Previous research has shown that the addition of a visual modality for speech recognition improves ASR performance in both clean and noisy environments. However, Audio-Visual Automatic Speech Recognition (AVASR) performance can be greatly affected by changing lighting conditions. Conversely, thermal cameras are highly invariant to changes in lighting conditions as such the use of thermal video may be beneficial to AVASR.
An AVASR system was created for testing the effect of adding a third modality to AVASR. Mel-frequency Cepstral Coefficient (MFCC) based speech recognition was used for audio speech recognition. For visual recognition, the standard video and thermal video were processed using a method derived from Wai Chee Yau’s (2008) proposed Motion Templates method of feature extraction.
A custom audio visual database was created for this project. Eleven participants were recorded in audio, standard video and thermal video repeating ten words each (the numbers zero through nine) fifteen times to create a database of 1650 words for testing. Testing was completed using a speaker dependent, isolated word recognition system. For each participant 14 samples of each word were used for training Hidden Markov Models (HMM) with the remaining sample used for testing with Gaussian white noise added to the audio signal at 20, 10, 0, -10 and -20 decibel signal to noise ratios (SNR). This test was repeated five times with different samples selected for each test and the results averaged to reduce sample bias.
It was successfully shown that combining audio, standard, and thermal video for ASR can improve performance by increasing recognition rates a relative 11.8% over audio and standard video combined ASR and a relative 38.2% over audio only ASR when averaged over all noise levels.