Lip reading from thermal cameras

aut.embargoNoen_NZ
aut.thirdpc.containsNoen_NZ
aut.thirdpc.permissionNoen_NZ
aut.thirdpc.removedNoen_NZ
dc.contributor.advisorFong, Alvis
dc.contributor.authorAnderson, Steven James
dc.date.accessioned2012-11-21T22:05:13Z
dc.date.available2012-11-21T22:05:13Z
dc.date.copyright2012
dc.date.created2012
dc.date.issued2012
dc.date.updated2012-11-21T00:11:36Z
dc.description.abstractA constructive research methodology has been used to explore the use of thermal images for improving Automatic Speech Recognition (ASR) performance. Previous research has shown that the addition of a visual modality for speech recognition improves ASR performance in both clean and noisy environments. However, Audio-Visual Automatic Speech Recognition (AVASR) performance can be greatly affected by changing lighting conditions. Conversely, thermal cameras are highly invariant to changes in lighting conditions as such the use of thermal video may be beneficial to AVASR. An AVASR system was created for testing the effect of adding a third modality to AVASR. Mel-frequency Cepstral Coefficient (MFCC) based speech recognition was used for audio speech recognition. For visual recognition, the standard video and thermal video were processed using a method derived from Wai Chee Yau’s (2008) proposed Motion Templates method of feature extraction. A custom audio visual database was created for this project. Eleven participants were recorded in audio, standard video and thermal video repeating ten words each (the numbers zero through nine) fifteen times to create a database of 1650 words for testing. Testing was completed using a speaker dependent, isolated word recognition system. For each participant 14 samples of each word were used for training Hidden Markov Models (HMM) with the remaining sample used for testing with Gaussian white noise added to the audio signal at 20, 10, 0, -10 and -20 decibel signal to noise ratios (SNR). This test was repeated five times with different samples selected for each test and the results averaged to reduce sample bias. It was successfully shown that combining audio, standard, and thermal video for ASR can improve performance by increasing recognition rates a relative 11.8% over audio and standard video combined ASR and a relative 38.2% over audio only ASR when averaged over all noise levels.en_NZ
dc.identifier.urihttps://hdl.handle.net/10292/4741
dc.language.isoenen_NZ
dc.publisherAuckland University of Technology
dc.rights.accessrightsOpenAccess
dc.subjectASRen_NZ
dc.subjectAVASRen_NZ
dc.subjectSpeech recognitionen_NZ
dc.subjectThermal videoen_NZ
dc.subjectSignal processingen_NZ
dc.subjectMotion templatesen_NZ
dc.titleLip reading from thermal camerasen_NZ
dc.typeThesis
thesis.degree.discipline
thesis.degree.grantorAuckland University of Technology
thesis.degree.levelMasters Theses
thesis.degree.nameMaster of Computer and Information Sciencesen_NZ
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AndersonS.pdf
Size:
1.82 MB
Format:
Adobe Portable Document Format
Description:
Whole thesis
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
897 B
Format:
Item-specific license agreed upon to submission
Description:
Collections