Lip Reading From Thermal Cameras

Anderson, Steven James

Lip Reading From Thermal Cameras

aut.embargo	No	en_NZ
aut.thirdpc.contains	No	en_NZ
aut.thirdpc.permission	No	en_NZ
aut.thirdpc.removed	No	en_NZ
dc.contributor.advisor	Fong, Alvis
dc.contributor.author	Anderson, Steven James
dc.date.accessioned	2012-11-21T22:05:13Z
dc.date.available	2012-11-21T22:05:13Z
dc.date.copyright	2012
dc.date.created	2012
dc.date.issued	2012
dc.date.updated	2012-11-21T00:11:36Z
dc.description.abstract	A constructive research methodology has been used to explore the use of thermal images for improving Automatic Speech Recognition (ASR) performance. Previous research has shown that the addition of a visual modality for speech recognition improves ASR performance in both clean and noisy environments. However, Audio-Visual Automatic Speech Recognition (AVASR) performance can be greatly affected by changing lighting conditions. Conversely, thermal cameras are highly invariant to changes in lighting conditions as such the use of thermal video may be beneficial to AVASR. An AVASR system was created for testing the effect of adding a third modality to AVASR. Mel-frequency Cepstral Coefficient (MFCC) based speech recognition was used for audio speech recognition. For visual recognition, the standard video and thermal video were processed using a method derived from Wai Chee Yau’s (2008) proposed Motion Templates method of feature extraction. A custom audio visual database was created for this project. Eleven participants were recorded in audio, standard video and thermal video repeating ten words each (the numbers zero through nine) fifteen times to create a database of 1650 words for testing. Testing was completed using a speaker dependent, isolated word recognition system. For each participant 14 samples of each word were used for training Hidden Markov Models (HMM) with the remaining sample used for testing with Gaussian white noise added to the audio signal at 20, 10, 0, -10 and -20 decibel signal to noise ratios (SNR). This test was repeated five times with different samples selected for each test and the results averaged to reduce sample bias. It was successfully shown that combining audio, standard, and thermal video for ASR can improve performance by increasing recognition rates a relative 11.8% over audio and standard video combined ASR and a relative 38.2% over audio only ASR when averaged over all noise levels.	en_NZ
dc.identifier.uri	https://hdl.handle.net/10292/4741
dc.language.iso	en	en_NZ
dc.publisher	Auckland University of Technology
dc.rights.accessrights	OpenAccess
dc.subject	ASR	en_NZ
dc.subject	AVASR	en_NZ
dc.subject	Speech recognition	en_NZ
dc.subject	Thermal video	en_NZ
dc.subject	Signal processing	en_NZ
dc.subject	Motion templates	en_NZ
dc.title	Lip Reading From Thermal Cameras	en_NZ
dc.type	Thesis
thesis.degree.discipline
thesis.degree.grantor	Auckland University of Technology
thesis.degree.level	Masters Theses
thesis.degree.name	Master of Computer and Information Sciences	en_NZ

Files

Original bundle

Now showing 1 - 1 of 1

Name:: AndersonS.pdf
Size:: 1.82 MB
Format:: Adobe Portable Document Format
Description:: Whole thesis

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 897 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters Theses