Audio surveillance in unstructured environments

Sharan, Roneel Vikash
Moir, Tom
Collins, John
Item type
Degree name
Doctor of Philosophy
Journal Title
Journal ISSN
Volume Title
Auckland University of Technology

This research examines an audio surveillance application, one of the many applications of sound event recognition (SER), and aims to improve the sound recognition rate in the presence of environmental noise using time-frequency image analysis of the sound signal and deep learning methods. The sound database contains ten sound classes, each sound class having multiple subclasses with interclass similarity and intraclass diversity. Three different noise environments are added to the sound signals and the proposed and baseline methods are tested under clean conditions and at four different signal-to-noise ratios (SNRs) in the range of 0–20dB. A number of baseline features are considered in this work which are mel-frequency cepstral coefficients (MFCCs), gammatone cepstral coefficients (GTCCs), and the spectrogram image feature (SIF), where the sound signal spectrogram images are divided in blocks, central moments are computed in each block and concatenated to form the final feature vector. Next, several methods are proposed to improve the classification performance in the presence of noise. Firstly, a variation of the SIF with reduced feature dimensions is proposed, referred as the reduced spectrogram image feature (RSIF). The RSIF utilizes the mean and standard deviation of the central moment values along the rows and columns of the blocks resulting in a 2.25 times lower feature dimension than the SIF. Despite the reduction in feature dimension, the RSIF was seen to outperform the SIF in classification performance due to its higher immunity to inconsistencies in sound signal segmentation. Secondly, a feature based on the image texture analysis technique of gray-level cooccurrence matrix (GLCM) is proposed, which captures the spatial relationship of pixels in an image. The GLCM texture analysis technique is applied in subbands to the spectrogram image and the matrix values from each subband are concatenated to form the final feature vector which is referred as the spectrogram image texture feature (SITF). The SITF was seen to be significantly more noise robust than all the baseline features and the RSIF, but with a higher feature dimension. Thirdly, the time-frequency image representation called cochleagram is proposed over the conventional spectrogram images. The cochleagram image is a variation of the spectrogram image utilizing a gammatone filter, as used for GTCCs. The gammatone filter offers more frequency components in the lower frequency range with narrow bandwidth and less frequency components in the higher frequency range with wider bandwidth which better reveals the spectral information for the sound signals considered in this work. With cochleagram feature extraction, the spectrogram features SIF, RSIF, and SITF are referred as CIF, RCIF, and CITF, respectively. The use of cochleagram feature extraction was seen to improve the classification performance under all noise conditions with the most improved results at low SNRs. Fourthly, feature vector combination has been seen to improve the classification performance in a number of literature and this work proposes a combination of linear GTCCs and cochleagram image features. This feature combination was seen to improve the classification performance of CIF, RCIF, and CITF and, once again, the most improved results were at low SNRs. Finally, while support vector machines (SVMs) seem to be the preferred classifier in most SER applications, deep neural networks (DNNs) are proposed in this work. SVMs are used as a baseline classifier but in each case the results are compared with DNNs. SVM being a binary classifier, four common multiclass classification methods, one-against-all (OAA), one-against-one (OAO), decision directed acyclic graph (DDAG), and adaptive directed acyclic graph (ADAG), are considered. The classification performance of all the classification methods is compared with individual and combined features and the training and evaluation times are also compared. For the multiclass SVM classification methods, the OAA method was generally seen to be the most noise robust and gave a better overall classification performance. However, the noise robustness of the DNN classifier was determined to be the best together with the best overall classification performance with both individual and combined features. DNNs also offered the fastest evaluation time but the training time was determined to be the slowest.

Sound event recognition , Time-frequency image , Support vector machines , Deep neural networks
Publisher's version
Rights statement