Brain-Inspired Audio-Visual Information Processing Using Spiking Neural Networks
MetadataShow full metadata
Artificial neural networks are one of the most popular and promising approaches to modern machine learning applications. They are based on a mathematical abstraction of the intricate processing mechanisms in the human brain, remaining sufficiently simple for efficient processing in conventional computers. Despite efforts to mimic the capabilities of the brain, however, they are limited in their contextual understanding of concepts and behaviours. With the aim to explore ways to overcome these limitations, this thesis endeavours to investigate alternatives that are closer to the original biological systems, with a focus on processing auditory and visual signals. Inspired by the functioning of human hearing and vision and by the brain’s capabilities to dynamically integrate newly perceived information with previous experiences and knowledge, this thesis presents the hypothesis that mimicking these processes more closely could lead to an enhanced analysis of such signals. The framework that was developed to investigate this hypothesis consisted of three separate but connected projects that looked into biologically inspired computational processing of auditory, visual, and combined audio-visual signals, respectively. One aim of designing the framework was to largely preserve the spectral, spatial, and temporal characteristics of the original signals through tonotopic and retinotopic mapping. For the auditory processing system, an encoding and mapping method was developed that could transform sound signals into electrical impulses (“spikes”) by simulating the human cochlea, which were then fed into a brain-shaped three-dimensional spiking neural network at the location of the auditory cortices. For the visual system, the method was developed analogously, simulating the human retina and feeding the resulting spikes into the location of the visual cortex. A key advantage of this approach was that it facilitated a straightforward brain-like combination of input signals for the analysis of audio-visual stimuli during the third project. The approach was tested on two existing benchmark datasets and on one newly created New Zealand Sign Language dataset to explore its capabilities. While the sound processing system achieved good classification results on the chosen speech recognition dataset (91%) compared to existing methods in the same domain, the video processing system, which was tested on a gesture recognition dataset, did not perform as well (51%). The classification results for the combined audio-visual processing model were between those for the individual models (76.7%), and unique spike patterns for the five classes could be observed. Even though the models created in this work did not exceed the statistical achievements of conventional machine learning methods, they demonstrated that systems inspired by biological and neural mechanisms are a promising pathway to investigate audio-visual data in computational systems. Increasing the biological plausibility of the models is expected to lead to better performance and could form a pathway to a more intuitive understanding of such data. To broaden the applicability of the model, it is suggested that future work include the addition of other sensory modalities or signals acquired through different brain recording and imaging methods and to perform further theoretical and statistical analysis of the relationship between model parameters and classification performance.