Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain

Amerineni, Rajesh; Gupta, Resh S.; Gupta, Lalit

doi:10.3390/brainsci9010003

Open AccessArticle

Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain

by

Rajesh Amerineni

¹,

Resh S. Gupta

² and

Lalit Gupta

^1,*

¹

Department of Electrical and Computer Engineering, Southern Illinois University, Carbondale, IL 62901, USA

²

Vanderbilt Brain Institute, Vanderbilt University, Nashville, TN 37232, USA

^*

Author to whom correspondence should be addressed.

Brain Sci. 2019, 9(1), 3; https://doi.org/10.3390/brainsci9010003

Submission received: 14 November 2018 / Revised: 12 December 2018 / Accepted: 25 December 2018 / Published: 2 January 2019

(This article belongs to the Section Systems Neuroscience)

Download

Browse Figures

Versions Notes

Abstract

Two multimodal classification models aimed at enhancing object classification through the integration of semantically congruent unimodal stimuli are introduced. The feature-integrating model, inspired by multisensory integration in the subcortical superior colliculus, combines unimodal features which are subsequently classified by a multimodal classifier. The decision-integrating model, inspired by integration in primary cortical areas, classifies unimodal stimuli independently using unimodal classifiers and classifies the combined decisions using a multimodal classifier. The multimodal classifier models are implemented using multilayer perceptrons and multivariate statistical classifiers. Experiments involving the classification of noisy and attenuated auditory and visual representations of ten digits are designed to demonstrate the properties of the multimodal classifiers and to compare the performances of multimodal and unimodal classifiers. The experimental results show that the multimodal classification systems exhibit an important aspect of the “inverse effectiveness principle” by yielding significantly higher classification accuracies when compared with those of the unimodal classifiers. Furthermore, the flexibility offered by the generalized models enables the simulations and evaluations of various combinations of multimodal stimuli and classifiers under varying uncertainty conditions.

Keywords:

multisensory integration; multimodal object classification; feature and decision integration

1. Introduction

The main goals of this multidisciplinary machine learning and neuroscience collaboration are (a) to formulate multimodal classifier models inspired by multisensory integration in the brain, and (b) to demonstrate that the resulting classifiers improve object recognition through the integration of semantically congruent unimodal stimuli. A generalized unimodal classification model is introduced, and two purely feed-forward multisensory multimodal models, namely, the feature-integrating (FI) and the decision-integrating (DI) models, are derived from this unimodal model. The two models differ in the type of information that is integrated. The FI model, inspired by multisensory integration in the subcortical superior colliculus (SC), combines unimodal features which are subsequently classified by a multimodal classifier. In the DI model, inspired by integration in primary cortical areas, the unimodal stimuli are classified independently by unimodal classifiers and the unimodal decisions are combined and classified by a multimodal classifier.

The key contribution of this study is the systematic development of the two multimodal classification models with parameters that can be manipulated to simulate (a) the weighted influence of each stimulus in the integration process, (b) different representations of the stimuli in the modalities, (c) mechanisms for combining different forms of unimodal information, and (d) varying degrees of uncertainty in the environment during stimuli integration. Supporting contributions include the design of experiments to demonstrate the versatility of the models and the improvement in performance by combining unimodal stimuli with various attenuation levels in varying degrees of noise. Specifically, each model is developed using artificial neural network and statistical classifiers to show that the classification models are not restricted to any particular type of classifier. The experiments involve the classification of auditory and visual representations of ten digits; however, the models can combine stimuli from other modalities and the number of modalities can be greater than two.

1.1. Multisensory Integration

Humans have multiple senses, including the primary senses of sight (vision), hearing (audition), taste (gustation), smell (olfaction), and touch (somatosensation). The secondary senses include perception of temperature (thermoception), kinesthesia (proprioception), pain (nociception), balance (equilibrioception), and vibration (mechanoreception) [1]. Among the numerous fascinating and complex operations performed by the brain, one of the most important is “multisensory integration,” which is the brain’s ability to combine information from different sensory modalities to robustly and coherently perceive the external environment [2,3,4,5]. The three rules that govern multisensory integration are the spatial, temporal, and inverse effectiveness rules, which state that multisensory integration is enhanced (more likely or stronger) when stimuli occur at approximately the same location, at approximately the same time, and when the unimodal stimuli in the set evoke relatively weak responses, respectively [5]. Multisensory integration enhances the detection of external stimuli, facilitates object recognition, resolves ambiguities and conflicts, and decreases reaction times [5,6]. As illustrated in the simplified diagram in Figure 1, the simple act of deciding whether a cantaloupe is ripe or unripe can involve examining the color (vision), tapping to detect the sound for hollowness (audition), feeling the skin netting (somatosensation), and smelling to detect the scent (olfaction).

Many studies have focused on the integration of auditory and visual stimuli. Originally, the established view posited that vision dominates multisensory perception of the world, however, it is now understood that visual perception can actually be manipulated by other sensory modalities [7]. These effects are well-demonstrated in conflicts arising from audio-visual integration [8]. As an example, visual processing dominates auditory processing during spatial processing (e.g., spatial localization of audio-visual stimuli); therefore, when there is conflict between the visual and auditory signal, the response will be influenced mostly by the visual signal. This phenomenon is clearly observed in the Ventriloquist effect [9,10], which refers to the ancient art of making one’s voice appear to come from elsewhere. When spatially localizing audio-visual stimuli, vision dominates audition when visual localization is good. However, under situations where visual stimuli are severely blurred (poorly localized), audition dominates vision [10]. In contrast, during temporal processing, audition dominates vision; this can be observed in the sound-induced illusory flash effect [7]. In this effect, when a single visual flash is accompanied by multiple auditory beeps, the single flash is incorrectly perceived as multiple flashes [7]. Dominance in audio-visual speech perception is also famously observed in the McGurk effect [11], which demonstrates the effect of vision on speech perception. In McGurk and MacDonald’s original study, a spoken /ba/ syllable dubbed onto a visual presentation of /ga/ was reported as /da/ on a majority of trials [12].

Several models of multisensory integration at different cortical levels have been reported, and most are based on integration in the subcortical SC and integration in primary cortical areas. In the SC integration model, SC neurons are multisensory and have a receptive field (RF) for each sensory modality [13,14,15]. The RFs partially overlap in sensory space, receive converging unisensory information, and combine this information in an appropriate way. The earlier models of integration in primary cortical areas assume that unimodal stimuli are processed separately in the primary cortices and are then combined in higher-order association areas in the brain [16,17].

Over the past decades, various approaches have been used to explain and demonstrate, at the neural level, many features of multisensory integration [5,6,18,19,20]. Notable neural network-based contributions, among many others, include the detailed mathematical description of a three-neural network model to mimic various responses to multisensory stimuli [18]. A Bayesian framework, which provides a general theory for multisensory integration, is another effective approach that has been developed, and numerous studies have been conducted to demonstrate the validity of this approach with respect to multimodal enhancement [4,19,20].

1.2. Scope of Research

We emphasize that our goal is not aimed at contributing to the wealth of research related to multisensory integration in the brain, but to demonstrate that the two proposed multimodal classification systems are able to emulate, at the systems (input-output) level, some properties related to multisensory object recognition. Specifically, we focus on the following interpretation of the principle of inverse effectiveness: combinations of weakly effective unimodal stimuli produce greater “responses” when compared with the response of the most effective stimulus in the set [5,21]. In the context of object classification, “responses” can be replaced by “classification accuracies.” Consequently, our goal is to show that object classification can be enhanced by combining unimodal stimuli. For demonstration purposes, we focus on the improvement in object classification through the integration of bimodal information extracted from the auditory and visual modalities. However, the generalized formulations of the multimodal classification models enable the integration of information from other modalities. Furthermore, the number of modalities can exceed two.

As noted earlier, in order to demonstrate the invariance of each model to any particular type of classifier, the models are developed using artificial neural network and statistical classifiers. Here too, we emphasize that our goal is not aimed at developing new and improved machine learning classification algorithms, nor is it aimed at determining the best set of features for classifying multimodal stimuli. Instead, our clear goal is to select a classifier and a feature set for each sensory modality, combine these unimodal classifiers into multimodal classification systems, and compare the performance of the resulting multimodal classifiers against the performance of the unimodal classifiers. We select two well-known classifiers: the multilayer perceptron (MLP) neural network classifier and the maximum a posteriori (MAP) classifier; however, it should be clear that other classifiers such as deep-learning neural networks, support vector machines, logistic regression, and decision-tree classifiers can be used in the models.

A preliminary 4-page version of this study describing the MLP implementations of the models and performance in noise was presented at BHI2018 [22]. This 14-page expanded version is a more complete study aimed at demonstrating the versatility of the unimodal and multimodal models by including (a) MAP implementations, (b) additional experiments and analyses in the presence of stimulus noise and amplitude attenuations, (c) a detailed discussion of the results, and (d) suggestions on how the models can be used to demonstrate the spatial and temporal rules. Also included is a more detailed introduction with additional references.

The proposed FI and DI multimodal classification systems are related to ensemble and data fusion classifiers. Typically, ensemble classifiers combine classifiers whereas data fusion classifiers combine data from multiple sensors [23,24,25,26]. These classifiers do not attempt to emulate multisensory integration in the brain, but are aimed at improving the overall classification accuracy by combining diverse classifiers and/or by exploiting complementary information from different sensors. However, the FI and DI models formulated in this paper have several notable features pertaining to multisensory integration, including the ability to tailor the classification systems for each modality and across modalities. For example, (a) the unimodal classifiers can differ across modalities to account for differing classification mechanisms in the modalities, (b) the feature sets can differ across modalities to account for differing “internal mental representations” extracted in the modalities, and (c) the type of noise and attenuation levels can differ across modalities to account for different types of variability in the multimodal stimuli.

The FI and DI multimodal models differ quite significantly in their structures, how information is defined, and the manner in which multimodal information is integrated. The main difference between the structures is that the FI model has a single classifier, and the DI model has multiple classifiers: one for each stimulus modality. The information integrated in the FI model is the set of features extracted from the unimodal stimuli, whereas the information integrated in the DI model is the set of decisions of the unimodal classifiers. The following sections describe the formulations of the unimodal and the two multimodal models for the multiclass classification of unimodal and multimodal stimuli. For each model, the generalized formulation is described first and the MLP and MAP formulations follow. The stimuli classes are represented by

ω_{k}, k = 1, 2, \dots, K

and the sensory stimuli are represented by

S_{j}, j = 1, 2, \dots, J,

where

K

and J are the number of stimulus classes and the number of sensory modalities, respectively.

2. Unimodal Classification Model

The generalized classification model for a single sensory modality is illustrated in Figure 2, in which the unimodal classifier for modality

j

is represented by UM-j. This systems-level model assumes that unimodal stimuli are processed separately in their respective primary channels (primary cortical areas), and there is no cross-modal interaction. Although unrealistic, this model is described in detail and implemented for the purpose of comparing the performances of the multimodal classifiers against the unimodal classifiers. Furthermore, this model is an integral part of the proposed multimodal classification systems which contain the same uncertainty parameters (noise and attenuations).

In this model for modality

j, S_{j}

is the

d_{j}

-dimensional input sensory signal,

α_{j}

is the signal attenuation (weighting) factor,

N_{j}

is the random noise added to the signal,

Φ_{j}

is the feature generating matrix,

[a_{j}, b_{j}]

is the normalizing interval,

X_{j}

is the classifier input, and

Y_{j}

is the system output. The dominance (amplitude or strength) of the input sensory signal can be adjusted by

α_{j}

, which takes values in the interval

[0, 1]

, where, zero corresponds to infinite attenuation (zero amplitude or signal absent) and one corresponds to zero attenuation (full amplitude or strength). During the training phase (parameter estimation),

α_{j}

is set to 1 but can be varied during testing to evaluate the performance as a function of stimulus attenuations.

The external noise signal

N_{j}

is assumed to be uncorrelated with the unimodal signal and is also assumed to be multivariate Gaussian with zero mean and covariance

Ψ_{j}

, that is,

N_{j} ~ G (0, Ψ_{j}) .

Features are generated by multiplying

{\tilde{S}}_{j}

with the

D_{j} \times d_{j}

transformation matrix

Φ_{j}

such that the D_j features are linear combinations of the elements of vector

{\tilde{S}}_{j}

. Because of its simplicity, data independence (fixed basis vectors), and information packing capabilities, the discrete cosine transform (DCT) is selected for

Φ_{j} j = 1, 2, \dots, J,

[27]. The feature vector may be regarded as an “internal or mental representation” of the stimulus. The features are normalized to take values in the interval

[a_{j}, b_{j}]

using a linear transformation of the form

m_{j} {\hat{S}}_{j} + c_{j}

, where,

m_{j}

and

c_{j}

are the slope and intercept of the line connecting

({\hat{S}}_{j, m i n}, a_{j})

and

({\hat{S}}_{j, m a x}, b_{j})

, and where

{\hat{S}}_{j, m i n}

and

{\hat{S}}_{j, m a x}

are the minimum and maximum values of

{\hat{S}}_{j}

, respectively. In the final step, the unimodal classifier determines the

K

-dimensional vector

Y_{j}

, which has the information needed, such as posterior probabilities or discriminants, to assign the sensory signal to one of the

K

stimulus classes.

2.1. Unimodal MLP Network Classifier

As noted in the introduction, the neural network selected for the generalized multiclass classification problem is the MLP. The network is fully interconnected between layers. The network determines a mapping

f : X_{j}^{D_{j}} \to Y^{K}

, using the backpropagation training algorithm, between input

X_{j}

of dimension

D_{j}

and output Y of dimension

K

. In general, the output of a fully interconnected MLP with

M

layers (layer 1 is the input), expressed as a composite of functions, is given by

Y = f (X_{j}) = f_{M} (W_{M - 1}^{T} \dots f_{3} (W_{2}^{T} f_{2} (W_{1}^{T} X_{j} + β_{2}) + β_{3}) \dots + β_{M}),

where,

f_{m}

is the activation function in layer

m, m \neq 1

,

β_{m}

is the bias into layer

m, m \neq 1

, and

W_{m}, m \neq M,

is the

N_{m} \times N_{m + 1}

interconnection weight matrix between layers

m

and

(m + 1)

which have dimensions

N_{m}

and

N_{m + 1}

, respectively.

For the multiclass classification problems considered in this study, we select the sigmoidal activation function for the intermediate hidden layers, the softmax activation function for the output layer, and the cross-entropy for the loss-function. Therefore, the hidden layer outputs are given by

f_{m} (q_{i}) = \frac{1}{1 + e^{- q_{i}}}, m = 2, \dots, (M - 1)

, and the network outputs are given by

f_{M} (q_{k}) = y (k) = \frac{e^{q_{k}}}{\sum_{k = 1}^{K} e^{q_{k}}}, k = 1, 2, \dots, K,

where,

q_{n}

is the weighted sum of the inputs into a neuron

n

in the respective layer and

y (k)

is the

k^{t h}

output of the network. If t_k is the target of

y (k), k = 1, 2, \dots, K,

the cross-entropy cost function is given by

E = - \sum_{K = 1}^{K} t_{k} \log (y (k)), where,

t_{k} = {\begin{matrix} 1 i f X_{j} ϵ ω_{k} \\ 0 o t h e r w i s e \end{matrix} .

Using the maximum response rule, the neural network assigns the input stimulus to the class associated with the output that yields the largest value. Because the softmax activation function is selected, the outputs

y (k), k = 1, 2, \dots, K,

can be regarded as estimates of the posterior probabilities. Consequently, a test feature

X

can be assigned to class

ω_{k}

if

y (k) = P (ω_{k} / X_{j}) > y (i) = P (ω_{i} / X_{j}), for all i \neq k .

2.2. Unimodal MAP Classifier

The statistical classifier selected is the MAP classifier, which is the Bayes classifier for a 0–1 loss function. For a given test feature vector

X_{j}

, the MAP classifier determines all

K

posterior probabilities and decides in favor of the class which yields the highest posterior probability. If the class-conditional probability density of

X_{j}

under

ω_{i}

is

P (X_{j} / ω_{i})

and the prior probability of class

ω_{i}

is

P (ω_{i})

, the posterior probability of class

ω_{i}

given X_j is determined from

P (ω_{i} / X_{j}) = \frac{P (X_{j} / ω_{i}) P (ω_{i})}{\sum_{i = 1}^{K} P (X_{j} / ω_{i}) P (ω_{i})}, i = 1, 2, \dots, K .

The MAP classifier assigns

X_{j}

to the class with the largest posterior probability. That is,

X_{j}

is assigned to class

ω_{k}

if

P (ω_{k} / X_{j}) > P (ω_{i} / X_{j}), for all i \neq k

(1)

Equivalently, the classifier can be expressed in terms of discriminant functions, and

X_{j}

is assigned to class

ω_{k}

if

g_{k} (X_{j}) > g_{i} (X_{j}), for all i \neq k,

where,

g_{i} (X_{j}) = l n P (X_{j} / ω_{i}) + l n P (ω_{i})

is the discriminant function of

X_{j}

under class

ω_{i}

.

In order to develop the MAP classifier in the unimodal classification system shown in Figure 2, the conditional density function

P (X_{j} / ω_{k}), k = 1, 2, \dots, K,

has to be derived systematically. As noted earlier, the noise vector

N_{j}

is

G (0, Ψ_{j})

and α_j is set to 1 during the parameter estimation phase. Given that all processing steps are linear, it can be shown that signals at various stages and their corresponding probability density functions

P (• / ω_{k}), k = 1, 2, \dots, K

, are given by

{\tilde{S}}_{j / k} = (S_{j / k} + N_{j}) ~ G (S_{j / k}, Ψ_{j}),

{\hat{S}}_{j / k} = Φ_{j} ({\tilde{S}}_{\frac{j}{k}}) ~ G (Φ_{j} S_{\frac{j}{k}}, Φ_{j} Ψ_{j} Φ_{j}^{T}),

X_{j / k} = (m_{j / k} {\hat{S}}_{j / k} + c_{j / k}) ~ G (m_{j / k} Φ_{j} S_{j / k} + c_{j / k}, m_{j / k}^{2} Φ_{j} Ψ_{j} Φ_{j}^{T}) .

Because the classifier outputs are the posterior probabilities

P (ω_{i} / X_{j}), i = 1, 2, \dots, K,

the rule in Equation (1) can be used to assign a test vector

X_{j}

to the class yielding the highest posterior probability.

The next two sections describe the FI and DI multimodal classification models which are derived from the unimodal model. The initial processing steps for each modality in both models are identical to the processing steps of the unimodal model. For consistency, and to facilitate the descriptions, the implementations that follow assume that each model contains the same uncertainty parameters and feature generating matrices. Furthermore, it is assumed that the unimodal classifiers in the DI model belong to the same family of classifiers (neural network or MAP). However, it is important to note that uncertainty parameters and feature matrices can differ across the modalities and the DI classification system can contain a mixture of neural network, statistical, or other forms of unimodal classifiers.

During the training phase, it is assumed that the multimodal stimuli presented to the system are semantically congruent, have no attenuations, and are temporally aligned. The attenuation factor

α_{j}

can be varied during testing to evaluate the performance as a function of varying attenuation levels across modalities. For example, in cases where visual processing dominates auditory processing, the attenuation factor for the visual sensory signal can be set to 1 while the factor for the auditory sensory signal is set to a fraction that is proportional to the relative lack of effectiveness of the auditory signal. The attenuation levels can also be adjusted to simulate the effects of varying conditions, such as visual dominance during daylight and auditory dominance at night. The random noise can be statistically different across modalities, and the level of noise that is added to each stimulus modality can be adjusted to determine the effects of increasing noise in one or more modalities.

3. Feature-Integrating Multimodal Classification Model

The generalized feature-integrating model for the multimodal classification system is shown in Figure 3, in which the feature-integrating classifier is represented by FI. The system combines the normalized unimodal feature vectors through concatenation, and the resulting vector is the input to a multimodal classifier which determines the class of the multimodal stimulus presented to the system. This model can be regarded as a systems-level representation of a pure feedforward system, which is often used to simulate activity of multisensory neurons in the superior colliculus [14,15]. For convenience, the symbol “

\nabla

” will be used to represent the concatenation operation. Therefore, the input to the multimodal classifier is given by

X = \nabla_{j = 1}^{J} X_{j}

, where,

X

is the column vector formed by concatenating the

J

D_{j}

dimensional vectors

X_{j}

. The resulting vector

X

will, therefore, have dimension

(\sum_{j = 1}^{J} D_{j})

. The output of the FI classifier, which is also the output of the multimodal classification system, is the

K

-dimensional vector

Y

.

3.1. Multimodal Feature-Integrating MLP Classifier

For this case, the network mapping can be expressed as

f : X^{D} \to Y^{K}

, where,

D = \sum_{j = 1}^{J} D_{j}

and

Y

is equal to

[P (ω_{1} / X), P (ω_{2} / X), \dots, P (ω_{k} / X)]

because of the softmax activation functions in the output layer. A multimodal test vector

X

can, therefore, be assigned to class

ω_{k}

if

P (ω_{k} / X) > P (ω_{i} / X), for all i \neq k .

(2)

Note that the neurons in this classifier may be regarded as multisensory because the neurons in the first hidden layer receive and process features from every modality, and this information is propagated through the nodes in the subsequent layers.

3.2. Multimodal Feature-Integrating MAP Classifier

Because the input

X

to the multimodal classifier is a concatenation of Gaussian vectors, it will be assumed that

X

is also Gaussian. Therefore, the probability density function of

X

under class

ω_{k}

is given by

P (X / ω_{k}) ~ G (\nabla_{j = 1}^{J} (m_{j / k} Φ_{j} S_{j / k} + c_{j / k}), \nabla_{i = 1}^{J} \nabla_{j = 1}^{J} Ψ_{i j}),

where,

Ψ_{i j}

is the cross-covariance matrix of the signals

X_{i}

and

X_{j}

. Note that the resulting covariance matrix with dimension

D \times D

is a matrix of cross-covariance matrices. The rule in Equation (2) can be used to assign a test vector

X

to the class yielding the maximum posterior probability.

4. Decision-Integrating Multimodal Classification Model

The decision-integrating multimodal classification model is illustrated in Figure 4. In this 2-stage classification model, the unimodal stimuli are first classified independently, and the unimodal classification results are combined through concatenation. In the next stage, the concatenated result is classified by the multimodal classifier, represented by DI, in order to determine the class of the input multimodal signal. This model can be regarded as a systems-level representation of the case in which unimodal stimuli are processed separately in the primary cortical areas and combined subsequently in multisensory association areas [16,17].

4.1. Decision-Integrating MLP Classifier

In this case, the J unimodal classifiers and multimodal classifier are MLPs. The input to the multimodal classifier is the concatenation of the

K

outputs (posterior probabilities) of the unimodal networks in the system. Therefore, the mapping for the multimodal MLP classifier can be expressed as

f : Z^{K J} \to Y^{K}

, where,

Z

is the input concatenated vector with dimension

(K) (J)

and

Y

is the output with dimension

K

. That is,

Z = [P (ω_{1} / Y_{1}), P (ω_{2} / Y_{1}), \dots, P (ω_{k} / Y_{1}), P (ω_{1} / Y_{2}), P (ω_{2} / Y_{2}), \dots, P (ω_{k} / Y_{2}), \dots, P (ω_{1} / Y_{J}), P (ω_{2} / Y_{J}), \dots, P (ω_{k} / Y_{J})]

and

Y = [P (ω_{1} / Z), P (ω_{2} / Z), \dots, P (ω_{k} / Z)]

.

For this case, a test multimodal stimulus is assigned to class

ω_{k}

where

k

is given by

P (ω_{k} / Z) > P (ω_{i} / Z), f o r a l l i \neq k .

(3)

Note that the neurons in the unimodal classifiers are unisensory, whereas the neurons in the fusion classifier can be regarded as multisensory.

4.2. Decision-Integrating MAP Classifier

In this classification system, the unimodal classifiers and the multimodal classifier are MAP classifiers. The outputs (independent decisions)

Y_{j}, j = 1, 2, \dots, J

of the

J

unimodal MAP classifiers are the inputs to the second stage discrete MAP classifier. The probability density function of

Y_{j}

under

ω_{k}

can be written as

P (Y_{j} / ω_{k}) = [P_{j} {(ω_{1} / ω_{k})}^{δ (Y_{j} - ω_{1})} P_{j} {(ω_{2} / ω_{k})}^{δ (Y_{j} - ω_{2})} \dots P_{j} {(ω_{K} / ω_{k})}^{δ (Y_{j} - ω_{K})}]

where,

P_{j} (ω_{i} / ω_{k})

is the probability that unimodal classifier

j

decides

ω_{i}

when the true class is

ω_{k}

and

δ (y - ω) = {\begin{matrix} 1, i f y = ω \\ 0, i f y \neq ω \end{matrix} .

If the input to the second classifier is

Z = (Y_{1}, Y_{2}, \dots, Y_{J})

, the probability density function of Z can be written as

P (Z / ω_{k}) = \prod_{j = 1}^{J} \begin{matrix} [P_{j} {(ω_{1} / ω_{k})}^{δ (Y_{j} - ω_{1})} P_{j} {(ω_{2} / ω_{k})}^{δ (Y_{j} - ω_{2})} \\ \dots P_{j} {(ω_{K} / ω_{k})}^{δ (Y_{j} - ω_{K})}] . \end{matrix}

Note that

Z

is a

J

-dimensional vector, and the posterior probability of the decision-integrating classifier is given by

P (ω_{i} / Z) = \frac{P (Z / ω_{i}) P (ω_{i})}{\sum_{i = 1}^{K} P (Z / ω_{i}) P (ω_{i})}, i = 1, 2 \dots K,

and the test multimodal stimulus is assigned to class

ω_{k}

, where,

k

is given by the rule in Equation (3).

5. Classification Experiments

Experiments were designed to demonstrate the application and evaluation of the unimodal and multimodal classifiers and to demonstrate the improvement in classification through multisensory integration. A ten-class problem involving the classification of visual and auditory signal representations of the digits 0, 1, …, 9, was considered. That is,

K = 10

and

J = 2

. The visual representations were binary images generated in

64 \times 64

arrays. The auditory signals were obtained from: evolution.voxeo.com. The adjustable sampling rate was set to 8 kHz. After start-point and end-point segmentation, the average duration of the ten noise-free auditory signals was 4000 samples. The noise-free auditory and visual signals are shown in Figure 5. For each image, the rows were concatenated to yield a vector of dimension 4096. Using the ranking method described in [27], features from the input signal

S_{j / k}

were extracted by selecting a subset of 64 DCT coefficients. The classes were assumed to have equal a priori probabilities, that is,

P (ω_{k}) = (1 / 10), k = 1, 2, \dots, 10

.

In order to demonstrate the principle of inverse effectiveness, “weakly effective” stimuli were generated by (a) increasing the additive noise by increasing the noise variance, and (b) increasing the attenuation by decreasing the attenuation factor. The multimodal stimuli in all training sets were semantically congruent and had no attenuations (i.e.,

α_{j} = 1, j = 1, 2

). The MLPs had three layers: an input layer, a hidden layer, and an output layer. The neural networks were trained systematically with increasingly noisy signal sets [28], and training was terminated when the cross-entropy fell below 0.001. For the MAP classifiers, the mean vectors and covariance matrices were estimated from the noisy training sets.

A total of eight classification systems were implemented: MLP and MAP implementations of the unimodal visual, unimodal auditory, and the two auditory-visual multimodal classifiers. The systems are represented by the following abbreviations: unimodal MLP for auditory stimuli (UM-MLP(A)), unimodal MLP for visual stimuli (UM-MLP(V)), unimodal MAP for auditory stimuli (UM-MAP(A)), unimodal MAP for visual stimuli (UM-MAP(V)), feature-integrating MLP system (FI-MLP), feature-integrating MAP (FI-MAP), decision-integrating MLP(DI-MLP), and decision-integrating MAP (DI-MAP). The inputs to the multimodal classifiers during training and testing were semantically congruent auditory and visual stimuli.

5.1. Noise Generation

Noisy stimuli were generated by adding Gaussian noise drawn from

G (0, σ_{j / k}^{2})

, to the noise-free input stimuli. The effects of a given noise variance on the classification accuracy varied across the different classes within a modality; therefore, the average value of

σ_{j / k}^{2}

across all the classes was used for the signals in modality

j

. Furthermore, the effects of additive noise with a given variance varied across the two modalities. Using validation sets, noise variances which yielded approximately the same unimodal classification accuracies across the two modalities were determined and paired. The noise was, therefore, specified as a pair

(A, V)

, where,

A

and

V

are the noise variances of the auditory and visual stimuli, respectively. Examples of four noisy auditory and visual signal pairs yielding similar MLP unimodal performances are shown in Figure 6. For brevity, only the noisy pairs for the digit 0 are shown. The effects of noise on the other digits are quite similar. This method of pairing noise variances was used to generate training and test sets for the unimodal and multimodal classifiers. A total of two hundred noisy auditory and visual stimuli were generated with varying levels of noise. The noisy stimuli were divided randomly into two equal sized sets. One set was used to train the classifier systems, and the other set was used to test the systems. For the MAP systems, training involved estimating the class mean vectors and class covariance matrices from the stimuli in the training set. For the MLP systems, training involved presenting the stimuli in the training set repeatedly until the convergence criterion was satisfied. The stimuli in the test set were classified and averaged to yield the classification accuracy at each noise-pair level. The following experiments were designed:

Set 1: Classification of noisy stimuli—These experiments were designed to evaluate the performance of the unimodal and multimodal classification systems with noisy stimuli. The noise levels in the test stimuli were varied while the attenuation factor was set to 1. The classification accuracies, as functions of paired noise variances, are shown in Figure 7 and Figure 8.
Set 2: Classification of noisy attenuated stimuli—These experiments were designed to evaluate the performance of the unimodal and multimodal classification systems with stimuli that were both noisy and attenuated. The experiments in Set 1 were repeated using $α = 0.9$ and $α = 0.8$ for both stimuli. The classification accuracies are shown in Figure 9, Figure 10, Figure 11 and Figure 12.

5.2. Discussion of Results

Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 clearly show that the classification accuracies of the multimodal classification systems are higher than those of the unimodal systems. Furthermore, multicomparison ANOVA tests conducted on the results which were used to display Figure 7 and Figure 8 confirmed that the differences between the multimodal and unimodal results were statistically significant (p < 0.05). Note that the noise variances used for testing the MAP classifiers are higher than those used for testing the MLP classifiers. It could, therefore, be concluded that, in these sets of experiments, the MAP classifiers perform better than the MLP classifiers. Furthermore, the FI-MAP classifiers perform better than the DI-MAP classifiers.

For this study, it is important to note that the differences between the unimodal and multimodal classification accuracies as well as the trends are more important than the actual classification accuracies. Recall that no attempt was made to select the best feature sets nor the best classifiers. Therefore, if the goal is to improve the performance of a multimodal classifier, it can be achieved by selecting better feature sets and classifiers for each modality. Although the main focus was to demonstrate the inverse effectiveness principle through the inclusion of random noise and an amplitude attenuation factor during the testing phase, the spatial and temporal rules can also be demonstrated in a similar manner by including spatial-offset and temporal-offset attenuation factors, respectively, into the models during testing. For these two cases, the attenuation factors can be regarded as penalties which increase when the distances and time intervals between the stimuli increase. Another possibility for demonstrating the temporal rule is to impose a penalty by dropping

\nabla_{j}

samples from a delayed stimulus vector, where,

\nabla_{j}

is proportional to the delay in the stimulus of modality

j

.

Finally, it could be argued that the improvement in performance of the multimodal classifiers over the unimodal classifiers should be expected because the additional congruent information available should facilitate classification. A similar argument can be made to explain the improvement in recognition in the brain due to the integration of congruent multimodal information. What is important to note, however, is that we have identified and solved key problems that need to be tackled in order to develop multimodal classifier models. These key problems include: (a) developing structures inspired by the two commonly accepted models for multisensory integration in the brain, (b) simulating the influence of stimuli and the effects of stimulus noise, (c) enabling the extraction of different stimulus representations, and (d) defining and combining information in different ways.

6. Conclusions

This paper focused on the development of feature-integrating and decision-integrating object classification models, both inspired by multisensory integration in the brain. The models were quite general in the sense that the classifiers were not pre-specified and the number of modalities and stimuli classes were variable. Furthermore, the models included the ability to add uncertainty to stimuli through random noise and variable attenuation levels. Consequently, the models enable the simulation and evaluation of various combinations of multimodal stimuli and classifiers under varying uncertainty conditions. The results from several sets of experiments clearly demonstrated the improvement in classification through the integration of auditory and visual stimuli, and it was noted that the models offer the potential to improve performance by selecting more effective features and classifiers for each modality.

Author Contributions

Conceptualization, R.A., R.S.G. and L.G.; Methodology, R.A., R.S.G. and L.G.; Supervision, L.G.; Validation, R.A., R.S.G. and L.G.; Writing—Original draft, L.G., R.S.G. and R.A.; Writing—Review & editing, R.S.G., L.G. and R.A.

Funding

This research received no external funding.

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful comments and suggestions that greatly contributed to improving the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Keeley, B.L. Making sense of the senses: Individuating modalities in humans and other animals. J. Philos. 2002, 99, 5–28. [Google Scholar] [CrossRef]
Ernst, M.O.; Bülthoff, H.H. Merging the senses into a robust percept. Trends Cogn. Sci. 2004, 8, 162–169. [Google Scholar] [CrossRef] [PubMed]
Driver, J.; Noesselt, T. Multisensory interplay reveals crossmodal influences on ‘sensory-specific’ brain regions, neural responses, and judgments. Neuron 2008, 57, 11–23. [Google Scholar] [CrossRef] [PubMed]
Stein, B.E. The New Handbook of Multisensory Processing; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Stein, B.E.; Stanford, T.R. Multisensory integration: Current issues from the perspective of the single neuron. Nature Rev. Neurosci. 2008, 9, 255. [Google Scholar] [CrossRef] [PubMed]
Koelewijn, T.; Bronkhorst, A.; Theeuwes, J. Attention and the multiple stages of multisensory integration: A review of audiovisual studies. Acta Psychol. 2010, 134, 372–384. [Google Scholar] [CrossRef]
Shams, L.; Kamitani, Y.; Shimojo, S. Illusions: What you see is what you hear. Nature 2000, 408, 788. [Google Scholar] [CrossRef] [PubMed]
Vroomen, J.; Bertelson, P.; De Gelder, B. The ventriloquist effect does not depend on the direction of automatic visual attention. Percept. Psychophys. 2001, 63, 651–659. [Google Scholar] [CrossRef]
Bertelson, P.; Vroomen, J.; De Gelder, B.; Driver, J. The ventriloquist effect does not depend on the direction of deliberate visual attention. Percept. Psychophys. 2000, 62, 321–332. [Google Scholar] [CrossRef] [PubMed]
Alais, D.; Burr, D. The ventriloquist effect results from near-optimal bimodal integration. Curr. Biol. 2004, 14, 257–262. [Google Scholar] [CrossRef]
McGurk, H.; MacDonald, J. Hearing lips and seeing voices. Nature 1976, 264, 746. [Google Scholar] [CrossRef]
Bertelson, P.; Vroomen, J.; De Gelder, B. Visual recalibration of auditory speech identification: A McGurk aftereffect. Psychol. Sci. 2003, 14, 592–597. [Google Scholar] [CrossRef] [PubMed]
Meredith, M.A.; Stein, B.E. Interactions among converging sensory inputs in the superior colliculus. Science 1983, 221, 389–391. [Google Scholar] [CrossRef] [PubMed]
Meredith, M.A.; Stein, B.E. Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. J. Neurophysiol. 1986, 56, 640–662. [Google Scholar] [CrossRef] [PubMed]
Wallace, M.T.; Meredith, M.A.; Stein, B.E. Multisensory integration in the superior colliculus of the alert cat. J. Neurophysiol. 1998, 80, 1006–1010. [Google Scholar] [CrossRef] [PubMed]
Felleman, D.J.; Van, D.E. Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex (New York, NY: 1991) 1991, 1, 1–47. [Google Scholar] [CrossRef]
Ghazanfar, A.A.; Schroeder, C.E. Is neocortex essentially multisensory? Trends Cogn. Sci. 2006, 10, 278–285. [Google Scholar] [CrossRef] [PubMed]
Ursino, M.; Cuppini, C.; Magosso, E.; Serino, A.; Di Pellegrino, G. Multisensory integration in the superior colliculus: a neural network model. J. Comput. Neurosci. 2009, 26, 55–73. [Google Scholar] [CrossRef]
Holmes, N.P. The principle of inverse effectiveness in multisensory integration: Some statistical considerations. Brain Topogr. 2009, 21, 168–176. [Google Scholar] [CrossRef]
Anastasio, T.J.; Patton, P.E.; Belkacem-Boussaid, K. Using Bayes’ rule to model multisensory enhancement in the superior colliculus. Neural Comput. 2000, 12, 1165–1187. [Google Scholar] [CrossRef]
Stein, B.E.; Meredith, M.A. The Merging of the Senses; The MIT Press: Cambridge, MA, USA, 1993. [Google Scholar]
Amerineni, R.; Gupta, L.; Gupta, R.S. Classification models inspired by multisensory integration. In Proceedings of the 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Las Vegas, NV, USA, 4–7 March 2018; pp. 255–258. [Google Scholar] [CrossRef]
Gupta, L.; Chung, B.; Srinath, M.D.; Molfese, D.L.; Kook, H. Multichannel fusion models for the parametric classification of differential brain activity. IEEE Transact. Biomed. Eng. 2005, 52, 1869–1881. [Google Scholar] [CrossRef]
Gupta, L.; Kota, S.; Molfese, D.L.; Vaidyanathan, R. Pairwise diversity ranking of polychotomous features for ensemble physiological signal classifiers. Proc. Inst. Mech. Eng. Part H J. Eng. Med. 2013, 227, 655–662. [Google Scholar] [CrossRef] [PubMed]
Polikar, R. Ensemble learning. In Ensemble Machine Learning; Zhang, C., Ma, Y.Q., Eds.; Springer: New York, NY, USA, 2012; pp. 1–34. [Google Scholar] [CrossRef]
Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms, 1st ed.; John Wiley & Sons: Hoboken, NJ, USA, 2004. [Google Scholar]
Gupta, L.; Kota, S.; Murali, S.; Molfese, D.L.; Vaidyanathan, R. A feature ranking strategy to facilitate multivariate signal classification. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2010, 40, 98–108. [Google Scholar] [CrossRef]
Gupta, L.; Sayeh, M.R.; Tammana, R. A neural network approach to robust shape classification. Pattern Recogn. 1990, 23, 563–568. [Google Scholar] [CrossRef]

Figure 1. The senses involved in deciding whether a cantaloupe is ripe.

Figure 2. Block diagram of the unimodal classification system for modality j.

Figure 3. Block diagram of the feature-integrating multimodal classification system.

Figure 4. Block diagram of the decision-integrating multimodal classification system.

Figure 5. The noise-free auditory and visual stimuli used in the experiments.

Figure 6. Examples of pairs of noisy auditory and visual representations of the digit 0.

Figure 7. Classification accuracy, as a function of noise, for the MLP-based classification systems.

Figure 8. Classification accuracy, as a function of noise, for the MAP-based classification systems.

Figure 9. Classification accuracy, as a function of noise, and

α = 0.9

in both stimuli, for the MLP-based classification systems.

Figure 9. Classification accuracy, as a function of noise, and

α = 0.9

in both stimuli, for the MLP-based classification systems.

Figure 10. Classification accuracy, as a function of noise, and

α = 0.8

in both stimuli, for the MLP-based classification systems.

Figure 10. Classification accuracy, as a function of noise, and

α = 0.8

in both stimuli, for the MLP-based classification systems.

Figure 11. Classification accuracy, as a function of noise, and

α = 0.9

in both stimuli, for the MAP-based classification systems.

Figure 11. Classification accuracy, as a function of noise, and

α = 0.9

in both stimuli, for the MAP-based classification systems.

Figure 12. Classification accuracy, as a function of noise, and

α = 0.8

in both stimuli, for the MAP-based classification systems.

Figure 12. Classification accuracy, as a function of noise, and

α = 0.8

in both stimuli, for the MAP-based classification systems.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Amerineni, R.; Gupta, R.S.; Gupta, L. Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain. Brain Sci. 2019, 9, 3. https://doi.org/10.3390/brainsci9010003

AMA Style

Amerineni R, Gupta RS, Gupta L. Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain. Brain Sciences. 2019; 9(1):3. https://doi.org/10.3390/brainsci9010003

Chicago/Turabian Style

Amerineni, Rajesh, Resh S. Gupta, and Lalit Gupta. 2019. "Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain" Brain Sciences 9, no. 1: 3. https://doi.org/10.3390/brainsci9010003

APA Style

Amerineni, R., Gupta, R. S., & Gupta, L. (2019). Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain. Brain Sciences, 9(1), 3. https://doi.org/10.3390/brainsci9010003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain

Abstract

1. Introduction

1.1. Multisensory Integration

1.2. Scope of Research

2. Unimodal Classification Model

2.1. Unimodal MLP Network Classifier

2.2. Unimodal MAP Classifier

3. Feature-Integrating Multimodal Classification Model

3.1. Multimodal Feature-Integrating MLP Classifier

3.2. Multimodal Feature-Integrating MAP Classifier

4. Decision-Integrating Multimodal Classification Model

4.1. Decision-Integrating MLP Classifier

4.2. Decision-Integrating MAP Classifier

5. Classification Experiments

5.1. Noise Generation

5.2. Discussion of Results

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI