1. Introduction
The human perception and cognition of the external world are essentially processes through which the brain integrates and processes multimodal sensory information (such as visual and auditory inputs). As the two primary sensory channels for acquiring information, the mechanisms underlying the cerebral processing of audiovisual information are not only a core research direction in cognitive neuroscience but also hold critical significance for the development of brain–computer interfaces (BCIs), the diagnosis of neurological disorders (such as assessing audiovisual integration impairments in individuals with autism or schizophrenia), and the advancement of human–computer interaction technologies [
1,
2,
3]. BCI systems based on the recognition of audiovisual brain activities can assist individuals with motor impairments in controlling external devices (e.g., wheelchairs, prosthetics) through “mental commands”, where the accurate classification of audiovisual brain activities is a prerequisite for achieving efficient interaction in such systems [
4,
5]. Therefore, research on the classification of audiovisual brain activities not only deepens the understanding of the brain’s multimodal information integration mechanisms but also promotes the practical application of related technologies in fields such as healthcare and rehabilitation, carrying significant theoretical value and practical importance.
To achieve the precise recognition of audiovisual brain activities, previous studies have extensively explored classification methods based on EEG signals. Studies on various audiovisual stimulus types and cognitive paradigms have achieved high accuracy rates. For instance, in research classifying visual and auditory stress responses, an automated evoked potential differentiation system achieved a binary stress classification accuracy of 97.14% for visual stimuli and 94.51% for auditory stimuli. In a more complex four-level stress classification task, accuracies reached 89.59% for visual and 82.63% for auditory stimuli [
6]. Another study employed the Sparse Optimal Scoring (SOS) method to analyze EEG data, achieving efficient classification in visual language comprehension tasks with an average out-of-sample accuracy as high as 98.89% [
7], fully demonstrating the potential of EEG signals in classifying audiovisual brain activities. With advancements in machine learning and deep learning technologies, models for audiovisual EEG classification have been continuously optimized. Convolutional Neural Networks (CNNs), leveraging their ability to extract local temporal features, became a mainstream tool for early EEG classification. Recurrent neural networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) networks, further improved classification performance under dynamic audiovisual stimuli by capturing long-term dependencies in the signals [
8]. In recent years, the introduction of hybrid models (e.g., CNN-LSTM, Graph Convolutional Networks (GCNs)) and Transformer models has effectively enhanced classification robustness in complex audiovisual tasks by integrating spatio-temporal features and global attention mechanisms [
9,
10]. Furthermore, innovations in multimodal guiding strategies have provided new pathways for improving classification performance. One study proposed a dynamic visual–auditory cooperative guidance method. The experimental results showed that this method significantly enhanced cortical activity in fine motor imagery-related audiovisual tasks, leading to broader brain region activation and stronger Event-Related Synchronization (ERS) and Event-Related Desynchronization (ERD) effects in key areas such as the frontal and parietal lobes, providing clearer neural response markers for classification [
11]. However, current EEG-based research on audiovisual brain activity classification predominantly focuses on the temporal dimension or single statistical features of EEG signals, often neglecting the in-depth utilization of the spatial distribution characteristics and frequency-specific features inherent in audiovisual brain activities. From a neurophysiological perspective, audiovisual information processing involves the collaboration of multiple brain regions. Visual stimuli primarily activate the primary visual cortex in the occipital lobe, while auditory stimuli preferentially activate the primary auditory cortex in the temporal lobe. The integration of both requires the participation of association areas such as the prefrontal and parietal lobes [
12]. This spatial difference in brain region activation is a key distinguishing feature of audiovisual brain activities. Yet, many methods treat EEG electrode signals as independent temporal sequences, failing to fully exploit the spatial correlation information between electrodes. Simultaneously, EEG signals of different frequencies correspond to distinct brain functional states. For example, theta waves (4–7 Hz) are associated with cognitive control, alpha waves (8–13 Hz) with visual cortex inhibition, and beta waves (14–30 Hz) with motor preparation [
13]. Audiovisual tasks elicit power changes in specific frequency bands. However, many studies often use full-band signals or analyze a single frequency band, failing to accurately capture the frequency-specific responses under different audiovisual tasks. This insufficient exploitation of spatial and frequency features makes it difficult for existing models to fully characterize the differences in neural mechanisms underlying audiovisual brain activities, thereby limiting an in-depth analysis of the brain’s multimodal information integration mechanisms.
EEG microstate and brain network analysis methods offer effective pathways to address the aforementioned limitations. EEG microstates, as sub-second stable scalp field topography patterns, can intuitively reflect the global spatial distribution characteristics of EEG activity. Parameters such as the topology, duration, and transition probabilities of microstates can precisely characterize the spatial activation differences in the cerebral cortex under audiovisual tasks [
14]. On the other hand, brain network analysis, particularly that of functional brain networks, quantifies the spatial synergistic relationships between different brain regions (electrodes) by calculating functional connectivity strength (such as phase synchronization, correlation). It can also construct frequency-specific brain networks by combining different frequency bands, thereby simultaneously characterizing the spatial correlation features and frequency-specific characteristics of audiovisual brain activities [
15]. However, the temporal, spatial, and frequency features of brain information processing are not independent; they are closely interrelated. Neuronal synchronous oscillations trigger the activation of distributed brain functional networks (frequency-space). These synchronous oscillations continuously change with cognitive task processing (time–frequency), which in turn affects the activation of brain functional networks and causes dynamic changes in network structure (space-time). The temporal, spatial, and frequency characteristics of audiovisual integration influence and constrain each other. Their intrinsic interrelationships form the foundation and core of the brain’s processing of audiovisual information. Therefore, establishing a mapping relationship that reflects the spatio-temporal–frequency characteristics of audiovisual brain activities and exploring the intrinsic one-to-one correspondence among temporal, spatial, and frequency features are essential means to fundamentally reveal the brain mechanisms of audiovisual information processing and to classify audiovisual brain activities. EEG microstates and brain networks are not independent analytical dimensions. Essentially, they represent manifestations of the brain’s information processing at different observational scales and share a close intrinsic relationship: on one hand, changes in the topological patterns of EEG microstates are macroscopic reflections of the dynamic adjustments in the activation and connection strengths of nodes (brain regions) within the functional brain network; on the other hand, the dynamic evolution of brain networks also depends on the temporal characteristics of neuronal synchronous oscillations, and the frequency of these oscillations further influences the transition rate and duration of microstates [
16]. This relationship fundamentally stems from the inseparability of the spatio-temporal–frequency characteristics in brain information processing. The spatio-temporal–frequency features of audiovisual information processing form an organic whole that mutually influences and constrains each other. Their intrinsic interrelationship constitutes the core mechanism by which the brain accomplishes multimodal information processing.
In this study, we utilize the microstates of the brain during the processing of visual, auditory, and audiovisual information as time windows to establish dynamic brain networks symmetrical to microstates, representing brain activity by constructing spatio-temporal–frequency feature vectors. Using the feature vectors, we are able to accurately identify the brain activities involved in visual, auditory, and audiovisual information processing on our proposed Adaptive Tensor Fusion Network (ATFN). The main contributions of this study can be summarized as follows:
- (1)
We proposed a method for representing the brain activity involved in auditory, visual, and audiovisual information processing based on spatio-temporal–frequency feature vectors. We computed EEG microstates across multiple frequency bands and constructed brain functional connectivity networks by dividing the microstate sequences into time windows. By calculating microstate parameters, the topological properties of the corresponding brain networks, and their inter-correlations, we characterized the brain activities during audiovisual information processing as symmetric spatio-temporal–frequency feature association vectors.
- (2)
We constructed an Adaptive Tensor Fusion Network that utilizes symmetric spatio-temporal–frequency association vectors to recognize the brain activities involved in audiovisual information processing. The feature fusion and selection module based on differential feature enhancement performs standardization, differential enhancement, and dynamic weight assessment on multi-band microstate features to automatically screen key features. And the attention-enhanced feature encoding module combines 1D convolution, bidirectional GRU, and multi-head self-attention mechanisms to enhance the discriminative representation of the symmetric spatio-temporal–frequency features. Finally, a classifier based on a multilayer perceptron (MLP) performs pooling and non-linear mapping on the fused features, and classification decisions were optimized using a focal loss function to achieve the efficient recognition of audiovisual brain activities.
3. Results and Analyses
3.1. Spatio-Temporal–Frequency Feature Association Vector for Audiovisual Processing
3.1.1. Microstate Analysis Results
We employed EEG microstate analysis to perform clustering analysis on visual, auditory, and audiovisual EEG data to obtain brain microstates corresponding to the three types of information processing across multiple frequency bands. Ultimately, for auditory information processing, the microstates in the EEG, delta_EEG, theta_EEG, alpha_EEG, and beta_EEG frequency bands were clustered into 13, 7, 8, 7, and 5 classes, respectively. For visual information processing, the microstates in the same frequency bands were clustered into 11, 6, 8, 8, and 6 classes, respectively. For audiovisual information processing, the microstates were clustered into 6, 6, 8, 4, and 11 classes, respectively. For ease of comparison,
Figure 4 displays the top six classes of microstates for A, V, and AV information processing in the 1–30 Hz frequency band. It can be seen from
Figure 4 that the spatial patterns of MS1–MS3 for A, V, and AV are different. The spatial distributions of MS4 for AV and A are similar, but the distribution of AV is more concentrated in the central area, and there is a difference in intensity between the central area and the surrounding areas, while the central area and surrounding area of MS4 for A have a smaller intensity difference. The spatial distribution of MS6 for AV shows a “left upper to right lower” pattern. Although both MS5 patterns for A and V present a “frontal” distribution, MS5 for A is close to the classic resting-state EEG “Microstate D” [
25,
26,
27], while that for V has a different distribution. The clustering results indicate that the topological structures of the microstates corresponding to A, V, and AV are different, suggesting that the neural activities induced by different types of sensory information exhibit notable distinctions. Effectively characterizing and capturing the spatial features of microstates across multiple frequency bands can be used to identify different brain activity states.
3.1.2. Brain Network Construction Results
Based on the previously obtained EEG microstate time series for visual, auditory, and audiovisual conditions through the backfitting procedure, this study further constructed corresponding brain functional networks for each microstate-categorized segment of EEG signals. Due to variations in the duration of microstate segments, a dynamic time window segmentation method was employed. The Phase Lag Index (PLI) was calculated for microstate segments under visual, auditory, and audiovisual stimuli to quantify phase synchronization between different brain regions.
Finally, this study constructed brain network sequences corresponding one-to-one with microstates for visual, auditory, and audiovisual information processing in the 1–30 Hz frequency band, as well as in the delta, theta, alpha, and beta frequency bands. Each functional connectivity network reflects the brain activity pattern of its corresponding microstate.
Figure 5 displays the PLI matrices corresponding to the six microstates shown in
Figure 4, within the 1–30 Hz frequency band.
To reveal the potential mechanisms of neural processing for different sensory information, this study further calculated the topological properties of brain functional networks. We focused on nine types of topological features closely related to global integration and local segregation functions, including the following: global efficiency, local efficiency, average clustering coefficient, average shortest path length, small-worldness, average degree centrality, average betweenness centrality, average closeness centrality, and eigenvector centrality. Through a quantitative analysis of these multi-dimensional metrics, we can comprehensively characterize the information transmission efficiency of the brain at the network level, the distribution of hub nodes, and the integration and differentiation levels of the overall network, providing insights into cognitive processing differences under different sensory modalities. The calculation results of brain network topological features are shown in
Table 1, which displays the topological feature values of the first six brain networks for A, V, and AV information processing in the 1–30 Hz frequency band.
3.1.3. Spatio-Temporal–Frequency Feature Association Vector Results
Neural oscillations in different frequency bands are closely related to the cognitive states of the brain, which carry distinct “meanings” and “functions” [
28,
29,
30]. Our study analyzed the correlations between spatio-temporal–frequency features by calculating Pearson coefficients among the microstate attributes (duration, coverage, occurrence) and brain network topological features (global efficiency, average clustering, average betweenness, average closeness, local efficiency, average degree centrality, average eigenvector centrality, average shortest path length, small-worldness) of visual, auditory, and audiovisual EEG across the 1–30 Hz, delta, theta, alpha, and beta frequency bands.
Figure 6 displays the correlation analysis results between microstates and brain networks for the six microstates in the 1–30 Hz frequency band. Brain network topological properties exhibit strong intrinsic synergies among themselves, while microstate parameters demonstrate similar internal correlations. Under the same conditions, significant differences are observed in the correlations between microstate parameters and brain network topology for the three types of stimuli (A, V, and AV). For example, the correlation matrix of brain activities corresponding to MS1 shows that the correlation of AV brain network parameters and the microstate parameters is relatively low, while the correlation of the single-modal A and V brain activities is relatively high. Particularly, the correlation for the brain network parameters of A and V brain activities is close to 1. This result indicates that there are differences in the brain network topologies of single-modal and dual-modal brain activities, reflecting that the neural mechanisms of information processing are different. Similarly, for MS2 and MS3, AV stimuli generally show lower correlation levels compared to unimodal A and V stimuli, reflecting distinct brain activities during the processing of different types of stimuli.
We further compared the microstate parameters and brain network topological attributes under the three modal stimuli, with the results shown in
Figure 7. The auditory modality exhibited higher centrality, clustering coefficient, and efficiency, while the visual modality showed higher values in the average shortest path length and duration. The majority of features in the audiovisual modality were distributed between those of A and V. As shown in
Figure 7, the duration, coverage, and occurrence for A, V, and AV also present different distributions. The comparison results reflect differences in brain activity during the processing of the three modalities, and these differences can be captured by microstate parameters and brain network topological attributes.
Based on the above results, to comprehensively characterize the spatio-temporal characteristics of EEG activity across multiple frequency bands, this study constructed an integrated vector that fuses multi-dimensional spatio-temporal–frequency features and their intrinsic associations. This vector integrates three types of microstate features and nine types of brain network topological features across five frequency bands, forming a base feature layer. Additionally, two cross-dimensional correlation metrics—frequency band specificity index and microstate class importance weight—were introduced to characterize the prominence of features in the frequency band distribution and the contribution weight of different microstate categories to the overall features, respectively.
Specifically, the microstate features and brain network features from each frequency band are first used to construct a spatio-temporal plane. The feature planes from the five frequency bands are then integrated into a three-dimensional tensor structure. As shown in
Figure 8, the resulting irregular vector not only encapsulates the original features but also explicitly incorporates the correlation information among these features across different dimensions. This vector serves as a high-dimensional representation with clear neurophysiological interpretability. In subsequent recognition models, an adaptive weighting mechanism is introduced to optimize the feature fusion process, thereby effectively enhancing the classification performance for attention states or cognitive tasks.
3.2. Audiovisual Brain Activity Recognition Results
This experiment was conducted on a computer equipped with an NVIDIA GeForce RTX 5060 GPU, an Intel Core i9-14900HX processor, and 16 GB of RAM, running a 64-bit Windows 11 operating system. The software environment included Python 1.9.13 and MATLAB R2022a, which were used for data preprocessing, feature extraction, model training, and validation.
Model training employed five-fold stratified cross-validation, where the proportion of samples from each class in every fold remained consistent with the overall distribution, ensuring the robustness and reliability of the evaluation results. During training, the Adam optimizer was used with an initial learning rate of 0.0005 and a weight decay of 0.00001. The training process spanned 100 epochs, and the model with the best performance on the validation set was saved after each epoch. To address class imbalance, the focal loss function was adopted as the loss function, with a focusing parameter γ = 2.0 and class weights dynamically adjusted based on the training data’s class distribution. The input data consisted of spatio-temporal–frequency feature association vectors extracted from each frequency band (including EEG, delta, theta, alpha, and beta). The data dimensions across all frequency bands were kept consistent to ensure balanced input data.
Model performance was comprehensively evaluated using four metrics: accuracy (Acc), Precision (Pre), Recall (Rec), and F1-score (F1). All metrics are reported as the macro-average on the test set to ensure a balanced evaluation of the multi-class classification task.
3.2.1. Audiovisual Brain Activity Recognition Based on ATFN
During the model training process, this study evaluated five distinct scenarios utilizing data from different frequency bands and recorded the changes in the loss function and accuracy with respect to the number of training epochs.
Figure 9 shows the loss curves for training and validation across all folds.
Figure 10 shows the average loss curves for training and validation.
Figure 11 shows the average accuracy curves for training and validation.
The validation results of the model’s classification performance are displayed in the ROC curves and confusion matrices, as shown in
Figure 12 and
Figure 13. ROC curve analysis indicates that as data from more frequency bands are utilized, the model’s classification performance for visual, auditory, and audiovisual brain activities gradually improves.
In order to further verify the cross-subject effectiveness of our proposed method, we apply leave-one-subject-out (LOSO). We achieve a cross-subject average Acc, Pre, Rec, and F1 of 95.67%, 96.32%, 95.67%, and 95.97%, with standard error of the mean (SEM) values of 0.023, 0.051, 0.024, and 0.048, respectively. The results show that our method is relatively stable in a cross-subject condition.
3.2.2. Ablation Study
To investigate the contribution of spatio-temporal–frequency feature associations to model performance, this study designed a set of comparative ablation experiments evaluating the classification effectiveness when correlation features were excluded (i.e., only original multi-band features were used without incorporating feature correlation analysis and microstate class importance weighting). The experimental results are shown in
Table 2 and
Table 3 below. Under two conditions—with and without feature correlations—we sequentially added features from the theta, alpha, delta, and beta frequency bands to the feature set and recorded the model’s performance on the test set.
The experimental results indicate that when feature correlations are not incorporated, model performance improves gradually as the number of included frequency bands increases. When only the theta band is included, accuracy (Acc) is 0.5202, and the F1-score is 0.5181. With the sequential addition of the alpha, delta, and beta bands, model performance improves step by step, reaching its peak when all bands are included (Acc = 0.7234; F1 = 0.7201). After incorporating feature correlations, the overall model performance improves significantly, particularly when multiple bands are used together. With only the theta band, Acc is 0.6061, and F1 is 0.6016. When all frequency bands are included and their correlation information is fused, the model achieves its highest performance (Acc = 0.9697; F1 = 0.9696), significantly outperforming the results without feature correlations. These findings demonstrate that the fusion of multi-band features and the inclusion of their correlations play a crucial role in enhancing the discriminative ability of the brain network classification model. They also further confirm that microstate and brain network features across various frequency bands contribute substantially to the recognition of audiovisual brain activities.
To evaluate the contribution of each core module in the proposed model, this study conducted systematic module ablation experiments. The results are shown in
Table 4.
The experimental results indicate that model performance improved significantly as modules were progressively added. When only the baseline structure (i.e., without any enhancement modules) was used, the model achieved an accuracy of 0.8765 and an F1-score of 0.8758, demonstrating that the basic architecture already possessed a certain level of classification capability. After introducing the feature dynamic weighting mechanism, the accuracy increased to 0.9124, indicating that this module effectively identifies and enhances discriminative features while suppressing noise or redundant information. With the further addition of the feature encoding module, the accuracy reached 0.9452, reflecting the module’s key role in extracting and fusing multi-dimensional spatio-temporal features and enhancing the model’s ability to distinguish temporal and spatial patterns in EEG signals. Finally, incorporating the self-attention mechanism optimized the model performance to its peak (accuracy: 0.9697; F1-score: 0.9696), proving its effectiveness in capturing long-range dependencies and further refining feature representation and interaction mechanisms. Overall, each module contributed distinctly to the performance improvement and exhibited good complementarity and synergy, validating the rationality and effectiveness of the ATFN model’s structural design. In conclusion, this ablation experiment verified the effectiveness and necessity of the proposed modules, providing empirical support for the model’s structural design.
4. Conclusions
This study integrates EEG microstates, brain networks, and their feature correlations to build a symmetrical spatio-temporal–frequency feature association vector that characterizes the brain activity involved in auditory, visual, and audiovisual information processing. By using the feature association vector, we develop an Adaptive Tensor Fusion Network (ATFN) model, which achieves an accuracy of 96.97% in recognizing A, V, and AV brain activity, providing a valuable reference for brain–computer interfaces, neurological disease diagnosis, and related fields. However, this study still has certain limitations. Firstly, the model’s performance in cross-subject generalization remains unstable. Due to significant individual differences in EEG signals, the current model exhibits considerable variability in classification performance across different subjects, which limits its potential for practical cross-user applications. Secondly, there is room for improvement in the multi-band feature fusion strategy. Although multi-band microstates, brain network features, and correlation features were incorporated, issues related to redundancy and noise in high-dimensional features have not been fully resolved. Some frequency bands or feature dimensions may contribute limited or even interference effects to classification. Future work will focus on enhancing cross-subject generalization, promoting practical applications in the fields of brain–computer interfaces and clinical medicine; using other classic deep learning models for the classification of visual and auditory brain activities, comparing them with our proposed method; and optimizing the structure of the ATFN to enhance its ability to integrate multi-dimensional features and reduce the complexity of the model. We will further analyze and verify the contributions of brain networks and microstates to brain activity recognition and enhance the interpretability of our method, providing theoretical references for studying the neural mechanisms of brain information processing.