1. Introduction
With the rapid development of low-altitude air mobility in urban transportation, logistics, and emergency response, structural noise generated by aircraft operations has become a major barrier to further development [
1,
2]. The accurate identification of monopole and dipole sources is fundamental to effective structural noise control in low-altitude aviation. Designing effective noise mitigation strategies requires a clear understanding of the underlying acoustic generation mechanisms [
3]. Among various contributors, monopole and dipole sources have been identified as dominant components in the radiated field [
4].
Consistently, Zhang and Liu [
5] validate a fast multipole Ffowcs Williams and Hawkings (FW-H) solver, demonstrating that monopole and dipole sources dominate the radiated field and thereby proving their fundamental importance. Relying solely on acoustic imaging [
6] results in being unable to distinguish monopole and dipole sources, leading to misinterpretation of source characteristics and suboptimal noise control. Relying identification of these elementary source types, particularly monopoles and dipoles, is a prerequisite for meaningful aeroacoustic analysis and targeted noise reduction.
Beamforming methods are widely applied to source localization tasks; however, conventional methods face intrinsic constraints in identification monopole and dipole sources, especially under turbulent or highly randomized acoustic conditions. These methods are generally formulated under monopole assumptions and thus struggle to characterize directionally radiating sources such as dipoles. These constraints severely hinder source-type identification in complex field conditions. To enhance identification capability, hybrid approaches have been proposed by integrating spherical harmonic decomposition [
7,
8] or transfer function correction techniques [
9,
10,
11] into beamforming frameworks. However, the former depends on coaxial symmetry in the source field, while the latter heavily relies on prior acoustic models and shows poor adaptability to varying environments. Despite advances in multipole beamforming—such as direction decomposition and source-component separation by Liu et al. [
10] and Suzuki [
12], or the stability-enhancing deconvolution and directional weakening strategies by Demyanov et al. [
13] and Pan et al. [
14]—still require prior knowledge of the source type. As a result, these methods remain insufficient for application in acoustically complex, source-diverse environments.
To address the remaining constraints of these advanced methods, recent research has proposed two major categories of deep learning-based frameworks, beamforming-driven methods and Bayesian inference-based methods.
Within the beamforming category, as one of the few methods explicitly developed for highly randomized and complex environments, Lobato et al. [
15], inspired by the neural network for sound source localization proposed by Ma and Liu [
16], proposed the DAMAS-MS method. This method enhances real-time performance via compression-driven grid refinement and targets the identification of monopole and dipole sources in randomized acoustic fields. However, its identification performance is hindered by a systematic bias toward dipole sources, resulting in frequent overestimation of dipole components and underrepresentation of monopole contributions. In parallel, an unrolled Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) applied to the cross-spectral matrix (CSM) was proposed to improve reconstruction stability in single-shot scenarios [
17]. Building on this insight, Goudarzi [
18] developed the Broadband CLEAN-SC method, which directly utilizes the raw CSM as input and integrates global optimization, local optimization, and a meshless neural network. This method eliminates the need for predefined source-type assumptions and demonstrates strong performance for broadband dipole identification in controlled configurations. However, it has not been validated in scenarios involving mixed source types, spatial randomness, or turbulent fields. Importantly, its direct use of the CSM as model input provides methodological inspiration for the present paper.
Bayesian inference-based methods have recently emerged as a class of data-driven approaches for acoustic source identification [
19]. Among them, Pan et al. [
20] proposed a representative sparse Bayesian learning framework based on a multipole transfer matrix model, enabling simultaneous identification of monopole and dipole sources without predefined source-type assumptions. However, these methods typically rely on dense microphone arrays to ensure observability and remain limited in low-frequency, low-SNR, or spatially randomized environments, restricting their practicality in real-world scenarios. Moreover, the approach of Pan et al. assumes that the total number of sources is known or can be reliably estimated beforehand, and that the underlying field is sufficiently structured. These assumptions are not valid for highly randomized source identification scenarios involving unknown numbers, types, and spatial locations of mixed monopole and dipole sources. As such, sparse Bayesian methods are not directly applicable to the unstructured, stochastic source identification task addressed in this work.
Recently, numerical simulation-based methods have emerged as an alternative for monopole and dipole source identification. Mao et al. [
21] and Yang et al. [
22] and Plaksin et al. [
23] used high-fidelity computational fluid dynamics to reconstruct source distributions via numerical beamforming. These methods can capture complex radiation patterns in virtual environments but rely heavily on detailed simulations. Li and Wang [
24] proposed a parametric model based on maximum-likelihood estimation to jointly recover source number, location, and orientation. While these approaches improve accuracy under controlled conditions, they involve high computational costs and offer limited generalization to complex or unpredictable scenarios.
Although numerical simulation methods offer high spatial resolution under structured or controlled conditions, they are not suitable for highly stochastic acoustic fields and are limited by high deployment costs and poor generalizability. In contrast, DAMAS-MS is the only method specifically designed for the identification of randomly located, typed, and numbered mixed monopole–dipole sources, and thus serves as the methodological focus for comparative analysis in this work. To address the limitations of DAMAS-MS under highly randomized conditions, this paper leverages the global representation capability of deep learning models. In particular, the vision transformer (ViT) has demonstrated outstanding performance in global feature modeling [
25,
26], making it well-suited for complex source identification. Kerssies et al. [
27] recently demonstrated that ViT can perform implicit spatial reasoning and accurate image segmentation even in encoder-only configurations, highlighting their scalability and structural adaptability. Inspired by the work of Goudarzi [
18], guided by the findings of Jekosch and Sarradj [
28] that the CSM inherently contains dipole orientation information. This paper adopts the CSM as the primary feature representation to retain and exploit its directional content. Localization results are used to extract CSM data at target frequencies, forming a three-channel input of real part, imaginary part, and spatial positional encoding. This preserves the spatial and spectral structure for ViT-based global feature modeling, yielding higher monopole accuracy and maintaining competitive dipole performance, extending CSM based learning to practical airframe noise identification.
This paper is organized as follows. In
Section 2, key challenges in monopole and dipole identification are discussed.
Section 3 introduces the proposed methodology and its theoretical foundation.
Section 4 presents simulations under the same core parameters as DAMAS-MS and reports the corresponding results.
Section 5 presents experimental verification conducted in an anechoic chamber. Finally, conclusions are drawn in
Section 6.
2. Problem Statement
2.1. Constraints of Conventional Monopole and Dipole Identification Methods Without Prior Source Type Assumptions
Conventional monopole and dipole identification methods are ineffective when no prior assumptions regarding source types are made, primarily for two reasons. First, the Green’s function matrix for combined monopole and dipole sources remains undetermined, thereby preventing the formulation of governing equations for the sound field.
This limitation originates from the fundamental mathematical model commonly used in acoustic inverse problems and source localization [
29], given by
Here, denotes the unknown source strengths, while represents noise in the pressure measurements. Once sound pressure data are collected using a microphone array, the time-domain signals are transformed into the frequency domain through the fast Fourier transform (FFT), yielding the frequency-domain sound pressure vector . Conventional approaches require a prior assumption about the source types. With such assumptions specified, the structure of the Green’s function matrix can then be constructed, for example, as .
Then, according to the expression,
The strengths of individual monopole and dipole sources can be computed.
In this work, the superscript denotes the standard transpose operation, used to arrange complex-valued signals or steering vectors into column vectors.
In Equation (1), denotes the frequency-domain sound pressure vector measured at microphone positions after Fourier transformation. is the Green’s function matrix (or transfer matrix) that maps the acoustic contributions from candidate source positions to the microphone array, represents the vector of unknown source strengths, and denotes additive measurement noise. In Equation (2), the transfer matrix is decomposed as [], where and correspond to monopole and dipole components, respectively. The associated source strength vectors are and , forming the full source vector . The resulting pressure field at the microphones due to both source types is denoted by .
However, in the absence of prior knowledge regarding source types, the arrangement of Green’s functions within the remains ambiguous. It is not feasible to determine the distribution and strength of monopole and dipole sources in complex acoustic fields based on the equation.
2.2. Ill-Posedness of the Inverse Problem
The second challenge stems from the ill-posedness of the inverse problem. Inferring source characteristics from acoustic measurements represents a classic inverse problem, which involves deducing the source distribution, field properties, or other unknown parameters from limited measured data. This process typically contains a large number of unknowns, such as the strength and position of equivalent sources.
When the number of measurement points is limited, the resulting system of equations tends to be underdetermined, meaning that multiple equivalent source or parameter distributions can produce the same external sound field. Under such conditions, there is no unique mapping between the sound field solution and the model parameters.
If the measurement points are located in the far field and the number of sensors is further reduced, the problem becomes more severe, resulting in solution instability and physical distortion. These issues severely limit the reliability and accuracy of source-type identification.
2.3. Proposed Framework of This Paper
To address the two key challenges mentioned above, this study develops a monopole and dipole identification framework that leverages the CSM as a discriminative representation and employs a ViT architecture to capture spatially coherent patterns from complex acoustic fields.
In the task of identifying mixed monopole and dipole sources within complex multi-source fields, constructing intermediate features that accurately represent source characteristics and are suitable for neural network input is essential. Among various representations, in beamforming localization, CSM is widely used in beamforming which is based localization due to its ability to characterize frequency-domain coherence between signals recorded by different microphones in an array, which encodes both phase and amplitude correlations across sensor pairs, forming the foundation for spatial filtering, directional response estimation, and source localization.
In this study, we repurpose the CSM from a localization-oriented tool into a feature representation for source-type identification. By reshaping the CSM into an image-like structure, we leverage capabilities of deep neural networks in pattern identification to extract spatially discriminative features from complex sound fields.
The process begins with spatial sampling of the sound pressure field using a planar microphone array. The array comprises
microphones, each recording time-domain acoustic signals
, the measured pressure signal is divided into
snapshots. These signals are first transformed into the frequency domain via FFT, yielding the frequency-domain pressure vector,
where
denotes the complex pressure spectrum at the
-th microphone position
,
is the frequency-domain spectrum at angular frequency
measured by the
-th microphone at the
-th snapshot.
The CSM is the fundamental data representation in array acoustics, defined as
The diagonal elements of are usually set to zero in order to remove the influence of background noise. The superscript indicates complex conjugation, as employed in cross-spectral computation. This formulation reflects the standard outer product used in cross-spectral estimation.
The CSM encodes dipole orientation [
28], motivating our identification using CSM. While Goudarzi [
18] demonstrated the use of CSM input combined with global and local optimization as well as meshfree neural networks for fixed-source scenarios, its applicability has not yet been extended to mixed or highly randomized source conditions. The ViT architecture, with its strong global modeling capability and contextual sequence learning [
26,
30,
31], offers powerful tools for learning complex spatial relationships. It can directly map source types to spatial phase and sound pressure features derived from the CSM and positional encoding, without explicitly constructing a physical propagation model. This method effectively avoids the traditional dependence on strong physical priors and enables identification under conditions of high randomness.
3. Detailed Architecture of the ViT Algorithm Model
As introduced in
Section 2.3, this study proposes a ViT-based framework that leverages the CSM as its core input representation to address the challenge of identifying monopole and dipole sources in complex acoustic fields. The method operates on frequency-domain data derived from beamforming and constructs a three-channel CSM image that integrates the real and imaginary components with a spatial positional encoding derived from source localization. A multi-label neural network architecture is then employed to handle scenarios involving multiple sources with high spatial randomness.
This chapter first introduces the construction process of the CSM image, including the extraction of matrices at characteristic frequencies, the real and imaginary component input format, and the position encoding strategy that incorporates source localization information. These components are ultimately combined to form a three-channel image representation suitable for ViT input. The chapter subsequently details the design of the ViT based multi-label recognition network, including input formatting, network structure, task modeling, and training procedures.
3.1. Input for ViT: CSM with Positional Encoding
In this work, the cross-spectral matrix at a characteristic frequency is selected based on source localization results. To preserve complex acoustic features and enhance the model’s spatial awareness, the real part and imaginary part are extracted and normalized as the first two channels of the image.
Meanwhile, based on preliminary beamforming localization results, the estimated coordinates of all sources in the current sample are normalized and mapped onto an scanning grid, yielding a positional encoding matrix that reflects the prior spatial distribution of the sources.
In this work, the number of microphones is fixed at
. This choice is made to ensure a direct and fair comparison with the study of Lobato et al. [
15], which also employed a 56-element array for multipole source identification. By adopting the same array configuration, the simulation setup in this study maintains consistency with Lobato et al. [
15], thereby allowing a more reliable evaluation of the performance differences between the proposed CSM–ViT framework and the DAMAS–MS baseline. Finally, these three types of information are combined into a three-channel image tensor:
The tensor with dimensions serves as the input to the subsequent ViT model, enabling it to jointly learn phase and amplitude features, while incorporating positional information related to source layout. This enhances the model’s identification performance under complex, multi-source, and non-ideal acoustic conditions.
3.2. Design of the ViT-Based Multi-Label Identification Network
The ViT is a significant recent breakthrough in the field of computer vision. Its core idea is to introduce the ViT architecture—originally developed for natural language processing—into image processing tasks. Since it was first proposed by Dosovitskiy et al. [
32] in 2021, ViT has demonstrated outstanding performance and strong global modeling capabilities in tasks such as image identification and object detection [
25,
26,
30]. Unlike traditional convolutional neural networks (CNN), which primarily rely on local receptive fields [
33,
34], ViT divides the input image into fixed-size patches and employs a multi-head self-attention mechanism to model global features. It is especially effective in handling structured and spatially correlated image data
, gradually becoming one of the mainstream techniques in visual recognition [
35].
To achieve efficient identification of multiple source types in complex sound fields, this study designs a multi-label identification model based on the ViT. The model takes the image constructed from the CSM as input and leverages ViT’s strength in capturing global dependencies to jointly identify different physical source types (monopole and two directional dipoles). In this work, the CSM at the target frequency is converted into a three -channel image for input to the ViT. The first channel is the real part of the CSM, the second channel is the imaginary part of the CSM, and the third channel is a spatial positional encoding derived from the source localization results. To construct this positional encoding, the estimated source coordinates are normalized and mapped onto an scanning grid (matching the CSM’s dimensions), producing a matrix that reflects the prior spatial distribution of the sources. We then normalize each of these matrices and stack them to form a tensor-effectively treating the CSM data as an image with three channels. This CSM image preserves both the amplitude and phase information of the acoustic field while incorporating spatial context, allowing the Vision Transformer to learn global features for improved monopole and dipole source identification.
The input size of the model is
. ViT first divides the input into non-overlapping image patches, each of size
, resulting in
patches. This configuration achieves an effective balance between spatial resolution and computational efficiency. For comparison, we also evaluated the
patch sizes (
patches) and the
patch sizes (
patches). The former delivered comparable precision, whereas the latter led to performance degradation due to excessive fragmentation, loss of fine spatial details, and overfitting caused by increased model complexity. The analysis of these configurations has been incorporated into
Section 4.5, where we demonstrate that the
patch sizes setup represents the optimal trade-off between precision and efficiency.
Each patch is linearly projected into a 768-dimensional embedding vector through a Conv2D, forming a token sequence of length 49. A learnable identification token (CLS) is prepended to the sequence, and positional embeddings are added element-wise to explicitly provide positional information. The resulting sequence of length 50 is then input into the stacked ViT encoder for feature modeling.
The ViT encoder consists of 12 identical encoder blocks. The proposed ViT model consists of 12 encoder layers with 12 multi-head attention modules, consistent with the standard ViT-Base configuration, ensuring sufficient modeling capacity for complex acoustic scenarios. In
Section 4.5, we further discuss the performance differences among models with 6, 9, and 12 layers, showing that shallower networks achieve comparable precision, indicating that model depth is not the performance bottleneck but primarily contributes to model stability.
Each block contains 12 multi-head attention mechanisms and a multilayer perceptron (MLP) module. The embedding dimension is 768, and the hidden dimension of the MLP is 3072 (MLP ratio of 4). Each block uses residual connections and layer normalization. During encoding, the multi-head attention mechanism effectively captures global dependencies between patches and models the spatial distribution characteristics of frequency-domain interference patterns.
After feature extraction, the first token in the output sequence is used as the global representation of the entire image. This token is passed through MLP heads to predict the source types. There are s parallel identification heads in total, each corresponding to a specific source location, and outputting its type prediction. Here, ∈ {2, 5, 10}, representing the number of sources in the sample. Each identification head consists of two fully connected layers with channels 768→384→3, using GELU activation and dropout for regularization. Finally, a softmax layer outputs the source type at each location (0: monopole; 1: -direction dipole; 2: -direction dipole).
The overall architecture of the ViT and its model configuration are illustrated in
Figure 1 and
Table 1, respectively.
3.3. Training Details
To ensure statistical reliability, each experiment was independently repeated 10 times with different random seeds. All performance metrics reported in this paper represent the mean values averaged over these 10 runs. The dataset for each source configuration (2, 5, and 10 sources) was consistently partitioned into training, validation, and test sets in a ratio of 6:2:2. The number of training samples was 30,000 for two sources, 120,000 for five sources, and approximately 350,000 for ten sources, with validation and test sets scaled accordingly.
During the model training phase, the Adam W optimizer is employed to update network weights, with an initial learning rate set to 1 × 10−4. A Cosine Annealing Warm Restarts (CAWR) scheduler is used to dynamically adjust the learning rate, enhancing convergence stability. The training is performed with a batch size of 64 for up to 200 epochs, and an early stopping strategy is applied on the validation set to prevent overfitting. To mitigate the impact of label uncertainty on model training, the loss function adopts label smoothing cross-entropy, which improves the model’s generalization in multi-class source identification tasks. Throughout training, the loss and accuracy curves for both the training and validation phases are recorded in real time. The model parameters corresponding to the lowest validation loss are saved for subsequent performance evaluation.
4. Simulations
To comprehensively evaluate the applicability and robustness of the proposed CSM–ViT–based source-type identification method under acoustically complex conditions, multiple simulations were designed and implemented. The simulations were conducted with fixed source counts of 2, 5, and 10. By verifying the identification performance under 2, 5, and 10 source conditions, respectively, a representative subset of the source-count space is covered; integrating sub-models trained at these discrete counts allows the construction of a generalized framework capable of recognizing mixed monopole and dipole sources across arbitrary counts ranging from 2 to 10, with randomized locations and types.
4.1. Simulation Setup and Evaluation Metrics
In each case, both source types and spatial positions were randomly generated following a consistent rule; the simulation plane was divided into a 64 × 64 uniform grid. Each sample contained three possible source types: monopole, -direction dipole, and -direction dipole.
The microphone array layout is shown in
Figure 2, while
Figure 3 illustrates a randomized source configuration. Monopoles occupy a single red grid cell, whereas dipoles span two adjacent cells—horizontally for
x-direction and vertically for
y-direction dipoles—visually distinguished by blue horizontal and vertical pairs, respectively, to emphasize their directional characteristics. Each sample was guaranteed to contain at least one monopole and one dipole, with dipole types selected randomly. Overlapping of grid cells between different sources was not permitted.
Model performance was evaluated using accuracy, recall, F1-score, and confusion matrix metrics within a multi-label identification framework, enabling a thorough assessment of identification accuracy and generalization capability across varied complex mixtures.
The simulation cases were as follows:
Case 1: Two sources with random positions and types (monopole–dipole identification across three labels).
Case 2: Five sources with random positions and types (three-label identification).
Case 3: Ten sources with random positions and types (three-label identification).
To ensure comparability among simulations, all three cases shared consistent core parameters, as summarized in
Table 2. A custom array file, Acoular_modify_array_56.xml is used, which retains the original Acoular layout but rescales the array dimensions to match the physical aperture described in the DAMAS-MS study.
4.2. Simulation Analysis of Case 1
In this section, we conduct simulation under the condition of a fixed number of two sound sources. For each sample, two sources are included, with both their spatial positions and types (monopole, -direction dipole, -direction dipole) randomly generated to simulate real-world scenarios characterized by high uncertainty in source distribution and type. The corresponding ViT sub-model is trained to perform a multi-label identification task, identifying the source type at each of the two positions.
The overall confusion matrix results are shown in
Figure 4 which indicates that the model achieves excellent differentiation among the three source types. The number of correctly identified monopoles,
-direction dipoles, and
-direction dipoles reached 487 out of 490, 508 out of 510, and 492 out of 498, respectively. The overall misidentification rate is extremely low, with minor confusion observed only between
- and
-direction dipoles. Notably, no systematic misidentification of monopoles as dipoles is observed.
Furthermore, the identification metrics—precision, recall, and F1-score are shown in
Figure 5 and
Figure 6, remaining above 0.990 across all categories, with the highest F1-score reaching 0.998 for the
-direction dipole, demonstrating the model’s stable and fine-grained identification capability under complex input structures. The F1-score heatmap further shows consistent performance across the two sources, with no noticeable variation due to spatial differences. Similarly, the heatmaps of precision and recall confirm that the internal attention mechanism effectively captures the structural association between source type features and spatial patterns in the CSM image.
In summary, under the condition of two-source training, the proposed ViT-based architecture exhibits strong identification accuracy for randomly located and mixed-type sources, highlighting its modeling advantages and practical value under spatial uncertainty.
4.3. Simulation Analysis of Case 2
The results under the setting of five fixed sound sources are shown below. The confusion matrix is illustrated in
Figure 7, the number of correctly identified sources for each class is as follows: 470 out of 485 for monopoles, 475 out of 504 for
-direction dipoles, and 478 out of 511 for
-direction dipoles. The overall misidentification rate remains low, though slightly higher than that observed in the two-source scenario, with a noticeable increase in mutual misidentification s between dipole types. The F1-score heatmap further is shown in
Figure 8 which reveals the identification performance distribution of each source across different categories, with scores generally ranging from 0.926 to 0.980. Monopoles remain the most consistently recognized category, while the F1-scores for
x- and
-direction dipoles are slightly lower, primarily due to the similarity in their spatial radiation patterns.
The precision and recall heatmaps are shown in
Figure 9 which provide a more detailed analysis of identification performance. Some source positions exhibit a clear drop in Precision for
x- and
-direction dipoles (as low as 0.905). Although recall at these positions remains above 0.922, the imbalance between these two metrics suggests issues related to confidence disparities and ambiguous boundary decisions when distinguishing adjacent source types. Nevertheless, the overall performance significantly surpasses that of traditional methods and remains consistent under multi-source excitation and structural randomness, indicating that the ViT architecture possesses a certain degree of robust spatial awareness and can stably extract discriminative features from local cross-spectral patterns.
In summary, the proposed method maintains high accuracy and strong adaptability when extended to five-source complex scenarios, demonstrating certain generalization capability and potential for practical engineering applications.
4.4. Simulation Analysis of Case 3
For the case with a fixed source count of ten, the confusion matrix is presented in
Figure 10, showing that the model maintains a prominent diagonal dominance. The numbers of correctly identified monopoles,
-direction dipoles, and
-direction dipoles are 433 out of 479, 449 out of 495, and 465 out of 526, respectively. However, the misidentification rate increases compared to the previous two cases, with a certain degree of cross-type misidentification occurring between monopoles and dipoles. This indicates that under high source density, the decision boundaries between source types begin to be affected by spatial aliasing and feature ambiguity, revealing a degree of boundary generalization error in the model structure.
The F1-score heatmap is shown in
Figure 11 which further highlights the performance variability across source points. Compared to low-density scenarios, the F1-scores for each class are distributed in the range of 0.850–0.930, demonstrating relatively strong identification stability.
The precision and recall heatmaps are shown in
Figure 12 which exhibit more pronounced metric fluctuations. For
-direction dipoles, the minimum precision is 0.827, and the minimum recall is 0.867, indicating that this category poses greater challenges for structural discrimination under complex configurations. In contrast, although monopoles also experience some misidentification, their precision and recall remain relatively stable overall, reaffirming the distinctiveness of their radiation characteristics.
It is noteworthy that despite a certain performance degradation, all overall indicators remain above 0.850. The model still successfully performs multi-label identification even under the extreme condition of ten highly interfering sources, demonstrating that the proposed identification framework which based on three-channel CSM construction and the ViT architecture which exhibits strong anti-interference capabilities and generalization, making it suitable for practical engineering applications in densely distributed multi-source scenarios.
4.5. Accuracy Under Varying Frequencies, Model Parameters and Input Channels
As shown in
Table 3, the identification accuracy (mean ± standard deviation) under varying frequencies (2000, 3000, and 4000 Hz) and source counts (2, 5, and 10). The results show that frequency has minimal impact on performance; accuracy remains consistent across frequencies within each source condition, with a slight improvement observed at higher frequencies, indicating strong robustness in the mid-to-high frequency range. In contrast, source count exerts a more significant influence. At 3000 Hz, for example, accuracy decreases from 0.995 with two sources to 0.933 with five and 0.868 with ten sources. The standard deviation also increases with source count, suggesting reduced prediction stability in high-density scenarios. Overall, the ViT-based model demonstrates strong frequency robustness but is more sensitive to increasing source complexity.
Table 4,
Table 5 and
Table 6 present the recognition accuracy under fixed source-count settings (2, 5, and 10), evaluated with different patch sizes and blocks. A consistent trend is observed; larger patch sizes result in noticeably lower accuracy. In contrast, the accuracy values for 7 × 7 and 8 × 8 patch sizes remain nearly identical across all configurations, suggesting that small to moderate patch sizes provide sufficient spatial resolution, while overly large patches reduce token density and limit the model’s ability to capture detailed spatial cues.
To further validate the impact of each input component, we conducted an ablation study under the two-source scenario.
Table 7 presents the classification precision across four input configurations: using only the real part, using only the imaginary part, using both real and imaginary parts, and using the full three-channel input.
Results indicate that combining the real and imaginary parts yields significantly higher precision than using either alone—this demonstrates that concurrent modeling of amplitude and phase features enhances identification accuracy. Furthermore, incorporating positional encoding achieves the highest overall precision, confirming that explicit spatial priors provide additional benefits to the model. This outcome aligns with the original intent of our input design, which aims to “preserve the spatial and spectral structure” in the data.
Notably, under the relatively simple two-source scenario, the model maintains high precision even without positional encoding. This can be attributed to two key factors. Firstly, the network can infer the basic source geometry directly from the CSM (Coherence Matrix) itself; secondly, and more critically, the small number of classes to distinguish in two-source cases results in a relatively high probability of correct random guessing, thus making the performance gain from positional encoding less prominent.
However, we hypothesize that as the number of sources increases, scenario complexity will rise substantially. In such cases, the probability of correct random guessing will drop sharply, and the coherent structures of multiple sources will tend to interfere with each other and become indistinguishable in the CSM. Here, positional encoding—by providing precise spatial prior information—will become a critical basis for distinguishing different sources, and its role in improving model performance will become far more pronounced. In future work, we will validate this hypothesis through comparative experiments under multi-source scenarios.
4.6. Noise Impact Analysis in Two-Source Conditions
To investigate the robustness of the proposed method under noisy acoustic conditions, we conducted additional tests using the two-source simulation scenario. The model was trained on a fixed dataset of 30,000 samples. White Gaussian noise was injected into the CSM at two signal-to-noise ratio (SNR) levels: 20 dB and 10 dB. As summarized in
Table 8, the classification precision dropped from 0.995 in the clean condition to 0.865 at 20 dB and 0.803 at 10 dB. While this degradation is expected, the model still maintains high identification accuracy under moderate noise levels, confirming its resilience against real-world acoustic disturbances. These results further support the practical feasibility of the proposed framework under non-ideal conditions. The performance drop is likely attributable in part to the limited training set size; in future work, we will explore scaling up the training data to further enhance noise robustness. Moreover, extending the validation to higher source-count scenarios will be an important direction for future studies.
4.7. Summary
The combined results from the three simulation cases with different numbers of sound sources (2, 5, and 10) demonstrate that the proposed method exhibits excellent identification performance in complex acoustic scenarios characterized by highly randomized source positions and types. Although each sub-model is trained under a fixed source count setting, all are capable of stably and accurately identifying the specific type of each source in multi-label identification tasks, with key performance metrics including precision, recall, and F1-score, consistently maintained at high levels. As the number of sources increases, slight misidentification between dipole categories is observed; however, the overall model architecture still demonstrates strong anti-interference capability and effective spatial feature decoupling.
Furthermore, compared with the results of DAMAS-MS in [
15], which yields a precision of about 0.90 and a recall of approximately 0.82, the proposed method achieves higher accuracy in monopole identification while maintaining the inherent advantages of DAMAS-MS in dipole identification. In the two-source case, the model exhibits near-perfect stability, with precision, recall, and F1-scores all exceeding 0.990. For the five-source case, performance remains excellent, with precision ranging from 0.905–0.980, recall above 0.922, and F1-scores between 0.926–0.980. Even under the challenging ten-source case, where slight category confusion occurs, the method maintains precision no lower than 0.827, recall above 0.867, and F1-scores between 0.850–0.930. This indicates that our method not only enhances performance where conventional methods struggle but also preserves their strengths in scenarios where they perform well.
These findings suggest that the proposed method is capable of accurately identifying arbitrary combinations of 2 to 10 mixed sources with random locations and types, thereby validating the effectiveness of representative sub-model training for generalization. The results also highlight the strong robustness and engineering applicability of the method in densely populated multi-source sound fields.
6. Conclusions
This paper proposes a novel sound source-type identification framework that integrates beamforming-derived CSM and a ViT identifier, aiming to address the challenge of identifying monopole and dipole sources under unknown source type, position, and number. To the best of our knowledge, this is the first study that systematically applies ViT-based deep learning to multipole acoustic source identification using CSM inputs.
Simulations with fixed source counts (2, 5, 10) and randomized locations and types demonstrate consistently high identification accuracy across a wide range of configurations. Experimental verification in an anechoic chamber with physical monopole and dipole realizations further confirms the robustness of the model.
By combining sub-models trained on different source counts, the approach can generalize to arbitrary 2–10 source configurations. Even with a limited number of microphones, ViT can capture spatial phase and pressure correlations within CSM. Its robustness to local disturbances and input order makes it effective for recognizing source types and positions in highly randomized acoustic fields. The model outperforms the DAMAS-MS baseline in monopole identification accuracy. Importantly, it bypasses the ill-posedness of traditional inverse problems by learning directly from CSM, offering a practical and scalable solution for complex acoustic environments.