Enhancing the Accuracy of Monopole and Dipole Source Identification with Vision Transformer

Chen, Junwen; Ma, Bohan; Lee, Cheng Wei; Liu, Xun; Ma, Wei

doi:10.3390/acoustics7040073

Open AccessArticle

Enhancing the Accuracy of Monopole and Dipole Source Identification with Vision Transformer

by

Junwen Chen

¹,

Bohan Ma

¹,

Cheng Wei Lee

¹

,

Xun Liu

² and

Wei Ma

^1,*

¹

School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China

²

Shanghai KeyGo Technology Company Limited, Shanghai 200090, China

^*

Author to whom correspondence should be addressed.

Acoustics 2025, 7(4), 73; https://doi.org/10.3390/acoustics7040073

Submission received: 23 September 2025 / Revised: 24 October 2025 / Accepted: 5 November 2025 / Published: 10 November 2025

Download

Browse Figures

Versions Notes

Abstract

The identification of mixed monopole and dipole sound sources under highly randomized acoustic environments is of interest in many industrial applications. The DAMAS–MS method is one of the few methods that has been explicitly developed to address this problem. However, it suffers from a critical constraint in that it consistently exhibits limited accuracy in identifying monopole sources, which leads to their underestimation in the final results. To overcome this constraint, this paper proposed a novel identification framework that integrates vision transformer (ViT) with beamforming techniques. The framework leverages preliminary beamforming results to construct input features by extracting the real and imaginary components of the cross-spectral matrix at target frequencies and incorporating spatial position encodings derived from estimated source locations. To ensure adaptability to varying source densities, multiple ViT sub-models are trained on representative scenarios. This strategy enables effective generalization across the target range and supports multi-label identification of monopole and dipole sources with varied configurations. Furthermore, anechoic chamber experiments with synthesized monopole and dipole emitters validate the method’s stability under single-frequency excitation. Compared to the DAMAS–MS method, the proposed method achieves improved identification accuracy for monopole sources, while maintaining comparable performance in dipole source identification, underscoring its potential for practical applications.

Keywords:

microphone array; monopole and dipole identification; vision transformer; beamforming

1. Introduction

With the rapid development of low-altitude air mobility in urban transportation, logistics, and emergency response, structural noise generated by aircraft operations has become a major barrier to further development [1,2]. The accurate identification of monopole and dipole sources is fundamental to effective structural noise control in low-altitude aviation. Designing effective noise mitigation strategies requires a clear understanding of the underlying acoustic generation mechanisms [3]. Among various contributors, monopole and dipole sources have been identified as dominant components in the radiated field [4].

Consistently, Zhang and Liu [5] validate a fast multipole Ffowcs Williams and Hawkings (FW-H) solver, demonstrating that monopole and dipole sources dominate the radiated field and thereby proving their fundamental importance. Relying solely on acoustic imaging [6] results in being unable to distinguish monopole and dipole sources, leading to misinterpretation of source characteristics and suboptimal noise control. Relying identification of these elementary source types, particularly monopoles and dipoles, is a prerequisite for meaningful aeroacoustic analysis and targeted noise reduction.

Beamforming methods are widely applied to source localization tasks; however, conventional methods face intrinsic constraints in identification monopole and dipole sources, especially under turbulent or highly randomized acoustic conditions. These methods are generally formulated under monopole assumptions and thus struggle to characterize directionally radiating sources such as dipoles. These constraints severely hinder source-type identification in complex field conditions. To enhance identification capability, hybrid approaches have been proposed by integrating spherical harmonic decomposition [7,8] or transfer function correction techniques [9,10,11] into beamforming frameworks. However, the former depends on coaxial symmetry in the source field, while the latter heavily relies on prior acoustic models and shows poor adaptability to varying environments. Despite advances in multipole beamforming—such as direction decomposition and source-component separation by Liu et al. [10] and Suzuki [12], or the stability-enhancing deconvolution and directional weakening strategies by Demyanov et al. [13] and Pan et al. [14]—still require prior knowledge of the source type. As a result, these methods remain insufficient for application in acoustically complex, source-diverse environments.

To address the remaining constraints of these advanced methods, recent research has proposed two major categories of deep learning-based frameworks, beamforming-driven methods and Bayesian inference-based methods.

Within the beamforming category, as one of the few methods explicitly developed for highly randomized and complex environments, Lobato et al. [15], inspired by the neural network for sound source localization proposed by Ma and Liu [16], proposed the DAMAS-MS method. This method enhances real-time performance via compression-driven grid refinement and targets the identification of monopole and dipole sources in randomized acoustic fields. However, its identification performance is hindered by a systematic bias toward dipole sources, resulting in frequent overestimation of dipole components and underrepresentation of monopole contributions. In parallel, an unrolled Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) applied to the cross-spectral matrix (CSM) was proposed to improve reconstruction stability in single-shot scenarios [17]. Building on this insight, Goudarzi [18] developed the Broadband CLEAN-SC method, which directly utilizes the raw CSM as input and integrates global optimization, local optimization, and a meshless neural network. This method eliminates the need for predefined source-type assumptions and demonstrates strong performance for broadband dipole identification in controlled configurations. However, it has not been validated in scenarios involving mixed source types, spatial randomness, or turbulent fields. Importantly, its direct use of the CSM as model input provides methodological inspiration for the present paper.

Bayesian inference-based methods have recently emerged as a class of data-driven approaches for acoustic source identification [19]. Among them, Pan et al. [20] proposed a representative sparse Bayesian learning framework based on a multipole transfer matrix model, enabling simultaneous identification of monopole and dipole sources without predefined source-type assumptions. However, these methods typically rely on dense microphone arrays to ensure observability and remain limited in low-frequency, low-SNR, or spatially randomized environments, restricting their practicality in real-world scenarios. Moreover, the approach of Pan et al. assumes that the total number of sources is known or can be reliably estimated beforehand, and that the underlying field is sufficiently structured. These assumptions are not valid for highly randomized source identification scenarios involving unknown numbers, types, and spatial locations of mixed monopole and dipole sources. As such, sparse Bayesian methods are not directly applicable to the unstructured, stochastic source identification task addressed in this work.

Recently, numerical simulation-based methods have emerged as an alternative for monopole and dipole source identification. Mao et al. [21] and Yang et al. [22] and Plaksin et al. [23] used high-fidelity computational fluid dynamics to reconstruct source distributions via numerical beamforming. These methods can capture complex radiation patterns in virtual environments but rely heavily on detailed simulations. Li and Wang [24] proposed a parametric model based on maximum-likelihood estimation to jointly recover source number, location, and orientation. While these approaches improve accuracy under controlled conditions, they involve high computational costs and offer limited generalization to complex or unpredictable scenarios.

Although numerical simulation methods offer high spatial resolution under structured or controlled conditions, they are not suitable for highly stochastic acoustic fields and are limited by high deployment costs and poor generalizability. In contrast, DAMAS-MS is the only method specifically designed for the identification of randomly located, typed, and numbered mixed monopole–dipole sources, and thus serves as the methodological focus for comparative analysis in this work. To address the limitations of DAMAS-MS under highly randomized conditions, this paper leverages the global representation capability of deep learning models. In particular, the vision transformer (ViT) has demonstrated outstanding performance in global feature modeling [25,26], making it well-suited for complex source identification. Kerssies et al. [27] recently demonstrated that ViT can perform implicit spatial reasoning and accurate image segmentation even in encoder-only configurations, highlighting their scalability and structural adaptability. Inspired by the work of Goudarzi [18], guided by the findings of Jekosch and Sarradj [28] that the CSM inherently contains dipole orientation information. This paper adopts the CSM as the primary feature representation to retain and exploit its directional content. Localization results are used to extract CSM data at target frequencies, forming a three-channel input of real part, imaginary part, and spatial positional encoding. This preserves the spatial and spectral structure for ViT-based global feature modeling, yielding higher monopole accuracy and maintaining competitive dipole performance, extending CSM based learning to practical airframe noise identification.

This paper is organized as follows. In Section 2, key challenges in monopole and dipole identification are discussed. Section 3 introduces the proposed methodology and its theoretical foundation. Section 4 presents simulations under the same core parameters as DAMAS-MS and reports the corresponding results. Section 5 presents experimental verification conducted in an anechoic chamber. Finally, conclusions are drawn in Section 6.

2. Problem Statement

2.1. Constraints of Conventional Monopole and Dipole Identification Methods Without Prior Source Type Assumptions

Conventional monopole and dipole identification methods are ineffective when no prior assumptions regarding source types are made, primarily for two reasons. First, the Green’s function matrix for combined monopole and dipole sources remains undetermined, thereby preventing the formulation of governing equations for the sound field.

This limitation originates from the fundamental mathematical model commonly used in acoustic inverse problems and source localization [29], given by

p (f) = G q + v

(1)

Here,

q

denotes the unknown source strengths, while

v

represents noise in the pressure measurements. Once sound pressure data are collected using a microphone array, the time-domain signals are transformed into the frequency domain through the fast Fourier transform (FFT), yielding the frequency-domain sound pressure vector

p (f)

. Conventional approaches require a prior assumption about the source types. With such assumptions specified, the structure of the Green’s function matrix

G

can then be constructed, for example, as

[G_{mono} G_{dip}]

.

Then, according to the expression,

P_{mono, dip} = [G_{mono} G_{dip}] \cdot {[q_{mono} q_{dip}]}^{T}

(2)

The strengths of individual monopole and dipole sources can be computed.

In this work, the superscript

{(\cdot)}^{T}

denotes the standard transpose operation, used to arrange complex-valued signals or steering vectors into column vectors.

In Equation (1),

p (f) \in C^{N \times 1}

denotes the frequency-domain sound pressure vector measured at

N

microphone positions after Fourier transformation.

G \in C^{N \times M}

is the Green’s function matrix (or transfer matrix) that maps the acoustic contributions from

M

candidate source positions to the microphone array,

q \in C^{M \times 1}

represents the vector of unknown source strengths, and

v \in C^{N \times 1}

denotes additive measurement noise. In Equation (2), the transfer matrix is decomposed as [

G_{mono} G_{dip}

], where

G_{mono} \in C^{N \times M_{mono}}

and

G_{dip} \in C^{N \times M_{dip}}

correspond to monopole and dipole components, respectively. The associated source strength vectors are

q_{mono} \in C^{M_{mono} \times 1}

and

q_{dip} \in C^{M_{dip} \times 1}

, forming the full source vector

{[q_{mono} q_{dip}]}^{T} \in C^{(M_{mono} + M_{dip}) \times 1}

. The resulting pressure field at the microphones due to both source types is denoted by

P_{mono, dip} \in C^{N \times 1}

.

However, in the absence of prior knowledge regarding source types, the arrangement of Green’s functions within the

G

remains ambiguous. It is not feasible to determine the distribution and strength of monopole and dipole sources in complex acoustic fields based on the equation.

2.2. Ill-Posedness of the Inverse Problem

The second challenge stems from the ill-posedness of the inverse problem. Inferring source characteristics from acoustic measurements represents a classic inverse problem, which involves deducing the source distribution, field properties, or other unknown parameters from limited measured data. This process typically contains a large number of unknowns, such as the strength and position of equivalent sources.

When the number of measurement points is limited, the resulting system of equations tends to be underdetermined, meaning that multiple equivalent source or parameter distributions can produce the same external sound field. Under such conditions, there is no unique mapping between the sound field solution and the model parameters.

If the measurement points are located in the far field and the number of sensors is further reduced, the problem becomes more severe, resulting in solution instability and physical distortion. These issues severely limit the reliability and accuracy of source-type identification.

2.3. Proposed Framework of This Paper

To address the two key challenges mentioned above, this study develops a monopole and dipole identification framework that leverages the CSM as a discriminative representation and employs a ViT architecture to capture spatially coherent patterns from complex acoustic fields.

In the task of identifying mixed monopole and dipole sources within complex multi-source fields, constructing intermediate features that accurately represent source characteristics and are suitable for neural network input is essential. Among various representations, in beamforming localization, CSM is widely used in beamforming which is based localization due to its ability to characterize frequency-domain coherence between signals recorded by different microphones in an array, which encodes both phase and amplitude correlations across sensor pairs, forming the foundation for spatial filtering, directional response estimation, and source localization.

In this study, we repurpose the CSM from a localization-oriented tool into a feature representation for source-type identification. By reshaping the CSM into an image-like structure, we leverage capabilities of deep neural networks in pattern identification to extract spatially discriminative features from complex sound fields.

The process begins with spatial sampling of the sound pressure field using a planar microphone array. The array comprises

N

microphones, each recording time-domain acoustic signals

x (t)

, the measured pressure signal is divided into

K

snapshots. These signals are first transformed into the frequency domain via FFT, yielding the frequency-domain pressure vector,

p (ω)^{(k)} = [p_{1} (ω)^{(k)} \dots, p_{n} (ω)^{(k)}, \dots, p_{N} (ω)^{(k)}]

(3)

where

p (ω)^{(k)} \in C^{N \times 1}

denotes the complex pressure spectrum at the

n

-th microphone position

r_{N}

,

p_{n} (ω)^{(k)}

is the frequency-domain spectrum at angular frequency

ω

measured by the

n

-th microphone at the

k

-th snapshot.

The CSM is the fundamental data representation in array acoustics, defined as

C (f) = \frac{1}{K} \sum_{k = 1}^{K} p (ω)^{(k) H} p (ω)^{(k)}

(4)

The diagonal elements of

C (f)

are usually set to zero in order to remove the influence of background noise. The superscript

{(\cdot)}^{H}

indicates complex conjugation, as employed in cross-spectral computation. This formulation reflects the standard outer product used in cross-spectral estimation.

The CSM encodes dipole orientation [28], motivating our identification using CSM. While Goudarzi [18] demonstrated the use of CSM input combined with global and local optimization as well as meshfree neural networks for fixed-source scenarios, its applicability has not yet been extended to mixed or highly randomized source conditions. The ViT architecture, with its strong global modeling capability and contextual sequence learning [26,30,31], offers powerful tools for learning complex spatial relationships. It can directly map source types to spatial phase and sound pressure features derived from the CSM and positional encoding, without explicitly constructing a physical propagation model. This method effectively avoids the traditional dependence on strong physical priors and enables identification under conditions of high randomness.

3. Detailed Architecture of the ViT Algorithm Model

As introduced in Section 2.3, this study proposes a ViT-based framework that leverages the CSM as its core input representation to address the challenge of identifying monopole and dipole sources in complex acoustic fields. The method operates on frequency-domain data derived from beamforming and constructs a three-channel CSM image that integrates the real and imaginary components with a spatial positional encoding derived from source localization. A multi-label neural network architecture is then employed to handle scenarios involving multiple sources with high spatial randomness.

This chapter first introduces the construction process of the CSM image, including the extraction of matrices at characteristic frequencies, the real and imaginary component input format, and the position encoding strategy that incorporates source localization information. These components are ultimately combined to form a three-channel image representation suitable for ViT input. The chapter subsequently details the design of the ViT based multi-label recognition network, including input formatting, network structure, task modeling, and training procedures.

3.1. Input for ViT: CSM with Positional Encoding

In this work, the cross-spectral matrix

C (f_{0}) \in C^{N \times N}

at a characteristic frequency

f_{0}

is selected based on source localization results. To preserve complex acoustic features and enhance the model’s spatial awareness, the real part

R (C (f_{0}))

and imaginary part

I (C (f_{0}))

are extracted and normalized as the first two channels of the image.

Meanwhile, based on preliminary beamforming localization results, the estimated coordinates of all sources in the current sample are normalized and mapped onto an

N \times N

scanning grid, yielding a positional encoding matrix

P \in R^{N \times N}

that reflects the prior spatial distribution of the sources.

In this work, the number of microphones is fixed at

N = 56

. This choice is made to ensure a direct and fair comparison with the study of Lobato et al. [15], which also employed a 56-element array for multipole source identification. By adopting the same array configuration, the simulation setup in this study maintains consistency with Lobato et al. [15], thereby allowing a more reliable evaluation of the performance differences between the proposed CSM–ViT framework and the DAMAS–MS baseline. Finally, these three types of information are combined into a three-channel image tensor:

X_{C S M} = [R (C (f_{0})), I (C (f_{0})), P] \in R^{3 \times N \times N}

(5)

The tensor with dimensions

3 \times 56 \times 56

serves as the input to the subsequent ViT model, enabling it to jointly learn phase and amplitude features, while incorporating positional information related to source layout. This enhances the model’s identification performance under complex, multi-source, and non-ideal acoustic conditions.

3.2. Design of the ViT-Based Multi-Label Identification Network

The ViT is a significant recent breakthrough in the field of computer vision. Its core idea is to introduce the ViT architecture—originally developed for natural language processing—into image processing tasks. Since it was first proposed by Dosovitskiy et al. [32] in 2021, ViT has demonstrated outstanding performance and strong global modeling capabilities in tasks such as image identification and object detection [25,26,30]. Unlike traditional convolutional neural networks (CNN), which primarily rely on local receptive fields [33,34], ViT divides the input image into fixed-size patches and employs a multi-head self-attention mechanism to model global features. It is especially effective in handling structured and spatially correlated image data, gradually becoming one of the mainstream techniques in visual recognition [35].

To achieve efficient identification of multiple source types in complex sound fields, this study designs a multi-label identification model based on the ViT. The model takes the image constructed from the CSM as input and leverages ViT’s strength in capturing global dependencies to jointly identify different physical source types (monopole and two directional dipoles). In this work, the CSM at the target frequency is converted into a three -channel image for input to the ViT. The first channel is the real part of the CSM, the second channel is the imaginary part of the CSM, and the third channel is a spatial positional encoding derived from the source localization results. To construct this positional encoding, the estimated source coordinates are normalized and mapped onto an

N \times N

scanning grid (matching the CSM’s dimensions), producing a matrix that reflects the prior spatial distribution of the sources. We then normalize each of these matrices and stack them to form a

3 \times N \times N

tensor-effectively treating the CSM data as an image with three channels. This CSM image preserves both the amplitude and phase information of the acoustic field while incorporating spatial context, allowing the Vision Transformer to learn global features for improved monopole and dipole source identification.

The input size of the model is

3 \times 56 \times 56

. ViT first divides the input into non-overlapping image patches, each of size

8 \times 8

, resulting in

7 \times 7 = 49

patches. This configuration achieves an effective balance between spatial resolution and computational efficiency. For comparison, we also evaluated the

7 \times 7

patch sizes (

8 \times 8

patches) and the

14 \times 14

patch sizes (

4 \times 4

patches). The former delivered comparable precision, whereas the latter led to performance degradation due to excessive fragmentation, loss of fine spatial details, and overfitting caused by increased model complexity. The analysis of these configurations has been incorporated into Section 4.5, where we demonstrate that the

8 \times 8

patch sizes setup represents the optimal trade-off between precision and efficiency.

Each patch is linearly projected into a 768-dimensional embedding vector through a Conv2D, forming a token sequence of length 49. A learnable identification token (CLS) is prepended to the sequence, and positional embeddings are added element-wise to explicitly provide positional information. The resulting sequence of length 50 is then input into the stacked ViT encoder for feature modeling.

The ViT encoder consists of 12 identical encoder blocks. The proposed ViT model consists of 12 encoder layers with 12 multi-head attention modules, consistent with the standard ViT-Base configuration, ensuring sufficient modeling capacity for complex acoustic scenarios. In Section 4.5, we further discuss the performance differences among models with 6, 9, and 12 layers, showing that shallower networks achieve comparable precision, indicating that model depth is not the performance bottleneck but primarily contributes to model stability.

Each block contains 12 multi-head attention mechanisms and a multilayer perceptron (MLP) module. The embedding dimension is 768, and the hidden dimension of the MLP is 3072 (MLP ratio of 4). Each block uses residual connections and layer normalization. During encoding, the multi-head attention mechanism effectively captures global dependencies between patches and models the spatial distribution characteristics of frequency-domain interference patterns.

After feature extraction, the first token in the output sequence is used as the global representation of the entire image. This token is passed through MLP heads to predict the source types. There are s parallel identification heads in total, each corresponding to a specific source location, and outputting its type prediction. Here,

s

∈ {2, 5, 10}, representing the number of sources in the sample. Each identification head consists of two fully connected layers with channels 768→384→3, using GELU activation and dropout for regularization. Finally, a softmax layer outputs the source type at each location (0: monopole; 1:

x

-direction dipole; 2:

y

-direction dipole).

The overall architecture of the ViT and its model configuration are illustrated in Figure 1 and Table 1, respectively.

3.3. Training Details

To ensure statistical reliability, each experiment was independently repeated 10 times with different random seeds. All performance metrics reported in this paper represent the mean values averaged over these 10 runs. The dataset for each source configuration (2, 5, and 10 sources) was consistently partitioned into training, validation, and test sets in a ratio of 6:2:2. The number of training samples was 30,000 for two sources, 120,000 for five sources, and approximately 350,000 for ten sources, with validation and test sets scaled accordingly.

During the model training phase, the Adam W optimizer is employed to update network weights, with an initial learning rate set to 1 × 10⁻⁴. A Cosine Annealing Warm Restarts (CAWR) scheduler is used to dynamically adjust the learning rate, enhancing convergence stability. The training is performed with a batch size of 64 for up to 200 epochs, and an early stopping strategy is applied on the validation set to prevent overfitting. To mitigate the impact of label uncertainty on model training, the loss function adopts label smoothing cross-entropy, which improves the model’s generalization in multi-class source identification tasks. Throughout training, the loss and accuracy curves for both the training and validation phases are recorded in real time. The model parameters corresponding to the lowest validation loss are saved for subsequent performance evaluation.

4. Simulations

To comprehensively evaluate the applicability and robustness of the proposed CSM–ViT–based source-type identification method under acoustically complex conditions, multiple simulations were designed and implemented. The simulations were conducted with fixed source counts of 2, 5, and 10. By verifying the identification performance under 2, 5, and 10 source conditions, respectively, a representative subset of the source-count space is covered; integrating sub-models trained at these discrete counts allows the construction of a generalized framework capable of recognizing mixed monopole and dipole sources across arbitrary counts ranging from 2 to 10, with randomized locations and types.

4.1. Simulation Setup and Evaluation Metrics

In each case, both source types and spatial positions were randomly generated following a consistent rule; the simulation plane was divided into a 64 × 64 uniform grid. Each sample contained three possible source types: monopole,

x

-direction dipole, and

y

-direction dipole.

The microphone array layout is shown in Figure 2, while Figure 3 illustrates a randomized source configuration. Monopoles occupy a single red grid cell, whereas dipoles span two adjacent cells—horizontally for x-direction and vertically for y-direction dipoles—visually distinguished by blue horizontal and vertical pairs, respectively, to emphasize their directional characteristics. Each sample was guaranteed to contain at least one monopole and one dipole, with dipole types selected randomly. Overlapping of grid cells between different sources was not permitted.

Model performance was evaluated using accuracy, recall, F1-score, and confusion matrix metrics within a multi-label identification framework, enabling a thorough assessment of identification accuracy and generalization capability across varied complex mixtures.

The simulation cases were as follows:

Case 1: Two sources with random positions and types (monopole–dipole identification across three labels).
Case 2: Five sources with random positions and types (three-label identification).
Case 3: Ten sources with random positions and types (three-label identification).

To ensure comparability among simulations, all three cases shared consistent core parameters, as summarized in Table 2. A custom array file, Acoular_modify_array_56.xml is used, which retains the original Acoular layout but rescales the array dimensions to match the physical aperture described in the DAMAS-MS study.

4.2. Simulation Analysis of Case 1

In this section, we conduct simulation under the condition of a fixed number of two sound sources. For each sample, two sources are included, with both their spatial positions and types (monopole,

x

-direction dipole,

y

-direction dipole) randomly generated to simulate real-world scenarios characterized by high uncertainty in source distribution and type. The corresponding ViT sub-model is trained to perform a multi-label identification task, identifying the source type at each of the two positions.

The overall confusion matrix results are shown in Figure 4 which indicates that the model achieves excellent differentiation among the three source types. The number of correctly identified monopoles,

x

-direction dipoles, and

y

-direction dipoles reached 487 out of 490, 508 out of 510, and 492 out of 498, respectively. The overall misidentification rate is extremely low, with minor confusion observed only between

x

- and

y

-direction dipoles. Notably, no systematic misidentification of monopoles as dipoles is observed.

Furthermore, the identification metrics—precision, recall, and F1-score are shown in Figure 5 and Figure 6, remaining above 0.990 across all categories, with the highest F1-score reaching 0.998 for the

x

-direction dipole, demonstrating the model’s stable and fine-grained identification capability under complex input structures. The F1-score heatmap further shows consistent performance across the two sources, with no noticeable variation due to spatial differences. Similarly, the heatmaps of precision and recall confirm that the internal attention mechanism effectively captures the structural association between source type features and spatial patterns in the CSM image.

In summary, under the condition of two-source training, the proposed ViT-based architecture exhibits strong identification accuracy for randomly located and mixed-type sources, highlighting its modeling advantages and practical value under spatial uncertainty.

4.3. Simulation Analysis of Case 2

The results under the setting of five fixed sound sources are shown below. The confusion matrix is illustrated in Figure 7, the number of correctly identified sources for each class is as follows: 470 out of 485 for monopoles, 475 out of 504 for

x

-direction dipoles, and 478 out of 511 for

y

-direction dipoles. The overall misidentification rate remains low, though slightly higher than that observed in the two-source scenario, with a noticeable increase in mutual misidentification s between dipole types. The F1-score heatmap further is shown in Figure 8 which reveals the identification performance distribution of each source across different categories, with scores generally ranging from 0.926 to 0.980. Monopoles remain the most consistently recognized category, while the F1-scores for x- and

y

-direction dipoles are slightly lower, primarily due to the similarity in their spatial radiation patterns.

The precision and recall heatmaps are shown in Figure 9 which provide a more detailed analysis of identification performance. Some source positions exhibit a clear drop in Precision for x- and

y

-direction dipoles (as low as 0.905). Although recall at these positions remains above 0.922, the imbalance between these two metrics suggests issues related to confidence disparities and ambiguous boundary decisions when distinguishing adjacent source types. Nevertheless, the overall performance significantly surpasses that of traditional methods and remains consistent under multi-source excitation and structural randomness, indicating that the ViT architecture possesses a certain degree of robust spatial awareness and can stably extract discriminative features from local cross-spectral patterns.

In summary, the proposed method maintains high accuracy and strong adaptability when extended to five-source complex scenarios, demonstrating certain generalization capability and potential for practical engineering applications.

4.4. Simulation Analysis of Case 3

For the case with a fixed source count of ten, the confusion matrix is presented in Figure 10, showing that the model maintains a prominent diagonal dominance. The numbers of correctly identified monopoles,

x

-direction dipoles, and

y

-direction dipoles are 433 out of 479, 449 out of 495, and 465 out of 526, respectively. However, the misidentification rate increases compared to the previous two cases, with a certain degree of cross-type misidentification occurring between monopoles and dipoles. This indicates that under high source density, the decision boundaries between source types begin to be affected by spatial aliasing and feature ambiguity, revealing a degree of boundary generalization error in the model structure.

The F1-score heatmap is shown in Figure 11 which further highlights the performance variability across source points. Compared to low-density scenarios, the F1-scores for each class are distributed in the range of 0.850–0.930, demonstrating relatively strong identification stability.

The precision and recall heatmaps are shown in Figure 12 which exhibit more pronounced metric fluctuations. For

x

-direction dipoles, the minimum precision is 0.827, and the minimum recall is 0.867, indicating that this category poses greater challenges for structural discrimination under complex configurations. In contrast, although monopoles also experience some misidentification, their precision and recall remain relatively stable overall, reaffirming the distinctiveness of their radiation characteristics.

It is noteworthy that despite a certain performance degradation, all overall indicators remain above 0.850. The model still successfully performs multi-label identification even under the extreme condition of ten highly interfering sources, demonstrating that the proposed identification framework which based on three-channel CSM construction and the ViT architecture which exhibits strong anti-interference capabilities and generalization, making it suitable for practical engineering applications in densely distributed multi-source scenarios.

4.5. Accuracy Under Varying Frequencies, Model Parameters and Input Channels

As shown in Table 3, the identification accuracy (mean ± standard deviation) under varying frequencies (2000, 3000, and 4000 Hz) and source counts (2, 5, and 10). The results show that frequency has minimal impact on performance; accuracy remains consistent across frequencies within each source condition, with a slight improvement observed at higher frequencies, indicating strong robustness in the mid-to-high frequency range. In contrast, source count exerts a more significant influence. At 3000 Hz, for example, accuracy decreases from 0.995 with two sources to 0.933 with five and 0.868 with ten sources. The standard deviation also increases with source count, suggesting reduced prediction stability in high-density scenarios. Overall, the ViT-based model demonstrates strong frequency robustness but is more sensitive to increasing source complexity.

Table 4, Table 5 and Table 6 present the recognition accuracy under fixed source-count settings (2, 5, and 10), evaluated with different patch sizes and blocks. A consistent trend is observed; larger patch sizes result in noticeably lower accuracy. In contrast, the accuracy values for 7 × 7 and 8 × 8 patch sizes remain nearly identical across all configurations, suggesting that small to moderate patch sizes provide sufficient spatial resolution, while overly large patches reduce token density and limit the model’s ability to capture detailed spatial cues.

To further validate the impact of each input component, we conducted an ablation study under the two-source scenario. Table 7 presents the classification precision across four input configurations: using only the real part, using only the imaginary part, using both real and imaginary parts, and using the full three-channel input.

Results indicate that combining the real and imaginary parts yields significantly higher precision than using either alone—this demonstrates that concurrent modeling of amplitude and phase features enhances identification accuracy. Furthermore, incorporating positional encoding achieves the highest overall precision, confirming that explicit spatial priors provide additional benefits to the model. This outcome aligns with the original intent of our input design, which aims to “preserve the spatial and spectral structure” in the data.

Notably, under the relatively simple two-source scenario, the model maintains high precision even without positional encoding. This can be attributed to two key factors. Firstly, the network can infer the basic source geometry directly from the CSM (Coherence Matrix) itself; secondly, and more critically, the small number of classes to distinguish in two-source cases results in a relatively high probability of correct random guessing, thus making the performance gain from positional encoding less prominent.

However, we hypothesize that as the number of sources increases, scenario complexity will rise substantially. In such cases, the probability of correct random guessing will drop sharply, and the coherent structures of multiple sources will tend to interfere with each other and become indistinguishable in the CSM. Here, positional encoding—by providing precise spatial prior information—will become a critical basis for distinguishing different sources, and its role in improving model performance will become far more pronounced. In future work, we will validate this hypothesis through comparative experiments under multi-source scenarios.

4.6. Noise Impact Analysis in Two-Source Conditions

To investigate the robustness of the proposed method under noisy acoustic conditions, we conducted additional tests using the two-source simulation scenario. The model was trained on a fixed dataset of 30,000 samples. White Gaussian noise was injected into the CSM at two signal-to-noise ratio (SNR) levels: 20 dB and 10 dB. As summarized in Table 8, the classification precision dropped from 0.995 in the clean condition to 0.865 at 20 dB and 0.803 at 10 dB. While this degradation is expected, the model still maintains high identification accuracy under moderate noise levels, confirming its resilience against real-world acoustic disturbances. These results further support the practical feasibility of the proposed framework under non-ideal conditions. The performance drop is likely attributable in part to the limited training set size; in future work, we will explore scaling up the training data to further enhance noise robustness. Moreover, extending the validation to higher source-count scenarios will be an important direction for future studies.

4.7. Summary

The combined results from the three simulation cases with different numbers of sound sources (2, 5, and 10) demonstrate that the proposed method exhibits excellent identification performance in complex acoustic scenarios characterized by highly randomized source positions and types. Although each sub-model is trained under a fixed source count setting, all are capable of stably and accurately identifying the specific type of each source in multi-label identification tasks, with key performance metrics including precision, recall, and F1-score, consistently maintained at high levels. As the number of sources increases, slight misidentification between dipole categories is observed; however, the overall model architecture still demonstrates strong anti-interference capability and effective spatial feature decoupling.

Furthermore, compared with the results of DAMAS-MS in [15], which yields a precision of about 0.90 and a recall of approximately 0.82, the proposed method achieves higher accuracy in monopole identification while maintaining the inherent advantages of DAMAS-MS in dipole identification. In the two-source case, the model exhibits near-perfect stability, with precision, recall, and F1-scores all exceeding 0.990. For the five-source case, performance remains excellent, with precision ranging from 0.905–0.980, recall above 0.922, and F1-scores between 0.926–0.980. Even under the challenging ten-source case, where slight category confusion occurs, the method maintains precision no lower than 0.827, recall above 0.867, and F1-scores between 0.850–0.930. This indicates that our method not only enhances performance where conventional methods struggle but also preserves their strengths in scenarios where they perform well.

These findings suggest that the proposed method is capable of accurately identifying arbitrary combinations of 2 to 10 mixed sources with random locations and types, thereby validating the effectiveness of representative sub-model training for generalization. The results also highlight the strong robustness and engineering applicability of the method in densely populated multi-source sound fields.

5. Experimental Verification

To assess the practical applicability and generalization of the proposed method, an acoustic experiment was conducted in an anechoic chamber. This experiment aims to extend the theoretical framework developed in simulation to engineering application scenarios, thereby assessing the method’s robustness and identification performance under real environmental conditions.

5.1. Experimental Environment and Source Configuration

The experimental setup is shown in Figure 13. The experiment employed the DES-T144 portable high-resolution acoustic camera developed by KeyGo Technology, Shanghai, China, which integrates a 144-channel microphone array and an array imaging module. A ViT sub-model was retrained to adapt to the structure of the DES-T144 array shown in Figure 14a and a two-source setup with random positions and random types was selected as the core experimental scenario.

In terms of source construction, as shown in Figure 14b, the monopole source was simulated using a single independent Bluetooth speaker (Thinkplus BT version Speaker K30), which emitted a 3000 Hz sine wave utilizing its point-like radiation characteristics. The dipole source was constructed using two identical speakers that emitted out-of-phase signals at the same frequency, thereby forming an approximately ideal dipole radiation field (see Figure 14c). The x-direction and

y

-direction dipoles were implemented by placing the dual-speaker structure horizontally or vertically, respectively. All source signals were uniformly controlled via an external smartphone to ensure accurate phase alignment between the two sources and consistency with the main frequency used in the simulation.

During the experiment, the distance between the sound sources and the microphone array was fixed at 3 m. By manually changing the positions of the sources to simulate random locations, 20 sets of test samples were constructed with different combinations of monopoles and dipoles in various directions to simulate random types. To clarify, twenty independent anechoic-chamber experiments were conducted to cover representative combinations of source types, spatial positions, and separations. These experiments were designed to ensure comprehensive coverage of key acoustic conditions under strict phase-control and precise positioning constraints. The primary objective was to verify the feasibility and robustness of the proposed method under realistic acoustic environments, while the broader statistical reliability of the approach was further validated through the accompanying large-scale simulation studies.

Each group of experiments was conducted under independent conditions for data acquisition and frequency-domain processing, and the processed data were fed into the trained ViT model for identification.

5.2. Experimental Results and Analysis

The experimental results are shown in the Figure 15. The overall confusion matrix demonstrates that the model has strong identification ability for the three types of sources. The monopole was correctly identified in 12 out of 15 cases, with most misidentifications occurring as

y

-direction dipoles. The

x

-direction dipole was correctly recognized in 11 out of 13 cases, with only minor misjudgments. The

y

-direction dipole was correctly identification in 9 out of 12 cases, with some misidentification s as monopoles or

x

-direction dipoles. Overall, the model exhibits good discriminative performance under real acoustic conditions, and the slight confusion between dipole types is mainly attributed to their similarity in radiation direction.

6. Conclusions

This paper proposes a novel sound source-type identification framework that integrates beamforming-derived CSM and a ViT identifier, aiming to address the challenge of identifying monopole and dipole sources under unknown source type, position, and number. To the best of our knowledge, this is the first study that systematically applies ViT-based deep learning to multipole acoustic source identification using CSM inputs.

Simulations with fixed source counts (2, 5, 10) and randomized locations and types demonstrate consistently high identification accuracy across a wide range of configurations. Experimental verification in an anechoic chamber with physical monopole and dipole realizations further confirms the robustness of the model.

By combining sub-models trained on different source counts, the approach can generalize to arbitrary 2–10 source configurations. Even with a limited number of microphones, ViT can capture spatial phase and pressure correlations within CSM. Its robustness to local disturbances and input order makes it effective for recognizing source types and positions in highly randomized acoustic fields. The model outperforms the DAMAS-MS baseline in monopole identification accuracy. Importantly, it bypasses the ill-posedness of traditional inverse problems by learning directly from CSM, offering a practical and scalable solution for complex acoustic environments.

Author Contributions

Conceptualization, J.C. and X.L.; methodology, J.C.; software, J.C. and B.M.; validation, J.C.; formal analysis, B.M.; investigation, J.C.; resources, C.W.L.; writing—original draft preparation, J.C.; writing—review and editing, W.M.; visualization, J.C.; supervision, W.M.; funding acquisition, C.W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (W2442003). The authors gratefully acknowledge the NSFC’s support in advancing acoustic measurement techniques.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

Co-author Xun Liu is from Shanghai KeyGo Technology Company Limited, and there are no conflicts of interest among the other authors.

References

Crighton, D.G. Airframe noise. In Aeroacoustics of Flight Vehicles: Theory and Practice; National Aeronautics and Space Administration: Washington, WA, USA, 1991; Volume 1, pp. 391–447. [Google Scholar]
Merino Martínez, R.; Sijtsma, P.; Snellen, M.; Ahlefeldt, T.; Antoni, J.; Bahr, C.J.; Blacodon, D.; Ernst, D.; Finez, A.; Funke, S.; et al. A review of acoustic imaging methods using phased microphone arrays: Part of the ‘Aircraft Noise Generation and Assessment’ Special Issue. CEAS Aeronaut. J. 2019, 10, 197–230. [Google Scholar] [CrossRef]
Good, M.D.; Gilkey, R.H. Sound localization in noise: The effect of signal-to-noise ratio. J. Acoust. Soc. Am. 1996, 99, 1108–1117. [Google Scholar] [CrossRef]
Russell, D.A.; Titlow, J.P.; Bemmen, Y.-J. Acoustic monopoles, dipoles, and quadrupoles: An experiment revisited. Am. J. Phys. 1999, 67, 660–664. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y. Fast Evaluations of Integrals in the Ffowcs Williams–Hawkings Formulation in Aeroacoustics via the Fast Multipole Method. Acoustics 2023, 5, 817–844. [Google Scholar] [CrossRef]
Sijtsma, P. Acoustic beamforming for the ranking of aircraft noise. In Accurate and Efficient Aeroacoustic Prediction Approaches for Airframe Noise; VKI Lecture Series 2013-03; Schram, C., Dénos, R., Lecomte, E., Eds.; von Karman Institute: Rhode-St-Genèse, Belgium, 2013. [Google Scholar]
Bouchard, C.; Havelock, D.I.; Bouchard, M. Beamforming with microphone arrays for directional sources. J. Acoust. Soc. Am. 2009, 125, 2098–2104. [Google Scholar] [CrossRef] [PubMed]
Suzuki, T. Identification of multipole noise sources in low Mach number jets near the peak frequency. J. Acoust. Soc. Am. 2006, 119, 3649–3659. [Google Scholar] [CrossRef]
Chen, W.; Jiang, H.; He, W. Dipole source based virtual three dimensional imaging for propeller noise. Aerosp. Sci. Technol. 2022, 124, 107562. [Google Scholar] [CrossRef]
Liu, Y.; Dowling, A.P.; Quayle, A.R.; Sijtsma, P. Beamforming correction for dipole measurement using two dimensional microphone arrays. J. Acoust. Soc. Am. 2008, 124, 182–191. [Google Scholar] [CrossRef]
Porteous, R.; Prime, Z.; Doolan, C.J.; Moreau, D.; Valeau, V. Three dimensional beamforming of dipolar aeroacoustic sources. J. Sound Vib. 2015, 355, 117–134. [Google Scholar] [CrossRef]
Suzuki, T. L1 generalized inverse beam-forming algorithm resolving coherent/incoherent, distributed and multipole sources. J. Sound Vib. 2011, 330, 5835–5851. [Google Scholar] [CrossRef]
Demyanov, M.; Bychkov, O.; Faranosov, G.; Zaytsev, M. Development of beamforming methods for uncorrelated dipole sources. In Proceedings of the 7th Berlin Beamforming Conference, Berlin, Germany, 5–6 March 2018. [Google Scholar]
Pan, X.; Wu, H.; Jiang, W. Multipole orthogonal beamforming combined with an inverse method for coexisting multipoles with various radiation patterns. J. Sound Vib. 2019, 463, 114979. [Google Scholar] [CrossRef]
Lobato, T.; Sottek, R.; Vorländer, M. Identification of multipole sources with neural deconvolution. In Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023, Torino, Italy, 11–15 September 2023. [Google Scholar]
Ma, W.; Liu, X. Phased microphone array for sound source localization with deep learning. Aerosp. Syst. 2019, 2, 71–81. [Google Scholar] [CrossRef]
Raumer, H.-G.; Ernst, D.; Spehr, C. Compensation of Modeling Errors for the Aeroacoustic Inverse Problem with Tools from Deep Learning. Acoustics 2022, 4, 834–848. [Google Scholar] [CrossRef]
Goudarzi, A. Improving the Analysis of Aeroacoustic Measurements Through Machine Learning. Ph.D. Thesis, Universität Göttingen, Göttingen, Germany, 2023. [Google Scholar]
Tung, A.; Gerstoft, P. Multipole Source Capture Using Multiple Dictionary Sparse Bayesian Learning. In Proceedings of the 58th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 27–30 October 2024. [Google Scholar]
Pan, W.; Wei, L.; Feng, D.; Shi, Y.; Chen, Y.; Li, M. Multipole transfer matrix model based sparse Bayesian learning approach for sound source identification. Appl. Acoust. 2024, 221, 109987. [Google Scholar] [CrossRef]
Mao, Y.; Gu, X.; Xu, C.; Zhou, L. Identification of aeroacoustic dipole sources of subsonic fan with active acoustic intensity method. Mech. Syst. Signal Process. 2025, 227, 112391. [Google Scholar] [CrossRef]
Yang, S.; Liu, Y.; Shao, C.; Zhou, J. Study on the characteristics of internal and external sound fields in centrifugal pumps under cavitation induced monopole and dipole sound sources. Appl. Acoust. 2025, 231, 110499. [Google Scholar] [CrossRef]
Plaksin, G.; Kozubskaya, T.; Sofronov, I. On Numerical Beamforming for Acoustic Source Identification Based on Supercomputer Simulation Data. Dokl. Math. 2024, 110, 435–441. [Google Scholar] [CrossRef]
Li, J.; Wang, X. Identification of Multiple Dipole Sound Sources: Maximum Likelihood, Iterative Beamforming, and Source Number Estimation. In Proceedings of the 7th International Conference on Information Communication and Signal Processing (ICICSP), Zhoushan, China, 21–23 September 2024; pp. 434–438. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training vision transformers from scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar] [CrossRef]
Yin, H.; Vahdat, A.; Alvarez, J.M.; Mallya, A.; Kautz, J.; Molchanov, P. A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10809–10818. [Google Scholar]
Kerssies, T.; Cavagnero, N.; Hermans, A.; Norouzi, N.; Averta, G.; Leibe, B.; Dubbelman, G.; De Geus, D. Your ViT is secretly an image segmentation model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Jekosch, S.; Sarradj, E. An Inverse Microphone Array Method for the Estimation of a Rotating Source Directivity. Acoustics 2021, 3, 462–472. [Google Scholar] [CrossRef]
Haykin, S.; Justice, J.H.; Owsley, N.L.; Yen, J.L.; Kak, A.C. Array Signal Processing; Prentice-Hall: Englewood Cliffs, NJ, USA, 1984. [Google Scholar]
Bao, F.; Nie, S.; Xue, K.; Cao, Y.; Li, C.; Su, H.; Zhu, J. All are worth words: A ViT backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22669–22679. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting mobile CNN from ViT perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Purwono, P.; Ma’arif, A.; Rahmaniar, W.; Fathurrahman, H.I.K.; Frisky, A.Z.K.; Haq, Q.M.U. Understanding of convolutional neural network (CNN): A review. Int. J. Robot. Control Syst. 2022, 2, 739–748. [Google Scholar] [CrossRef]
Shi, H.; Shao, H.; Mao, W.; Wang, Z. Trio ViT: Post Training Quantization and Acceleration for Softmax Free Efficient Vision Transformer. IEEE Trans. Circuits Syst. I Regul. Pap. 2025, 72, 1296–1307. [Google Scholar] [CrossRef]

Figure 1. ViT architecture.

Figure 2. Microphone array layout.

Figure 3. Example of the randomized source layout: red markers indicate monopoles, while blue markers indicate dipoles.

Figure 4. Confusion matrix for two-source identification.

Figure 5. Precision and recall heatmap for two-source identification: (a) precision heatmap by source and class; (b) recall heatmap by source and class.

Figure 6. F1-Score heatmap for two-source identification.

Figure 7. Confusion matrix for five-source identification.

Figure 8. F1-Score heatmap for five-source identification.

Figure 9. Precision and recall heatmap for five-source identification: (a) precision heatmap by source and class; (b) recall heatmap by source and class.

Figure 10. Confusion matrix for ten-source identification.

Figure 11. F1-Score heatmap for ten-source identification.

Figure 12. Precision and recall heatmap for ten-source identification: (a) precision heatmap by source and class; (b) recall heatmap by source and class.

Figure 13. Experimental setup.

Figure 14. Experimental equipment: (a) microphone array; (b) monopole source; (c) x-dipole source.

Figure 15. Confusion matrix for anechoic chamber experiment.

Table 1. ViT model configuration.

Layer	Configuration	Output Size
Input image	56 × 56 CSM image	56 × 56 × 3
Patch Embedding	7 × 7 patches (8 × 8 patch sizes), linear projection → 768 d	49 × 768
[CLS] token + Pos. Encoding	Prepend 1 learnable [CLS] token; add 50 learnable position embeddings	50 × 768
Dropout	Dropout on token embeddings	50 × 768
Transformer Block ×12	12 × (Multi-head self-attn (12 heads) + MLP (3072) + Dropout)	50 × 768
Layer Norm (pre-head)	Layer Norm on [CLS] token	50 × 768
MLP Head (Classifier)	Dense (768 → 3 × S) on [CLS] token	S × 3

Table 2. Overview of simulation parameter settings.

Parameter Category	Parameter Value
Source Frequency	3000 Hz
Microphone Array	Acoular_modify_array_56.xml
Source-to-Array Distance	3 m
Scanning Plane	2 m × 2 m
Grid Spacing	0.03125 m
Grid Resolution	64 × 64

Table 3. Accuracy across different frequencies and source counts.

	2000 Hz	3000 Hz	4000 Hz
Sources	2000 Hz	3000 Hz	4000 Hz
2	0.993 ± 0.01	0.995 ± 0.02	0.995 ± 0.02
5	0.926 ± 0.05	0.933 ± 0.06	0.941 ± 0.05
10	0.858 ± 0.09	0.868 ± 0.08	0.870 ± 0.09

Table 4. Accuracy for 2-source cases under varying patch sizes and blocks.

	6	9	12
Patch Sizes	6	9	12
7 × 7	0.992 ± 0.03	0.994 ± 0.02	0.994 ± 0.02
8 × 8	0.995 ± 0.02	0.994 ± 0.03	0.995 ± 0.02
14 × 14	0.976 ± 0.04	0.980 ± 0.03	0.981 ± 0.04

Table 5. Accuracy for 5-source cases under varying patch sizes and blocks.

	6	9	12
Patch Sizes	6	9	12
7 × 7	0.929 ± 0.10	0.933 ± 0.08	0.931 ± 0.06
8 × 8	0.930 ± 0.09	0.932 ± 0.07	0.933 ± 0.06
14 × 14	0.919 ± 0.12	0.920 ± 0.10	0.919 ± 0.10

Table 6. Accuracy for 10-source cases under varying patch sizes and blocks.

	6	9	12
Patch Sizes	6	9	12
7 × 7	0.867 ± 0.08	0.865 ± 0.07	0.867 ± 0.09
8 × 8	0.868 ± 0.08	0.870 ± 0.09	0.868 ± 0.08
14 × 14	0.838 ± 0.09	0.839 ± 0.08	0.841 ± 0.10

Table 7. Accuracy for 2-source cases under varying inputs.

Input	Precision
real part	0.866 ± 0.07
imaginary part	0.762 ± 0.09
real plus imaginary	0.932 ± 0.04
3 channel input	0.995 ± 0.02

Table 8. Accuracy for 2-source cases under varying SNR.

SNR	Precision
20 dB	0.865 ± 0.07
10 dB	0.803 ± 0.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Ma, B.; Lee, C.W.; Liu, X.; Ma, W. Enhancing the Accuracy of Monopole and Dipole Source Identification with Vision Transformer. Acoustics 2025, 7, 73. https://doi.org/10.3390/acoustics7040073

AMA Style

Chen J, Ma B, Lee CW, Liu X, Ma W. Enhancing the Accuracy of Monopole and Dipole Source Identification with Vision Transformer. Acoustics. 2025; 7(4):73. https://doi.org/10.3390/acoustics7040073

Chicago/Turabian Style

Chen, Junwen, Bohan Ma, Cheng Wei Lee, Xun Liu, and Wei Ma. 2025. "Enhancing the Accuracy of Monopole and Dipole Source Identification with Vision Transformer" Acoustics 7, no. 4: 73. https://doi.org/10.3390/acoustics7040073

APA Style

Chen, J., Ma, B., Lee, C. W., Liu, X., & Ma, W. (2025). Enhancing the Accuracy of Monopole and Dipole Source Identification with Vision Transformer. Acoustics, 7(4), 73. https://doi.org/10.3390/acoustics7040073

Article Menu

Enhancing the Accuracy of Monopole and Dipole Source Identification with Vision Transformer

Abstract

1. Introduction

2. Problem Statement

2.1. Constraints of Conventional Monopole and Dipole Identification Methods Without Prior Source Type Assumptions

2.2. Ill-Posedness of the Inverse Problem

2.3. Proposed Framework of This Paper

3. Detailed Architecture of the ViT Algorithm Model

3.1. Input for ViT: CSM with Positional Encoding

3.2. Design of the ViT-Based Multi-Label Identification Network

3.3. Training Details

4. Simulations

4.1. Simulation Setup and Evaluation Metrics

4.2. Simulation Analysis of Case 1

4.3. Simulation Analysis of Case 2

4.4. Simulation Analysis of Case 3

4.5. Accuracy Under Varying Frequencies, Model Parameters and Input Channels

4.6. Noise Impact Analysis in Two-Source Conditions

4.7. Summary

5. Experimental Verification

5.1. Experimental Environment and Source Configuration

5.2. Experimental Results and Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI