Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues

Grzegorz, Szwoch; Kotus, Józef; Zaporowski, Szymon

doi:10.3390/app152312780

Open AccessArticle

Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues

by

Szwoch Grzegorz

,

Józef Kotus

^*

and

Szymon Zaporowski

Department of Multimedia Systems, Faculty of Electronics, Telecommunication and Informatics, Gdańsk University of Technology, 80-222 Gdańsk, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12780; https://doi.org/10.3390/app152312780

Submission received: 17 October 2025 / Revised: 21 November 2025 / Accepted: 29 November 2025 / Published: 3 December 2025

(This article belongs to the Special Issue Advances in Audio Signal Processing)

Download

Browse Figures

Versions Notes

Featured Application

The proposed Acoustic Vector Sensor (AVS)-based speaker diarization algorithm can be applied in automatic transcription systems for two-speaker dialogues, such as medical consultations, legal depositions, or media interviews. By performing accurate speaker detection and determining overlapping speech before transcription, the method improves the reliability of automatic speech recognition (ASR) systems. It is particularly suitable for use in compact, low-cost hardware solutions that require robust performance in reverberant or noisy environments.

Abstract

Speaker diarization is a key component of automatic speech recognition (ASR) systems, particularly in interview scenarios where speech segments must be assigned to individual speakers. This study presents a diarization algorithm based on sound intensity analysis using an Acoustic Vector Sensor (AVS). The algorithm determines the azimuth of each speaker, defines directional beams, and detects speaker activity by analyzing intensity distributions within each beam, enabling identification of both single and overlapping speech segments. A dedicated dataset of interview recordings involving five speakers was created for evaluation. Performance was assessed using the Diarization Error Rate (DER) metric and compared with the State-of-the-Art Pyannote.audio system. The proposed AVS-based method achieved a lower DER value (0.112) than Pyannote (0.213) without overlapping speech, and a DER equal to 0.187 with overlapping speech included, demonstrating improved diarization accuracy and better handling of overlapping speech. The algorithm does not require training, operates independently of speaker-specific features, and can be adapted to various acoustic conditions. The results confirm that AVS-based diarization provides a robust and interpretable alternative to neural approaches, particularly suitable for structured two-speaker dialogues such as physician–patient or interviewer–interviewee scenarios.

Keywords:

speaker diarization; acoustic vector sensor; direction of arrival; automatic speech recognition; overlapping speech

1. Introduction

Automatic speech recognition (ASR) systems are widely used to convert spoken language into text, but in multi-speaker settings, such as physician-patient consultations or interviewer–interviewee dialogues, accurate speaker diarization is required to associate transcribed segments with the correct speaker. Overlapping speech and rapid turn-taking make diarization particularly challenging in these scenarios. There are two main approaches to the diarization problem. The first method analyzes speech recorded with a single microphone and performs diarization using extracted speech features that are specific to each speaker. This approach is not robust to cases in which the voices of both speakers are similar or the tone of the speaker’s voice changes. The second method utilizes the fact that signals from each speaker arrive at a sensor from different directions. The drawback of the spatial approach is that a multi-microphone setup must be used to record speech. The algorithm presented in this paper is based on the latter approach.

Traditional approaches to speaker diarization have relied on statistical methods, such as Gaussian Mixture Models and Hidden Markov Models [1,2]. These methods were complex, and their accuracy was limited. Recent advancements in deep learning-based solutions shifted the focus to end-to-end neural speaker diarization (EEND) frameworks, leveraging neural networks to directly optimize diarization tasks without the need for separate modules for feature extraction and clustering [3,4,5]. Transformer-based models have also been introduced to capture long-term dependencies in audio data, significantly improving performance in scenarios with overlapping speech [6,7,8]. Neural diarization is particularly useful in challenging real-world scenarios, where acoustic noise, overlapping speech, and an unknown number of speakers are common [3,9]. Speaker diarization is an important part of modern ASR systems and their applications, such as audio indexing, transcription, and speaker identification [10,11,12].

From a methodological perspective, diarization techniques can be categorized into software-based, hardware-based, and hybrid approaches. Software-based approaches predominantly rely on algorithmic advancements and computational models. For instance, x-vector embeddings combined with Probabilistic Linear Discriminant Analysis (PLDA) have become a standard pipeline for speaker segmentation and clustering [13]. Hardware-based solutions involve the use of multi-microphone arrays or beamforming techniques to spatially separate speakers based on their physical location. Time Difference of Arrival (TDOA) analysis is often employed in such systems to enhance diarization accuracy by exploiting spatial cues [14,15,16]. Hybrid approaches integrate both software and hardware components; for example, combining microphone array processing with neural network-based segmentation has shown significant improvements in noisy environments [15,17]. Target-speaker voice activity detection was created for multi-speaker diarization in a dinner party scenario [18]. Neural diarization algorithms operating on multi-channel signals [19,20,21] and on virtual microphone arrays [22] were also proposed. Some methods utilize spatial cues from multiple speakers for multi-channel diarization [23,24]. Other approaches combine speaker embeddings with TDOA values from microphone sensor arrays [25,26].

Despite these advancements, several challenges persist. Handling overlapping speech remains one of the most significant obstacles, particularly in multi-party conversations. The DIHARD Challenge series (2018–2021) established benchmark datasets specifically designed to evaluate diarization performance in challenging real-world scenarios with varying acoustic conditions, unknown numbers of speakers, and substantial overlapping speech [27,28]. Recent work has explored overlap-aware diarization models that explicitly predict overlapping segments using multi-label classification frameworks [7]. Data augmentation techniques, such as adding synthetic noise or reverberation during training, have been widely adopted [13,29]. Additionally, self-supervised learning has recently gained traction to leverage large amounts of unlabeled audio data to improve diarization performance [30]. Systems that adapt to dynamically changing numbers of speakers are another active area of research. Online diarization methods that update speaker models in real time have shown promise but require further refinement [6,30].

While recent diarization systems achieve high accuracy in multi-speaker meetings, they typically ignore spatial cues available in compact sensors. Acoustic Vector Sensors (AVS) are small multi-microphone setups that can provide information on the direction of arrival (DOA) of the sound waves [31]. Therefore, they can be useful in spatial separation of speakers [32,33] and in diarization algorithms based on spatial data [34]. Estimation of the speaker’s DOA may be based on the time-frequency bin selection [35], the inter-sensor data ratio model [36], or the inter-sensor data ratio model in the time-frequency domain [37]. There are also multi-speaker DOA estimation systems based on neural networks, operating on AVS signals [38,39]. In contrast to neural or microphone-array diarization approaches, the advantages of employing an AVS include low-cost hardware, small size of the sensor, interpretability, and better overlap detection thanks to utilizing DOA information.

The current (as of late 2025) State-of-the-Art diarization system is Pyannote.audio [40]. This neural diarization model is often the first choice for speech processing applications, such as the WhisperX ASR system [41]. The authors of this paper evaluated both systems for speech recognition in a physician-patient interview scenario, and they found two main issues limiting speaker diarization accuracy. First, Pyannote operates on speech features extracted from the signal; it does not utilize spatial information, which can be clearly established in this scenario. As a result, diarization errors related to speaker confusion were frequently observed. Second, Pyannote does not recognize more than one speaker in the overlapping speech segments, which also decreases diarization accuracy.

This study proposes a low-complexity, interpretable diarization algorithm that leverages an AVS to obtain DOA cues from sound intensity. The proposed method exploits DOA information to distinguish speakers and handle overlapping speech without the need for neural training. By moving diarization to a pre-processing stage and using tunable parameters, the method aims to (1) reduce speaker confusion, (2) detect overlapping speech, and (3) remain robust across different voice timbres without requiring training data. The proposed approach was validated on a custom AVS-recorded interview dataset and compared to the Pyannote.audio baseline using multiple DER variants. The details of the proposed algorithm, evaluation of the proposed method, comparison of the obtained results with Pyannote, and discussion of the results are presented in the subsequent sections of the paper.

2. Materials and Methods

The algorithm for speaker diarization, presented in Figure 1, was designed for the following scenario. Two speakers are seated opposite each other in a reverberant room. An AVS is positioned between the speakers. The two speakers conduct an interview, during which they mostly speak in turns (question/answer), but there are also fragments of overlapping speech (both speakers active at the same time). Using the signal recorded with the AVS, the task of the algorithm is to detect signal fragments in which each speaker was active, and to provide a list of indices marking each detected segment together with the speaker label. The proposed diarization algorithm consists of three stages: (1) estimation of dominant DOAs from AVS-derived sound intensity histograms, (2) construction of directional beams centered at the detected azimuths, and (3) per-beam activity detection that classifies blocks as noise, single-speaker, or overlapping speech. The following subsections describe sound intensity computation, DOA estimation, and block-level activity detection.

2.1. Sound Intensity

The proposed algorithm is based on sound intensity analysis, using signals recorded with an AVS, which measures particle velocity along the three axes of a Cartesian coordinate system (X-Y-Z), and sound pressure at the center point of the sensor. Particle velocity is approximated with a pressure gradient, measured with pairs of identical, omnidirectional microphones placed on each axis, at the same distance from the sensor center point. Pressure p_x(t) and particle velocity u_x(t) on the X axis can be calculated from the signals p_x₁(t) and p_x₂(t) measured with two microphones placed on this axis (t denotes time):

p_{x} (t) = (p_{x 1} (t) - p_{x 2} (t)) / 2,

(1)

u_{x} (t) = p_{x 1} (t) - p_{x 2} (t) .

(2)

Pressure and particle velocity for the Y and Z axes can be calculated using the same approach. Pressure signals obtained for all three axes are averaged to provide a single pressure signal p(t).

The pressure and velocity signals are processed in the digital domain. Signal samples are partitioned into fixed-size blocks. Each block is transformed using a Discrete Fourier Transform to obtain frequency-domain spectra. The cross-spectrum between pressure and particle velocity yields the axis-specific sound intensity [42]. Sound intensity I_X(ω) for the X axis can be computed as:

I_{X} (ω) = R e \{P (ω) \cdot U_{X}^{*} (ω)\},

(3)

where ω is the angular frequency, P(ω) and U_X(ω) are the spectra of the pressure and particle velocity signals, respectively, and the asterisk denotes complex conjugation. Sound intensity in the Y and Z axes is computed the same way. Total sound intensity I(ω), expressing the amount of acoustic energy measured with the AVS regardless of direction, is calculated as:

I (ω) = \sqrt{I_{X}^{2} (ω) + I_{Y}^{2} (ω) + I_{Z}^{2} (ω)} .

(4)

2.2. Detection of Speaker Direction

Sound intensity (calculated with Equations (3) and (4)) is determined for each frequency component of the signal. The DOA of each component in the horizontal plane (X-Y) may be obtained using the equation:

φ (ω) = a r c t a n \frac{I_{X} (ω)}{I_{Y} (ω)} .

(5)

The dominant directions of sound sources within the analyzed block may be found by calculating a histogram of the azimuth φ(ω) weighted by the total intensity I(ω). The histogram is calculated by dividing the whole azimuth range 0°–360° into bins of equal width, e.g., 5°. To improve the analysis accuracy, block histograms are averaged within a moving window of size 2L + 1 blocks. For each histogram bin representing azimuth values (φ_min, φ_max), the bin value b_n is calculated as:

b_{n (φ m i n, φ m a x)} = \frac{1}{2 L + 1} \sum_{m = n - L}^{n + L} \sum_{k = k m i n}^{k m a x} I_{m, k} (ω) |φ m i n \leq φ_{m, k} (ω) < φ m a x,

(6)

were n is the current block index, kmin and kmax define the frequency range (the spectral bin indices) used for the histogram calculation. In the algorithm described here, the analyzed frequencies are limited to the 93.75–7992 Hz range to focus on speech-dominated content.

The algorithm assumes that the speaker’s position remains approximately constant during the entire recording (small variations are handled by the algorithm). Therefore, histograms calculated for all signal blocks are averaged to form a single histogram for the whole recording. Then, the azimuth of each dominant sound source (each speaker) can be found by extracting local maxima from the histogram. In the algorithm proposed here, the azimuth related to each local maximum becomes the center of a beam: the azimuth range representing a given speaker. The width of the beam may be adjusted according to the distance between each speaker and the sensor, the angular distance between the speakers, etc. In the scenario presented in this paper, the speakers were positioned opposite each other, so that their radial distance was close to 180°. The optimal beam range for this case, found during the experiments, was ±45°. In other scenarios (e.g., multiple speakers), the beam ranges may be determined automatically.

The proposed speaker detection method requires that the histogram peaks related to sound sources (speakers) are clearly distinguishable from the noise. As long as the signal-to-noise ratio (SNR) is sufficient for speech to be audible, the algorithm is able to identify the speakers’ positions.

2.3. Speaker Activity Detection

The final stage of the diarization algorithm analyzes total intensity within the two non-overlapping beams (φ_low, φ_high), defined for each speaker, to decide if the block contains no speech, one active speaker, or overlapping speakers. The azimuth ranges (φ_low, φ_high) for each beam are determined from the histogram calculated using the procedure described earlier. In the presented experiments, local maxima of the histogram related to the speakers were used as the beam centers, and the ranges were set to ±45° around the center azimuth. It is also possible to find the optimal azimuth ranges instead of using fixed-width beams, using the calculated azimuth histogram.

The analyzed signal is segmented the same way as before. In each block, a decision is made whether there is no speaker activity, one speaker is active, or both speakers are active. The decision is made based on two criteria. The first one determines whether a block contains speech or noise, in each beam separately. The second criterion tests sound intensity distribution between the beams for overlapping speech.

For each signal block, a sum of total intensity for all analyzed spectral components having the azimuth within the beam (φ_low, φ_high) is calculated:

e_{n} = \sum_{k = k m i n}^{k m a x} I_{n, k} (ω) |φ_{n, k} (ω) \in (φ_{l o w}, φ_{h i g h}),

(7)

where n is the block index, and the range (kmin, kmax) is defined as before. The calculation is performed for each beam separately. Values of e_n are then normalized by the mean value calculated for all blocks that contain sufficient sound intensity, discarding blocks containing only noise. A mask ξ_n is calculated:

ξ_{n} = \{\begin{array}{l} 1, i f e_{n} > e_{m i n} \\ 0, o t h e r w i s e \end{array},

(8)

where e_min is a fixed threshold for the preliminary signal-noise detection. Next, a signal detection metric s_n is calculated:

s_{n} = e_{n} \cdot \frac{\sum_{n}^{N} ξ_{n}}{\sum_{n}^{N} (ξ_{n} \cdot e_{n})}

(9)

where N is the number of analyzed blocks. The s_n values calculated for all blocks are then smoothed using exponential averaging.

The decision whether the analyzed block contains a signal is made using the condition: s_n > s_min. The threshold values e_min and s_min should be higher than the noise level observed in the analyzed signal. They may be estimated by applying Equations (7) and (9) to a recording containing only noise. If these thresholds are set too high, the risk of misclassifying blocks containing speech as noise increases.

If a signal is detected in only one of the beams, this beam is marked as active, and the other one as inactive. If a signal is not detected in any beam, the block is marked as inactive (noise) for both beams. However, if a signal is detected in both beams, an additional condition is checked to decide whether the block contains overlapping speech. For this purpose, intensity distribution between the beams is calculated using the ratio r_n of the beam intensity to the whole range intensity:

r_{n} = \frac{e_{n}}{\sum_{k = k m i n}^{k m a x} I_{n, k} (ω)} .

(10)

The value r_n represents normalized sound intensity contained within the beam (0 to 1). This value is compared with the threshold r_min, which should be larger than 0.5 (in the experiments: r_min = 0.6). If the condition r_n > r_min is fulfilled for only one beam, this beam is marked as active (most of the sound intensity is concentrated within this beam), and the other one as inactive. If this condition is not met for any beam, which suggests almost equal distribution of sound intensity between two beams containing a signal, both beams are marked as active (overlapping speech).

After the processing, each block is marked as active or inactive for each beam (activity means speech presence). The final step of speaker diarization performs post-processing of the block decisions, merging the sequences of active blocks separated by gaps smaller than the minimum allowed gap, and removing active block sequences that are too short. The obtained result consists of time indices of signal fragments with the detected speaker activity, for each speaker separately.

2.4. Automatic Tuning of Threshold Values

The threshold values e_min, s_min, and r_min are the tunable parameters of the algorithm, affecting its accuracy. The first two thresholds should be adjusted to the noise level observed in the recordings. In the experiments presented here, threshold values were found empirically, as described in Section 3.2. For e_min and s_min, initial values can be estimated with a simple statistical analysis of the signal energy, and then the threshold may be tuned manually to improve the algorithm’s performance.

Total signal energy in each signal block can be computed as:

E_{n} = \sum_{k = k m i n}^{k m a x} I_{n, k} (ω) .

(11)

A histogram of E_n values obtained from all blocks is calculated using the logarithmic scale. Two distinct peaks should be visible in the histogram: one for speech and another for noise. Next, a cumulative distribution function (cdf) is computed from the histogram. The value at which the cdf reaches 0.95 is selected as the e_min threshold. The choice of the 95th percentile is the standard in noise analysis, related to the 95% confidence interval in statistical analysis. The second threshold s_min can then be calculated using a modified Equation (9):

s_{m i n} = e_{m i n} \cdot \frac{\sum_{n}^{N} ξ_{n}}{\sum_{n}^{N} (ξ_{n} \cdot E_{n})},

(12)

where ξ_n is defined in Equation (8).

The threshold value r_min is related to the energy distribution between the two beams. This value should typically be within the 0.5–0.6 range. The value 0.6 used in the experiments is a sensible default choice. It may be adjusted if significant errors are observed in overlapping speech segments.

3. Experiments and Results

Experiments were conducted to evaluate the accuracy of the proposed algorithm and compare it with the reference diarization system. This section describes the dataset created for the experiments, the methodology of evaluation of the algorithm’s performance, the results of the experiments, and a comparison with the reference system.

3.1. Dataset

Evaluation of the proposed algorithm requires a dataset of speech recordings created with an AVS—a specific setup of six microphones. To the authors’ knowledge, no such datasets are available publicly. The existing datasets were recorded either with single microphones or with multi-microphone setups different from AVS, which cannot be used by the proposed algorithm. It is possible to process standard dataset recordings in a computer-simulated environment to create a virtual microphone array. However, the authors aimed at evaluating the algorithm in a real-world scenario, in practical acoustic conditions. Therefore, the authors performed a series of speech recordings in a small room (c.a. 40 m³), with reverberation time 0.48 s and speech transmission index 0.75. The recordings were made with a custom AVS built from six omnidirectional, digital MEMS microphones, placed at the faces of a 10 mm cube, and connected to a computer through an I²S-USB interface. Six audio channels from the sensors were recorded digitally, sampled at 48 kHz, with 32-bit float resolution.

In each recording, two speakers were seated opposite each other, with the AVS positioned on the table between them, and the speakers were 0.5–1.0 m away from the sensor. An exception was recording #2, in which one of the speakers was placed >2 m from the sensor. Five adults, all of whom spoke Polish, participated in the recordings (two females, three males). The vocal fundamental frequency was in the range 150–500 Hz for the female voices and 120–200 Hz for the male voices. Some parts of the recordings contained overlapping speech from both speakers. Natural conversation between the speakers was recorded, in acoustic conditions typical for a small office; SNR ranged from 10.7 dB to 18.8 dB. A total of 10 recordings of typical conversation during an interview were made in different speaker configurations, totaling 72 min 17 s. Ground truth speaker data was labeled manually. Table 1 summarizes SNR, recording durations, and activity statistics; Figure 2 shows the room and the AVS placement.

3.2. Testing Methodology

Each recording in the dataset was processed with the proposed algorithm. Six-channel digital audio streams (sampling rate 48 kHz) from the sensor were converted to the pressure and particle velocity signals and then analyzed in blocks of 2048 samples (42.67 ms). For each block, sound intensity in the frequency domain and the azimuth were calculated, as described earlier. The azimuth of both speakers was detected, and a beam of width ±45° was created for each speaker. Next, speaker activity was detected for each beam, using the algorithm presented in this paper. The threshold values used in the experiments: e_min = 0.01, s_min = 0.01, r_min = 0.6, were adjusted to the noise level present in the recordings and to the selected beam width, allowing the authors to optimize the algorithm performance throughout the whole dataset. The results of speaker activity detection obtained for all analyzed blocks were then converted to time indices (in seconds from the beginning of each recording), marking segments of the speaker’s activity.

Diarization results obtained from the algorithm were compared with the ground truth data from the dataset, in segments of 250 ms. For each segment, one of four possible diarization results was detected: correct detection (CD, the same speaker detected); speaker confusion (SC, speech detected, but for a wrong speaker); false detection (FD, non-existing speech detected) or missed detection (MD, speech not detected). Diarization accuracy is expressed in terms of Diarization Error Rate (DER), computed from the number of results obtained for all segments of a recording [43]:

D E R = \frac{S C + F D + M D}{C D + S C + F D + M D} .

(13)

The DER value is in the 0 to 1 range; higher values indicate worse performance of the evaluated algorithm.

For comparison of the results obtained with the proposed algorithm, the Pyannote.audio system (version 3.1) was chosen. Pyannote is the current State-of-the-Art speaker diarization system, based on deep neural networks [40], used in many practical applications, such as the WhisperX ASR system [41]. Pyannote’s diarization procedure is based on evaluating speech features, while the proposed algorithm is based on DOA analysis. It should be stressed that Pyannote does not handle multiple speakers in overlapping speech cases. To detect multiple speakers who are active concurrently, an additional module for source separation must be employed. While some solutions for this task, based on neural networks, were proposed, the most popular State-of-the-Art ASR systems (as of late 2025), such as WhisperX, do not contain such a model, and they do not handle overlapping speech correctly (a single speaker is detected). Also, while researchers proposed diarization systems that handle multi-channel input streams, these systems are limited, and they are not State-of-the-Art. Therefore, Pyannote was chosen as the best reference system available, but the overlapping speech cases had to be treated specifically during the experiments.

3.3. Example of a Recording Analysis

To illustrate the proposed speaker diarization algorithm, example results obtained for recording #4 from the dataset (two male speakers) are presented. Figure 3 shows the result of speaker detection. The azimuth histogram identified two dominant directions (85° and 265°), and ±45° beams were formed around those centers.

Figure 4 shows an example of the analysis results obtained with the proposed algorithm. Figure 4a presents the results of signal detection in two beams. The s value is high when sound intensity contained in a beam is also high, indicating signal presence, while a low s value indicates noise. Figure 4b shows the sound intensity distribution between two beams. A high r value indicates that a single beam is active. Within overlapping speech segments, s is high for both beams, and r for both beams is similar (e.g., signal fragment at 215 s). Figure 4c presents the speaker activity detection results (solid lines) and the ground truth data (dashed lines). The activity values in Figure 4c are scaled for plot clarity (the actual activity results are either 0 or 1).

3.4. Results

Using the diarization results obtained with the proposed algorithm, DER values were calculated and compared with the reference system. However, Pyannote cannot detect multiple speakers active at the same time. Therefore, direct comparison of DER scores is problematic. Because of that, the results were analyzed and compared in four variants (Table 2) that differ in the way the overlapping speech segments are considered during DER calculation. The reference system cannot provide correct results in variant C, but the DER is provided for completeness. In Variant D, overlapping speech segments were converted to single speaker segments, assigned to the speaker that was active in the previous non-overlapping speech segment (previous decision is sustained).

Table 3 presents the DER scores and the percentage of results of each type, obtained for the four variants described above, for all ten recordings in the dataset. Figure 5 presents the percentage of segment results and the DER scores obtained for the individual recordings, tested in variant B, for which the lowest total DER score was obtained. Other variants exhibit similar distributions.

4. Discussion

The dataset recorded by the authors was used to evaluate the performance of both the proposed diarization algorithm and the reference system, Pyannote. These two systems utilize different approaches to the speaker diarization problem. The proposed algorithm is based on the spatial distribution of sound intensity originating from different speakers, analyzed in signals recorded with an AVS. Pyannote is a solution based on deep neural networks that extracts speaker features from signals recorded with a single microphone. Performance of both methods was assessed using the DER metric, in four variants of the DER calculation method, differing in the way the overlapping speech fragments are considered. Across all DER variants, the proposed AVS-based method consistently outperformed the Pyannote baseline. Overall DER for the proposed algorithm ranged from 0.10 to 0.19 versus 0.19 to 0.23 for Pyannote; the largest gains were observed in reduced speaker confusion and higher correct-detection rates when spatial information is informative.

In Variant A, in which only the signal fragments not containing overlapping speech were included in DER calculation, the proposed algorithm yielded lower DER scores than the reference system. It was observed that in the absence of overlapping speech, Pyannote tends to detect the wrong speaker much more frequently than the proposed method, which is probably caused by the similarity of speech features of both speakers. The proposed algorithm is based on spatial information, so it is robust to such issues. The evaluated algorithm provided slightly more false detections, resulting mostly from detecting speaker activity slightly outside the speaker segment (sustaining the speaker detection after the speaker activity ended). However, for speech recognition purposes, lower MD values are more important than lower FD values.

Variant B included signal fragments in which both speakers were active at the same time, requiring that at least one speaker be recognized correctly. The overall results are slightly better than for Variant A. The differences between the proposed algorithm and the reference system are consistent with those from Variant A. These results confirm that both methods handle the overlapping speech fragments correctly if only a single speaker (either one) needs to be detected.

Variant C required that both speakers be detected within the overlapping speech fragments. With this approach, the evaluated algorithm produced a higher DER than the previous variants, which was expected. Speaker diarization in the case of overlapping speech is a difficult problem, so there is a large increase in the number of confused speaker errors. However, the obtained DER value is satisfactory for the described scenario, and it is still lower than the DER obtained for the reference system in Variant A (without overlapping speech). The Pyannote system cannot operate in this Variant, because by design, it only detects a single speaker.

Variant D differed from Variant B in that, in the overlapping speech fragments, it required that the detector sustain its decision when an overlapping speech segment begins. Both the proposed algorithm and the reference system exhibited similar differences in results (D vs. B). The experiments performed during the research may indicate that Pyannote’s diarization system operates in the overlapping speech cases using this approach, although it was not confirmed.

Comparison of the results obtained for the individual recordings (Figure 5) indicates that the distribution of results is similar in most of the recordings. As expected, there is no correlation between the persons participating in the recording (male/female voices, Table 1) and the obtained results. The recording #2 produced a significantly higher DER (0.31) than the other ones. This was the only recording in which one of the speakers was located further from the sensor (more than 2 m away). This resulted in a lower signal-to-noise ratio for this speaker, and consequently, a higher percentage of false detections (19%). However, the FD errors (noise detected as speech) are less critical for speech recognition than the other types of errors, although they increase the risk of hallucinations from the ASR model. The algorithm accuracy in cases like this may be improved by better selection of the threshold values. Automatic calculation of the optimal algorithm parameters is left for future research.

The results of the experiments may be summarized as follows. The proposed algorithm for speaker diarization using AVS and sound intensity analysis provides a satisfactory accuracy of speaker diarization, both without and with overlapping speech fragments. It outperformed Pyannote in the interview scenario with two speakers seated opposite each other, as confirmed with lower DER scores across all analyzed variants. These improvements stem from leveraging DOA cues that disambiguate similar voice timbres and help sustain speaker decisions during rapid turn-taking or partial overlaps. In contrast, single-channel neural models rely mainly on spectral features and require source-separation modules or overlap-aware architectures to fully resolve concurrent speakers.

Both approaches: the algorithmic one, based on spatial information, and the machine learning approach, based on speaker features, have practical advantages and disadvantages. The current State-of-the-Art systems, such as Pyannote, operate on signals obtained with a single microphone, and they provide a complete, end-to-end solution for speech diarization and recognition. However, these systems are unable to handle overlapping speech unless they are supplemented with a source separation model. The source separation problem is complex, and it is outside the scope of this paper. In comparison, the proposed approach performs only the diarization stage (speech recognition is still performed with an external ASR system), and it requires a specific sensor (which, however, is a low-cost and small-sized device). The main advantage of the proposed algorithm is that it is robust to the similarities in speech features from the speakers, as it operates on spatial information obtained from the sound intensity analysis. As long as there is sufficient azimuth difference between the speakers (as is the case in the presented interview scenario), accurate speaker diarization is possible even in the case of overlapping speech. This is evident when comparing the number of CS results: 11.9% for the proposed algorithm with overlapping speech, and 14.9% for the reference system without overlapping speech. The proposed method is independent of speaker features, such as age, gender, manner of speech, spoken language, etc. Another advantage of the proposed method is that the obtained spatial information can be used to perform speaker separation. While this aspect is outside the scope of this paper, the previous publication [33] has shown that this approach allows for the separation of overlapping speech into streams of individual speakers, with sufficient accuracy. Additionally, the algorithm proposed in this paper contains parameters that can be tuned to improve its accuracy and to adapt the algorithm to acoustic conditions in the room. The reference system is a ‘black box’ approach, without the possibility of altering its function. Lastly, the proposed method is easy to implement, and it does not require a large dataset to train the diarization system.

This paper focuses on an interview scenario with two speakers seated opposite each other, and the sensor placed between them. The dataset presented in this manuscript was created to evaluate this specific scenario. However, other tests performed by the authors indicate that the proposed algorithm can also work correctly in different speaker configurations. In the previous publication [33], it was shown that in a simulated environment, the algorithm based on AVS can separate sound sources if the azimuth distance between two speakers is at least 15°. Tests performed in real reverberant rooms showed that the required azimuth distance between two speakers is at least 45°. Therefore, the proposed algorithm is expected to work correctly with two speakers at a radial distance of 45° to 180°. Additionally, the speaker detection algorithm can detect more than two peaks in the histogram, so it can also work correctly in a multi-speaker scenario, if the distance between each pair of speakers is at least 45°. The authors performed tests with two speakers seated near each other at the same side of the table, with four speakers seated around a table, etc., and the detection procedure worked as expected. The speaker activity detection algorithm needs to be extended to handle more than two beams. However, these scenarios require different test recordings for their validation. Therefore, the authors decided to focus on a specific scenario in this paper, and other scenarios are left for future publications.

In terms of computational complexity, the proposed algorithm is relatively simple. In the algorithm used for testing, processing each block of 128 samples of a six-channel sensor signal required applying seven digital filters of length 512 each to perform the correction of microphone characteristics [44], then computing the forward and the inverse Fourier transforms (FFT and IFFT) of length 2048 and computing the histogram; the remaining operations are simple mathematical ones. Unlike the machine learning approaches, such as Pyannote, which require powerful hardware to run, the proposed algorithm can run on low-power hardware, such as a Raspberry Pi.

Limitations of the current study include the assumption of approximately stationary speakers and the interview geometry (opposite seating). Recording #2, where one speaker was >2 m away, illustrates sensitivity to noise; better parameter selection or automatic tuning could mitigate this. The case of a speaker moving during the recording may be handled by performing the speaker detection procedure in shorter segments (e.g., 3 s long) and applying a source tracking procedure. Further evaluation of the algorithm is also needed, including test recordings with a larger number of speakers, varying distance between the speakers and the sensor, speech recorded in varied acoustic environments, with different SNR, various noise types, etc. Future work should address moving speakers, more than two concurrent talkers, improved automatic parameter tuning, integration with source separation to support end-to-end ASR, and testing the algorithm in different acoustic conditions, with different speaker configurations.

5. Conclusions

The results of the experiments performed on the custom dataset indicate that the proposed diarization algorithm based on AVS works as expected, and the obtained DER scores are better than in the reference system. The proposed approach reduced the number of substitution errors observed in the reference system, occurring when two speakers were active concurrently or there was a rapid transition between the speakers. These situations may also lead to transcription errors. The proposed method employs an additional modality (DOA analysis) to improve the diarization accuracy. The presented algorithm does not require training, it is not based on speaker profiles, and it does not depend on the characteristic features of a speaker or a manner of speaking. Additionally, it can be used to detect overlapping speech and to perform speaker separation before the pre-processed signals are passed to an ASR system. Thanks to that, the efficiency of a system for an automatic speech transcription of a dialogue may be improved. This is important for practical applications, such as a speech-to-text system for automatic documentation of medical procedures in a physician-outpatient scenario. Practically, AVS-based diarization is attractive for compact, low-cost hardware that requires interpretable spatial cues and minimal training. By operating as a pre-processing module, it can reduce diarization errors that would otherwise propagate into downstream ASR transcripts.

In this paper, a scenario in which two speakers at constant positions participated in a dialogue was presented. The proposed method may be generalized for a larger number of speakers, who may be active concurrently, and who may change their position. To realize this goal, the algorithm must be supplemented with source tracking and speaker separation algorithms. Additionally, a voice activity detector may be added to discard sounds not related to speech, which may reduce the error rate even further. Moreover, automatic tuning of the algorithm parameters may help reduce diarization errors when there is a large difference in loudness between the speakers. These issues will be the topic of future research.

Author Contributions

Conceptualization, J.K.; methodology, J.K., S.G. and S.Z.; software, S.G. and S.Z.; validation, S.G. and J.K.; formal analysis, S.G. and S.Z.; investigation, S.G., J.K. and S.Z.; resources, J.K.; data curation, J.K. and S.G.; writing—original draft preparation, S.G., J.K. and S.Z.; writing—review and editing, S.G., J.K. and S.Z.; visualization, S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Polish National Centre for Research and Development (NCBR) within the project: “ADMEDVOICE—Adaptive intelligent speech processing system of medical personnel with the structuring of test results and support of therapeutic process”. No. INFOSTRATEG4/0003/2022.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in a repository at the URL: https://multimed.org/datasets/avs_diarization/ (accessed on 17 November 2025).

Acknowledgments

The authors wish to thank the staff from the Medical University of Gdańsk (GUMED), Faculty of Medicine, for their participation in the dataset recording.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AVS	Acoustic vector sensor
DER	Diarization error rate
DOA	Direction of arrival
CD	Correct detection
CS	Confused speaker
FD	False detection
MD	Missed detection
SNR	Signal-to-noise ratio

References

Tranter, S.E.; Reynolds, D.A. An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1557–1565. [Google Scholar] [CrossRef]
Miro, X.A.; Bozonnet, S.; Evans, N.; Fredouille, C.; Friedland, G.; Vinyals, O. Speaker diarization: A review of recent research. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 356–370. [Google Scholar] [CrossRef]
Horiguchi, S.; Fujita, Y.; Watanabe, S.; Xue, Y.; Nagamatsu, K. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 269–273. [Google Scholar] [CrossRef]
Park, T.J.; Kanda, N.; Dimitriadis, D.; Han, K.J.; Watanabe, S.; Narayanan, S. A review of speaker diarization: Recent advances with deep learning. Comput. Speech Lang. 2022, 72, 101317. [Google Scholar] [CrossRef]
O’Shaughnessy, D. Diarization: A review of objectives and methods. Appl. Sci. 2025, 15, 2002. [Google Scholar] [CrossRef]
Xue, Y.; Horiguchi, S.; Fujita, Y.; Takashima, Y.; Watanabe, S.; Perera, L.P.G.; Nagamatsu, K. Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3116–3120. [Google Scholar] [CrossRef]
Du, Z.; Zhang, S.; Zheng, S.; Yan, Z. Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7458–7469. [Google Scholar] [CrossRef]
Broughton, S.J.; Samarakoon, L. Improving End-to-End Neural Diarization Using Conversational Summary Representations. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 3157–3161. [Google Scholar] [CrossRef]
Diliberto, J.; Pereira, C.; Nikiforovskaja, A.; Sahidullah, M. Speaker Diarization with Overlapped Speech. Master’s Thesis, Universite de Loraine: Institut des sciences du Digital, Metz, France, 2021. [Google Scholar]
Raj, D.; Denisov, P.; Chen, Z.; Erdogan, H.; Huang, Z.; He, M.; Watanabe, S.; Du, J.; Yoshioka, T.; Luo, Y.; et al. Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 897–904. [Google Scholar] [CrossRef]
Mitrofanov, A.; Prisyach, T.; Timofeeva, T.; Novoselov, S.; Korenevsky, M.; Khokhlov, Y.; Akulov, A.; Anikin, A.; Khalili, R.; Lezhenin, I.; et al. Accurate speaker counting, diarization and separation for advanced recognition of multichannel multispeaker conversations. Comput. Speech Lang. 2025, 92, 101780. [Google Scholar] [CrossRef]
Hu, Q.; Sun, T.; Chen, X.; Rong, X.; Lu, J. Optimization of modular multi-speaker distant conversational speech recognition. Comput. Speech Lang. 2026, 95, 101816. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
Lin, B.; Zhang, X. Speaker Diarization as a Fully Online Bandit Learning Problem in MiniVox. Proc. Mach. Learn. Res. 2021, 157, 1660–1674. [Google Scholar] [CrossRef]
Raj, D.; García-Perera, L.P.; Huang, Z.; Watanabe, S.; Povey, D.; Stolcke, A.; Khudanpur, S. DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 881–888. [Google Scholar] [CrossRef]
Zheng, N.; Li, N.; Yu, J.; Weng, C.; Su, D.; Liu, X.; Meng, H.M. Multi-channel speaker diarization using spatial features for meetings. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7337–7341. [Google Scholar] [CrossRef]
Coria, J.M.; Bredin, H.; Ghannay, S.; Rosset, S. Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 1139–1146. [Google Scholar] [CrossRef]
Medennikov, I.; Korenevsky, M.; Prisyach, T.; Khokhlov, Y.; Korenevskaya, M.; Sorokin, I.; Timofeeva, T.; Mitrofanov, A.; Andrusenko, A.; Podluzhny, I.; et al. Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 274–278. [Google Scholar] [CrossRef]
Horiguchi, S.; Takashima, Y.; García, P.; Watanabe, S.; Kawaguchi, Y. Multi-channel end-to-end neural diarization with distributed microphones. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7332–7336. [Google Scholar] [CrossRef]
Wang, J.; Liu, Y.; Wang, B.; Zhi, Y.; Li, S.; Xia, S.; Zhang, J.; Tong, F.; Li, L.; Hong, Q. Spatial-aware speaker diarization for multi-channel multi-party meeting. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 1491–1495. [Google Scholar] [CrossRef]
Taherian, H.; Wang, D. Multi-channel conversational speaker separation via neural diarization. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2467–2476. [Google Scholar] [CrossRef]
Gomez, A.; Pattichis, M.S.; Celedón-Pattichis, S. Speaker diarization and identification from single channel classroom audio recordings using virtual microphones. IEEE Access 2022, 10, 56256–56266. [Google Scholar] [CrossRef]
Ma, F.; Tu, Y.; He, M.; Wang, R.; Niu, S.; Sun, L.; Ye, Z.; Du, J.; Pan, J.; Lee, C.-H. A spatial long-term iterative mask estimation approach for multi-channel speaker diarization and speech recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12331–12335. [Google Scholar] [CrossRef]
Wang, R.; Niu, S.; Yang, G.; Du, J.; Qian, S.; Gao, T.; Pan, J. Incorporating spatial cues in modular speaker diarization for multi-channel multi-party meetings. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025. [Google Scholar] [CrossRef]
Xylogiannis, P.; Vryzas, N.; Vrysis, L.; Dimoulas, C. Multisensory fusion for unsupervised spatiotemporal speaker diarization. Sensors 2024, 24, 4229. [Google Scholar] [CrossRef] [PubMed]
Cord-Landwehr, T.; Gburrek, T.; Deegen, M.; Haeb-Umbach, R. Spatio-spectral diarization of meetings by combining TDOA-based segmentation and speaker embedding-based clustering. In Proceedings of the Interspeech 2025, Rotterdam, The Netherlands, 17–21 August 2025; pp. 5223–5227. [Google Scholar] [CrossRef]
Ryant, N.; Church, K.; Cieri, C.; Cristia, A.; Du, J.; Ganapathy, S.; Liberman, M. The Second DIHARD Diarization Challenge: Dataset, task, and baselines. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 978–982. [Google Scholar] [CrossRef]
Ryant, N.; Singh, P.; Krishnamohan, V.; Varma, R.; Church, K.; Cieri, C.; Du, J.; Ganapathy, S.; Liberman, M. The Third DIHARD Diarization Challenge. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August—3 September 2021; pp. 3570–3574. [Google Scholar] [CrossRef]
Dawalatabad, D.; Ravanelli, M.; Grondin, G.; Thienpondt, J.; Desplanques, B.; Na, H. ECAPA-TDNN embeddings for speaker diarization. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 2528–2532. [Google Scholar] [CrossRef]
Yue, Y.; Du, J.; He, M.-K.; Yeung, Y.; Wang, R. Online Speaker Diarization with Core Samples Selection. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 1466–1470. [Google Scholar] [CrossRef]
Cao, J.; Liu, J.; Wang, J.; Lai, X. Acoustic vector sensor: Reviews and future perspectives. IET Signal Process. 2017, 11, 1–9. [Google Scholar] [CrossRef]
Shujau, M.; Ritz, C.H.; Burnett, I.S. Separation of speech sources using an Acoustic Vector Sensor. In Proceedings of the 2011 IEEE 13th International Workshop on Multimedia Signal Processing, Hangzhou, China, 17–19 October 2011. [Google Scholar] [CrossRef]
Kotus, J.; Szwoch, G. Separation of simultaneous speakers with Acoustic Vector Sensor. Sensors 2025, 25, 1509. [Google Scholar] [CrossRef] [PubMed]
Gburrek, T.; Schmalenstroeer, J.; Haeb-Umbach, R. Spatial diarization for meeting transcription with ad-hoc Acoustic Sensor Networks. In Proceedings of the 2023 57th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 29 October–1 November 2023; pp. 1399–1403. [Google Scholar] [CrossRef]
Geng, J.; Wang, S.; Gao, S.; Liu, Q.; Lou, X. A time-frequency bins selection pipeline for direction-of-arrival estimation using a single Acoustic Vector Sensor. IEEE Sens. J. 2022, 22, 14306–14319. [Google Scholar] [CrossRef]
Jin, Y.; Zou, Y.; Ritz, C.H. Robust speaker DOA estimation based on the inter-sensor data ratio model and binary mask estimation in the bispectrum domain. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 3266–3270. [Google Scholar] [CrossRef]
Zou, Y.; Liu, Z.; Ritz, C.H. Enhancing target speech based on nonlinear soft masking using a single Acoustic Vector Sensor. Appl. Sci. 2018, 8, 1436. [Google Scholar] [CrossRef]
Wang, D.; Zou, Y. Joint noise and reverberation adaptive learning for robust speaker DOA estimation with an Acoustic Vector Sensor. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 821–825. [Google Scholar] [CrossRef]
Wang, D.; Zou, Y.; Wang, W. Learning soft mask with DNN and DNN-SVM for multi-speaker DOA estimation using an acoustic vector sensor. J. Frankl. Inst. 2018, 355, 1692–1709. [Google Scholar] [CrossRef]
Bredin, H. Pyannote.audio 2.1 Speaker Diarization Pipeline: Principle, Benchmark, and Recipe. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 1983–1987. [Google Scholar] [CrossRef]
Bain, M.; Huh, J.; Han, T.; Zisserman, A. WhisperX: Time-accurate speech transcription of long-form audio. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 4489–4493. [Google Scholar] [CrossRef]
Fahy, F. Sound Intensity, 2nd ed.; E & F.N. Spon: London, UK, 1995. [Google Scholar]
Fiscus, J.G.; Radde, N.; Garofolo, J.S.; Le, A.; Ajot, J.; Laprun, C. The Rich Transcription 2005 Spring Meeting Recognition Evaluation. In Machine Learning for Multimodal Interaction; Renals, S., Bengio, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 2869, pp. 369–389. [Google Scholar] [CrossRef]
Kotus, J.; Szwoch, G. Calibration of Acoustic Vector Sensor based on MEMS microphones for DOA estimation. Appl. Acoust. 2018, 141, 307–321. [Google Scholar] [CrossRef]

Figure 1. Block diagram of the proposed diarization algorithm.

Figure 2. Recording of the dataset (the red circle marks the AVS position).

Figure 3. The result of speaker detection in recording #4. Blue line: the azimuth histogram. Red dashed lines: the center azimuth detected for each speaker. Colored areas: the beams used for further analysis.

Figure 4. The analysis results for the recording #4: (a) detection of signal and noise, (b) analysis of sound intensity distribution between the beams, (c) detection of the speaker activity—ground truth (GT, dashed lines) and the algorithm results (Alg., solid lines). Black horizontal lines indicate the threshold values.

Figure 5. Distribution of segment results for the individual recordings in the dataset (variant B). Values on the right are DER scores obtained for each recording.

Table 1. Dataset of recordings used in the experiments: speakers (F—female, M—male), SNR, recording duration, and percentage of speaker activity (Speaker 1, Speaker 2, both speakers).

Rec. #	Speaker		SNR [dB]	Duration min:s	Activity (%)
Rec. #	Sp. 1	Sp. 2	SNR [dB]	Duration min:s	Sp. 1	Sp. 2	Both
1	F1	M1	10.7	9:11	74.2	5.7	4.2
2	F1	M1	15.6	3:14	77.7	1.9	1.0
3	M2	M3	17.6	5:48	52.7	35.4	2.5
4	M2	M3	15.8	5:58	48.4	37.5	5.1
5	M2	M3	11.1	4:41	49.8	31.2	5.9
6	M2	F2	16.1	4:21	38.3	43.1	6.1
7	M2	F2	18.8	10:55	46.4	28.9	18.3
8	F1	M3	17.9	12:56	45.4	28.5	20.0
9	F1	M3	14.0	4:22	60.6	28.9	5.0
10	F2	F1	12.2	10:48	35.0	42.0	14.9

Table 2. Four variants of the diarization evaluation.

Variant	Overlap Treatment	Analysis Focus
A	Overlaps excluded	Non-overlapping speech
B	One speaker must match	Practical single-speaker acceptance
C	Both speakers must match	Strict multi-speaker detection
D	The previous active speaker kept	Continuity – detector persistence

Table 3. Results of speaker diarization with the proposed algorithm (Alg.) and the reference system (Ref.): DER, and percentage of segments with correct detection (CD), confused speaker (CS), false detection (FD), and missed detection (MD).

Variant	Source	DER	CD	CS	FD	MD
A	Alg.	0.112	88.8	3.7	5.4	2.1
A	Ref.	0.213	78.7	14.9	3.8	2.5
B	Alg.	0.101	89.9	3.3	4.9	1.9
B	Ref.	0.190	81.0	13.3	3.4	2.3
C	Alg.	0.187	81.3	11.9	4.9	1.9
C	Ref.	0.242	75.8	18.5	3.4	2.3
D	Alg.	0.130	87.0	6.2	4.9	1.9
D	Ref.	0.227	77.3	17.0	3.4	2.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Grzegorz, S.; Kotus, J.; Zaporowski, S. Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues. Appl. Sci. 2025, 15, 12780. https://doi.org/10.3390/app152312780

AMA Style

Grzegorz S, Kotus J, Zaporowski S. Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues. Applied Sciences. 2025; 15(23):12780. https://doi.org/10.3390/app152312780

Chicago/Turabian Style

Grzegorz, Szwoch, Józef Kotus, and Szymon Zaporowski. 2025. "Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues" Applied Sciences 15, no. 23: 12780. https://doi.org/10.3390/app152312780

APA Style

Grzegorz, S., Kotus, J., & Zaporowski, S. (2025). Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues. Applied Sciences, 15(23), 12780. https://doi.org/10.3390/app152312780

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Sound Intensity

2.2. Detection of Speaker Direction

2.3. Speaker Activity Detection

2.4. Automatic Tuning of Threshold Values

3. Experiments and Results

3.1. Dataset

3.2. Testing Methodology

3.3. Example of a Recording Analysis

3.4. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI