1. Introduction
Voice activity detection (VAD) is a core front-end component in modern speech systems, where it separates speech from non-speech segments to reduce the computational load, latency, and false triggers of downstream components such as automatic speech recognition (ASR) and speaker verification [
1]. Traditional VAD methods, including energy-based detectors, statistical model-based approaches, and more recent neural VAD models, are typically speaker-agnostic and only indicate whether any speech is present [
1]. In realistic multi-speaker and noisy environments, however, this speaker-agnostic formulation is often insufficient for personal devices and voice assistants [
1,
2], which should react only to their enrolled user rather than to every nearby talker. As a result, there has been growing interest in methods that can detect the speech activity of a specific target speaker at the frame level while ignoring other speakers and background noise [
2,
3].
Personal voice activity detection (PVAD) directly addresses this requirement by conditioning the detector on a target speaker [
4]. The original Personal VAD work proposed a neural VAD-like network that takes as additional input either a speaker embedding or a speaker verification score derived from an enrollment utterance of the target user [
4]. For each frame, the model outputs three posterior probabilities—non-speech, target speaker speech, and non-target speech—so that a streaming on-device ASR system can gate its computation based on whether the enrolled user is speaking, thereby reducing battery consumption and unintended activations without relying solely on a keyword detector [
4]. More recently, Personal VAD 2.0 extended this framework with improved conditioning strategies, an architecture optimized for on-device deployment, and training schemes that support both enrollment-based and enrollment-less operation, substantially improving detection quality while meeting tight latency and resource constraints [
5].
Beyond these specific architectures, several recent studies have explored how to make PVAD more robust and practical under real-world constraints. One line of work focuses on the dependence on high-quality speaker embeddings extracted from a separate speaker verification model. Efficient PVAD with wake-word reference speech, for example, replaces static embeddings with frame-level features from short reference speech (such as a wake word), directly injecting these features into the detection network and avoiding a separate large verification model [
6]. Other work demonstrates that PVAD remains feasible even when the enrollment utterance is extremely short: a personal VAD with ultra-short reference speech shows that, by continuously updating internal states as target speaker representations and carefully designing the training objective, the system can reliably detect target speaker activity from very limited enrollment audio [
7]. PVAD has also been studied as part of broader personalized pipelines, such as speaker verification-oriented SVVAD and system-level comparative analyses that evaluate multiple personalized VAD architectures under realistic device, noise, and usage conditions [
8].
In parallel with these advances, we adopt a flexible design termed FDE-PVAD (Flexible Detachable Encoder Personal Voice Activity Detection) [
9]. FDE-PVAD is built on a dynamic encoder RNN front-end that can operate either as a conventional VAD or as a personalized PVAD by attaching or detaching a lightweight personalization module, enabling a smooth transition between generic and user-specific functionality without retraining the entire network [
9]. This design yields a compact model that reuses a single encoder for both tasks, reduces redundant computation, and fully exploits the speech-related representations learned by the core VAD encoder for downstream personal activity classification [
9]. Concretely, FDE-PVAD adopts a recurrent neural network implemented as a long short-term memory (LSTM) model as the primary temporal sequence learner in all major modules: the encoder RNN aggregates frame-level acoustic features into context-aware hidden states shared by both the base VAD decision head and the detachable PVAD module, while additional LSTM layers in the personalization path refine these hidden states with target speaker information [
9,
10].
Although LSTM-based RNNs are widely used in PVAD and provide strong temporal modeling capacity, they are not necessarily the best choice in terms of the accuracy–efficiency trade-off for on-device deployment. Recent advances in sequence modeling, such as Transformer and its variants, often yield superior performance on long-context speech tasks [
11], but their quadratic or otherwise heavy complexity makes them less attractive for low-latency, resource-constrained PVAD. To bridge this gap, a family of so-called linear recurrent models has been proposed, including architectures like Mamba [
12], HGRN [
13], and HGRN2 [
14], which aim to retain much of the modeling power of modern sequence models while keeping time and memory complexity closer to that of classical RNNs. Among them, HGRN2 employs gated linear recurrence with non-parametric state expansion to significantly increase effective memory capacity without adding trainable parameters, making it a promising backbone for long-context, on-device speech processing.
Motivated by these developments, in this work, we instantiate the FDE-PVAD framework with an HGRN2-based encoder, referred to as FDE-HGRN2, by replacing the LSTM modules in the original FDE-RNN front-end with HGRN2 [
14]. We systematically investigate how this substitution affects the accuracy, robustness, and efficiency of personal voice activity detection, and show that FDE-HGRN2 can provide a better balance between model capacity and on-device cost while preserving the flexibility of the original FDE-PVAD design [
9,
10]. In particular, we study both the VAD and PVAD branches, integrate HGRN2 into the dynamic encoder and personalization paths, and combine the architectural changes with a cosine-annealing learning rate schedule to improve training stability under practical batch size constraints.
The main contributions of this work are summarized as follows:
We instantiate the Flexible Dynamic Encoder PVAD (FDE-PVAD) framework with an HGRN2-based backbone, yielding FDE-HGRN2, which replaces all LSTM modules in the original FDE-RNN while preserving its detachable PVAD architecture and dynamic encoder gating mechanism.
We provide a detailed integration of HGRN2 into both the VAD and PVAD branches, including a parallelizable prediction/encoder design and a personalized path with FiLM conditioning and GLU refinement, demonstrating that gated linear recurrence with state expansion can be effectively adapted to PVAD.
Through experiments on a LibriSpeech-derived benchmark, we show that FDE-HGRN2 consistently improves mean Average Precision and frame-wise accuracy over the LSTM-based FDE-RNN and several prior PVAD systems, while reducing backbone parameters by about 15% and achieving smaller models than many existing methods.
We analyze robustness under practical constraints, demonstrating that FDE-HGRN2 maintains or improves performance under reduced batch sizes and across a wide range of SNR conditions, whereas the LSTM-based FDE-RNN degrades more noticeably, indicating more stable optimization and better noise robustness for the HGRN2 backbone.
We conduct ablations and architectural comparisons (with and without CALR, and with LSTM vs. HGRN2 in the dynamic encoder) to isolate the contributions of the proposed components, confirming that the HGRN2-based DE-RNN and CALR jointly provide a superior accuracy–efficiency trade-off for deployment-oriented PVAD.
2. Overview of the Backbone Model: FDE-RNN
The Flexible Dynamic Encoder Recurrent Neural Network (FDE-RNN) constitutes a significant advancement in the development of unified voice activity detection (VAD) and personalized voice activity detection (PVAD) frameworks. This novel architecture facilitates seamless task adaptation while minimizing redundant computational overhead, thereby addressing a fundamental limitation of conventional PVAD systems that necessitate full model execution even for elementary VAD operations.
The FDE-RNN incorporates a modular design comprising two principal components: the Dynamic Encoder RNN (DE-RNN, a recurrent front-end that dynamically skips its state update on non-speech frames to avoid accumulating noise into the shared speech representation), functioning as the VAD front-end, and a detachable Personalization module (P-module) responsible for speaker-conditioned PVAD processing. This architectural separation enables the FDE-RNN to operate as an ultra-lightweight VAD with merely 40.4K parameters—approximately 30% of those in competing architectures—while preserving state-of-the-art performance across both detection tasks.
The complete architecture of FDE-RNN, illustrated in
Figure 1, employs a two-stage processing pipeline. In the first stage, the DE-RNN extracts and encodes speech-relevant acoustic features, after which the P-module is conditionally activated for segments containing detected speech. This selective operational strategy ensures efficient resource utilization and high scalability, making the FDE-RNN particularly well-suited for real-world deployment scenarios.
2.1. Dynamic Encoder RNN (DE-RNN): Core Mechanism
DE-RNN extends dynamic neural network principles to VAD, comprising interdependent Prediction and Encoder RNN modules that coordinate through a gating mechanism. This design eliminates redundant computations on non-speech frames while accumulating robust speech representations over time.
The Prediction RNN generates frame-level VAD probabilities
by processing concatenated current acoustic features
and prior Encoder hidden states
:
The Encoder RNN selectively updates its states based on prediction confidence against threshold
:
This conditional updating ensures that the Encoder accumulates speech-specific latent representations resilient to non-target speaker interference, providing an ideal feature basis for subsequent PVAD processing.
2.2. Personalization Module: Speaker-Adaptive Processing
The P-module receives Encoder outputs
and adaptively fuses them with raw acoustics through a confidence-weighted residual connection:
High VAD confidence prioritizes refined latent features; low confidence incorporates more raw spectral details to counter potential false positives. This adaptive mechanism significantly enhances downstream PVAD robustness.
Speaker conditioning occurs via Feature-wise Linear Modulation (FiLM):
where
are affine transformation parameters generated from target speaker embedding
via a FiLM generator. Temporal modeling follows through LSTM, yielding final PVAD classifications:
2.3. Parallel Training Paradigm
FDE-RNN employs joint optimization of VAD and PVAD objectives through parallel training. During training, P-module predictions execute at every time step regardless of VAD gating:
This strategy exposes the P-module to comprehensive input distributions (speech + non-speech), enhancing generalization. Inference reverts to conditional execution for efficiency.
2.4. Architectural Advantages and Research Value
In summary, FDE-RNN’s principal advantage lies in its modular architecture: the DE-RNN can operate as a standalone VAD front-end by detaching the P-module, while the shared encoder representations learned under the VAD objective are directly reused by the personalization path via confidence-weighted fusion and FiLM conditioning. This design avoids redundant computation on non-speech frames, reduces parameter overhead, and provides a controlled backbone for studying individual component contributions—properties that make it a suitable basis for the HGRN2-based extensions proposed in the present work.
3. Proposed Framework: FDE-HGRN2
In this study, we propose replacing the individual time-series models in the original FDE-RNN architecture with HGRN2 to investigate whether it can enhance PVAD performance. Unlike traditional RNNs plagued by vanishing gradients or Transformers burdened by quadratic complexity, HGRN2 achieves Transformer-level expressivity through non-parametric state expansion with linear scaling. The following sections introduce HGRN2’s core formulation and its advantages for PVAD tasks.
3.1. Introduction to HGRN2
Hierarchically Gated Recurrent Neural Network 2 (HGRN2) is a recent linear recurrent architecture that attains Transformer-competitive performance while maintaining true linear scaling in sequence length. Its key innovation is a non-parametric state expansion mechanism that expands hidden states from d-dimensional vectors to matrices via outer products, thereby providing effective capacity without introducing additional trainable parameters.
Intuitively, a standard RNN hidden state can be thought of as a fixed-size notepad of dimension d: the model must compress all past information into this fixed-length vector, which limits how much temporal context it can retain. HGRN2 replaces this notepad with a matrix of size , giving the model d times more effective writing space. Crucially, this expansion is achieved without purchasing more parameters: the matrix is constructed at each time step as an outer product of two existing d-dimensional gate vectors, so the additional capacity comes at zero trainable parameter cost. The sections below formalize this mechanism and describe how it overcomes a known theoretical limitation of standard linear RNNs.
3.1.1. Limitations of Traditional RNNs and Transformers
Vanilla RNNs inherently suffer from vanishing gradients during backpropagation through time (BPTT). At each time step
t, the hidden state is updated as
and the nonlinear tanh activation causes gradients to decay exponentially over long sequences, effectively limiting temporal memory. LSTMs and GRUs alleviate this issue by introducing gating mechanisms, but their strictly sequential recurrence prevents parallelization across time during training. In contrast, standard Transformers support fully parallel training over time, yet incur a quadratic
computational and memory complexity with respect to sequence length
N, which is often prohibitive for long sequences and on-device deployment.
3.1.2. Linear RNNs and the Capacity Lower Bound
Linear RNNs address the sequential bottleneck by enforcing linearity in the recurrent path, which enables associative recurrence and facilitates stable gradient propagation over long horizons. A typical gated linear RNN updates the hidden state
and produces an output
as
Because the gates
depend only on the input sequence
, all gate values can be computed in parallel across time. The remaining linear recurrence in Equation (
8) then admits an implementation via parallel prefix-sum (scan), leading to an
training complexity in sequence length
N and avoiding the
bottleneck of self-attention.
Several architectures build on this idea. The direct predecessor to our core module, the Hierarchically Gated Recurrent Neural Network (HGRN) [
13], stacks multiple such gated linear layers, where lower layers focus on high-frequency local patterns and higher layers capture low-frequency global context. State-space models such as Mamba [
12] also employ linear recurrent state-space updates, where each channel is associated with a structured SSM state of dimension
n ≪
d.
Despite their computational efficiency, these models face a fundamental theoretical limitation. Recent analyses on associative recall and state-tracking tasks establish a strict theoretical lower bound on the required state dimension to effectively solve complex sequence modeling problems. For a linear RNN to robustly memorize and retrieve information over long horizons, its memory capacity (the maximum amount of information a recurrent state can store and retrieve over time, which scales with the dimensionality of the hidden state) must scale adequately. Because standard HGRN and state-space models maintain a hidden state restricted to a d-dimensional vector or a small fixed size n, their capacity fundamentally falls short of this lower bound, restricting their expressive power when handling highly complex temporal dependencies.
3.1.3. HGRN2: Overcoming the Lower Bound via State Expansion
HGRN2 [
14] is designed to directly overcome the theoretical lower bound limitation of the original HGRN while preserving its favorable
training complexity. Rather than relying solely on element-wise interactions in a constrained
d-dimensional hidden space, HGRN2 lifts the recurrent state into a
matrix using outer products, thereby enabling non-parametric state expansion (the process of enlarging the effective recurrent state from a
d-dimensional vector to a
matrix via outer products, without introducing additional trainable parameters).
The core recurrent computation of an HGRN2 cell at layer
ℓ is depicted in
Figure 2. To preserve the ability to pre-compute gates in parallel, they are derived purely from the layer input
. In practice, to maximize hardware utilization, the independent weight matrices are concatenated into a single fused projection matrix, computing all pre-activation vectors simultaneously before chunking. Mathematically, this is equivalent to:
where
are distinct trainable weight matrices for the input, output, and forget gates, respectively.
To prevent the exponential decay of historical information over long sequences, HGRN2 natively incorporates a lower bound parameter
into the forget gate, forming the actual decay factor
:
The matrix-valued hidden state
is then updated via the non-parametric state expansion recurrence:
where the outer product ⊗ expands the vectors into a
capacity matrix. This operation provides
effective state dimensions, satisfying the theoretical capacity lower bound without introducing new trainable parameters. The complete single-step computation flow is illustrated in
Figure 3, which shows how the three gate paths are computed in parallel before being combined through the outer-product state expansion.
Finally, the state is contracted using the output gate and passed through Layer Normalization (LN) and an output projection layer (
) to yield the cell output
:
3.1.4. Multi-Head Parallelization and the HGRN2 Block
A straightforward implementation of the state update would require operations per time step, leading to an overall complexity of . To keep HGRN2 computationally efficient, a multi-head parallelization scheme is adopted. The d channels are partitioned into H heads, each operating on a reduced subspace defined by the expansion ratio .
For head
in layer
ℓ, the matrix recurrence in Equation (
17) is localized strictly to its
subspace. Because each head is computed independently and in parallel, the per-head operations scale with
. This multi-head structural design effectively reduces the total recurrent complexity to
, drastically minimizing computational overhead.
While the aforementioned multi-head state expansion dictates the temporal modeling, integrating HGRN2 into a deep architecture requires a robust block design. As formalized in Equation (
18), the independent outputs from the
H parallel heads are first concatenated and passed through LayerNorm to stabilize the hidden state distribution. Finally, the linear projection layer (
) facilitates cross-head information mixing, ensuring that features processed independently within different subspaces are effectively aggregated before being passed to the next layer.
In summary, the complete HGRN2 block combines (i) linear recurrent updates for parallel processing, (ii) non-parametric state expansion for high capacity, and (iii) dynamic lower bounds with multi-head mixing. In the original literature, a complete HGRN2 block encompasses a core recurrent cell (often denoted as HGRU2) followed by a Gated Linear Unit (GLU). To maximize computational efficiency while preserving representation power where it matters most, our proposed architecture strategically deploys heterogeneous blocks: we utilize the lightweight core cell () for front-end modules, and the full block () for the back-end. These properties make it an ideal backbone for the FDE-HGRN2 PVAD architecture.
3.2. HGRN2-Based FDE Backbone: FDE-HGRN2
While the original FDE-RNN relies on LSTM-based recurrent modules in both the Dynamic Encoder RNN (DE-RNN) and the Personalization module (P-module), this choice is not necessarily optimal in terms of the accuracy–efficiency trade-off, especially under strict on-device constraints. Recent linear recurrent architectures such as HGRN2 provide competitive long-range sequence modeling capabilities while enabling more efficient sequence-level parallelization and reduced computational complexity. Motivated by these developments, the FDE-HGRN2 architecture is obtained by replacing all LSTM blocks in FDE-RNN with HGRN2 layers, while preserving the overall dynamic gating, FiLM-based conditioning, and detachable PVAD structure. The flowchart of FDE-HGRN2 is depicted in
Figure 4.
In the DE-RNN, both the Prediction RNN and the Encoder RNN are instantiated as lightweight
modules (without GLU) to minimize computational overhead. Crucially, to fully exploit the parallel training capability of linear RNNs, the explicit recurrent feedback connection from the previous state is removed from the VAD prediction. Equation (
1) thus becomes:
where
operates solely on the current acoustic features
, utilizing its internal multi-head state expansion to capture temporal dependencies.
The most significant architectural shift occurs in the conditional encoder update. Rather than relying on explicit piecewise routing as in the original LSTM baseline, FDE-HGRN2 achieves dynamic encoding by hijacking the native lower bound parameter of the cell (as formulated in Equation (
16)). Given the VAD confidence threshold
, we first derive a binary gating mask
. This mask is inverted to serve as the dynamic lower bound
. The encoder update is then elegantly unified within the recurrence:
when
(non-speech),
, which freezes the internal state matrix
and prevents noise assimilation. When
(speech),
, and the encoder updates normally.
The overall structure of the P-module remains unchanged regarding the confidence-weighted fusion and the FiLM-based conditioning. The fused features
are modulated by target speaker embeddings. However, because discerning target speaker traits from complex acoustic mixtures requires superior feature selection capabilities, temporal modeling here is performed by a full HGRN2 block (incorporating the GLU). Consequently, the PVAD output is reformulated as:
where
encompasses both the core recurrence and the GLU. By strategically distributing lightweight core cells and full blocks, FDE-HGRN2 maintains the original design principles while leveraging non-parametric state expansion to substantially improve the balance between PVAD accuracy and efficiency.
The integration of HGRN2 into the FDE-PVAD framework involves three non-trivial technical contributions that go beyond a straightforward module substitution.
First, dynamic gating via the native lower bound parameter. The original FDE-RNN dynamic encoder relies on explicit piecewise routing: the Encoder RNN either executes its full recurrent update or freezes its hidden state based on the VAD confidence threshold. For HGRN2, whose recurrent state is a matrix , we identified that the native lower bound parameter in the forget gate can be repurposed as a dynamic gating mechanism: setting on non-speech frames freezes the state matrix, while enables normal updates. This reuse of an existing HGRN2 component for a new functional purpose is a non-trivial insight not achievable by simply substituting HGRN2 for LSTM.
Second, gradient-preserving state freezing. In LSTM, state freezing copies the previous cell state unchanged—an abrupt binary operation that stops gradient propagation through skipped frames. In FDE-HGRN2, setting the lower bound to 1 drives to 1, causing to retain its previous value without any discontinuity in the computational graph. This preserves gradient flow through frozen frames during training, resulting in smoother optimization behavior consistent with the improved robustness to batch size reduction which will be observed in the experimental results (Table 1).
Third, removal of the recurrent feedback connection. The original FDE-RNN VAD Prediction module receives explicit feedback from the previous Encoder hidden state . In HGRN2, the internal multi-head state expansion provides effective internal state dimensions sufficient to capture the relevant temporal context directly from the input sequence. Removing it enables full sequence-level parallelization of the Prediction module during training—a key efficiency advantage of linear RNNs. These three integration decisions constitute the non-trivial technical contributions of the FDE-HGRN2 integration.
3.3. Cosine-Annealing Learning Rate (CALR) in FDE-HGRN2 Training
Cosine-annealing learning rate (CALR) [
15] is a learning rate scheduling technique designed to improve the training dynamics of neural networks. CALR gradually decreases the learning rate using a cosine function, which enables smoother convergence and improves generalization by avoiding abrupt learning rate drops. In contrast to methods with periodic restarts, we employ a single monotonic decay from a maximum to a minimum learning rate throughout the training process.
The learning rate at the
t-th epoch is scheduled as follows:
where
and
are the initial and final learning rates, respectively, and
is the total number of training epochs. The learning rate starts at
when
and decays smoothly to
as
t approaches
. This cosine-shaped decay avoids sudden changes and allows stable convergence in the later stages of training.
While the original FDE-RNN framework relies on a discrete two-phase step decay schedule, we replace this with the CALR strategy for optimizing our proposed FDE-HGRN2 architecture. The hyperparameters are chosen to match the range of the FDE-RNN two-phase schedule: equals the FDE-RNN initial learning rate, equals the FDE-RNN final learning rate, and equals the total training epochs, so that CALR is a drop-in replacement for the step decay with an identical learning rate range but a smoother decay trajectory. The potential advantages of using CALR scheduling in frame-level PVAD include:
Smooth Convergence: Gradual decay prevents instability in optimization and leads to better convergence behavior.
Balanced Learning: Higher learning rates early in training help explore the parameter space, while lower rates later refine and stabilize the model’s target speaker representations.
5. Experimental Results and Discussion
5.1. Overall PVAD Performance and Model Size
Table 1 summarizes the PVAD performance and model size of the original FDE-RNN and the proposed FDE-HGRN2 under different batch size configurations. Due to limited GPU memory, we could not reproduce the baseline FDE-RNN with the original batch size of 256 and instead trained it with a batch size of 64 for all our experiments. This reduction leads to a mild but consistent degradation of the reproduced FDE-RNN: accuracy drops by about one percentage point, mAP decreases from 0.948 to 0.942, and the AP for target speech frames decreases from 0.924 to 0.910, with a similar but smaller reduction for the non-speech/non-target class (from 0.962 to 0.959). These trends suggest that the LSTM-based FDE-RNN benefits from larger mini-batches and is somewhat sensitive to batch size constraints.
Under the same batch size of 64, FDE-HGRN2 achieves the best performance among all variants. Its accuracy reaches 88.61% and its mAP increases to 0.952, not only compensating for the degradation observed in the reproduced FDE-RNN but also slightly surpassing the originally reported FDE-RNN result (mAP 0.948). The AP for target speech frames improves to 0.932, outperforming both the reproduced baseline (0.910) and the original reported model (0.924), while the non-speech/non-target AP is also the highest at 0.964. At the same time, the number of trainable parameters is reduced from 92.372K to 78.724K, resulting in a more compact recurrent backbone.
The key observations from
Table 1 are as follows:
Effect of batch size: Comparing “FDE-RNN (reported)” and “FDE-RNN (reproduced)” shows that reducing the batch size from 256 to 64 yields a small but systematic performance loss in terms of accuracy, mAP, and AP for both target speech and non-speech/non-target classes, indicating that the LSTM-based FDE-RNN relies on relatively large mini-batches for its best optimization behavior.
Benefit of HGRN2 backbone: With the same batch size constraint of 64, FDE-HGRN2 attains higher accuracy, mAP, and AP values than both the reproduced FDE-RNN and the originally reported FDE-RNN, demonstrating that replacing the LSTM with HGRN2-based recurrence yields genuine performance gains that cannot be attributed to training conditions alone. The comparison against the originally reported FDE-RNN (batch size 256) is presented as additional context only, since this cross-condition comparison is not fully controlled; the most rigorous evidence is the intra-family comparison at batch size 64. Furthermore, FDE-HGRN2, despite being trained under the more constrained batch size of 64, achieves performance comparable to or better than the originally reported FDE-RNN trained with batch size 256, suggesting that the HGRN2 backbone is more robust to batch size constraints than its LSTM-based counterpart.
Accuracy–efficiency trade-off: FDE-HGRN2 reduces the parameter count from 92.372K to 78.724K while still achieving the strongest overall PVAD performance in
Table 1. Thus, the proposed architecture improves both detection accuracy and model compactness, offering a strictly better accuracy–efficiency trade-off than the original LSTM-based FDE-RNN.
5.2. Note on Statistical Significance
All results are reported from single training runs, consistent with established practice in the PVAD literature (PVAD 1.0, PVAD 2.0, COIN-AT-PVAD, SE-PVAD, and FDE-RNN all report single-run results). We acknowledge that multi-seed experiments would provide the strongest statistical guarantees; however, a single full training run requires approximately one week in our setup, making this infeasible at present. Nonetheless, the consistency of improvements across all four evaluation metrics simultaneously—accuracy (), AP(tss) (), AP(ns&ntss) (), and mAP ()—provides meaningful evidence that the observed gains are not attributable to random variation. Multi-seed validation is listed as future work.
5.3. Inference Efficiency Analysis
Table 2 reports three complementary efficiency metrics evaluated under identical hardware conditions on a workstation equipped with an Intel(R) Core(TM) i7-14700 CPU and an NVIDIA GeForce RTX 5070 GPU with 12 GB of memory for both models. The results reveal a nuanced picture, consistent with a well-established characteristic of linear recurrent architectures on contemporary hardware.
In terms of computational cost and memory footprint, FDE-HGRN2 demonstrates clear advantages. It reduces amortized floating-point operations per frame from 136.425 to 116.436 kFLOPs (a reduction of approximately 15%), consistent with the theoretical complexity reduction achieved by replacing the four-gate LSTM recurrence with HGRN2’s multi-head linear recurrence. More substantially, peak memory allocation during inference is reduced from 16.601 MB to 8.445 MB, a reduction of approximately 49%. This memory saving is particularly significant for deployment-oriented scenarios, where RAM is often the primary bottleneck rather than raw compute throughput.
However, the real-time factor (RTF) of FDE-HGRN2 (0.0787) is higher than that of FDE-RNN (0.0281), indicating that the wall-clock inference time of FDE-HGRN2 is approximately 2.8 times longer on the measurement platform. This apparent discrepancy between theoretical FLOPs and practical latency is a well-documented phenomenon for linear recurrent architectures: whereas LSTM benefits from highly optimized cuDNN kernels that have been extensively tuned for GPU execution, HGRN2’s outer-product state expansion involves matrix-valued recurrent state updates that do not yet benefit from equivalent hardware-level optimization on general-purpose GPUs. As a result, the theoretical arithmetic savings do not fully translate into wall-clock speedup on the current measurement platform. We note that the RTF of 0.0787 still satisfies real-time processing requirements by a substantial margin (RTF ≪ 1), and that dedicated hardware implementations or operator-level kernel fusion for linear recurrent architectures—an active area of systems research—are expected to narrow this gap significantly. The memory and FLOP advantages of FDE-HGRN2 are therefore more representative of its deployment potential on resource-constrained devices, where memory capacity is the binding constraint rather than latency.
5.4. Impact of Signal-to-Noise Ratio (SNR)
Table 3 reports the PVAD performance of the reproduced FDE-RNN baseline and the proposed FDE-HGRN2 model at SNRs of 5, 10, and 15 dB. Note that the SNR conditions in
Table 3 are applied exclusively at test utterances. All models are trained on MTR-augmented data covering a broad distribution of reverberant and noisy conditions; the SNR values in this section refer to the noise level of the test utterances only. For each SNR level, test utterances are derived from the original test set using controlled noise conditions, and a unified batch size of 64 is employed for all experiments to avoid confounding effects from the training configuration. Across all three SNR settings, FDE-HGRN2 consistently achieves higher AP for both the non-speech/non-target and target speech classes, as well as higher mAP, indicating that the HGRN2 backbone provides more robust sequence modeling under noisy conditions.
Beyond the AP-based metrics, FDE-HGRN2 also improves frame-wise accuracy at every SNR, with gains of roughly 1.5–2 percentage points over the reproduced FDE-RNN baseline. These improvements show that the benefits of HGRN2 are reflected not only in ranking quality along the precision–recall curve but also in fewer frame-level misclassifications in practical operation. The accuracy and mAP gains persist as the SNR increases from 5 to 15 dB, suggesting that the advantages of HGRN2 are stable across both moderately noisy and relatively clean acoustic environments, rather than being restricted to a particular noise regime.
5.5. Ablation Study
The ablation results in
Table 4 reveal consistent and interpretable trends across the three configurations. The full FDE-HGRN2 model, which employs HGRN2 in both the Dynamic Encoder RNN (DE-RNN) and the interaction block together with CALR, attains the best overall performance, with the highest mAP (0.9522), the strongest class-wise AP (0.9643 for ns&ntss and 0.9317 for tss), and the highest accuracy of 88.61%. This indicates that combining the HGRN2 backbone with CALR yields the strongest frame-level discrimination for both target and non-target speech and leads to the most reliable overall classification.
Removing CALR from FDE-HGRN2 produces a moderate but consistent degradation: mAP decreases from 0.9522 to 0.9477 and accuracy drops from 88.61% to 87.81%. Both tss and ns&ntss AP are slightly reduced (from 0.9317 to 0.9233 and from 0.9643 to 0.9619, respectively), suggesting that CALR provides a steady gain across classes, with a noticeable impact on both target and background discrimination. The reduction in accuracy further confirms that the smooth learning rate decay helps stabilize frame-wise decisions, improving not only AP-based metrics but also the overall error rate.
When the DE-RNN is further reverted from HGRN2 back to LSTM while CALR remains disabled, performance drops again to an mAP of 0.9457 and an accuracy of 87.38%. In this configuration, ns&ntss AP decreases to 0.9615 and tss AP to 0.9169, indicating that the LSTM-based DE-RNN is less effective at modeling background and interfering speakers than its HGRN2 counterpart. This additional degradation highlights the importance of the HGRN2-based DE-RNN: replacing HGRN2 with LSTM weakens the model’s ability to separate target and non-target speech and slightly lowers overall detection quality, confirming that the linear recurrent backbone is a key contributor to the improved accuracy–efficiency trade-off within the FDE framework.
Effect of expansion ratio. To justify the choice of
, we evaluated FDE-HGRN2 (w/o CALR) under
. The results are summarized in
Table 5. As shown,
achieves the highest mAP (0.9477) and accuracy (87.81%). Crucially, all three configurations share an identical parameter count of 78.724K, directly and empirically confirming HGRN2’s theoretical claim that state expansion is non-parametric: increasing
n enlarges the effective recurrent state capacity via outer-product expansion without introducing additional trainable parameters. Performance degrades at
, suggesting diminishing returns when the state-space is excessively large relative to the model’s scale and training set. We therefore adopt
as the proposed configuration.
5.6. Comparison with State-of-the-Art PVAD Methods
Table 6 compares the proposed FDE-HGRN2 with several representative PVAD systems: PVAD 1.0 [
4], PVAD 2.0 [
5], COIN-AT-PVAD [
17], SE-PVAD [
18], and the original FDE-RNN [
9]. The performance metrics and parameter counts for these baseline systems are taken from the FDE-RNN literature [
9], and our experiments follow the same dataset construction and evaluation protocol to ensure a fair comparison. The table reports class-wise AP on tss, ntss, and ns, together with overall mAP, frame-wise accuracy, and model size.
Three main observations can be drawn:
Detection performance: FDE-HGRN2 achieves the highest overall detection performance among all listed models. Its mAP of 0.9522 and accuracy of 88.61% exceed those of SE-PVAD (mAP 0.925, Acc 84.86%) and the originally reported FDE-RNN (mAP 0.948, Acc 87.88%). It also delivers strong target speaker discrimination with an AP (tss) of 0.9317 while maintaining competitive or better AP for the non-target classes, indicating that the HGRN2-based recurrence effectively enhances temporal modeling and target/non-target separation.
Accuracy–efficiency trade-off: Despite its strong performance, FDE-HGRN2 remains parameter-efficient. Compared with earlier PVAD systems such as PVAD 1.0 and SE-PVAD, which each use more than 130K parameters and achieve accuracies below 85%, FDE-HGRN2 attains both higher accuracy (88.61%) and higher mAP with only 78.724K parameters. Relative to the FDE-RNN backbone (92.372K, Acc 87.88%), it reduces parameter count by roughly 15% while still improving mAP and accuracy. Although COIN-AT-PVAD employs slightly fewer parameters (71.869K) and reaches reasonable accuracy (86.74%), its mAP (0.940) remains clearly lower, indicating that FDE-HGRN2 provides a more favorable balance between compactness and predictive power.
Robustness under constrained training: The gap between the reported and reproduced FDE-RNN variants highlights the sensitivity of the LSTM-based architecture to training hyperparameters, especially batch size. When the batch size is reduced from 256 to 64, both mAP and accuracy decrease for FDE-RNN. In contrast, FDE-HGRN2, trained under the same constrained batch size of 64, avoids this degradation and even surpasses the original large-batch FDE-RNN in both mAP and accuracy. This behavior suggests that the HGRN2-based design is more robust to realistic hardware limitations and maintains strong performance without relying on large-batch training.
6. Conclusions and Future Work
In this work, we proposed FDE-HGRN2, an HGRN2-based instantiation of the FDE-PVAD framework that replaces all LSTM modules in the original FDE-RNN backbone while preserving its dynamic encoder gating and detachable personalization architecture. Experiments on a LibriSpeech-derived PVAD benchmark show that FDE-HGRN2 improves mean Average Precision and frame-wise accuracy over both the reported and reproduced FDE-RNN baselines, despite using approximately 15% fewer parameters in the recurrent backbone. The model also exhibits more stable behavior under constrained training conditions and across a range of SNR levels, indicating that the HGRN2 backbone and the CALR schedule jointly provide a more robust accuracy–efficiency trade-off for personal voice activity detection.
A key limitation of the current evaluation is its reliance on synthetically constructed multi-speaker sequences derived from LibriSpeech. Although MTR augmentation partially simulates reverberant and noisy far-field conditions using real room impulse responses and noise corpora, real device-captured recordings introduce additional acoustic artifacts not fully captured by simulation. Evaluating FDE-HGRN2 on real-world corpora such as AMI or CHiME, and integrating it into end-to-end streaming device pipelines with measured latency and power consumption, are important directions for future work. Furthermore, due to computational constraints, all experiments in this work were conducted with a batch size of 64; training and evaluating FDE-HGRN2 at the original FDE-RNN batch size of 256 to enable a fully controlled large-batch comparison remains an important direction for future work.
Future work will focus on extending FDE-HGRN2 toward more realistic deployment scenarios and broader personalized speech applications. One direction is to evaluate the proposed architecture under real-world far-field recordings and device-captured noise, and to integrate it into multi-task pipelines that jointly handle PVAD, speaker verification, and possibly diarization or keyword spotting. In particular, the linear-time HGRN2 backbone is well suited for streaming keyword spotting, and we plan to explore joint PVAD–KWS modeling while explicitly measuring latency, memory footprint, and energy consumption on mobile SoCs. Additionally, a systematic study of FDE-HGRN2’s robustness to imperfect speaker embeddings, including embeddings extracted from short or noisy enrollment utterances, would further characterize the system’s practical deployment envelope and is left for future work. Future work will also include rigorous on-device profiling of FDE-HGRN2, measuring inference latency in milliseconds per frame, floating-point operations per forward pass, and peak memory usage during inference on representative mobile SoCs, to substantiate the efficiency claims made in this paper. Additional directions include combining FDE-HGRN2 with model compression and quantization techniques for tighter resource budgets, and investigating its applicability to other personalized front-end tasks in on-device speech interfaces.