HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training

Wang, Tzu-Wei; Chen, Tai-You; Chiu, Chien-Chia; Chen, Berlin; Hung, Jeih-Weih

doi:10.3390/electronics15081561

Open AccessArticle

HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training

by

Tzu-Wei Wang

¹,

Tai-You Chen

¹,

Chien-Chia Chiu

¹,

Berlin Chen

² and

Jeih-Weih Hung

^1,*

¹

Department of Electrical Engineering, National Chi Nan University, Nantou 54561, Taiwan

²

Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 106308, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1561; https://doi.org/10.3390/electronics15081561

Submission received: 27 March 2026 / Revised: 5 April 2026 / Accepted: 6 April 2026 / Published: 8 April 2026

(This article belongs to the Special Issue Advances of Artificial Intelligence and Vision Applications: Third Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study presents HGRN2-based Flexible Dynamic Encoder Personal VAD (FDE-HGRN2), a recurrent framework for personal voice activity detection (PVAD). Building on the original LSTM-based FDE-RNN backbone, we replace all recurrent modules with the recently introduced HGRN2 gated linear RNN and adopt a cosine-annealing learning rate schedule to improve both detection accuracy and efficiency. HGRN2 uses gated linear recurrence with non-parametric state expansion, enlarging the recurrent state without increasing the number of trainable parameters and enabling more expressive long-range temporal modeling than conventional LSTMs. We evaluate FDE-HGRN2 on a LibriSpeech-derived PVAD benchmark, where multi-speaker mixtures are constructed by concatenating one to three speakers per utterance and randomly designating a target speaker, following established PVAD data construction practices to ensure direct comparability with prior work. The system uses 40-dimensional Mel-filterbank features as acoustic inputs and conditions the detector on 256-dimensional d-vector embeddings extracted from a pretrained speaker verification network. Experimental results show that FDE-HGRN2 consistently outperforms the original FDE-RNN baseline and several state-of-the-art PVAD models in terms of mean Average Precision and frame-level accuracy, while reducing the parameter count of the recurrent backbone by roughly 15% and yielding substantially smaller models than many competing systems. These findings indicate that HGRN2 provides a more temporally expressive and parameter-efficient alternative to LSTM for PVAD, offering a favorable accuracy–efficiency trade-off for real-world, deployment-oriented personalized speech interfaces.

Keywords:

personal voice activity detection; gated linear recurrent networks; HGRN2-based sequence modeling; deployment-oriented speech processing; accuracy–efficiency trade-off

1. Introduction

Voice activity detection (VAD) is a core front-end component in modern speech systems, where it separates speech from non-speech segments to reduce the computational load, latency, and false triggers of downstream components such as automatic speech recognition (ASR) and speaker verification [1]. Traditional VAD methods, including energy-based detectors, statistical model-based approaches, and more recent neural VAD models, are typically speaker-agnostic and only indicate whether any speech is present [1]. In realistic multi-speaker and noisy environments, however, this speaker-agnostic formulation is often insufficient for personal devices and voice assistants [1,2], which should react only to their enrolled user rather than to every nearby talker. As a result, there has been growing interest in methods that can detect the speech activity of a specific target speaker at the frame level while ignoring other speakers and background noise [2,3].

Personal voice activity detection (PVAD) directly addresses this requirement by conditioning the detector on a target speaker [4]. The original Personal VAD work proposed a neural VAD-like network that takes as additional input either a speaker embedding or a speaker verification score derived from an enrollment utterance of the target user [4]. For each frame, the model outputs three posterior probabilities—non-speech, target speaker speech, and non-target speech—so that a streaming on-device ASR system can gate its computation based on whether the enrolled user is speaking, thereby reducing battery consumption and unintended activations without relying solely on a keyword detector [4]. More recently, Personal VAD 2.0 extended this framework with improved conditioning strategies, an architecture optimized for on-device deployment, and training schemes that support both enrollment-based and enrollment-less operation, substantially improving detection quality while meeting tight latency and resource constraints [5].

Beyond these specific architectures, several recent studies have explored how to make PVAD more robust and practical under real-world constraints. One line of work focuses on the dependence on high-quality speaker embeddings extracted from a separate speaker verification model. Efficient PVAD with wake-word reference speech, for example, replaces static embeddings with frame-level features from short reference speech (such as a wake word), directly injecting these features into the detection network and avoiding a separate large verification model [6]. Other work demonstrates that PVAD remains feasible even when the enrollment utterance is extremely short: a personal VAD with ultra-short reference speech shows that, by continuously updating internal states as target speaker representations and carefully designing the training objective, the system can reliably detect target speaker activity from very limited enrollment audio [7]. PVAD has also been studied as part of broader personalized pipelines, such as speaker verification-oriented SVVAD and system-level comparative analyses that evaluate multiple personalized VAD architectures under realistic device, noise, and usage conditions [8].

In parallel with these advances, we adopt a flexible design termed FDE-PVAD (Flexible Detachable Encoder Personal Voice Activity Detection) [9]. FDE-PVAD is built on a dynamic encoder RNN front-end that can operate either as a conventional VAD or as a personalized PVAD by attaching or detaching a lightweight personalization module, enabling a smooth transition between generic and user-specific functionality without retraining the entire network [9]. This design yields a compact model that reuses a single encoder for both tasks, reduces redundant computation, and fully exploits the speech-related representations learned by the core VAD encoder for downstream personal activity classification [9]. Concretely, FDE-PVAD adopts a recurrent neural network implemented as a long short-term memory (LSTM) model as the primary temporal sequence learner in all major modules: the encoder RNN aggregates frame-level acoustic features into context-aware hidden states shared by both the base VAD decision head and the detachable PVAD module, while additional LSTM layers in the personalization path refine these hidden states with target speaker information [9,10].

Although LSTM-based RNNs are widely used in PVAD and provide strong temporal modeling capacity, they are not necessarily the best choice in terms of the accuracy–efficiency trade-off for on-device deployment. Recent advances in sequence modeling, such as Transformer and its variants, often yield superior performance on long-context speech tasks [11], but their quadratic or otherwise heavy complexity makes them less attractive for low-latency, resource-constrained PVAD. To bridge this gap, a family of so-called linear recurrent models has been proposed, including architectures like Mamba [12], HGRN [13], and HGRN2 [14], which aim to retain much of the modeling power of modern sequence models while keeping time and memory complexity closer to that of classical RNNs. Among them, HGRN2 employs gated linear recurrence with non-parametric state expansion to significantly increase effective memory capacity without adding trainable parameters, making it a promising backbone for long-context, on-device speech processing.

Motivated by these developments, in this work, we instantiate the FDE-PVAD framework with an HGRN2-based encoder, referred to as FDE-HGRN2, by replacing the LSTM modules in the original FDE-RNN front-end with HGRN2 [14]. We systematically investigate how this substitution affects the accuracy, robustness, and efficiency of personal voice activity detection, and show that FDE-HGRN2 can provide a better balance between model capacity and on-device cost while preserving the flexibility of the original FDE-PVAD design [9,10]. In particular, we study both the VAD and PVAD branches, integrate HGRN2 into the dynamic encoder and personalization paths, and combine the architectural changes with a cosine-annealing learning rate schedule to improve training stability under practical batch size constraints.

The main contributions of this work are summarized as follows:

We instantiate the Flexible Dynamic Encoder PVAD (FDE-PVAD) framework with an HGRN2-based backbone, yielding FDE-HGRN2, which replaces all LSTM modules in the original FDE-RNN while preserving its detachable PVAD architecture and dynamic encoder gating mechanism.
We provide a detailed integration of HGRN2 into both the VAD and PVAD branches, including a parallelizable prediction/encoder design and a personalized path with FiLM conditioning and GLU refinement, demonstrating that gated linear recurrence with state expansion can be effectively adapted to PVAD.
Through experiments on a LibriSpeech-derived benchmark, we show that FDE-HGRN2 consistently improves mean Average Precision and frame-wise accuracy over the LSTM-based FDE-RNN and several prior PVAD systems, while reducing backbone parameters by about 15% and achieving smaller models than many existing methods.
We analyze robustness under practical constraints, demonstrating that FDE-HGRN2 maintains or improves performance under reduced batch sizes and across a wide range of SNR conditions, whereas the LSTM-based FDE-RNN degrades more noticeably, indicating more stable optimization and better noise robustness for the HGRN2 backbone.
We conduct ablations and architectural comparisons (with and without CALR, and with LSTM vs. HGRN2 in the dynamic encoder) to isolate the contributions of the proposed components, confirming that the HGRN2-based DE-RNN and CALR jointly provide a superior accuracy–efficiency trade-off for deployment-oriented PVAD.

2. Overview of the Backbone Model: FDE-RNN

The Flexible Dynamic Encoder Recurrent Neural Network (FDE-RNN) constitutes a significant advancement in the development of unified voice activity detection (VAD) and personalized voice activity detection (PVAD) frameworks. This novel architecture facilitates seamless task adaptation while minimizing redundant computational overhead, thereby addressing a fundamental limitation of conventional PVAD systems that necessitate full model execution even for elementary VAD operations.

The FDE-RNN incorporates a modular design comprising two principal components: the Dynamic Encoder RNN (DE-RNN, a recurrent front-end that dynamically skips its state update on non-speech frames to avoid accumulating noise into the shared speech representation), functioning as the VAD front-end, and a detachable Personalization module (P-module) responsible for speaker-conditioned PVAD processing. This architectural separation enables the FDE-RNN to operate as an ultra-lightweight VAD with merely 40.4K parameters—approximately 30% of those in competing architectures—while preserving state-of-the-art performance across both detection tasks.

The complete architecture of FDE-RNN, illustrated in Figure 1, employs a two-stage processing pipeline. In the first stage, the DE-RNN extracts and encodes speech-relevant acoustic features, after which the P-module is conditionally activated for segments containing detected speech. This selective operational strategy ensures efficient resource utilization and high scalability, making the FDE-RNN particularly well-suited for real-world deployment scenarios.

2.1. Dynamic Encoder RNN (DE-RNN): Core Mechanism

DE-RNN extends dynamic neural network principles to VAD, comprising interdependent Prediction and Encoder RNN modules that coordinate through a gating mechanism. This design eliminates redundant computations on non-speech frames while accumulating robust speech representations over time.

The Prediction RNN generates frame-level VAD probabilities

p_{v a d} (t)

by processing concatenated current acoustic features

x_{t}

and prior Encoder hidden states

h_{t - 1}

:

p_{v a d} (t) = softmax (Linear ({RNN}_{p r e d} (x_{t} + h_{t - 1}))) .

(1)

The Encoder RNN selectively updates its states based on prediction confidence against threshold

θ

:

(h_{t}, c_{t}) = \{\begin{matrix} {RNN}_{e n c} (x_{t}), & p_{v a d} (t) > θ \\ (h_{t - 1}, c_{t - 1}), & otherwise \end{matrix}

(2)

This conditional updating ensures that the Encoder accumulates speech-specific latent representations

h_{t}

resilient to non-target speaker interference, providing an ideal feature basis for subsequent PVAD processing.

2.2. Personalization Module: Speaker-Adaptive Processing

The P-module receives Encoder outputs

h_{t}

and adaptively fuses them with raw acoustics through a confidence-weighted residual connection:

F_{t} = h_{t} + (1 - p_{v a d} (t)) \cdot x_{t}

(3)

High VAD confidence prioritizes refined latent features; low confidence incorporates more raw spectral details to counter potential false positives. This adaptive mechanism significantly enhances downstream PVAD robustness.

Speaker conditioning occurs via Feature-wise Linear Modulation (FiLM):

{\tilde{F}}_{t} = γ (e_{t a r g e t}) ⊙ F_{t} + β (e_{t a r g e t})

(4)

where

γ (\cdot), β (\cdot)

are affine transformation parameters generated from target speaker embedding

e_{t a r g e t}

via a FiLM generator. Temporal modeling follows through LSTM, yielding final PVAD classifications:

[p_{t s s} (t), p_{n t s s} (t)] = softmax (Linear (LSTM ({\tilde{F}}_{t})))

(5)

2.3. Parallel Training Paradigm

FDE-RNN employs joint optimization of VAD and PVAD objectives through parallel training. During training, P-module predictions execute at every time step regardless of VAD gating:

L_{t o t a l} = L_{v a d} + L_{p v a d}

(6)

L_{v a d / p v a d} = \frac{1}{T} \sum_{t = 0}^{T - 1} BCE (y_{t}, p_{t})

(7)

This strategy exposes the P-module to comprehensive input distributions (speech + non-speech), enhancing generalization. Inference reverts to conditional execution for efficiency.

2.4. Architectural Advantages and Research Value

In summary, FDE-RNN’s principal advantage lies in its modular architecture: the DE-RNN can operate as a standalone VAD front-end by detaching the P-module, while the shared encoder representations learned under the VAD objective are directly reused by the personalization path via confidence-weighted fusion and FiLM conditioning. This design avoids redundant computation on non-speech frames, reduces parameter overhead, and provides a controlled backbone for studying individual component contributions—properties that make it a suitable basis for the HGRN2-based extensions proposed in the present work.

3. Proposed Framework: FDE-HGRN2

In this study, we propose replacing the individual time-series models in the original FDE-RNN architecture with HGRN2 to investigate whether it can enhance PVAD performance. Unlike traditional RNNs plagued by vanishing gradients or Transformers burdened by quadratic complexity, HGRN2 achieves Transformer-level expressivity through non-parametric state expansion with linear

O (N)

scaling. The following sections introduce HGRN2’s core formulation and its advantages for PVAD tasks.

3.1. Introduction to HGRN2

Hierarchically Gated Recurrent Neural Network 2 (HGRN2) is a recent linear recurrent architecture that attains Transformer-competitive performance while maintaining true linear scaling in sequence length. Its key innovation is a non-parametric state expansion mechanism that expands hidden states from d-dimensional vectors to

d \times d

matrices via outer products, thereby providing

d^{2}

effective capacity without introducing additional trainable parameters.

Intuitively, a standard RNN hidden state can be thought of as a fixed-size notepad of dimension d: the model must compress all past information into this fixed-length vector, which limits how much temporal context it can retain. HGRN2 replaces this notepad with a matrix of size

d \times d

, giving the model d times more effective writing space. Crucially, this expansion is achieved without purchasing more parameters: the matrix is constructed at each time step as an outer product of two existing d-dimensional gate vectors, so the additional capacity comes at zero trainable parameter cost. The sections below formalize this mechanism and describe how it overcomes a known theoretical limitation of standard linear RNNs.

3.1.1. Limitations of Traditional RNNs and Transformers

Vanilla RNNs inherently suffer from vanishing gradients during backpropagation through time (BPTT). At each time step t, the hidden state is updated as

h_{t} = tanh (W_{h} h_{t - 1} + W_{x} x_{t} + b_{h}),

and the nonlinear tanh activation causes gradients to decay exponentially over long sequences, effectively limiting temporal memory. LSTMs and GRUs alleviate this issue by introducing gating mechanisms, but their strictly sequential recurrence prevents parallelization across time during training. In contrast, standard Transformers support fully parallel training over time, yet incur a quadratic

O (N^{2})

computational and memory complexity with respect to sequence length N, which is often prohibitive for long sequences and on-device deployment.

3.1.2. Linear RNNs and the Capacity Lower Bound

Linear RNNs address the sequential bottleneck by enforcing linearity in the recurrent path, which enables associative recurrence and facilitates stable gradient propagation over long horizons. A typical gated linear RNN updates the hidden state

h_{t} \in R^{d}

and produces an output

y_{t}

as

\begin{matrix} (8) & g_{t} & = σ (W_{g} x_{t} + b_{g}), \\ (9) & i_{t} & = SiLU (W_{i} x_{t} + b_{i}), \\ (10) & h_{t} & = g_{t} ⊙ h_{t - 1} + i_{t} ⊙ x_{t}, \\ (11) & o_{t} & = σ (W_{o} x_{t} + b_{o}), \\ (12) & y_{t} & = o_{t} ⊙ h_{t} . \end{matrix}

Because the gates

g_{t}, i_{t}, o_{t}

depend only on the input sequence

{x_{t}}

, all gate values can be computed in parallel across time. The remaining linear recurrence in Equation (8) then admits an implementation via parallel prefix-sum (scan), leading to an

O (N d)

training complexity in sequence length N and avoiding the

O (N^{2})

bottleneck of self-attention.

Several architectures build on this idea. The direct predecessor to our core module, the Hierarchically Gated Recurrent Neural Network (HGRN) [13], stacks multiple such gated linear layers, where lower layers focus on high-frequency local patterns and higher layers capture low-frequency global context. State-space models such as Mamba [12] also employ linear recurrent state-space updates, where each channel is associated with a structured SSM state of dimension n ≪ d.

Despite their computational efficiency, these models face a fundamental theoretical limitation. Recent analyses on associative recall and state-tracking tasks establish a strict theoretical lower bound on the required state dimension to effectively solve complex sequence modeling problems. For a linear RNN to robustly memorize and retrieve information over long horizons, its memory capacity (the maximum amount of information a recurrent state can store and retrieve over time, which scales with the dimensionality of the hidden state) must scale adequately. Because standard HGRN and state-space models maintain a hidden state restricted to a d-dimensional vector or a small fixed size n, their capacity fundamentally falls short of this lower bound, restricting their expressive power when handling highly complex temporal dependencies.

3.1.3. HGRN2: Overcoming the Lower Bound via State Expansion

HGRN2 [14] is designed to directly overcome the theoretical lower bound limitation of the original HGRN while preserving its favorable

O (N)

training complexity. Rather than relying solely on element-wise interactions in a constrained d-dimensional hidden space, HGRN2 lifts the recurrent state into a

d \times d

matrix using outer products, thereby enabling non-parametric state expansion (the process of enlarging the effective recurrent state from a d-dimensional vector to a

d \times d

matrix via outer products, without introducing additional trainable parameters).

The core recurrent computation of an HGRN2 cell at layer ℓ is depicted in Figure 2. To preserve the ability to pre-compute gates in parallel, they are derived purely from the layer input

x_{t}^{(ℓ)} \in R^{d}

. In practice, to maximize hardware utilization, the independent weight matrices are concatenated into a single fused projection matrix, computing all pre-activation vectors simultaneously before chunking. Mathematically, this is equivalent to:

\begin{matrix} (13) & i_{t}^{(ℓ)} & = SiLU (W_{i}^{(ℓ)} x_{t}^{(ℓ)} + b_{i}^{(ℓ)}), \\ (14) & o_{t}^{(ℓ)} & = σ (W_{o}^{(ℓ)} x_{t}^{(ℓ)} + b_{o}^{(ℓ)}), \\ (15) & g_{t}^{(ℓ)} & = σ (W_{g}^{(ℓ)} x_{t}^{(ℓ)} + b_{g}^{(ℓ)}), \end{matrix}

where

W_{i}^{(ℓ)}, W_{o}^{(ℓ)}, W_{g}^{(ℓ)} \in R^{d \times d}

are distinct trainable weight matrices for the input, output, and forget gates, respectively.

To prevent the exponential decay of historical information over long sequences, HGRN2 natively incorporates a lower bound parameter

β \in (0, 1)

into the forget gate, forming the actual decay factor

λ_{t}^{(ℓ)}

:

λ_{t}^{(ℓ)} = β + (1 - β) g_{t}^{(ℓ)} .

(16)

The matrix-valued hidden state

H_{t}^{(ℓ)} \in R^{d \times d}

is then updated via the non-parametric state expansion recurrence:

H_{t}^{(ℓ)} = Diag (λ_{t}^{(ℓ)}) H_{t - 1}^{(ℓ)} + [i_{t}^{(ℓ)} \otimes (1 - λ_{t}^{(ℓ)})],

(17)

where the outer product ⊗ expands the vectors into a

d \times d

capacity matrix. This operation provides

d^{2}

effective state dimensions, satisfying the theoretical capacity lower bound without introducing new trainable parameters. The complete single-step computation flow is illustrated in Figure 3, which shows how the three gate paths are computed in parallel before being combined through the outer-product state expansion.

Finally, the state is contracted using the output gate and passed through Layer Normalization (LN) and an output projection layer (

W_{o u t}

) to yield the cell output

y_{t}^{(ℓ)}

:

y_{t}^{(ℓ)} = W_{o u t}^{(ℓ)} (LN (H_{t}^{(ℓ)} o_{t}^{(ℓ)})) + b_{o u t}^{(ℓ)} .

(18)

3.1.4. Multi-Head Parallelization and the HGRN2 Block

A straightforward implementation of the

d \times d

state update would require

O (d^{2})

operations per time step, leading to an overall complexity of

O (N d^{2})

. To keep HGRN2 computationally efficient, a multi-head parallelization scheme is adopted. The d channels are partitioned into H heads, each operating on a reduced subspace defined by the expansion ratio

E = d / H

.

For head

h \in {1, \dots, H}

in layer ℓ, the matrix recurrence in Equation (17) is localized strictly to its

E \times E

subspace. Because each head is computed independently and in parallel, the per-head operations scale with

{(d / H)}^{2}

. This multi-head structural design effectively reduces the total recurrent complexity to

O (N d^{2} / H)

, drastically minimizing computational overhead.

While the aforementioned multi-head state expansion dictates the temporal modeling, integrating HGRN2 into a deep architecture requires a robust block design. As formalized in Equation (18), the independent outputs from the H parallel heads are first concatenated and passed through LayerNorm to stabilize the hidden state distribution. Finally, the linear projection layer (

W_{o u t}

) facilitates cross-head information mixing, ensuring that features processed independently within different subspaces are effectively aggregated before being passed to the next layer.

In summary, the complete HGRN2 block combines (i) linear recurrent updates for

O (N)

parallel processing, (ii) non-parametric state expansion for high capacity, and (iii) dynamic lower bounds with multi-head mixing. In the original literature, a complete HGRN2 block encompasses a core recurrent cell (often denoted as HGRU2) followed by a Gated Linear Unit (GLU). To maximize computational efficiency while preserving representation power where it matters most, our proposed architecture strategically deploys heterogeneous blocks: we utilize the lightweight core cell (

HGRN 2_{c o r e}

) for front-end modules, and the full block (

HGRN 2_{f u l l}

) for the back-end. These properties make it an ideal backbone for the FDE-HGRN2 PVAD architecture.

3.2. HGRN2-Based FDE Backbone: FDE-HGRN2

While the original FDE-RNN relies on LSTM-based recurrent modules in both the Dynamic Encoder RNN (DE-RNN) and the Personalization module (P-module), this choice is not necessarily optimal in terms of the accuracy–efficiency trade-off, especially under strict on-device constraints. Recent linear recurrent architectures such as HGRN2 provide competitive long-range sequence modeling capabilities while enabling more efficient sequence-level parallelization and reduced computational complexity. Motivated by these developments, the FDE-HGRN2 architecture is obtained by replacing all LSTM blocks in FDE-RNN with HGRN2 layers, while preserving the overall dynamic gating, FiLM-based conditioning, and detachable PVAD structure. The flowchart of FDE-HGRN2 is depicted in Figure 4.

In the DE-RNN, both the Prediction RNN and the Encoder RNN are instantiated as lightweight

HGRN 2_{c o r e}

modules (without GLU) to minimize computational overhead. Crucially, to fully exploit the parallel training capability of linear RNNs, the explicit recurrent feedback connection from the previous state is removed from the VAD prediction. Equation (1) thus becomes:

p_{v a d} (t) = softmax {(Linear ({HGRN 2}_{c o r e} (x_{t})))}_{s p e e c h},

(19)

where

{HGRN 2}_{c o r e}

operates solely on the current acoustic features

x_{t}

, utilizing its internal multi-head state expansion to capture temporal dependencies.

The most significant architectural shift occurs in the conditional encoder update. Rather than relying on explicit piecewise routing as in the original LSTM baseline, FDE-HGRN2 achieves dynamic encoding by hijacking the native lower bound parameter of the cell (as formulated in Equation (16)). Given the VAD confidence threshold

θ

, we first derive a binary gating mask

m_{t} = I (p_{v a d} (t) > θ)

. This mask is inverted to serve as the dynamic lower bound

b_{t} = 1 - m_{t}

. The encoder update is then elegantly unified within the recurrence:

h_{t}^{e n c} = {HGRN 2}_{c o r e} (x_{t} ∣ b_{t} = 1 - m_{t}) .

(20)

when

m_{t} = 0

(non-speech),

b_{t} = 1

, which freezes the internal state matrix

H_{t}

and prevents noise assimilation. When

m_{t} = 1

(speech),

b_{t} = 0

, and the encoder updates normally.

The overall structure of the P-module remains unchanged regarding the confidence-weighted fusion and the FiLM-based conditioning. The fused features

F_{t}^{f i l m}

are modulated by target speaker embeddings. However, because discerning target speaker traits from complex acoustic mixtures requires superior feature selection capabilities, temporal modeling here is performed by a full HGRN2 block (incorporating the GLU). Consequently, the PVAD output is reformulated as:

[p_{t s s} (t), p_{n t s s} (t)] = softmax (Linear ({HGRN 2}_{f u l l} (F_{t}^{f i l m}))),

(21)

where

HGRN 2_{f u l l}

encompasses both the core recurrence and the GLU. By strategically distributing lightweight core cells and full blocks, FDE-HGRN2 maintains the original design principles while leveraging non-parametric state expansion to substantially improve the balance between PVAD accuracy and efficiency.

The integration of HGRN2 into the FDE-PVAD framework involves three non-trivial technical contributions that go beyond a straightforward module substitution.

First, dynamic gating via the native lower bound parameter. The original FDE-RNN dynamic encoder relies on explicit piecewise routing: the Encoder RNN either executes its full recurrent update or freezes its hidden state based on the VAD confidence threshold. For HGRN2, whose recurrent state is a matrix

H_{t} \in R^{d \times d}

, we identified that the native lower bound parameter

β

in the forget gate can be repurposed as a dynamic gating mechanism: setting

β = 1

on non-speech frames freezes the state matrix, while

β = 0

enables normal updates. This reuse of an existing HGRN2 component for a new functional purpose is a non-trivial insight not achievable by simply substituting HGRN2 for LSTM.

Second, gradient-preserving state freezing. In LSTM, state freezing copies the previous cell state unchanged—an abrupt binary operation that stops gradient propagation through skipped frames. In FDE-HGRN2, setting the lower bound to 1 drives

λ_{t}

to 1, causing

H_{t}

to retain its previous value without any discontinuity in the computational graph. This preserves gradient flow through frozen frames during training, resulting in smoother optimization behavior consistent with the improved robustness to batch size reduction which will be observed in the experimental results (Table 1).

Third, removal of the recurrent feedback connection. The original FDE-RNN VAD Prediction module receives explicit feedback from the previous Encoder hidden state

h_{t - 1}

. In HGRN2, the internal multi-head state expansion provides

d^{2} / H

effective internal state dimensions sufficient to capture the relevant temporal context directly from the input sequence. Removing it enables full sequence-level parallelization of the Prediction module during training—a key efficiency advantage of linear RNNs. These three integration decisions constitute the non-trivial technical contributions of the FDE-HGRN2 integration.

3.3. Cosine-Annealing Learning Rate (CALR) in FDE-HGRN2 Training

Cosine-annealing learning rate (CALR) [15] is a learning rate scheduling technique designed to improve the training dynamics of neural networks. CALR gradually decreases the learning rate using a cosine function, which enables smoother convergence and improves generalization by avoiding abrupt learning rate drops. In contrast to methods with periodic restarts, we employ a single monotonic decay from a maximum to a minimum learning rate throughout the training process.

The learning rate at the t-th epoch is scheduled as follows:

η_{t} = η_{min} + \frac{1}{2} (η_{max} - η_{min}) (1 + cos (\frac{t}{T_{max}} π)),

(22)

where

η_{max}

and

η_{min}

are the initial and final learning rates, respectively, and

T_{max}

is the total number of training epochs. The learning rate starts at

η_{max}

when

t = 0

and decays smoothly to

η_{min}

as t approaches

T_{max}

. This cosine-shaped decay avoids sudden changes and allows stable convergence in the later stages of training.

While the original FDE-RNN framework relies on a discrete two-phase step decay schedule, we replace this with the CALR strategy for optimizing our proposed FDE-HGRN2 architecture. The hyperparameters are chosen to match the range of the FDE-RNN two-phase schedule:

η_{max} = 1 \times 10^{- 3}

equals the FDE-RNN initial learning rate,

η_{min} = 5 \times 10^{- 5}

equals the FDE-RNN final learning rate, and

T_{max} = 10

equals the total training epochs, so that CALR is a drop-in replacement for the step decay with an identical learning rate range but a smoother decay trajectory. The potential advantages of using CALR scheduling in frame-level PVAD include:

Smooth Convergence: Gradual decay prevents instability in optimization and leads to better convergence behavior.
Balanced Learning: Higher learning rates early in training help explore the parameter space, while lower rates later refine and stabilize the model’s target speaker representations.

4. Experimental Setup

4.1. Dataset

We used the LibriSpeech corpus [16] as the source for constructing both the training and test sets for PVAD evaluation. LibriSpeech provides three standard training subsets—train-clean-100, train-clean-360, and train-other-500—which together total 960 h of speech from 2338 distinct speakers. The official test portion consists of 10 h of audio from 73 speakers, split evenly between clean (test-clean) and noisy (test-other) conditions, and these original partitions were strictly preserved to avoid any overlap between training and evaluation data.

Multi-speaker sequences for PVAD were created by first sampling the number of speakers per utterance uniformly from one to three, then randomly selecting that number of distinct speakers from the corpus and concatenating their speech segments into a single sequence. One of the selected speakers was then chosen at random as the target speaker. To discourage the model from overfitting to target speaker characteristics, in 20% of the utterances, the designated target was replaced by a speaker who does not appear in the corresponding concatenated audio, yielding multi-speaker segments without the true target. Using this procedure, 140,000 training utterances were generated from the LibriSpeech training split and 5500 test utterances were generated exclusively from the LibriSpeech test split, ensuring strict separation between training and test sets and eliminating the risk of data leakage.

The overall data construction strategy follows established practices in the PVAD literature. In particular, PVAD 1.0 [4], PVAD 2.0 [5], COIN-AT-PVAD [17], and SE-PVAD [18] all generate multi-speaker mixtures directly from LibriSpeech and employ similar concatenation-based preparation. To further enhance robustness and generalization, we follow the approach in COIN-AT-PVAD [17] by replacing the target speaker in 20% of the utterances with a speaker not present in the given mixture. Moreover, we apply Multi-Condition Training with Reverberation (MTR) data augmentation to each concatenated utterance to simulate far-field (reverberant) and noisy recording conditions. The room impulse responses (RIRs) and noise used for augmentation are sourced from the RIRs_noises [19] and MUSAN [20] datasets, respectively. By using LibriSpeech as the core data source and adhering closely to the dataset construction and augmentation procedures of prior PVAD studies, our experimental setup remains directly comparable with published systems and supports the practical relevance of the reported results for real-world PVAD scenarios. We acknowledge two main limitations of this construction: sequential concatenation does not capture naturalistic overlapping speech, and LibriSpeech’s read-speech characteristics limit acoustic diversity. These limitations are shared by all compared systems, and the shared protocol is adopted to ensure direct comparability with the PVAD literature.

4.2. Speaker Embedding Preparation

Speaker embeddings were extracted using a d-vector framework [21]. Speech segments were first randomly sampled from each speaker’s recordings and passed through a pretrained speaker verification model to obtain window-level d-vectors. These d-vectors were then L2-normalized and averaged over time to form robust utterance-level embeddings, denoted as

e^{target}

with a dimensionality of 256. This procedure converts variable-length speech into fixed-dimensional representations while preserving discriminative vocal characteristics through normalization and temporal aggregation.

4.3. Other Implementation Details

Feature Extraction and Training Protocol:
Acoustic features are 40-dimensional Mel-filterbank coefficients computed with a 25 ms analysis window and a 10 ms frame shift. All models are implemented in PyTorch 2.9.0 [22] and optimized using Adam [23] with a batch size of 64 for $T_{max} = 10$ epochs. For the baseline FDE-RNN, we employ a two-phase learning rate schedule: $1 \times 10^{- 3}$ during the initial stage (epochs 1–2) for stable convergence, decaying to $5 \times 10^{- 5}$ from epoch 7 onward. For the proposed FDE-HGRN2, we utilize the CALR schedule introduced in Section 3.3, setting the maximum learning rate $η_{max} = 1 \times 10^{- 3}$ and the minimum learning rate $η_{min} = 5 \times 10^{- 5}$ .
FDE-HGRN2 Configuration:
As stated previously, the proposed FDE-HGRN2 replaces the three recurrent modules of the FDE-RNN baseline (VAD Prediction, VAD Encoder, and PVAD Prediction) with single-layer HGRN2 modules. To enable sequence-level parallel training, we remove the explicit recurrent feedback connection in the VAD Prediction module, relying entirely on the internal state expansion of HGRN2 for temporal modeling.
Regarding network dimensions, the VAD Encoder maintains a hidden dimension of 40 to match the acoustic features, facilitating direct confidence-weighted residual fusion. Both the VAD and PVAD Prediction modules employ a hidden dimension of 64, terminating in binary linear classifiers. In the personalized branch, a Gated Linear Unit (GLU) is inserted after the HGRN2-core output to enhance feature selection.
Internally, all HGRN2-core modules are configured with an expansion ratio of $n = 2$ . The input gates utilize SiLU activations, while the update and output gates use Sigmoid. The VAD confidence and dynamic update threshold $θ$ is set to 0.5. All other components, including data preprocessing and target speaker embeddings, are identical to the baseline system, so the observed performance improvements can be attributed to the HGRN2-based recurrent backbone and the associated training strategy.

4.4. Evaluation Metrics

Model performance is evaluated using multiple metrics to provide a comprehensive assessment of PVAD behavior. Average Precision (AP) is first computed as the area under the precision–recall curve for each individual class, namely target speaker speech (tss), non-target speaker speech (ntss), and non-speech (ns), yielding class-wise AP scores. The mean Average Precision (mAP) is then obtained as the unweighted average of these three AP values, reflecting overall detection performance rather than target-specific accuracy alone.

Frame-level classification quality is further measured by accuracy, defined as

Acc = \frac{Correctly detected frames}{Total frames} \times 100 %,

(23)

which considers all categories jointly. In addition, the total number of model parameters (#Para.) is reported as an indicator of model complexity and deployment feasibility, particularly in resource-constrained scenarios where PVAD systems are typically deployed. Owing to the inherent class imbalance in the dataset (with relatively fewer positive samples), mAP is adopted as the primary performance indicator, while the combination of AP, accuracy, and parameter count provides a balanced view of detection fidelity, classification reliability, and practical efficiency.

5. Experimental Results and Discussion

5.1. Overall PVAD Performance and Model Size

Table 1 summarizes the PVAD performance and model size of the original FDE-RNN and the proposed FDE-HGRN2 under different batch size configurations. Due to limited GPU memory, we could not reproduce the baseline FDE-RNN with the original batch size of 256 and instead trained it with a batch size of 64 for all our experiments. This reduction leads to a mild but consistent degradation of the reproduced FDE-RNN: accuracy drops by about one percentage point, mAP decreases from 0.948 to 0.942, and the AP for target speech frames decreases from 0.924 to 0.910, with a similar but smaller reduction for the non-speech/non-target class (from 0.962 to 0.959). These trends suggest that the LSTM-based FDE-RNN benefits from larger mini-batches and is somewhat sensitive to batch size constraints.

Under the same batch size of 64, FDE-HGRN2 achieves the best performance among all variants. Its accuracy reaches 88.61% and its mAP increases to 0.952, not only compensating for the degradation observed in the reproduced FDE-RNN but also slightly surpassing the originally reported FDE-RNN result (mAP 0.948). The AP for target speech frames improves to 0.932, outperforming both the reproduced baseline (0.910) and the original reported model (0.924), while the non-speech/non-target AP is also the highest at 0.964. At the same time, the number of trainable parameters is reduced from 92.372K to 78.724K, resulting in a more compact recurrent backbone.

The key observations from Table 1 are as follows:

Effect of batch size: Comparing “FDE-RNN (reported)” and “FDE-RNN (reproduced)” shows that reducing the batch size from 256 to 64 yields a small but systematic performance loss in terms of accuracy, mAP, and AP for both target speech and non-speech/non-target classes, indicating that the LSTM-based FDE-RNN relies on relatively large mini-batches for its best optimization behavior.
Benefit of HGRN2 backbone: With the same batch size constraint of 64, FDE-HGRN2 attains higher accuracy, mAP, and AP values than both the reproduced FDE-RNN and the originally reported FDE-RNN, demonstrating that replacing the LSTM with HGRN2-based recurrence yields genuine performance gains that cannot be attributed to training conditions alone. The comparison against the originally reported FDE-RNN (batch size 256) is presented as additional context only, since this cross-condition comparison is not fully controlled; the most rigorous evidence is the intra-family comparison at batch size 64. Furthermore, FDE-HGRN2, despite being trained under the more constrained batch size of 64, achieves performance comparable to or better than the originally reported FDE-RNN trained with batch size 256, suggesting that the HGRN2 backbone is more robust to batch size constraints than its LSTM-based counterpart.
Accuracy–efficiency trade-off: FDE-HGRN2 reduces the parameter count from 92.372K to 78.724K while still achieving the strongest overall PVAD performance in Table 1. Thus, the proposed architecture improves both detection accuracy and model compactness, offering a strictly better accuracy–efficiency trade-off than the original LSTM-based FDE-RNN.

5.2. Note on Statistical Significance

All results are reported from single training runs, consistent with established practice in the PVAD literature (PVAD 1.0, PVAD 2.0, COIN-AT-PVAD, SE-PVAD, and FDE-RNN all report single-run results). We acknowledge that multi-seed experiments would provide the strongest statistical guarantees; however, a single full training run requires approximately one week in our setup, making this infeasible at present. Nonetheless, the consistency of improvements across all four evaluation metrics simultaneously—accuracy (

+ 1.76

), AP(tss) (

+ 0.022

), AP(ns&ntss) (

+ 0.005

), and mAP (

+ 0.010

)—provides meaningful evidence that the observed gains are not attributable to random variation. Multi-seed validation is listed as future work.

5.3. Inference Efficiency Analysis

Table 2 reports three complementary efficiency metrics evaluated under identical hardware conditions on a workstation equipped with an Intel(R) Core(TM) i7-14700 CPU and an NVIDIA GeForce RTX 5070 GPU with 12 GB of memory for both models. The results reveal a nuanced picture, consistent with a well-established characteristic of linear recurrent architectures on contemporary hardware.

In terms of computational cost and memory footprint, FDE-HGRN2 demonstrates clear advantages. It reduces amortized floating-point operations per frame from 136.425 to 116.436 kFLOPs (a reduction of approximately 15%), consistent with the theoretical complexity reduction achieved by replacing the four-gate LSTM recurrence with HGRN2’s multi-head linear recurrence. More substantially, peak memory allocation during inference is reduced from 16.601 MB to 8.445 MB, a reduction of approximately 49%. This memory saving is particularly significant for deployment-oriented scenarios, where RAM is often the primary bottleneck rather than raw compute throughput.

However, the real-time factor (RTF) of FDE-HGRN2 (0.0787) is higher than that of FDE-RNN (0.0281), indicating that the wall-clock inference time of FDE-HGRN2 is approximately 2.8 times longer on the measurement platform. This apparent discrepancy between theoretical FLOPs and practical latency is a well-documented phenomenon for linear recurrent architectures: whereas LSTM benefits from highly optimized cuDNN kernels that have been extensively tuned for GPU execution, HGRN2’s outer-product state expansion involves matrix-valued recurrent state updates that do not yet benefit from equivalent hardware-level optimization on general-purpose GPUs. As a result, the theoretical arithmetic savings do not fully translate into wall-clock speedup on the current measurement platform. We note that the RTF of 0.0787 still satisfies real-time processing requirements by a substantial margin (RTF ≪ 1), and that dedicated hardware implementations or operator-level kernel fusion for linear recurrent architectures—an active area of systems research—are expected to narrow this gap significantly. The memory and FLOP advantages of FDE-HGRN2 are therefore more representative of its deployment potential on resource-constrained devices, where memory capacity is the binding constraint rather than latency.

5.4. Impact of Signal-to-Noise Ratio (SNR)

Table 3 reports the PVAD performance of the reproduced FDE-RNN baseline and the proposed FDE-HGRN2 model at SNRs of 5, 10, and 15 dB. Note that the SNR conditions in Table 3 are applied exclusively at test utterances. All models are trained on MTR-augmented data covering a broad distribution of reverberant and noisy conditions; the SNR values in this section refer to the noise level of the test utterances only. For each SNR level, test utterances are derived from the original test set using controlled noise conditions, and a unified batch size of 64 is employed for all experiments to avoid confounding effects from the training configuration. Across all three SNR settings, FDE-HGRN2 consistently achieves higher AP for both the non-speech/non-target and target speech classes, as well as higher mAP, indicating that the HGRN2 backbone provides more robust sequence modeling under noisy conditions.

Beyond the AP-based metrics, FDE-HGRN2 also improves frame-wise accuracy at every SNR, with gains of roughly 1.5–2 percentage points over the reproduced FDE-RNN baseline. These improvements show that the benefits of HGRN2 are reflected not only in ranking quality along the precision–recall curve but also in fewer frame-level misclassifications in practical operation. The accuracy and mAP gains persist as the SNR increases from 5 to 15 dB, suggesting that the advantages of HGRN2 are stable across both moderately noisy and relatively clean acoustic environments, rather than being restricted to a particular noise regime.

5.5. Ablation Study

The ablation results in Table 4 reveal consistent and interpretable trends across the three configurations. The full FDE-HGRN2 model, which employs HGRN2 in both the Dynamic Encoder RNN (DE-RNN) and the interaction block together with CALR, attains the best overall performance, with the highest mAP (0.9522), the strongest class-wise AP (0.9643 for ns&ntss and 0.9317 for tss), and the highest accuracy of 88.61%. This indicates that combining the HGRN2 backbone with CALR yields the strongest frame-level discrimination for both target and non-target speech and leads to the most reliable overall classification.

Removing CALR from FDE-HGRN2 produces a moderate but consistent degradation: mAP decreases from 0.9522 to 0.9477 and accuracy drops from 88.61% to 87.81%. Both tss and ns&ntss AP are slightly reduced (from 0.9317 to 0.9233 and from 0.9643 to 0.9619, respectively), suggesting that CALR provides a steady gain across classes, with a noticeable impact on both target and background discrimination. The reduction in accuracy further confirms that the smooth learning rate decay helps stabilize frame-wise decisions, improving not only AP-based metrics but also the overall error rate.

When the DE-RNN is further reverted from HGRN2 back to LSTM while CALR remains disabled, performance drops again to an mAP of 0.9457 and an accuracy of 87.38%. In this configuration, ns&ntss AP decreases to 0.9615 and tss AP to 0.9169, indicating that the LSTM-based DE-RNN is less effective at modeling background and interfering speakers than its HGRN2 counterpart. This additional degradation highlights the importance of the HGRN2-based DE-RNN: replacing HGRN2 with LSTM weakens the model’s ability to separate target and non-target speech and slightly lowers overall detection quality, confirming that the linear recurrent backbone is a key contributor to the improved accuracy–efficiency trade-off within the FDE framework.

Effect of expansion ratio. To justify the choice of

n = 2

, we evaluated FDE-HGRN2 (w/o CALR) under

n \in {1, 2, 4}

. The results are summarized in Table 5. As shown,

n = 2

achieves the highest mAP (0.9477) and accuracy (87.81%). Crucially, all three configurations share an identical parameter count of 78.724K, directly and empirically confirming HGRN2’s theoretical claim that state expansion is non-parametric: increasing n enlarges the effective recurrent state capacity via outer-product expansion without introducing additional trainable parameters. Performance degrades at

n = 4

, suggesting diminishing returns when the state-space is excessively large relative to the model’s scale and training set. We therefore adopt

n = 2

as the proposed configuration.

5.6. Comparison with State-of-the-Art PVAD Methods

Table 6 compares the proposed FDE-HGRN2 with several representative PVAD systems: PVAD 1.0 [4], PVAD 2.0 [5], COIN-AT-PVAD [17], SE-PVAD [18], and the original FDE-RNN [9]. The performance metrics and parameter counts for these baseline systems are taken from the FDE-RNN literature [9], and our experiments follow the same dataset construction and evaluation protocol to ensure a fair comparison. The table reports class-wise AP on tss, ntss, and ns, together with overall mAP, frame-wise accuracy, and model size.

Three main observations can be drawn:

Detection performance: FDE-HGRN2 achieves the highest overall detection performance among all listed models. Its mAP of 0.9522 and accuracy of 88.61% exceed those of SE-PVAD (mAP 0.925, Acc 84.86%) and the originally reported FDE-RNN (mAP 0.948, Acc 87.88%). It also delivers strong target speaker discrimination with an AP (tss) of 0.9317 while maintaining competitive or better AP for the non-target classes, indicating that the HGRN2-based recurrence effectively enhances temporal modeling and target/non-target separation.
Accuracy–efficiency trade-off: Despite its strong performance, FDE-HGRN2 remains parameter-efficient. Compared with earlier PVAD systems such as PVAD 1.0 and SE-PVAD, which each use more than 130K parameters and achieve accuracies below 85%, FDE-HGRN2 attains both higher accuracy (88.61%) and higher mAP with only 78.724K parameters. Relative to the FDE-RNN backbone (92.372K, Acc 87.88%), it reduces parameter count by roughly 15% while still improving mAP and accuracy. Although COIN-AT-PVAD employs slightly fewer parameters (71.869K) and reaches reasonable accuracy (86.74%), its mAP (0.940) remains clearly lower, indicating that FDE-HGRN2 provides a more favorable balance between compactness and predictive power.
Robustness under constrained training: The gap between the reported and reproduced FDE-RNN variants highlights the sensitivity of the LSTM-based architecture to training hyperparameters, especially batch size. When the batch size is reduced from 256 to 64, both mAP and accuracy decrease for FDE-RNN. In contrast, FDE-HGRN2, trained under the same constrained batch size of 64, avoids this degradation and even surpasses the original large-batch FDE-RNN in both mAP and accuracy. This behavior suggests that the HGRN2-based design is more robust to realistic hardware limitations and maintains strong performance without relying on large-batch training.

6. Conclusions and Future Work

In this work, we proposed FDE-HGRN2, an HGRN2-based instantiation of the FDE-PVAD framework that replaces all LSTM modules in the original FDE-RNN backbone while preserving its dynamic encoder gating and detachable personalization architecture. Experiments on a LibriSpeech-derived PVAD benchmark show that FDE-HGRN2 improves mean Average Precision and frame-wise accuracy over both the reported and reproduced FDE-RNN baselines, despite using approximately 15% fewer parameters in the recurrent backbone. The model also exhibits more stable behavior under constrained training conditions and across a range of SNR levels, indicating that the HGRN2 backbone and the CALR schedule jointly provide a more robust accuracy–efficiency trade-off for personal voice activity detection.

A key limitation of the current evaluation is its reliance on synthetically constructed multi-speaker sequences derived from LibriSpeech. Although MTR augmentation partially simulates reverberant and noisy far-field conditions using real room impulse responses and noise corpora, real device-captured recordings introduce additional acoustic artifacts not fully captured by simulation. Evaluating FDE-HGRN2 on real-world corpora such as AMI or CHiME, and integrating it into end-to-end streaming device pipelines with measured latency and power consumption, are important directions for future work. Furthermore, due to computational constraints, all experiments in this work were conducted with a batch size of 64; training and evaluating FDE-HGRN2 at the original FDE-RNN batch size of 256 to enable a fully controlled large-batch comparison remains an important direction for future work.

Future work will focus on extending FDE-HGRN2 toward more realistic deployment scenarios and broader personalized speech applications. One direction is to evaluate the proposed architecture under real-world far-field recordings and device-captured noise, and to integrate it into multi-task pipelines that jointly handle PVAD, speaker verification, and possibly diarization or keyword spotting. In particular, the linear-time HGRN2 backbone is well suited for streaming keyword spotting, and we plan to explore joint PVAD–KWS modeling while explicitly measuring latency, memory footprint, and energy consumption on mobile SoCs. Additionally, a systematic study of FDE-HGRN2’s robustness to imperfect speaker embeddings, including embeddings extracted from short or noisy enrollment utterances, would further characterize the system’s practical deployment envelope and is left for future work. Future work will also include rigorous on-device profiling of FDE-HGRN2, measuring inference latency in milliseconds per frame, floating-point operations per forward pass, and peak memory usage during inference on representative mobile SoCs, to substantiate the efficiency claims made in this paper. Additional directions include combining FDE-HGRN2 with model compression and quantization techniques for tighter resource budgets, and investigating its applicability to other personalized front-end tasks in on-device speech interfaces.

Author Contributions

Conceptualization, T.-W.W., B.C. and J.-W.H.; methodology, T.-W.W., T.-Y.C., C.-C.C. and J.-W.H.; software, T.-W.W., T.-Y.C. and C.-C.C.; validation, T.-W.W., B.C. and J.-W.H.; formal analysis, J.-W.H., B.C. and T.-W.W.; investigation, J.-W.H.; resources, B.C. and J.-W.H.; data curation, B.C., J.-W.H. and T.-W.W.; writing—original draft preparation, J.-W.H.; writing—review and editing, J.-W.H.; visualization, B.C., J.-W.H. and T.-W.W.; supervision, J.-W.H.; project administration, J.-W.H.; funding acquisition, B.C. and J.-W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Medennikov, I.; Korenevsky, M.; Prisyach, T.; Khokhlov, Y.; Korenevskaya, M.; Sorokin, I.; Makieva, T.; Romanenko, A.; Kremnev, I.; Prendin, I.; et al. Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 1476–1480. [Google Scholar]
Wang, W.; Lin, Q.; Li, M. Online Target Speaker Voice Activity Detection for Speaker Diarization. In Proceedings of the INTERSPEECH, Incheon, Republic of Korea, 18–22 September 2022; pp. 1441–1445. [Google Scholar]
Maciejewski, M. Speaker Conditioning of Voice Activity Detection via Implicit Separation. In Proceedings of the INTERSPEECH, Rotterdam, The Netherlands, 17–21 August 2025; pp. 5798–5802. [Google Scholar]
Ding, S.; Wang, Q.; Chang, S.Y.; Wan, L.; Moreno, I.L. Personal VAD: Speaker-Conditioned Voice Activity Detection. In Proceedings of the Odyssey: The Speaker and Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 433–439. [Google Scholar]
Ding, S.; Rikhye, R.; Liang, Q.; He, Y.; Wang, Q.; Narayanan, A.; O’Malley, T.; McGraw, I. Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition. In Proceedings of the INTERSPEECH, Incheon, Republic of Korea, 18–22 September 2022; pp. 3744–3748. [Google Scholar]
Zeng, B.; Cheng, M.; Tian, Y.; Liu, H.; Li, M. Efficient Personal Voice Activity Detection with Wake Word Reference Speech. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12241–12245. [Google Scholar] [CrossRef]
Xu, L.; Zhang, M.; Zhang, W.; Wang, T.; Yin, J.; Gao, Y. Personal Voice Activity Detection with Ultra-Short Reference Speech. In Proceedings of the 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Macao, China, 3–6 December 2024; pp. 1–6. [Google Scholar] [CrossRef]
Kang, Z.; Wang, J.; Peng, J.; Xiao, J. SVVAD: Personal Voice Activity Detection for Speaker Verification. In Proceedings of the INTERSPEECH, Dublin, Ireland, 20–24 August 2023; pp. 5067–5071. [Google Scholar]
Yu, E.L.; Wang, C.C.; Hung, J.W.; Huang, S.C.; Chen, B. Flexible VAD-PVAD Transition: A Detachable PVAD Module for Dynamic Encoder RNN VAD. In Proceedings of the INTERSPEECH, Rotterdam, The Netherlands, 17–21 August 2025; pp. 5793–5797. [Google Scholar]
Gudepu, P.R.R.; Koroth, J.M.; Sabu, K.; Shaik, M.A.B. Dynamic Encoder RNN for Online Voice Activity Detection in Adverse Noise Conditions. In Proceedings of the INTERSPEECH, Dublin, Ireland, 20–24 August 2023; pp. 5052–5056. [Google Scholar]
Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Qin, Z.; Yang, S.; Zhong, Y. Hierarchically Gated Recurrent Neural Network for Sequence Modeling. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
Qin, Z.; Yang, S.; Sun, W.; Shen, X.; Li, D.; Sun, W.; Zhong, Y. HGRN2: Gated Linear RNNs with State Expansion. arXiv 2024, arXiv:2404.07904. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
Yu, E.L.; Chang, R.X.; Hung, J.W.; Huang, S.C.; Chen, B. COIN-AT-PVAD: A Conditional Intermediate Attention PVAD. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Macao, China, 3–6 December 2024; pp. 1–5. [Google Scholar]
Yu, E.L.; Ho, K.H.; Hung, J.W.; Huang, S.C.; Chen, B. Speaker Conditional Sinc-Extractor for Personal VAD. In Proceedings of the INTERSPEECH, Kos, Greece, 1–5 September 2024. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Snyder, D.; Chen, G.; Povey, D. MUSAN: A Music, Speech, and Noise Corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar] [CrossRef]
Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized End-to-End Loss for Speaker Verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4879–4883. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. The flowchart of FDE-RNN (redrawn according to [9]).

Figure 2. The core recurrent computing mechanism of HGRN2 at layer ℓ.

Figure 3. Detailed computation flow within a single HGRN2 time step. The three gate paths are computed in parallel from the input

x_{t}

: the forget gate (

λ_{t}

), the input gate (

i_{t}

), and the output gate (

o_{t}

). The core of the mechanism is the outer product of

i_{t}

and

(1 - λ_{t})

, which expands two d-dimensional vectors into a

d \times d

state matrix

H_{t}

, providing

d^{2}

effective capacity without introducing additional trainable parameters. The state is then contracted by

o_{t}

, normalized, and projected to yield the cell output

y_{t}

.

Figure 3. Detailed computation flow within a single HGRN2 time step. The three gate paths are computed in parallel from the input

x_{t}

: the forget gate (

λ_{t}

), the input gate (

i_{t}

), and the output gate (

o_{t}

). The core of the mechanism is the outer product of

i_{t}

and

(1 - λ_{t})

, which expands two d-dimensional vectors into a

d \times d

state matrix

H_{t}

, providing

d^{2}

effective capacity without introducing additional trainable parameters. The state is then contracted by

o_{t}

, normalized, and projected to yield the cell output

y_{t}

.

Figure 4. Flowchart of FDE-HGRN2. Newly proposed modules are enclosed in red dashed boxes, with the core HGRN2 module highlighted in yellow.

Table 1. PVAD performance and model size for FDE-RNN variants. The best result in each column is shown in bold.

Model	Batch	Acc (%)	AP (ns & ntss)	AP (tss)	mAP	#Params (k)
FDE-RNN (reported)	256	87.88	0.962	0.924	0.948	92.372
FDE-RNN (reproduced)	64	86.85	0.959	0.910	0.942	92.372
FDE-HGRN2	64	88.61	0.964	0.932	0.952	78.724

Table 2. Inference efficiency comparison between FDE-RNN (reproduced) and FDE-HGRN2. RTF = real-time factor (lower is faster). Peak Mem. Alloc. = peak memory allocated during inference. Amortized kFLOPs/frame = floating-point operations per frame averaged over the test set. All measurements are conducted on the same hardware platform under identical conditions. The best result in each column is shown in bold.

Model	RTF	Peak Mem. Alloc. (MB)	Amortized kFLOPs/Frame
FDE-RNN (reproduced)	0.0281	16.601	136.425
FDE-HGRN2	0.0787	8.445	116.436

Table 3. PVAD performance of FDE-RNN (reproduced) and FDE-HGRN2 for the test utterances under different SNR conditions.

SNR	Method	Accuracy (%)	AP (ns & ntss)	AP (tss)	mAP
5 dB	FDE-RNN (reproduced)	85.43	0.9504	0.8997	0.9320
5 dB	FDE-HGRN2	87.11	0.9549	0.9179	0.9403
10 dB	FDE-RNN (reproduced)	85.71	0.9534	0.8978	0.9340
10 dB	FDE-HGRN2	87.99	0.9625	0.9263	0.9488
15 dB	FDE-RNN (reproduced)	87.54	0.9605	0.9146	0.9450
15 dB	FDE-HGRN2	89.43	0.9688	0.9420	0.9584

Table 4. Ablation study of FDE-HGRN2. The second row removes CALR from the full model; the third row further replaces the HGRN2-based Dynamic Encoder RNN with an LSTM-based DE-RNN while keeping the HGRN2 interaction block.

Model	Acc (%)	AP (ns & ntss)	AP (tss)	mAP
FDE-HGRN2 (full model)	88.61	0.9643	0.9317	0.9522
FDE-HGRN2 w/o CALR	87.81	0.9619	0.9233	0.9477
FDE-HGRN2 w/o CALR, LSTM DE-RNN	87.38	0.9615	0.9169	0.9457

Table 5. Sensitivity analysis of the expansion ratio n for FDE-HGRN2 (w/o CALR). All configurations share an identical parameter count, confirming that state expansion is non-parametric. The best result in each column is shown in bold.

Expansion Ratio (n)	Acc (%)	AP (ns & ntss)	AP (tss)	mAP	#Params (K)
1	87.384	0.9602	0.9194	0.9455	78.724
2 (proposed)	87.810	0.9619	0.9233	0.9477	78.724
4	87.542	0.9610	0.9207	0.9464	78.724

Table 6. VAD performance and model size comparison across prior methods and FDE variants. Baseline results for PVAD 1.0, PVAD 2.0, COIN-AT-PVAD, and SE-PVAD are taken from Yu et al. [9], which uses the same LibriSpeech-based dataset construction and evaluation protocol as the present work. The best result in each column is shown in bold. Batch size and detailed training conditions for these baselines other than FDE-RNN (reproduced) are not reported in [9] and are marked as unavailable (–). Direct re-implementation of all baselines under fully identical training conditions was not feasible; the most controlled comparison is therefore between FDE-RNN (reproduced) and FDE-HGRN2, both trained under identical conditions (batch size 64, same data pipeline).

Model	Batch	AP (tss)	AP (ntss)	AP (ns)	mAP	Acc (%)	# Params (k)
PVAD 1.0	–	0.856	0.880	0.862	0.864	78.63	130.307
PVAD 2.0	–	0.874	0.901	0.880	0.887	80.74	97.667
COIN-AT-PVAD	–	0.912	0.955		0.940	86.74	71.869
SE-PVAD	–	0.929	0.932	0.902	0.925	84.86	130.507
FDE-RNN (reported)	256	0.9240	0.9620		0.9480	87.88	92.372
FDE-RNN (reproduced)	64	0.9101	0.9587		0.9419	86.85	92.372
FDE-HGRN2	64	0.9317	0.9643		0.9522	88.61	78.724

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, T.-W.; Chen, T.-Y.; Chiu, C.-C.; Chen, B.; Hung, J.-W. HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training. Electronics 2026, 15, 1561. https://doi.org/10.3390/electronics15081561

AMA Style

Wang T-W, Chen T-Y, Chiu C-C, Chen B, Hung J-W. HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training. Electronics. 2026; 15(8):1561. https://doi.org/10.3390/electronics15081561

Chicago/Turabian Style

Wang, Tzu-Wei, Tai-You Chen, Chien-Chia Chiu, Berlin Chen, and Jeih-Weih Hung. 2026. "HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training" Electronics 15, no. 8: 1561. https://doi.org/10.3390/electronics15081561

APA Style

Wang, T.-W., Chen, T.-Y., Chiu, C.-C., Chen, B., & Hung, J.-W. (2026). HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training. Electronics, 15(8), 1561. https://doi.org/10.3390/electronics15081561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training

Abstract

1. Introduction

2. Overview of the Backbone Model: FDE-RNN

2.1. Dynamic Encoder RNN (DE-RNN): Core Mechanism

2.2. Personalization Module: Speaker-Adaptive Processing

2.3. Parallel Training Paradigm

2.4. Architectural Advantages and Research Value

3. Proposed Framework: FDE-HGRN2

3.1. Introduction to HGRN2

3.1.1. Limitations of Traditional RNNs and Transformers

3.1.2. Linear RNNs and the Capacity Lower Bound

3.1.3. HGRN2: Overcoming the Lower Bound via State Expansion

3.1.4. Multi-Head Parallelization and the HGRN2 Block

3.2. HGRN2-Based FDE Backbone: FDE-HGRN2

3.3. Cosine-Annealing Learning Rate (CALR) in FDE-HGRN2 Training

4. Experimental Setup

4.1. Dataset

4.2. Speaker Embedding Preparation

4.3. Other Implementation Details

4.4. Evaluation Metrics

5. Experimental Results and Discussion

5.1. Overall PVAD Performance and Model Size

5.2. Note on Statistical Significance

5.3. Inference Efficiency Analysis

5.4. Impact of Signal-to-Noise Ratio (SNR)

5.5. Ablation Study

5.6. Comparison with State-of-the-Art PVAD Methods

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI