A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding

Zhang, Zhixiong; Li, Yao; Ren, Wen; Wang, Xiaoyan

doi:10.3390/eng6110319

Open AccessArticle

A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding

¹

School of Mechanical and Electrical Engineering, Sanming University, Sanming 365004, China

²

Key Laboratory of Equipment Intelligent Control of Fujian Higher Education Institute, Sanming University, Sanming 365004, China

^*

Author to whom correspondence should be addressed.

Eng 2025, 6(11), 319; https://doi.org/10.3390/eng6110319

Submission received: 26 August 2025 / Revised: 5 November 2025 / Accepted: 6 November 2025 / Published: 10 November 2025

(This article belongs to the Section Electrical and Electronic Engineering)

Download

Browse Figures

Versions Notes

Abstract

Intelligent in-vehicle voice systems face critical challenges in robustness, security, and semantic flexibility under complex acoustic conditions. To address these issues holistically, this paper proposes a novel multimodal and secure voice-control framework. The system integrates a hybrid dual-channel wake-up mechanism, combining a commercial English engine (Picovoice) with a custom lightweight ResNet-Lite model for Chinese, to achieve robust cross-lingual activation. For reliable identity authentication, an optimized ECAPA-TDNN model is introduced, enhanced with spectral augmentation, sliding window feature fusion, and an adaptive threshold mechanism. Furthermore, a two-tier fuzzy command matching algorithm operating at character and pinyin levels is designed to significantly improve tolerance to speech variations and ASR errors. Comprehensive experiments on a test set encompassing various Chinese dialects, English accents, and noise environments demonstrate that the proposed system achieves high performance across all components: the wake-up mechanism maintains commercial-grade reliability for English and provides a functional baseline for Chinese; the improved ECAPA-TDNN attains low equal error rates of 2.37% (quiet), 5.59% (background music), and 3.12% (high-speed noise), outperforming standard baselines and showing strong noise robustness against the state of the art; and the fuzzy matcher boosts command recognition accuracy to over 95.67% in quiet environments and above 92.7% under noise, substantially outperforming hard matching by approximately 30%. End-to-end tests confirm an overall interaction success rate of 93.7%. This work offers a practical, integrated solution for developing secure, robust, and flexible voice interfaces in intelligent vehicles.

Keywords:

in-vehicle voice control; ECAPA-TDNN; dual-channel wake-up; speaker verification; fuzzy command matching; GPIO control; embedded systems; simulation-to-deployment transfer

1. Introduction

The proliferation of intelligent voice assistants has redefined user interaction within the automotive cabin, promising enhanced convenience and safety [1,2]. However, the transition from a controlled laboratory setting to the challenging in-vehicle environment exposes significant vulnerabilities in conventional voice control pipelines. These systems must concurrently overcome the trifecta of pervasive acoustic noise, the security risks of unverified command execution, and the need for semantic flexibility in interpreting naturally spoken instructions [3,4]. While substantial progress has been made in individual component technologies, a critical examination of the literature reveals a fragmented landscape where optimizations for one objective often come at the expense of another, particularly under the stringent computational constraints of automotive-grade hardware [5].

A systematic analysis of the current state of the art underscores specific and interconnected gaps. In the domain of wake-up word detection, the focus has largely been on monolingual models. Commercial engines like Picovoice offer high accuracy for specific languages but lack inherent cross-lingual support [6]. While recent research has explored multilingual wake-up models [7,8], many of these approaches rely on cloud-based processing, rendering them unsuitable for low-latency, privacy-preserving edge deployment in vehicles. This creates a fundamental limitation for global OEMs requiring flexible language support without sacrificing responsiveness or data security.

For speaker verification, deep learning models, particularly the ECAPA-TDNN architecture [9,10], have set new benchmarks in accuracy, outperforming earlier i-vector [11] and x-vector systems [12,13]. Nevertheless, a critical performance gap persists with short, noisy utterances—the hallmark of in-vehicle commands [1,2]. Although architectural innovations like attention mechanisms [14] and other modifications [15] have pushed performance boundaries, they frequently increase model complexity. This exacerbates the inherent trade-off between accuracy and computational efficiency, making real-time execution on resource-constrained embedded processors a significant challenge [5]. Consequently, even state-of-the-art models exhibit notable performance degradation under the complex noise conditions typical of a driving scenario [16].

Finally, in the realm of command understanding, the advent of powerful end-to-end ASR models like Whisper [17,18] has dramatically improved transcription robustness across accents and noise. However, these systems are not infallible; their output can contain errors or phonetic ambiguities that readily confuse traditional hard keyword-matching strategies [3,19]. This brittleness highlights a critical oversight in many integrated systems: the underdevelopment of a dedicated, fault-tolerant semantic interpretation layer that operates downstream of the ASR [20]. While fuzzy matching techniques exist [19,20], they are seldom designed to handle the unique challenges of Mandarin Chinese, such as homophone disambiguation at scale, nor are they systematically evaluated as an integral part of a secure voice-control pipeline. Furthermore, studies confirm the limitations of repurposing ASR corpora for speaker-related tasks [21], and challenges in adapting a multi-speaker ASR corpus for tasks like diarization [22] further emphasize the domain-specific difficulties in automotive environments.

Therefore, the research gap this paper addresses is not merely the improvement of a single component, but the integrated co-design of a secure, robust, and efficient voice interaction framework tailored for the in-vehicle edge environment. Existing solutions tend to optimize for one or two of these dimensions in isolation, leaving a void for a system that holistically balances cross-lingual wake-up, noise-resilient and efficient speaker verification, and ASR-robust command understanding.

The main contributions of this work are succinctly summarized as follows:

•: A hybrid dual-channel wake-up mechanism that integrates a commercial English engine (Picovoice) with a custom, lightweight ResNet-Lite model for Chinese, employing intelligent arbitration to achieve robust cross-lingual activation under noise [23,24].
•: An optimized ECAPA-TDNN model enhanced with spectral augmentation, sliding window feature fusion, and an adaptive threshold mechanism, enabling accurate and efficient speaker verification on resource-constrained hardware [25].
•: A novel two-tier fuzzy command matching algorithm that operates on character and pinyin levels to significantly improve tolerance to speech variations and ASR errors, achieving high command recognition accuracy [17,19].
•: A comprehensive integration and validation of the proposed modules into a fully functional, hardware-agnostic framework, demonstrating end-to-end performance gains and practical viability through extensive experiments.

In the spirit of transparent scholarship, we also acknowledge the primary limitations of the present study. The current evaluation was conducted primarily on a high-performance PC platform to validate the conceptual framework and its synergistic advantages. Consequently, performance validation—particularly regarding real-time latency, memory footprint, and power consumption—on actual resource-constrained automotive embedded hardware remains a critical step for future work. Additionally, while the self-developed Chinese wake-up model fulfills a core system requirement, its performance establishes a functional baseline that can be further enhanced with more diverse training data and specialized front-end processing. The current fuzzy matching algorithm is optimized for recall and fault tolerance; a systematic evaluation of its potential false-acceptance risk in open-ended scenarios is warranted. Lastly, the system operates on a single-command basis and does not yet support multi-turn dialog context, which represents a key direction for enhancing interactivity.

The remainder of this paper is structured as follows: Section 2 details the proposed system architecture and methodologies. Section 3 describes the experimental setup and presents a comparative analysis of results. Section 4 discusses the integrated system performance and limitations, and Section 5 concludes the paper and outlines future research directions.

2. Methodology: Key Technologies of the Proposed System

2.1. System Architecture Overview

The proposed intelligent in-vehicle voice control system is designed as a modular pipeline architecture, as illustrated in Figure 1. The system processes a single-channel audio input through a sequence of specialized modules to achieve robust, secure, and flexible voice interaction. The architecture is logically divided into four cohesive layers: the Audio Input Layer, the Wake-up Decision Layer, the Core Processing Layer, and the Control Execution Layer. These layers are decoupled through standardized data interfaces, ensuring maintainability, portability, and efficient cooperation among modules.

2.1.1. Audio Preprocessing and Input

At the Audio Input Layer, raw audio is captured via a microphone at a sampling rate of 16 kHz. The audio stream is then framed using a 25 ms Hamming window with a 10 ms frame shift. Eighty-dimensional log Mel-filter bank (Fbank) energies are extracted as the primary acoustic features. To ensure robustness against variations in recording conditions, Cepstral Mean and Variance Normalization (CMVN) is applied on a per-utterance basis. These processed features serve as the uniform input for all subsequent processing stages.

2.1.2. System Workflow

The workflow of the proposed system, aligned with its four-layer architecture, proceeds as follows:

(1)

Audio Input Layer: Raw audio is continuously captured and preprocessed as described in Section 2.1.1. The stream of processed audio features is then delivered to the subsequent layer.

(2)

Wake-up Decision Layer: The feature stream is continuously analyzed by a dual-channel wake-up mechanism (detailed in Section 2.2). This mechanism operates two models in parallel: a commercial engine (Picovoice Porcupine) for the English wake word (“Hey Porcupine”) and a custom, lightweight ResNet-Lite model for the Chinese wake word (“Xiaotun”). An arbitration logic fuses their confidence scores to make a robust, language-agnostic wake-up decision. Upon a positive detection, the system activates, and a segment of audio containing the user’s subsequent command is forwarded to the next layer.

(3)

Core Processing Layer: This layer undertakes three critical tasks sequentially upon receiving the audio segment from the Wake-up Decision Layer:

Voice Activity Detection (VAD) and Endpointing: A Silero VAD module first processes the audio segment to precisely detect the start and end points of the user’s spoken command, removing leading and trailing silence.
Speaker Verification: The segmented speech is then fed into our improved ECAPA-TDNN model (Section 2.3) to generate a speaker embedding and verify the user’s identity against enrolled profiles.
Command Parsing: Only upon successful speaker verification, the verified audio segment is transcribed to text by the Whisper ASR engine. The transcribed text is then interpreted by our dual-tier fuzzy command matching algorithm (Section 2.4), which determines the user’s intent by calculating similarity against a set of predefined commands at both character and pinyin levels.

The outcome of this process—a recognized command from a verified speaker—is then delivered to the final layer.

(4): Control Execution Layer: This layer acts as the execution endpoint. It receives the validated command from the Core Processing Layer and executes it either on a custom Tkinter-based software simulator (developed in-house using Python libraries for prototyping and validation) or on actual in-vehicle hardware via a hardware-agnostic control interface (for deployment), facilitating a seamless “simulation-to-deployment” transition.

This streamlined sequential workflow, strictly adhering to the four-layer design, provides a stable and efficient foundation for the key technical optimizations detailed in the subsequent sections.

2.2. Dual-Model Wake-Up Mechanism

The wake-up function serves as the critical entry point of the entire voice interaction system. To achieve high reliability in noisy and multilingual in-vehicle environments, a hybrid dual-model wake-up mechanism is proposed, overcoming the inherent trade-off between recall and false-alarm rates in traditional single-model solutions.

2.2.1. Hybrid Architecture Design

The system’s dual-channel parallel processing architecture comprises the following:

•: English Channel: Utilizes the commercially available Picovoice Porcupine engine, optimized for the wake word “Hey Porcupine”.
•: Chinese Channel: Employs a custom-developed ResNet-Lite model, specifically designed for accurate detection of the Chinese wake word “Xiaotun”.

Both channels operate on the shared unified front-end processing chain described in Section 2.1.1, ensuring consistent feature extraction. The core innovation lies in an interlocked, confidence-based arbitration logic formulated as follows:

D e c i s i o n = \{\begin{matrix} “ Hey Porcupine ”, & i f f_{e n g} (X) \geq θ_{e n g} \land f_{c h n} (X) \leq θ_{r e j} \\ “ Xiaotun ” & i f f_{c h n} (X) \geq θ_{c h n} \land f_{e n g} (X) \leq θ_{r e j} \\ R e j e c t, & o t h e r w i s e \end{matrix}

(1)

Here,

X = M e l S p e c t r o g r a m (x)

is the 64-dimensional input feature;

f_{e n g} (\cdot)

and

f_{c h n} (\cdot)

are channel confidence scores in [0,1]; and

θ_{e n g}

,

θ_{c h n}

,

θ_{r e j}

are channel-specific thresholds. The thresholds are optimized via the Neyman–Pearson criterion with a maximum cross-language false wake-up rate α = 0.05.

2.2.2. Lightweight Optimization of ResNet-Lite

Considering the resource constraints of automotive-grade processors, the classic ResNet-18 framework [26] was redesigned through deep pruning and structural reconstruction to create the lightweight ResNet-Lite model. The detailed architectural changes, in comparison to the original ResNet-18, are visually summarized in Figure 2. The key optimization strategies include the following:

(1): Stage Reduction: The original four residual stages were reduced to three, halving the network depth from 18 to 9 weighted layers.
(2): Kernel Size Reduction: The large 7 × 7 convolution in the input stem was replaced with a smaller 3 × 3 kernel.
(3): Pooling Removal: The initial max-pooling layer was omitted to preserve fine-grained temporal and spectral features crucial for discerning short wake-word syllables and Mandarin tones.
(4): Progressive Channel Scaling: The number of convolutional channels scales as 32 → 64 → 128 → 256 across stages. A down-sampling stride $s$ of 2 is applied only at the first convolutional layer of each stage, while all other layers use a stride of 1 to better preserve temporal details:

C_{o u t} = 2 \times C_{i n}, s = \{\begin{matrix} 2, f i r s t l a y e r o f s t a g e; \\ 1, o t h e r l a y e r s . \end{matrix}

(2)

The ResNet-Lite model was implemented using the PyTorch framework (version 2.3.1+cu118) and trained using the built-in Adam optimizer with a learning rate of 1 × 10⁻³ and a batch size of 16 for 100 epochs, and if the EER on the validation set shows no improvement over 3 consecutive epochs, the learning rate is halved. These optimizations collectively achieved a significant reduction in computational complexity while retaining strong discriminative performance, as detailed in Table 1.

2.3. Voiceprint Recognition Optimization: Real-Time Speaker Verification Based on an Improved ECAPA-TDNN

The ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation-TDNN) architecture is currently one of the most widely adopted end-to-end frameworks for speaker recognition [9]. It exhibits several key advantages: enhanced local and multi-scale feature extraction via SE-Res2Net modules, improved speaker-discriminative capability through channel attention mechanisms, global temporal modeling using Attentive Statistics Pooling, and robust speaker embedding under short utterance and low-resource conditions.

However, ECAPA-TDNN still faces two practical challenges in in-vehicle environments characterized by short utterances and high concurrency: (1) unstable feature distributions in short speech segments can impair sequential modeling; (2) static decision thresholds may not adapt well to varying speakers or noise levels, leading to false acceptances or rejections. To address these issues, we propose three optimization strategies for different stages of the voiceprint recognition pipeline:

•: At the model architecture level, we design a structurally enhanced ECAPA-TDNN to improve feature representation and deep modeling capabilities (Section 2.3.1);
•: During acoustic feature extraction, we introduce a sliding window feature fusion mechanism to enhance robustness against short utterances (Section 2.3.2);
•: At the decision stage, we develop an adaptive threshold mechanism to improve system reliability under multi-speaker and noisy conditions (Section 2.3.3).

2.3.1. Improved ECAPA-TDNN Architecture

To improve ECAPA-TDNN performance in real-time in-vehicle speaker verification tasks, we propose four key structural enhancements based on the original framework, forming an improved ECAPA-TDNN model. These enhancements span input augmentation, feature processing, forward modeling, and output optimization. The overall structure is illustrated in Figure 3, where green modules indicate newly added components. The enhancements are as follows:

(1): Input Augmentation with FbankAug

Given the complex and variable Signal-to-Noise Ratio (SNR) in in-vehicle audio inputs, we introduce a SpecAugment-style data augmentation module prior to feature extraction. This module applies frequency masking and time warping during training, thereby improving model generalization and robustness.

(2): Feature Processing Pipeline Enhancement (log + CMVN)

After MFCC/Fbank feature extraction, we apply logarithmic compression and CMVN. This enhances the stability of feature distribution, accelerates convergence, and improves adaptability to device or environment variations.

(3): Forward Path Improvement via Cross-Layer Residuals

To strengthen long-range dependency modeling and deep feature aggregation, we incorporate cross-layer residual connections across multiple TDNN layers on top of the SE-Res2Block modules. This mitigates gradient vanishing issues and enhances the retention of critical information.

(4): Lightweight Output Design

To reduce parameter count and inference latency while preserving embedding quality, we prune and compress the final statistics pooling and projection layers. This meets the dual demand for efficiency and responsiveness on in-vehicle platforms.

The improved ECAPA-TDNN model was implemented based on the PyTorch toolkit and trained with the Adam optimizer, a learning rate of 1 × 10⁻³, a weight decay of 2 × 10⁻⁵, and a batch size of 64 for 50 epochs, and if the EER on the validation set shows no improvement over 3 consecutive epochs, the learning rate is halved.

2.3.2. Sliding Window Feature Fusion

To improve robustness and recognition accuracy for short speech segments, we adopt a multi-scale sliding window feature fusion strategy. This method segments the input feature sequence into overlapping windows and averages intra-window features to enhance local speaker representation, as shown in Figure 4.

Let the original feature sequence be

X = \{x_{1}, x_{2}, \dots, x_{T},\}, x_{T} \in R^{d},

(3)

where

T

is the total number of frames and d is the feature dimension. Using a window length L and step size

S

, we slide over

X

to generate subsequences:

X_{i} = \{x_{i}, x_{i + 1}, \dots, x_{i + L - 1},\}, i = 1, 1 + S, 1 + 2 S, \dots

(4)

Each window yields a deep embedding

f_{i}

, and the final speaker representation is obtained via mean pooling:

f_{f i n a l} = \frac{1}{N} \sum_{i = 1}^{N} f_{i}

(5)

where N is the total number of windows. This fusion mechanism increases robustness to short-term variations and unstable feature patterns.

2.3.3. Adaptive Threshold Decision Mechanism

Traditional speaker verification systems typically use a fixed threshold θ to determine whether a test voiceprint matches an enrolled one:

A c c e p t ⟺ s i m (f_{t e s t}, f_{e n r o l l}) > θ_{f i x e d}

(6)

where

s i m (\cdot)

denotes a similarity metric function, such as cosine similarity.

f_{t e s t}

represents the feature embedding vector of the test utterance,

f_{e n r o l l}

represents the enrolled speaker’s feature embedding vector, and

θ_{f i x e d}

is the pre-defined fixed decision threshold. However, this approach demonstrates poor robustness against noise interference, speaker variability, and multi-speaker scenarios. To address these limitations, this paper proposes a dual dynamic threshold decision mechanism that performs adaptive judgment from two dimensions, recognition confidence and environmental noise, as shown in Figure 5.

First, a relative threshold criterion based on recognition confidence is introduced. The system not only requires the highest similarity score to exceed a base threshold but also mandates a certain margin condition between the highest and the second-highest scores to enhance the reliability of the recognition result.

Second, a noise-adaptive linear threshold adjustment mechanism is designed. This mechanism dynamically adjusts the base recognition threshold based on the SNR of the current acoustic environment:

•: The SNR of the input audio is estimated, yielding a raw SNR value.
•: The SNR is constrained within the range [0, 20] dB:

s n r = m a x (0, m i n (S N R, 20)),

(7)

•: The dynamic threshold $θ$ is computed via linear interpolation, allowing it to increase gradually from 0.34 (in high-noise conditions) to 0.44 (in quiet conditions) as the SNR improves:

θ_{d y n a m i c} = 0.34 + (0.44 - 0.34) \times (\frac{s n r}{20}),

(8)

The baseline values of 0.34 (noisy) and 0.44 (quiet) were determined empirically through grid search on a held-out validation set to optimize the balance between false acceptance rate and false rejection rate under representative high and low SNR conditions.

The integrated decision rule, which incorporates both the confidence margin and the noise-adaptive threshold, is then defined as follows:

A c c e p t ⟺ (s i m (f_{t e s t}, f_{e n r o l l}^{(1)}) > θ_{d y n a m i c}) \land (s i m (f_{t e s t}, f_{e n r o l l}^{(1)}) - s i m (f_{t e s t}, f_{e n r o l l}^{(2)}) \geq 0.15),

(9)

Here,

f_{e n r o l l}^{(1)}

and

f_{e n r o l l}^{(2)}

represent the enrolled voice features with the highest and second-highest similarity to the test voice feature, respectively.

θ_{d y n a m i c}

is the dynamic base threshold computed in Equation (8). The constant 0.15 is the empirically set minimum confidence margin.

The proposed mechanism incorporates both environmental awareness and confidence measurement. First, it dynamically relaxes the recognition requirements in high-noise environments. Then, it maintains strict criteria when conditions are quiet. This approach ultimately enhances the system’s robustness and recognition accuracy across diverse operational environments.

2.4. Fuzzy Command Matching Algorithm

In practical in-vehicle voice interaction scenarios, drivers’ spoken commands may suffer from issues such as unclear pronunciation, unstable speech rate, dialect interference, or background noise. These factors make it difficult for traditional keyword-based hard matching strategies to accurately interpret user intent. To address this, a dual-layer fuzzy matching mechanism is designed that integrates both character-level and pinyin-level (i.e., phonetic transcription for Chinese) information to enhance the system’s fault tolerance and semantic understanding capabilities. The overall logic scheme is illustrated in Figure 6.

We first define a set of built-in standard commands in the system:

C = \{c_{1}, c_{2}, \dots, c_{n}\},

(10)

The user’s speech is transcribed into a text string by the Whisper model. The objective is to find the command from the set C that best matches the user’s intent:

\hat{c} = \arg \max_{c_{i} \in C} S i m (s, c_{i}),

(11)

where

S i m (s, c_{i})

denotes the overall similarity score between the transcribed speech

s

and the candidate command

c_{i}

. To improve robustness in the matching decision, the system performs matching from both character and pinyin perspectives and adopts a maximum value strategy for integrated decision-making.

At the character level, the system calculates the Levenshtein edit distance and normalizes it into a similarity score:

C h a r S i m (s, C_{i}) = 1 - \frac{d_{e d i t} (s, c_{i})}{\max (|s|, ⌈c_{i}⌉)},

(12)

where

d_{e d i t} (s, C_{i})

is the minimum number of edit operations (insertions, deletions, substitutions) required to transform

s

into

c_{i}

, with smaller values indicating higher similarity. This method provides good tolerance against spelling errors or transcription deviations.

However, in the context of Mandarin Chinese, relying solely on character-level matching may lead to issues such as homophone misrecognition and pronunciation variation. Therefore, the system additionally introduces a pinyin-level matching mechanism. A word segmentation and pinyin conversion module is first used to convert both s and c_i into pinyin sequences, e.g.,

P i n y i n (s) = \{p_{1}, p_{2}, \dots, p_{m}\},

(13)

Then, the edit distance between the pinyin sequences is calculated and normalized into a similarity score:

P i n S i m (s, C_{i}) = 1 - \frac{d_{e d i t} (P i n y i n (s), P i n y i n (c_{i}))}{\max (m, n)}

(14)

where m and n denote the lengths of the pinyin sequences for

s

and

c_{i}

, respectively. Pinyin-level matching is particularly useful for identifying expressions with similar pronunciations but different written forms.

Finally, the overall similarity decision is made using a maximum value strategy, as defined below:

S i m (s, C_{i}) = \{\begin{matrix} 1, & i f \max (C h a r S i m (s, C_{i}), P i n S i m (s, C_{i})) \geq θ \\ 0, & o t h e r w i s e \end{matrix}

(15)

where

θ

is the predefined similarity threshold. A command is considered a successful match if either the character-level or pinyin-level similarity exceeds

θ

. Compared with weighted decision strategies, this maximum strategy offers greater fault tolerance.

This algorithm significantly enhances robustness against both transcription deviations and pronunciation ambiguities. The character-level matching component effectively handles ASR errors. Simultaneously, the pinyin-level matching is designed to comprehend non-standard pronunciation and homophones. The maximum value strategy enables intelligent adaptation to diverse command expressions. Consequently, this module exhibits strong scalability and serves as a critical component for enhancing overall user experience and practical system usability.

3. Experiments and Results

3.1. Experimental Data and Setup

To ensure a comprehensive and realistic evaluation of the proposed in-vehicle voice control system, we employed a multi-source data strategy for both training and testing. This section details the composition of the datasets, the data preprocessing pipeline, and the experimental platform configuration to ensure the reproducibility of our results.

3.1.1. Data Composition and Statistics

The experiments utilized a combination of large-scale public corpora and a privately collected dataset to cover various aspects of wake-up word detection, speaker verification, and command recognition. The specific composition is summarized in Table 2.

•: Wake-up Word Data: The training and testing sets for the self-developed Chinese wake-up word “Xiaotun” were constructed from a private dataset recorded by 20 native speakers (aged 22–45, representing 10 major Chinese dialect regions, e.g., Qingdao and Dezhou of Shandong, Shaoguang and Guangzhou of Guangdong, Nanchang of Jiangxi, Xiamen, Quanzhou and Sanming of Fujian, Zhengzhou of Henan, Wuhan of Hubei, Chengdu of Sichuan, Chongqing), supplemented with synthetic speech generated by a Text-to-Speech (TTS) engine. The data includes a balanced mix of positive (wake-word) and negative (non-wake-word) samples. The English wake-up word “Hey Porcupine” was handled by the commercial Picovoice engine using its built-in model.
•: Speaker Verification Data: The model was trained on the large-scale English VoxCeleb2 dataset. For testing, the official test sets of VoxCeleb2 (English) and CN-Celeb (Chinese) were used to evaluate cross-lingual and accent generalization. The test sets were augmented with three typical in-vehicle acoustic conditions: quiet, background music, and simulated high-speed driving noise.
•: Command Recognition Data: A private dataset was collected for command recognition, comprising 6871 real recordings of 16 in-vehicle command categories (e.g., “open window,” “adjust temperature”) spoken by 20 native Chinese speakers (same as the Wake-up Word Data) across the three noise conditions.

To rigorously simulate real-world in-vehicle acoustic environments, the background music and noise conditions were created by adding interference from the MUSAN corpus [27] to the clean test utterances at specific SNR ranges. The SNR settings for each condition were as follows: “noise” (from MUSAN noise) at [0, 15] dB, “speech” (babble from MUSAN speech) at [13, 20] dB, and “music” (from MUSAN music) at [5, 15] dB. All private data collection procedures obtained written consent from participants, and original audio files were anonymized to preserve privacy.

3.1.2. Data Preprocessing and Feature Extraction

A standardized front-end processing pipeline was applied to all audio data to ensure consistency:

•: Audio Preprocessing: Raw audio was resampled to a 16 kHz mono channel. The waveform amplitude was normalized to the range [−1, 1].
•: Feature Extraction: We extracted 80-dimensional log-Mel filterbank features. The features were computed using a 25 ms Hamming window with a 10 ms frame shift.
•: Feature Normalization: CMVN was applied to the Fbank features at the utterance level to reduce session variability and improve model convergence. This was integrated within the model’s forward pass.

3.1.3. Experimental Platform and Hyperparameters

The experiments were conducted on a workstation equipped with an AMD Ryzen 7 5800H CPU and an NVIDIA GeForce RTX 3060 GPU. The key hyperparameter settings for the ResNet-Lite model in the wake-up module and the improved ECAPA-TDNN model in speaker verification are summarized in Table 3. An early stopping strategy was employed where the learning rate was halved if the validation loss did not improve for three consecutive epochs.

3.2. Evaluation of the Dual-Channel Wake-Up Mechanism

3.2.1. Experimental Design and Rationale

The evaluation of the wake-up module was designed to validate the performance and integration of the proposed dual-channel architecture. The primary objective was to ascertain that the system provides reliable bilingual wake-up capability without compromising the performance of its individual components. To this end, the experimental design focused on a comparative analysis across three key configurations to isolate and measure the impact of integration:

(1): The Commercial Baseline: The native performance of the Picovoice Porcupine engine for the English wake-word “Hey Porcupine” was measured to establish a reference benchmark for a mature, commercial-grade solution.
(2): The Integrated Commercial Channel: The performance of the same Picovoice engine was re-evaluated within our integrated system framework. This direct comparison aims to quantify any performance overhead or degradation introduced by our unified front-end audio processing pipeline and system orchestration.
(3): The Self-Developed ResNet-Lite Channel: The performance of our custom, lightweight ResNet-Lite model for the Chinese wake-word “Xiaotun” was evaluated. In the absence of a directly comparable open-source Chinese model, this evaluation serves to demonstrate the standalone viability and effectiveness of our custom-developed solution in fulfilling a critical system requirement.

3.2.2. Results and Discussion

The performance of the wake-up modules across the three noise conditions is detailed in Table 4. All data are presented as the mean of five repeated measurements. The metrics of accuracy and wake-up success rate provide a comprehensive view of the modules’ reliability and robustness.

The results demonstrate several key findings. First, the integration of the commercial Picovoice engine into our system was highly successful. The performance metrics for the English wake-word “Hey Porcupine” within our framework are nearly identical to its native operation in quiet (Condition A) and driving noise (Condition C) conditions, with accuracy and success rates differing by no more than 0.01% and 0.00%, respectively. This indicates that our system architecture introduces negligible overhead and correctly interfaces with the complex commercial engine.

A more notable observation is the performance under background music (Condition B). Here, the wake-up success rate for the Picovoice engine within our system (93.19%) is 2.19% lower than its native performance (95.38%). This slight degradation suggests that our front-end audio processing (e.g., resampling, normalization) or the parallel operation with the Chinese channel may introduce minor variability in challenging acoustic scenarios where music spectral content overlaps with speech. Nevertheless, the engine maintained a high accuracy of 99.66%, confirming that correct detections remain highly reliable.

The self-developed ResNet-Lite module for the Chinese wake-word “Xiaotun” achieved wake-up accuracies of 78.41%, 70.26%, and 65.17% under quiet, background music, and driving noise conditions, respectively, with corresponding success rates of 66.0%, 56.0%, and 75.0%. Acknowledging that these results are preliminary and fall short of commercial benchmarks is crucial for an objective evaluation. The performance gap can be rationally attributed to several fundamental constraints faced by this research prototype.

First, a primary factor is the immense disparity in training data scale and diversity. Our model was trained on approximately 1200 h of primarily synthetic (TTS) Chinese speech, a volume that is orders of magnitude smaller than the millions of hours of real, varied, and domain-specific speech data used to train commercial-grade engines like Picovoice. This limited scale directly restricts the model’s ability to learn noise-robust acoustic features and generalize to the vast variability of real-user speech in challenging environments.

Second, the choice of a lightweight ResNet-Lite architecture represents a conscious trade-off between performance and efficiency. While this design is essential for feasible deployment on resource-constrained automotive-grade hardware, it inherently caps the model’s capacity and complexity. Consequently, its ability to discern subtle phonetic cues of the wake-word “Xiaotun” amidst strong background interference, such as the spectral masking from music or the broad-frequency noise of high-speed driving, is limited compared to larger, computationally intensive commercial models.

Despite these current limitations, which provide a clear roadmap for future optimization, the successful development and integration of this module are critical for three key reasons. It fulfills the fundamental system requirement for a custom Chinese wake-word where no off-the-shelf solution existed. The results establish a valuable baseline and definitively pinpoint the need for enhanced data augmentation and specialized front-end processing. Finally, it proves the core functionality and viability of the dual-channel architecture itself.

In conclusion, the evaluation confirms the operational success of the integrated dual-channel system. The commercial English channel provides mature, high-performance wake-up, while the self-developed Chinese channel delivers a crucial, functional capability at a preliminary performance level. This work successfully establishes a scalable framework, with the current performance of the Chinese channel clearly highlighting the specific and addressable challenges—namely data scale and model optimization—for future work to build upon.

3.3. Performance Evaluation of Speaker Recognition

3.3.1. Comparative Models and Evaluation Plan

This subsection details the specific comparative models and evaluation plan for the speaker verification task. The evaluation is conducted on the test sets and under the noise conditions defined in Section 3.1.1. Performance is measured using the standard EER and minimum Detection Cost Function (minDCF).

Three models form the core of this comparison. The standard ECAPA-TDNN [9] serves as the primary baseline to quantify the improvement delivered by our proposed enhancements. The CAM++ model [7], a current state-of-the-art approach utilizing context-aware masking, is included as a benchmark for top-tier academic performance. It is compared against our final improved ECAPA-TDNN model, which integrates the sliding window feature fusion and adaptive thresholding mechanisms described in Section 2.3.

3.3.2. Comparative Results and Analysis

The comprehensive evaluation of speaker verification performance, measured by Equal Error Rate (EER) and minimum Detection Cost Function (minDCF) across diverse acoustic conditions, is summarized in Figure 7 and Figure 8. The results unequivocally demonstrate that our proposed improved ECAPA-TDNN achieves a favorable balance between high accuracy and exceptional noise robustness, positioning it as a highly practical solution for in-vehicle environments.

(1): Superiority over the Standard Baseline

Our improved ECAPA-TDNN exhibits a decisive performance advantage over the standard ECAPA-TDNN baseline across all test scenarios. In quiet conditions, the improvement is solid, notably reducing the EER on the English test set from 4.15% to 2.37%. However, the most significant gains are observed in noisy environments. For instance, under simulated high-speed driving noise, our model dramatically lowers the EER from 44.72% to 3.12% on the English set and from 47.12% to 21.68% on the more challenging Chinese set. This remarkable improvement, reflected similarly in the minDCF metrics, underscores the critical effectiveness of the sliding window feature fusion and adaptive thresholding mechanisms in mitigating the severe impact of environmental noise on speaker verification reliability.

A notable observation is the performance gap between the English and Chinese test sets, which can be attributed to the inherent challenge of cross-lingual generalization. The model was primarily trained on the large-scale English VoxCeleb2 corpus, leading to a potential domain mismatch when evaluating on the Chinese CN-Celeb dataset. Furthermore, speaker-specific cues inherent to Mandarin Chinese, such as its tonal nature, may not be fully leveraged by a model optimized for English phonetic patterns. Despite this gap, our improvements consistently and significantly enhance performance on both languages, confirming the universal value of our proposed mechanisms in boosting noise robustness.

(2): Highly Competitive Performance against the State of the Art

When benchmarked against the current state-of-the-art CAM++ model, our improved ECAPA-TDNN demonstrates highly competitive and, in key aspects, superior characteristics. While CAM++ achieves an exceptionally low EER in quiet conditions (e.g., 3.71% vs. our 2.37% on English), its performance is more susceptible to degradation in noise, as evidenced by the EER rising to 5.59% and 5.60% under music and driving noise, respectively. In contrast, our model not only maintains lower EER values in these noisy conditions (3.12%) but also establishes a commanding advantage in terms of minDCF—a metric crucial for real-world security applications. On the English set under noise, our model’s minDCF is nearly 50% lower than that of CAM++ (e.g., ~2.63 vs. ~4.93 in driving noise), indicating a substantially better balance between false acceptance and false rejection rates in practical, challenging scenarios. This advantage holds for the Chinese set as well, where our model also achieves a lower minDCF, demonstrating its practical reliability despite the cross-lingual performance gap.

(3): Conclusion on Practical Value

In summary, the experimental results validate that our proposed enhancements successfully address the core challenge of noise robustness in speaker verification. The improved ECAPA-TDNN does not merely match the baseline but surpasses it by a large margin, and it redefines the trade-off with the state of the art by offering superior resilience to noise and lower operational cost at a minimal sacrifice in peak quiet-condition performance. While cross-lingual generalization remains an area for future improvement, particularly through multilingual training, the current model’s profile makes it exceptionally well-suited for the variable and demanding acoustic environment of a vehicle.

3.4. Robustness Evaluation of Command Recognition

3.4.1. Experimental Design for Fuzzy Matching Validation

This experiment was designed to validate the core hypothesis that the proposed fuzzy matching algorithm provides a universal improvement in command recognition robustness, independent of the performance level of the front-end ASR engine. The evaluation compares the fuzzy matching strategy against the traditional Hard Matching baseline. To rigorously test for universality, the experiment was conducted using a series of open-source Whisper models of varying scales: Whisper-tiny (prioritizing efficiency), Whisper-base (balanced), and Whisper-small (prioritizing accuracy). All models were tested on the Chinese command dataset described in Section 3.1.1 under the three characteristic noise conditions, with the command recognition accuracy serving as the primary metric.

3.4.2. Results and Analysis of Fuzzy Matching Strategy

The comparative performance of the fuzzy matching and hard matching strategies across the three Whisper models is summarized in Figure 9. All data are presented as the mean of five repeated measurements. The results demonstrate a consistent and substantial advantage of the fuzzy matching algorithm over the hard matching baseline across all tested conditions.

First is the profound impact on lower-performance models. The most critical contribution of the fuzzy matching algorithm is its ability to make voice interfaces functionally viable with lower-performance ASR models. For the Whisper-tiny model, hard matching accuracy was severely limited, ranging from only 36.67% to 48.00% across environments. The fuzzy matching algorithm dramatically elevated these rates to a usable 59.63–65.88% range. This represents a transformative improvement, enabling the deployment of effective voice control in scenarios where only highly efficient, low-resource ASR models are feasible.

Second is the substantial gains on balanced models. A significant performance leap is observed with the Whisper-base model. The algorithm consistently and substantially outperformed hard matching, most notably in the high-speed driving environment, where accuracy was boosted from 40.82% to 71.38%. This pattern of major improvement confirms that the algorithm provides a powerful corrective mechanism for the typical error profiles of balanced, general-purpose ASR models, making them far more reliable for in-vehicle use.

Finally, the optimization of high-accuracy models. Even when paired with the more capable Whisper-small model, where the hard matching baseline was respectable, the fuzzy matching strategy delivered a crucial enhancement. It raised the accuracy from good to excellent, for instance, achieving 95.67% compared to 78.26% in quiet conditions. This highlights its role in mastering natural language variation and resolving residual ambiguities, which is essential for a seamless and natural user experience.

In summary, the experimental evidence solidly positions the fuzzy matching algorithm not as an incremental adjustment, but as a foundational component for robust in-vehicle voice interfaces. Its capability to transform the performance of lightweight models and substantially boost the reliability of balanced models underscores its critical role in practical applications where computational resources and cost are key constraints.

4. Integrated System Performance and Discussion

This section transitions from the component-level evaluations to a holistic analysis of the fully integrated system. We first validate the end-to-end workflow performance through interactive tests, then synthesize the findings from all previous experiments to discuss the comparative advantages and intrinsic value of the proposed multimodal framework. The section concludes with a critical reflection on the current limitations and directions for future work.

4.1. Analysis of Integrated Advantages and Synergistic Effects

This section synthesizes the findings from the component-level evaluations presented in Section 3.2, Section 3.3 and Section 3.4 to discuss the comparative advantages and, more importantly, the synergistic value of the proposed multimodal framework. The integration of the individually optimized components creates a system whose overall capability exceeds the sum of its parts, specifically addressing the intertwined challenges of in-vehicle voice interaction.

(1): Synergy between Robust Wake-up and Secure Verification: The dual-channel wake-up mechanism (Section 3.2) serves as a robust gatekeeper, ensuring that only intentional speech triggers the computationally intensive downstream processes. This is critically complemented by the improved ECAPA-TDNN model (Section 3.3), which acts as a security checkpoint. The effectiveness of this sequential filter is demonstrated by the speaker verification module’s high accuracy under noise, which prevents the execution of commands from unverified speakers, a vulnerability in conventional systems. This layered approach ensures that system responsiveness is reserved for authenticated users, enhancing both security and resource efficiency.
(2): Decoupling of Authentication from Semantic Understanding: A key architectural advantage is the clear separation of speaker identity verification from command intent recognition. The fuzzy matching algorithm (Section 3.4) operates only on transcripts from verified speakers. This design means that the system’s high semantic flexibility—its ability to understand varied and imprecise commands—is a privilege granted exclusively after successful authentication. The results from Section 3.4 demonstrate that this flexibility is maintained across different ASR performance levels and noise conditions, proving that the security module does not compromise the user experience for authorized individuals.
(3): End-to-End Robustness through Modular Optimization: The experimental results collectively demonstrate robustness against a spectrum of real-world challenges. The wake-up module handles cross-lingual activation in noise, the speaker verification module maintains low error rates despite short, noisy utterances, and the command matching algorithm tolerates ASR errors and colloquial expressions. When integrated, these capabilities enable the system to handle complex scenarios. For instance, a command issued with a strong accent in a noisy cabin can be successfully processed: it is reliably woken up, the speaker is correctly verified, and the non-standard command phrase is accurately understood. This end-to-end resilience is a direct result of the targeted optimizations in each module and their effective orchestration within the proposed architecture.

In conclusion, while the validation in this work focuses on rigorous component-level performance, the analysis presented here confirms that the integrated framework successfully creates a cohesive, secure, and robust voice interaction loop. The modules do not operate in isolation but reinforce each other, providing a solid foundation for practical in-vehicle deployment.

4.2. Limitations and Future Work

While the integrated framework demonstrates compelling advantages as analyzed in Section 4.1, this study has several limitations that point to clear directions for future work:

(1): Lack of Embedded Platform Validation: The system has not yet been deployed on actual embedded processing platforms; all tests were conducted via a PC-based simulator. Future work will therefore prioritize the deployment and evaluation of the system’s operational efficiency on resource-constrained automotive-grade hardware.
(2): Insufficient Diversity in Speech Samples: The current test sets, while covering multiple scenarios, do not fully capture the vast diversity of accents, speech rates, and real-world noise encountered in vehicles. A key future direction is to collect and incorporate richer, more challenging corpora to further validate and enhance the system’s universality and robustness.
(3): Limited Context Understanding: The system currently operates on a single-command basis, lacking multi-turn dialog memory and contextual analysis. To enable more natural interaction, future enhancements will focus on integrating lightweight semantic parsing and context-aware intent recognition technologies.

Overall, this work establishes a solid foundation for secure and robust in-vehicle voice control. The future research trajectory is clearly charted: advancing from simulation to real-world deployment, from controlled validation to extensive field testing, and from command-level to context-aware interaction.

5. Conclusions

This paper introduced a multimodal framework to holistically address the critical challenges of robustness, security, and semantic flexibility in in-vehicle voice control. The core of our solution is the synergistic integration of three key innovations: a hybrid dual-channel wake-up mechanism for reliable bilingual activation; an improved ECAPA-TDNN model with feature fusion and adaptive thresholding for noise-robust speaker verification; and a hierarchical fuzzy matching algorithm that bridges the gap between imperfect ASR transcription and reliable command understanding. Comprehensive experiments validated the effectiveness of each component and its integration. The proposed speaker verification model significantly outperformed the standard ECAPA-TDNN and demonstrated superior noise robustness and practical reliability compared to the state-of-the-art CAM++ model. The fuzzy matching algorithm universally boosted command recognition accuracy across various ASR engines and noise conditions, proving to be a critical enabler for practical deployment. This work provides a solid foundation for developing secure and robust in-vehicle voice interfaces. Future efforts will focus on three key directions: deployment and performance evaluation on real embedded automotive hardware, expansion of test scenarios with greater acoustic and linguistic diversity, and the integration of context-aware dialog management for more natural multi-turn interactions.

Author Contributions

Conceptualization, X.W.; methodology, X.W., Y.L. and W.R.; software, Z.Z.; validation, Z.Z. and Y.L.; writing—original draft preparation, Z.Z.; writing—review and editing, X.W., Y.L. and W.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Fujian Province (Project No.: 2023J011032), Foreign Cooperation Projects of Fujian Province (2025I0031).

Institutional Review Board Statement

Ethical review and approval were waived for this study because it primarily involved the development and testing of signal processing and machine learning algorithms for an in-vehicle system. The human voice data used were collected from volunteer colleagues under informed consent, and the study presented minimal risk to participants. All audio data were fully anonymized during processing to remove any personal identifiers, and the research did not involve sensitive or private personal information.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The self-collected bilingual voice dataset is not publicly available due to privacy and ethical restrictions. Publicly available datasets were analyzed in this study, which can be found here: VoxCeleb2 (https://pan.baidu.com/s/1DBi9eFthXYVLGsQPvHESyA?pwd=gcc9), CN-Celeb (https://aistudio.baidu.com/datasetdetail/233361), and MUSAN (https://www.openslr.org/17/, https://www.openslr.org/28/), accessed on 25 August 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rochat, J.L. Highway Traffic Noise. Acoust. Today 2016, 12, 38–47. [Google Scholar]
Chakroun, R.; Frikha, M. Robust Features for Text-Independent Speaker Recognition with Short Utterances. Neural Comput. Appl. 2020, 32, 13863–13883. [Google Scholar] [CrossRef]
Hanifa, R.M.; Isa, K.; Mohamad, S. A Review on Speaker Recognition: Technology and Challenges. Comput. Electr. Eng. 2021, 90, 107005. [Google Scholar] [CrossRef]
Li, J.; Chen, C.; Azghadi, M.R.; Ghodosi, H.; Pan, L.; Zhang, J. Security and Privacy Problems in Voice Assistant Applications: A Survey. Comput. Secur. 2023, 134, 103448. [Google Scholar] [CrossRef]
Chen, T.; Chen, C.; Lu, C.; Chan, B.; Cheng, Y.; Chuang, H. A Lightweight Speaker Verification Model For Edge Device. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC, Taipei, Taiwan, 31 October–3 November 2023; pp. 1372–1377. [Google Scholar]
Cihan, H.; Wu, Y.; Peña, P.; Edwards, J.; Cowan, B. Bilingual by Default: Voice Assistants and the Role of Code-Switching in Creating a Bilingual User Experience. In Proceedings of the 4th Conference on Conversational User Interfaces 4th Conference on Conversational User Interfaces, Glasgow, UK, 26–28 July 2022. [Google Scholar]
Hu, Q.; Zhang, Y.; Zhang, X.; Han, Z.; Yu, X. CAM: A Cross-Lingual Adaptation Framework for Low-Resource Language Speech Recognition. Inf. Fusion 2024, 111, 102506. [Google Scholar] [CrossRef]
Kunešová, M.; Hanzlíček, Z.; Matoušek, J. An Exploration of ECAPA-TDNN and x-Vector Speaker Representations in Zero-Shot Multi-Speaker TTS. In Text, Speech, and Dialogue, Proceedings of the TSD 2025; Ekštein, K., Konopík, M., Pražák, O., Pártl, F., Eds.; Lecture Notes in Computer, Science; Springer: Cham, Switzerland, 2026; Volume 16029. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020. [Google Scholar]
Liu, S.; Song, Z.; He, L. Improving ECAPA-TDNN Performance with Coordinate Attention. J. Shanghai Jiaotong Univ. Sci. 2024. [Google Scholar] [CrossRef]
Wang, S. Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning. IEEEACM Trans. Audio Speech Lang. Process. 2024, 32, 4971–4998. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Karo, M.; Yeredor, A.; Lapidot, I. Compact Time-Domain Representation for Logical Access Spoofed Audio. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 946–958. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, Z.; Wu, S.; Zhang, X.; Zhang, P.; Yan, Y. Multi-Branch Coordinate Attention with Channel Dynamic Difference for Speaker Verification. IEEE Signal Processing Letters 2025, 32, 3225–3229. [Google Scholar] [CrossRef]
Li, W.; Yao, S.; Wan, B. TDNN Achitecture with Efficient Channel Attention and Improved Residual Blocks for Accurate Speaker Recognition. Sci. Rep. 2025, 15, 23484. [Google Scholar] [CrossRef]
Yang, W.; Wei, J.; Lu, W.; Li, L.; Lu, X. Robust Channel Learning for Large-Scale Radio Speaker Verification. IEEE J. Sel. Top. Signal Process. 2025, 19, 248–259. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 28492–28518. [Google Scholar]
Graham, C.; Roll, N. Evaluating OpenAI’s Whisper ASR: Performance Analysis across Diverse Accents and Speaker Traits. JASA Express Lett. 2024, 4, 025206. [Google Scholar] [CrossRef]
Rong, Y.; Hu, X. Fuzzy String Matching Using Sentence Embedding Algorithms. In Neural Information Processing, Proceedings of the ICONIP 2016; Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D., Eds.; Lecture Notes in Computer, Science; Springer: Cham, Switzerland, 2016; Volume 9949. [Google Scholar]
Bini, S.; Carletti, V.; Saggese, A.; Vento, M. Robust Speech Command Recognition in Challenging Industrial Environments. Comput. Commun. 2024, 228, 107938. [Google Scholar] [CrossRef]
Kheddar, H.; Hemis, M.; Himeur, Y. Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey. Inf. Fusion. 2024, 109, 102422. [Google Scholar] [CrossRef]
Horiguchi, S.; Tawara, N.; Ashihara, T.; Ando, A.; Delcroix, M. Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization? In Proceedings of the 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Honolulu, HI, USA, 6–10 December 2025. [Google Scholar]
Lim, H. Lightweight Feature Encoder for Wake-Up Word Detection Based on Self-Supervised Speech Representation. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Sun, X.; Fu, J.; Wei, B.; Li, Z.; Li, Y.; Wang, N. A Self-Attentional ResNet-LightGBM Model for IoT-Enabled Voice Liveness Detection. IEEE Internet Things J. 2023, 10, 8257–8270. [Google Scholar] [CrossRef]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.; Zoph, B.; Cubuk, D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
OpenSLR. Open Speech and Language Resources. Available online: http://www.openslr.org (accessed on 25 August 2025).

Figure 1. Modular Architecture of the Intelligent In-Vehicle Voice Control.

Figure 2. Architecture of ResNet-Lite.

Figure 3. Improved ECAPA-TDNN model structure optimization.

Figure 4. Block diagram of feature averaging and fusion for MFCC Spectrogram using sliding window.

Figure 5. Flowchart of the Adaptive Threshold Decision Mechanism.

Figure 6. Block Diagram of fuzzy command matching.

Figure 7. Comparison of speaker verification models across English channel: (a) EER (%) and (b) minDCF (×10⁻¹).

Figure 8. Comparison of speaker verification models across Chinese channel: (a) EER (%) and (b) minDCF (×10⁻¹).

Figure 9. Comparative accuracy of hard matching vs. fuzzy matching strategies across different ASR front ends and noise conditions: (a) Whisper-small (prioritizing accuracy); (b) Whisper-base (balanced); and (c) Whisper-tiny (prioritizing efficiency).

Table 1. Structural and Performance Comparison of ResNet-18 and ResNet-Lite.

Item	ResNet-18	ResNet-Lite	Description
Number of Residual Stages	4	3	ResNet-Lite removes one residual stage
Convolution Kernel Size	7 × 7	3 × 3	Smaller kernels reduce computational load
Max Pooling	Yes	No	Preserves more speech detail
Total Weighted Layers	18	9	Network depth halved
Parameters (Approx.)	11.2 M	3.5 M	~70% reduction in parameters
Inference Speed (Desktop CPU)	~35 ms/sample	~12 ms/sample	Test environment: Intel i5-12400 + PyTorch
Memory Usage (Desktop GPU)	~180 MB	~64 MB	Test environment: NVIDIA GTX 1660Ti
Suitable Deployment	General servers/GPU platforms	Resource-constrained edge devices (Automotive)	Optimized for embedded voice control environments

Table 2. Statistics of the datasets used in the experiments.

Data Type	Source	Language	# Speakers ^*1	# Utterances ^*1	Real/Synthetic	Noise Conditions ^*3	Purpose
Wake-up Word	Private + TTS	Chinese (“Xiaotun”)	20 + TTS	2055 (481:1574) ^*2	~1:5 (Real vs. Synthetic)	A, B, or C	Training
	Private + TTS	Chinese (“Xiaotun”)	20 + TTS	491 (100:391) ^*2	~1:5 (Real vs. Synthetic)	A, B, or C	Testing (Chinese Channel)
	Picovoice	English (“Hey Porcupine”)	N/A	(Built-in)	N/A	N/A	Testing (English Channel)
Speaker Verification	VoxCeleb2	English	5994	~710 k	100% Real	A, B, C, or their random superposition	Training
	VoxCeleb2	English	118	20 k (1:1) ^*2	100% Real	A, B, or C	Testing (English Channel)
	CN-Celeb	Chinese	195	10 k (1:1) ^*2	100% Real	Inherent noise + A, B, or C	Testing (Chinese Channel)
Command Recognition	Private Dataset	Chinese	20	6871	100% Real	A, B, or C	Testing

Note: *1: In the header of Table 2, the symbol “#” denotes “Number of”. *2: In the “# Utterances” column, the ratio shown in parentheses represents positive vs. negative sample pairs. *3: In the “Noise Conditions” column, A, B, and C refer to quiet, background music, and ambient noise (e.g., high-speed driving noise), respectively.

Table 3. Model training hyperparameter configuration.

Hyperparameter	Optimizer	Initial Learning Rate	Batch Size	Max Epochs	Weight Decay	Framework
Improved ECAPA-TDNN	Adam	0.001	64	80	2 × 10⁻⁵	PyTorch
ResNet-Lite	Adam	0.001	16	100	Not Applied	PyTorch

Table 4. Performance of the dual-channel wake-up mechanism under different noise conditions.

System	Module	Wake Word	Language	Accuracy (%) in Noise Conditions *			Wake-Up Success Rate (%) in Noise Conditions *
System	Module	Wake Word	Language	A	B	C	A	B	C
Commercial Picovoice	Picovoice (Native)	“Hey Porcupine”	English	99.95%	99.76%	99.96%	99.51%	95.38%	99.76%
Integrated Dual-Channel System	Picovoice (In Our System)	“Hey Porcupine”	English	99.95%	99.66%	99.97%	99.51%	93.19%	99.76%
Integrated Dual-Channel System	ResNet-Lite	“Xiaotun”	Chinese	78.41%	70.26%	65.17%	66.0%	56.0%	75.0%

Note: * The “Noise Conditions” A, B, and C refer to quiet, background music, and ambient noise (e.g., high-speed driving noise), respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Li, Y.; Ren, W.; Wang, X. A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding. Eng 2025, 6, 319. https://doi.org/10.3390/eng6110319

AMA Style

Zhang Z, Li Y, Ren W, Wang X. A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding. Eng. 2025; 6(11):319. https://doi.org/10.3390/eng6110319

Chicago/Turabian Style

Zhang, Zhixiong, Yao Li, Wen Ren, and Xiaoyan Wang. 2025. "A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding" Eng 6, no. 11: 319. https://doi.org/10.3390/eng6110319

APA Style

Zhang, Z., Li, Y., Ren, W., & Wang, X. (2025). A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding. Eng, 6(11), 319. https://doi.org/10.3390/eng6110319

Article Menu

A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding

Abstract

1. Introduction

2. Methodology: Key Technologies of the Proposed System

2.1. System Architecture Overview

2.1.1. Audio Preprocessing and Input

2.1.2. System Workflow

2.2. Dual-Model Wake-Up Mechanism

2.2.1. Hybrid Architecture Design

2.2.2. Lightweight Optimization of ResNet-Lite

2.3. Voiceprint Recognition Optimization: Real-Time Speaker Verification Based on an Improved ECAPA-TDNN

2.3.1. Improved ECAPA-TDNN Architecture

2.3.2. Sliding Window Feature Fusion

2.3.3. Adaptive Threshold Decision Mechanism

2.4. Fuzzy Command Matching Algorithm

3. Experiments and Results

3.1. Experimental Data and Setup

3.1.1. Data Composition and Statistics

3.1.2. Data Preprocessing and Feature Extraction

3.1.3. Experimental Platform and Hyperparameters

3.2. Evaluation of the Dual-Channel Wake-Up Mechanism

3.2.1. Experimental Design and Rationale

3.2.2. Results and Discussion

3.3. Performance Evaluation of Speaker Recognition

3.3.1. Comparative Models and Evaluation Plan

3.3.2. Comparative Results and Analysis

3.4. Robustness Evaluation of Command Recognition

3.4.1. Experimental Design for Fuzzy Matching Validation

3.4.2. Results and Analysis of Fuzzy Matching Strategy

4. Integrated System Performance and Discussion

4.1. Analysis of Integrated Advantages and Synergistic Effects

4.2. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI