You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

11 December 2024

Enhanced Conformer-Based Speech Recognition via Model Fusion and Adaptive Decoding with Dynamic Rescoring

,
,
,
and
School of Automation and Intelligence, Beijing Jiaotong University, No. 3 Shangyuancun, Haidian District, Beijing 100044, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing

Abstract

Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause instability and long response times, hindering AI’s competitiveness. Therefore, addressing these technical bottlenecks is critical for advancing national scientific progress and global information infrastructure. In this paper, we propose improvements to the model structure fusion and decoding algorithms. First, based on the Conformer network and its variants, we introduce a weighted fusion method using training loss as an indicator, adjusting the weights, thresholds, and other related parameters of the fused models to balance the contributions of different model structures, thereby creating a more robust and generalized model that alleviates overfitting and local optima. Second, for the decoding phase, we design a dynamic adaptive decoding method that combines traditional decoding algorithms such as connectionist temporal classification and attention-based models. This ensemble approach enables the system to adapt to different acoustic environments, improving its robustness and overall performance. Additionally, to further optimize the decoding process, we introduce a penalty function mechanism as a regularization technique to reduce the model’s dependence on a single decoding approach. The penalty function limits the weights of decoding strategies to prevent over-reliance on any single decoder, thus enhancing the model’s generalization. Finally, we validate our model on the Librispeech dataset, a large-scale English speech corpus containing approximately 1000 h of audio data. Experimental results demonstrate that the proposed method achieves word error rates (WERs) of 3.92% and 4.07% on the development and test sets, respectively, significantly improving over single-model and traditional decoding methods. Notably, the method reduces WER by approximately 0.4% on complex datasets compared to several advanced mainstream models, underscoring its superior robustness and adaptability in challenging acoustic environments. The effectiveness of the proposed method in addressing overfitting and improving accuracy and efficiency during the decoding phase was validated, highlighting its significance in advancing speech recognition technology.

1. Introduction

Speech recognition is a technology that converts human speech signals into text, enabling computers to understand and interpret human language. This technology plays a critical role in human–computer interaction and is widely applied in various fields such as security, education, healthcare, and the military. It significantly improves people’s quality of life and work efficiency, demonstrating tremendous application potential and market prospects across multiple industries. As this technology rapidly develops globally, the market size reached USD 11.4 billion in 2023 and is projected to grow to USD 44.7 billion by 2032 [1]. The advancement of speech recognition not only drives the development of intelligent devices but also has a substantial impact on global information technology infrastructure and artificial intelligence strategies.
Speech recognition technology has undergone three major stages of development. Initially, methods based on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) were used [2], relying on manual feature extraction and statistical modeling for speech recognition. While these methods performed reasonably well in environments with low noise, they struggled with complex speech inputs. The introduction of deep learning later brought about the application of convolutional neural networks (CNNs) and deep neural networks (DNNs), significantly improving the accuracy of speech recognition, particularly in noisy environments [3]. In recent years, end-to-end architectures have become mainstream, with models such as Transformer. The self-attention mechanism in Transformer models addresses the limitations of traditional models in parallel processing and long-range dependencies, further enhancing recognition accuracy and training efficiency [4]. With these advancements, speech recognition systems have seen substantial improvements in accuracy, robustness, and adaptability to dynamic environments, particularly in handling multiple languages, dialects, and complex background conditions.
However, despite the significant progress made in speech recognition technology, several challenges and pain points remain. One notable issue is the lack of robustness. Speech recognition systems experience a significant drop in accuracy when faced with noisy environments and varying background noise [5]. This limitation hinders the performance of speech recognition technology in complex and diverse real-world applications. Another challenge lies in the shortcomings of current decoding algorithms. These algorithms have limitations in handling the complexity and diversity of natural language, which affects the accuracy of speech recognition systems and constrains their response speed and efficiency in real-time applications. This is particularly evident when dealing with continuous speech and complex commands, where the efficiency and accuracy of existing algorithms need improvement [6].
To address these issues, this paper proposes the following innovative methods: First, during the model training phase, to alleviate overfitting and local optima, we introduce an innovative network architecture and propose a loss-weighted model fusion approach. This method combines multiple model structures by adjusting their weights during the convergence phase, enhancing the model’s generalization ability and robustness. Second, to address the balance between accuracy and efficiency in decoding algorithms, we design a dynamic adaptive decoding algorithm. This algorithm integrates CTC and attention mechanisms using nonlinear functions, enabling more precise capture of the complex relationships and interactions in speech data. Furthermore, to improve decoding efficiency, we introduce a dynamically adjusted penalty mechanism that limits the model’s reliance on a single decoder, thereby enhancing the robustness of the decoding process and accelerating the decoding speed.
The structure of this paper is divided into six sections, arranged as follows: Section 2 presents the related work, reviewing the development history of speech recognition models, discussing challenges in model performance, decoding algorithms, and the scarcity of datasets, and outlining the solutions adopted in this paper. Section 3 introduces the methodology, detailing the design concepts behind model fusion and the innovations in the decoding algorithm. Section 4 describes the experimental setup, including the model parameter configuration, dataset sources, and performance evaluation metrics. Section 5 presents the experimental results, exploring the effects of model fusion and decoding algorithm optimization, conducting ablation experiments, and comparing the performance of the proposed method with mainstream network models. Section 6 concludes the paper based on the experimental results. Section 7 discusses future work, providing a critical examination of the limitations inherent in the proposed methods and suggesting strategies to address these challenges, while identifying avenues for further exploration.

3. Materials and Methods

This study will employ the advanced Conformer and its variant end-to-end model structures. On the one hand, a weighted fusion approach will be used to organically combine these models, enhancing the model’s generalization ability and stability. On the other hand, a novel adaptive dynamic re-scoring decoding algorithm is proposed, which ensures recognition accuracy while improving decoding efficiency. The overall process of the proposed methodology is illustrated in Figure 1.
Figure 1. Overall flowchart of the methodology.

3.1. Fundamental Theory

Feature extraction from audio data is a critical step in the preprocessing phase of speech recognition. It involves converting audio data into feature representations that can be utilized by deep learning models. This study employs the commonly used Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction, as it effectively simulates human auditory characteristics, enhancing the efficiency of model training and improving generalization capabilities. The process is illustrated in Figure 2.
Figure 2. MFCC processing flow.
The model structure employed in this study is the Conformer network, a novel hybrid neural network architecture designed specifically for capturing both local and global dependencies in complex sequential data. The Conformer network ingeniously combines the local feature extraction capabilities of convolutional neural networks (CNNs) with the long-range dependency handling abilities of Transformers. By integrating convolutional layers into its structure, the Conformer is capable of capturing short-term phonetic features crucial for speech recognition tasks, while its self-attention mechanism allows it to understand and process the overall semantic structure of longer passages. This combination makes the Conformer particularly effective in tasks such as natural language processing and speech recognition, which require both detailed local observation and broad global understanding. The specific structure of the Conformer is illustrated in Figure 3.
Figure 3. Conformer network architecture.
Building on this, two variant structures are also employed in this study: Unified_Conformer and U2++_Conformer. These three architectures are similar in design. Unified_Conformer introduces improvements in the encoding layer by implementing a dynamic block training method and utilizing a causal convolution module to support streaming recognition, thereby enhancing training effectiveness. U2++_Conformer builds on the causal convolution by adding a right-to-left attention decoder in the decoder section, enabling bidirectional decoding and improving decoding accuracy. The subsequent experiments will focus on these three structures.
During the model training process, the calculation of the loss value is a hybrid of two types: CTC (1) and attention mechanism decoding (2). The structure is illustrated in Figure 4.
L C T C = t = 1 T log P c t c ( y t t r u e   |   X )
L a t t = t = 1 T log P a t t ( y t t u r e   |   y 1 : t 1 , X )
where y t t u r e represents the true label information, X represents the input audio feature sequence, and y 1 : t 1 represents all target labels before the time step t .
Figure 4. Hybrid CTC/attention model.
By using λ as a weighting coefficient, the two types of loss are combined to ultimately obtain the total loss value:
L t o t a l = λ L C T C + ( 1 λ ) L a t t

3.2. Model Fusion

Typically, after model training is completed, to ensure both accuracy and robustness, models from the better-performing training epochs are selected. The weight parameters of these selected models are then averaged using an arithmetic mean, resulting in a new model with improved stability and performance.
Figure 5 illustrates the training process curve of a model. By selecting the models M = { M 1 , M 2 , , M S } and corresponding loss values L = { L 1 , L 2 , , L S } from the convergence phase, the arithmetic averaging formula is as follows:
M a v e = i = 1 S M i S
where M a v e is the newly obtained model after arithmetic averaging, and S is the number of models.
Figure 5. Model training example.
Although arithmetic averaging is simple and easy to implement in model fusion, it has significant drawbacks when dealing with the following issues: First, arithmetic averaging combines the outputs or parameters of multiple models with equal weights without considering the generalization ability of each model. This means that even if some models have overfitted to their training data, their influence in the ensemble is the same as that of well-performing models. As a result, the effects of overfitting may be retained in the final ensemble model, reducing the accuracy of predictions on new data. Second, when independently trained models are stuck in different local optima within their parameter space, simple arithmetic averaging may cause these local optima to cancel each other out or dilute their effects, especially when these local optima are far apart. This averaging approach can potentially disrupt valuable solutions found by well-performing models within certain local regions.
To overcome these shortcomings, we propose a weighted averaging method. We select models from the stage where they are about to converge, calculate weights using loss values as evaluation criteria, and perform a weighted sum to obtain the final weighted average model. The formula for the weighted calculation is as follows:
M a v e = i = 1 S w i M i i = 1 S w i
w i = 1 l o s s i +
where M a v e is the new model obtained using weighted averaging, w i is the weight of each model, is a small constant (set to 1 × 10−6) to avoid division by zero errors, and l o s s i is the loss value of model M i .
The above approach applies specifically to models with the same structure. When dealing with multiple different models, direct calculation using the entire models is not feasible due to structural differences. First, for the three different models, the model with the best performance, as evaluated by the loss value, is selected as the base model. When calculating across different models, only the structurally identical overlapping parts are weighted and averaged, while the non-overlapping parts are left unchanged. The specific formulas and pseudocode for this approach are as follows:
First, the selection of the base model is carried out by defining the model with the smallest loss value as the base model M b :
M b = M argmin i L o s s i     w h e r e       i { 1 , 2 , , s }
Next, for the specific structure within the models, the weights ω and biases b of the l-th layer in the base model are calculated as follows:
ω l M = i = 1 S w i I ( l M b M i )   ω l i i = 1 S w i I ( p M b M i )
b l M = i = 1 S w i I ( p M b M i )   b l i i = 1 S w i I ( p M b M i )
where I ( l M b M i ) is an indicator function that equals 1 if M i shares common structural parameters in layer l with the base model M b , and 0 otherwise. ω l i and b l i represent the weights and biases of layer l in model i , while ω l M and b l M denote the weights and biases of layer l in the new model M .
Finally, these parameters from each layer are integrated and aggregated to obtain the final model M a v e :
M a v e = s w i s h ( X l ω l M + b l M )
where s w i s h is an activation function used within the network.
In the weighted averaging fusion method, each model’s contribution is weighted based on its performance on validation dataset, ensuring that models with better performance and greater generalization ability have a larger influence on the final outcome. This approach not only effectively mitigates the impact of overfitting but also better addresses the issue of multiple models potentially converging to different local optima, thereby guiding the ensemble model towards a more effective global optimum. Through this weighted mechanism, the stability and generalization performance of the model can be significantly enhanced, leading to more reliable and consistent results across different datasets and tasks. The detailed model fusion method is provided in Algorithm 1.
Algorithm 1 Weighted Average Prediction
Require: ε: A small constant to prevent division by zero
1: Initialize Weighted_Sum as an empty dictionary
2: Initialize Total_Weight as an empty dictionary
3: for each model Mi in set of models M do
4:   Compute weight Wi = 1/(Loss of Mi + ε)
5:   for each parameter k in the intersection of keys in base model Mb and Mi do
6:     If k is not in Weighted_Sum
7:        Initialize Weighted_Sum[k] = 0
8:        Initialize Total_Weight[k] = 0
9:     Weighted_Sum[k] += Wi * value of parameter k in Mi
10:     Total_Weight[k] += Wi
11: end for
12: for each parameter k in Weighted_Sum do
13:   Y_weighted[k] = Weighted_Sum[k]/Total_Weight[k]
14: end for
15: Return Y_weighted
In our experiments, we applied the above-mentioned weighted averaging fusion method to combine three networks: the Conformer and its two variants. The final result is a new hybrid model, Mixformer. The specific fusion process is illustrated in Figure 6. This fusion approach allows us to effectively integrate the strengths of the three models, further improving the overall performance and generalization ability of the model.
Figure 6. Model fusion process.

3.3. Decoding Algorithms

The commonly used decoding methods include attention decoding, CTC greedy search, and CTC prefix beam search.
The attention decoding algorithm decodes the input sequence using an autoregressive search method, calculating context vectors and weights at each step while considering all previously generated words, thereby ensuring high accuracy. However, this method has high computational complexity and consumes significant resources. In contrast, the CTC greedy search method selects the most likely character at each step, which is efficient but less accurate in complex scenarios. The CTC prefix beam search improves upon greedy search by retaining multiple likely prefix paths at each step, calculating the total probability for each prefix to select the best path. Although this increases accuracy, it also consumes more memory and computational resources.
These traditional methods each have their strengths and weaknesses, but they often struggle to balance accuracy and computational efficiency when dealing with complex speech signals. To overcome these limitations and enhance the model’s generalization ability and decoding performance, we propose an adaptive attention fusion rescoring method (AAF rescoring). This approach combines nonlinear function connections and a penalty mechanism to rescore the outputs of decoders. The specific process is illustrated in Figure 7.
Figure 7. Adaptive attention fusion rescoring (AAF rescoring) decoding algorithm.
First, the CTC prefix beam search is used to perform an initial decoding of the speech input, generating potential candidate decoding results. Although CTC prefix beam search may lack precision when handling complex backgrounds, it provides an efficient initial decoding outcome that serves as a foundation for subsequent processing. Next, attention beam search is applied to further process the candidate results generated by the CTC. Nonlinear functions are employed to connect these candidate paths, resulting in richer decoding outputs. This method enhances the accuracy and flexibility of the decoding process while preserving global contextual information. Additionally, it significantly improves recognition accuracy and reduces the required computational resources. Finally, a penalty mechanism is introduced to rescore the newly generated decoding results. The penalty mechanism limits the weights to prevent the model from becoming overly dependent on the output of a single decoder, thereby enhancing the model’s robustness. The specific formula is as follows:
S c o m b i n e d A F F = σ ( ω r ) log P a t t ( y t   |   y < t , x ) + σ ( ω c ) log P C T C ( y t   |   x )                             λ ( σ ( ω r ) 2 + σ ( ω c ) 2 )
where P a t t ( y t   |   y < t , x ) represents the output of the attention-based decoder, indicating the conditional probability of generating the current output given the previous outputs and the input x ; P C T C ( y t   |   x ) represents the output of the CTC decoder, indicating the probability of generating the current output given the entire sequence x ; σ ( ) denotes the nonlinear activation function, with the sigmoid function being used in this context. λ is a regularization coefficient used to penalize excessively large weight values, limiting the model’s tendency to rely heavily on the output of a single decoder, thereby enhancing the model’s robustness.
In summary, the adaptive dynamic rescoring decoding algorithm proposed in this paper introduces improvements in three key ways: First, we effectively integrate the CTC and attention decoding methods, leveraging the efficiency of CTC and the high accuracy of attention through an adaptive dynamic rescoring mechanism. This combination significantly improves recognition accuracy and computational efficiency, allowing for more precise identification and analysis of features in diverse and uncertain audio signals. Second, the use of a nonlinear activation function enables the model to capture complex relationships within the data, enhancing its adaptability to diverse inputs. Third, the penalty mechanism reduces the model’s reliance on a single decoder by limiting weights, thus decreasing the risk of overfitting. These improvements collectively enable the model to achieve higher accuracy, stability, and efficiency in complex speech recognition tasks, meeting the demands of practical applications. The proposed adaptive dynamic rescoring algorithm not only facilitates more effective learning from complex data, improving the model’s robustness, but also helps mitigate the risk of overfitting associated with complex model structures.

4. Experiments

4.1. Experimental Setup

To validate the feasibility of the proposed model fusion and improved decoding algorithms, we implemented three different network architectures: Conformer, Unified_Conformer, and U2++_Conformer. The detailed experimental configurations for these models are presented in Table 1.
Table 1. Model parameter configurations.
Table 1 summarizes the key hyperparameter configurations used in the experiments, including the structure of the encoder and decoder, optimizer settings, and dataset parameters. Taking Conformer as an example, its encoder consists of 12 stacked layers, each with an output dimension of 256, utilizing the Swish activation function. The decoder comprises 6 stacked layers, with a CTC weight of 0.3, label smoothing weight of 0.1, and a batch size of 12. These parameters were derived from existing studies [46] and fine-tuned to align with specific task requirements. Building on this foundation, the Unified_Conformer and U2++_Conformer models introduce modifications to the model architecture and data processing strategies. Unified_Conformer employs a reduced learning rate (0.001) to enhance training stability. In contrast, U2++_Conformer reduces the number of decoder layers from 6 to 3 and incorporates a reverse sequence modeling weight (Reverse Weight = 0.3) to improve the capture of global features.
Additionally, during model training, the Adam optimizer was employed to dynamically adjust critical hyperparameters, thereby enhancing training efficiency and convergence [47].
Table 2 illustrates the unified network architecture of the three models, comprising an input layer, convolutional layers, positional encoding, Conformer blocks, and a fully connected layer. The input, represented as an 80-dimensional feature sequence, undergoes initial processing through 2D convolution to generate a 256-dimensional hidden state. This hidden state is then passed to the encoder, which is composed of 12 Conformer blocks. Each Conformer block includes feed-forward layers, self-attention mechanisms, and convolutional modules. Finally, the fully connected layer maps the encoded features to a 5002-dimensional output space. The 5002-dimensional output corresponds to the size of the model’s vocabulary, representing the total number of possible word or character categories.
Table 2. Model architecture.
The hardware environment for the experiments includes an NVIDIA RTX 4070 GPU with 16.0 GB of RAM, running on a 64-bit Ubuntu 18.04 operating system. The software environment is Python 3.8.

4.2. Dataset Base

This study utilized the LibriSpeech dataset, a widely used large-scale corpus for speech recognition research. It comprises approximately 1000 h of English speech recordings from the public domain, encompassing diverse speech styles, accents, and topics. This dataset provides extensive resources for developing and evaluating speech recognition algorithms, ensuring the generalizability of experimental results.
In this study, the train-960 subset was used for model training, while dev-clean, dev-other, test-clean, and test-other were employed for validation and testing. The clean subsets include recordings characterized by clear speech, standardized pronunciation, and minimal background noise, primarily aimed at evaluating model performance under ideal conditions. In contrast, the other subsets represent recordings from more complex acoustic environments and are designed to assess the model’s robustness. These recordings are marked by distinctive features such as heavy accents, irregular pronunciations, rapid speaking rates, and low-quality audio caused by aging recording equipment or environmental factors. Additionally, they include various background noises, such as indoor reverberation, street sounds, and low-frequency environmental disturbances. Testing on these subsets, particularly dev-other and test-other, enables a comprehensive evaluation of the model’s adaptability and robustness across diverse speech scenarios, showcasing its ability to handle noisy and complex speech conditions effectively. Detailed information about the dataset is presented in Table 3.
Table 3. Dataset configuration.

4.3. Performance Metrics

In speech recognition systems, evaluating model performance is crucial. To ensure that the experiments achieve the expected results, we use the word error rate ( W E R ) as the evaluation metric. W E R measures the difference between the recognized words and the actual words by calculating the minimum number of operations required to substitute, delete, or insert words, and then dividing this by the total number of reference words. The specific formula is as follows:
W E R = S w + D w + I w N w
where S w represents the number of word substitutions, D w represents the number of word deletions, I w represents the number of word insertions, and N w represents the total number of reference words. A lower W E R value indicates higher recognition accracy.
A c c u r a c y is another commonly used evaluation metric. The higher the accuracy, the better the model performs in recognizing speech. It can be calculated using the following formula:
A c c u r a c y = N E N w
where N E represents the number of incorrectly recognized words, which is the sum of word substitutions, deletions, and insertions. Therefore, there is a corresponding relationship between A c c u r a c y and W E R :
A c c u r a c y = N w ( S w + D w + I w ) N w = 1 W E R

5. Results

5.1. Basic Experiment

We first trained the three base models—Conformer, Unified Conformer, and U2++ Conformer—on the LibriSpeech train-960 dataset and conducted basic experiments using three traditional decoding methods during the decoding phase. After the training was completed, we applied arithmetic averaging for an initial model fusion. The models were then tested and validated on the dev_clean, dev_other, test_clean, and test_other datasets. The changes in loss values and learning rates during the training process are shown in Figure 8 and Figure 9, respectively. The quantitative results from the testing and validation are presented in Table 3.
Figure 8. Loss value variation curve.
Figure 9. Learning rate variation curve.
Figure 8 and Figure 9 illustrate the changes in loss values and learning rates over 100 training epochs for the three models. Compared to the Conformer model, the Unified Conformer and U2++ Conformer models exhibit a more significant reduction in loss values and a smoother decline in learning rates. This improvement is attributed to the dynamic block training method and the enhancements from the causal convolution module, which have contributed to increased speech recognition accuracy. Additionally, although the U2++ Conformer shows a smooth decline in learning rates and the fastest reduction in loss values during the initial stages, its performance in later stages is not as strong as that of the Unified Conformer. This may be due to the increased model complexity introduced by the bidirectional decoding structure, which might have hindered the model’s ability to fully learn effective features within the training epochs, ultimately affecting its final performance.
Table 4 compares the recognition performance of three different network structures—Conformer, Unified Conformer, and U2++ Conformer—using three traditional decoding algorithms across two types of test and validation sets (clean and noisy environments).
Table 4. Basic test results.
First, in terms of model methodology, Unified Conformer and U2++ Conformer show significant improvements in accuracy compared to the Conformer model across all dataset types, with WER decreasing by approximately 35% on a relative basis. However, when the environment shifts from clean to noisy, the WER for all models increases by about 125% relatively. This indicates that all three models still have limitations in handling noise, suggesting the need for further optimization to improve performance in complex environments.
Second, regarding decoding algorithms, the attention mechanism performs well on the clean environment validation set, with WERs of 7.10%, 4.55%, and 4.41% for Conformer, Unified Conformer, and U2++ Conformer, respectively. However, in noisy environments, the WERs increase to 16.30%, 10.12%, and 10.59%, respectively. Similarly, the recognition accuracy of both CTC greedy search and CTC Prefix Beam Search declines as the dataset environment transitions from clean to noisy.
In summary, while the recognition accuracy of Unified Conformer and U2++ Conformer shows improvement over Conformer in the basic experiments, there is still room for improvement in noisy environments. This reflects the inherent nature of these models as neural networks, which are prone to overfitting during training, and their robustness needs enhancement. Additionally, the current decoding algorithms exhibit suboptimal recognition accuracy when dealing with speech signals in complex environments. Therefore, there is a clear need for further optimization of both the model structures and decoding methods.

5.2. Model Fusion Experiment

To further optimize the models, we employed a weighted fusion approach. For each individual model, we conducted a comparative experiment between arithmetic averaging and weighted averaging under the same configuration environment. The comparison of the single model fusion test results for the three structures is presented in Table 5, Table 6 and Table 7.
Table 5. Conformer model fusion test results.
Table 6. Unified Conformer model fusion test results.
Table 7. U2++ Conformer model fusion test results.
Table 5, Table 6 and Table 7 present a comparison of recognition performance between arithmetic averaging fusion and weighted averaging fusion within the same model. The comparison reveals that using weighted fusion provides certain improvements across different data environments and decoding algorithms compared to arithmetic averaging fusion. For example, in the Conformer model with the attention mechanism decoding algorithm, the WER in the clean environment decreases from 7.10% to 7.00% on the Dev-Clean set, and from 7.23% to 7.18% on the Test-Clean set. In the noisy environment, the WER decreases from 16.30% to 16.26% on the Dev-Other set, and from 16.75% to 16.72% on the Test-Other set. Similar optimizations are also observed in the Unified Conformer and U2++ Conformer models.
These results indicate that model fusion using loss values as weights can effectively improve recognition accuracy, particularly in complex noisy environments. Additionally, this fusion method can mitigate overfitting issues during training, further enhancing the robustness of the model. Therefore, the proposed model fusion method not only optimizes the performance of individual models but also offers an effective strategy for addressing challenges in different environments. Subsequent experiments will apply weighted fusion to the three model structures.
Next, we continued with this fusion approach to combine the Conformer, Unified Conformer, and U2++ Conformer networks into a mixed model called Mixformer and conducted experiments. The results are presented in Table 8.
Table 8. Multi-model fusion test results.
Table 8 presents the recognition performance results after applying weighted fusion to the Conformer, Unified Conformer, and U2++ Conformer models. Comparing Table 7 with Table 5, Table 6 and Table 7, it is evident that the fused model outperforms the individual models in both clean and noisy environments, regardless of the decoding method used. Particularly in noisy environments, the WER of the fused model is significantly lower than that of the individual models under the same conditions. This demonstrates that the fused model, by integrating the structures and advantages of different models, not only improves recognition accuracy but also enhances robustness and generalization capabilities, leading to superior performance across various environments and decoding algorithms. Additionally, this approach effectively addresses issues such as local optima and overfitting that may arise during model training.
In summary, the Mixformer model, obtained by applying weighted fusion to the three Conformer models, significantly enhances the performance of the speech recognition system, demonstrating greater robustness and generalization ability. This indicates that the proposed model fusion method effectively resolves potential issues of local optima and overfitting during training, while also proving its significant effectiveness in improving recognition accuracy and handling complex environments.

5.3. Decoding Algorithm Improvement Experiment

For optimizing the decoding algorithm, we applied the proposed dynamic adaptive decoding algorithm and conducted experiments on the three models (after weighted fusion). The test results are presented in Table 9.
Table 9. Decoding algorithm innovation test results.
Table 9 presents the recognition performance of three models using the proposed decoding algorithm, while Table 10 compares it against traditional decoding methods, including attention, CTC greedy search, and CTC prefix beam search, highlighting the significant advantages of the adaptive attention fusion rescoring algorithm. This proposed algorithm dynamically integrates the efficiency of CTC with the global modeling capabilities of attention, effectively overcoming the limitations of traditional methods. Although attention decoding excels in capturing global dependencies, it struggles with complex alignments or noisy environments. Conversely, CTC decoding is computationally efficient but neglects contextual relationships, and both greedy search and prefix beam search often suffer from suboptimal accuracy in challenging scenarios.
Table 10. Comparison of decoding algorithms.
In contrast, the proposed decoding algorithm leverages a dynamic weighting mechanism to adaptively blend CTC’s efficiency with attention’s contextual understanding. This synergy enhances robustness in noisy conditions, delivering notable performance improvements. For instance, on noisy datasets such as Dev-Other and Test-Other, the proposed algorithm achieves significant reductions in WER. Specifically, on the Test-Other dataset with the U2++_Conformer model, WER decreased from 12.47% with CTC prefix beam search to 10.08%, a reduction of 2.39%. This demonstrates the algorithm’s ability to accurately capture complex, nonlinear features in speech signals.
In summary, the novel dynamic adaptive decoding method proposed in this paper effectively combines the strengths of both CTC and attention mechanisms, merging the advantages of both decoding algorithms. This approach improves decoding efficiency while ensuring recognition accuracy. By replacing linear functions with nonlinear ones, the method more effectively captures the complex nonlinear features in audio data. Additionally, by introducing a penalty function mechanism, it limits the model’s reliance on a single type of decoding output. Ultimately, this method significantly enhances the accuracy and robustness of speech recognition, providing new insights for the advancement of speech recognition technology and demonstrating its substantial effectiveness in practical applications.

5.4. Ablation Experiment

We combined the two aforementioned optimizations, using the Mixformer hybrid structure model and the dynamic adaptive decoding algorithm for the experiment. The contents of the previous experiments were summarized and compared in an ablation study. The experimental results are presented in Table 11 and Figure 10.
Table 11. Summary of ablation experiment results.
Figure 10. Summary of recognition results under different methods.
The above experimental results clearly demonstrate that each step of improvement significantly enhances recognition performance. First, the model fusion method (Mixformer) shows lower WER compared to individual models across both attention and other decoding algorithms, indicating its effectiveness in enhancing model robustness and generalization capabilities. Second, the dynamic adaptive decoding algorithm significantly reduces WER across various test environments compared to traditional decoding algorithms (attention, CTC greedy search, CTC prefix beam search), further improving recognition accuracy. Additionally, the ablation experiment confirms the effectiveness of each improvement step, highlighting the independent contributions of model fusion and decoding algorithm innovation to enhancing speech recognition performance, without mutual interference.
Ultimately, the combination of the model fusion and dynamic adaptive decoding algorithm yielded the best recognition performance across all test environments, with WERs of 3.92%, 8.94%, 4.07%, and 9.31% in the Dev-Clean, Dev-Other, Test-Clean, and Test-Other environments, respectively, validating the effectiveness and practicality of the proposed methods. This innovative approach offers new insights for the development of speech recognition technology, demonstrating significant advantages in improving recognition accuracy and handling complex environments.
Additionally, this study conducts a comprehensive comparison of the proposed Mixformer model with current state-of-the-art speech recognition models, as shown in Table 12 and Figure 11. The experimental results indicate that while DeepSpeech and Kaldi perform reasonably well on clean datasets such as Test-Clean, their performance significantly degrades in noisy environments, such as the Test-Other dataset, highlighting limitations in robustness. In contrast, models like HuBERT and Wav2Vec 2.0, which leverage deeper network architectures and pre-training techniques, achieve lower word error rates (WERs) across datasets, particularly excelling in clean data scenarios.
Table 12. Comparison of speech recognition accuracy using state-of-the-art models.
Figure 11. Comparison of our proposed model with state-of-the-art models.
The proposed Mixformer model outperforms all benchmark models across all test datasets. Notably, it reduces the WER by approximately 0.4% compared to HuBERT on the Dev-Other and Test-Other datasets, demonstrating its enhanced robustness in challenging acoustic conditions. These findings suggest that Mixformer not only improves recognition accuracy but also provides better adaptability to complex environments, underscoring its effectiveness and potential in speech recognition applications.

6. Conclusions

As a popular research field, speech recognition has wide-ranging applications in various aspects of life. With the advent of deep learning, the latest end-to-end architectures have demonstrated high recognition efficiency and accuracy. However, overfitting and local optima are common challenges in training deep learning models. Additionally, existing decoding algorithms struggle to balance efficiency and accuracy during the decoding phase, and recognition accuracy often declines as the complexity of the dataset increases.
To address these issues, this paper proposes innovations in both model fusion and decoding algorithms. A fusion method based on training loss values combines models within and across structures near convergence, alleviating overfitting and local optima. Additionally, a dynamic adaptive rescoring algorithm integrates CTC and attention mechanisms, utilizing nonlinear functions and penalty mechanisms to reduce reliance on single decoding outputs, thereby improving robustness. Experimental results on the LibriSpeech dataset demonstrate that model fusion significantly enhances recognition accuracy, with cross-model fusion and the proposed decoding algorithm yielding further improvements. Furthermore, a comparative analysis with state-of-the-art models reveals that the proposed method demonstrates superior recognition accuracy, particularly on complex datasets, highlighting its robustness and adaptability in challenging acoustic environments.

7. Future Work

This study has several limitations concerning model fusion, dataset diversity, and hardware resources. First, the fusion strategies employed are confined to variants of the Conformer network. While these strategies effectively mitigate overfitting and improve recognition accuracy, they lack exploration of cross-architecture fusion, thereby narrowing the scope of investigation. Second, although the LibriSpeech dataset encompasses various noise types, the evaluation primarily focuses on its overall noisy subsets without conducting separate experiments for each specific noise category. This limits the ability to analyze the model’s robustness under distinct and diverse noise conditions, which are often encountered in real-world scenarios. Finally, the computational constraints of the laboratory server, with limited processing power, may have prolonged training times and restricted the optimization of model performance and precision.
Future work will address these limitations through the following improvements. First, a broader range of fusion strategies will be explored, integrating advanced architectures such as HuBERT and Wav2Vec to leverage their complementary strengths and develop more robust and generalizable models. Second, independent tests and analyses of each noise type will be conducted to thoroughly evaluate the model’s robustness in practical applications, thereby providing more comprehensive insights into its adaptability across various noise conditions. Lastly, upgrading the hardware platform to high-performance computing infrastructure, including multi-core and multi-GPU systems, could significantly enhance training efficiency and decoding speed, paving the way for further performance optimization.

Author Contributions

Conceptualization, J.G. and D.J.; methodology, Z.H. and N.W.; software, J.G.; validation, J.G. and Z.L.; formal analysis, Z.H.; writing—original draft preparation, J.G.; writing—review and editing, D.J.; supervision, D.J.; project administration, D.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities, China [grant numbers 2024YJS007].

Data Availability Statement

The dataset LibriSpeech is publicly available and can be downloaded at: https://www.openslr.org/12 (accessed on 16 September 2024).

Acknowledgments

The authors would like to express our sincere gratitude to Jia for her invaluable guidance and support throughout this research. Additionally, we would like to thank all the students who participated in this project for their hard work and dedication. Your contributions were essential to the success of this study.

Conflicts of Interest

The authors report no conflicts of interest.

References

  1. Precedence Research. Available online: https://www.marketresearch.com/IMARC-v3797/Voice-Speech-Recognition-Technology-Deployment-36766503/ (accessed on 12 November 2024).
  2. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  3. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
  4. Huang, W.; Chiu, C.-C.; Pang, R. Transformer in action: A comparative study of transformer-based acoustic models for large-scale speech recognition applications. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6778–6782. [Google Scholar] [CrossRef]
  5. Liu, Q.F.; Gao, J.Q.; Wan, G.S. The Research Development and Challenge of Automatic Speech Recognition. Data Comput. Dev. Front. 2019, 2, 26–36. [Google Scholar] [CrossRef]
  6. Al-Radhi, M.S.; Csapó, T.G.; Zainkó, C.; Németh, G. Continuous wavelet vocoder-based decomposition of parametric speech waveform synthesis. Proc. Interspeech 2021, 2021, 2212–2216. [Google Scholar] [CrossRef]
  7. Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  8. Seltzer, M.L.; Yu, D.; Wang, Y. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 7398–7402. [Google Scholar]
  9. Ghaffarzadegan, S.; Bořil, H.; Hansen, J.H.L. Deep neural network training for whispered speech recognition using small databases and generative model sampling. Int. J. Speech Technol. 2017, 20, 1063–1075. [Google Scholar] [CrossRef]
  10. Chauhan, M.S.; Mishra, R.; Patel, M.I. Speech recognition and separation system using deep learning. In Proceedings of the 2021 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Coimbatore, India, 25–26 August 2021; pp. 1–5. [Google Scholar]
  11. Wang, P. Research and design of smart home speech recognition system based on deep learning. In Proceedings of the 2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL), Chongqing, China, 11–13 December 2020; pp. 218–221. [Google Scholar]
  12. Vetráb, M.; Gosztolya, G. Using hybrid HMM/DNN embedding extractor models in computational paralinguistic tasks. In Proceedings of the 2023 International Conference on Acoustic Sensors and Their Applications, Budapest, Hungary, 2–4 October 2023; p. 5208. [Google Scholar]
  13. Zhi, T.; Shi, Y.; Du, W.; Li, G.; Wang, D. M2ASR-MONGO: A free Mongolian speech database and accompanied baselines. In Proceedings of the 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Kuala Lumpur, Malaysia, 1–3 December 2021; pp. 140–145. [Google Scholar]
  14. Liu, K.; Wei, J.; Zou, J.; Wang, P.; Yang, Y.; Shen, H.T. Improving Pre-trained Model-based Speech Emotion Recognition from a Low-level Speech Feature Perspective. IEEE Trans. Multimed. 2024, 26, 10623–10636. [Google Scholar] [CrossRef]
  15. Karita, S.; Chen, N.; Hayashi, T. A comparative study on transformer vs. RNN in speech applications. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 14–18 December 2019; pp. 449–456. [Google Scholar]
  16. Miao, H.; Cheng, G.; Gao, C.; Zhang, P.; Yan, Y. Transformer-based online CTC/Attention end-to-end speech recognition architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6084–6088. [Google Scholar] [CrossRef]
  17. Tanaka, K.; Kameoka, H.; Kaneko, T. ATTS2s-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6805–6809. [Google Scholar] [CrossRef]
  18. Wang, Y.; Shi, Y.; Zhang, F. Weak-attention suppression for transformer-based speech recognition. arXiv 2005, arXiv:2005.09137. [Google Scholar]
  19. Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2005, arXiv:2005.08100. [Google Scholar]
  20. Liu, M.; Wei, Y. An improvement to conformer-based model for high-accuracy speech feature extraction and learning. Entropy 2022, 24, 866. [Google Scholar] [CrossRef]
  21. Zhong, G.; Song, H.; Wang, R.; Sun, L.; Liu, D.; Pan, J.; Fang, X.; Du, J.; Zhang, J.; Dai, L. External text-based data augmentation for low-resource speech recognition in the constrained condition of OpenASR21 challenge. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 4860–4864. [Google Scholar]
  22. Chen, H.; Du, J.; Dai, Y. Audio-visual speech recognition in MISP2021 challenge: Dataset release and deep analysis. In Proceedings of the Annual Conference of the International Speech Communication Association, Brno, Czech Republic, 30 August–3 September 2022; pp. 1766–1770. [Google Scholar]
  23. Li, J.; Fang, X.; Chu, F. Acoustic feature shuffling network for text-independent speaker verification. In Proceedings of the INTERSPEECH 2022, Brno, Czech Republic, 30 August–3 September 2022; pp. 4790–4794. [Google Scholar]
  24. Nakatani, T. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1896–1900. [Google Scholar]
  25. Wu, X. Deep sparse conformer for speech recognition. arXiv 2022, arXiv:2209.00260. [Google Scholar]
  26. Kim, J.; Lee, J. Generalizing RNN-transducer to out-domain audio via sparse self-attention layers. arXiv 2021, arXiv:2108.10752. [Google Scholar]
  27. Burchi, M.; Timofte, R. Audio-visual efficient conformer for robust speech recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 7–10 March 2023; pp. 2258–2267. [Google Scholar]
  28. Shi, Y.; Chang, S. Exploring the potentials of conformer for end-to-end speech recognition. IEEE Trans. Acoust. Speech Signal Process. 2021, 34, 566–578. [Google Scholar]
  29. Wang, T.; Deng, J.; Geng, M.; Ye, Z.; Hu, S.; Wang, Y.; Cui, M.; Jin, Z.; Liu, X.; Meng, H. Conformer-based elderly speech recognition system for Alzheimer’s disease detection. arXiv 2022, arXiv:2206.13232. [Google Scholar]
  30. Liu, H.; Chen, Z.; Shi, W. Robust Audio-Visual Mandarin Speech Recognition Based on Adaptive Decision Fusion and Tone Features. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1381–1385. [Google Scholar] [CrossRef]
  31. Kakuba, S.; Poulose, A.; Han, D.S. Deep Learning Approaches for Bimodal Speech Emotion Recognition: Advancements, Challenges, and a Multi-Learning Model. IEEE Access 2023, 11, 113769–113789. [Google Scholar] [CrossRef]
  32. Zhang, X.; Ma, Z.; Liu, Z.; Zhu, F.; Wang, C. Research status and prospects of transformer in speech recognition tasks. J. Comput. Sci. Explor. 2021, 15, 1578–1594. [Google Scholar]
  33. Si, D.; Yun, C. Practical sharpness-aware minimization cannot converge all the way to optima. Adv. Neural Inf. Process. Syst. 2024, 36, 26190–26228. [Google Scholar]
  34. Shamir, O. Employing No Regret Learners for Pure Exploration in Linear Bandits. Presented at NeurIPS 2020. Available online: https://opt-ml.org/ (accessed on 15 April 2024).
  35. Zheng, Z.J. Deep Learning Optimization Techniques. Available online: https://0809zheng.github.io/ (accessed on 15 April 2024).
  36. Hanif, M.K.; Zimmermann, K.H. Accelerating Viterbi algorithm on graphics processing units. Computing 2017, 99, 1105–1123. [Google Scholar] [CrossRef]
  37. Wang, D.; Wang, X.; Lv, S. An overview of end-to-end automatic speech recognition. Symmetry 2019, 11, 1018. [Google Scholar] [CrossRef]
  38. Ren, Z.; Yolwas, N.; Slamu, W.; Cao, R.; Wang, H. Improving hybrid CTC/attention architecture for agglutinative language speech recognition. Sensors 2022, 22, 7319. [Google Scholar] [CrossRef]
  39. Zhang, Y. Research on end-to-end speech recognition based on convolutional neural networks. Ph.D. Thesis, Beijing Jiaotong University, Beijing, China, 2021. [Google Scholar] [CrossRef]
  40. Xie, X. Research and system construction of end-to-end speech recognition models. Ph.D. Thesis, Jiangnan University, Wuxi, China, 2022. [Google Scholar] [CrossRef]
  41. Mukhamadiyev, A.; Khujayarov, I.; Djuraev, O. Automatic speech recognition method based on deep learning approaches for Uzbek language. Sensors 2022, 22, 3683. [Google Scholar] [CrossRef]
  42. Li, J.; Duan, Z.; Li, S. ESAformer: Enhanced self-attention for automatic speech recognition. IEEE Signal Process. Lett. 2024, 31, 471–475. [Google Scholar] [CrossRef]
  43. Yao, Z.; Guo, L.; Yang, X. Zipformer: A faster and better encoder for automatic speech recognition. arXiv 2023, arXiv:2310.11230. [Google Scholar]
  44. Tian, Z.; Yi, J.; Tao, J. Hybrid autoregressive and non-autoregressive transformer models for speech recognition. IEEE Signal Process. Lett. 2022, 29, 762–766. [Google Scholar] [CrossRef]
  45. Yi, C.; Zhou, S.; Xu, B. Efficiently fusing pretrained acoustic and linguistic encoders for low-resource speech recognition. IEEE Signal Process. Lett. 2021, 28, 788–792. [Google Scholar] [CrossRef]
  46. Andrusenko, A.; Nasretdinov, R.; Romanenko, A. UConv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
  47. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.