Improving Mandarin ASR Performance Through Multimodality

Jiang, Rui; Yang, Zhao; Fu, Xiao; Zhao, Jizhong

doi:10.3390/app152212224

Open AccessArticle

Improving Mandarin ASR Performance Through Multimodality

Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12224; https://doi.org/10.3390/app152212224

Submission received: 20 May 2025 / Revised: 1 August 2025 / Accepted: 2 August 2025 / Published: 18 November 2025

Download

Browse Figures

Versions Notes

Abstract

In the context of Internet of Things (IoT) applications, accurate and efficient speech recognition is essential for enabling seamless voice-based interactions and control. Mandarin ASR, in particular, presents unique challenges due to the ideographic nature of the Chinese language, where recognition results are not directly correlated with pronunciation. Pinyin, as a representation of Chinese character pronunciation, has an intrinsic connection with Chinese characters, making it a valuable tool for enhancing ASR performance. This paper proposes a multimodal ASR neural network that combines pinyin data from the text modality and speech data from the audio modality as shared inputs to the ASR model. Specifically, the system processes the speech input through a preprocessed WeNet to generate pinyin text, which is then enhanced using a label denoising algorithm to improve its accuracy. The proposed text-acoustic multimodal ASR model improves the overall speech recognition performance by approximately 4%, making it more suitable for IoT applications that require high accuracy in voice commands and interactions.

Keywords:

automatic speech recognition; multimodal; label denoise

1. Introduction

The rapid advancement of the Internet of Things (IoT) has transformed various domains, enabling intelligent automation, real-time monitoring, and seamless human–computer interaction. With the proliferation of IoT devices in smart homes, healthcare, industrial automation, and autonomous systems, there is an increasing need for natural and efficient human–machine interfaces. Speech recognition, as a key component of human–computer interaction, provides an intuitive and hands-free control mechanism for IoT applications. By integrating speech recognition into IoT systems, users can interact with connected devices more naturally, enhancing accessibility, usability, and efficiency.

Traditional IoT interfaces primarily rely on touch, graphical user interfaces (GUIs), or predefined commands, which may not be optimal for all scenarios, especially in hands-free environments such as healthcare monitoring, smart homes, and industrial automation. Speech recognition overcomes these limitations by enabling voice-controlled operations, reducing the need for physical interaction with IoT devices. The implementation of voice-recognition systems still presents certain challenges [1,2]. In recent years, ASR systems have increased their presence in IoT-based Home Automation (HA) systems. Moreover, advancements in deep learning, particularly in ASR models, have significantly improved the accuracy and robustness of speech recognition systems, making them increasingly viable for real-world IoT applications.

ASR is a technology that converts human speech signals into text and is widely used in applications such as voice assistants, smart customer service, and voice input. With the advancement of deep learning algorithms and computational power, the accuracy of ASR systems has significantly improved [3,4,5], especially in standardized language environments like Mandarin. However, due to the complexity of the Chinese language, the abundance of homophones, and the variety of dialects, ASR still faces many challenges in Chinese speech recognition. To address these issues, researchers have developed various auxiliary techniques, such as pinyin-based recognition methods, to improve the accuracy of the system in recognizing Chinese characters.

Pinyin is the official romanization system for Standard Mandarin. Chinese characters are typically ideographic, meaning there is no direct correspondence between the characters and their pronunciation. However, pinyin, as a phonetic notation for Chinese characters, is inherently connected to them. Therefore, pinyin-assisted Mandarin ASR models have recently become an important method for Mandarin speech recognition. However, almost all of these studies only utilize pinyin information during model training to improve various end-to-end ASR architectures, such as DFSMN-CTC-sMBR [3] or attention-based encoder–decoder models [6,7].

Pinyin is merely an intermediate product in Chinese speech recognition, and the system ultimately needs to map the pinyin to the correct Chinese characters [8,9]. Existing solutions require a pinyin dataset, and they typically create a pinyin dataset corresponding to the text to address data scarcity. Another solution is for researchers to extract both pinyin and text results from the same speech segment and use the pinyin results to supervise the text recognition process.

The aforementioned approaches have achieved certain results, but the creation and annotation of datasets limit their application to experimental settings. In real-world scenarios, the varying quality of data can lead to increased annotation costs and decreased dataset performance. Regarding the latter, since pinyin features are only intermediate products extracted by the model, their accuracy is not stable. While pinyin features can enhance text recognition performance in most cases, errors in pinyin features can also lead to correct text features being updated with incorrect results. This paper pretrains the pinyin feature extractor with a small dataset, and then adds a label denoising method to correct the error of the pinyin feature extractor. With this method, we can use a multimodal speech recognition model with better performance using only unimodal speech data.

This paper proposes an end-to-end pinyin-assisted speech recognition method aimed at improving the accuracy and robustness of Chinese speech recognition. The approach leverages a pretrained WeNet [4] model as a pinyin feature extractor, which first extracts pinyin features from the input speech signal. These pinyin features, along with conventional speech features, are then used as inputs to the model. This multimodal input method takes advantage of the complementary nature of pinyin and speech features to enhance the model’s performance. However, since the extracted pinyin features are not 100% accurate, directly using them could introduce additional noise and errors, negatively impacting the final speech recognition outcomes.

To address the accuracy issues of pinyin features, we incorporate a label denoising mechanism into the model. The label denoising technique cleanses the pinyin labels extracted from the WeNet model, removing erroneous information and making the pinyin features more reliable. The denoised pinyin features, together with the original speech features, are then fed into the neural network, further improving the speech recognition accuracy. Experimental results demonstrate that this pinyin-assisted speech recognition method significantly enhances the overall performance of the network. In particular, the synergy between the pinyin and speech features shows a marked improvement in recognition accuracy.

The main contributions of this paper are as follows:

A new multimodal Chinese speech recognition neural network structure is proposed.
Label denoising is introduced into pinyin-assisted Chinese speech recognition for the first time.
Performance is improved by 4%.

In Section 1, we discuss the shortcomings of multimodal speech recognition and propose the method in this paper based on them. In Section 2, we review the related work encountered according to the ideas of this paper. In Section 3, we introduce the overall framework of the model proposed in this paper and introduce the framework in detail in different modules. In Section 4, we introduce the experimental settings in detail, give the experimental results, analyze the experimental results, and add a discussion section. In Section 5, we summarize the method proposed in this paper.

2. Related Work

In this section, we review the methods used in this paper.

In IoT applications, ASR serves as a key technology for enabling voice-driven interactions, hands-free operation, and intelligent automation. Previous studies have investigated the integration of ASR in various domains, including smart homes [10], healthcare monitoring systems [11], and industrial IoT [12], showcasing its potential to improve user experience and operational efficiency. However, most existing ASR models are trained on general-purpose datasets and may not be well suited for the specific constraints of IoT environments. By incorporating pinyin-assisted Chinese ASR, we have further enhanced recognition accuracy, achieving a higher level of performance in IoT-related speech applications.

Pinyin-supported ASR has garnered increasing attention in recent years due to its relevance in processing tonal languages, particularly Mandarin Chinese [4,13,14]. Traditional ASR systems primarily rely on acoustic models that map audio signals directly to phonetic or character sequences. However, the complexity of Mandarin’s tonal variations and its large set of homophones pose challenges for direct speech-to-text conversion. To address this, several studies have integrated pinyin as an intermediate representation in the ASR pipeline.

Multimodal ASR has been proposed incorporating auxiliary modalities [15,16]—such as lip movement videos [17], pinyin [4,13], and contextual text—to assist acoustic modeling [18,19] and enhance overall system performance [20]. This line of work forms the core direction of multimodal ASR. The essence of multimodal ASR lies in leveraging multi-source information to collaboratively interpret speech content. As such, the choice of input modalities directly affects the system’s modeling capabilities and its suitability for different application scenarios. Different modalities provide complementary information across multiple dimensions, including acoustic signals, visual cues, semantic priors, and linguistic structure.

Label The Denoising Method has gained significant attention in the field of machine learning [21,22], particularly in tasks where noisy or imperfect labels negatively impact model performance. Mainstream label denoising algorithms include modified loss functions [23,24,25], “self-training” mechanisms [26,27,28], and dynamic selection of “trusted samples” to participate in gradient updates [29,30,31]. Label noise can come from various sources, such as human labeling errors, insufficient supervision, or automatic labeling systems.

3. Method

Pinyin-assisted Chinese ASR leverages the Mandarin pinyin system as an auxiliary modality to improve recognition accuracy. Unlike phoneme-based systems used in languages such as English, Chinese faces unique challenges due to its logographic nature and extensive character vocabulary. By integrating pinyin, ASR systems benefit from phonetic representation, which simplifies pronunciation modeling and reduces the inherent ambiguity of Chinese characters.

Our work introduces pinyin into Chinese ASR without requiring additional supervision from a dedicated pinyin dataset. To avoid introducing extra supervision, we first convert speech audio signals into the pinyin textual modality. Subsequently, the pinyin in the textual modality and the audio signal in the acoustic modality are fed jointly into a multimodal ASR model. The pinyin predictions are then used to further assist and supervise the improvement of Chinese ASR performance.

To generate the pinyin data in the textual modality, we pretrain a pinyin recognition model. Specifically, we utilized a WeNet-based audio-to-pinyin model, which serves as the audio-to-pinyin module in our framework. The pretrained pinyin recognition module introduces pinyin assistance into the Chinese ASR process. Since we do not directly use a dataset labeled with pinyin, but instead rely on the results from the trained pinyin recognition model, the textual modality data is not 100% accurate (in practice, the pretrained WeNet pinyin recognition model achieves 96.7% accuracy). To address this, we incorporated a label correction mechanism into the model to handle incorrect labels in the textual modality.

With the introduction of the textual modality, the input to our proposed model becomes a combination of acoustic and textual modalities. We designed a multimodal ASR model to process these multimodal inputs and enhance the overall performance of Chinese ASR. The multimodal ASR network architecture is shown in Figure 1. The first component is the audio-to-pinyin module, which extracts pinyin features from the audio signal. The extracted audio features are then converted into both the textual and acoustic modalities, which together serve as multimodal inputs. The second component is the multimodal processing module, where we employ two types of multimodal processing mechanisms to handle these multimodal features. The third component is the decoder block, inspired by the WeNet decoder module. The detailed functionalities of each module will be elaborated in subsequent sections. The structure of the proposed model is illustrated in Figure 1.

3.1. Speech to Pinyin

The model proposed in this paper enables multimodal ASR without requiring additional pinyin datasets. Instead, the necessary pinyin modal data is generated through the speech-to-pinyin module introduced in this work. This module is implemented using two key components: (1) extracting acoustic features through a pretrained model and (2) generating pinyin labels based on these extracted features. However, the pretrained model used in our implementation has an accuracy of only 96.7%, which may introduce errors in the pinyin labels. To address this issue and enhance the reliability of the pinyin labels, we treat the extracted pinyin labels as noisy labels and incorporate a label denoising module to improve their generalization ability. The structure of the audio-to-pinyin (audio2pinyin) module is illustrated in Figure 2.

As shown in Figure 3, the overall architecture of WeNet can be divided into three main modules: the shared encoder, the CTC decoder, and the transformer decoder. The shared encoder consists of multiple stacked deep Transformer or Conformer layers, providing strong capabilities for temporal sequence modeling and contextual awareness. To meet the strict latency requirements of real-time speech recognition, the encoder limits the receptive field of future (right-side) context during modeling, thereby achieving low-latency speech modeling.

The CTC decoder, serving as the first-stage module, is composed of only a linear transformation layer that maps the encoder’s output features to a CTC activation distribution. Since CTC decoding does not rely on a global attention mechanism, this module can efficiently and in a streaming fashion perform coarse-grained alignment and recognition of the speech sequence, offering prior guidance for the second-stage attention decoder.

The attention decoder, as the second-stage module, adopts a multi-layer Transformer decoder structure. Based on the encoder representations and the preliminary CTC results, it performs fine-grained modeling of the target text sequence in an autoregressive manner. This module has strong capabilities for context fusion and sequence generation, allowing it to correct potential errors from the first-pass CTC decoding and generate more accurate and fluent recognition results.

Through joint training of the CTC and attention decoding mechanisms, as well as re-ranking during inference, WeNet effectively combines the strengths of both approaches. It achieves high recognition accuracy while maintaining real-time performance, making it well suited for latency-sensitive speech recognition applications such as voice assistants, real-time captioning, and intelligent customer service. To tailor WeNet for pinyin recognition, we pretrained the model using a small, self-constructed pinyin dataset derived from THCHS-30 [32]. This dataset was designed to capture essential phonetic variations while maintaining a manageable size for efficient training. After pretraining, we froze the parameters of this module in our main model to prevent further updates during subsequent training phases. This design ensures stable pinyin feature extraction while allowing the ASR model to focus on integrating pinyin information to improve speech recognition accuracy. By leveraging WeNet’s robust architecture and incorporating a label denoising strategy, our proposed approach enhances ASR performance without the need for an extensive pinyin-specific dataset, making it well suited for practical IoT applications.

3.2. Label Denoising Module

We introduce pseudo-labels as the results of pinyin recognition to enhance the training effectiveness of our model. Pseudo-labeling is a widely used semi-supervised learning technique that leverages unlabeled data to improve model performance [33,34,35]. Its core idea is to use a trained model to generate artificial labels (pseudo-labels) for unlabeled data and incorporate these pseudo-labels as additional supervision signals during training, treating them as if they were real labels. This approach enables the model to learn from a larger dataset, even when only a small portion of the data is manually annotated.

In this study, since the results of pinyin recognition are not directly obtained through manual annotation but are instead generated by the model’s predictions, we use these predicted labels as pseudo-labels in subsequent training stages. This method not only introduces an additional modality to assist our speech recognition results but also allows the model to iteratively refine the quality of pseudo-labels through self-learning, thereby improving the accuracy and generalization capability of speech recognition. By incorporating pseudo-labels, we can effectively utilize unlabeled pinyin data to enhance the overall performance of speech recognition tasks. This approach provides a feasible solution for speech recognition and natural language processing under low-resource conditions.

Since we recognize pinyin rather than directly converting the label into pinyin, the generated pinyin features are not 100% accurate. Inaccurate labels present a significant challenge to machine learning systems, affecting model accuracy, stability, interpretability, and overall performance. We conducted an experiment on the WeNet model using a randomly generated AIShell dataset with 96.7% incorrect labels to evaluate the impact of erroneous labels on the overall model performance. The experimental results are shown in Table 1.

Therefore, after the pinyin recognition, we add a label denoising model. Label denoising methods are designed to improve the performance of machine learning models by mitigating the impact of noisy or incorrect labels in training datasets. Noisy labels can occur due to human error, ambiguity, or automated labeling processes, which negatively affect model training by leading to overfitting or reduced accuracy. Label denoising aims to identify and correct or minimize the influence of these noisy labels during training, enhancing the overall robustness of the model.

We used three classic label denoising models for this module: decoupling, co-teaching [29], and co-teaching+ [36]. The difference is shown in Figure 4. The following presents a comparison of the leading methods in dealing with noisy labels.

Decoupled training is a typical active robust learning algorithm. Its core idea is based on the observation that training a neural network inherently follows a “curriculum” learning mechanism—from easy to hard. In the early stages, the model tends to memorize clean labels, while in the later stages, it is more prone to overfitting noisy labels. To mitigate this, the decoupled algorithm enhances robustness by controlling the source of samples used for parameter updates. Specifically, in each training batch, two neural network models independently perform forward propagation on all samples and generate their respective predictions. The algorithm then selects a subset of samples where the predictions from the two models disagree, computes the loss, and performs gradient updates only based on this subset. This strategy is based on the assumption that samples with inconsistent predictions are more likely to lie in the “boundary regions” or contain uncertain labels. By learning from these uncertain areas, the model becomes better at identifying label anomalies and avoiding overfitting to incorrect labels. This method ensures training stability while effectively suppressing the dominance of noisy labels.

Co-teaching is a label denoising method based on a small-loss selection strategy. Compared with decoupled methods that filter samples based on prediction disagreement, co-teaching leverages the model’s natural tendency to learn clean samples first in the early training phase. It employs an interactive training mechanism to make the most of reliable data for gradient updates. In each training iteration, two parallel neural networks compute the loss for each sample and each selects a subset of samples with the smallest loss—assumed to have trustworthy labels. Unlike traditional methods, co-teaching does not use its own selected samples to update its own parameters. Instead, each model passes its “small-loss” samples to the other model for parameter updates, thereby reducing confirmation bias and enhancing the label denoising effect. This mechanism provides significant stability advantages and maintains strong performance even in scenarios with high label noise rates.

Co-teaching+ addresses a critical limitation of the original co-teaching algorithm: as training progresses, the two models, influenced by one another and trained on similar small-loss samples, may gradually converge in their parameters and predictions. This convergence increases redundancy in sample selection and weakens the collaborative advantage of the dual-model framework. To overcome this, the co-teaching+ algorithm combines the small-loss selection mechanism from co-teaching with the disagreement-based filtering strategy from decoupled training, aiming to improve both diversity and efficiency in sample selection. In this algorithm, training focuses not only on the loss values but also on the prediction disagreement between the models. Specifically, each model selects the set of samples on which their predictions disagree, and from within this set, chooses a subset with the smallest loss. This selected subset is then used by the other model to update parameters. This enhanced strategy retains the robustness of cross-model small-loss exchange in co-teaching while introducing the disagreement constraint from decoupling, effectively preventing the issue of “collaborative degradation” caused by model convergence. Experiments show that in tasks with a high proportion of noisy labels or where model outputs are unstable, the co-teaching+ algorithm can maintain strong denoising capability over long training cycles, significantly improving generalization under complex data distributions.

Specifically, we introduce a data sample partition–driven parallel modeling strategy during the data organization and model training process. By randomly distributing and shuffling training samples in the speech corpus, each training mini-batch is divided into two partially overlapping subsets, which are then fed into two independently parameterized WeNet recognition models. This allows for synchronized parameter updates and alternating guidance during training. The design intention behind this strategy is to simulate a multi-view modeling mechanism, enabling each model to process samples not only based on its own learning experience but also with feedback corrections from collaborative signals provided by the other model.

During training, the two WeNet networks perform forward propagation and compute their loss functions independently. Based on a predefined denoising algorithm, “clean samples” are dynamically selected—that is, samples are ranked according to the model’s output confidence or loss values, and only those with high confidence or low loss are used for backpropagation to update weights. This significantly reduces the impact of noisy labels on gradient estimation and parameter optimization. Such a process ensures that, in the early training stages, the model’s optimization is not dominated by incorrect labels, thereby enhancing training stability and convergence performance.

It is worth emphasizing that through this paradigm of dual-model collaboration, sample selection, and cross-feedback learning, both recognition models are able to interpret and optimize input data from complementary perspectives in each training round. As a result, the models consistently focus on high-quality sample regions while gradually diminishing the influence of noisy samples. This mechanism is particularly critical in handling pseudo-label error propagation in automatic pinyin recognition scenarios and provides a more robust source of supervisory signals for the integration of pinyin modality in future multimodal speech recognition systems.

3.3. Multimodal ASR Encoder

In this section, we introduce our multimodal ASR module, which integrates both speech and pinyin modalities to enhance recognition performance. Unlike conventional ASR encoders that process only acoustic features, our approach incorporates pinyin representations as an auxiliary modality, facilitating better linguistic alignment. To achieve this, we employ a cross-attention mechanism as the core component of our multimodal processing module.

Cross-attention refers to a fundamental mechanism within attention-based models, particularly in Transformer architectures, where the attention module computes interactions between two different input sequences. Unlike self-attention, where each token attends to all other tokens within the same sequence to capture dependencies within a single modality, cross-attention enables one sequence (e.g., a query sequence) to attend to another distinct sequence (e.g., a key and value sequence) from a different modality or a separate processing stage.

This mechanism plays a crucial role in multimodal learning, as it facilitates the transfer and alignment of information across different data sources. By enabling selective information retrieval between modalities, cross-attention enhances the model’s ability to integrate and interpret diverse data sources, improving both performance and robustness in complex multimodal applications.

In our case, the input consists of audio and text, where we treat the pinyin modality as the query (Q), and the speech information as the key (K) and value (V). This is because we want to select specific local regions in the audio information based on the pinyin sequence, meaning we align the audio sequence according to the pinyin as a condition to optimize text generation. Specifically, we compute the similarity score between each pinyin and all the audio sequences in the form of attention, normalize it, and then use these weights to compute the final output feature representation that is fed into the decoder. The process is as follows:

The pinyin input serves as Q, and the speech input as K and V. It is required that the two different modality sequences have matching input dimensions.
For each position in Q, calculate its degree of association with all positions in K, obtain the similarity matrix, and normalize it to obtain the weight matrix.
Use this normalized matrix to weight V, producing a new feature representation that integrates the new modality.

3.4. Decoder Module

We propose an effective solution to improve ASR performance by utilizing a hybrid Connectionist Temporal Classification (CTC)/attention architecture. This architecture employs a Transformer or Conformer-based encoder and an attention-based decoder to rescore the hypotheses generated by the CTC model. Specifically, the system integrates a two-pass decoding process to refine and optimize the output, enhancing overall recognition accuracy.

In the first pass, the model uses the CTC decoder, which is designed to produce initial hypotheses by leveraging the CTC prefix beam search technique. This approach generates multiple candidate sequences, ranking them according to their likelihoods. By focusing on temporal alignment and sequence modeling, the CTC decoder effectively handles speech input and provides a diverse set of hypotheses, which are subsequently refined. To further improve the quality of the hypotheses, the second pass employs the Transformer rescoring strategy, which rescores the CTC-generated candidates by considering more complex dependencies within the acoustic signal. This rescoring process incorporates global context, phonetic structures, and higher-level features that the CTC model may not fully capture, ensuring that only the most promising candidates proceed to the next stage.

In the second phase of decoding, we introduce the attention decoder, which refines the results produced by the CTC decoder. The attention decoder, built with several layers of Transformer [37] or Conformer [38] architectures, enables the system to focus on relevant portions of the input sequence, effectively capturing long-range dependencies and improving the model’s overall recognition accuracy. This approach allows for a more precise alignment between the input features and the output transcription, as the attention mechanism selectively weighs the most important acoustic frames for each decoding step. The attention decoder also benefits from the flexibility of the Transformer and Conformer architectures, which excel at processing sequential data and integrating context across time.

This two-pass approach combines the strengths of both CTC and attention-based models, allowing for a more accurate and reliable speech recognition process. By leveraging the CTC model’s ability to handle sequential data and the attention decoder’s capacity to focus on context and finer details, our proposed architecture delivers a significant performance boost compared with conventional CTC decoding alone. The resulting system is well suited for real-world ASR applications, where both speed and accuracy are crucial for optimal performance.

As depicted in Figure 1, the architecture consists of three key components: a shared encoder, a CTC decoder, and an attention decoder. The shared encoder, composed of multiple layers of either Transformer or Conformer, extracts a rich set of features from the input acoustic data. This encoder is designed to balance between capturing sufficient context for accurate predictions and maintaining low latency for real-time processing. The CTC decoder then converts the encoder’s output into CTC activations, producing initial hypotheses. Finally, the attention decoder, consisting of several Transformer decoder layers, processes the CTC hypotheses in the second pass to generate the final output sequence.

4. Experiment

4.1. Experiment Setup

We use Aishell [39] as our dataset. The Aishell dataset is a large-scale Mandarin Chinese speech corpus collected by Beijing Shell Shell Technology Co., Ltd., Beijing, China.. The dataset contains approximately 170 h of high-quality speech data from over 4000 speakers covering different dialects, ages, and genders. It includes read speech, conversational speech, and scripted dialogue in different acoustic conditions. The Aishell dataset has been widely used in research on speech recognition, speaker recognition, and speech synthesis, which serves as our complete dataset for training. This experiment uses RTX 4090 training.

The baseline model in this paper is based on the model proposed in this paper. In the model proposed in this paper, the speech-to-pinyin and multimodal encoder modules are removed, and a standard speech encoder is used, which is then connected to our decoder module.

Many designs of the model in this paper are borrowed from Wenet, and the pinyin recognition part directly uses Wenet’s pretrained model, so this paper uses Wenet as our comparison model. Wenet is based on Transformer and Conformer and can effectively process long-time series speech data.

4.2. Results

The experimental results are shown in Table 2.

It can be seen that after the conventional speech Transformer introduces our multimodal speech recognition module, the performance has been significantly improved. Compared with the Chinese recognition tool Wenet, the method presented in this paper is an improvement to a certain extent.

In addition, the experimental results show that if we simply introduce pinyin recognition without introducing a label error correction module, incorrect labels will cause a dramatic decrease in model performance.

4.3. Analysis and Discussion

We evaluate the effectiveness of each module in the network structure mentioned in this paper by conducting ablation experiments. The purpose of ablation experiments is to gradually remove different modules in the network and observe their impact on the overall performance, so as to determine the contribution of each module to the network. The results of the ablation experiment are shown in Table 3.

(1) Label Denoising Module

In this section of the ablation study, we systematically analyze the independent contribution and core role of the label denoising module within the multimodal speech recognition framework. Specifically, we first remove the label denoising mechanism from the full model architecture—i.e., we disable the noise detection and dynamic filtering of the automatically generated pinyin labels during training—to assess the marginal effect of this module on overall model performance.

Experimental results show that even with the auxiliary pinyin recognition branch retained, the Top-1 pinyin recognition accuracy remains high at 96.3%. However, the overall end-to-end speech transcription system experiences a significant increase in Character Error Rate (CER) across multiple standard test sets. This indicates that a high-quality pinyin branch alone does not directly translate into stable improvements in overall recognition performance. The label denoising mechanism plays an indispensable role in the collaborative modeling process, acting as a key module for modality coordination and error suppression.

From a mechanistic perspective, the incorporation of the pinyin modality theoretically provides structural priors and phoneme-level semantic guidance for the speech recognition model. It serves a dual role in the multimodal fusion process: as an “alignment anchor” and as a “guide for attention focus.” However, in practical training scenarios, since pinyin labels are primarily generated by an automatic pinyin recognition system, noisy labels are inevitable—especially in regions where asynchronous modalities exhibit temporal misalignment or where pronunciation is ambiguous. Even with high overall accuracy, local errors can still significantly mislead the model.

Under such conditions, the attention distributions perceived by the multi-head cross-attention mechanism are prone to misdirection. The model may over-attend to incorrect pinyin positions, thereby disrupting the focus and representation of speech modality features. This leads to attention drift and modality misalignment. Such semantic misalignment is particularly pronounced in Mandarin speech recognition tasks due to the language’s complex tonal system, abundance of homophones with different meanings, and blurred word boundaries. The relationship between speech and text is inherently nonlinear and highly entangled. Once the guiding signal is biased, it can cause semantic drift throughout the entire recognition path.

To address these issues, this paper introduces a label denoising mechanism as an auxiliary training module. The core idea is to establish a self-correcting closed-loop system of dynamic label filtering and iterative refinement. Specifically, a dual-model collaboration mechanism is used to detect potential noisy labels based on prediction disagreement in the early training stage. Only samples deemed reliable by both models are retained for backpropagation, thereby preventing the model from being misled by erroneous gradients during early training.

Further experimental analysis confirms the quality-enhancing effect of this mechanism on the auxiliary pinyin information. Specifically, after introducing label denoising, attention distributions across modalities become more focused and better aligned with acoustic boundaries. The system shows significantly improved stability and accuracy in recognizing complex speech sequences, including long sentences, consecutive homophones, and accent variations. Overall, CER decreases markedly, and the model’s generalization performance is substantially enhanced.

(2) Label Denoising Method

To systematically evaluate the impact of the label denoising mechanism on the overall performance of a multimodal speech recognition system, this study designs and conducts a set of comparative experiments based on multi-label noise modeling strategies. The goal is to comprehensively validate the practicality and scalability of the label denoising module in real-world complex speech scenarios from three perspectives: model robustness, generalization ability, and modality fusion effectiveness. Specifically, we selected three representative and widely applicable mainstream algorithms from the current field of label denoising research: Decoupling, Co-teaching, and its iterative improvement version Co-teaching+. These methods were each integrated into our proposed cross-modal speech recognition framework, allowing for a systematic comparison of their adaptability and performance gains when the auxiliary modality contains noisy labels.

All three methods are based on the shared understanding that noisy labels significantly interfere with training and should be proactively filtered. They employ multi-model collaborative decision mechanisms, using different strategies to selectively ignore potential pseudo-labeled samples during training, thereby reducing the interference of incorrect labels on parameter updates.

Further comparative analysis reveals that as the granularity of the label denoising strategy increases, model performance consistently improves, indicating that high-quality label reliability modeling directly benefits multimodal fusion recognition tasks. Notably, Co-teaching+ demonstrates the strongest robustness when dealing with noisy pinyin labels, significantly reducing the negative impact of label noise on downstream recognition accuracy and further enhancing overall model performance. This finding confirms the potential of fine-grained label reliability modeling to improve the quality of multimodal learning in weakly supervised scenarios.

It is also noteworthy that even lightweight strategies like Decoupling, which feature relatively simple designs and fewer parameters, still outperform the baseline Transformer model without any denoising module when integrated into a multimodal speech recognition system. This demonstrates the high generalizability and transferability of the label denoising mechanism. It not only improves the quality of training signals from the auxiliary modality but also indirectly enhances the stability and accuracy of cross-modal representation learning.

From the perspective of the modality alignment mechanism, the pinyin modality—serving as a weakly supervised auxiliary input—has its label quality directly influencing the accuracy of alignment anchor selection and attention distribution within the cross-attention mechanism. The presence of noisy labels can cause the attention mechanism to form high-activation regions over incorrect temporal segments or semantically irrelevant fragments, thereby degrading the contextual modeling of the original speech signal. By introducing the label denoising mechanism, these pseudo-activation paths driven by erroneous labels are filtered out, enabling the model to focus more precisely on trustworthy regions. This strengthens the semantic alignment between speech and pinyin, thereby significantly improving the overall performance of the end-to-end speech recognition system.

(3) Multimodal Encoder

We systematically introduce and evaluate the cross-attention mechanism as the core structure of the multimodal fusion module, aiming to explicitly model the semantic alignment and dynamic interaction between the speech modality and the pinyin modality. Owing to its powerful cross-modal perception and selective modeling capabilities, the cross-attention mechanism has been widely adopted in cutting-edge research areas such as vision–language understanding and multimodal translation. However, its performance potential in the field of multimodal speech recognition remains underexplored.

To comprehensively assess the fusion capability and generalization performance of cross-attention, we design a feature-level concatenation fusion (concat) method as a baseline for comparison. This baseline employs parallel speech and pinyin encoders to extract features independently, then fuses the outputs via early concatenation, followed by processing with a unified Transformer encoder. Structurally, this represents a typical “early fusion” strategy, which relies on the downstream self-attention mechanism to learn inter-modal coupling from the concatenated representation automatically. However, this unconstrained fusion approach often leads to information interference and modeling bias when facing asynchronous modality alignment, inconsistent temporal granularity between pinyin and speech, or redundancy and conflict between modalities—resulting in unstable fusion and limited recognition performance.

Experimental results confirm the effectiveness and advantages of the cross-attention mechanism from multiple perspectives. Specifically, after introducing cross-attention, the model achieved significant improvements in Character Error Rate (CER). Cross-attention not only enhanced the model’s ability to explicitly model the alignment structure between speech and pinyin but also strengthened the deep understanding of pronunciation-related semantic structures, fully leveraging pinyin’s role as an intermediate guiding representation.

Furthermore, although cross-attention and feature concatenation differ structurally in fusion strategy, both multimodal designs significantly outperform traditional unimodal Transformer architectures, further validating the importance of incorporating the pinyin auxiliary modality in Chinese speech recognition tasks. The structure of the Chinese language heavily relies on syllables and tone distinctions, and the phonetic-semantic nature of Chinese characters makes it difficult for purely acoustic signals to fully capture linguistic meaning. As an intermediate representation between speech and text, pinyin naturally possesses low dimensionality, clear semantics, and standardized structure, making it highly suitable as an auxiliary modality to guide the speech recognition model in establishing stable and semantically consistent alignment.

From a model design perspective, the cross-attention mechanism essentially provides a “semantically-driven modality alignment path.” In contrast to early fusion approaches based on “unsupervised modality disentanglement + reconstruction,” cross-attention explicitly defines the direction of attention flow, enabling the pinyin modality to serve as a high-confidence prior that dynamically regulates the information extraction scope of the speech modality. This design not only preserves the original temporal structure of speech but also significantly enhances fusion efficiency across modalities.

(4) Decoder Method

In this study, to enhance the stability and robustness of Chinese speech recognition systems in complex scenarios, we introduce a dual-path hybrid decoding structure in the decoder design—combining Connectionist Temporal Classification (CTC) with attention-based decoding. This hybrid decoding framework is based on the CTC rescoring mechanism proposed in WeNet, aiming to balance decoding efficiency and semantic modeling accuracy and overcome the structural limitations of single-path decoding in alignment, generalization, and long-sequence modeling.

Within this architecture, the CTC decoder path provides frame-level weak alignment constraints during training, accelerating model convergence, and takes on the role of candidate sequence generation during inference. It supplies the attention path with initial values for the decoding search space. This strategy effectively mitigates the instability of attention-based models when dealing with long sequences or variable-speed speech inputs, improving the overall system’s decoding determinism and robustness during convergence.

To systematically validate the role of the CTC module within the multimodal speech recognition system, this section presents an ablation study in which the CTC decoding path is completely removed. Only the autoregressive attention decoder based on the Transformer architecture is retained as the output path, and this single-path model is used as a baseline for comparison. Experimental results show that although the attention path still retains some semantic generation capability in certain scenarios, it suffers a significant performance drop in Character Error Rate (CER) metrics.

From a modeling perspective, the core advantage of CTC lies in its Conditional Independence Assumption and alignment-free learning. By maximizing the total likelihood between prediction paths and target outputs, CTC enables robust modeling that is tolerant to sequence length variations and sparse frame-level distributions. The CTC module allows the model to learn global feature representations without requiring precise boundary annotations, making it particularly suitable for processing speech samples with sparse frame information or poor signal quality, and offering favorable inductive bias and noise resistance.

Moreover, in the dual-path architecture, the CTC path supports parallel streaming inference, allowing the model to begin decoding without waiting for the complete audio input. This effectively reduces latency and improves computational efficiency for real-time speech recognition scenarios. Meanwhile, the attention path provides context-aware fine-grained rescoring of CTC’s coarse candidate outputs, compensating for CTC’s limited capacity in modeling long-term dependencies. The two paths work together to form a hybrid semantic modeling system that achieves both efficiency and accuracy.

5. Conclusions

This paper introduces a novel multimodal ASR model that leverages pinyin as a modality, marking a significant departure from traditional image-based assistance in ASR systems. Pinyin assistance aligns more closely with human auditory processing, making it a more natural fit for speech recognition tasks, particularly in tonal languages such as Chinese. One of the key contributions of this work is the resolution of the inherent label noise in the pinyin modality through the use of label denoising techniques.

The experimental results validate that multimodal ASR incorporating pinyin assistance achieves a substantial improvement in speech recognition performance compared with traditional single-modal or image-assisted methods. This advancement not only improves recognition accuracy but also demonstrates the effectiveness of incorporating linguistic structures like pinyin in speech models. Looking ahead, we plan to expand the model by developing additional components tailored to the multimodal ASR framework. Future work will focus on refining the pinyin label denoising process by incorporating acoustic information, as well as designing a specialized multimodal encoder that can better capture the interaction between pinyin and other speech features.

Author Contributions

Conceptualization, R.J.; Methodology, J.Z.; Software, Z.Y.; Data curation, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China under grant 2022YFF0901800, NSFC Grants 62176205, 62472346, 62372365, 62302383.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kaur, P.; Singh, P.; Garg, V. Speech recognition system; challenges and techniques. Int. J. Comput. Sci. Inf. Technol. 2012, 3, 3989–3992. [Google Scholar]
Zaidi, S.F.N.; Shukla, V.K.; Mishra, V.P.; Singh, B. Redefining home automation through voice recognition system. In Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2020, Volume 2; Springer: Berlin/Heidelberg, Germany, 2021; pp. 155–165. [Google Scholar]
Zhang, S.; Lei, M.; Liu, Y.; Li, W. Investigation of modeling units for mandarin speech recognition using dfsmn-ctc-smbr. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7085–7089. [Google Scholar]
Yao, Z.; Wu, D.; Wang, X.; Zhang, B.; Yu, F.; Yang, C.; Peng, Z.; Chen, X.; Xie, L.; Lei, X. WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit. Interspeech 2021, 2021, 4054–4058. [Google Scholar]
Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
Li, L.; Long, Y.; Xu, D.; Li, Y. Boosting Character-based Mandarin ASR via Chinese Pinyin Representation. Int. J. Speech Technol. 2023, 26, 895–902. [Google Scholar] [CrossRef]
Yang, Z.; Ng, D.; Fu, X.; Han, L.; Xi, W.; Wang, R.; Jiang, R.; Zhao, J. On the Effectiveness of Pinyin-Character Dual-Decoding for End-to-End Mandarin Chinese ASR. arXiv 2022, arXiv:2201.10792. [Google Scholar]
Wang, Q.; Andrews, J.F. Chinese Pinyin. Am. Ann. Deaf 2021, 166, 446–461. [Google Scholar] [CrossRef] [PubMed]
Effendi, J.; Tjandra, A.; Sakti, S.; Nakamura, S. Listening while speaking and visualizing: Improving ASR through multimodal chain. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 471–478. [Google Scholar]
Ma, Y. Design and Implementation of Smart Home Air Monitoring System. In Proceedings of the 4th International Conference on Computer, Internet of Things and Control Engineering, Wuhan, China, 1–3 November 2024; pp. 151–155. [Google Scholar]
Kumar, M.; Mukherjee, P.; Verma, S.; Kavita; Kaur, M.; Singh, S.; Kobielnik, M.; Woźniak, M.; Shafi, J.; Ijaz, M.F. BBNSF: Blockchain-based novel secure framework using RP2-RSA and ASR-ANN technique for IoT enabled healthcare systems. Sensors 2022, 22, 9448. [Google Scholar] [CrossRef] [PubMed]
Zhu, H.; Zhang, Q.; Gao, P.; Qian, X. Speech-oriented sparse attention denoising for voice user interface toward industry 5.0. IEEE Trans. Ind. Inform. 2022, 19, 2151–2160. [Google Scholar] [CrossRef]
Zhang, B.; Wu, D.; Peng, Z.; Song, X.; Yao, Z.; Lv, H.; Xie, L.; Yang, C.; Pan, F.; Niu, J. Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv 2022, arXiv:2203.15455. [Google Scholar]
Bai, Y.; Chen, J.; Chen, J.; Chen, W.; Chen, Z.; Ding, C.; Dong, L.; Dong, Q.; Du, Y.; Gao, K.; et al. Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition. arXiv 2024, arXiv:2407.04675. [Google Scholar]
He, J.; Shi, X.; Li, X.; Toda, T. MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, ASR Error Detection, and ASR Error Correction. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11066–11070. [Google Scholar]
Oneață, D.; Cucu, H. Improving multimodal speech recognition by data augmentation and speech representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 4579–4588. [Google Scholar]
Afouras, T.; Chung, J.S.; Zisserman, A. Asr is all you need: Cross-modal distillation for lip reading. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2143–2147. [Google Scholar]
Morris, R.W.; Clements, M.A. Reconstruction of speech from whispers. Med. Eng. Phys. 2002, 24, 515–520. [Google Scholar] [CrossRef] [PubMed]
Ao, J.; Wang, R.; Zhou, L.; Wang, C.; Ren, S.; Wu, Y.; Liu, S.; Ko, T.; Li, Q.; Zhang, Y.; et al. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5723–5738. [Google Scholar]
Broderick, M.P.; Anderson, A.J.; Lalor, E.C. Semantic context enhances the early auditory encoding of natural speech. J. Neurosci. 2019, 39, 7564–7575. [Google Scholar] [CrossRef] [PubMed]
Ji, X.; Zhu, Z.; Xi, W.; Gadyatskaya, O.; Song, Z.; Cai, Y.; Liu, Y. FedFixer: Mitigating heterogeneous label noise in federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 12830–12838. [Google Scholar]
Lukasik, M.; Bhojanapalli, S.; Menon, A.; Kumar, S. Does label smoothing mitigate label noise? In Proceedings of the International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 6448–6458. [Google Scholar]
Hao, Y.; Madani, S.; Guan, J.; Alloulah, M.; Gupta, S.; Hassanieh, H. Bootstrapping autonomous driving radars with self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15012–15023. [Google Scholar]
Xu, P.; Xiang, Z.; Qiao, C.; Fu, J.; Pu, T. Adaptive multi-modal cross-entropy loss for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5135–5144. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2980–2988. [Google Scholar]
Wang, J.; Huang, D.; Wu, X.; Tang, Y.; Lan, L. Continuous review and timely correction: Enhancing the resistance to noisy labels via self-not-true distillation. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5700–5704. [Google Scholar]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
Radhakrishnan, A.; Davis, J.; Rabin, Z.; Lewis, B.; Scherreik, M.; Ilin, R. Design choices for enhancing noisy student self-training. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1926–1935. [Google Scholar]
Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I.; Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Li, J.; Socher, R.; Hoi, S.C. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv 2020, arXiv:2002.07394. [Google Scholar]
Jiang, L.; Zhou, Z.; Leung, T.; Li, L.J.; Li, F.F. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 2304–2313. [Google Scholar]
Wang, D.; Zhang, X. THCHS-30: A Free Chinese Speech Corpus. arXiv e-Prints. 2015. arXiv–1512. Available online: http://index.cslt.org/mediawiki/images/f/fe/Thchs30.pdf (accessed on 1 August 2025).
Ran, L.; Li, Y.; Liang, G.; Zhang, Y. Pseudo Labeling Methods for Semi-Supervised Semantic Segmentation: A Review and Future Perspectives. IEEE Trans. Circ. Syst. Video Technol. 2024, 35, 3054–3080. [Google Scholar] [CrossRef]
Liu, S.; Cao, W.; Fu, R.; Yang, K.; Yu, Z. RPSC: Robust pseudo-labeling for semantic clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 14008–14016. [Google Scholar]
Pei, H.; Xiong, Y.; Wang, P.; Tao, J.; Liu, J.; Deng, H.; Ma, J.; Guan, X. Memory disagreement: A pseudo-labeling measure from training dynamics for semi-supervised graph learning. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 434–445. [Google Scholar]
Yu, X.; Han, B.; Yao, J.; Niu, G.; Tsang, I.; Sugiyama, M. How does disagreement help generalization against label corruption? In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June2019; pp. 7164–7173. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline. In Proceedings of the Oriental COCOSDA 2017, Seoul, Republic of Korea, 1–3 November 2017. Submitted. [Google Scholar]

Figure 1. The overall structure of the model.

Figure 2. Audio2Pinyin module.

Figure 3. Wenet.

Figure 4. Label denoising models.

Table 1. The Impact of Noisy Labels on Results.

Wenet	Accuracy	CER	Performance
baseline	94.76	5.34
noise label	86.47	13.53	−8.19

Table 2. Results.

Model	Accuracy	CER	Performance
Wenet	94.76	5.34	−1.23
baseline	91.62	8.38	−4.25
Prop-Model w/o denoise	81.09	18.91	−14.78
Proposed Model	95.87	4.13

Table 3. Ablation Results.

	Multimodal		CTC		CER
	Concat	Cross-Att	w/	w/o	CER
w/o denoise	✔		✔		18.91
	✔			✔	20.67
		✔	✔		16.71
		✔		✔	17.24
Decoupling	✔		✔		7.57
	✔			✔	8.11
		✔	✔		7.21
		✔		✔	7.68
Co-teaching	✔		✔		6.18
	✔			✔	6.37
		✔	✔		5.17
		✔		✔	6.09
Co-teaching+	✔		✔		4.96
	✔			✔	5.23
		✔	✔		4.13
		✔		✔	4.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, R.; Yang, Z.; Fu, X.; Zhao, J. Improving Mandarin ASR Performance Through Multimodality. Appl. Sci. 2025, 15, 12224. https://doi.org/10.3390/app152212224

AMA Style

Jiang R, Yang Z, Fu X, Zhao J. Improving Mandarin ASR Performance Through Multimodality. Applied Sciences. 2025; 15(22):12224. https://doi.org/10.3390/app152212224

Chicago/Turabian Style

Jiang, Rui, Zhao Yang, Xiao Fu, and Jizhong Zhao. 2025. "Improving Mandarin ASR Performance Through Multimodality" Applied Sciences 15, no. 22: 12224. https://doi.org/10.3390/app152212224

APA Style

Jiang, R., Yang, Z., Fu, X., & Zhao, J. (2025). Improving Mandarin ASR Performance Through Multimodality. Applied Sciences, 15(22), 12224. https://doi.org/10.3390/app152212224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Mandarin ASR Performance Through Multimodality

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Speech to Pinyin

3.2. Label Denoising Module

3.3. Multimodal ASR Encoder

3.4. Decoder Module

4. Experiment

4.1. Experiment Setup

4.2. Results

4.3. Analysis and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI