A Context-Aware Language Model to Improve the Speech Recognition in Air Trafﬁc Control

: Recognizing isolated digits of the ﬂight callsign is an important and challenging task for automatic speech recognition (ASR) in air trafﬁc control (ATC). Fortunately, the ﬂight callsign is a kind of prior ATC knowledge and is available from dynamic contextual information. In this work, we attempt to utilize this prior knowledge to improve the performance of the callsign identiﬁcation by integrating it into the language model (LM). The proposed approach is named context-aware language model (CALM), which can be applied for both the ASR decoding and rescoring phase. The proposed model is implemented with an encoder–decoder architecture, in which an extra context encoder is proposed to consider the contextual information. A shared embedding layer is designed to capture the correlations between the ASR text and contextual information. The context attention is introduced to learn discriminative representations to support the decoder module. Finally, the proposed approach is validated with an end-to-end ASR model on a multilingual real-world corpus (ATCSpeech). Experimental results demonstrate that the proposed CALM outperforms other baselines for both the ASR and callsign identiﬁcation task, and can be practically migrated to a real-time environment.


Introduction
In the past few decades, the automatic speech recognition (ASR) technique has made great processes by data-driven methods. It has been widely used in various fields as one of the important interfaces for human-machine interaction, such as air traffic control (ATC), mobile devices. Currently, in the ATC procedure, the speech communication and ATC system support the ATC operation together to ensure its efficiency and safety. On the one hand, the air traffic controller (ATCO) issues speech instructions via the very high frequency (VHF) radio, whereas the pilot subsequently read the instructions back. On the other hand, flight plans, aircraft positions provided by surveillance radar, and other contextual information are integrated into the terminal of the ATC system to assist ATCO in managing the airspace.
However, due to technical limitations, ATC speech communication is independent of the ATC system, which fails to understand the real-time traffic dynamics. Thus, the ASR system becomes a promising technique to bridge the speech communication and ATC system. Recently, more and more attention has been paid to employ the ASR techniques to empower ATC applications, such as the ATC assistance system [1], operational safety monitoring system [2,3], and the ATCO training system [4,5].
In the above-mentioned applications, the flight callsign is the only correlation between ATC speech and real-time contextual information of the ATC system. In general, only the ASR results with a correct callsign can be applied to the downstream applications [6]. Therefore, improving the performance of the callsign identification is the key to advance the ASR technique into industrial application.
Exploring the ASR techniques in the field of ATC communications has attracted increasing interest in recent years. The techniques and challenges in the ATC-related research were reviewed in [7,8]. A cascaded framework was studied to cope with the multilingual and out-of-vocabulary (OOV) issues in the ATC domain [9]. An exploratory benchmark of several advanced ASR models trained on ATC corpus was presented in [10]. Semi-supervised Learning [11] and representation learning [12,13] approaches were also introduced to leverage abundant untranscribed speech data to improve ASR performance in the ATC domain. Furthermore, an ASR and callsign detection challenge of the ATC was held by the Airbus company in 2018 [14].
Although significant progress of the ASR performance has been made in the ATC domain [9][10][11][12][13][14][15], recognizing isolated digits of the callsign is a challenging task in the ATC domain due to their widespread usage and ambiguous meanings [8]. For example, an ATC instruction Air China four four one climb maintain eight thousand one hundred meters contains multiple digits, the four four one is a part of the callsign while the eight thousand one refers to the flight level. The best result of the callsign detection F1-score reported in [14] is about 83% in the AIRBUS-ATC [16] corpus, whereas it is about 74% accuracy for another multilingual ASR system [15]. Fortunately, the flight callsign is a kind of prior ATC knowledge and available from the contextual information, such as surveillance radar and flight plan. In other words, if the callsign entity in the dynamic contextual information can be encoded into a text set, the callsigns involved in the ATC speech are most possibly one of the elements. Intuitively, integrating the contextual information into the ASR system is expected to be an effective way to improve the performance of the callsign identification.
In this work, we attempt to utilize this prior knowledge (flight callsign) to improve the performance of the callsign identification in the ASR system. To this end, a contextaware language model (CALM) is proposed to integrate the contextual information into the language model (LM). Moreover, as shown in Figure 1, a contextual ASR system is designed to integrate the CALM and end-to-end acoustic model (AM). Compared with conventional LM, the core idea of the proposed approach is to bias the output of the AM using CALM which can consider the embedding of the dynamic contextual information. Furthermore, the CALM is incorporated into the ASR system in two ways, i.e., decoding with beam search and rescoring based on the N-best list.  In general, the proposed CALM is implemented with an encoder-decoder architecture, in which the encoder module consists of the text encoder and context encoder. To consider the prior callsign set in the contextual information, an extra encoder module, i.e., context encoder, is proposed to convert the callsign into text-related representations. A shared embedding layer is designed to learn common correlations of the input tokens between the text encoder and context encoder. To discriminate the contributions of the AM output and the predefined callsigns, the context attention mechanism is also designed to support the decoder module to generate the final ASR result. In addition, a callsign mapping strategy is innovatively proposed to consider the multilingual in the ATC speech and the multi-callsign entities in the context. Finally, a simple yet effective context simulation method is developed to complete the modeling training on the existing corpus, which further supports the real-time applications.
By combining with an end-to-end ASR model, the proposed approach is validated on a real-world multilingual speech corpus, i.e., ATCSpeech [15]. Experimental results demonstrate that the proposed CALM outperforms other baselines, which not only shows desired performance improvement on the ASR task (about 4.36% character error rate), but also achieves about 20% accuracy improvement for the callsign identification. Most importantly, the efficiency and effectiveness of the proposed approach are also confirmed on a 5-h real environment dataset, in which both the ATC speech and the contextual information were collected from Chengdu area control.
In summary, the main contributions of this work are as follows: • A novel neural network language model, called CALM, is proposed to improve the callsign identification in ATC-related ASR systems. • Compared to conventional LM, the proposed CALM has the ability to integrate the contextual information into the LM decoding by the designed context encoder and context-aware decoder, which improves the ASR performance from the perspective of scene awareness. • To fuse the representations of the text and the contextual information, a context attention mechanism is proposed to generate a joint representation vector that further supports the context-aware decoder. • We integrate the CALM into the decoding and rescoring procedure of the ASR systems and validate on the real-world speech corpus.
The remainder of the paper is organized as follows: the previous works of the contextual ASR are briefly reviewed in Section 2. Section 3 presents the architecture of the AM and CALM for constructing the ATC ASR system in this work. In Section 4, we evaluate the proposed CALM in terms of character error rate and callsign accuracy on both decoding and rescoring procedures. The conclusion and future work are described in Section 5.

Contextual ASR Systems
Integrating contextual information into the ASR system to improve performance has been studied in both conventional hybrid and end-to-end systems. In general, there are three ways of integrating contextual knowledge into ASR systems, i.e., weighted finite-state transducer (WFST) based decoding, developing external LM, and end-to-end contextual ASR model.
In the HMM-based ASR system, the context information is usually injected into the main finite-state transducer (FST) graph to support the decoding by a WFST [17]. In [18], the lexicon and grammar served as straightforward extensions to generate the recognition search space by on-the-fly composition and delay construction mechanism. A biasing WFST method composed a baseline WFST and a compact WFST representation of the contextual n-grams was used for a voice search application [19].
An on-the-fly rescoring mechanism was proposed to adjust the LM weights of n-grams which is relevant to the dynamic context during the decoding procedure in [20]. In [21], the class LM and word mapping algorithm were proposed to achieve the rare entity words recognition with the LAS (Listen, Attend, and Spell) [22] architecture. A shallow-fusion end-to-end biasing method [23] showed the competitive performance with the recurrent neural network transducer (RNN-T) [24] model.
End-to-end contextual ASR models incorporate contextual information into the recognition process by a single neural network. A contextual-LAS (CLAS) architecture was proposed to consider contextual information by an all-neural mechanism and outperform online rescoring techniques [25]. To improve the recognition for entity names, an end-to-end contextual RNN-T model was presented in [26] for open domain ASR.
Overall, these methods are able to improve the performance of recognizing proper nouns and personalized user vocabulary of the contextual information to a certain extent. It can be found that the contextual ASR systems tends to be developed from the external components to end-to-end manner. However, the end-to-end model often requires a large of samples in the training process. Developing an external LM for integrating contextual information is still a popular technique in many applications.

ATC Related Works
Due to a wealth of contextual information in the ATC environment, various studies attempted to utilize contextual information to improve the performance of the ATC-related ASR system in recent years [27,28]. A knowledge-based lattice rescoring method [29] was investigated to rescore the ASR hypothesis by a dynamic weighted constraint satisfaction function with dynamic contextual information. The knowledge of the dynamic contextual information was extracted by the ATC grammars which were specified by the International Civil Aviation Organization (ICAO). In [30], the contextual information is generated from a planning system, in which a grammar WFST based approach was further proposed to improve the ASR performance. The ASR hypothesis was also updated by a weighted Levenshtein distance of all possible words that are produced by an additional sequence labeling system [31].
As it can be seen, the WSFT is a standard component of the above methods which highly rely on an external module to generate the required contextual information. Inspired by the success of Deep Fusion [32] and Cold Fusion [33] methods, we attempt to develop a context-aware LM using the deep fusion-based method and integrate it into the ASR system. Specifically, instead of processing the contextual information separately, the proposed approach understands them in a fused and straightforward manner by a neural architecture.

The Acoustic Model
Considering that the end-to-end ASR systems are often the most efficient method and deliver competitive quality in recent years [12,22,24,34], a connectionist temporal classification (CTC) based model referring to Deepspeech 2 [35] is introduced to serve as the AM in this work. In general, the AM model consists of convolutional neural networks (CNN), recurrent neural network (RNN), and fully connected (FC) layers. The spectrogram of the speech extracted by a series of linearly spaced log-filterbanks filters served as the model input. Then, three Conv1D layers are stacked to aggregate the local frequency dependencies between the adjacent speech frame and learn high-level representations. Seven bi-directional RNN layers with gated recurrent units (GRU) are applied to capture the long-term temporal dependencies. In addition, the FC layer outputs the probability of given tokens condition on the input speech frame-wisely. Finally, the training error is evaluated by the CTC criterion [36] to further upgrade the training parameters.
In this work, the spectrogram dimension of the input is set to 81 with 25 ms windows and 15 ms overlaps. The CNN channels, filter size, and stride are set to (512, 512, 512), (5,5,5), and (1, 1, 2), respectively. These parameters benefit the reduction in the size of output while retaining sufficient receptive fields of the CNN. Furthermore, the BatchNorm1D layer and Hardtanh activation are employed to transform the output features of each Conv1D layer. All RNN layers adopt 512 neurons, which is consistent with the dimension of features output by Conv1D layers. In the training process, the Adam optimizer with an initial learning rate of 10 −4 is applied to train the AM. An early stopping strategy is performed to terminate the training procedure by observing the validation loss.

Context-Aware Language Model
With the speech signal being X and a word sequence being W, the target of the ASR task can be described as: for which P(X | W) is predicted by the acoustic model, while the language model aims to build the correlation distribution of the word sequences W. LM is a powerful way to improve the ASR performance by building vocabulary correlations from numerous existing corpora. However, in practice, the probability of a word sequence is determined by both the historical experience and real-time contextual information. The former generally consists of phrases, fixed terms, grammar, and other customed rules, while the latter mainly focuses on the information that may be affected by real-time contexts, such as the personalized data on the mobile devices and the flight callsign in the ATC environment.
Intuitively, for a certain application, incorporating contextual information into the ASR system is a promising way to improve its final performance. To this end, a novel perspective is introduced to utilize the contextual information empowered LM. Thereby, the target of the LM is refined as P(W | C), where C is the real-time context vector.
The proposed model is called CALM, whose architecture is illustrated in Figure 2. In general, the model consists of three modules, including text encoder, context encoder, and context-aware decoder. The detailed descriptions of the three modules are described as follows: • Text Encoder: The text encoder is composed of an input layer, embedding layer, and several LSTM layers. The purpose of the text encoder is to convert the input sequence into high-level feature representations. For a text sequence W = {w 1 , w 2 , ..., w n }, the text encoder learns word representations through an embedding layer and intermediately outputs hidden features h w = {h w 1 , h w 2 , ..., h w n } by LSTM layers, as shown below: • Context Encoder: The context encoder shares the same network architecture with the text encoder, i.e., input layer, embedding layer, and several LSTM layers. Similarly, the context encoder learns the high-level representations from the context sequence which is generated by the contextual information mapping strategy. The context information mapping strategy is described in Section 3.3. With a context sequence being C = {c 1 , c 2 , ..., c m }, the context encoder learns the context representations • Context-aware Decoder: The context-aware decoder is constructed based on a context attention module and two FC layers. Specifically, the learned representations from the text encoder (AM output) and context encoder (contextual information) are fused with different weights optimized by the context attention module. Then, the first FC layer is applied to transform the fused features. The last FC layer with Softmax activation is applied to normalize the output probability on the vocabulary. The decoder process can be summarized as follows: The inference rule of the feature fusion method (context attention module) is motivation by the attention mechanism. Firstly, each hidden unit h w i from the text sequence is assigned the score s i with the context vector h c by Equation (4), where V, W, U are trainable parameters. Secondly, the scores s 1 , . . . , s i are normalized by the Softmax operation as in Equation (5) to get the fusion weights α i . Then, in Equation (6), a weighted sum is calculated on the context feature c i to obtain the fused context feature representation for step i. Finally, as shown in Equation (7), the text representation vector h w i and the fused context representation vector c i of step i are concatenated to form a context-aware vector for the FC layer to generate an output y i .

Context-aware Decoder
Text Encoder Context Encoder It is worth noting that the context sequence and text sequence share the same vocabulary. Meanwhile, the embedding layer of the context encoder and the text encoder also share the learned weights to build stronger correlations of the same vocabulary between the text sequence and context sequence. In this work, the architecture of the CALM is described as follows: the size of the embedding layer is set to 200 for both the text encoder and context encoder, followed by two LSTM layers with 200 neurons per layer. The context-aware decoder is configured with a context attention module and two fully connected layers with |V| units (vocabulary size).
Finally, the proposed CALM is incorporated into the ASR system in two ways, i.e., decoding and rescoring. The decoding strategy is performed with a beam search algorithm (refer to to [35]). Beam search uses a breadth-first search strategy to build its search tree, which can easily integrate the scores of CALM into the search process. In the rescoring procedure, the N-best list of the AM decoding results by beam search is used as the candidate set to generate the final result.

Contextual Information Organization
In this paper, we mainly focus on integrating flight callsign knowledge into the ASR system to improve its performance in a real-time ATC environment. Basically, a total of two problems are required to be addressed for the context mapping: • multiple pronunciations for a single callsign: the airline company name DLH can be spoken as "delta lima hotel" or Lufthansa. Similarly, the airline number "8883" can be spoken as "eight eight eight three" or "triple eight three" in English or "ba ba ba san" in Chinese. • multiple callsign entities from the real-time context: in most cases, there are several flights in a control sector, which are required to be fed into the context encoder to support the subsequent decoding procedure.
In this work, the callsign of international flights is represented by the English word, while the Chinese character is for domestic flights. The rest of the contextual information is organized based on their standard pronunciation. By organizing contextual information in a different format, the CALM is expected to learn the inherited semantic representations of the same callsign entity.
Specifically, multiple flight callsigns are organized as a text sequence with a predefined separator, as: callsign 1 <eos> callsign 2 <eos> callsign 3 <eos>, . . . , callsign n <eos>, in which the <eos> means the end of sentence and serves as a separator between callsign entities. Each callsign is regarded as a whole entity to provide discriminative features for different callsigns.

ATC Corpus
In this work, both the AM and LM are trained on the ATCSpeech corpus [15] that was collected from a real ATC environment. The ATCSpeech is a manually labeled multilingual ASR corpus, which includes about 39.83 h of Chinese speech and about 18.69 h of English speech. Moreover, this corpus covers all flight phases (ground, tower, approach, area control center) and is a more comprehensive ATC speech dataset. The detailed descriptions (i.e., duration, the number of utterances, speaker gender, and speaker role.) of the ATC-Speech corpus are described in Table 1. In addition, more details of the ATCSpeech corpus can be found in [15]. Since the contextual information of training samples in this dataset can no longer be traced back, a simulation strategy is applied to generate the input of the context encoder. To simulate the callsign for each utterance, the callsigns of the whole corpus are pre-extracted to formulate a callsign pool. About 4.5% of the samples are without a callsign in their transcription, which is labeled as None. In the training stage, the contextual information for each utterance is a combination set, including its own callsign and k randomly selected items from the callsign pool. Here, k is picked uniformly from [1, N callsign ], where N callsign is a hyperparameter of the training procedure.
To further validate the proposed approach, an extra test set (called test-real) is also organized to consider the influence of the simulated contextual situational information. The test-real was collected from the real ATC environment of Chengdu area control, including the ATC speech and real-time contextual situational information. The details of the test-real set are also shown in Table 1; there are 4896 utterances in this dataset with a total duration of about 5 h, about 70% Chinese speech, and 30% spoken in English.

Experimental Configurations
Due to the multilingual nature of the ATCSpeech corpus, three AM models, i.e., ASR-C, ASR-E, ASR-A, are applied to conduct experiments, as shown below: • ASR-C: the model is optimized on the Chinese speech of the ATCSpeech corpus. • ASR-E: the model is optimized on the English speech of the ATCSpeech corpus. • ASR-A: the model is optimized on the whole ATCSpeech corpus.
Based on the above ASR models, the proposed CALM is evaluated on both the decoding and rescoring phases. In the decoding experiments, the output vocabulary of the CALM is the same as that of the ASR model, i.e., Chinese character and English letter for Chinese and English speech, respectively. To explore the effect of the modeling unit, both the English letter and word are regarded as the basic token to train the related LM for the N-best rescoring evaluation.
In addition, two comparative baselines, including the N-gram and RNNLM, are also designed to confirm the efficiency and effectiveness of the proposed approach. The N-gram LM is implemented based on the KenLM toolkit [37]. The order is set to 9 and 18 for Chinese and English speech, respectively, and 15 for multilingual speech. The RNNLM architecture is implemented by removing the context encoder and attention layer of the CALM, while other layers remain unchanged.
Based on the statistics of the real ATC environment, the hyperparameter N callsign is set to 20. In the test-real dataset, the number of callsigns depends on the collected real-time contextual information. The top-10 hypothesis of the decoding results is applied to achieve the rescoring procedure. The beam width of the decoding procedure is set to 20.
In this work, the character error rate (CER %) based on the Chinese character and English letter is applied to evaluate the ASR output, while the callsign accuracy (CSA %) is for the callsign identification task. Only when all elements in the callsign are correctly recognized can it be considered as a valid result.
The calculations of CER and CSA are shown as below: where N is the length of the ground-truth, the S, D, and I are the number of the substitution, delete, and insert operations for converting the predicted label into the ground-truth. The C callsign and T utterances represent the number of utterances whose callsigns are correctly recognized and the total utterance in the test data set, respectively.
In the experiment, we construct and train all the models with the open-source deep learning framework PyTorch 1.4.0. The training server was configured as follows: Ubuntu 16.04 operating system with 2*NVIDIA GeForce RTX 2080Ti GPU, Intel Xeon E5-2630 CPU, and 128 GB memory. Cross-entropy is used as the loss function for both the RNNLM and CALM. The initial learning rate of the LM training process starts at 20 and anneals the learning rate (reduce to 1/4) if the validation loss had not improved at the end of every epoch.

Decoding Results
The results of applying the proposed CALM to the decoding procedure are reported in Table 2. As can be seen from the results, an extra LM is able to significantly improve the ASR performance. Specifically, the N-gram and RNNLM correct some spelling errors of the AM outputs, and slightly improve the CSA. They can effectively correct the airline code, or the callsign has occurred in the training set. However, they fail to make positive contributions to correct unseen callsigns, especially isolated digits or letters in the callsign. It can be attributed that there are no semantic correlations between the digits or letters in the callsign. For both of the datasets, the proposed CALM achieves a considerable performance improvement for the callsign identification task, i.e., over 30% relatively CSA improvement for the ASR-C and ASR-A model, and about 73% for the ASR-E model. It also can be seen from the experimental result that the models optimized on the whole corpus obtained better results than the ones optimized on the monolingual speech corpus. Firstly, the increase of training samples (the whole ATCSpeech corpus vs. Chinese or English speeches) helps to improve the performance of the model. Secondly, better performance of intra-sentential code-switching was presented on the Chinese-English speech in multilingual ASR systems.
Note that, since the decoding procedure is a sequential iterative search (no parallel computing), the computational complexity for the NN-based LM is much higher than that of the N-gram ones.

Rescoring Results
In this section, the proposed CALM is also applied to the ASR rescoring. Since the Ngram LM reaches a better trade-off between the performance and computational complexity, it serves as the LM for the decoding procedure (a baseline) in this section. Only the ASR-A model is used for this experiment due to its superior performance over independent systems. To further validate the LM modeling unit, both the English letter and word are applied to train the LM for the English speech, while the Chinese character is always for Chinese speech. The rescoring results for different modeling units are listed in Tables 3 and 4, respectively.  The following conclusions can be drawn from the experimental results: 1. By using the N-gram LM for the decoding procedure, the final ASR performance is slightly improved with both the CALM and RNNLM rescoring. This fact also validates the decoding procedure, which provides a more reliable N-best list and further benefits to the rescoring procedure.

2.
For the test and test-real datasets, the CALM outperforms the common LM for both the ASR and callsign identification task. Thanks to the contextual information, the CALM achieves about 85% CSA, i.e., 20% absolute improvement.

3.
The LMs trained with English words obtain superior performance over those trained with English letters. It can be attributed that taking English letters as the modeling unit leads to the input sequence being too long to capture the vocabulary dependencies, which further affects the final performance of the NN-based LMs.

4.
It can also be seen that, since the rescoring is a separate procedure without considering the AM probability, the rescoring results are not always optimal (the lowest CER) compared to that of applying it to the decoding procedure. However, the rescoring is a one-pass procedure, and can be achieved with less computational resources in a real-time manner. It is a more preferable way to take advantage of the proposed CALM in the real environment.

Visualization and Analysis
To better understand how the CALM works, the learned context attention weights are visualized in Figure 3 for both Chinese and English speech examples. The x-axis and y-axis correspond to the input of the context encoder (contextual information) and the text encoder (AM output), respectively. Purple colors denote the attention values close to 0, while the yellow colors represent the values close to 1. The outputs of the baselines and the CALM for given examples are also presented in Table 5.  As shown in Figure 3 and Table 5, compared to the baselines, the probabilities of the callsign were successfully biased by the proposed CALM, which properly considers the contextual information. In practice, the callsign Cathay two eight niner is also a valid expression in contextual-independent situations. Therefore, the conventional LM (i.e., baselines) failed to predict correct results. Thus, it is clear that the proposed CALM indeed captures the desired correlations between the contextual information and the AM output, which further supports the motivation of this work.
In practice, the requirements of the CER and CSA depend on the specific application scenario in the ATC. For instance, a lower CER (<5%) and a higher CSA (>85%) are needed to ensure accurate alarm in real-time speech understanding-based safety monitoring systems, while the 10% CER and 75% CSA are also acceptable in the speech data retrieval and analysis system. In summary, the proposed CALM was validated on the real-world dataset and can support the majority of ASR applications in the ATC domain.

Conclusions and Future Works
In this work, we propose to apply contextual information to improve the ASR performance in the ATC domain. To this end, a context-aware LM (based on an encoder-decoder architecture) is proposed to integrate predefined flight callsigns into the ASR system. By combining with an end-to-end ASR model, the proposed approach is validated on a multilingual real-world corpus. Experimental results show that it outperforms other baselines for both the ASR and callsign identification task, achieving 4.36% CER and about 85.92% CSA. Most importantly, the proposed approach is also confirmed in a real-time environment. Due to the computational complexity, we believe that the ASR rescoring is a preferable way to practically take advantage of the proposed approach.
In the future, we plan to integrate more situational context information (such as speed, altitude.) into the proposed CALM to improve the performance of recognizing the key ATC elements in the ASR system.