Lessons Learned in Transcribing 5000 h of Air Trafﬁc Control Communications for Robust Automatic Speech Understanding

: Voice communication between air trafﬁc controllers (ATCos) and pilots is critical for ensuring safe and efﬁcient air trafﬁc control (ATC). The handling of these voice communications requires high levels of awareness from ATCos and can be tedious and error-prone. Recent attempts aim at integrating artiﬁcial intelligence (AI) into ATC communications in order to lessen ATCos’s workload. However, the development of data-driven AI systems for understanding of spoken ATC communications demands large-scale annotated datasets, which are currently lacking in the ﬁeld. This paper explores the lessons learned from the ATCO2 project, which aimed to develop an unique platform to collect, preprocess, and transcribe large amounts of ATC audio data from airspace in real time. This paper reviews (i) robust automatic speech recognition (ASR), (ii) natural language processing, (iii) English language identiﬁcation, and (iv) contextual ASR biasing with surveillance data. The pipeline developed during the ATCO2 project, along with the open-sourcing of its data, encourages research in the ATC ﬁeld, while the full corpus can be purchased through ELDA. ATCO2 corpora is suitable for developing ASR systems when little or near to no ATC audio transcribed data are available. For instance, the proposed ASR system trained with ATCO2 reaches as low as 17.9% WER on public ATC datasets which is 6.6% absolute WER better than with “out-of-domain” but gold transcriptions. Finally, the release of 5000 h of ASR transcribed speech—covering more than 10 airports worldwide—is a step forward towards more robust automatic speech understanding systems for ATC communications.


Introduction
There has been a growing interest in the development of automatic speech recognition (ASR) and understanding systems for air traffic control (ATC) due to their potential to enhance the safety and efficiency of the aviation industry.The application of ASR and understanding technologies in ATC has resulted in the creation of advanced proof-ofconcept engines that can assist air traffic controllers (ATCos) in their daily tasks.These systems are designed to analyze spoken ATC communications and convert them into machine-readable texts, allowing for faster and more accurate processing.Previous works such as MALORCA [1], HAAWAII [2] or SESAR2020's Solution 97.2 [3] have shown mature enough methods to reduce ATCos' workload while increasing safeness, e.g., see [4,5].The authors concluded that integrating novel ASR-based tools can reduce the total amount of time that ATCos expend on entering and confirming the clearances in their workstations by 20% absolute points.As a result, ASR and understanding technologies are becoming more advanced and capable of handling the complexities of ATC communications, leading to improved safety and efficiency in the aviation industry.The paragraphs below summarize three current challenges-while working with ATC voice communications-addressed by this paper: (1) Previous works on ASR to analyze air traffic communication is built for a specific domain, e.g., one airport or en-route/approach scenarios.The process of adapting machine learning models to different airports or control areas requires new in-domain data, which remain challenging to collect and annotate.For instance, ATC audio data collected from one airport, e.g., airport X, in general, do not transfer well to airport Y.
(2) ATC data collection and their transcription [6] are expensive and time-consuming tasks.The data collection phase includes data capture and preprocessing; this task can be automated.The data transcription phase aims at producing the word-by-word transcript of the given ATC utterance; this task is carried out by hand by humans.It becomes expensive as several man hours are needed to transcribe one hour of ATC speech without silence.For some solutions, such as those targeting small airports, this cost may be prohibitive.This raises the question of what is the most efficient manner to collect and process large-scale ATC audio data.
(3) In addition, audio data from ATC communication are considerably noisier with regard to standard ASR corpora when they are captured via very-high-frequency (VHF) receivers.In some cases, SNR (signal-to-noise) levels may range from 5 to 20 dB.Thus, it becomes challenging to develop an ASR system and later use its outputs for downstream tasks, e.g., natural language processing (NLP), due to the high word error rates (WER).In contrast, higher SNR ATC data sourced from operation rooms and characterized by close-mic recordings and substantially reduced noise can be obtained from air navigation service providers (ANSPs), albeit being limited to private use in most cases.
In this paper, we answer these questions by extending our previous work on the ATCO2 project and its resulting corpora [7]; see detailed information in Appendix A. The ATCO2 project aimed to reduce the human effort required to collect, preprocess, and transcribe ATC voice communications by employing state-of-the-art ASR and NLP systems [8].ATCO2 releases the largest corpus of ATC voice communications to date, consisting of more than 5000 h of automatically transcribed audio data and their correspondent surveillance data [9].In addition, four hours of human-transcribed data (i.e., gold transcriptions) were also released, where we quantified that the transcription process can be significantly accelerated by providing the annotators with automatically transcribed data (i.e., output from an in-domain ASR system), rather than requiring them to produce transcriptions from scratch.According to [7], the real-time factor (RTF; time needed to generate the gold word-by-word transcription of the ATC audio with regard to its duration) for transcribing the data can be reduced from 50 to 20.An overview of the composition of ATCO2 corpora is given in Figure 1.
This paper covers several aspects and lessons learned (see Section 6) related to the data collection and the transcription pipeline, including its primary actors.Also, it covers the main AI-based systems that can be developed with the ATCO2 corpora, and we set baselines on ASR and understanding.
The rest of the paper is organized as follows.Section 2 covers related work on automatic speech recognition and understanding for ASR.We describe the ATCO2 system, data collection pipeline, and the main contributions of this paper in Section 3. In Section 4, we cover technical details about the data collection platform (front end and back end) and how the community of volunteers interacts with them.In Section 5, we cover the main technologies that can be developed with ATCO2 corpora.We conclude the paper and discuss the main lessons learned in Section 6.

Early Work on Automatic Speech Recognition and Understanding in ATC
Recent work in ASR and understanding of ATC communications has been documented for trainee's ATCo training by AENA (Aeropuertos Españoles y Navegación Aérea) [10] and MITRE corporation [11]; also including workload estimation with ASR systems [12].In recent years, more research-oriented work has focused on pure ASR.For example, ref. [13] established the first benchmark on ASR for different ATC communications-focused databases.Furthermore, there has been a significant effort to integrate novel semisupervised learning algorithms for boosting the ASR performance with surveillance data such as [14].This supports the idea of the growing interest in research in ASR and understanding towards ATC, with mature proof-of-concept engines that can assist ATCos in their daily tasks.Our previous work related to the large-scale automatic collection of ATC audio data from different airports worldwide was in [9].Additionally, recent work targeted to improve callsign recognition by integrating surveillance data into the pipeline has been explored in [15] or, for instance, automating pilots report extraction with ASR tools [16].
Another line of work has been directed at open-sourcing ATC-related databases: for US-based communications [17], in Czechia [18], and [19] for several accents in English.Recently, there was an Airbus-led challenge [20] for ATC communications, with Frenchaccented recordings from France [21].Private databases such as VOCALISE [22] and ENAC [23] have also targeted ATC communications.For a general overview of ATCrelated databases, we redirect the reader to

ATCO2 Corpora
It is well known that AI-based tools need large amounts of reliably transcribed data during their training process.For instance, ASR or NLP tools for ATC could work better if we had large-scale data.The ATCO2 corpora was designed to target this data scarcity issue by solving four big challenges: (1) Current corpora related to air traffic control are primarily focused on automatic speech recognition.However, for an AI engine to be successfully deployed in the control room, it must not only accurately transcribe ATC communication but also understand it.This includes the ability to detect speaker roles (SRD) as well as extract and parse callsigns and commands.The ATCO2 corpora provides a comprehensive solution to this challenge by including detailed tags for SRD and callsign and command extraction.This, in turn, will improve the accuracy and efficiency of AI-based systems in ATC operations.
(2) Out-of-domain ASR and NLP-based corpora transfer poorly to the ATC domain.ATC communication follows an unique grammatical structure and employs a specific set of the vocabulary defined by ICAO [25], making it a niche application.This poses a significant limitation to the use of out-of-domain corpora (previous studies [13] have shown that employing non-ATC related corpora such as LibriSpeech [26], CommonVoice [27] or SWITCHBOARD [28], does not match the acoustics of ATC communication, and therefore does not contribute substantially to ASR training).As such, the ATCO2 project collected and publicly released a large amount of ATC-specific data to aid in the development of ASR and understanding engines for ATC.
(3) The research community working on ATC is hindered by a severe lack of openly available annotated data.To address this issue, the ATCO2 project has released a vast corpus of over 5000 h of automatically transcribed data (i.e., ATCO2-T set), as well as 4 h of manually annotated data (i.e., ATCO2-test-set-4h).It is worth noting from Table 1, that the transcriptions generated by the automatic tools have been proven to be robust, with WERs as low as 9%.These errors are achieved when training an ASR engine with ATCO2 corpora only.See the prior results for the Malorca-Vienna-test set coming from the MALORCA project in [7].
(4) There is no standardized metric to evaluate quality of nontranscribed data prior to their transcription process.Currently, when a new corpus for ASR is in its collection and labeling phase, few filtering stages are performed to ensure high-quality audio data selection.In contrast, in Section 3.3, and specifically in Equation ( 1), ATCO2 uncovers the quality estimation that helped to select the best audio files for gold transcription generation by humans.

ATCO2 System and Generalities
The ATCO2 system is described in Figure 1.During the collection of the ATCO2 corpora, we followed several preprocessing steps in order to normalize the generated transcriptions.Here, we aim at minimizing errors produced by phonetic dissimilarities, e.g., "descent to two thousand" and "descend two two thousand".We performed several text normalization steps in order to unify the gold and automatic transcriptions following ICAO rules [25] and well-known ontologies for ATC communications [5].A summary of the transcription protocol is depicted in Figure 2. Additionally, we direct the reader to a more detailed overview on text normalization and lexicon for transcript generation in Section 3 of ref. [7].Furthermore, the ATCO2 corpora are composed of ATCO2-T set corpus and ATCO2-test-set corpus, described below: • First, the ATCO2-T set corpus is the first ever release of a large-scale dataset targeted to ATC communications.We recorded, preprocessed, and automatically transcribed ∼5281 h of ATC speech from ten different airports (see Table 2).To the best of the authors' knowledge, this is the largest and richest dataset in the area of ATC ever created that is accessible for research and commercial use.Further information and details are available in [7].

•
Second, ATCO2-test-set-4h corpus was built for the evaluation and development of automatic speech recognition and understanding systems for English ATC communications.This dataset was annotated by humans.There are two partitions of the dataset, as stated in Table 1.The ATCO2-test-set-1h corpus is a ∼1 h long open-sourced corpus, and it can be accessed for free at https://www.atco2.org/data(accessed on 10 October 2023).The ATCO2-test-set-4h corpus contains ATCO2-test-set-1h corpus and adds to it ∼3 more hours of manually annotated data.The full corpus is available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484(accessed on 10 October 2023).

Data Collection Pipeline
The processing pipeline is implemented as a Python script that follows a configuration file → worker.py.The configuration file allows us to modify the logic and flow of the data in the pipeline on-the-fly.It allows parallelism, forking, and conditions.In principle, worker.pyconsists of global definitions (constants), blocks (local definitions), and links (an acyclic oriented graph) between blocks.The processing pipeline is given in Figure 3.For instance, we address earlier implementations of each technology from the previous work [9], e.g., segmentation and diarization, ASR, or named entity recognition (NER).All the technologies and tools are encapsulated in BASH scripts with an unified interface.
The first row of blocks from Figure 3 refers to segmentation and demodulation.Initially, an antenna and a recording device jointly capture the radio signal, which is divided into segments containing portions where the transmission was "active", and the silent parts are not recorded (push-to-talk is used in ATC voice communication).This functionality is part of the RTLSDR-Airband audio recording software, from which we dump the raw I/Q signal.Second, we convert this complex I/Q radio signal into a waveform signal by a software-defined radio CSDR.The first part is performed in the recording device, while the second is performed at the OpenSky Network (OSN).The OSN is a nonprofit communitybased receiver network which has been continuously collecting air traffic surveillance data since 2013.Unlike other networks, OpenSky keeps the complete unfiltered raw data and makes them accessible to academic and institutional researchers).
Next, we perform "signal-to-noise ratio (SNR) filtering" (second row); the purpose is to remove the recordings that are too noisy.In bad recording conditions, we can end up in a situation in which the voice is not intelligible.The following step is "diarization" (third row).In the automatically segmented data, some recordings contain more than one speaker.This is a problem because we intend to automatically transcribe speaker turns of single speakers.And, for subsequent NLP/SLU tasks, it is important to separate the speaker turns as well.The diarization solves this by splitting the audio into segments with single speakers and assigning them speaker labels.In the ASR step, we simply convert "speech-to-text".This is performed by our ASR system that we build with tools from the Kaldi toolkit [29].The outputs from this step are transcripts, which inevitably contain some errors.To improve the accuracy of the transcripts, we use callsign lists from surveillance as contextual information.The callsign lists come from the air traffic monitoring databases of OpenSky Network.Further details can be found in Section 5.2.2 and [15].
Next, the transcripts are used as input for the English language detection (ELD) system.The purpose is to be able to discard non-English audio data.The typical state-of-the-art language identification system is based on acoustic modeling and uses audio as input.For the ATC speech, we do not need to "identify" the non-English languages, so we developed a "lexical English detection system" which uses transcripts and confidence scores produced by ASR as its inputs (see previous work at Interspeech in 2021 [24]).For ATC speech, this worked better than the "traditional" acoustic language identification method.The last automatic operation is "post-processing by NLP".Currently, the pipeline performs a callsign-code extraction step.It returns the callsign in ICAO format, like "DLH77RM", belonging to an aircraft.Finally, some processed data go through "human correction", and some data are kept with automatic labels.The former case produced ATCO2-test-set-4h corpus, while the latter, ATCO2-T set corpus.A more detailed description of the data collection flow and data transcription is given in Appendices C and D.

ATCO2 on-line processing pipeline
Figure 3. ATCO2 workflow for processing data collected by a community of feeders.Initially, the data are sent and stored on OSN servers.The audio data go through several modules to filter out recordings with a high level of noise and too-long or too-short segments.Blue rectangles are processes.The cyan arrow blocks are internal callback events, where the pipeline informs the master node about progress and sends intermediate results.The orange rhombuses are conditions, where intermediate results are taken into account (e.g., an SNR level), i.e., whether to continue (clean audio) or stop processing.A final internal callback is run when the pipeline finishes.It triggers the API to call the OSN server with the particular callback, for instance, the processing has finalized as OK or ERROR.

Quality Estimation for Data Transcription
As mentioned at the end of Section 3.2, the captured, processed, and automatically transcribed data (see Figure 3) can be annotated by humans.This in turn would generate "gold transcriptions" that we use to evaluate the proposed ASR and NLP systems.The ATCO2-test-set-4h corpus went through all these steps.As the data are continuously being recorded by OSN, we need to select the most intelligible and clean data.We developed a score that ranks the recordings depending on their quality.This score integrates seven metrics that assess the quality of each recording present in ATCO2 corpora.For instance, we used Equation (1) to measure, rank, and select the ATC communications with the highest quality.Later, these recordings were shortlisted for human transcription (see Section 4.2).The data annotators generated the ground truth transcripts and tags that are part of the ATCO2-test-set-4h corpus.The ranking score is given as follows: where • avg SNR -provides average SNR of speech in range <0, 40>.SNR needs to be as high as possible; • num spk -provides the number of speakers in the audio in the range of <1, 10>.The more speakers detected in audio, the better; • speech len -provides the amount of speech in seconds; • audio len -provides the overall audio length.More speech detected in audio is better; • ELD score provides "probability" of audio being English in the range <0.0, 1.0>.The higher the ELD score, the better; • avg WordCon f -provides average confidence of the speech recognizer <0.0, 1.0>.We want data where the recognizer is confident.Higher is better; • wrd cnt -provides the number of words spoken in the range of <0, ∼150>.The more words, the better.
A breakdown of the outputs of these steps for a single day is given in Figure 4.For instance, ∼0.6 h of data are selected for gold transcriptions from an initial 26 h pool of audio data.We believe this is a robust quality scoring method because it gathers information from different systems, e.g., ASR, SNR, and ELD estimation.A day-to-day estimation of the output of each of these steps is available on the SpokenData website: https://www.spokendata.com/atc(accessed on 10 October 2023).

Runtime Characteristics
We also measured the running time for individual components of our processing pipeline.In Table 3, we list the relative time spent by each module, such as ASR and speaker diarization; both accounting for 65% of the overall processing time.This disparity compared to other modules is due to the fact that both are AI-powered modules, which, in principle, needs more processing time.Other important parts are preprocessing, voice activity detection (VAD) segmentation, and ELD.Audio data preprocessing involves obtaining data, demodulation by software radio, segmental gain control, detecting media format, and plotting waveform.A key metric is the real-time factor of the whole pipeline.The real-time factor is the ratio of "processing time" over "length of the audio".Our processing pipeline has a real-time factor of 4.47.In other words, the processing is computationally demanding.For an average five-second-long recording, the processing time is 22 s.The actual running times of each component for the "average" five-second-long recording are shown in Table 3.

Collection Platform and Community of Volunteers
In this section, we summarize the data collection and distribution.In addition, a short description of the roles involved in data processing is provided.We also cover some highlevel statistics about the collected data.First, data are captured and fed into the OpenSky Network by the volunteers who operate their own receiver equipment (see Figure 5).These individuals are often aviation enthusiasts with previous operational experience, or people with an interest in aviation technology, e.g., conducting domain-related research.But anyone with little to no background in aviation or technology can become a feeder.To become a feeder, one must have an internet connection and access to a VHF receiver.An affordable low-complexity setup is covered in the ATCO2 corpora paper [7] and the guide for setting it up is provided https://ui.atc.opensky-network.org/set-up(accessed on 10 October 2023).It is important to recall that in some countries, it is prohibited by law to record air traffic management (ATM) data.Readers interested in the legal aspect are directed to the legal and privacy aspects for collection of ATC recordings section in [7].

The Platform
The high-level architecture is given in Figure 6.As one can observe, the platform has been divided into three distinct groups: (i) feeder equipment, (ii) back-end, and (iii) frontend.The architecture was decided during the design phase of ATCO2, with the main objective to achieve scalability of the entire system.That means keeping the complexity relatively low within all the groups, which allows it to  A better overview of the OSN platform is also listed in Appendix B. As mentioned above, the platform has been divided into three parts.Below, we describe each of these platform's groups.Feeder equipment: the main task of the feeder equipment is to capture the conversation between the pilot and the ATCo and feed the data, together with some relevant metadata, to the back-end.For the recording part, we recommend using RTLSDR-Airband together with RTLSDR dongle.RTLSDR is a set of tools that enables USB dongles based on the Realtek RTL2832U chipset to be used as cheap software defined radios, given that the chip allows transferring raw I/Q samples from the tuner straight to the host device (see further documentation in https://osmocom.org/projects/rtl-sdr/wiki/Rtl-sdr;accessed on 10 October 2023).The latter is an affordable and widely used combination within the aviation enthusiast community for this exact purpose-to capture and stream ATC voice.
The feeder software is responsible for transmitting the recordings from the receiver to the remote server.It is a rather simple piece of software that monitors the output directory of the RTLSDR-Airband and transfers any new data it finds to the back-end using a gRPC (gRPC; remote procedure calls) connection.The fact that the feeder software only looks for specific types of data from the output folder suggests that the feeder is free to choose any other software for capturing and storing the voice data.Care must be taken to assure that the output is suitable for the feeder software.A simple, step-by-step guide is provided to simplify the setup process.It can be found at https://ui.atc.opensky-network.org/set-up(accessed on 10 October 2023).
Back-end: the main tasks for the back-end are (i) to store recordings, transcripts, and any other relevant metadata, and (ii) to provide interfaces for external users.The external users in this are data feeders, transcription service providers, data users, or any other parties contributing to the dataset or making use of it.The back-end is deployed on Kubernetes, an open-source container orchestration system.As one can observe from Figure 6, there are several processing layers involved.These layers are as follows: • Ingestion API: receives recording segments and metadata and queues them for processing in Kafka/S3 compatible object storage; • Aggregation layer: converts raw data to flac audio, stores metadata, and triggers transcription using Kafka Streams, S3, and Serving API; • Serving API: provides external interfaces to consume metadata, store, and consume transcript and statistics; • Scheduled jobs: run processes that are not part of the streaming process like statistics aggregation and data housekeeping.
Interfacing the back-end is performed using API, which is well documented in https://api.atc.opensky-network.org/q/swagger-ui(accessed on 10 October 2023).In order to access the back-end and make use of the available APIs, one needs to register on https://auth.opensky-network.org/auth/(accessed on 10 October 2023), contact OpenSky Network (mailto: contact@opensky-network.org), and give a short description of what one needs the access for.
Front-end: the front-end is a web-page (https://ui.atc.opensky-network.org/;accessed on 10 October 2023) and it provides access to public stats, links to documentation, e.g., API documentation, and external web pages, e.g., SpokenData transcription service.In addition, this is a place for an user to set up their receivers, see some statistics about the receiver performance, and so on.
Statistics: since the public opening of the service (5 March 2023), the ATCO2 project has recorded speech from 24 different airports in 14 different countries.In Figures 7 and 8, names of countries and airports, together with corresponding recording lengths, are shown.Please note that only the airports/areas with the length of regrinding ≥ 1 h are included.This also applies for the ATCO2 corpora released in ELDA.Countries where the length of recordings is longer than 1 h are given.Note that some countries (e.g., United States) were not part of the official release of the ATCO2 corpora (see Table 2).Still, they are currently being collected in the OSN Platform.

Data Annotators
Apart from the data feeder, there is another type of volunteers who have contributed to the project and will continue to contribute in the future.These are called "Annotators".The data annotators are volunteers who write down the transcripts of the ATC voice communications, including assigning speakers and annotating named entities, i.e., callsigns and commands.For the ATCO2 project, we relied on both volunteers and paid transcribers.Our data processing pipeline (as seen in Figure 3) generates transcripts and NLP tags for each communication.By generating transcriptions with AI tools, we are able to speed up the overall transcription process (if you are interested in becoming an annotator, please create an account on the SpokenData transcription platform: http://www.spokendata.com/atco2;accessed on 10 October 2023).The amount of human transcribed data is the package of a four-hour test set, i.e., ATCO2-test-set-4h corpus.The data annotators are the final actors involved in the transcription step, as shown in Figure 4.

Technologies
In this section, we cover the main tools developed with the ATCO2 corpora.We also list some potential topics that can be explored with it.Moreover, note that the ATCO2 corpora are not limited to the fields covered in this paper e.g., ASR or NLP, but also can be used for text-to-speech (TTS), which is somehow opposite to ASR.We expect the community will build on top of ATCO to foster and advance speech and text-based technologies for ATC.

Automatic Speech Recognition
One of the principal components of the ATCO2 project is the strong ASR system, used in order to provide high-quality automatic transcriptions for the collected ATC data.An ASR system is trained to predict the best text translation for the input acoustic signal.Formally speaking, ASR aims to find the best probability candidate output sequence of words from a set of all possible word combinations (or sentences) in a language given a noisy acoustic observation sequence.End-to-end ASR models learn a direct mapping of speech S, to the output text W: The hybrid (conventional) ASR systems combine three separately trained models: acoustic model (AM), pronunciation model, and language model (LM).The model calculates the conditional probability p(W|S), where W is a sequence of words (W = w 1 , . . ., w n ), S is a sequence of input feature vectors representing the acoustic observations (S = s 1 , . . ., s t ), and V is the vocabulary of all possible words [30,31] or subwords [32], as shown in Equation (4).
= argmax = argmax where p(S|P) is an AM, p(P|W) is a pronunciation model, and p(W) is an LM; we use V * to represent the collection of all word sequences formed by words in V.One of the advantages of conventional pipeline models is a more transparent optimization of an objective function [33].Moreover, the LM is trained with unpaired text data and can be easily adapted to a specific domain.This gives conventional models more flexibility and makes them convenient for use in industrial projects, such as ATC.

Training Data Configuration
To measure the effectiveness of using automatically transcribed data (ATCO2-T set) versus using fully supervised gold transcriptions, we defined three training scenarios.

Conventional ASR
To obtain automatic transcriptions of the best possible quality for ATCO2 corpora audio, we use a strong hybrid model trained on ATC data only.We train a hybrid-based model for each of the scenarios described above.For scenario (a), an AM was built to include all available 190 h datasets, speech augmentation accounting for 573 h of data.The model dictionary consists of 30,832 words coming from diverse sources.This includes (i) a list of airline designators for callsigns taken from Wikipedia: https://en.wikipedia.org/wiki/List_of_airline_codes (accessed on 10 October 2023); (ii) all five-letter waypoint names in Europe retrieved from the Traffic project, see https://pypi.org/project/traffic/(accessed on 10 October 2023); (iii) additional words, such as countries, cities, airport names, airplane models and brands, and some ATC acronyms.For training the acoustic model, we use the Kaldi toolkit [29].The system follows the standard Kaldi recipe, which uses MFCC and i-vectors features [34] with time-delay neural networks (TDNN) [35,36].The standard chain training is based on lattice-free maximum mutual information (LF-MMI [37], which includes threefold speed perturbation and one-third frame subsampling).The acoustic model is a CNN-TDNNF [38], which comprises a convolutional network and a factorized-TDNN.The LM is 3G trained on the same data as the acoustic model with additional text data coming from additional public resources such as airlines names, airports, ICAO alphabet, and way-points in Europe.
Results and analysis: the results are presented in Table 4.We compared three models trained with the same conventional CNN-TDNNF architecture but on different data: scenarios (a), (b), and (c) (see Section 5.1.1).The model (a) in Table 4 is trained on the "out-of-domain" for ATCO2 but supervised data.The models (b) and (c) are trained on the "in-domain" ATCO2 data and the difference is only in the size of the training set: 500 h vs. 2500 h.We can see that training on completely unsupervised data yields good performance in comparison to (a).Increasing the size of unsupervised data from 500 h to 2500 h, however, does not bring too much improvement: the WER goes from 18.1% to 17.9% and from 25.1% to 24.9% only for ATCO2-1h and ATCO2-4h, respectively.
Our main hypothesis is that ATCO2 test sets contain higher levels of noise compared to the audio data present in (a), i.e., mainly clean data from ATCos.Moreover, ATCO2 test sets also contain speech from pilots collected via VHF receivers, which in turn degrades the SNR levels, i.e., reduced audio quality.Hence, when the system is trained on "clean data", i.e., scenario (a) and later tested on ATCO2, it creates a large train-test set mismatch.Yet, when we use ATCO2 training data, scenario (b) or scenario (c), this mismatch is reduced, and therefore we obtain substantially better results.5.2.1.End-to-End ASR Differently from hybrid-based ASR, there exists another paradigm for performing ASR [39], named end-to-end (E2E) ASR [40].Here, we aim at directly transcribing speech to text without requiring alignments between input features and output words or characters (i.e., standard procedure in hybrid-based ASR); see Equation (4).Recent work on encoderdecoder ASR has shown that this step can be removed [41].E2E can be divided into connectionist temporal classification (CTC)-based [42], attention-based encoder-decoder modeling [43], or hybrid [44].Previous work based on self-supervised learning [45] for ASR includes Wav2Vec2.0[46], vq-Wav2Vec [47], and, most recently, WavLM [48] and multilingual XLS-R [49] models.E2E ASR aims at reducing the expert knowledge needed.This makes the overall ASR development simpler; thus, it could have a significant impact on ATC [50].This work focuses on data novelty (including their collection and preparation) rather than investigating (i) different E2E architectures for ASR, e.g., Conformer [51], HyperConformer [52], Conmer [53], or BranchFormer [54]; or (ii) toolkits for E2E ASR such as SpeechBrain [55], ESPnet [56], NeMo [57], or WeNet toolkits [58].Therefore, we leave these lines of research for future work.

Callsign Boosting
To further improve the prediction made by an ASR system, along with speech input, one can use other information available from context.For the ATC domain, such context information may be the data received from radar.Every moment, radar registers aircraft that are currently in the airspace, listing unique identifiers of those aircraft, i.e., "callsigns".With the radar data, we know exactly what callsigns are especially likely to appear in the conversation.This knowledge allows us to bias the system outputs towards these registered callsigns and to increase the probability that they are recognized correctly.A callsign is typically a sequence of an ICAO airline identifier, letters, and digits, which in speech turns into a sequence of words.In ASR, the target sequences of words can be boosted during decoding with WFST (weighted finite state transducer) by adjusting the weights in the prediction graphs, called "lattices".The rescoring technique with WFST was proposed earlier and applied for biasing towards use's play lists [59], contact names [60], and named entities [61].Recently, a similar biasing approach has proved to be useful in improving callsign recognition [9].The rescoring of lattices is performed with the finite state transducer (FST) operation of composition between lattices produced by an ASR system and an FST created with the target transcript and discount weights (Equation ( 5)): Biasing the lattice toward the context callsigns usually allows us to considerably improve their recognition in the final outputs (Table 5).The results of different experiments on the ATC data proved that applying the lattice rescoring method on top of ASR predictions leads to higher accuracy of automatic transcriptions, first of all, callsigns [14].Therefore, lattice rescoring was used for all transcriptions of the ATCO2 data.Results and analysis: in Table 5, we report the results for the out-of-domain (ATC supervised) and in-domain (ATCO2-500 h/2500 h) ATC models.Both acoustic models are trained with CNN-TDNNF architecture following the standard Kaldi recipe, as described in Section 5.2.The results are reported with three metrics: WER (word error rate), Call-WER (WER calculated on the sequence of n-grams that correspond to callsigns only), and ACC (accuracy).
To rescore a decoding lattice according to the current context, we perform the following steps: (1) we receive all the callsigns registered by the radar at the current timestamp in the ICAO format; (2) we expand the ICAO callsigns to word sequences to include all possible callsign variations, i.e., ways this callsign can be spoken; (3) we use the expanded callsigns to bias the decoding lattice towards the current context.See our previous work [15] for more details on callsign verbalization.
Biasing multiple callsigns registered by the radar, compared to biasing only a ground truth (GT) callsign, can be used in a real-life scenario and with real-time ASR.To allow it, a new contextual FST with expanded callsigns is generated on the fly every time when new data come from radar.The results of biasing a GT callsign are given in Table 5 to illustrate the oracle performance of the biasing method.Overall, decoding with n-grams biasing always helps to achieve better performance, especially for callsigns, with a relative improvement of 15.0% and 12.8% for callsign recognition and of 3.4% and 2.4% for the entire utterance on ATCO2-test-set-1h and ATCO2-test-set-4h test sets, respectively.
The size of biasing FST depends on the number of callsigns and their variations we want to boost.Too many callsigns may decrease the effectiveness of the biasing method, as the more nontrue callsigns are boosted, the less the correct sequence is prominent.The previous results show that the optimal size of biasing FST highly depends on the data, but generally, the performance begins to degrade when the number of biased word sequences exceeds 1000 [62].For our experiments, we have, on average, 214 biased callsigns variations per utterance in the ATCO2-test-set-4h and 140 biased callsigns variations per utterance in the ATCO2-test-set-1h corpus.

Natural Language Understanding of Air Traffic Control Communications
Natural language understanding (NLU) is a subfield of NLP that focuses on the ability to understand and interpret human language.NLU involves the development of algorithms and models that can extract meaning and intent from text and/or spoken communication.NLU involves several subtasks, including (i) named entity recognition [63], which aims at identifying entities in text, such as people, places, and organizations [64]; (ii) part-of-speech tagging (POS), identifying the grammatical role of each word in a sentence [65], similar to sequence classification (see Section 5.3.2);(iii) sentiment analysis, identifying the emotional tone of a piece of text [66]; (iv) relationship extraction, identifying the relationships between entities in text [67]; (v) question answering, understanding, and answering natural language questions [68].The following subsections cover each of the proposed NLU submodules that can be developed with ATCO2 corpora, like the ones presented in Figure 9. Main automatic speech recognition and understanding tasks that can be achieved with the ATCO2 corpora.ELD: English language detection; NLU: natural language understanding, e.g., callsign highlighting; SPKRoleID: speaker role identification; RBED: read-back error detection.

Named Entity Recognition for Air Traffic Control Communications
In ATC communications, NLU can be used to automatically analyze and interpret the meaning of spoken messages between pilots and ATCos, which can aid ATCos in downstream tasks, such as assisting in identifying emergency situations and other critical events.NLU can help to extract important information, such as flight numbers, callsigns, or airport codes, which in turn can aid ATCos to manage traffic more efficiently.
Overall, the use of NLU in ATC helps improve communication accuracy and efficiency, aids in reduction of ATCos' workload by prefilling aircraft radar labels, and provides valuable data for analysis and decision making.In this work, one of the main tasks is to understand and extract high-level information within ATC communication.Therefore, we develop an NER system tasked to extract this information, as depicted in Figure 10a.For instance, consider the following transcribed communication (taken from Figure 1): ASR transcript: runway three four left cleared to land china southern three two five, would be converted to high-level entity format with the NER system to: Output: <value> ::::::: runway ::::: three ::::: four ::: left </value> <command> . . . . . . . . .cleared . . . . to . ... . .land </com-mand> <callsign> china southern three two five </callsign> .
In this work, we developed two systems based on transformers [69] to extract and tag this information from ATC communications, i.e., a pretrained BERT [70] model and RoBERTa [71] model.
Experimental setup: we fine-tune a pretrained BERT and RoBERTa model on the NER task, as shown in Figure 9).We employed the pretrained version of BERT-base-uncased [70] with 110M parameters, URL: https://huggingface.co/bert-base-uncased (accessed on 10 October 2023).Also, the pretrained version of RoBERTa-base [71] is composed of 123M parameters, URL: https://huggingface.co/roberta-base (accessed on 10 October 2023).We download the pretrained models from HuggingFace [72,73].For training, we use the full ATCO2-test-set-4h, which contains ∼3k sentences.In this dataset, each word is annotated together with a predefined class, as follows: callsign, . . . . . . . . . . . .command, ::::: values, and ::::: UNK (everything else).In order to fine-tune the model, we append a layer on top of the BERT model by using a feedforward network with a dimension of 8 (we define two outputs per class, see the class structures in Section 3.3 of ref. [8] and in [15]).Due to the lack of gold transcriptions, we perform a fivefold cross-validation scheme to avoid overfitting.The reader interested in developing their own NER system for ATC is redirected to the open-source GitHub repository of the ATCO2 corpora (GitHub repository: https: //github.com/idiap/atco2-corpus;accessed on 10 October 2023).We fine-tune each model on an NVIDIA GeForce RTX 3090 for ∼10 k steps.During experimentation, we use a linear learning rate scheduler with an initial learning rate of γ = 5 × 10 −5 , dropout [74] of dp = 0.1, and GELU (Gaussian error linear unit) activation function [75].We also employ gradient norm clipping [76] for regularization and AdamW as optimizer [77].Each model during the cross-validation scheme uses an effective batch size of 32.Evaluation metric: we evaluate both BERT RoBERTa NER systems with a binary classification metric named, F-score.Particularly, the F1-score, defined in Equation ( 8), represents the harmonic mean of precision and recall.Recall, as defined in Equation ( 7), is the ratio of TP to all samples that should have been identified as positive (including false negatives (FN)).Precision, as described in Equation ( 6), is the ratio of true positive (TP) results to all positive results (including false positives (FP)): Results and analysis: the NER system's performance is evaluated on the ATCO2-testset-4h corpus using a fivefold cross-validation scheme, with five fine-tuning runs using different training seeds.Table 6 presents the performance metrics for callsign, command, and values classes of two transformer-based [69] models, namely, BERT-base and RoBERTabase.Although both models achieve similar F1-scores, we provide analysis for the BERTbased NER system, which achieves an F1-score of over 97% for the callsign class, while the . . . . . . . . . . . . .command and :::::: values classes lag behind with F1-scores of 81.9% and 87.1%, respectively.We hypothesize that the command class contains higher complexity when compared to the other two classes, values and callsigns.Values are mostly composed of defined keywords (e.g., flight level) followed by cardinal numbers (e.g., "one hundred"), while callsigns follow a well-defined structure of airline designators and a set of numbers or letters spoken in ICAO format [25].These characteristics make it easier for the NER system to correctly detect them.
One potential method for increasing the performance of the NER system for the command and values classes is to incorporate plausible commands and values in real time, depending on the situation of the surveillance data.This can be achieved using the boosting technique, as described in Section 5.2.2.Although the results with boosting callsigns are reported in Table 5, further investigation is needed to assess the impact of boosting on the command and values classes.Table 6.Different performance metrics for callsign, command, and values classes of the NER system.Results are averaged over a fivefold cross-validation scheme on ATCO2-test-set-4h corpus in order to mitigate overfitting.We run five-times fine-tuning with different training seeds (2222/3333/4444/5555/6666).Results are reported on two transformer-based models.@P, @R, and @F1 refer to precision, recall, and F1-score, respectively.

Model
Callsign Command Values @P @R @F1 @P @R @F1 @P @R @F1 Bert-base 97.Sequence classification (SC) is a type of machine learning (ML) task that involves assigning a label or a category to a sequence of data points [78,79].The data points in the sequence can be of various types, such as text, audio, or numerical data, and the label assigned to the sequence can also be of different types, such as binary (e.g., positive or negative sentiment [66]) or multiclass.Sequence classification can also be used to automatically classify ATC communication sequences into various categories.This technique can be applied to both audio and text data, making it a versatile tool to provide a high-level understanding of the communication at hand.
In scenarios where only a monaural communication channel exists, it can be challenging to recognize the identity of the speaker.Hence, it is especially important to distinguish between the ATCo and the pilot over the target communications.As a potential solution, we propose an alternative approach that utilizes a speaker role detection (SRD) system based on SC.The system receives text as an input, and it returns as output a category where the communication falls, either uttered by the ATCo or the pilot.In recent years, there has been a growing interest in using deep learning techniques, such as the transformer-based models [69], to improve the performance of SC for SRD in ATC communications.Here, we ablate three types of such models, (i) BERT [70], (ii) RoBERTa [71], and (iii) DEBERTA [80].These models have been shown to achieve state-of-the-art performance on a wide range of sequence classification tasks, including SRD for ATC.The proposed SRD is illustrated in Figure 10b.
Overall, the SRD and speaker diarization (see Section 5.3.3)tasks can leverage the fact that ATC dialogues follow a well-defined lexicon and dictionary with simple grammar.This standard phraseology has been defined by the ICAO [25] for ATCos.The main idea is to guarantee safety and reduce miscommunications between the ATCos and pilots.Therefore, previous work has shown the potential in performing SRD in an E2E manner on the text-level, as presented here (see in [8,81]).
Experimental setup: he SRD system is built on top of pretrained models (BERT [70], RoBERTa [71], and DEBERTA [80]), which are downloaded from HuggingFace [72,73].Here, the experimental setup is exactly the same as the one described for the NER system, including the training hyperparameters.For further details, we redirect the reader to Section 5.3.1.Still, the SRD model is fine-tuned on the SC rather than on the NER task.Further, we define an output layer with two units (classes): one for ATCo and one for pilot.
Results and analysis: we evaluate the SRD system on ATCO2-test-set-4h corpus.Differently from the NER system, here, we have access to two training corpora.(1) The Air Traffic Control Corpus (LDC-ATCC) corpus, see URL: https://catalog.ldc.upenn.edu/LDC94S14A (accessed on 10 October 2023).It consists of audio recordings in the area of ASR for air traffic control communications.We use the metadata along the transcripts to perform research on NLU for ATC, i.e., speaker role detection.The data files are sampled at 8 kHz, 16 bit linear, with continuous monitoring and without squelch or silence elimination.
(2) the UWB-ATCC corpus by the University of West Bohemia, which can be downloaded for free at the following URL: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-CCA1-0 (accessed on 10 October 2023).The UWB-ATCC corpus contains recordings of air traffic control communication.The speech is manually transcribed with the speaker information; thus, it can be used for speaker role detection) datasets.We evaluate the SRD under two considerations: (i) ablations of different pretrained models for SRD on ATC communications, and (ii) low-resource and incremental training scenarios.
(i) Analysis of the Impact of Pretrained Models and Training Data Type.In this scenario, we evaluate the impact of pretrained models and training data on the SRD task for ATC data.To this end, we compare the performance of three transformer-based [69] models, including BERT, RoBERTa, and deBERTa-V3, trained on two different corpora, LDC-ATCC and UWB-ATCC, and evaluate them on the ATCO2-test-set-4h corpus.The F1-scores for SRD are reported separately for ATCo and pilot speakers in Table 7.Our results show that all the models achieved comparable F1-scores, ranging from 87-88% for ATCo and 84-85% for pilots.These findings suggest that the SRD task for ATC data is not significantly sensitive to the choice of pretrained models.However, we observe that models trained on UWB-ATCC outperform those trained on LDC-ATCC, with up to 4% absolute improvement in F1-scores.For instance, BERT-model with LDC-ATCC → UWB-ATCC gives a comparison of 82.4% → 86.2% for ATCo and 79.2% → 83.2%, for Pilot.Additionally, we find that combining both datasets leads to a 1% absolute improvement in F1-scores.Overall, our study highlights the importance of selecting appropriate training data for the SRD task in ATC data and suggests that using multiple datasets can lead to improved performance.The findings also suggest that the choice of pretrained models has a relatively minor impact on the SRD task for ATC data.
Table 7. ATCO/PILOT F1-scores for speaker role identification based on full ATC utterances for ATCO2-test-set-4 test set.Each utterance represents one sample.Metrics reported with three different transformer-based models (BERT [70], RoBERTa [71], deBERTa-V3 [80]).All models are the "base" version, e.g., bert-base.Numbers in bold refer to the top performance per split, i.e., ATCO or PILOT.Results are averaged over a fivefold cross-validation scheme on ATCO2-test-set-4h corpus in order to mitigate overfitting.(ii) Analysis of the Impact of Data Quantity on Speaker Role Detection.In this study, we aim to evaluate the impact of the number of text samples on the performance of SRD.The results of this analysis are illustrated in the left panel of Figure 11, where the F1-score on the ATCO2-test-set-4h is plotted against the number of samples in a logarithmic scale on the x-axis.Interestingly, we found that as few as 100 samples are necessary to achieve a reasonably good F1-score of 60% on SRD.Notably, the UWB-ATCC appears to be more informative for the BERT model, which achieves an F1-score of 71% with only 100 training samples.Increasing the training data to 1000 samples further improves the performance, resulting in F1-scores near 80% (LDC-ATCC + UWB-ATCC).These findings are significant, considering that the gold transcription of ATC communications is generally expensive and time-consuming.In the right panel of Figure 11, we present a box plot that shows the variation of the BERT model's performance when fine-tuned on SRD with different training seeds.Each box represents the variation of the model between the ATCo and pilot subsets, over the fivefold cross-validation scheme.Overall, the results indicate that increasing the training data leads to better performance and more consistent results.These observations highlight the importance of selecting a suitable training set size for speaker role detection tasks.).Metrics are reported only on ATCO2-test-set-1h corpus with a bert-base-uncased model trained with different datasets from Table 1.Left plot: ablation of the F1-score versus the number of samples used to train the system.Right plot: F1-score for models trained with different training seeds.The box plot depicts the performance variability when splitting the test set into ATCo and pilot subsets.

Text-Based Diarization
In addition to only detecting roles in a given ATC communication (e.g., SRD), there are cases where multiple segments end up in the same recording/communication.The task that solves this issue is known as speaker diarization (SD).SD answers the question "who spoke when?".Here, the system receives an audio signal or recording (or text, in our case) and detects the speaker changes or segmentation and the speaker role.The main parts of an SD system are (i) segmentation, (ii) embedding extraction, (iii) clustering, and (iv) labeling (similar to SRD).SD is normally performed on the acoustic level, and previous work based on mel filterbank slope and linear filterbank slope was covered in [82].Speaker discriminative embeddings such as x-vectors are investigated in [83], and, more recently, a variational Bayesian hidden Markov model (VBx) was investigated in [84], which is the SD system used during the data collection stage of ATCO2 (see Section 3.2).State-of-the-art SD systems are based on the E2E paradigm, named E2E neural diarization (EEND) [85].This approach was introduced in [86] where an SD model is trained jointly to perform extraction and clustering [87].Here, differently from SRD, we only used the BERT [70] pretrained model.
Experimental setup: The SD system is built on top of a pretrained BERT model downloaded from HuggingFace [72,73].As in the NER and SRD system, here, the experimental setup is the same; this also includes the training hyperparameters.For further details we redirect the reader to Section 5.3.1.The SD model is fine-tuned on the NER task, where each speaker role (ATCo or pilot) is a class.Therefore, we have two tags per class, accounting for four classes in total.Readers are directed to our paper on text-based SD presented at The 2022 IEEE Spoken Language Technology Workshop (SLT 2022), see [8].
Evaluation metric: to score the text-based SD system, we use the Jaccard error rate (JER) metric.JER is a recent metric introduced in [88] that aligns with speaker diarization.JER aims at avoiding the bias that the predominant speaker might cause, i.e., JER evaluates all speakers equally.The JER is defined in Equation ( 9): where (i) speaker is the selected speaker from reference and (ii) max cluster is the cluster from the system with maximum overlap duration with the currently selected speaker.
Results and analysis: we evaluate the SRD system on ATCO2-test-set-4h corpus.Differently from the NER system, but similar to SRD, here, we have access to two training corpora: LDC-ATCC and UWB-ATCC datasets.We evaluate the SD under one consideration: (i) low-resource and incremental training scenario.
(i) Analysis of the Impact of Data Quantity on Text-based Speaker Diarization In this study, we aim to evaluate the impact of the number of text samples on the performance of SD.The results of this analysis are illustrated in the left panel of Figure 12, where the JER (the lower the better) on the ATCO2-test-set-4h is plotted against the number of samples in a logarithmic scale on the x-axis.We found that as few as 100 samples are necessary to achieve a JER score of 45.6% (LDC-UWB).Similar to SRD, the UWB-ATCC dataset seems to be more informative in the SD system.For instance, under the 1000 samples scenario, we noted a 5% absolute JER reduction if UWB-ATCC is used.Furthermore, increasing the training data to 10k samples improved the performance, resulting in JER scores near to 20% (LDC+UWB).A more appropriate comparison of text and acoustic-based SD for ATC communications can be found in our previous work [8].Additionally, in the right panel of Figure 11, we present a box plot that shows the variation of the BERT-based SD model's performance when fine-tuned with different training seeds.Each box represents the variation of the model between the two proposed classes: ATCo and pilot, over the fivefold cross-validation scheme.The results are listed with F1-scores.Overall, we can conclude that the UWB-ATCC dataset is more informative for the SD model in comparison to the LDC-ATCC dataset.

Future Work Enabled by ATCO2
In this subsection, we discuss several research directions that can be explored with the ATCO2 corpora.We cover (i) end of communication detection (akin to VAD), (ii) read-back error detection, and (iii) English language detection.

End of Communication Detection
In the ATC domain, it is crucial to detect the end of communications.While push-totalk (PTT) signals are commonly acquirable in the ATC operations room or in the cockpit, there are cases where PTT is not available, and in such scenarios, the ATCO2 corpora can be leveraged to develop end-of-communication detection systems using either acoustic-or text-based approaches.Acoustic-based systems, known as VAD, perform their task prior to the ATC communication being sent to the ASR system [89], but may require the integration of a new independent module into the recognition pipeline.Text-based systems rely on strong artificial intelligence models like BERT, and previous studies in ATC [8] have shown their effectiveness in detecting callsigns [90], commands [7], and end-of-communication signals from transcripts generated by an ASR system.

Read-Back Error Detection
Pilot read-backs happens when a pilot speaks back the relevant instructions initially uttered by the ATCo.In practice, the ATCo is listening and checking the conformity of each read-back.Therefore, it is important to have a procedure in place, e.g., a read-back error detection (RBED) system.Despite the infrequency of communication errors in ATC, they still have the potential to cause significant safety issues, with some transmissions containing multiple errors.Authors in [91] show that in every hundredth ATC communication, an error may occur, and in [92], the authors show that the error may occur in every sixteenth communication.The possibility to detect such error still remains a challenge, as shown in this recent work [93].Although, in general, read-back errors are quite rare, preventing even one incident due to automatic RBED can make an important difference in ensuring ATM safety.To support ATCos in this task, previous projects employ ASRU engines to extract high-level information from ATC communications [5].Previous work in [93] has proposed two approaches for performing RBED.One system is based on rules, while a second system is a data-driven sequence classifier based on a BERT-alike pretrained encoder, named RoBERTa [71].Here, the input sequence is a concatenation of ATCo and pilot utterance transcriptions with a special separator token [SEP] between them.They show that combining these approaches results in an 81% RBED rate in real-life voice recordings from Isavia's en-route airspace.They also cover a proof-of-concept trial with six ATCos producing challenging, artificial read-back error samples.
A main issue with well-known past projects, such as HAAWAII or MALORCA, is that their data cannot be publicly shared.In contrast, ATCO2 corpora are open to the public, e.g., ATCO2-test-set-1h set can be accessed for free, and practitioners can follow previous research to implement an RBED module.

English Language Detection
Currently, we have developed and deployed a suitable English language detection system (ELD) to discard non-English utterances in newly collected data.We tested a state-of-the-art acoustic-based system with an x-vector extractor.We also came up with the idea of using an NLP approach that processes ASR output with word confidence for the ELD.Finally, our experiments show that the ELD based on NLP is superior to the acoustic approach in both detection accuracy and computational resources.Moreover, the NLP approach can use outputs from several ASR systems jointly, which further improves the results.For the processing pipeline, we integrated the NLP-based English detector operating on Czech and English ASR.The integrated English detector consists of TF-IDF (term frequency-inverse document frequency) for reweighting the accumulated "soft" word counts and a logistic regression classifier to obtain the English/non-English decision [24].
We created the development and evaluation dataset consisting of data from various airports, data with various English accents, and code-mixing of English and local languages.The data are selected from our ATCO2 corpora introduced in Table 1.The development set is used to estimate the model parameters of our English language detector (the logistic regression classifier).The evaluation set is used for testing.The rules for manually tran-scribing the utterances are mentioned below.We found several interesting properties of the ATC data during listening and tagging the ELD dataset: • Various noise conditions.The majority of data are clean, but there are some very noisy segments; • Strongly accented English.The speakers' English accent varies widely.From native speakers (pilots) to international accents (French, German, Russian, etc.) (pilots and ATCos) and strong Czech accents (pilots and ATCos); • Mixed words and phrases.For example, the vocabulary of Czech ATCos is a mix of Czech and English words.They use standard greetings in Czech which can be a significant portion of an "English" sentence if a command is short.On the other hand, they use many English words (alphabet, some commands) in "Czech" sentences.Moreover, they use a significant set of "Czenglish" words.
We use the language of spoken numerals as a rule of thumb to decide on the language of a particular ATCo-pilot communication utterance.The language has to be consistent within the audio recording.More detailed information, including experimental results, is covered in our previous work [24].

Conclusions
This paper expands upon our previous work [7] and discusses the main lessons learned from the ATCO2 project.The aim of the ATCO2 project was to develop a platform for collecting, preprocessing, and posterior ASR-based transcription generation of ATC communications audio data.With over 5000 h of ASR transcribed audio data, ATCO2 is the largest public ATC dataset to date, thus pushing the research boundaries on robust automatic speech recognition and natural language understanding of ATC communications.The main lessons learned from ATCO2 are sixfold, as follows: • Lesson 1: ATCO2's automatic transcript engine (see Appendix B) and annotation platform (see Appendix C) have proven to be reliable (∼20% WER on ATCO2-test-set-4h) for collection of a large-scale audio dataset targeted to ATC communications; • Lesson 2: Good transcription practices for ATC communications have been developed based on ontologies published by previous projects [5].A cheat sheet (see Appendix E) has been created to provide guidance for future ATC projects and reduce confusion while generating transcripts; • Lesson 3: The most demanding modules of the ATCO2 collection platform are the speaker diarization and automatic speech recognition engines, each accounting for ∼32% of the overall system processing time.The complete statistics regarding runtime are covered in the Table 3.In ATCO2, we make these numbers public so they can be used as baselines in future work aligned to reducing the overall memory and runtime footprint of large-scale collection of ATC audio and radar data; • Lesson 4: Training ASR systems purely on ATCO2 datasets (e.g., ATCO2-T 500h set corpus) can achieve competitive WERs on ATCO2 test sets (see Table 4).The ASR model can achieve up to 17.9%/24.9%WERs on ATCO2-test-set-1h/ATCO2-test-set-4h, respectively.More importantly, these test sets contain noisy accented speech, which is highly challenging in standard ASR systems; • Lesson 5: ATC surveillance data are an optimal source of real-time information to improve ASR outputs.The integration of air surveillance data can lead to up to 11.8% absolute callsign WERs reduction, which represents an amelioration of 20% (62.6% no boosting → 82.9% GT boosted) absolute callsign accuracy in ATCO2-test-set-4h, as shown in Table 5; • Lesson 6: ATCO2 corpora can be used for natural language understanding of ATC communications.BERT-based NER and speaker role detection modules have been developed based on ATCO2-test-set-4h.These systems can detect callsigns, commands, and values from the textual inputs.Additionally, speaker roles can also be detected based on textual inputs.For instance, as few as 100 samples are necessary to achieve 60% F1-score on speaker role detection.Furthermore, the NLU task is of special interest to the ATC community because this high-level information can be used to assist ATCos in their daily tasks, thus reducing their overall workload.
In addition to these six lessons learned, this paper brings substantial improvements in the domain of automatic speech recognition and understanding for ATC domain, i.e., Tables 5 and 6 show the current best-performing ASR and NLU engines developed on open-source data, and, thus, are replicable by the community.Furthermore, to the authors' knowledge, there is no other research or commercial activity at this moment which would demonstrate a more accurate engine for an ATC domain built on publicly open data.
Most of these are actually disabled due to security reasons (not to interrupt the processing pipeline), but may be easily enabled on the fly if needed.The overall data flow model is described in Figure A1.Any new job (request for a full automatic transcription of recording) accepted via API on the SpokenData (Industrial partner: https://www.spokendata.com/atco2;accessed on 10 October 2023) side is processed by a master processing node.The job is enqueued into a workload manager queue.Once there is a free processing slot, the job is submitted to a processing server, or worker.The master processing node then informs the OSN server about the state of the job by calling a callback.

Appendix C. Transcription Platform: Data Flow
The data (the recording for human transcription) life cycle is split into four main states: The new recording state is set as queued and is untouched when the recording is pushed into the transcription platform from the transcription engine.The recording is placed into a queue of transcription jobs and is immediately visible to all annotators.The queue is shown in the open jobs screen.Annotators can interact with the queue-listen to recordings and select some for transcription.Recording in this state may drop off the queue in the case: they are old-no one is interested in annotating them; three annotators marked the recording by thumbs down.The dropped-off recordings are deleted after 7 days.
Once an user selects the recording for transcription, revises the automatic transcription, and saves it, the recording is set as queued and annotated.This state prevents the recording from being dropped from the queue and deleted.Also, it is indicated as (to) re-check in the open jobs screen, to inform other annotators that it was modified (annotated) and they should recheck if the transcription is correct rather than annotate from scratch.If any annotator indicates the existence of personal information in the recording (by "Anonymize" label), the recording is dropped off the queue and deleted.
The next state is annotated.If the recording is successfully rechecked, then the recording is considered as annotated and the transcription is final.The recording is removed from the open jobs queue and placed on a stack of finished recording transcriptions.The stack is periodically exported to ELRA for further packaging and distribution to the community.This state also triggers a callback to the OSN platform, informing them that the human transcription is completed, and they can download the transcription.After the recording is exported to ELRA, we set the state as Finished.Here, the recording can be archived or deleted.The detailed data flow schema is depicted in Figure A2.

Figure 1 .
Figure 1.ATCO2 corpora.Blue circles denote transcriptions only available for ATCO2 test set corpus.Green circles denote transcriptions and metadata available for both ATCO2 test set and ATCO2 transcribed corpus sets.Taken from previous work in [7].

Figure 2 .
Figure 2. Transcription protocol.ATCO2 corpora follow a rigorous transcription protocol based on previous ATC-related corpora and resources.Additionally, a cheat sheet for ATC transcript generation was developed during the project.The cheat sheet is available in Appendix E.

Figure 4 .
Figure 4. Breakdown of data flow yield from raw data (recordings from data feeders) w.r.t the humanannotated transcripts throughout the pipeline.This is a one-day snapshot from 9 February 2022.

Figure 5 .
Figure 5. Data feeders pipeline.The data users have set up a VHF receiver and feed data to OSN servers.

•
Interface external services (e.g., voice annotation) in a simple and intuitive way; • Keep maintenance and error handling as simple as possible.

Figure 6 .
Figure 6.The high-level architecture of the data collection platform.

Figure 7 .
Figure 7. Length of recordings per country from the beginning of the service until 5 March 2023.Countries where the length of recordings is longer than 1 h are given.Note that some countries (e.g., United States) were not part of the official release of the ATCO2 corpora (see Table2).Still, they are currently being collected in the OSN Platform.

Figure 9 .
Figure9.Main automatic speech recognition and understanding tasks that can be achieved with the ATCO2 corpora.ELD: English language detection; NLU: natural language understanding, e.g., callsign highlighting; SPKRoleID: speaker role identification; RBED: read-back error detection.

Figure 10 .
Figure 10.(a) Named entity recognition and (b) speaker role detection module based on sequence classification (SC).Both systems are based on fine-tuning a pretrained BERT [70] model on ATC data.The NER systems recognize callsign, command, and values, while the SC assigns a speaker role to the input sequence.Taken from [7].

Figure 11 .
Figure11.Metrics for the speaker role detection system (introduced in [7]).Metrics are reported only on ATCO2-test-set-1h corpus with a bert-base-uncased model trained with different datasets from Table1.Left plot: ablation of the F1-score versus the number of samples used to train the system.Right plot: F1-score for models trained with different training seeds.The box plot depicts the performance variability when splitting the test set into ATCo and pilot subsets.

Figure A2 .Figure A3 .
Figure A2.Diagram of the data flow (lifetime) in the transcription platform.Transcription engine in green.Queued and untouched state in yellow.Queued and annotated state in red.Annotated state in blue.The rest (white) is for state, securely destroyed.

Table
1 in ref. [7], and for the databases released by the ATCO2 project, to Table 1 in this paper.

Table 1 .
Air traffic control communications corpora released by ATCO2 project.† Full database after silence removal.† † Speaker accents depend on the airport's location and on the airline origin (e.g., Air France in Australia may contain French-accented audio); accents of pilots are not known at any time of the communication due to privacy regulations.

Table 2 .
Total accumulated duration.(in hours) of speech after voice activity detection per airport in ATCO2-T set.† English language detection (ELD) (0-1) score.This score shows how confident our ELD system is in detecting whether there is only English spoken inside the ATC communication.Note that the first word for each name denotes the ICAO airport identifier.

Table 3 .
Processing time per component in the transcription pipeline.The values in the second column are for an average 5.016 s long recording.The average was computed over 10,334 recordings (14.4 h), recorded on 4 December 2021.

Table 4 .
WER results for the public ATCO2 test sets with CNN-TDNNF models trained on different data; this includes from scenario (a) to scenario (c).

Table 5 .
Results for boosting experiment on ATCO2 corpora.Results are listed for the CNN-TDNNF model trained with either all supervised data or 500 h or 2500 h of ATCO2 corpora.The top results per block are highlighted in bold.The best result per column is marked with underline.† 1h public test set.‡ 4h full test set.Results are obtained with offline CPU decoding.¶ word error rates only on the sequence of words that compose the callsign in the utterance.CallWER: callsign word error rate; ACC: accuracy.

Table 1 .
Left plot: ablation of the Jaccard error rate versus the number of samples used to train the system.Right plot: F1-score for models trained with different training seeds.The box plot depicts the performance variability when splitting the test set by ATCo and pilot subsets.