Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Lessons Learned in Transcribing 5000 h of Air Traffic Control Communications for Robust Automatic Speech Understanding

Aerospace 2023, 10(10), 898; https://doi.org/10.3390/aerospace10100898

by Juan Zuluaga-Gomez^1,2,*

, Iuliia Nigmatulina^1,3

, Amrutha Prasad^1,4, Petr Motlicek^1,4,*

, Driss Khalil¹, Srikanth Madikeri¹, Allan Tart⁵, Igor Szoke⁴, Vincent Lenders⁶, Mickael Rigault⁷ and Khalid Choukri⁷

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Aerospace 2023, 10(10), 898; https://doi.org/10.3390/aerospace10100898

Submission received: 5 September 2023 / Revised: 10 October 2023 / Accepted: 11 October 2023 / Published: 20 October 2023

(This article belongs to the Special Issue Automatic Speech Recognition and Understanding in Air Traffic Management)

Round 1

Reviewer 1 Report (Previous Reviewer 1)

Thank you for addressing my comments. I'm happy for this to go forward now, modulo the following remarks:

You say you want to keep the title as it is, but what are the 'lessons learnt' and 'pseudo-labelling' is a new phrase to me. Perhaps it would be clearer if you reserved the term' annotation' for human labelling and 'ASR transcription' for machine labelling.

Futhermore, I still think you should clarify precisely what you mean by 'annotation'; the word sequence, word boundaries, phone sequence, phone boundaries?

It's unusual to find a paper on corpus collection which goes into such detail. I found it hard to follow the intricacies, and I think the other reviewer did so too. There is a tendency to assume knowledge of the ATCO2 project which the reader may not have, and I would consider putting all that kind of material into a single appendix.

Author Response

We thank the reviewer for their thoughtful reviews. We have modified several parts of the manuscript. Most of the modifications are highlighted in “YELLOW” in the PDF version.

Q: You say you want to keep the title as it is, but what are the 'lessons learnt' and 'pseudo-labelling' is a new phrase to me. Perhaps it would be clearer if you reserved the term' annotation' for human labelling and 'ASR transcription' for machine labelling.

R: Pseudo-labeling, although a technical term in the ASR domain, can create confusion for readers unfamiliar with it. We acknowledge this feedback and have revised our terminology throughout the paper. Specifically, we have replaced 'pseudo-labeling' with the broader term 'ASR transcription' to enhance clarity. This modification is reflected not only in the text but also in the title. We believe this change effectively distinguishes between human transcription and machine-generated ASR transcriptions.

Regarding the lessons learned, we have summarized them in the conclusions section. In total, we propose 6 lessons learned. Similarly, between lines 108-130 we summarize the main contributions of the paper.

Q: Furthermore, I still think you should clarify precisely what you mean by 'annotation'; the word sequence, word boundaries, phone sequence, phone boundaries?

R: Addressing your concern, we have modified this term (see below) and now it is only called “transcript” (or ASR transcript), see lines 40-47. In this context, “transcript” refers to a "word-by-word human generated transcript” of a given utterance, while “ASR transcription” means that this step is generated by an in domain ASR system. This explicit definition should provide readers with a clear understanding of our terminology, ensuring there is no ambiguity.

Q: It's unusual to find a paper on corpus collection which goes into such detail. I found it hard to follow the intricacies, and I think the other reviewer did so too. There is a tendency to assume knowledge of the ATCO2 project which the reader may not have, and I would consider putting all that kind of material into a single appendix.

R: Taking your suggestion into account, we have created a dedicated appendix (Appendix A) specifically focused on the ATCO2 project. This new section serves as a comprehensive resource, encapsulating all pertinent details, including the project's motivation and relevant URLs. By consolidating this information into an appendix, we aim to streamline the main body of the paper, making it more accessible to readers who may not be familiar with the ATCO2 project. We expect that this adjustment will enhance the overall readability and clarity of our manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report (New Reviewer)

A brief summary

The authors of the manuscript build on the previous results of researchers in projects that brought partial success, such as within the SESAR program, etc., which paved the way for the development of a real pseudo-pilot proof-of-concept system. A weak point is the fragility of the database and sometimes the concentration on one region, one airport, etc. as the authors also rightly mentioned.

I see the main goal of the manuscript as an effort to expand knowledge and support the ability to recognize speech in the radio correspondence of air traffic controllers, which would allow its textual transcription for further use.

The main contribution and strength of the manuscript is the contribution to strengthening the robustness of the database for machine learning based on Artificial Intelligence, which is built on the presented 5,000 hours of automatically transcribed audio data and corresponding data. The modules powered by artificial intelligence form an innovative content and approach to improving training in the ATC domain.

General concept comments

The manuscript is clear, relevant to the field of application of artificial intelligence in speech recognition in radio correspondence for ATCo training and presented in a well-structured manner.

92 cited references are current and relevant.

The manuscript is scientifically based with an experimentally verified design that is suitable for testing for voice training of ATC correspondence.

The methodology for solving the problem and the results of the manuscript are reproducible based on the details given in the manuscript for other users, respectively. researchers.

The results are supported by 16 figures and 7 tables that are convenient and present the data. Data are interpreted throughout the manuscript in a reasonable and comprehensible manner. The authors also appropriately used knowledge from a rich reference base and the current state of the art in the field of artificial intelligence and machine learning for speech recognition in the context of air traffic control.

In my opinion, the partial conclusions of the solution to the problem are in accordance with the presented evidence and arguments. I found no ethical misconduct.

Specific comments

Minor revisions:

To improve the manuscript for the reader, I recommend:

In line 65, I recommend formulating the main objective of the article and the research questions for which the researchers are looking for answers, or the hypotheses that will be verified.

I recommend finalizing the manuscript according to comments and instructions for authors within MINOR revisions.

In my opinion, the topic will be interesting for the wider readership of the journal. The manuscript has the potential to generate further research questions for further scientific work.

I have no significant comments.

Author Response

We thank the reviewer for their thoughtful reviews. We have modified several parts of the manuscript. Most of the modifications are highlighted in “YELLOW” in the updated PDF version of the paper.

Q: To improve the manuscript for the reader, I recommend: In line 65, I recommend formulating the main objective of the article and the research questions for which the researchers are looking for answers, or the hypotheses that will be verified.

R: Thank you for your valuable feedback. We have revisited the introduction section to enhance clarity for the reader. In line 65, we have now explicitly stated the main objective of our research, articulating the specific research questions we sought to answer and the hypotheses we aimed to verify. See paragraphs starting with “(1)..., (2)..., (3)... ”. This addition provides a clear roadmap for readers, outlining the focus and purpose of our study. We believe this modification improves the understanding of our manuscript.

In addition to this, we have added a new Appendix (Appendix A) about the ATCO2 project. This new section includes the motivation of the project and some URLs of interest. Readers can easily check the websites that were of particular interest prior, during and after the development of the ATCO2 corpora.

Author Response File: Author Response.pdf

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

This m/s documents a major undertaking, to produce a substantial corpus of spoken Air Traffic Control interactions, for the purpose of developing spoken language technology for this material.

I work in Automatic Speech Recognition and my comments are therefore restricted to the ASR and corpus aspects of this work.

Firstly, I am surprised at the complications and the effort required to produce this corpus: it is a major undertaking as I said, and ATCO2 should underpin work in this area for some time.

This brings me to a point that puzzles me: what is the status of ATCO2? Is it finished or is it still being collected? Is it available to the research community and if so how? Is it open access, as many speech corpora are? Forgive me if I've missed the answers but they aren't consistent: e.g. l202 and l205 contracdict each other.

Are speakers told what to say or does ATCO2 capture live, unscripted material?.. the latter I assume but make it clear.

When you quote WERs, for instance in the abstract, make it clear whether they are absolute or relative.

Say something about the quality of the recordings, which presumably is not high. How variable is it?

p55-60 I don't understand how you can 'provideg data annotators with automatically transcribed data,' .. doesn't that mean that the annotation has already been done?

Make the level of anotation clear: word sequences, individual word boundaries, phone level? An example of annotation would help.

l119 'ecosystem'?

In the ASR application, will you normally know the identity of the speakers? If so there is much that can be done to develop bespoke recognisers.

p242 'would allow to' ?

l310 define x and y

ASR figures .. what is statistically significant?

ASR figures WERs don't seem lowenough to allow for 'high quality transcription' without human editing. ASR moves very quickly and performance figures older than 2 years are likely to have been surpassed.

Callsign boosting ..could be different for each flight.

Table 5: what's EntWER?

l448 and following.. do you need more ANN explanation for your readership?

l695 'pseudo-annotating' ?

You use the word 'engine' to refer to a speech tech system. This is unusual ad feels awkward.

The quality of English is good, with a few odd contructions which I've pointed to in my comments.

Author Response

Dear reviewer,

Please, find attached the reply to each of the questions raised.

Best regards,
Juan Pablo Zuluaga

Author Response File: Author Response.pdf

Reviewer 2 Report

The article very nicely describes the issue of the Clean Sky 2 project, while it is a very nice report on the Clean Sky 2 project Joint Undertaking (JU) and EU-H2020, under Grant Agreement No. 864702—ATCO2 (Automatic collection and processing of voice data from air-traffic communications).

However, the authors of the article did not identify which part of the work is theirs and which is taken from the project outputs. From my point of view, the whole article seems like the output of a project, which, however, has no scientific contribution. In this case, it is a commercial use of technology. This corresponds to the conclusion of the article as well as its structure, which does not correspond to a scientific article.

I recommend the authors to rework the article into a scientific article, not to publish appendices that are not related to the scientific nature of the article, and to focus on the essential parts without unnecessarily describing the functioning of the ATCO2 architecture.

N/A

Author Response

Dear reviewer,

Please, find attached the reply to each of the questions raised.

Best regards,
Juan Pablo Zuluaga

Author Response File: Author Response.pdf

Article Menu

Printed Edition

Lessons Learned in Transcribing 5000 h of Air Traffic Control Communications for Robust Automatic Speech Understanding

Further Information

Guidelines

MDPI Initiatives

Follow MDPI