Towards Aircraft Maintenance Metaverse Using Speech Interactions with Virtual Objects in Mixed Reality

Metaverses embedded in our lives create virtual experiences inside of the physical world. Moving towards metaverses in aircraft maintenance, mixed reality (MR) creates enormous opportunities for the interaction with virtual airplanes (digital twin) that deliver a near-real experience, keeping physical distancing during pandemics. 3D twins of modern machines exported to MR can be easily manipulated, shared, and updated, which creates colossal benefits for aviation colleges who still exploit retired models for practicing. Therefore, we propose mixed reality education and training of aircraft maintenance for Boeing 737 in smart glasses, enhanced with a deep learning speech interaction module for trainee engineers to control virtual assets and workflow using speech commands, enabling them to operate with both hands. With the use of the convolutional neural network (CNN) architecture for audio features and learning and classification parts for commands and language identification, the speech module handles intermixed requests in English and Korean languages, giving corresponding feedback. Evaluation with test data showed high accuracy of prediction, having on average 95.7% and 99.6% on the F1-Score metric for command and language prediction, respectively. The proposed speech interaction module in the aircraft maintenance metaverse further improved education and training, giving intuitive and efficient control over the operation, enhancing interaction with virtual objects in mixed reality.


Introduction
Metaverse is coming [1]. The world started to experience an extraordinary mixed reality (MR) digital place inside of the physical world where people seamlessly get together and interact in millions of 3D virtual experiences. What is more, the coronavirus pandemic [2], "with its requirements of physical distancing, has brought people into online digital environments for a growing range of shared human experiences" [1]. Soon being a part of daily life, metaverses will also have a key role in whole industries, such as aviation maintenance. Transforming physical aircraft that are worth hundreds of millions of dollars (i.e., Boeing 737 Max 8 costs $121.6 million [3]) into virtual digital twins will allow airlines and maintenance repair operation (MRO) service providers to save a lot of money and resources and increase the productivity, efficiency, and quality of the maintenance process by enhancing technicians' workflow with mixed reality collaboration. In addition, with virtual models that can be easily downloaded, maintenance crew members can share the content around the globe, and train fellow engineers in another part of the world in realtime, using augmented virtual aircrafts available to everyone. Airlines or aviation schools that use old physical units for training purposes will soon replace equipment and aircrafts with modern virtual 3D models that are dynamic and interactive with built-in manuals, warnings, and tutorial instructions to follow. virtual 3D assets of aircraft have enormous potential for educational purposes when shared among trainees. Taking into account the fact that aviation colleges currently use retired models of aircrafts that do not have modern technological updates in them, our work makes it possible to reimagine aircraft maintenance education in Industry 4.0 by providing a scalable solution with constantly updated virtual 3D models that can be easily updated, manipulated, downloaded, and shared. What is more, the aircraft 3D visualizations with smart glasses are presented in actual size with all supported equipment, maximizing the effect of the presence, delivering industry-level experience.
Sensors 2021, 21, x FOR PEER REVIEW 3 of 21 space with MR content of his fellow engineer. As can be seen, the physical world combined with virtual 3D assets of aircraft have enormous potential for educational purposes when shared among trainees. Taking into account the fact that aviation colleges currently use retired models of aircrafts that do not have modern technological updates in them, our work makes it possible to reimagine aircraft maintenance education in Industry 4.0 by providing a scalable solution with constantly updated virtual 3D models that can be easily updated, manipulated, downloaded, and shared. What is more, the aircraft 3D visualizations with smart glasses are presented in actual size with all supported equipment, maximizing the effect of the presence, delivering industry-level experience. In addition, combining cutting-edge technologies such as mixed reality and artificial intelligence in the era of Industry 4.0, we built a speech interaction module powered by a deep learning model that enables trainees "to talk" with mixed reality content and control the workflow of the maintenance process. Speech interaction creates a tremendous possibility to imitate an expert in the field of aircraft maintenance that guides trainees during the education process and facilitates tutoring by providing required materials such as technical manuals, instructions, and reference videos that support maintenance education. We addressed the above-mentioned limitations of voice interaction in smart glasses and created a custom voice interaction module, which is flexible enough to recognize the wide variation of speech commands for operation and supports intermixed requests in English and Korean languages, recognizing the corresponding language of command (see Figure 2). First, we collected a dataset of user commands in English and Korean languages. Next, we created a deep learning-based model that predicts the requested command and the language in which the command was spoken. The network has customized architecture that uses CNN for audio features learning, and two output classification parts for commands and language identification, respectively. The module records speech commands by the user and extracts MFCC features for inference. Based on the speech interaction module result (see Figure 2), the MR application performs a particular action, and voice feedback is given back to a user in the predicted language (i.e., if a person makes a speech request in Korean, the response from the application will be in Korean, accordingly). Finally, we embedded the model into the HoloLens smart glasses and performed an evaluation process. In addition, combining cutting-edge technologies such as mixed reality and artificial intelligence in the era of Industry 4.0, we built a speech interaction module powered by a deep learning model that enables trainees "to talk" with mixed reality content and control the workflow of the maintenance process. Speech interaction creates a tremendous possibility to imitate an expert in the field of aircraft maintenance that guides trainees during the education process and facilitates tutoring by providing required materials such as technical manuals, instructions, and reference videos that support maintenance education. We addressed the above-mentioned limitations of voice interaction in smart glasses and created a custom voice interaction module, which is flexible enough to recognize the wide variation of speech commands for operation and supports intermixed requests in English and Korean languages, recognizing the corresponding language of command (see Figure 2). First, we collected a dataset of user commands in English and Korean languages. Next, we created a deep learning-based model that predicts the requested command and the language in which the command was spoken. The network has customized architecture that uses CNN for audio features learning, and two output classification parts for commands and language identification, respectively. The module records speech commands by the user and extracts MFCC features for inference. Based on the speech interaction module result (see Figure 2), the MR application performs a particular action, and voice feedback is given back to a user in the predicted language (i.e., if a person makes a speech request in Korean, the response from the application will be in Korean, accordingly). Finally, we embedded the model into the HoloLens smart glasses and performed an evaluation process. To evaluate the performance of the speech interaction module, we created several sets of test data, including intermixed commands, and separate English and Korean test sets. To carry out the quantitative analysis, classification metrics such as accuracy, precision, recall, and F1-Score, confusion matrix applied, and experimental results demonstrated that the model can accurately identify the commands' type along with language, showing an average F1-Score 95.7 and 99.6% for command and language recognition, accordingly. The model has only 181,000 weight parameters, taking up a small memory size of 716kB with real-time performance on smart glasses. Overall, the proposed speech interaction module is enhancing aircraft maintenance with smart glasses by giving traineeengineers a flexible, intuitive, and productive way to easily navigate inside the maintenance routine. Furthermore, the module works properly even when a user wears a face mask (see Figure 1), coping with various background noises and speech uncertainties due to a covered mouth, which is practically demanding in times of social distancing.
In the following sections, we described the importance and breakthrough of mixed reality in the field of maintenance and education. Next, we discussed deep learning techniques for developing a custom speech interaction module. In Section 3, we presented the proposed mixed reality environment for aircraft maintenance education and training. Section 4 and Section 5 presented step-by-step explanations on the dataset collection, audio features extraction, and the network architecture of the proposed module. Lastly, we demonstrated evaluation techniques and experimental results to make a conclusion.

Mixed Reality in Maintenance and Education
Mixed reality is becoming a new player in the maintenance industry, creating great potential for industrial applications [4][5][6][7][8][9]. Engineers that use MR content guidance in various head-mounted devices such as smart glasses can seamlessly perform work with both hands and enhance their workflow. Taking the case of maintenance in the aviation industry, MR is essential, due to a high degree of worker-centered processes. Here, various inspection and maintenance processes are characterized by a high percentage of manual work steps, a great variety of handled components, and a considerable effort for documentation. Therefore, the cognitive support systems based on MR can deliver great value to the workers [5]. For example, the use of MR for visualization will allow the analysis, interaction, and exploration of information easier, hence facilitating operators' decisionmaking during maintenance tasks [7].
Mixed reality provides mobility. In the specific context of an application scenario, maintenance personnel should be highly mobile; while performing operational and maintenance tasks at different maintenance facilities or directly on the field, MR creates the possibility to work independent of location, while still following the instructions to perform the repairment process [4]. Digitalized aircraft can be easily downloaded to a working device and can easily be moved in the space at the touch of a finger. Usually, at the maintenance site, in order to manipulate the physical part of a massive machine, special equipment is required, however, MR opens up enormous advantages for mobile exploration.
Mixed reality improves collaboration. Considering that maintenance is a synergetic process, where more than two participants are involved, according to [28], the presence To evaluate the performance of the speech interaction module, we created several sets of test data, including intermixed commands, and separate English and Korean test sets. To carry out the quantitative analysis, classification metrics such as accuracy, precision, recall, and F1-Score, confusion matrix applied, and experimental results demonstrated that the model can accurately identify the commands' type along with language, showing an average F1-Score 95.7 and 99.6% for command and language recognition, accordingly. The model has only 181,000 weight parameters, taking up a small memory size of 716kB with real-time performance on smart glasses. Overall, the proposed speech interaction module is enhancing aircraft maintenance with smart glasses by giving trainee-engineers a flexible, intuitive, and productive way to easily navigate inside the maintenance routine. Furthermore, the module works properly even when a user wears a face mask (see Figure 1), coping with various background noises and speech uncertainties due to a covered mouth, which is practically demanding in times of social distancing.
In the following sections, we described the importance and breakthrough of mixed reality in the field of maintenance and education. Next, we discussed deep learning techniques for developing a custom speech interaction module. In Section 3, we presented the proposed mixed reality environment for aircraft maintenance education and training. Sections 4 and 5 presented step-by-step explanations on the dataset collection, audio features extraction, and the network architecture of the proposed module. Lastly, we demonstrated evaluation techniques and experimental results to make a conclusion.

Mixed Reality in Maintenance and Education
Mixed reality is becoming a new player in the maintenance industry, creating great potential for industrial applications [4][5][6][7][8][9]. Engineers that use MR content guidance in various head-mounted devices such as smart glasses can seamlessly perform work with both hands and enhance their workflow. Taking the case of maintenance in the aviation industry, MR is essential, due to a high degree of worker-centered processes. Here, various inspection and maintenance processes are characterized by a high percentage of manual work steps, a great variety of handled components, and a considerable effort for documentation. Therefore, the cognitive support systems based on MR can deliver great value to the workers [5]. For example, the use of MR for visualization will allow the analysis, interaction, and exploration of information easier, hence facilitating operators' decision-making during maintenance tasks [7].
Mixed reality provides mobility. In the specific context of an application scenario, maintenance personnel should be highly mobile; while performing operational and maintenance tasks at different maintenance facilities or directly on the field, MR creates the possibility to work independent of location, while still following the instructions to perform the repairment process [4]. Digitalized aircraft can be easily downloaded to a working device and can easily be moved in the space at the touch of a finger. Usually, at the maintenance site, in order to manipulate the physical part of a massive machine, special equipment is required, however, MR opens up enormous advantages for mobile exploration. Mixed reality improves collaboration. Considering that maintenance is a synergetic process, where more than two participants are involved, according to [28], the presence of virtual objects in the physical world improves spatial cues in collaborative mixed reality environments (MREs). Engineers who work in MREs experience a positive influence on collaborators' communication behavior for object identification and object positioning tasks. The authors of [28] suggest that adding virtual objects to MREs "reduces user task load and improves groups' communication and user experience". As a result, while performing certain work, group members used more informative verbal references rather than deictic gestures, decreasing the subjective workload when virtual objects were present.
MR can facilitate training and access to large amounts of documentation. The authors of [8] present work in training and assisting in maintaining equipment in complex industrial environments. As the complexity of the assembly or maintenance task increases, training becomes a significant factor with respect to both time and money. Such a technology as MR can make an enormous addition to tutoring future technicians by giving them the ability to interact with virtual objects but receive nearly the same experience as if they worked with real ones. Taking the example of the most exploitable aircraft with over 10,000 active units, the Boeing 737 [29]: the flight cost of Boeing 737-800 is about $2180 per hour [30], which means whenever an aircraft stays on land for MRO, it is extremely costly for the owner/company. In this sense, MR is an effective way to educate employees and at the same time save money, since a physical plane will not be required.
To gain the best experience from mixed reality, smart glasses, which overlay content in front of the eyes at a wide-angle and give the ability to work with both hands and no longer hold a device, should be used. Moreover, engineers that wear smart glasses should receive high-level human-computer interaction that speeds up the maintenance process, not slows it down. Therefore, interactions in smart glasses such as gestures, eye-tracking, and voice should be developed in a user-friendly way. What is more, depending on the application, one interaction type might play a key role. Taking into consideration the nature of the maintenance work, where technicians use both hands to complete tasks, voice interaction using various speech commands is crucial to control the MR since you do not have to click buttons or use hand gestures, thus enabling the ability to have a seamless interaction with the system.

Smart Glasses Voice Interaction
Voice input in smart glasses brings colossal benefits to its users. The authors of [11] show that voice interaction with a system can significantly reduce a person's cognitive load since there is no need to remember the steps of particular action invocations; moreover, it minimizes the user's efforts by making tasks effortless. Taking an example of the voice input of the current most-advanced head-mounted device for mixed reality, Microsoft HoloLens 2 [10], voice communication is great for interacting with traversals to cut multiple actions, making users experience powerful emotional effects whilst saving a lot of time. What is more, it gives a sense of multi-tasking whenever users' arms are busy [11]. Considering these opportunities that smart glasses have, there are also technical limitations.
First, system-built voice interaction is based on static pre-defined commands. When certain voice commands are set to invoke a particular action, a person wearing the smart glasses must speak out the command in exactly the way it was programmed. For example, if "next instruction" is defined in the system, then "alright, show me the next instruction" is not going to work. Users must either remember all the commands exactly how they are set, which is a contradiction to the user-friendliness paradigm, or developers need to handle this with multiple variations of the same voice command in the system, which is inconvenient to maintain when there are many control functions.
Second, built-in speech recognition is not suitable for a domain-specific language. Taking the example of maintenance in the aviation industry, there exist special manuals such as the IPC (Illustrated Parts Catalogue), and when the dictation speech input option is invoked in the HoloLens 2 smart glasses, the speech command "show me IPC view" is recognized as "show me my PC view". Thus, in the cases where certain professional words are called, there is no guarantee that the built-in speech recognition feature performs well.
Third, voice interaction is not optimized for intermixed commands, where a sentence consists of words that belong to two or more languages. Coming to the industry again, companies and employees around the world always exchange their technologies; therefore, it is a usual case when professional vocabulary in a certain community consists of words from various languages. In the case of aviation maintenance, all manuals are in English; therefore, using English domain-specific words during the work process by, for example, Korean engineers is unavoidable.
Even though advancements in smart glasses, particularly voice interaction, bring value to users and cope with basic needs and functionality, it might not be enough to deliver to an industry a scalable, dynamic, and flexible approach that can recognize speech commands with wide variations that includes words from the domain-specific vocabulary. Therefore, in this work, we applied an alternative approach to handle these limitations and fulfill our requirements with deep learning.

Deep Learning for Speech Interaction
Deep learning is the pick. According to [12], deep learning-based approaches show better performance over traditional machine learning methods for many different applications. They demonstrate massive potential in various tasks related to audio input, such as speech recognition, sound, and speech classification [13]. In addition, interesting development have occurred in speech enhancement systems based on visual speech recognition when a system needs audio and visual information at the same time for inferencing speech [31], which is beneficial for people with disabilities in hearing. However, according to the requirements of our speech module, we want to classify a single audio speech request into one of the command types and recognize the language; hence, it is not a speech enhancement task nor speech recognition, where you must build the sentence from a speech and have a high level of word or character accuracy. Therefore, sound and speech classification works are further analyzed.
The recent works in speech and sound classification mostly dealt with simple singleword speech commands (A Dataset for Limited-Vocabulary Speech Recognition [21]) or urban sounds (A Dataset and Taxonomy for Urban Sound Research [32]). The fundament of their networks' architecture is the convolutional neural network (CNN); however, the dimensions of the convolutional block and type of audio feature are divided into two approaches. First, [13,14,16,20,23] used raw audio directly, without extracting features. Since audio is a one-dimensional signal, their CNN architecture is also based on one-dimensional filters. Second, [13,15,17,18,24,25,31] applied various audio features such as mel frequency cepstral coefficients (MFCCs) [26] or Spectrograms with 2D convolutions. Since these features have a two-dimensional representation, CNN filters have the corresponding shape. Even though both approaches perform nearly the same, the analyses in [13,26,33] show evidence that MFCCs are great at the approximation to the human voice, which helps to identify linguistic content and discard background noise or certain emotions present in the audio. Therefore, in our work, we decided to follow the second approach-using 2D CNN-based architecture with MFCC as audio features.
The referenced works in speech command recognition demonstrated the ability to easily handle simple single-word classification; however, the following limitations are present:

1.
They do not deal with complex commands classification, where multiple variations of request belong to one class; 2.
The datasets they experiment with and trained networks are monolingual. In our work, we work with English, Korean, and intermixed commands at the same time; 3.
Their networks' output layers only give the command class; however, we need to identify the command type and language of command as well.
Therefore, we created our own deep learning network architecture and trained it on a custom collected dataset of spoken commands following their effective practices.

Mixed Reality Maintenance Education for Boeing 737
Moving towards the development of the aircraft maintenance metaverse, we proposed the mixed reality-based application in the Microsoft HoloLens 2 smart glasses for maintenance education of the aircraft Boeing 737. Creating a virtual aviation space, MR guides a trainee wearing the smart glasses and overlays aircraft-specific mixed reality content and other supportive materials in front of the user's line of sight, combining assets of both physical and virtual worlds. A user is provided step-by-step instructions to complete a certain task, such as landing gear removal, and all the required documents and manuals, tutorial videos, MR-based illustrations. Since trainee mechanics perform MRO and training procedures, they use both hands, receiving instructions and illustration in the metaverse while work is performed on a physical site. Therefore, the main functions to support operations and allow seamless workflow need to be invoked using speech commands without clicking buttons or using hand gestures, creating a multitasking experience.

Application Overview
Aircraft maintenance MRO is performed in stages. There is one global task, such as landing gear removal, which includes subtasks that need to be completed. Each subtask has its own materials to support engineers, and instructions to follow. Figure 3 describes the action and content flow of the maintenance process implemented in our work. According to a subtask that a user works on, subtask content is delivered to the smart glasses. subtask content includes the following (see Figure 3):

•
Virtual objects. Mixed reality 3D models of aircraft and their specific components. Virtual models contain animations that guide a trainee by showing the proper way of action on a specific part. In addition, certain parts of the virtual model are highlighted since they are referred to in the manuals; • IPC Manual. IPC stands for the Illustrated Parts Catalogue, which is the reference document specific to aircraft type which describes in comprehensive detail every component. "When equipment breaks down or requires routine maintenance, maintenance personnel need to be able to figure out what spare parts are required to complete the work" [34]. Spare parts information is taken from the IPC document; • AMM Manual. The Aircraft Maintenance Manual (AMM) is a document which details the steps of maintenance tasks carried out on an aircraft [35]. Mechanics, regulatory officials, and aircraft and component manufacturers refer to the relevant AMM of the aircraft before they proceed further with the maintenance or repair of the aircraft or its systems; • Reference Video. A tutorial video that contains an illustration of certain procedures on how to perform an action. Videos are taken from real aircraft maintenance sites, demonstrating expert engineers working on a specific instruction; • Instructions. Instructions in text and audio format summarize a subtask and list all that needs to be done in a stage. Audio is played to give a hearing perception, which is convenient when a user is located far from the text panel. At the same time, the instructions panel contains the content of instruction with all referenced objects.  Figure 1 shows a capture of a trainee that wears a HoloLens 2 and works on the lower side strut removal subtask of the landing gear removal task. The capture in Figure 1 was made using Spectator View [27] which is an augmented reality product that enables viewing HoloLens experiences from other devices (in this case tablet). Thus, the training pro- As can be seen in Figure 3, all materials in the subtask content are part of the knowledge base, which is a structured knowledge representation that has all indexes to the required data. A user performing a subtask controls workflow and can go back and forth between stages, where mixed reality content supports and enhances the learning process. Figure 1 shows a capture of a trainee that wears a HoloLens 2 and works on the lower side strut removal subtask of the landing gear removal task. The capture in Figure 1 was made using Spectator View [27] which is an augmented reality product that enables viewing HoloLens experiences from other devices (in this case tablet). Thus, the training process and its demonstration are shared among viewers, giving them a chance to have collaborative learning. Multiple trainees using mobile devices such as smartphones, tablets, and smart glasses can simultaneously share the space and view the maintenance lessons demonstrated by a supervisor in real-time. Viewing virtual objects from different angles and perspectives, participants of aircraft metaverse can interact with 3D assets and have hands-on experience.
To look from a first-person perspective, Figure 4 is a snapshot of the developed MR application in HoloLens 2. As can be seen, the 3D model of the landing gear is illustrated in the center, instructions to perform subtasks are played at the beginning of the stage using smart glasses speakers and shown in the special instruction panel in the center bottom; the IPC or AMM manuals are displayed in the manual panel on the left; the reference video is played in the media player on the right. The virtual objects located in the space are scanned by HoloLens 2; therefore, to analyze, a user may physically come closer towards 3D objects or use gestures to virtually drag 3D assets. Certain parts of the main 3D object have annotations with an identification number and name which are mentioned in the manuals (i.e., for "Remove Nut [42]", the annotation "Nut [42]" is placed onto the 3D virtual object). This makes following instructions and cross-referencing manuals easier for trainees.  Figure 1 shows a capture of a trainee that wears a HoloLens 2 and works on the lower side strut removal subtask of the landing gear removal task. The capture in Figure 1 was made using Spectator View [27] which is an augmented reality product that enables viewing HoloLens experiences from other devices (in this case tablet). Thus, the training process and its demonstration are shared among viewers, giving them a chance to have collaborative learning. Multiple trainees using mobile devices such as smartphones, tablets, and smart glasses can simultaneously share the space and view the maintenance lessons demonstrated by a supervisor in real-time. Viewing virtual objects from different angles and perspectives, participants of aircraft metaverse can interact with 3D assets and have hands-on experience.
To look from a first-person perspective, Figure 4 is a snapshot of the developed MR application in HoloLens 2. As can be seen, the 3D model of the landing gear is illustrated in the center, instructions to perform subtasks are played at the beginning of the stage using smart glasses speakers and shown in the special instruction panel in the center bottom; the IPC or AMM manuals are displayed in the manual panel on the left; the reference video is played in the media player on the right. The virtual objects located in the space are scanned by HoloLens 2; therefore, to analyze, a user may physically come closer towards 3D objects or use gestures to virtually drag 3D assets. Certain parts of the main 3D object have annotations with an identification number and name which are mentioned in the manuals (i.e., for "Remove Nut [42]", the annotation "Nut [42]" is placed onto the 3D virtual object). This makes following instructions and cross-referencing manuals easier for trainees.  Our proposed application for aircraft maintenance education has similar points with other works in this field; however, it delivers considerable improvements and novelties. The authors of [4] have a similar view on aircraft maintenance, having their application developed for use in laptop and mobile phone, but we applied modern HoloLens 2 smart glasses, which is much more effective since there is no need to hold a device. To use manuals, [4] applied AR markers that overlay 3D illustrations on top of the paper; however, we totally eliminate having to work with physical manuals, migrating them into the metaverse. In the case of [8], which presented an AR system for training and assisting in maintaining equipment in the industrial context, the authors presented the work with a head-mounted display with similar approaches; however, special equipment was required, such as a special stand with a camera, belt, and PC for processing. In our case, Microsoft HoloLens 2 replacing all these, creating a powerful independent environment that works in real-time. Overall, comparing related works in the field of MR training, we present new methods such as mixed reality visualizations and legacy manuals into the metaverse at the same time. What is more, we have a unique feature to share the working space of the maintenance process among other users, allowing them to engage in interactive education. In addition, moving forward in effective interaction with virtual objects, we built a custom voice interaction module, which creates opportunities beyond the available technical functions of a device.
To provide users with a smooth way to control the flow, we should take into consideration the nature of maintenance work. Since trainees and field engineers use both hands and need quick access to certain information, speech interaction is an effective way to control the workflow (i.e., to move between subtasks-command "next subtask"). Thus, to develop a speech interaction module, first, the main reference functions to be invoked need to be defined.

Reference Functions and Requirements of Speech Interaction
Overall, the main support for the mixed reality MRO should be in process navigation and content management. We underlined supportive functions and divided them into eight categories. Figure 4 is a snapshot of the MR maintenance application from the user's point of view, which is deployed in the Microsoft HoloLens 2 smart glasses. As can be seen in Figure 4, using dashed areas, we selected panels that should be managed during operation, and the following functions need to be called with speech commands: 1.
IPC Manual. The IPC document needs to be displayed in the manuals panel (see Figure 4); 2.
AMM Manual. The AMM document needs to be displayed in the manuals panel; 3.
Next Instruction. The required content for the next stage is shown; 4.
Previous Instruction. The required content for the previous stage is shown; 5.
Current Instruction. The required content for the current stage is reshown; 6.
Play. The media player runs a reference video; 7.
Pause. The media player pauses a reference video; 8.
Stop. The media player closes a reference video.
Regarding specific characteristics of aircraft maintenance, the process involves multiple professionals, and each person may have their own style of voice interaction. Thus, the speech interaction module should be flexible to capture various speech requests that may refer to the same command. Moreover, users of the application are Korean and English-speaking trainee technicians who deal with aviation-specific manuals written in English, and depending on the person, one might use pure English, Korean, or intermixedlanguage speech command for interaction. Therefore, all mentioned types of requests should be handled.
Along with taking speech requests, the module should give corresponding responses as well. An engineer should clearly know that a speech request is understood and proceeded, or it needs to be repeated. Thus, the module should give the response message in text and audio formats. To have the best interaction experience, communication between a user and the system should be in the same language; thus, the appropriate feedback (audio and text) by the application should proceed in the language it was requested.
Accordingly, based on the above-mentioned requirements, we proposed a speech interaction module that uses a deep learning model that maps spoken commands to the defined command types and at the same time identifies the language of request to proceed with corresponding feedback. To build and train a deep learning model, which is a data-driven approach, we first collected a dataset.

Dataset for Speech Interaction Module
Dataset creation and collection for our task was performed in three stages (see Figure 5). First, we analyzed the semantic structure of user commands in English and Korean and created a list of potential user requests. Second, based on the list, we collected audio recordings of these speech commands using Text-to-speech (TTS) application programming interface (API) services and asked several people to speak them out. Third, in order to bridge the gap between API and human recordings, we applied various audio augmentation techniques to the TTS-generated data.
teraction module that uses a deep learning model that maps spoken commands to the defined command types and at the same time identifies the language of request to proceed with corresponding feedback. To build and train a deep learning model, which is a datadriven approach, we first collected a dataset.

Dataset for Speech Interaction Module
Dataset creation and collection for our task was performed in three stages (see Figure  5). First, we analyzed the semantic structure of user commands in English and Korean and created a list of potential user requests. Second, based on the list, we collected audio recordings of these speech commands using Text-to-speech (TTS) application programming interface (API) services and asked several people to speak them out. Third, in order to bridge the gap between API and human recordings, we applied various audio augmentation techniques to the TTS-generated data.

Speech Commands
To develop a robust and flexible voice interaction module, it is essential to consider the nature of user requests. Therefore, we analyzed variations in speech commands and semantics for each command type (see Section 3.2). Different users use various combinations to address the same functionality. For example, "next instruction", "display the next instruction", "can you show me the next task" map to the same command-"next instruction". Moreover, end-users of the application use English and Korean languages to have a voice interaction with the MR application. Taking into consideration all these factors and goals, we made a list of potential user speech requests based on command type and language. Figure 6 summarizes the distribution of created speech commands. Overall, there are 541 commands in English and 623 in the Korean language that mapped to one of eight command types. As can be seen from the charts, the distribution of commands is nearly balanced, having on average about 145 commands for each command type in English and Korean languages in total. In the next stage, we recorded audio files for all created commands.

Speech Commands
To develop a robust and flexible voice interaction module, it is essential to consider the nature of user requests. Therefore, we analyzed variations in speech commands and semantics for each command type (see Section 3.2). Different users use various combinations to address the same functionality. For example, "next instruction", "display the next instruction", "can you show me the next task" map to the same command-"next instruction". Moreover, end-users of the application use English and Korean languages to have a voice interaction with the MR application. Taking into consideration all these factors and goals, we made a list of potential user speech requests based on command type and language. Figure 6 summarizes the distribution of created speech commands.
Dataset creation and collection for our task was performed in three stages (see Figu 5). First, we analyzed the semantic structure of user commands in English and Korean a created a list of potential user requests. Second, based on the list, we collected audio cordings of these speech commands using Text-to-speech (TTS) application programmi interface (API) services and asked several people to speak them out. Third, in order bridge the gap between API and human recordings, we applied various audio augmen tion techniques to the TTS-generated data.

Speech Commands
To develop a robust and flexible voice interaction module, it is essential to consid the nature of user requests. Therefore, we analyzed variations in speech commands a semantics for each command type (see Section 3.2). Different users use various combin tions to address the same functionality. For example, "next instruction", "display the n instruction", "can you show me the next task" map to the same command-"next instru tion". Moreover, end-users of the application use English and Korean languages to hav voice interaction with the MR application. Taking into consideration all these factors a goals, we made a list of potential user speech requests based on command type and la guage. Figure 6 summarizes the distribution of created speech commands. Overall, there are 541 commands in English and 623 in the Korean language th mapped to one of eight command types. As can be seen from the charts, the distributi of commands is nearly balanced, having on average about 145 commands for each co mand type in English and Korean languages in total. In the next stage, we recorded aud files for all created commands. Overall, there are 541 commands in English and 623 in the Korean language that mapped to one of eight command types. As can be seen from the charts, the distribution of commands is nearly balanced, having on average about 145 commands for each command type in English and Korean languages in total. In the next stage, we recorded audio files for all created commands.

Audio Data
The audio dataset collection is based on created speech commands. We used two methods to record speech audio: using existing text-to-speech (TTS) API services to create audio recordings, and employing people to record spoken commands.
For most of the audio collection, we applied API services since this is the cheapest approach. The ones we used were Google Cloud text-to-Speech [36], Microsoft Azure text-to-speech [37], Naver speech synthesis [38], and Kakao speech API [39]. In total, we applied 83 voices-60 English and 23 Korean.
Next, we involved people to record spoken commands: 15 volunteers for English language commands and seven participants for Korean.
Overall, for English commands, we have 60 API, and 15 real people voices, which makes 40,575 (75 × 541) recordings, and for Korean ones, 23 API, and seven real voices, which is 18,690 (30 × 623). Therefore, overall, we collected 59,265 audio recordings as a dataset for all types of commands. Those recordings have a duration of about one to four seconds, which makes the dataset about 24.7 hours long. The voices of API and volunteers have different accents and originated from countries including the USA, Australia, Canada, India, Ireland, United Kingdom, Korea, Uzbekistan. Thus, the dataset has a wide variety of spoken examples for the same phrases, which makes the speech model adapted to variations. Figure 7 summarizes the amount and distribution of audio data.
Next, we involved people to record spoken commands: 15 volunteers for English language commands and seven participants for Korean.
Overall, for English commands, we have 60 API, and 15 real people voices, which makes 40,575 (75 × 541) recordings, and for Korean ones, 23 API, and seven real voices, which is 18,690 (30 × 623). Therefore, overall, we collected 59,265 audio recordings as a dataset for all types of commands. Those recordings have a duration of about one to four seconds, which makes the dataset about 24.7 hours long. The voices of API and volunteers have different accents and originated from countries including the USA, Australia, Canada, India, Ireland, United Kingdom, Korea, Uzbekistan. Thus, the dataset has a wide variety of spoken examples for the same phrases, which makes the speech model adapted to variations. Figure 7 summarizes the amount and distribution of audio data. As can be seen from the charts in Figure 7, API audio data make about 78.1% of the dataset, where the rest 21.9% are spoken commands of employed people. The amount of Korean data is two times less than English, but the proportion of API and real data remains similar. However, having speech API-synthesized audio recordings has its own issues.

Data Augmentation
Text-to-speech API-generated audio recordings differ from the recordings of participants, and we solved this issue using data augmentation. Generally, data augmentation itself is a method to improve the performance of deep neural networks since it helps to generalize a model and avoid overfitting [40], however, we used audio augmentation to bridge the gap between distributions of real and synthesized data. Overall, we found two main issues to address.
First, API-generated data has no background noise. Usually, when audio is recorded so-called white noise is always present in a recording. However, in the case of synthetic recordings, the audio is too clear, which makes API data distribution vary from real data. To solve this issue, we manually added various types of background noise (i.e., white noise, street noise, crowd noise, etc.) to the API-generated data to make them more realistic. As in [13], we added noise with random volume to have more variations.
Second, text-to-speech API data has a static playback starting point. In recordings made by humans, the actual point of time when a person begins to speak varies from 0.5 to 1.5 s. This makes people's audio recordings diverse. In contrast, in the case of API data, As can be seen from the charts in Figure 7, API audio data make about 78.1% of the dataset, where the rest 21.9% are spoken commands of employed people. The amount of Korean data is two times less than English, but the proportion of API and real data remains similar. However, having speech API-synthesized audio recordings has its own issues.

Data Augmentation
Text-to-speech API-generated audio recordings differ from the recordings of participants, and we solved this issue using data augmentation. Generally, data augmentation itself is a method to improve the performance of deep neural networks since it helps to generalize a model and avoid overfitting [40], however, we used audio augmentation to bridge the gap between distributions of real and synthesized data. Overall, we found two main issues to address.
First, API-generated data has no background noise. Usually, when audio is recorded so-called white noise is always present in a recording. However, in the case of synthetic recordings, the audio is too clear, which makes API data distribution vary from real data. To solve this issue, we manually added various types of background noise (i.e., white noise, street noise, crowd noise, etc.) to the API-generated data to make them more realistic. As in [13], we added noise with random volume to have more variations.
Second, text-to-speech API data has a static playback starting point. In recordings made by humans, the actual point of time when a person begins to speak varies from 0.5 to 1.5 s. This makes people's audio recordings diverse. In contrast, in the case of API data, the starting point of playback is static and usually starts at zero seconds, which is unrealistic in comparison with human data. To handle this problem, we randomly shifted each API recording to have a diverse beginning time of playback. This approach made API audio recordings sound even more like real ones.
The above-mentioned audio augmentation techniques made the real and API distributions closer, and recordings of both types sound similar now. Hence, by finishing the dataset collection stages described in Figure 5, as a result, we have a speech commands dataset in the form of audio files, each having a sampling rate of 16 kHz and a bit rate of 256 kbit/s. Next, we created and trained our speech model.

Speech Interaction Module
The speech interaction module in our mixed reality maintenance education plays the role of an assistant that can help to navigate during the maintenance process and deliver and illustrate the required materials just at a speech request, which provides the ability for multi-tasking, and a feeling of supervision. Figure 2 shows the speech interaction flow. A user makes a speech request and the speech interaction module identifies the command and language it was spoken in. Next, based on the predictions, MR provides requested materials and takes a trainee further in stages by giving corresponding feedback based on the requested command and language.
The proposed speech interaction module action flow consists of two steps: speech command retrieval and speech command processing (see Figure 8). In the speech command retrieval, the module first accepts a request for a command. Following approaches of voice assistants such as Apple Siri [41] ("Hey, Siri") or Google [42] ("Hey, Google"), we used our own trigger "Hey, AK" (AK is short for Augmented Knowledge Corp., the company that supported our work and provided Microsoft HoloLens 2 smart glasses and virtual 3D assets of the Boeing 737). Thus, whenever a user wants to give a command, "Hey, AK" should be spoken, and it is then detected by the keyword recognizer, which gives the notification sound that denotes recording, and next triggers the audio recording using a microphone with a duration of n seconds. In our case, four seconds was sufficient for an average request. The speech command retrieval produces an audio speech command that is processed in the next stage.
API recording to have a diverse beginning time of playback. This approach made API audio recordings sound even more like real ones.
The above-mentioned audio augmentation techniques made the real and API distributions closer, and recordings of both types sound similar now. Hence, by finishing the dataset collection stages described in Figure 5, as a result, we have a speech commands dataset in the form of audio files, each having a sampling rate of 16 kHz and a bit rate of 256 kbit/s. Next, we created and trained our speech model.

Speech Interaction Module
The speech interaction module in our mixed reality maintenance education plays the role of an assistant that can help to navigate during the maintenance process and deliver and illustrate the required materials just at a speech request, which provides the ability for multi-tasking, and a feeling of supervision. Figure 2 shows the speech interaction flow. A user makes a speech request and the speech interaction module identifies the command and language it was spoken in. Next, based on the predictions, MR provides requested materials and takes a trainee further in stages by giving corresponding feedback based on the requested command and language.
The proposed speech interaction module action flow consists of two steps: speech command retrieval and speech command processing (see Figure 8). In the speech command retrieval, the module first accepts a request for a command. Following approaches of voice assistants such as Apple Siri [41] ("Hey, Siri") or Google [42] ("Hey, Google"), we used our own trigger "Hey, AK" (AK is short for Augmented Knowledge Corp., the company that supported our work and provided Microsoft HoloLens 2 smart glasses and virtual 3D assets of the Boeing 737). Thus, whenever a user wants to give a command, "Hey, AK" should be spoken, and it is then detected by the keyword recognizer, which gives the notification sound that denotes recording, and next triggers the audio recording using a microphone with a duration of n seconds. In our case, four seconds was sufficient for an average request. The speech command retrieval produces an audio speech command that is processed in the next stage. In the speech command processing step, we extract audio features. We use mel frequency cepstral coefficients (MFCCs) as audio representation. Next, using our speech interaction model, we predict the requested command type and the language in which the command was spoken. Our deep learning network is divided into three parts: CNN base (convolutional neural network architecture for audio features learning), commands, and In the speech command processing step, we extract audio features. We use mel frequency cepstral coefficients (MFCCs) as audio representation. Next, using our speech interaction model, we predict the requested command type and the language in which the command was spoken. Our deep learning network is divided into three parts: CNN base (convolutional neural network architecture for audio features learning), commands, and language classification (classifier networks that make predictions according to audio features learned in CNN base). Based on the command classification result, the MR application performs a particular action, and voice feedback given in the language predicted by the language classifier (i.e., if a user makes a speech request in Korean, the response from the application will be in Korean, accordingly). In the case of when a request mixes a few Korean words and technical English terms or any Korean word is pronounced, the model will classify the language as Korean.
To train the model on our custom dataset, which consists of complex speech requests in two languages that sound completely different, we extracted audio features as data representation and presented training methods that played a significant role in network convergence. Let us start with audio data representation.

Audio Features Extraction
There are many ways to extract audio features; however, based on our research, MFFCs or mel frequency cepstral coefficients tend to suit the best for linguistic content analysis [26]. The job of MFCCs is to represent a sound's power spectrum produced by a human's vocal tract. MFCCs preserve information about spoken content and discards carried background noise. MFCC features are robust and reliable to a speaker's variations and recording conditions [43]. What is more, MFCCs correlate audio features well, which perfectly suits the deep learning approach [13]. Therefore, MFCCs are widely used in recent speech recognition methods. Thus, taking advantage of mel frequency cepstral coefficients, we extracted audio features from spoken commands. Figure 9 shows the illustration of audio and its feature forms. Taking the example of the speech command "Go to the next instruction", Figure 9a is a wave plot of the recorded audio, Figure 9b is its representation in spectrogram form, and finally, Figure 9c is the MFCC features of audio we applied in our work.
in two languages that sound completely different, we extracted audio features as data representation and presented training methods that played a significant role in network convergence. Let us start with audio data representation.

Audio Features Extraction
There are many ways to extract audio features; however, based on our research, MFFCs or mel frequency cepstral coefficients tend to suit the best for linguistic content analysis [26]. The job of MFCCs is to represent a sound's power spectrum produced by a human's vocal tract. MFCCs preserve information about spoken content and discards carried background noise. MFCC features are robust and reliable to a speaker's variations and recording conditions [43]. What is more, MFCCs correlate audio features well, which perfectly suits the deep learning approach [13]. Therefore, MFCCs are widely used in recent speech recognition methods. Thus, taking advantage of mel frequency cepstral coefficients, we extracted audio features from spoken commands. Figure 9 shows the illustration of audio and its feature forms. Taking the example of the speech command "Go to the next instruction", Figure 9a is a wave plot of the recorded audio, Figure 9b is its representation in spectrogram form, and finally, Figure 9c is the MFCC features of audio we applied in our work. To be precise, in our work, we applied the following parameters that have been used to extract MFCCs from audio: Each audio file that we processed was scaled down to 204x13 dimensions, where 13 is the number of MFCCs, and 204 is the resultant number of frames after sampling with the hop length. Basically, 12 to 20 cepstral coefficients are typically optimal for speech analysis and an increasing number of MFCCs give more details and, at the same time, results in more complexity in the models [44]. Empirically, the mentioned parameters worked best in our task. Moving on, according to the audio features dimensions, we built the deep learning model for the speech interaction module.

Model Architecture
The architecture of the proposed speech interaction module network consists of three parts (see Figure 10). First is the convolutional neural network (CNN) base, which takes To be precise, in our work, we applied the following parameters that have been used to extract MFCCs from audio: Each audio file that we processed was scaled down to 204x13 dimensions, where 13 is the number of MFCCs, and 204 is the resultant number of frames after sampling with the hop length. Basically, 12 to 20 cepstral coefficients are typically optimal for speech analysis and an increasing number of MFCCs give more details and, at the same time, results in more complexity in the models [44]. Empirically, the mentioned parameters worked best in our task. Moving on, according to the audio features dimensions, we built the deep learning model for the speech interaction module.

Model Architecture
The architecture of the proposed speech interaction module network consists of three parts (see Figure 10). First is the convolutional neural network (CNN) base, which takes the audio features vector of MFCCs passed as an input. Since we deal with two-dimensional MFCC features, we applied respective 2D dimensions for the convolutional layer filters. The CNN base consists of stacks of convolutional, batch normalization, max pooling, and dropout layers. As can be seen in Figure 10, Conv2D has three parameters, which are the number of filters (64), kernel size (3), and the activation function applied (ReLU) at the end of the layer. In the case of max pooling, parameters denote pool size and stride, which are 2 for both. In addition, the dropout shows the dropout rate equals 0.3. At last, the flatten layer reshapes the input to be one-dimensional to pass the output for classification. Overall, the CNN base gathers features present in the audio, and passes them for classification parts.
ing, and dropout layers. As can be seen in Figure 10, Conv2D has three parameters, which are the number of filters (64), kernel size (3), and the activation function applied (ReLU) at the end of the layer. In the case of max pooling, parameters denote pool size and stride, which are 2 for both. In addition, the dropout shows the dropout rate equals 0.3. At last, the flatten layer reshapes the input to be one-dimensional to pass the output for classification. Overall, the CNN base gathers features present in the audio, and passes them for classification parts. Third, the language classification layers, that take the output from the CNN base to predict the language in the input command was spoken. It is similar to the commands classification part; however, it has a lower number of dense layers. The output vector of this part is the predicted language with the shape of 1x2, where 2 is the score for each language (English and Korean).
All three parts combined make the architecture of the proposed model, which consists of only 181,066 trainable and 384 non-trainable parameters, respectively. However, building the architecture is only one part of the work; the next thing is to find proper training techniques to make it work with complex characteristics of the dataset.

Model Training
Training deep learning models requires a proper pick of methods based on the dataset and overall task. In our work, we use data in English and Korean languages with diverse spoken commands. In addition, we want the model to work in a real-world environment, which has various background noises and new human voices with their unique characteristics. Thus, the most essential goal we have is to avoid overfitting. To handle this task, we applied weight decay, which applies regularization for model weights [45]. This approach improves the generalization of the model and makes it perform well on data that is not present in the dataset.
Next, we found that using the cosine annealing scheduler [46] for setting a proper learning rate helped a lot for the model to converge [14,47]. The general working principle of the cosine annealing scheduler is setting the learning rate starting with a large value that is relatively rapidly decreased to a minimum value before being increased rapidly again [46]. With each restart, the scheduler tries to start with the weights that contribute to the training most; therefore, this technique is referred to as a "warm restart".
Talking about parameters such as loss and optimizer, for each command and language classification part, we applied categorical cross entropy loss, having loss weights The second part is the command classification. This part of the network is devoted to the classification of any command requested by a user. The stack of layers consists of the fully connected layers, which are denoted by dense and the number of units in it (64 or 8), the dropout (0.5), and the Softmax activation function, which gives the final output vector of shape 1 × 8, where 8 is a representation for each command type.
Third, the language classification layers, that take the output from the CNN base to predict the language in the input command was spoken. It is similar to the commands classification part; however, it has a lower number of dense layers. The output vector of this part is the predicted language with the shape of 1x2, where 2 is the score for each language (English and Korean).
All three parts combined make the architecture of the proposed model, which consists of only 181,066 trainable and 384 non-trainable parameters, respectively. However, building the architecture is only one part of the work; the next thing is to find proper training techniques to make it work with complex characteristics of the dataset.

Model Training
Training deep learning models requires a proper pick of methods based on the dataset and overall task. In our work, we use data in English and Korean languages with diverse spoken commands. In addition, we want the model to work in a real-world environment, which has various background noises and new human voices with their unique characteristics. Thus, the most essential goal we have is to avoid overfitting. To handle this task, we applied weight decay, which applies regularization for model weights [45]. This approach improves the generalization of the model and makes it perform well on data that is not present in the dataset.
Next, we found that using the cosine annealing scheduler [46] for setting a proper learning rate helped a lot for the model to converge [14,47]. The general working principle of the cosine annealing scheduler is setting the learning rate starting with a large value that is relatively rapidly decreased to a minimum value before being increased rapidly again [46]. With each restart, the scheduler tries to start with the weights that contribute to the training most; therefore, this technique is referred to as a "warm restart".
Talking about parameters such as loss and optimizer, for each command and language classification part, we applied categorical cross entropy loss, having loss weights to be equally distributed. Moreover, Adam optimizer was applied, having the following hyperparameters: Beta 1-0.9, Beta 2-0.999, Epsilon-1e-07.
Overall, the mentioned techniques helped the model to converge and learn diverse features of the dataset. In the next section, we discuss the hardware and software setup, and the time taken for training and inference of the network.

Hardware and Software Environment
To set up a deep learning development environment for our work, we used a PC with the following specifications and software: The proposed network was trained for 50 epochs, and with all the cross-validation and checkpoint model savings, training time took about six minutes, reaching a training accuracy of 99% by the end of the process.
The created model in the h5 extension was exported to ONNX format and imported to the Unity framework [49] with Barracuda 1.12.1 (cross-platform neural network inference library for Unity [50]), where the mixed reality application is built. After exporting to the corresponding format, the model has a 716KB disk memory allocation size. For the development of the mixed reality application that later deployed to HoloLens 2, we used a PC with the following specifications and software: Next, the application with the speech interaction module deployed to Microsoft HoloLens 2 (untethered holographic computer or new generation smart glasses [10]) to check the inference time of the model, which was about 40 to 60 milliseconds to process a speech request. Overall, the presented details show that the proposed model requires a small disk size and has a real-time inference. Next, we performed the evaluation process to assess the performance of the model on test data.

Evaluation of Speech Interaction Module
Once the model is trained, it needs to be assessed based on certain evaluation metrics regarding different characteristics of machine learning algorithms [51]. Since our network deals with classification task, accordingly, we used classification metrics such as accuracy, precision, recall, and F1-Score. We described in detail the intuition behind these assessment methods and how they work and summarized the results of experiments. The evaluation process started with the testing data preparation.

Test Data
To perform the evaluation process of the proposed speech interaction model, first, we created the test sets. We randomly picked 2000 audio samples from our dataset. The network was not trained on these audios; however, it might see the other samples of the same voice. We denoted this set as test data. Next, since we have two languages, we decided to assess performance for each one with real people's data, which the network has not seen. Therefore, we applied two sets of spoken commands by humans for English and Korean data. One set consists of all spoken commands by a person. Thus, for English data, we used 541 test audio samples, and 623 for Korean. We denoted these sets as test English speaker and test Korean speaker. Next, we discuss the evaluation metrics that were be used for testing.

Evaluation Metrics
In this work, we used accuracy, precision, recall, and F1-Score classification metrics to evaluate the speech recognition model. First, let us define notations: • True Positive (TP)-correct prediction: The actual value was true, and the model predicted true.
Based on these notations, we explained the evaluation metrics below: • Accuracy: the ratio of accurately classified predictions to the total number of predictions (see Formula (1)). Accuracy = (TP + TN)/(TP + TN + FP + FN) Accuracy is a technique to evaluate the classification model and works best in the case of a balanced dataset [52]. Since our test data followed the overall distribution of speech commands (see Figure 6), which is nearly balanced, it is appropriate to use accuracy.
When recall is low, then the number of false negatives is high. • F1-score: This is a combination of precision and recall scores, making a weighted average between them. F1-score is described in Formula (4): F1-score is useful when having an unbalanced test dataset. A high F1 score means that you have low false positives and low false negatives, thus, correctly identifying true classes. • Confusion Matrix: a table that shows a visualization of the performance of an algorithm. Each row of the matrix represents the samples of a predicted class, whereas each column shows the instances in an actual class (or vice versa) [53]. To easily see the good performance of the matrix, it should be diagonal, so that true and predicted classes must match, and the rest of positions are zero.
These classification metrics were used to evaluate the model. We described experimental results in the next section.

Results
Based on the performance of the proposed model experimented on test data, we evaluated our network using classification metrics such as accuracy, precision, recall, and F1-score. Since we have two classification outputs (command and language), we evaluated each separately. Table 1 summarizes the assessment for the commands classification part. As can be seen, performance on test data is near perfect with a 99% success rate in each metric. However, in the case of English and Korean test voices, results are about five percent lower but still great with 94.5% on average for each metric. Results with test data show better performance because the network might have seen some voice characteristics of test samples (other data of the same voice); however, still, it was not trained on this data. In the case of test English and Korean data, these voice features were totally new, and the network has not seen similar audio samples of these voices. To sum up this experiment, the results of the best model in the form of the confusion matrix are illustrated in Table 2 for test data, Table 3 for test English data, and Table 4 for test Korean data, which shows the exact amount of correct and incorrect predictions.       0  0  2  1  1  77  1  2  Pause  1  0  0  1  1  1  55  3  Stop  0  0  0  1  5  3  5  52 Next, we demonstrated the results of the language classification part in Table 5. This experiment shows that, for all test sets, the proposed speech commands classification network performed nearly perfectly, with about 99% for all metrics. Since we have only two languages, this task is a binary classification; therefore, it was easily handled by the network.

IPC
The experiment results demonstrated, that for our task, we collected a proper dataset that was enough to train the proposed network and satisfy our needs. In addition, the model successfully passed the evaluation and tended to identify the commands' type accurately. In all presented metrics (accuracy, precision, recall, and F1-Score), the model performed nearly similar, which means the network can successfully predict voice commands regarding any class.

Conclusions and Future Works
To conclude, in this work, we performed a forward step towards the aircraft maintenance metaverse, creating a mixed reality digital place for Boeing 737 maintenance education enhanced with a speech interaction module in the Microsoft HoloLens 2 smart glasses. Using a virtual 3D twin of the aircraft and supportive aviation-specific materials, the proposed MR application provides guidance and supervision during the maintenance process, delivering trainees step-by-step instructions, mixed reality visualizations, reference videos, and aircraft operation manuals to complete MRO tasks. Moreover, the developed speech interaction module, which is based on the deep learning model integrated into the HoloLens 2, enables users to intuitively deal with virtual objects in the aircraft metaverse and seamlessly navigate the maintenance process using speech commands without being distracted from the workflow by pressing buttons or using hand gestures, which is essential when hands are busy with equipment.
To develop the speech interaction model, firstly, based on the defined requirements of the application, we collected a bilingual speech commands dataset in English and Korean languages with domain-specific vocabulary, which included a variety of requests to invoke functions. Secondly, for a core of the speech interaction module, we built and trained a unique deep learning-based CNN network with two outputs that recognizes a referenced command and its language from extracted audio features of the user's speech request. Having a small number of training parameters and low memory size, the model is lightweight and inferences in real-time, which perfectly suits standalone devices such as smart glasses. It was successfully integrated into the Microsoft HoloLens 2, replacing built-in voice interaction. Lastly, to assess the performance of the speech interaction module, we conducted the evaluation using classification metrics and various test data sets. The experimental results based on English and Korean test data showed that the speech interaction model can accurately recognize commands along with the language, achieving on average 95.7 and 99.6% on the F1-Score metric for command and language prediction, respectively. As a result, the proposed speech interaction module in the aircraft maintenance metaverse further improved the aircraft maintenance training by giving intuitive and efficient control over the operation, empowering voice interaction in mixed reality smart glasses.
In the future, we want to further move forward in developing a metaverse for aircraft maintenance and education, by allowing engineers and trainees located in different places to simultaneously engage with MR experiences in created virtual spaces. What is more, we plan to enlarge the variety of referencing speech commands and application functions to deal with complex interactions based on contextual information present in a certain scene, where questions and answers are given according to the knowledge base.