Air Traffic Controller Fatigue Detection by Applying a Dual-Stream Convolutional Neural Network to the Fusion of Radiotelephony and Facial Data

: The role of air traffic controllers is to direct and manage highly dynamic flights. Their work requires both efficiency and accuracy. Previous studies have shown that fatigue in air traffic controllers can impair their work ability and even threaten flight safety, which makes it necessary to carry out research into how to optimally detect fatigue in controllers. Compared with single-modality fatigue detection methods, multi-modal detection methods can fully utilize the complementarity between diverse types of information. Considering the negative impacts of contact-based fatigue detection methods on the work performed by air traffic controllers, this paper proposes a novel AF dual-stream convolutional neural network (CNN) architecture that simultaneously extracts controller radio telephony fatigue features and facial fatigue features and performs two-class feature-fusion discrimination. This study designed two independent convolutional processes for facial images and radio telephony data and performed feature-level fusion of the extracted radio telephony and facial image features in the fully connected layer, with the fused features transmitted to the classifier for fatigue state discrimination. The experimental results show that the detection accuracy of radio telephony features under a single modality was 62.88%, the detection accuracy of facial images was 96.0%, and the detection accuracy of the proposed AF dual-stream CNN network architecture reached 98.03% and also converged faster. In summary, a dual-stream network architecture based on facial data and radio telephony data is proposed for fatigue detection that is faster and more accurate than the other methods assessed in this study.


Introduction
The subsiding of the COVID-19 epidemic and the associated ongoing increase in the number of flights is further increasing the workload of air traffic controllers, thereby increasing the problem of controller fatigue [1].At busy airports, controllers who continuously issue control instructions are likely to experience fatigue symptoms such as dry mouth and difficulty speaking [2].Studies have shown that fatigue can significantly reduce the reaction speed, judgment accuracy, and decision-making ability of controllers.Kelly D. [3] studied the human factors in some aviation accidents from 2007 to 2017 and found that fatigue is an important cause of accidents.Abd-Elfattah H. M.'s research shows that fatigue has a negative impact on human perception and decision making [4].Fatigue may cause errors, omissions, and forgetfulness in the work of controllers, thereby threatening the safe operation of flights [5].In September 2011, a controller in Japan fell asleep while on duty in the early morning, causing an incoming plane to lose contact with the ground for more than 10 min.Civil aviation safety incidents have occurred due to controller fatigue resulting in sleeping on duty, indicating that controller fatigue has always been a potential threat to the efficient and safe operation of civil aviation.
Interventions to effectively curb the negative impacts of fatigue require research into controller fatigue detection.Fatigue detection methods are usually divided into subjective and objective methods: subjective methods usually involve questionnaires, while objective methods are based on objective data such as physiological indicators of the subjects.Objective detection methods can also be divided into contact-based and non-contact-based methods according to whether the detection equipment makes physical contact with the subject.Based on detection indicators, fatigue detection methods can also be divided into single-modality detection methods based on a single data source such as audio or facial data and multi-modal detection methods based on multiple data sources.
Contact-based fatigue detection equipment can interfere with the work of controllers, which has prompted many researchers to investigate non-contact-based fatigue detection methods.Researchers have found that fatigue is associated with numerous facial features such as eye closure rate, eyelid distance, percentage of eye open [6], blinking frequency, mouth breathing [7,8], and other facial features [9,10].Liang [11] analyzed eye features when controllers were working and proposed a deep-fusion neural network for eye position and eye state detection.Deng et al. [12] studied the relationship between the percentage of the pupil covered by the eyelid over time and the fatigue state of controllers.Li K. [13] focused on analyzing the information of the eyes and mouth and fused the fatigue information from different facial regions through multi-source fusion, proposing an accurate recognition algorithm called Recognizing the Drowsy Expression (REDE).Zhang et al. [14] revealed the fatigue state of controllers based on changes in pupil size.These studies have shown that fatigue information is indeed available from facial features.However, fatigue is not only reflected in the eyes, so research focusing only on the eye area ignores other information available from the face.
Many studies on the facial fatigue features of car drivers have shown that actions such as mouth breathing contain significant fatigue information [15].Devi et al. [16] proposed a fatigue state discrimination method based on a fuzzy inference system to perform car driver fatigue state discrimination by fusing mouth breathing and the fatigue state of the eyes.Li et al. [17] improved the Tiny YOLOV3 convolutional neural network and evaluated the fatigue state of car drivers based on both eye and mouth features.
In contrast to car drivers, air traffic controllers need to speak as an integral part of their workflow, which provides information for fatigue state detection.Moreover, collecting radio telephony data has little impact on the work of controllers, making such data ideal for controller fatigue detection.Audio features such as hesitations, silent pauses, prolongation of final syllables, and the syllable articulation rate in a fatigue state differ significantly from those in a normal state [18], thus confirming the possibility of analyzing controller fatigue state through audio analysis.Wu [19] proposed an audio fatigue detection algorithm based on traditional Mel-Frequency Cepstral Coefficients (MFCC) and added an adaptive mechanism to the algorithm to successfully classify the fatigue audio of air traffic controllers.Shen and others [20] revealed significant differences in the fractal dimensions of radio telephony under different fatigue states.He [21] analyzed the radio telephony of controllers under different fatigue states and found that features such as the audio rate and pitch can be utilized by a k-means++ algorithm to classify the fatigue state.Shen [22] applied a densely connected convolutional autoencoder to neural networks to classify fatigue radio telephony.
In addition, there are fatigue detection studies based on other data sources.We have summarized the fatigue detection methods according to the data sources of fatigue detection in Table 1.In comparison, multi-modal fatigue state detection data contain richer and moredetailed fatigue information since they are affected by multiple aspects of the fatigue state.The processing of weights for different types of modal information is a major difficulty when fusing multi-modal information.In order to dynamically adjust the degree of influence of two types of features on the detection results, the fusion of audio features and facial features can be weighted separately as δ sum = ϑδ v + (1 − ϑ)δ a and weighted product , where the weight factor ϑ ∈ [0, 1] and δ v and δ a are facial features and audio features, respectively [23].Authors have adjusted the weight factors through experiments to achieve the optimal fusion of audio features and facial features and used δ sum and δ prod as the fusion features.However, manual adjustments are both time-consuming and inefficient.
In order to solve the problem of accurately detecting the fatigue state of controllers without affecting their work, this paper proposes a multi-modal fatigue feature detection network for controllers based on audio data and facial data.The proposed network applies two independent convolution processes to audio data and facial data, fuses audio features with facial features at the feature level, and finally realizes fatigue state discrimination using the Softmax function.Compared with the weighted feature-fusion method, the feedforward neural network can automatically correct the weights of each neuron during backpropagation, not only by dynamically adjusting the weights between different modalities of features but also between different features within the same modality and between different elements within the same feature.This approach can reveal key information in multi-modal features and improve the accuracy of fatigue state discrimination.
This paper was organized as follows: Section 2 introduces the proposed AF dualstream CNN architecture, explaining the processing of audio data and facial data as well as the fusion and discrimination steps of the two features.Section 3 introduces comparative experiments of the AF dual-stream CNN to verify the effectiveness of the network.Finally, conclusions are drawn and future work is discussed in Section 4.

AF Dual-Stream CNN: A Dual-Stream CNN for Audio and Facial Images
This section first introduces the extraction of audio features in the dual-stream network and then introduces the proposed AF dual-stream CNN architecture.

Audio Feature Extraction
Various vocal feature extraction methods have different focuses on the features they extract.Therefore, the simultaneous use of multiple vocal feature extraction methods can comprehensively reflect the differences before and after vocal fatigue.This study selected five commonly used audio features from five perspectives as audio fatigue features [24], as briefly introduced below.

(a) Zero-crossing rate
In the waveform diagram, the zero-crossing rate (ZCR) [25] represents the number of times the waveform crosses the X-axis.The short-time zero-crossing rate of an audio single A i is calculated as where sgn(n) is the following sign function: The w(n) function is This feature can be used to represent spectral and noise changes.

(b) Chromagram
Chromaticity features [26] are collectively referred to as the chroma vector and the chromagram.The chromaticity vector is a vector containing 12 elements, which represent the energy in 12 levels within a period of time.Energy of the same level in different octaves is accumulated.The chromaticity diagram is a sequence of chromaticity vectors.The chroma features of an audio sample A i are represented as where FFT is the fast Fourier transform.
C(A i ) can capture harmonic information in an audio signal and has high robustness.
(c) Mel-frequency cepstral coefficients MFCC [27] are a widely used cepstral parameter extracted in the mel-scale frequency domain: The MFCC feature of an audio signal A i can be obtained by applying preprocessing, the fast Fourier transform, triangular mel-filter processing, and the discrete Fourier transform.This feature conforms to the auditory characteristics of human ears and can convert raw audio into separable and recognizable feature vectors.

(d) Root mean square
The size of a frame of signals can be quantified as its root mean square (RMS) value [28], which is essentially a set of arithmetic mean values: where 2 represents the average energy at all sampling points in frame t.RMS(A i ) has the advantage of not being sensitive to outliers.

(e) Mel spectrogram
The mel-spectrogram [29] is calculated by mapping the power spectrum onto the mel frequency scale.This can capture the spectral information of audio signals and reflect changes therein over time.The mel-spectrogram of an audio signal A i is We concatenate the five types of audio features of an audio signal to obtain its feature matrix A i f : We then input the combined audio features A i f of each audio A i into the designed audio data stream network.
In Figure 1, blue, yellow, and green, respectively, represent the distribution of voice feature values when "Notfatigue", "Mildfatigue", and "Fatigue" are present, and we can see that the voice features of the three fatigue states are intertwined with each other, so the classification of fatigue speech is difficult.The mel-spectrogram [29] is calculated by mapping the power spectrum onto the mel frequency scale.This can capture the spectral information of audio signals and reflect changes therein over time.The mel-spectrogram of an audio signal   is We concatenate the five types of audio features of an audio signal to obtain its feature matrix   : We then input the combined audio features   of each audio   into the designed audio data stream network.
In Figure 1, blue, yellow, and green, respectively, represent the distribution of voice feature values when "Notfatigue", "Mildfatigue", and "Fatigue" are present, and we can see that the voice features of the three fatigue states are intertwined with each other, so the classification of fatigue speech is difficult.

AF Dual-Stream CNN
In order to accurately and rapidly detect a controller's fatigue state, the characteristics of the controller's work [30]

AF Dual-Stream CNN
In order to accurately and rapidly detect a controller's fatigue state, the characteristics of the controller's work [30] are utilized in this paper to propose a fatigue detection network AF dual-stream CNN based on audio data and facial data.The network architecture includes an audio convolution module, a facial convolution module, and a fully connected layer.
(a) Audio data stream: convolution module based on audio data In Section 2.1, we introduced the five types of features of audio.In our model, we pass audio feature A i f to a one-dimensional audio convolution module that includes four convolution layers and four pooling layers.The audio features are used as audio fatigue features after convolution processing.The essence of the audio convolution module is performing function mapping with A i f as an independent variable so that the convolution operation of the audio data stream can be recorded as function ADS A i f , and the convolution operation process of the audio data stream is where audio fatigue feature A f f is produced by the convolution processing of the audio features.
(b) Facial data stream: convolution module based on facial data We designed a two-dimensional convolution module for processing facial features.This convolution module first performs three convolution operations, followed by pooling operations, two further convolution operations, and, finally, more pooling operations.Suppose a facial picture with a resolution of n × n P i is denoted as P n×n : where x i,j ∈ (0, 255) and i, j ∈ (1, n); x i,j represents the value of a pixel in a facial image.This means that the independent variable in the convolution operation of the facial data stream is P n×n , and the convolution mapping of the facial data stream can be written as where F f f is an m×m matrix representing the facial fatigue features after convolution processing of the facial image.In our network, n = 48 and m = 12.
(c) Feature fusion and fatigue state discrimination The processing results for the audio and facial data streams are fused in the fully connected layer, and, finally, the fused features are input into the Softmax classifier for classifying three fatigue states.
For a facial image output x m,m    and then perform feature concatenation in the input layer of the fully connected layer.The feature concatenation process is as follows: We then input fatigue feature FF f containing audio information and facial information into the trained fully connected layer for feature fusion.The fully connected layer as a function FCL F f results in the process of feature fusion and detection being described as where result ∈ {NoFatigue, MildFatigue, Fatigue}.The detailed network architecture is shown in Figure 2. The detailed algorithm process is shown in Algorithm 1.

′
Step 1: Fully connected layer update For E = 1 to the number of iterations, train the fully connected layer using F f of each A j and P j in the training set; according to the difference between the input and output labels, update the parameters in the fully connected layer using a backpropagation algorithm;

End Step 2: Fatigue state discrimination
Initialize FF of A i and P i according to Equation ( 13).Output the fatigue label according to Equation ( 14): result = Solftmax(FF)

Fatigue Detection Experiments
This section introduces the experimental environment, including the experimental data and parameter settings of the proposed network used in the experiments.The comparative experiments performed with other mainstream methods are also described.

Experimental Setup
In collaboration with the Jiangsu Air Traffic Control Bureau, we collected radio telephony data and facial data from 14 certified air traffic controllers while they were working.We used the same aircraft model as the Jiangsu Air Traffic Control Bureau to ensure that the simulation environment was identical to the actual working environment.The data collection experiments were conducted on a control simulator produced by China Electronics Technology Group Corporation.The control simulator system was divided into two parts: the tower seat and the captain's seat.The experimental environment of the control tower seat is shown in Figure 3.The subjects were located in the tower seat, and the equipment used included apron display screens, electronic progress sheet display screens, control call recording equipment, and facial data collection cameras.The subjects comprised seven males and seven females, all of whom were licensed tower controllers in the Air Traffic Management Bureau and came from East China; hence, their radio telephony had similar pronunciation characteristics.They had between 3 and 8 years of work experience, and the workload of the air traffic controllers in the simulation scenario was similar to the workload in their actual working environment.All the subjects had rested sufficiently (>7 h) for the two nights before the experiments and were prohibited from consuming food or alcohol that might affect the experimental results.All the experimenters were fully familiar with the control simulator system.All the subjects were informed of the experimental content and had the right to withdraw from participating in the experiments at any time.The experiments were conducted at 14:00 every day from the 4 of April to the 27 of April 2023.During this period, one controller was assigned to conduct experiments in the control seat every day.During each experimental day, we asked the subjects to complete six sets of control simulation tasks, each set of which had a volume of 20 flights and lasted for 30 min, and with radio telephony and facial data only recorded while they were in the control seat during the experiments.After each set of tasks had been completed, we al- The experiments were conducted at 14:00 every day from 4 April to 27 April 2023.During this period, one controller was assigned to conduct experiments in the control seat every day.During each experimental day, we asked the subjects to complete six sets of control simulation tasks, each set of which had a volume of 20 flights and lasted for 30 min, and with radio telephony and facial data only recorded while they were in the control seat during the experiments.After each set of tasks had been completed, we allowed the subjects to rest for 5 min and complete the 9-point Karolinska Sleepiness Scale during this rest period to assess their fatigue status as a score from 1 to 9 [31].The result was used to categorize the controller's fatigue status as no fatigue, mild fatigue, or severe fatigue [32].At the beginning of the experiment, the scale assessment results showed that the subjects were mostly in a no-fatigue state.After the experiment had been conducted for a while, the assessment results of some scales showed that the subjects appeared to be fatigued.The typical fatigue features are shown in Figure 4.The experiments were conducted at 14:00 every day from the 4 of April to the 27 of April 2023.During this period, one controller was assigned to conduct experiments in the control seat every day.During each experimental day, we asked the subjects to complete six sets of control simulation tasks, each set of which had a volume of 20 flights and lasted for 30 min, and with radio telephony and facial data only recorded while they were in the control seat during the experiments.After each set of tasks had been completed, we allowed the subjects to rest for 5 min and complete the 9-point Karolinska Sleepiness Scale during this rest period to assess their fatigue status as a score from 1 to 9 [31].The result was used to categorize the controller's fatigue status as no fatigue, mild fatigue, or severe fatigue [32].At the beginning of the experiment, the scale assessment results showed that the subjects were mostly in a no-fatigue state.After the experiment had been conducted for a while, the assessment results of some scales showed that the subjects appeared to be fatigued.The typical fatigue features are shown in Figure 4.After the experiments, we preprocessed the facial and radio telephony data.After processing, each radio telephony sample lasted about 2 s and was associated with a facial image, with both of them being used as a set of data.Finally, 1602 facial images were obtained, corresponding to 1602 radio telephony samples for the same period.The data of all the subjects were processed, which finally yielded 496, 543, and 563 sets of data for the no-fatigue, mild-fatigue, and severe-fatigue states, respectively.The average age of the participants was 29.2, with a standard deviation of 2.9.The data distribution is presented in Table 2.After the experiments, we preprocessed the facial and radio telephony data.After processing, each radio telephony sample lasted about 2 s and was associated with a facial image, with both of them being used as a set of data.Finally, 1602 facial images were obtained, corresponding to 1602 radio telephony samples for the same period.The data of all the subjects were processed, which finally yielded 496, 543, and 563 sets of data for the no-fatigue, mild-fatigue, and severe-fatigue states, respectively.The average age of the participants was 29.2, with a standard deviation of 2.9.The data distribution is presented in Table 2. Our dual-stream network was trained using each set of data as an input sample, and each set of data participating in the network training had undergone data-enhancement processing.For the radio telephony data, we injected noise and performed slice processing as well as compression and expansion processing in the time domain.For the facial images, we used an iterator to introduce random disturbances into the data for each round of training, including scaling and slight rotation of the images.Using these preprocessing methods helped us to increase the generalization of the model.
The experiments were conducted using Python 3.6 in the Windows operating system.To ensure the reliability of the experimental results, the experiments were repeated multiple times, from which average values were determined.
The framework of the proposed dual-stream network used in the experiments is shown in Figure 1, and the detailed parameter settings are listed in Table 3.The parameter settings of the facial data stream are as shown in Table 4.
The features obtained from the facial data stream and the voice data stream are trained in the fully connected layer to achieve fatigue classification.The parameter settings of the fully connected layer are as shown in Table 5.
During the experiments, in order to facilitate the display of the detection results for a single modality, we used the network formed by connecting Tables 3 and 5 as the audio data stream model.The network formed by connecting Tables 4 and 5 as the facial data stream model.We subsequently conducted experiments on single-modality and multi-modal networks.

Experimental Results
For the radio telephony data, we selected commonly used and classic methods to classify the audio features.In the experiments, when we only used the audio data stream to train the fully connected layer of the neural network, the accuracy of the audio data stream model reached 62.88%, while the AF dual-stream CNN detection accuracy based on the radio telephony data and the facial data reached 98.03%.In the experiments, we aimed to set other model parameters to their optimal values in order to ensure that all comparisons were unbiased.The parameter settings for the other models were as follows: a.
For the SVC model, the penalty coefficient C was 10, the kernel function used the radial basis function, and the randomness was set to 69. b.
For the KNN model, the number of neighbors was set to five, the prediction weight function was inversely proportional to the distance, and the brute force algorithm was used.The leaf size passed to the nearest-neighbor search algorithm was 30.c.
For the random forest model, the number of trees was set to 500, the random state was 69, and the maximum number of features was the square root of the number of sample features.The node split criterion was the information gain entropy.d.
For the multilayer perceptron classifier unscaled MLP model, randomness was set to 69, and data scaling was not performed during testing.e.
For the multilayer perceptron classifier standard scaled MLP model, randomness was set to 69, and data scaling was performed during testing.
The experimental results are presented in Table 6.ResNet18 and VGGNet16 have been used previously for facial fatigue feature extraction [33].Inspired by this, we selected related models for testing with the air traffic controller facial fatigue dataset.The experimental results are presented in Table 7.The experimental results show that our dual-stream network exhibited high accuracy, 0.21% higher than that for ResNet50.In addition, our network converged markedly faster during training, requiring only 50 iterations to converge.
As indicated in Table 8, although the number of parameters of our model was almost the same as that for ResNet50, the number of iterations required for convergence was only 20% of those for ResNet50.This indicates that our network model converges rapidly, so it is particularly well suited to air traffic controller fatigue detection.Table 8 lists the values of four evaluation parameters for our AF dual-stream CNN model.We use four metrics, Precision, F1 Score, Recall, and Support, to evaluate the recognition results of our model, as shown in Table 9.The comparative experimental results for the audio data show that the fatigue characteristics of audio were difficult to categorize into the three fatigue states.Even when using audio features that performed well in previous studies, the highest accuracy of the various test models for fatigue audio detection did not exceed 77.57%.The comparative experimental results for facial data show that our facial data stream model performed well but was not the best.In order to overcome the shortcomings of audio features and improve the detection accuracy of facial features, we combined facial data with audio data to judge the fatigue state of the air traffic controllers.The experimental results show that this combination approach resulted in the detection accuracy of our AF dual-stream CNN model increasing by 2.03%, reaching 98.03%.

Conclusions
This paper proposes an AF dual-stream CNN based on radio telephony data and facial data for the first time in the field of fatigue detection.The dual-stream convolutional network architecture designed in this study effectively utilizes the complementarity between multi-modal data sources.The experimental results show that the fatigue detection accuracy of our dual-stream network was 35.15% higher than that for using the radio telephony data alone and 2.03% higher than that for using the facial data alone, reaching 98.03%, which is better than the other algorithms and models tested in this study.In addition, during the training process, the neural networks assessed in experiments such as VGG16 also achieved detection accuracies as high as 96.81%, but the required number of iterations exceeded 300 times, and the training of ResNet50 required more than 1000 iterations; in contrast, our network model needs fewer than 50 training iterations to achieve convergence.The experimental results also show that networks such as VGG19, which have performed well in previous studies, did not perform well for our dataset, suggesting that such neural network models are not suitable for facial data, and their generalizability is not sufficient to support their implementation for classifying facial fatigue states.Our AF dual-stream CNN designed for fatigue detection effectively realizes the classification of controller fatigue states based on radio telephony data and facial data.The method in this paper can intervene in time when a controller shows fatigue, thereby contributing to the safe operation of flights.
The dual-stream convolutional neural network requires fewer iterations to reach convergence.In addition, we believe that there is still a better form of this structure, which can be further improved in the future.In our future work, we plan to focus on solving two problems.Firstly, when an air traffic controller issues control instructions, their mouth movements will negatively impact the detection of their facial fatigue status, so how to further reduce the impact of such factors needs to be determined.Secondly, due to the diversity of audio features, it is necessary to identify those audio features and their combinations that are optimal for fatigue detection.
are utilized in this paper to propose a fatigue detection network AF dual-stream CNN based on audio data and facial data.The network architecture includes an audio convolution module, a facial convolution module, and a fully connected layer.(a) Audio data stream: convolution module based on audio data

Figure 2 .Algorithm 1 :
Figure 2. Network structure of the AF dual-stream CNN that includes three modules: audio data stream, facial data stream, and feature fusion.

Figure 2 .
Figure 2. Network structure of the AF dual-stream CNN that includes three modules: audio data stream, facial data stream, and feature fusion.

Algorithm 1 :
AF dual-stream CNNInput: A i , P i Output: result Initialize: initialize F fStep 1: initialize A ff and F ff '

Figure 4 .
Figure 4. Facial schematic diagram of the controller in a fatigued state (left image) and in a nonfatigued state (right image).

Figure 4 .
Figure 4. Facial schematic diagram of the controller in a fatigued state (left image) and in a nonfatigued state (right image).

Table 1 .
Summary of fatigue detection methods and characteristics.

Table 3 .
Parameter settings for the audio data stream.The same padding method was applied in all cases.

Table 4 .
Parameter settings for the facial data stream.The same padding method was applied in all cases.

Table 5 .
Parameter settings for the fully connected layer.

Table 6 .
Comparison of audio detection model accuracies for different models.

Table 7 .
Comparison of facial detection model accuracies.

Table 8 .
Comparison of the numbers of detection model parameters and iterations required for convergence.

Table 9 .
Detection performance of our AF dual-stream CNN.