Automatic COVID-19 Detection from Cough Sounds Using Multi-Headed Convolutional Neural Networks

Featured Application: Corona Virus Disease 2019 (COVID-19) is rampant all over the world, threatening human life and health. Currently, the detection of COVID-19 is mainly based on the nucleic acid test as the standard. However, this method not only takes up a lot of medical resources but also takes a long time to achieve detection results. Different from the existing method, we propose that the cough sound is used as a large-scale pre-screening method before the nucleic acid test. Abstract: Novel coronavirus disease 2019 (Corona Virus Disease 2019, COVID-19) is rampant all over the world, threatening human life and health. Currently, the detection of the presence of nucleic acid from SARS-CoV-2 is mainly based on the nucleic acid test as the standard. However, this method not only takes up a lot of medical resources but also takes a long time to achieve detection results. According to medical analysis, the surface protein of the novel coronavirus can invade the respiratory epithelial cells of patients and cause severe inﬂammation of the respiratory system, making the cough of COVID-19 patients different from that of healthy people. In this study, the cough sound is used as a large-scale pre-screening method before the nucleic acid test. Firstly, the Mel spectrum features, Mel Frequency Cepstral Coefﬁcients, and VGG embeddings features of cough sound are extracted and oversampling technology is used to balance the dataset for classes with a small number of samples. In terms of the model, we designed multi-headed convolutional neural networks to predict audio samples, and adopted an early stop method to avoid the over-ﬁtting problem of the model. The performance of the model is measured by the binary cross-entropy loss function. Our model performs well on the dataset of the AICovidVN 115M challenge that its accuracy rate is 98.1%, and on the dataset of the University of Cambridge that its accuracy rate is 91.36%.


Introduction
Life and health have always been an important topic of global human concern.In order to ensure the safety of human life, health, and safety, different countries have established their own healthcare systems.Studies have shown that there is an upward trend in government and individual health expenditures in some regions [1,2].However, some countries still need to improve their healthcare systems [3,4].In recent years, the emergence of the COVID-19 epidemic has greatly tested the sustainability of health systems in many countries [5], and the lives of people in some developing countries have also been greatly impacted [6].In order to better deal with the national medical financial risks brought by the COVID-19 epidemic, countries and the research community are actively taking measures to deal with the epidemic [7], especially in the detection of the COVID-19 disease [8,9].
At present, the main detection method for COVID-19 is to use throat swabs and nasal swabs to collect the throat parts (i.e., the posterior pharyngeal wall) of the subjects [10].
Although it is a standard method, there are many potential risks in its sampling and detection process.Before sampling, the tested people need to wait in line for nucleic acid detection.During sampling, the medical staff will also have close contact with the subjects during on-site collection.Such a sampling process violates social distancing and greatly increases the risk of cross-infection.After sampling, the medical staff will send the collected samples back to the hospital and test them with nucleic acid testing reagents.The tested person will need to wait about four to five hours to obtain the test result.This detection process is not efficient and increases the chances of virus transmission.According to medical analysis [11], the new coronavirus is an RNA virus, and its surface protein will first attack human respiratory epithelial cells, causing severe inflammation of the respiratory system.It leads to the difference in acoustic characteristics between patients with COVID-19 and healthy people [12][13][14][15].Many research results show that, in addition to visual information, audio signals produced by the human body can also provide reliable basis for medical diagnosis [16][17][18].Meanwhile, with the development of artificial intelligence, it has become a reality for wearable devices to collect and analyze audio signals [19].
During the COVID-19 epidemic, many universities tried to collect the cough sounds of COVID-19 patients and healthy people, and use machine learning to classify two cough sounds.In September 2020, the team at Cambridge University extracted several features of cough sounds through manual extraction and transfer learning, and divided them into three tasks for research.The first is to distinguish COVID-positive from non-COVID samples, the second is to distinguish COVID-positive with a cough from non-COVID with a cough, and the third is to distinguish COVID-positive coughs from non-COVID asthma coughs [20].The research results showed that the performance of all tasks remained above 0.8 in the area under the curve (AUC).In the same year, the Massachusetts Institute of Technology also made relevant reports on the research based on cough sound detection [21].The team widely collected audio signals on its own website, extracted the acoustic characteristics of cough sounds, and then used CNN to classify cough sounds.The team verified the specificity value of 94.2% and AUC of 0.97 on the dataset.In January 2021, Andreu Perez J et al. [22] proposed a cough analysis system that could classify audio feature tensors and detect the severity of COVID-19 infectors.In the AICovidVN 115M COVID-19 challenge, Nguyễn Thành Trung's team [23] extracted multiple features based on the dataset provided by the competition, combined them, and implemented different normalization processing methods for the input data.Additionally, they used the LightGBM model for classification, and performed model tuning through the method of k-fold cross-validation.Finally, the area under the ROC curve is 0.96.Pahar et al. [24] extracted features such as Mel Frequency Cepstral Coefficients (MFCC) and zero-crossing rates from the audio data of the coswara public dataset and used them for model training.In their experiment, a total of seven machine learning classifiers were trained and evaluated, including LR (logistic regression) model, support vector machine model, k-nearest neighbor algorithm, multilayer perceptron (MLP), long short-term memory (LSTM), CNN, and 50-layer depth residual neural network (resnet50).Among them, the performance of LSTM, CNN, and resnet50 classifiers is better than other architectures.In terms of COVID-19 detection, many research teams in China have also made many contributions.Researchers from the Institute of Artificial Intelligence of Beijing University of Posts and Telecommunications expanded the corpus of cough sounds and speech clips.They extracted VQ features from the audio and used Vlad encoding to make the features more representative and improve the performance of the algorithm [25].The research of the team formed by the University of Science and Technology of China and iFLYTEK won two championships in the second DiCOVA COVID-19 Sound Signal Detection Challenge held by ICASSP2022 [26].Their design idea is to use both supervised and self-supervised pre-training schemes, and finally fuse the prediction results, which provides an outstanding AUC index value with 0.88.
Considering that one type of the acoustic features is relatively single or the acoustic features do not have strong representation abilities, we adopted a combination of three features, which fully considered the characteristics of human hearing; especially the pre-training model Vggish is trained on the large-scale audio dataset, and has strong domain adaptability.Three features combination makes our proposed method have a strong generalization ability, and can be better applied to the situation where the training data are limited.
Different from a single neural network structure, our designed multi-headed convolutional neural networks (MHCNNs) structure can fuse different abstract representations from each head and combine them to achieve a more comprehensive data representation.At the same time, from the perspective of information fusion, our designed structure of each head of a MHCNN is different.By integrating information from multiple heads, the network can benefit from different perspectives, resulting in richer representation and a better decision-making ability.This paper will adopt the deep learning method of MHCNNs to introduce the intelligent diagnosis of audio recognition technology into the medical diagnosis process, and our experimental results have good performance.This will be a low-risk and high-efficiency large-scale epidemic surveillance method.At the same time, this method will effectively reduce medical expenses in many countries, thereby indirectly improving the quality of national healthcare services [27,28].

Feature Extraction
This paper uses two methods to extract audio features from cough audio data.The first one is to manually extract the features of the Mel spectrum and MFCC from the samples.The other is to automatically extract the features by employing the VGGish network from the raw audios.

Mel Spectrum
In human auditory perception, the resolution of frequency is not uniform.Mel frequency is more consistent with the characteristics of the human auditory system than linear frequency.Firstly, the speech is pre-emphasized with the filter to spectrally flatten the signal.The pre-emphasized speech is separated into short frames in order to guarantee stationarity inside the frame.In addition, two adjacent frames have the overlap in order to ensure stationary between the frames.To reduce the frame edge effect, a Hamming window is applied to each frame.Secondly, in order to find out the obvious characteristics of voice information, the spectrum of each frame is calculated with fast Fourier transform (FFT).Finally, the spectrogram is converted into Mel spectrum in order to more accurately represent the characteristics of audio signals.

MFCC
MFCC of short-term cepstral features is the most widely used.The block diagram of the MFCC feature extraction is shown in Figure 1.
training schemes, and finally fuse the prediction results, which provides an outstanding AUC index value with 0.88.
Considering that one type of the acoustic features is relatively single or the acoustic features do not have strong representation abilities, we adopted a combination of three features, which fully considered the characteristics of human hearing; especially the pretraining model Vggish is trained on the large-scale audio dataset, and has strong domain adaptability.Three features combination makes our proposed method have a strong generalization ability, and can be better applied to the situation where the training data are limited.
Different from a single neural network structure, our designed multi-headed convolutional neural networks (MHCNNs) structure can fuse different abstract representations from each head and combine them to achieve a more comprehensive data representation.At the same time, from the perspective of information fusion, our designed structure of each head of a MHCNN is different.By integrating information from multiple heads, the network can benefit from different perspectives, resulting in richer representation and a better decision-making ability.This paper will adopt the deep learning method of MHCNNs to introduce the intelligent diagnosis of audio recognition technology into the medical diagnosis process, and our experimental results have good performance.This will be a low-risk and high-efficiency large-scale epidemic surveillance method.At the same time, this method will effectively reduce medical expenses in many countries, thereby indirectly improving the quality of national healthcare services [27,28].

Feature Extraction
This paper uses two methods to extract audio features from cough audio data.The first one is to manually extract the features of the Mel spectrum and MFCC from the samples.The other is to automatically extract the features by employing the VGGish network from the raw audios.

Mel Spectrum
In human auditory perception, the resolution of frequency is not uniform.Mel frequency is more consistent with the characteristics of the human auditory system than linear frequency.Firstly, the speech is pre-emphasized with the filter to spectrally flatten the signal.The pre-emphasized speech is separated into short frames in order to guarantee stationarity inside the frame.In addition, two adjacent frames have the overlap in order to ensure stationary between the frames.To reduce the frame edge effect, a Hamming window is applied to each frame.Secondly, in order to find out the obvious characteristics of voice information, the spectrum of each frame is calculated with fast Fourier transform (FFT).Finally, the spectrogram is converted into Mel spectrum in order to more accurately represent the characteristics of audio signals.

MFCC
MFCC of short-term cepstral features is the most widely used.The block diagram of the MFCC feature extraction is shown in Figure 1.After pre-emphasized, adding window, frame partition, and FFT are processed, the power of each band is calculated.In order to simulate the nonlinear auditory characteristics of human cochlea, a bank of triangular filters is designed.In addition, the power of the filterbank is multiplied by the characteristic of triangular filter to generate the filter outputs and then these filter outputs are summed to generate the power of each filter.To remove the correlation between the output values of the triangular filter, the logarithm of all filter bank energies is transformed by the discrete cosine transform (DCT) to obtain MFCC.In order to enhance the performance of the speaker recognition system, the first-order and second-order delta of MFCC are computed as the dynamic parameters and stitched together with the static parameters to form the features of each frame.

VGGish
VGGish [29] is a convolutional neural network that is a simple structure and has a good generalization ability of the model.The VGGish model was pre-trained using a largescale YouTube video audio track dataset and the learned model parameters were in GitHub website.By using the VGGish network, the raw audios are transformed into features.The VGGish pre-trained model first divides data samples into 0.96 s non-overlapping sub-samples, and for each 0.96 s, it returns a 128-dimensional feature vector.
For each cough audio file, the shapes of the extracted Mel spectrum feature matrix, MFCC feature matrix, and VGG network feature matrix are (nums, 128), (nums, 13), and (nums, 128), respectively, where nums is related to the duration of the audio; the longer the audio time, the larger the nums value.Then, features extracted by the pre-trained model are converted into one-dimensional vectors, which are called VGG embeddings.We transposed each feature matrix and took the mean value for each column to form a one-dimensional vector.Among them, the sizes of the feature vectors corresponding to Mel spectrum, MFCC, and VGG embeddings are (1,128), (1,13), and (1, 128), respectively.In this paper, we use early fusion to fuse three features and horizontally splice three features of each audio file to obtain a (1, 269) feature vector.N cough audio files are finally integrated as a feature sequence of (N, 269), where N expresses the number of cough audio files, and being output as an .npyfile as well as the corresponding tag data (0 for healthy people and 1 for COVID-19 patients).

Multi-Headed Convolutional Neural Networks Architecture
Since the further abstract expression of the input feature sequence by a series of convolution and pooling operations of each CNN allows the network to better learn the input feature sequence from multiple perspectives, this paper designs and implements the network structure of MHCNNs.MHCNNs input the one-dimensional audio feature sequence into three CNNs with different operations, combines the output data abstracted by the three CNNs, and inputs them into several fully connected layers for classification.
For the first input of the network, a 1D cough audio feature sequence is input to MaxPooling1D layer after passing through two Conv1D layers in turn, and then input to GlobalAveragePooling1D layer after being abstracted through another two Conv1D layers in turn.Finally, the Dropout layer and the Flatten layer are added.After being processed by this series of layers, the length of audio feature sequence would be 128.
For the second input of the network, 1D cough audio feature sequence first passes through two Conv1D layers and one MaxPooling1D layer, and then only one Conv1D layer is input to the GlobalAveragePooling1D layer.By the Dropout layer and the Flatten layer, the final sequence data size is 256.
For the third input of the network, 1D cough audio feature sequence is 7936 in length after being processed by a four-layer network consisting of Conv1D layer, Dropout layer, MaxPooling1D layer, and Flatten layer.
Then, the abstracted sequence data of three different lengths are combined, and the final output is obtained after passing through four Dense layers.Its network architecture is shown in Figure 2.Among them, the activation functions of all convolution layers use Relu.The activation function of the last dense layer is set to Softmax.
Then, the abstracted sequence data of three different lengths are combined, and the final output is obtained after passing through four Dense layers.Its network architecture is shown in Figure 2.Among them, the activation functions of all convolution layers use Relu.The activation function of the last dense layer is set to Softmax.

Experimental Dataset1
To evaluate the performance of the proposed method, our experiments are conducted on the AICovidVN 115M challenge dataset.This dataset contains 4068 cough audio files, including 669 cough audio files of COVID-19 patients and 3399 cough audio files of non-COVID people, which indicates that the category distribution of this dataset is imbalanced.The visualization of the audio time length of this dataset is shown in Figure 3.The horizontal axis represents 4068 cough audio files, and the vertical axis represents the time

Experimental Dataset1
To evaluate the performance of the proposed method, our experiments are conducted on the AICovidVN 115M challenge dataset.This dataset contains 4068 cough audio files, including 669 cough audio files of COVID-19 patients and 3399 cough audio files of non-COVID people, which indicates that the category distribution of this dataset is imbalanced.The visualization of the audio time length of this dataset is shown in Figure 3.The horizontal axis represents 4068 cough audio files, and the vertical axis represents the time length of each audio file, in seconds (s).The average duration of all cough audio files is about 9.1 s, where 1225 cough audio files are longer than 9 s and 2843 cough audio files are shorter than 9 s.length of each audio file, in seconds (s).The average duration of all cough audio files is about 9.1 s, where 1225 cough audio files are longer than 9 s and 2843 cough audio files are shorter than 9 s.Due to the problem of category data imbalance in the dataset, SMOTE technology is used to augment the data of COVID-19 patients.It will start from the data with few sample categories (COVID-19 patient) to find adjacent samples and synthesize new samples, so that the sample ratio of the two categories could remain almost unchanged while data becomes augmented.In this way, the dataset is balanced, and the final total number of samples is 6798.The balanced feature sequences are divided into training set and testing set with a ratio of 8:2, that is, 5438 feature sequences are used for training (453 feature sequences are used as the validation set), and 1360 feature sequences are used for testing.To be consistent with the input size of CNN network structure, it is necessary to expand the input feature vector from two-dimensional to three-dimensional, taking the training data as an example, that is, from (5438, 269) to (5438, 269, 1), as the input of CNN model training.

Experimental Parameter Setting
This experiment trains the network model on Google's Colab platform, and accelerates the training speed with the help of its GPU.Google Drive is used to load experimental data and save the network model parameters.In the training process, learning rate decay is involved.It could reduce the step size of parameter adjustment, which is conducive to the convergence of the algorithm.The Keras exponential decay function is used in tensorflow to realize the decay of learning rate, so that the learning rate could be taken reasonably and adjusted dynamically with the progress of training.The initial learning rate should not be too large, which is set to 0.1.It is found that the model training process is fast, but the accuracy is only about 50%.The initial learning rate is then set as 0.00035, the decay index is 0.9, and the decay speed is 1000.At the same time, this paper uses adaptive moment estimation (Adam) optimizer to optimize the training process.The learning rate exponential decay object created above is inputted as a parameter into the Adam optimizer, which can better select super parameters, to calculate the learning rate for us in an adaptive way.The generated optimizer object will be part of the model compilation.
In the process of training, this paper uses the binary cross-entropy loss function, which is commonly used for classification tasks, and sets the value of epoch at the same time.If the epoch value is set too large, the network may cause over-fitting problems due to unsatisfactory generalization ability.In order to avoid overfitting of the model, this article uses the EarlyStopping function in Keras, and sets the patience parameter of the To be consistent with the input size of CNN network structure, it is necessary to expand the input feature vector from two-dimensional to three-dimensional, taking the training data as an example, that is, from (5438, 269) to (5438, 269, 1), as the input of CNN model training.

Experimental Parameter Setting
This experiment trains the network model on Google's Colab platform, and accelerates the training speed with the help of its GPU.Google Drive is used to load experimental data and save the network model parameters.In the training process, learning rate decay is involved.It could reduce the step size of parameter adjustment, which is conducive to the convergence of the algorithm.The Keras exponential decay function is used in tensorflow to realize the decay of learning rate, so that the learning rate could be taken reasonably and adjusted dynamically with the progress of training.The initial learning rate should not be too large, which is set to 0.1.It is found that the model training process is fast, but the accuracy is only about 50%.The initial learning rate is then set as 0.00035, the decay index is 0.9, and the decay speed is 1000.At the same time, this paper uses adaptive moment estimation (Adam) optimizer to optimize the training process.The learning rate exponential decay object created above is inputted as a parameter into the Adam optimizer, which can better select super parameters, to calculate the learning rate for us in an adaptive way.The generated optimizer object will be part of the model compilation.
In the process of training, this paper uses the binary cross-entropy loss function, which is commonly used for classification tasks, and sets the value of epoch at the same time.If the epoch value is set too large, the network may cause over-fitting problems due to unsatisfactory generalization ability.In order to avoid overfitting of the model, this article uses the EarlyStopping function in Keras, and sets the patience parameter of the function to 10.This means that when the loss on the function monitoring validation set does not continue to decrease for 10 consecutive iterations, the training will be terminated early to prevent overfitting.If the epoch value is too small, the network may be underfitting due to insufficient learning of the training data.To avoid this from happening, we set the epoch value to be greater than the epoch value when the network converges.During the experiment, we found that after training a model, the total number of iterations of the dataset does not exceed 100.Considering the computational power and experimental phenomena, we set the epoch value of the model to 100 and the batch size to 64.

Experimental Results and Analysis
To compare the performances of MHCNN and single-headed CNN, the MHCNN is divided into three independent single-headed CNNs, which are named Head1 CNN, Head2 CNN, and Head3 CNN.
In order to compare the effects of MHCNN and three single-headed CNN (Head1 CNN, Head2 CNN, and Head3 CNN), we, respectively, used three different features (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings).The accuracy is shown in Table 1, where Mfcc + mel + VGG embeddings feature, and MHCNN model (accuracy with 98.09%) outperforms another methods.Moreover, different feature performances (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings) with the same model and different model performances (Head1 CNN, Head2 CNN, Head3 CNN, and MHCNN) with the same feature are also evaluated by AUC value.The results are shown in Table 2, where it can be seen that the combination of Mfcc + mel + VGG embeddings feature and MHCNN model has the highest AUC value (AUC = 0.9959).In this study, The model.fit() function is used to train the model for a certain number of epochs, and return historical training data.The learning curves for accuracy and loss function are shown in Figures 4 and 5.The model was trained 38 times, and by the time of the 30th epoch, the network had gradually converged.It can be seen from the two figures that the generalization performance of the network is satisfied, and the two curves on the training set and the validation set are relatively close.
Other evaluation indicators, including precision, recall rate, and F1 score are also calculated and shown in Table 3.Among 1360 test samples, there are 680 negative data and 680 positive data.The precision on the negative data is 0.99, the recall rate is 0.97, and the F1 score is 0.98.The precision on positive data is 0.97, the recall rate is 0.99, and the F1 score is 0.98.The results of three evaluation metrics show that MHCNN model performs well in the test set.Other evaluation indicators, including precision, recall rate, and F1 score are also calculated and shown in Table 3.Among 1360 test samples, there are 680 negative data and 680 positive data.The precision on the negative data is 0.99, the recall rate is 0.97, and the F1 score is 0.98.The precision on positive data is 0.97, the recall rate is 0.99, and the F1 score is 0.98.The results of three evaluation metrics show that MHCNN model performs well in the test set.At the same time, the confusion matrix is adapted to evaluate and supervise the performance of the learning algorithm, and reflect the parts that the real value is consistent   Other evaluation indicators, including precision, recall rate, and F1 score are also calculated and shown in Table 3.Among 1360 test samples, there are 680 negative data and 680 positive data.The precision on the negative data is 0.99, the recall rate is 0.97, and the F1 score is 0.98.The precision on positive data is 0.97, the recall rate is 0.99, and the F1 score is 0.98.The results of three evaluation metrics show that MHCNN model performs well in the test set.At the same time, the confusion matrix is adapted to evaluate and supervise the performance of the learning algorithm, and reflect the parts that the real value is consistent with the predicted value or not.The confusion matrix of the MHCNN model is shown At the same time, the confusion matrix is adapted to evaluate and supervise the performance of the learning algorithm, and reflect the parts that the real value is consistent with the predicted value or not.The confusion matrix of the MHCNN model is shown below in Figure 6.From the perspective of lines, the first line represents the actual number of cough audio of healthy people in the test audio, and the second line represents the actual number of cough audio of patients with COVID-19.From the perspective of columns, the first column represents the number of negative cough signals put into the system for prediction, and the second column represents the number of positive results.The matrix can be used to separately count the prediction results on the test data, including the number of classification errors and correct classifications.We can see that the classification of 680 healthy audio, 662 predictions are correct, and 18 predictions are wrong.For 680 cough audio of patients with COVID-19, there were 8 wrong predictions and 672 correct predictions.
umns, the first column represents the number of negative cough signals put into the system for prediction, and the second column represents the number of positive results.The matrix can be used to separately count the prediction results on the test data, including the number of classification errors and correct classifications.We can see that the classification of 680 healthy audio, 662 predictions are correct, and 18 predictions are wrong.For 680 cough audio of patients with COVID-19, there were 8 wrong predictions and 672 correct predictions.In addition, the ROC curve of the new method is shown in Figure 7.

Experimental Dataset2
To verify the generalization of our proposed method, our experiments are conducted on the University of Cambridge dataset (until 22 May 2020) too.This dataset was gathered from a web-based app and an Android app.It contains 544 cough audio files, including In addition, the ROC curve of the new method is shown in Figure 7.
tem for prediction, and the second column represents the number of positive results.The matrix can be used to separately count the prediction results on the test data, including the number of classification errors and correct classifications.We can see that the classification of 680 healthy audio, 662 predictions are correct, and 18 predictions are wrong.For 680 cough audio of patients with COVID-19, there were 8 wrong predictions and 672 correct predictions.In addition, the ROC curve of the new method is shown in Figure 7.

Experimental Dataset2
To verify the generalization of our proposed method, our experiments are conducted on the University of Cambridge dataset (until 22 May 2020) too.This dataset was gathered from a web-based app and an Android app.It contains 544 cough audio files, including

Experimental Dataset2
To verify the generalization of our proposed method, our experiments are conducted on the University of Cambridge dataset (until 22 May 2020) too.This dataset was gathered from a web-based app and an Android app.It contains 544 cough audio files, including 141 cough audio files of COVID-19 patients and 403 cough audio files of non-COVID people, which indicates that the category distribution of this dataset is imbalanced.A total of 141 cough audio files of COVID-19 patients have the time length 897 s.In addition, 403 cough audio files of non-COVID people have the time length 1870 s.
Due to the problem of category data imbalance in the dataset, SMOTE technology is used to augment the data of COVID-19 patients.It will start from the data with few sample categories (COVID-19 patient) to find adjacent samples and synthesize new samples so that the sample ratio of the two categories could remain almost unchanged while data become augmented.In this way, the dataset is balanced, and the final total number of samples is 806.The balanced feature sequences are divided into a training set and a testing set with a ratio of 8:2, that is, 644 feature sequences are used for training (54 feature sequences are used as the validation set), and 162 feature sequences are used for testing.To be consistent with the input size of the CNN network structure, it is necessary to expand the input feature vector from two-dimensional to three-dimensional, taking the training data as an example, that is, from (644, 269) to (644, 269, 1), as the input of CNN model training.

Experimental Parameter Setting
This experiment trains the network model on Google's Colab platform, and accelerates the training speed with the help of its GPU.Google Cloud hard disk is used to load experimental data and save the network model parameters.In the training process, the learning rate is set as 0.00025 and the decay index is 0.85.This paper uses the binary cross-entropy loss function are used commonly used for classification tasks too.In order to avoid overfitting of the model, this article uses the EarlyStopping function in Keras, and sets the patience parameter of the function to 10.To avoid underfitting the network due to insufficient learning of the training data, we set the epoch value to be greater than the epoch value when the network converges.Considering the computational power and experimental phenomena, we set the epoch value of the model to 100 and the batch size to 64.

Experimental Results and Analysis
To compare the performances of MHCNN and single-headed CNN, MHCNN is divided into three independent single-headed CNNs, which are named Head1 CNN, Head2 CNN, and Head3 CNN.
In order to compare the effects of MHCNN and three single-headed CNN (Head1 CNN, Head2 CNN, and Head3 CNN), we, respectively, used three different features (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings).The accuracy is shown in Table 4 and in Figure 8, where Mfcc + mel + VGG embeddings feature, and the MHCNN model (accuracy with 91.36%) outperforms another methods.5 and in Figure 9, where it can be  Moreover, different feature performances (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings) with the same model and different model performances (Head1 CNN, Head2 CNN, Head3 CNN, and MHCNN) with the same feature are also evaluated by AUC value.The results are shown in Table 5 and in Figure 9, where it can be seen that the combination of Mfcc + mel + VGG embeddings feature and MHCNN model has the highest AUC value (AUC = 0.9646).Moreover, different feature performances (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings) with the same model and different model performances (Head1 CNN, Head2 CNN, Head3 CNN, and MHCNN) with the same feature are also evaluated by AUC value.The results are shown in Table 5 and in Figure 9, where it can be seen that the combination of Mfcc + mel + VGG embeddings feature and MHCNN model has the highest AUC value (AUC = 0.9646).Other evaluation indicators, including precision, recall rate, and F1 score are, also calculated and shown in Table 6.Among 162 test samples, there are 81 negative data and 81 positive data.The precision on the negative data is 0.95, the recall rate is 0.88, and the F1 score is 0.91.The precision on positive data is 0.89, the recall rate is 0.95, and the F1 score is 0.92.The results of three evaluation metrics show that MHCNN model performs well in the test set.Other evaluation indicators, including precision, recall rate, and F1 score are, also calculated and shown in Table 6.Among 162 test samples, there are 81 negative data and   Other evaluation indicators, including precision, recall rate, and F1 score are, also calculated and shown in Table 6.Among 162 test samples, there are 81 negative data and At the same time, the confusion matrix is adapted to evaluate and supervise the performance of the learning algorithm, and reflect the parts that the real value is consistent with the predicted value or not.The confusion matrix of the MHCNN model is shown below in Figure 12.From the perspective of lines, the first line represents the actual number of cough audio clips of healthy people in the test audio, and the second line represents the actual number of cough audio clips of patients with COVID-19.From the perspective of columns, the first column represents the number of negative cough signals put into the system for prediction, and the second column represents the number of positive results.The matrix can be used to separately count the prediction results on the test data, including the number of classification errors and correct classifications.We can see that the classification of 81 healthy audio, 71 predictions are correct, and 10 predictions are wrong.For 81 cough audio of patients with COVID-19, there were 4 wrong predictions and 77 correct predictions.
ber of cough audio clips of healthy people in the test audio, and the second line represents the actual number of cough audio clips of patients with COVID-19.From the perspective of columns, the first column represents the number of negative cough signals put into the system for prediction, and the second column represents the number of positive results.The matrix can be used to separately count the prediction results on the test data, including the number of classification errors and correct classifications.We can see that the classification of 81 healthy audio, 71 predictions are correct, and 10 predictions are wrong.For 81 cough audio of patients with COVID-19, there were 4 wrong predictions and 77 correct predictions.In addition, the ROC curve of the new method is shown in Figure 13.In addition, the ROC curve of the new method is shown in Figure 13.At the same time, in order to verify that our proposed method can distinguish the coughs of respiratory patients from COVID-19 patients, we chose the coughing sounds of asthma patients and COVID-19 patients in the University of Cambridge dataset to conduct experiments.Among them, there are 21 cough files of asthmatic patients with a total of 120 s, and 54 cough files of COVID-19 patients with a total of 299 s.
Our designed MHCNN neural network is used to classify asthma coughs and COVID-19 coughs on the University of Cambridge dataset, where the training set and testing set are divided in the same way as previous experiments.We used the optimal feature combination (Mfcc + mel + VGG embeddings) found in the previous two group experiments is adopted to train with the MHCNN network.At the same time, in order to verify that our proposed method can distinguish the coughs of respiratory patients from COVID-19 patients, we chose the coughing sounds of asthma patients and COVID-19 patients in the University of Cambridge dataset to conduct experiments.Among them, there are 21 cough files of asthmatic patients with a total of 120 s, and 54 cough files of COVID-19 patients with a total of 299 s.
Our designed MHCNN neural network is used to classify asthma coughs and COVID-19 coughs on the University of Cambridge dataset, where the training set and testing set are divided in the same way as previous experiments.We used the optimal feature combination (Mfcc + mel + VGG embeddings) found in the previous two group experiments is adopted to train with the MHCNN network.
Evaluation indicators of precision, recall rate and F1 score are calculated and shown in Table 7, where they show that MHCNN model performs well in the test set.In this experiment, the confusion matrix of the MHCNN model is shown in Figure 14.The ROC curve of this experiment is shown in Figure 15.

Conclusions
In this paper, a MHCNN is designed for the detection of COVID-19 by using the The ROC curve of this experiment is shown in Figure 15.The ROC curve of this experiment is shown in Figure 15.

Conclusions
In this paper, a MHCNN is designed for the detection of COVID-19 by using the cough sound of the human body.We separately validated the AICovidVN 115M challenge dataset and the University of Cambridge dataset, and achieve good results.This method

Conclusions
In this paper, a MHCNN is designed for the detection of COVID-19 by using the cough sound of the human body.We separately validated the AICovidVN 115M challenge dataset and the University of Cambridge dataset, and achieve good results.This method achieves better performance as a new way to assist screening and diagnosis of COVID-19.Compared with the traditional detection method, it reduces the manpower and material costs and decreases the frequent contact between medical staff and the people being tested.Most importantly, it provides a new way to actively find patients with COVID-19, which is conducive to early isolation and treatment of infected people, and reduces the risk of potential infection, and is of great significance for combating the epidemic.

Figure 1 .
Figure 1.The block diagram of the MFCC feature extraction.

Figure 2 .
Figure 2. The architecture of a MHCNN.

Figure 2 .
Figure 2. The architecture of a MHCNN.

Figure 3 .
Figure 3.The time length analysis of cough audio files in dataset.

Figure 3 .
Figure 3.The time length analysis of cough audio files in dataset.Due to the problem of category data imbalance in the dataset, SMOTE technology is used to augment the data of COVID-19 patients.It will start from the data with few sample categories (COVID-19 patient) to find adjacent samples and synthesize new samples, so that the sample ratio of the two categories could remain almost unchanged while data becomes augmented.In this way, the dataset is balanced, and the final total number of samples is 6798.The balanced feature sequences are divided into training set and testing set with a ratio of 8:2, that is, 5438 feature sequences are used for training (453 feature sequences are used as the validation set), and 1360 feature sequences are used for testing.To be consistent with the input size of CNN network structure, it is necessary to expand the input feature vector from two-dimensional to three-dimensional, taking the training data as an example, that is, from (5438, 269) to (5438, 269, 1), as the input of CNN model training.

Figure 4 .
Figure 4. Changes in accuracy value during training on the AICovidVN 115M challenge dataset.

Figure 5 .
Figure 5. Changes in loss value during training on the AICovidVN 115M challenge dataset..

Figure 4 .
Figure 4. Changes in accuracy value during training on the AICovidVN 115M challenge dataset.

Figure 4 .
Figure 4. Changes in accuracy value during training on the AICovidVN 115M challenge dataset.

Figure 5 .
Figure 5. Changes in loss value during training on the AICovidVN 115M challenge dataset..

Figure 5 .
Figure 5. Changes in loss value during training on the AICovidVN 115M challenge dataset.

Figure 6 .
Figure 6.Confusion matrix of the MHCNN model on the AICovidVN challenge dataset.

Figure 7 .
Figure 7. ROC curve of the MHCNN model on the AICovidVN 115M challenge dataset.

Figure 6 .
Figure 6.Confusion matrix of the MHCNN model on the AICovidVN 115M challenge dataset.

Figure 6 .
Figure 6.Confusion matrix of the MHCNN model on the AICovidVN 115M challenge dataset.

Figure 7 .
Figure 7. ROC curve of the MHCNN model on the AICovidVN 115M challenge dataset.

Figure 7 .
Figure 7. ROC curve of the MHCNN model on the AICovidVN 115M challenge dataset.

Figure 8 .
Figure 8. Accuracy bar charts of the combination of four different models and three different features on the University of Cambridge dataset.Moreover, different feature performances (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings) with the same model and different model performances (Head1 CNN, Head2 CNN, Head3 CNN, and MHCNN) with the same feature are also evaluated by AUC value.The results are shown in Table5and in Figure9, where it can be

Figure 8 .
Figure 8. Accuracy bar charts of the combination of four different models and three different features on the University of Cambridge dataset.

Figure 8 .
Figure 8. Accuracy bar charts of the combination of four different models and three different features on the University of Cambridge dataset.

Figure 9 .
Figure 9. AUC bar charts of the combination of four different models and three different features on the University of Cambridge dataset.In this study, The model.fit() function is used to train the model for a certain number of epochs, and return historical training data.The learning curves for accuracy and loss function are shown in Figures 10 and 11.After training the neural network's 10 epochs, the network gradually converges.It can be seen the two curves on the training set and the validation set are relatively different, where the effect on training set is better than the validation set because the size of the dataset from the University of Cambridge is smaller than the AICovidVN 115M dataset.Other evaluation indicators, including precision, recall rate, and F1 score are, also calculated and shown in Table6.Among 162 test samples, there are 81 negative data and 81 positive data.The precision on the negative data is 0.95, the recall rate is 0.88, and the F1 score is 0.91.The precision on positive data is 0.89, the recall rate is 0.95, and the F1 score is 0.92.The results of three evaluation metrics show that MHCNN model performs well in the test set.
In this study, The model.fit() function is used to train the model for a certain number of epochs, and return historical training data.The learning curves for accuracy and loss function are shown in Figures10 and 11.After training the neural network's 10 epochs, the network gradually converges.It can be seen the two curves on the training set and the validation set are relatively different, where the effect on training set is better than the validation set because the size of the dataset from the University of Cambridge is smaller than the AICovidVN 115M dataset.

Figure 10 .
Figure 10.Changes in accuracy value during training on the University of Cambridge dataset.

Figure 11 .
Figure 11.Changes in loss value during training on the University of Cambridge dataset.

Figure 10 .
Figure 10.Changes in accuracy value during training on the University of Cambridge dataset.

Figure 10 .
Figure 10.Changes in accuracy value during training on the University of Cambridge dataset.

Figure 11 .
Figure 11.Changes in loss value during training on the University of Cambridge dataset.

Figure 11 .
Figure 11.Changes in loss value during training on the University of Cambridge dataset.

Figure 12 .
Figure 12.Confusion matrix of the MHCNN model on the University of Cambridge dataset.

Figure 12 .
Figure 12.Confusion matrix of the MHCNN model on the University of Cambridge dataset.

Figure 13 .
Figure 13.ROC curve of the MHCNN model on the University of Cambridge dataset.

Figure 13 .
Figure 13.ROC curve of the MHCNN model on the University of Cambridge dataset.

Table 1 .
Accuracy of the combination of four different models and three different features on the AICovidVN 115M challenge dataset.

Table 2 .
AUC of the combination of four different models and three different features on the AICovidVN 115M challenge dataset.

Table 3 .
Evaluation indicators of the MHCNN model on the AICovidVN 115M challenge dataset.

Table 3 .
Evaluation indicators of the MHCNN model on the AICovidVN 115M challenge dataset.

Table 3 .
Evaluation indicators of the MHCNN model on the AICovidVN 115M challenge dataset.

Table 4 .
Accuracy of the combination of four different models and three different features on the University of Cambridge dataset.

Table 5 .
AUC of the combination of four different models and three different features on the University of Cambridge dataset.

Table 5 .
AUC of the combination of four different models and three different features on the University of Cambridge dataset.AUC bar charts of the combination of four different models and three different features on the University of Cambridge dataset.

Table 6 .
Evaluation indicators of the MHCNN model on the University of Cambridge dataset.

Table 7 .
Performance of the MHCNN model on Evaluation Indicators.