NanoChest-Net: A Simple Convolutional Network for Radiological Studies Classification

The new coronavirus disease (COVID-19), pneumonia, tuberculosis, and breast cancer have one thing in common: these diseases can be diagnosed using radiological studies such as X-rays images. With radiological studies and technology, computer-aided diagnosis (CAD) results in a very useful technique to analyze and detect abnormalities using the images generated by X-ray machines. Some deep-learning techniques such as a convolutional neural network (CNN) can help physicians to obtain an effective pre-diagnosis. However, popular CNNs are enormous models and need a huge amount of data to obtain good results. In this paper, we introduce NanoChest-net, which is a small but effective CNN model that can be used to classify among different diseases using images from radiological studies. NanoChest-net proves to be effective in classifying among different diseases such as tuberculosis, pneumonia, and COVID-19. In two of the five datasets used in the experiments, NanoChest-net obtained the best results, while on the remaining datasets our model proved to be as good as baseline models from the state of the art such as the ResNet50, Xception, and DenseNet121. In addition, NanoChest-net is useful to classify radiological studies on the same level as state-of-the-art algorithms with the advantage that it does not require a large number of operations.


Introduction
The new coronavirus disease (COVID-19) has achieved historical records. Until 8 March 2021, the World Health Organization (WHO) had registered more than 116 million confirmed cases and over 2.5 million deaths [1]. COVID-19 is an infectious disease caused by SARS-CoV2 virus that affects severely the lungs of people infected, and the virus is easily propagated in the air and by contact. COVID-19 can cause complications and lead to development of pneumonia and other symptoms that can be confused with other diseases [1].
In addition, pneumonia is also an infectious disease that affects the lungs and can be caused by bacteria such as Streptococcus pneumoniae and Haemophilus influenzae, and viruses apart from the one that provokes COVID- 19. It has been a major disease and cause of death for children and senior people around the world. According to the WHO, pneumonia causes 15% of all deaths of children under 5 years old [2]. Moreover, pneumonia killed 808,694 children in 2017.
On the other hand, tuberculosis disease, caused by Mycobacterium tuberculosis is also an infectious disease that causes antimicrobial resistance and death of tissue on different parts of the body, affecting principally the lungs. According to the WHO, tuberculosis is feature extractor, a confidence module, and a prediction module achieving a sensitivity of 71.70%. Rahman et al. [28] presented a comparison between several baseline models to classify images of children infected with pneumonia, achieving up to 99% of sensitivity, and Luján-García et al. [29] used the same dataset but added a preprocessing technique and used a different pretrained baseline.
Recently, Rajpurkar et al. [30] presented a DL assistance tool to classify tuberculosis from patients with human immunodeficiency virus (HIV) using a CNN and a linear classifier to predict six clinical findings. On the other hand, Pasa et al. [31] presented a new small CNN to classify X-ray images from two small datasets, and they achieved good results despite the fact that no pretrained models were used. Moreover, using the same dataset as Pasa et al., Khatibi et al. [32] used an ensemble of CNNs to achieve classification accuracies up to 99.2%.
On the other hand, breast cancer has been improved using DL techniques. Shen et al. [33] used pretrained baselines on a large mammography dataset to classify among malign and benign mass and calcification, obtaining a sensitivity of 96%. Moreover, Agarwal [34] used a pretrained CNN to detect masses in mammography images, which achieved a better result, compared to Shen et al., with a sensitivity of 98% on the same dataset. Finally, Wu et al. [35] presented a custom ResNet-based CNN to classify over 1 million images from multiple views of patients with benign and malign masses, achieving an area under the curve score of 0.895.
Nonetheless, popular CNNs are enormous models and need a large amount of data for the purpose of being trained properly to get good results. Therefore, we aim to preset a small but effective CNN model that can be used to classify among different diseases using images from radiological studies.

Materials and Methods
In this section, datasets used for this research are described. In addition, we briefly introduce some of the CNN baseline models used for comparison purposes. Finally, metrics used for evaluating the algorithms are detailed.

Tuberculosis Dataset
The tuberculosis dataset is a collection of two sets of chest X-ray images from two different hospitals presented by the National Institute of Health of the United States [36]. The tuberculosis dataset is divided in two sets: the Montgomery County set and the Shenzhen set.
The Montgomery County set contains 138 frontal chest X-ray images, in which 80 of them are normal cases and 58 are from tuberculosis patients. Similarly, the Shenzhen set contains 662 frontal chest X-ray images, of which 326 are normal cases and 336 are tuberculosis patients.

Pneumonia Children Dataset
The Pneumonia children dataset was published by Kermany et al. [37]. The dataset contains 5856 chest X-ray images of healthy and sick children up to five years old. All images are given as a training set, with 5232 images and an official test set of 624 images. From the 5232 images, 3883 are from patients infected with pneumonia. The remaining 1349 images are from healthy children. On the other hand, the test set is divided as follows: 390 images are from pneumonia-infected children, and 234 images are from healthy children.

COVID-19 Dataset
Presented by Cohen et al. [38], The COVID-19 Image Data Collection was one of the first open available datasets that contained chest X-rays from patients infected with COVID-19. We initially used the dataset from November 2020, which contained 930 images from different diseases such as pneumonia, severe acute respiratory syndrome (SARS), and middle east respiratory syndrome (MERS), among others. At this time, only 478 images were from patients infected with COVID-19. These images were used to generate two different sets of images to perform experiments, explained in the next section.

RSNA Pneumonia Challenge Dataset
The RSNA Pneumonia Challenge (RSNA-PC) is the only competition (from Kaggle.com) to classify and provide bounding boxes for damaged areas of the lung caused by pneumonia. The dataset contains 26,684 unique chest X-ray images of both normal (29%) and not normal/opacities (71%) for the training set, and 3000 images for the test set.

BCDR Dataset
The Breast Cancer Digital Repository (BCDR), by Moura and Guevara [39], offers multiple datasets for both digital and scanned mammography in which principal classes are malign and benign tumors. For this work, we have used only the two datasets of digital mammography.
The BCDR-D01 contains full-field digital mammography and is composed of 79 biopsyproven lesions of 64 women, rendering 143 segmentations for 80 unique images of patients with benign tumors, and 57 patients with malign tumors.
The BCDR-D02 contains full-field digital mammography and is composed of 230 biopsyproven lesions of 162 women, rendering 455 segmentations for 359 unique images of patients with benign tumors, and 48 patients with malign tumors.

CNN Models from the State of the Art
Back in 2015, since the formal introduction of Deep Learning [40], the research community has dedicated a lot of attention and effort on developing DL algorithms for different purposes, such as image recognition, CV, CAD systems, and natural language processing, among others. Due to the capacity of extracting features withing the algorithm itself from different kind of signals (including images), CNNs have achieved magnificent results. Nowadays, a huge number of CNN models exist and are used for distinct purposes. As a result, we can find custom models [16,19,31] and the ones that use key baselines for classification of different diseases [17,22,24,29,[41][42][43].
For this research, we have compared the proposed method with the most popular CNNs used for computer vision tasks such as the ResNet50 [44], the Xception network [45], and the DenseNet121 [46].

Metrics
In a binary classification problem, we can measure the performance according to the examples correctly classified that belong to each class as true positives (tp) and true negatives (tn), and we take into account the mistakes or errors when classifying instances such as the false positives (fp) and false negatives (fn). Normally, tp, tn, fp, and fn are shown in tabular form as a confusion matrix ( Figure 1).  From the confusion matrix, we can compute a variety of metrics. Accuracy is commonly used when we have a classification task among two or more different classes. Moreover, we can also compute other metrics such as precision, sensitivity, specificity, F1-Score, and the area under the ROC curve (AUC) [47]. Following, we can find the def- From the confusion matrix, we can compute a variety of metrics. Accuracy is commonly used when we have a classification task among two or more different classes. Moreover, we can also compute other metrics such as precision, sensitivity, specificity, F1-Score, and the area under the ROC curve (AUC) [47]. Following, we can find the definition of the metrics used in this work (Equations (1)-(5)).
In general, accuracy not always represents an unbiased performance measurement due to different imbalances within the instances of a dataset. Therefore, precision, sensitivity, specificity, F1-Score, and AUC are always helpful to measure the performance of a model. For this work, an AUC using thresholds was computed.

Proposal
In this section, a detailed description of the proposed custom CNN model is given. Moreover, final datasets and their partitions are explained. On the other hand, preprocessing and data augmentation techniques are described. Finally, hyperparameters used to train the models are mentioned.

DL Model
Inspired by the Separable Convolutions from the Xception network, we have designed the NanoChest-net to classify between images from radiological studies, such as X-ray images. The complete block diagram of our CNN model is shown in Figure 2. A complete specification of each layer of the CNN is described in Table 1.   We have used the depth multiplier of Separable Convolution layers to increment the number of output channels on each layer. In addition, we have used a dilation rate of 2 to increment the size of the spatial perception field on each layer. As a result, Table 1 shows us the total number of layers of our proposal, which is 28. If we count the set of a convolutional layer, the batch normalization layer, and its activation as a complete layer (as commonly used in the literature), then our proposal is composed of a very small number of 14 layers. Moreover, if we only focus on weighted layers, then our proposal is as small as 10 layers in depth. In comparison, baseline models such as VGG-16, which comes next in layer size, contain 19 weighted layers and have an average of 136.4 million parameters [48]. Therefore, our proposal has the advantage of halving the depth according to weighted layers, and it has 40 times fewer parameters with only 3.4 million. As consequence, the aforementioned reasons are motive to call this small model as NanoChest-net due to the minimal number of layers on the CNN model and its application to radiological studies, primarily of the chest.

Splitting and Final Datasets
We have maintained the original examples for Shenzhen, Montgomery, Pneumonia children, BCDR-D01, and BCDR-D02. Nonetheless, we have generated two new subsets using the COVID-19 dataset and the RSNA-PC dataset. We took the 478 COVID-19 images from the COVID-19 dataset and 478 images of healthy patients of the RSNA-PC to generate the COVID-NORMAL dataset. Moreover, we took the same 478 images from COVID-19 dataset, but now 478 images of pneumonia-sick patients from the RSNA-PC to generate the COVID-PNEUMONIA dataset. Table 2 shows the final datasets used for this research.

Validation Method
Hold-out validation was performed in order to obtain the training, development (Dev), and test set for each dataset. Hold-out validation consists of randomly dividing the original number of images on the training, Dev, and test set. Figure 3 shows the behavior of the hold-out validation method.

Validation Method
Hold-out validation was performed in order to obtain the training, development (Dev), and test set for each dataset. Hold-out validation consists of randomly dividing the original number of images on the training, Dev, and test set. Figure 3 shows the behavior of the hold-out validation method. A hold-out 70-10-20 was used over each dataset, except for the Pneumonia children, in which an official test set was established by the authors. Therefore, partitions for each dataset are as follows (Table 3).

Preprocessing and Data Augmentation
All images were normalized before feeding the CNN models. In addition, all datasets were resized to 500 × 500 pixels to avoid resizing each image from its original size to the input size of each CNN model on each step of the training. Moreover, we fed the models with the original input size. Then, the input size of each model is as follows (Table 4). A hold-out 70-10-20 was used over each dataset, except for the Pneumonia children, in which an official test set was established by the authors. Therefore, partitions for each dataset are as follows (Table 3).

Preprocessing and Data Augmentation
All images were normalized before feeding the CNN models. In addition, all datasets were resized to 500 × 500 pixels to avoid resizing each image from its original size to the input size of each CNN model on each step of the training. Moreover, we fed the models with the original input size. Then, the input size of each model is as follows (Table 4). For the Montgomery County dataset, first we cropped the central region of the images with the intention of deleting black bars from the original images ( Figure 4). We followed the algorithm by Pasa et al. [31]. Then, we applied preprocessing and data augmentation techniques.

Tuberculosis Montgomery County Dataset
For the Montgomery County dataset, first we cropped the central region of the images with the intention of deleting black bars from the original images ( Figure 4). We followed the algorithm by Pasa et al. [31]. Then, we applied preprocessing and data augmentation techniques.  On the other hand, data augmentation techniques were applied to each dataset aiming to obtain a better generalization of the models. For tuberculosis datasets, pneumonia dataset, and COVID-19 dataset we applied horizontal flip, magnification in a range of 0.90 to 1.2, random width and height shift with a factor of 0.20, random rotation of 20 degrees, and brightness changes in a factor range of 0.80 to 1.05. In the case of the BCDR datasets, we changed the random rotation to 30 degrees and added vertical flip.

Hyperparameter Tuning
We conducted the same experiments using the state-of-the-art CNNs and our proposed method. Equivalent hyperparameters were used through all models, except for the input size. We used the original input size for each model, as mentioned in Section 4.3. We trained all models using a logistic layer of two units with Sigmoid activation to get the probability of having each of two classes per dataset. Binary cross-entropy was used as a cost function (computed as in [29]) and Adam [49] as optimization algorithm with parameters 1 = 0.9, 2 = 0.999 (recommended values from original paper). In addition, we performed several experiments with different optimizers to see the impact in the training of our proposal as seen in Table 5 (best results are highlighted in bold).  On the other hand, data augmentation techniques were applied to each dataset aiming to obtain a better generalization of the models. For tuberculosis datasets, pneumonia dataset, and COVID-19 dataset we applied horizontal flip, magnification in a range of 0.90 to 1.2, random width and height shift with a factor of 0.20, random rotation of 20 degrees, and brightness changes in a factor range of 0.80 to 1.05. In the case of the BCDR datasets, we changed the random rotation to 30 degrees and added vertical flip.

Hyperparameter Tuning
We conducted the same experiments using the state-of-the-art CNNs and our proposed method. Equivalent hyperparameters were used through all models, except for the input size. We used the original input size for each model, as mentioned in Section 4.3. We trained all models using a logistic layer of two units with Sigmoid activation to get the probability of having each of two classes per dataset. Binary cross-entropy was used as a cost function (computed as in [29]) and Adam [49] as optimization algorithm with parameters β 1 = 0.9, β 2 = 0.999 (recommended values from original paper). In addition, we performed several experiments with different optimizers to see the impact in the training of our proposal as seen in Table 5 (best results are highlighted in bold).
From Table 5, we can see that the stochastic gradient descent (SGD) [50] did not obtain good results in any dataset. On the other hand, Adam obtained the best scores 25 times, and RMSProp (introduced by Hinton in 2012) obtained the best scores 17 times. Adam was selected due to the fact it is a combination of SGD (using momentum) and RMSProp (squared gradients). As a result, we also performed experiments to see the impact of changing the learning rate for training using Adam optimizer as seen on Table 6 (best results are highlighted in bold).
From previous results (Table 6) we can see that using a learning rate of 0.001 provided the best scores 16 times. Nonetheless, using a learning of 0.0005, the best scores were obtained 23 times. A learning rate of 0.0005 was selected because it showed better performance than a larger one. Moreover, a small learning rate on the order of 10-4 benefits all models when training on small datasets.
As a result, a learning rate of 0.0005 was selected to perform all experiments with all datasets. In the same way, the Adam algorithm, which is a combination of SGD and RMSProp, was selected as an optimization algorithm. Moreover, the number of epochs was selected according to the size of each dataset and our technological capabilities. In addition, the batch size was selected considering the size of each dataset and its partitions. Learning rate, epochs, and batch size configurations are shown in Table 7.  We applied weights to BCDR-D01 and BCDR-D02 datasets to combat class imbalance. We used 0.8636 and 1.1875 for benign and malign on BCDR-D01, and 0.568 and 4.1765 for benign and malign on BCDR-D02.

Results
In this section, the experimental framework is described. In addition, performance and comparison between models are presented. Furthermore, statistical analysis is presented, considering metrics and time measurements.

Experimental Framework
Experiments for this research were conducted on a PC with AMD Ryzen 3700x processor; 16 GB of RAM; 512 SSD + 2 TB of storage; GPU Nvidia RTX 2070 Super with 8 GB GDDR5; Python 3.7.9 was used as programming language; TensorFlow 2.1.0 with Keras as high-level DL framework; sci-kit learn 0.23.2 [51] as machine learning (ML) library; and OpenCV 3.4.2 [52] as main image processing library. Moreover, we set a fixed seed for TensorFlow, Python random generator, and NumPy library to get the repeatability of the experiments.

Test Sets Results
We trained all baseline models and the proposed NanoChest-net using the same hyperparameters (apart from the input size) on all the datasets specified in Section 2 and computed the performance metrics. The results over the respective test set for each dataset can be found in Table 8 (best results are highlighted in bold).  As seen in Table 8. The proposed model behaved similar to the baseline ones. A detailed discussion will be given in the next section.

Training Time Results
Apart from evaluating the metrics, we also measured the time taken for each model at training time. We measured the total time taken by each model, the average time per epoch, the time of each model to process a single example, and the time taken in achieving the best result through all the training epochs. Therefore, results can be found in Table 9 (best results are highlighted in bold). Despite the fact our proposal did not seem be faster among the CNN models, we will perform further analyses in the next section.

Size of the Models
With the final structure of each CNN model, and after training them, we measured the total number of parameters of each one. In addition, as the best-performance model for all CNNs was saved, we measured the size of storage of each model. Results can be found in Table 10 (smaller number of parameters and size are highlighted in bold).

Statistical Analysis
We performed the Friedman test [53] over computed metrics for each model on each dataset ( Table 11). The Friedman test tells with a 95% confidence if a significant statistical difference exists between five or more instances ranked in order of significance. If p < 0.05 is found, then significant differences exist. We arranged the results of the Friedman test in Table 11 (best results are highlighted in bold). From Table 11, we can observe that p-values were greater than 0.05 (confidence level of 95%). Therefore, the null hypothesis for equality on the performance of compared algorithms was not rejected.
As for time per example, statistical analysis showed that the Friedman test obtained a p-value of 0.001213, being the ResNet50 the best of the ranking. As a result, the null hypothesis was rejected. However, after applying the Holm test [54], we obtained the results as shown in Table 12. From Table 12, we can observe that the Holm test rejected the hypothesis with unadjusted p-values smaller than 0.001213. Therefore, neither DenseNet121 nor NanoChest-net was rejected. On the contrary, only the Xception network was rejected, showing significant differences (inferior performance) compared with the other algorithms. Consequently, we will perform further analyses in the next section.

Discussion
In this section, advantages of the proposal are highlighted. An evaluation of the classification results on the different datasets is performed, as well as the time analysis of the training.
From the classification results (Table 8), our proposal obtained good results through all the datasets. On the Montgomery County dataset, our model obtained the best results on accuracy, sensitivity, F1-Score, and AUC, with scores of 0.931, 1.000, 0.929, and 0.928, respectively. For the Shenzhen dataset, our model obtained the second-best scores on accuracy (0.828), sensitivity (0.897), F1-Score (0.841), and AUC (0.928) only behind the Xception network. On the Pneumonia children dataset, our method achieved the best results among all CNN models, with an accuracy of 0.931, precision of 0.904, sensitivity of 0.995, specificity of 0.825, F1-Score of 0.947, and an AUC score of 0.992. Again, for the COVID-NORMAL dataset, our proposal achieved the best scores of all models with an accuracy of 0.933, precision score of 0.912, sensitivity score of 0.959, specificity score of 0.907, F1-Score of 0.935, and an AUC score of 0.970. On the other hand, for the COVID-PNEUMONIA dataset, the NanoChest-net obtained the best results for precision (0.860), specificity (0.878), and AUC (0.919), and the remaining metrics had good results, only behind the Xception network. Nevertheless, for the BCDR-D01 dataset, we fell behind to third place behind DenseNet121 and Xception networks, except for sensitivity (best result). We obtained the following scores: accuracy of 0.621, precision of 0.500, sensitivity of 0.818, specificity of 0.500, F1-Score of 0.621, and AUC of 0.702. Finally, for the BCDR-D02 dataset we obtained the best results for sensitivity (0.556) and F1-Score (0.278). For the remaining metrics, we obtained second place only behind the DenseNet121 with an accuracy of 0.687, precision of 0.185, specificity of 0.703, and an AUC of 0.664. Finally, from the statistical analysis for metrics (Table 9), the Friedman test did not show evidence to reject the hypothesis. Therefore, there were no statistically significant differences among the models. However, the scores from the test placed our proposal at the top of the ranking on each metric, showing a superior behavior compared to the state-of-the-art baseline models. In addition, if we used a confidence level of 90%, then there will exist differences on F1-Score and AUC in favor of NanoChest-net due to its first position in the ranking.
On the other hand, time measurement results (Table 9) showed that in the Montgomery County and BCDR-D01 our proposal obtained the lowest values for total training time, epoch average time, and time per example. For Shenzhen, Pneumonia children, COVID-NORMAL, COVID-PNEUMONIA, and BCDR-D02, our method was always in third place, behind the ResNet50 and DenseNet121 networks. Nonetheless, from the Friedman test and Holm results (Tables 11 and 12) we can observe that there were no significant differences among ResNet50, DenseNet121, and our method. The Xception network was the only worse model considering the time taken per example.
At the same time, our proposal took an important step further on a crucial aspect. Apart from showing decent classification and time results, our method was significantly smaller in parameter count and storage size. From Table 10, we can observe that our method had less than half the parameters and size compared to DenseNet121, and 6 to 7 times fewer parameters and smaller size when compared to Xception and ResNet50 models. Consequently, our model could be used in computers, embedded devices, and mobile devices with limited storage, memory, and computation capabilities.

Conclusions
In this paper, we have introduced a new, full custom, and small convolutional neural network model called NanoChest-net. Our proposed model is used to classify medical images from radiological studies such as X-rays from the chest and mammography from the breasts of women. As a result, our model proves to be effective in classifying among different diseases like tuberculosis, pneumonia, and COVID-19. Moreover, NanoChest-net obtained the best results on both the Pneumonia children and COVID-NORMAL datasets. On the remaining datasets, our model proved to be as good as baseline models from the state of the art such as the ResNet50, Xception, and DenseNet121, finding no statistically relevant differences among models, neither in performance nor training time. On the contrary, we can find an abrupt difference on the number of parameters and storage size of our model, being two to seven times smaller compared with baseline models. In short, the NanoChest-net model is useful to classify radiological studies on the same level as state-of-the-art algorithms and without computing large numbers of operations and occupying more than 40 MB of storage, making our proposal suitable for embedded and mobile devices.