Does Two-Class Training Extract Real Features? A COVID-19 Case Study

Diagnosis aid systems that use image analysis are currently very useful due to the large workload of health professionals involved in making diagnoses. In recent years, Convolutional Neural Networks (CNNs) have been used to help in these tasks. For this reason, multiple studies that analyze the detection precision for several diseases have been developed. However, many of these works distinguish between only two classes: healthy and with a specific disease. Based on this premise, in this work, we try to answer the questions: When training an image classification system with only two classes (healthy and sick), does this system extract the specific features of this disease, or does it only obtain the features that differentiate it from a healthy patient? Trying to answer these questions, we analyze the particular case of COVID-19 detection. Many works that classify this disease using X-ray images have been published; some of them use two classes (with and without COVID-19), while others include more classes (pneumonia, SARS, influenza, etc.). In this work, we carry out several classification studies with two classes, using test images that do not belong to those classes, in order to try to answer the previous questions. The first studies indicate problems in these two-class systems when using a third class as a test, being classified inconsistently. Deeper studies show that deep learning systems trained with two classes do not correctly extract the characteristics of pathologies, but rather differentiate the classes based on the physical characteristics of the images. After the discussion, we conclude that these two-class trained deep learning systems are not valid if there are other diseases that cause similar symptoms.


Introduction
Advances in imaging technology and computer science have greatly enhanced the interpretation of medical images and contributed to early diagnosis.
The typical architecture of a Computer-Aided Diagnosis (CAD) system includes image pre-processing, the definition of the region(s) of interest, feature extraction and selection, and classification. Currently, Deep Learning (DL) is probably the most widely used technology to implement this type of system. This technology requires large training datasets, and the data have to be correctly labeled in order to learn the relevant patterns [1,2]. This data-intensive nature of DL and the necessity of labels has produced models that usually do not generalize well when there is a distribution shift between training and evaluation data [3].
When focusing on the study of respiratory infectious diseases that affect lungs (mainly pneumonia), different types can be found based on the information provided by X-ray pulmonary images. In addition, these problems can be produced by bacteria, viruses, or fungi.
Thus, if the system trained by pulmonary X-ray images classifies between healthy patients and patients with a disease that leads to pneumonia, it is logical to think that the system would focus on detecting lung problems. Thus, it would not be correct to say that the developed system provides an accurate result for that particular disease.
In the particular case that concerns us, how is COVID-19 pneumonia different from regular pneumonia? The symptoms of COVID-19 pneumonia can be similar to other types of viral pneumonia. Because of this, it can be difficult to know what is causing this condition without getting tested for COVID-19 or other respiratory infections.
Research is underway to determine how COVID-19 pneumonia differs from other types of pneumonia. Information from these studies can potentially aid in the diagnosis and in expanding our understanding of how SARS-CoV-2 affects lungs [4].
That study used Computerized Tomography (CT) scans and laboratory tests to compare the clinical features of COVID-19 pneumonia with other types of pneumonia. The researchers found that people with COVID-19 pneumonia were more likely to have: • Pneumonia that affects both lungs instead of just one (bilateral pneumonia). • Lungs that had a characteristic "ground glass" appearance on CT scans. • Abnormalities in some laboratory tests, particularly those that were evaluated.
However, we are talking about probabilities, not exact rules. Thus, using lung X-ray images, a patient with COVID-19 can be erroneously classified as a patient with any other type of disease that causes similar damages to lungs.
Considering the above-mentioned diagnosis aid systems, the availability of multiple Chest X-Ray (CXR) datasets and the advances in the field of deep learning have facilitated the development of techniques for automatic image interpretation [5]. These techniques have been recently applied to COVID-19 detection in CXR images, e.g., [6][7][8][9][10]. In general, these papers used widely different datasets with different domains and distributions. As an example, Reference [6] used 184 COVID-19 images and 5000 non-COVID-19 images coming from 13 different categories. As a second example, Reference [8] used 358 CXR images from 266 COVID-19 patient cases, 8066 with no pneumonia (i.e., normal) and 5538 with non-COVID-19 pneumonia. As a further example, Reference [7] distinguished only between COVID-19 and tuberculosis. In general, we can find tools that distinguish between several classes (e.g., COVID-19, non-COVID-19 pneumonia, and healthy) or binary systems.
As the first step, we need to study the various datasets and works related to this topic. However, due to the extensiveness of these studies, they are included in the next section.
Thus, generalizing this problem, it is important to study whether the classification systems based on only two classes (healthy or sick) behave correctly.
The main problem with binary systems is that some of them do not try to cover the spectrum of all possible cases that can be found on real CXR and consider only the characteristics that distinguish the two analyzed classes.
A priori, and according to current studies, if the disease whose images are used to train the DL system has very characteristic features that distinguish it from any other disease, it is feasible to think about this possibility. However, if the classifier is trained with images that focus on symptoms generated by that disease and that may appear in other cases (not being exclusive to the trained disease), it is fair to think that the system could make errors if it is tested with images of patients with other diseases with similar symptoms not included in the training dataset.
Thus, in this work, we ask the question: Does the use of two classes in disease classification systems based on convolutional networks correctly extract the characteristics that distinguish that disease, or, on the contrary, does it only extract the characteristics that differentiate a healthy patient from another with the specific disease?
In the rest of the paper, we deal with the real-world problem of diagnosis aids originally trained with a specific domain and distribution and later used in a more realistic situation.
To try to answer these questions, in this work, various studies are carried out in which DL systems for lung diseases' classification are trained. In these studies, exclusively two-class training is carried out, and images from another untrained class is used to test the system. Of course, diseases that share symptoms of pneumonia are used in order to observe the behavior of the trained systems.
In addition, considering the current global situation and focusing this study on a topic that is now very relevant, in this work, we focus on the application case of COVID-19 detection. In this way, we consider three main classes: healthy, pneumonia, and COVID-19. For this study, we train DL two-class classification systems that distinguish between: • Healthy patients and COVID-19 patients; using pneumonia patients as the test case. • Healthy patients and patients with pneumonia; using COVID-19 patients as the test case. • Patients with pneumonia and patients with COVID; using healthy patients as the test case.
The rest of the paper is divided as follows: First, in the Materials and Methods Section, the different datasets and published works are detailed, as well as the analysis methodology used to test the different systems. Next, the results obtained after training and testing the different systems are detailed and explained in the Results and Discussion Sections. Finally, conclusions are presented.

Materials and Methods
In this section, the theories detailed above are focused on some of the most relevant works currently published about COVID-19 detection using medical images. To achieve this goal, first, we present the current publicly available datasets. Next, we summarize a significant subset of the published works, considering the datasets used by each one. This paper does not pretend to be a review of the existing papers on CXR-based COVID-19 diagnosis aids; thus, we include a subset of the available papers that is wide enough to demonstrate our theories. Finally, we detail the methodology used to demonstrate the theories presented in the Introduction about the goodness of COVID-19 detection systems based on convolutional neural networks with only two classification classes.

Works and Datasets
First, a study of the different publicly accessible datasets was carried out. It should be noted that, on some occasions, the datasets collected to prepare a work are not published, and therefore, we cannot evaluate them to corroborate their content. However, thanks to the published articles, we can indicate their most relevant characteristics. The results of this study are shown in Table 1. As can be observed in the datasets shown in Table 1, some of them only provide information about two classes: COVID-19 positive and COVID-19 negative. Due to this fact, on some occasions, the election of using a 2-class classification system in some works depends on the dataset used. Moreover, some datasets combine images from other COVID-19 datasets (like COVIDx-CRX, which includes some images from other datasets of the previous table), and finally, in some cases, other classic pneumonia datasets are used to complete the dataset itself.
However, as can be observed in Table 2, the most widely used dataset is COVID-chestXray, which does provide information on more than two classes. This dataset uses its own images and distinguishes between several illnesses that attack lungs and produce different types of pneumonia; thus, this is the most complete dataset. Furthermore, as can be observed in the last two columns of Table 1 (and based on the explanation included in the caption), this dataset is the most balanced among the studied datasets that distinguish between more than two classes. Other important points shown in Table 2  After presenting the most widely used datasets and some of the most relevant works, the methodology used in this work to test the quality of the COVID-19 classification systems that only use two classes is described.

Study Approach
In this subsection, the dataset used for this work and the system's architecture are detailed. First, the dataset is presented.

Dataset Used
To carry out our study, a combination of the COVID-chestXray [11] (only CT images) and COVIDx-CXR [8] datasets using three classes (COVID-19, pneumonia, healthy) is used. After this combination, the obtained dataset is almost perfectly balanced (≈33.3% for each class). Although it contains images registered by different acquisition technologies and different dispositions of the patient's torso during registration, there is a very high imbalance, and the most abundant are those obtained by X-ray generators and with the torso exposed from the front so that the two lungs can be seen in the same image. Therefore, to achieve a sufficiently homogeneous sample for our study, only images with these characteristics were used. By performing this filter, bias was limited, and a better interpretation of the results could be obtained; however, the total number of images for each class was significantly reduced.
In this study, we trained three different models for a binary classification problem, considering only two of the three classes in the dataset for each training experiment. To verify that the model is effective for the corresponding classification problem, the dataset was split for training and validation using the hold-out technique, that is a subset was randomly selected for the training of the models, and the remaining subset was used to assess the model performance. Thus, a subset with 80% of the dataset samples was used for training, while the remaining 20% subset was used for evaluation. Table 3 shows the distribution. Table 3. Dataset distribution for each subset.

Subset
Total Train  Test   COVID-19  132  105  27  Healthy  132  105  27  Pneumonia  132  106  26  Total  396  316  80 Once the dataset has been described, the next step (and the most important one) in our study is to check the behavior of each binary model when it receives samples from the class for which it has not been trained. For this reason, in this second phase of the study, all samples of the excluded class were used as the test subset for the respective binary model. This explanation is detailed further in the next subsection.

Processing Architecture
In this work, the VGG16 architecture [28] was used as the backbone. The weights used are those resulting from a pre-training with the ImageNet dataset [29]. An average-pooling layer was added next, whose results were flattened and transmitted to a dense layer of 64 nodes and, finally, a last inference layer. Only these last added layers were trained with the previously described dataset. The architecture was implemented using TensorFlow with Keras.
Additionally, dataset pre-processing was applied before the training process. The images stored in the initial dataset contained lung X-ray images of healthy patients, patients with pneumonia, and COVID-19 positives. However, some samples of the COVID-19 positive cases were not obtained with the same acquisition parameters. Moreover, in order to work with a subset of images with similar features, a histogram equalization was applied as a pre-processing step.
As explained above, using the TensorFlow framework with Keras, a VGG16 architecture was combined with a final inference layer to train the classification systems. In this study, three binary classification models were trained. These models were: Once the system architecture and the dataset used to obtain the classification mechanism have been specified, the procedure used in this study to analyze the trained models must be assessed.

Performance Assessment
The final purpose of our study is to know the vulnerabilities that a trained and evaluated model may have for a case of binary classification of lung pathology. For this purpose, the following two study phases are proposed: • Classic model test: A conventional training and evaluation are carried out. This is used to evaluate the samples of the classes for which the model has been trained. It is evaluated through the use of metrics and the analysis of the confusion matrix. The metrics considered consist of accuracy, precision, recall, and f1-score. If the model is considered as invalid, it means that the model is not capable of extracting real features from the classes, but it only distinguishes differences between both of them. That is why in this phase, the distribution of the samples classified by the models are analyzed. Additionally, the performance results that each model would achieve when faced with a homogeneous sample of the three considered classes are shown. As the models have been designed to solve binary classification problems, in order to correctly interpret the results, we must combine the samples of two classes and assume a new classification problem, which will depend on the combination performed. The combinations and the resulting classification problems are displayed in the Results Section.
For the case of a valid model, the expected results for each trained model are detailed next: • H-P model: The testing class will be COVID-19. In this case, as COVID-19 patients present more severe symptoms of pneumonia than patients diagnosed with another type of pneumonia, a valid model should be capable of classifying COVID-19 patients in the pneumonia class.
• H-C model: The testing class will be pneumonia. In this case, if the system detects lung problems, pneumonia patients should not be classified as healthy.
• P-C model: The testing class will be healthy. In this case, if the system distinguishes the severity of lung damage, healthy patients should be classified in the pneumonia class.
After presenting the methodology used to test the theories discussed in the Introduction, the results are detailed in the next section.

Results
This section is divided into two subsections. In the first one, the classical model verification is presented; and in the second one, the extended class test is performed.

Classical Model Verification
Each binary class is trained, and the results obtained from the previously defined metrics are detailed in Table 4. These results are obtained after testing the system with 20% of the samples of the same classes used for the training. These results are presented next with normalized confusion matrices (see Figure 1) and ROC curves (see Figure 2).
As can be observed in the previous tables and graphs, the classification systems that only use two classes obtain high performance results. Furthermore, the binary system trained only with healthy and COVID-19 images obtains a 96% accuracy (better that almost all the binary systems presented in Table 2). The only work with binary classification that obtained better results was the one by Ahoja et al. [19]; however, in that work, the DL model was much more complex that the one used for this training.

Extended Class Test
As detailed before, in this subsection, images from the non-trained class to test each binary system are used. Summarizing, the class used for testing is: • H-P model: the testing class will be COVID-19.
• H-C model: the testing class will be pneumonia. • P-C model: the testing class will be healthy.
Results obtained after testing each system with the third class are shown in Table 5. These results are not presented using confusion matrices and/or ROC curves as the results obtained would not be correct because there are no true positive classifications, and that value is essential to calculate the metrics involved in such representations.
As can be observed in Table 5, each model behaves differently. However, these results are analyzed in the Discussion Section with a medical approach in order to know if the trained model is medically valid or not depending on the results obtained after testing each model with the third class. Another point of view used to study the quality of DL systems with external classes that have not been used during the training process is to calculate the pseudo confusion matrices (a technique applied in other works like [30][31][32]). This technique consists of assuming that the images of the new class correspond to some of the original classes. In this way, the system is trained with images of all classes, but only classifies among the original classes. In the case of binary systems (such as those studied in this work), two different situations can be obtained for each model consisting of integrating the images of the new class into the two existing classes. If such a study is carried out, the classification results should be able to clarify whether the resulting system is valid or not based on the results of the pseudo confusion matrices.
Thus, with this approach, six different systems have to be considered: • H-P model-Case 1: COVID-19 images are included in the healthy class, in order to obtain a system that classifies between pneumonia positive and pneumonia negative. The pseudo confusion matrices are presented in Figure 3: the three classes are used with each binary model (rows), including the non-trained class inside another one to resemble real situations (columns).
As can be seen in Figure 3, each case behaves differently (even within the same model). Each case is analyzed in the Discussion Section.

Discussion
The first results presented in Table 4 and detailed in Figures 1 and 2 demonstrate that the DL model used for this work has acceptable results, and the previous works that obtained worse results will probably not behave as well as our system in the next test (using images from a third class not included in the training).
From a medical point of view, the main difference between the three classes involved in this study should focus on alveoli inflammation. In that case, patients classified as healthy should not have any inflammation; patients with COVID-19 should have severe bilateral pneumonia (due to severe alveoli inflammation in both lungs); and patients with other pneumonia should have unilateral inflammation or mild bilateral inflammation (as detailed in [9]). Analyzing each case independently, we can observe: ] H-C model: The system received pneumonia images for testing. This is the most interesting model, as many works presented in the bibliography (see Table 2) used this classification between healthy and COVID-19 patients. However, there have been many deceased patients whose damages provoked pneumonia symptoms. Therefore, a valid DL system should extract the characteristics regarding the alveoli inflammation of the images, and according to that, all images that present some alveoli inflammation should be classified as COVID-19. This may seem unrealistic, but for these DL works, the world itself is divided only into two classes: healthy and COVID-19. Therefore, for these systems, there are no other pneumonia cases beyond COVID-19 patients. That is why, if the meta-characteristics are extracted correctly during the training process, every image with alveoli inflammation should be classified as COVID-19. When focusing on the results obtained in Table 5, the system behavior seems random: half of the images are classified as healthy and the other half as COVID-19. These results seem to indicate that the system does not extract the correct characteristics from the classes involved, and each pneumonia image is classified randomly as healthy or COVID-19, which seems to be a big mistake due to the clear medical differences between those classes. Therefore, according to this, the model should not be considered as medically valid. • [Case 3] P-C model: In this last model, the system received healthy images for testing. The system itself may not seem useful because no healthy patients were taken onto account. However, in using a medical approach, the pneumonia symptoms can be easily detected by specialist; so, this first division between ill and healthy can be done by doctors. With ill patients, using a system that distinguish between mild and severe pneumonia (produced by COVID-19) can be useful for diagnosis purposes. Even so, no published works used this model because the results obtained (as observed in Table 4) were lower than H-C models because of the differences between both classes. Regarding this model, if the characteristics' extraction is performed correctly, a healthy patient should be classified in the pneumonia class as the lung damage is much lower that COVID-19 patients. Therefore, if the world is divided between pneumonia and COVID-19 classes, the healthy images are much more similar to pneumonia than COVID-19.
Focusing on the results obtained in Table 5 and according to the previous explanation, the system seems to behave correctly: 96.21% of the healthy images are classified as pneumonia. Therefore, in this case, the model seems to be medically valid.
From a medical utility point of view in the case of COVID-19 patient classification, the H-P model does not make much sense. However, the H-C and P-C models do have practical utility: the first model could help to distinguish between healthy and sick patients and the second one to distinguish classic pneumonia from another severe pneumonia caused by COVID-19. Regarding those models, according to the previous explanation, the second one has been tested to be medically useful even if there is no healthy class trained in the system. However, the most useful binary classifier is potentially the H-C model (healthy vs. COVID- 19), and this can be observed in the publications presented in Table 2.
Deeply analyzing the pseudo-confusion matrices presented in the Results Section, we can observe: • H-P Model-Case 1 (Figure 3, top-left): As can be observed, many of the nonpneumonia images are classified as pneumonia (more than 46%), so the case of mixing healthy and COVID-19 classes is not medically valid. • H-C Model-Case 2 (Figure 3, top-right): The system presents more than 13% of false positives when classifying healthy patients. Joining the COVID-19 and pneumonia classes produces acceptable results for distinguishing between sick and healthy patients, but the results show almost 8% false negative results for the diseased class. However, as detailed before, the H-P model is not useful to distinguish COVID-19 patients. • P-C Model-Case 1 (Figure 3, middle-left): the integration of the pneumonia class with the healthy class may seem the most logical and easy election due to its use in previous works. For this case, the system obtains more than 31% of false positives for the COVID-19 class, so the results are not acceptable. Moreover, after analyzing the results of the pseudo confusion matrices, we can conclude that none of the models presented in the Results section satisfies the requirements of the classifier with guarantees. There are only two cases that obtain acceptable results: Case 2 of the H-P model, but this system does not make sense for the detection of COVID-19; and Case 2 of the P-C model. However, for the latter case, the high rate of false negatives when detecting COVID-19 patients is very dangerous. Furthermore, for the model most widely used for binary classifiers in previous works (H-C model), the results are not acceptable at all, as it obtains results below 70% for some classes.
Thus, after presenting the analytical results, we want to finish this work going back to the Introduction and trying to answer the questions initially posed: Does the use of two classes in disease classification systems based on convolutional networks correctly extract the characteristics that distinguish that disease, or, on the contrary, does it only extract the characteristics that differentiate a healthy patient from another with the disease?
Furthermore, the following question should be added to those already presented above: If the DL system does not correctly extract the characteristics, what is it extracting?
We try to solve these questions in the next subsection.

Answering the Questions
Previously, various DL binary classifiers based on the detection of lung associated pathologies have been tested. For this purpose, they have been tested in extreme conditions where an external class not used in the training phase has been used as a test class.
However, this third class is related to the problem that each model has tried to solve; so, eventually in real life, images from this class may be used as inputs in the developed systems (even accidentally). Furthermore, if a new pathology that causes lung injuries appears in the future, it is important to know how the current systems would act.
The purpose of this study is to try to understand the feature extraction phase of a convolutional DL system. If the trained system correctly extracts the most relevant features from the images (as a perfect system should do), the main feature would be the lung damage present in each image. If this is the case, when testing a new class, the system should classify these images as the already trained class whose lung damage most closely resemble the new class.
According to the results obtained in the previous subsection, we can affirm that the binary systems tested with X-ray images for the detection of lung associated pathologies do not act according to a system that has correctly extracted the relevant characteristics of the images. This affirmation means that, in those cases, these systems would not work correctly with the inclusion of an untrained third class. Therefore, the feature extraction process was not done adequately for those models.
Thus, the questions initially presented can be answered thanks to these results: • Does the use of two classes in disease classification systems based on convolutional networks correctly extract the characteristics that distinguish that disease? No, it does not. According to the previous results obtained after testing several cases and combinations of binary classifiers for lung damage detection, we demonstrate that the behavior of the models trained with images of untrained classes is not correct and does not behave as if the characteristics of the trained classes had been correctly extracted. • Does it only extract the characteristics that differentiate a healthy patient from another with the disease? It seems it does, although we cannot affirm this categorically. The operation of the training process of DL systems and the good classification results initially obtained for each binary model for its training classes allow us to affirm with relative security that the system only focuses on detecting the differences between the classes used during the training process.
In this way, the two unknowns that initially arise in this work have been answered. However, according to the obtained results, a new question that we try to answer below appears. If the binary systems studied above do not extract the relevant features from the images used to train them, what features are they extracting during training? Can these be known?
Leaving aside temporarily the topic discussed above about lung damage, if we extend this question to a generic DL system that classifies grayscale images, it is logical to think that the distinction between the different classes that it is classifying could be related to the gray levels of the images involved in the training. Thus, we carry out a preliminary study about the gray levels of the images involved in the previously trained models: healthy, pneumonia, and COVID-19.
First, we randomly select an image from each class, apply the pre-processing (in the same way as carried out before training), and calculate its histogram. The obtained results are shown in Figure 4.
At first sight, it is not easy to detect the differences in the X-ray images in some cases. Greater inflammation of the alveoli can be seen in the image with COVID-10, but it is difficult to distinguish it from the inflammation that can be seen in the image with pneumonia. However, if we look at the histograms of the images (bottom of Figure 4), we can more clearly distinguish some differences between them. For the displayed images, the COVID-19 image has a higher concentration of dark areas, while the healthy image has a higher concentration of light areas. In the case of pneumonia, it can be seen that the curve of the histogram is flatter, so that it has a more regular concentration than the other two cases.
Even so, we cannot extract the characteristics from three images obtained randomly from the classes. It is true that there are clear differences in the histograms and that they could help in the classification of the DL system, but this does not answer why they are classified incorrectly if the differences are as clearly appreciable as those observed in the figure.

COVID-19
Pneumonia Healthy Figure 4. Image example from each class (after pre-processing) with its histograms.
In order to more clearly detect the characteristics of each class, a similar study must be carried out with all the images in the dataset. To achieve this goal, as we know that the relevant information could be concentrated in the histograms, the mean histograms of each class are calculated. These mean histograms show the mean of the occurrences of each value on the grayscale for each class and are presented in Figure 5.

COVID-19
Pneumonia Healthy If we analyze more deeply the mean histograms, we can observe some characteristics that clearly distinguish each class: • COVID-19: This first class presents four important characteristics that can be easily appreciated (see Figure 5-left). They are detailed next: • Pneumonia: This final class presents other characteristics that are similar to the ones detailed for the other two classes (see Figure 5-middle). They are detailed next: -Pronounced peak of blacks; similar to the COVID-19 class, but larger.

-
Maximum concentration, between 100 and 150, with a local peak in the middle of the range; exactly the same as the healthy class.

-
Higher concentration of whites than the COVID-19 class, but lower than the healthy class.

-
No end peak in whites.
It can be seen that there are shared characteristics between classes. The two classes that have the most differences are COVID-19 and healthy (as might be expected), while the pneumonia class presents similar characteristics to the other two classes. Depending on which characteristics are taken into account by the trained model, it will resemble one or another class.
Summarizing and generalizing the observed characteristics, we can distinguish: • Characteristic 1 (Ch1): Blacks' peak. Thus, it would be interesting to study the consequences of these characteristics in the previously described models. Initially, a qualitative study is carried out on the possible characteristics used for each model, and based on these, we justify the results obtained in the various trained systems. According to the results shown in Table 5, where the class not included in the training was used as a test set, we can make the following observations: • H-P model: The characteristics that easily distinguish these classes are Ch1, Ch2, Ch3, and Ch4. We can theorize that these characteristics rule the classification result. Therefore, introducing the COVID-19 class as a test, we can observe that it matches the characteristics Ch1 and Ch3 with the pneumonia class and characteristics Ch2 and Ch4 with the healthy class. Even so, as the mean Ch1peak is much higher than the mean Ch2 peak, it is reasonable to think that for this reason, there are a few more samples of this class that are classified as pneumonia than as healthy (almost 72%). Moreover, it is important to note that, as the characteristic that best distinguishes the COVID-19 class from the pneumonia class (Ch5) is not used in the H-P model as it is not relevant (similar values), this is another reason why it is more similar to the pneumonia class. • H-C model: The characteristics that easily distinguish these classes are Ch1, Ch3, Ch4, and Ch5. We can theorize that these characteristics rule the classification result. Using the Pneumonia class as a test, it has similarities in characteristics Ch1 and Ch3 with the COVID-19 class and has similarities in characteristics Ch4 and Ch5 with the healthy class. Although the peak that can be observed in blacks (Ch1) concentrates a large number of pixels and this more closely resembles the COVID-19 class, the maximum concentration range of values (Ch5) is practically identical to that of the healthy class. Therefore, this may be the reason why the occurrences of this class are divided equally between the other two during the test phase. • P-C model: The characteristics that easily distinguish these classes are Ch2, Ch4, and Ch5. We can theorize that these characteristics rule the classification result. Including the healthy class as a test, we can observe that it would resemble the COVID-19 class in the Ch2 characteristic, but it would resemble the pneumonia class in the Ch4 and Ch5 characteristics. Due to this, as the characteristics Ch4 and Ch5 have much greater weight than the characteristic Ch2 and, in addition, they concentrate most of the pixels, it is coherent that the results obtained more than 96% of classification as pneumonia.
With the study of the histograms and the characteristics that can be distinguished from them, the results obtained in the first tests of this work (where the third untrained class is used as the test) are coherent, and therefore, it seems that the convolutional DL binary classification systems use the histogram of the image to extract the characteristics.
However, before stating this conclusion more strongly, we will carry out a more indepth study applying these same characteristics to the different systems trained in the second part of the Results Section, where the third class is included as part of one of the two initial ones in each model. In this way, we obtain six systems for which we carry out a detailed analysis like the previous one.
• H-P model-Case 1: Almost all the characteristics that easily distinguish the pneumonia class have opposite values in the other two that compose the mixed class (COVID-19 and healthy). The only characteristic that differentiates both from pneumonia is Ch2 (white's peak); however, in the whites' concentration (Ch4), there are also significant differences between the images of COVID-19 and those of healthy patients. Due to these large differences between the two sets that compose the mixed class, the classification results are very low (reaching over 50% for the combined class detection). • H-P model-Case 2: For this case, there is a clearly remarkable characteristic of the healthy class with respect to the mixed class (COVID-19 and pneumonia), which is Ch1 (blacks' peak). Thanks to this, the results obtained improve significantly, although there is a worrying percentage of samples that are erroneously classified as belonging to the mixed class, presumably due to the range of the higher concentration of values (Ch5) that coincides between the healthy class and the pneumonia images.
• H-C model-Case 1: Mixing the images of pneumonia with those of healthy patients causes the COVID-19 class to be classified quite acceptably (around 92%), presumably due to the range of the maximum concentration of values (Ch5). However, for the mixed class, the results are very bad since several important characteristics have very different values between the images of pneumonia and those of healthy patients, such as Ch1 (which is similar between pneumonia and COVID-19) and Ch2 (which is similar between healthy and COVID-19).
• H-C model-Case 2: If we mix the pneumonia images with those of COVID-19, the classification of healthy patients reaches 100% due to the concentration of whites (Ch4) and the absence of the peak of blacks (Ch1), which clearly distinguishes the histograms. However, regarding the precision when classifying the mixed class, the percentage is significantly reduced due to the differences between the images with COVID-19 and pneumonia, presumably produced by the maximum concentration range of values (Ch5). • P-C model-Case 1: If the images of healthy patients are mixed with those of COVID-19, we obtain a high percentage of classification for patients without pneumonia. In this case, the only characteristic that could justify these results is Ch2, although for Ch1, the black peak is much more significant than that which can be observed for the COVID-19 class. On the other hand, the pneumonia classification obtained very bad results (less than 42%) because, as has been verified, the pneumonia class shares many values of the characteristics extracted with the other classes. • P-C model-Case 2: Finally, combining the images of healthy patients with those of pneumonia, a high classification result can be observed for the mixed class since the pneumonia images share the Ch4 and Ch5 characteristics with the images of healthy patients, with the Ch5 characteristics being the most relevant due to the values' concentration. On the other hand, the results for the classification of the COVID-19 class are acceptable (but not excellent), with more than 15% failures due to the similarities in the Ch1 and Ch3 characteristics with the pneumonia images.
After this qualitative analysis, we can affirm that the theory initially indicated about the use of the gray level concentration values to extract characteristics is feasible. Although we cannot certify with total certainty that the characteristics indicated are all used and that the explanations provided are complete, the intention of this study was not to make a classification based on a decision tree, but to make the reader doubt about the correct feature extraction performed by convolutional DL binary image classification systems.
Due to all the presented results and the previous explanations, we can certify that the feature extraction carried out by a binary DL convolutional system is not related to the characteristics that are usually identified by the medical specialist, but rather has its origin in the numerical characteristics of the images.
Although there are cases in which the classification of a disease is a binary task (for example, whether or not the patient has cancer), there are other cases (such as the one we present in this work) where the possible outcomes are not binary. Especially when looking at disease symptoms (such as damage to lungs), it is important to understand that there are multiple diseases that can cause them, and therefore, working with a binary classification system may not be realistic or correct. For the results to be useful for specialists, the trained systems have to be as close as possible to reality, distinguishing between all the possible classes that the classification problem may have. This would be the only way to make sure that the trained system is useful.

Conclusions
In this work, the quality of the convolutional DL binary image classification systems is tested. For this purpose, this study is focused on the application of the detection of COVID-19 based on lung X-ray images.
After an exhaustive analysis of the currently available datasets and works carried out on this topic, the training and testing of various DL binary systems based on the COVID-19, pneumonia, and healthy classes was performed. The third class was used as a testing set in order to study the response of the different models and understand if the characteristics extracted by these systems are relevant.
Next, the third class was integrated into one of the two used to train each model, obtaining six different cases that were independently and thoroughly analyzed.
The results show that convolutional DL binary systems do not correctly extract the characteristics of the classes involved, but only use those aspects that differentiate one class from the other. Extending these conclusions and in order to deeply test this theory, a study was performed that includes the histograms of the three studied classes and a manual extraction of characteristics, in order to be able to elaborate a system capable of distinguishing manually between the classes involved like a decision tree does. The extracted characteristics were tested with the results obtained previously being able, in most cases, to justify clearly all the values obtained. In this way, this work is able to raise doubts about the correct extraction of characteristics performed by a convolutional DL binary image classification system. Therefore, according to those final tests, it can be stated that a binary system probably extracts characteristics from the grayscale values of the images, not the characteristics that a medical specialist would take into account in order to differentiate those classes.
Generalizing those results and focusing our attention on the implications of this work: in the cases when the disease classification is not a binary task (for example, the patient has or does not have cancer), the trained systems based on detecting disease symptoms (like the one studied in this work) must include all the possible classes that the classification problem may have; that is, all possible diseases that cause similar symptoms should be included. Only if all the possible classes involved in a classification system are taken into account, we can conclude that the system resembles reality, and therefore, it is a valid tool for use by medical specialists.
It is important to note that the lung images used in the detailed works (and in this work as well) are combined from different datasets; so, as a future work, we need to study if using images collected by different instruments intervenes in the classification results.

Acknowledgments:
We want to thank the FEDER project MSF-PHIA and the Telefonica Chair "Intelligence in Network" for the funding, but also the Research Group "TEP-108: Robotics and Computer Technology" from Universidad de Sevilla (Spain) for the using its resources.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: