Convolutional Support Vector Models: Prediction of Coronavirus Disease Using Chest X-rays

: The disease caused by the new coronavirus (COVID-19) has been plaguing the world for months and the number of cases are growing more rapidly as the days go by. Therefore, ﬁnding a way to identify who has the causative virus is impressive, in order to ﬁnd a way to stop its proliferation. In this paper, a complete and applied study of convolutional support machines will be presented to classify patients infected with COVID-19 using X-ray data and comparing them with traditional convolutional neural network (CNN). Based on the ﬁtted models, it was possible to observe that the convolutional support vector machine with the polynomial kernel ( CSVM Pol ) has a better predictive performance. In addition to the results obtained based on real images, the behavior of the models studied was observed through simulated images, where it was possible to observe the advantages of support vector machine (SVM) models.


Introduction
The coronavirus disease  is an ongoing pandemic that spread quickly worldwide. The first case was noted in Wuhan City, China, in November 2019. Later, a high number of cases were reported in diverse countries making the World Health Organization (WHO) announce the pandemic as a public health emergency on 11 March 2020. WHO also announced that the virus can cause a respiratory disease with a clinical presentation of cough, fever, and lung inflammation. Another aggravating factor is that the virus has a high spreading ratio for person-to-person contamination [1], which puts self-isolation measures and urgent tracking and diagnosis of possible cases as priorities to stop virus propagation.
Several works were published to track the spreading of the disease [2,3] and to establish a clear comprehension of the state of the pandemic around the world [4,5]. Some works showed that one of the diagnosis methods is handled by radiologists, who performs a manual lung infection quantification caused by the virus [6]. Reference [7] showed that as the number of patients infected increases, it turns out to be more difficult for radiologists to timely finish the diagnosis. Therefore, statistical learning models can be extremely useful to embrace a greater number of cases to provide accurate and faster support for diagnosing COVID-19 and its complications.
Data-driven solutions proposing automatic diagnosis using statistical learning methods applied to medical images are present in the literature [8][9][10][11][12]. Recently, convolutional neural networks (CNNs) have been recognized as a powerful tool for predictive tasks using medical images [13][14][15]. The support vector machine (SVM) [16] is a model that can achieve great prediction and generalization capacity, due to its theoretical properties. These characteristics are supported by a convex optimization, based on the structural risk minimization principle, to obtain the minimum global of a loss function. This principle distinguishes SVM from other learning algorithms because it guarantees a global minimum and avoids being trapped in a local minimum as perceived in neural networks.
SVM also presents a notable performance applied in image classification tasks. Several works used support vector models for image recognition with great observed performance [17][18][19][20][21][22]. More recently, with the advent of convolutional neural networks to classify images, SVM was applied over convolutional features and achieved good results, presenting a general error lower than the traditional CNN approach [23][24][25][26].
The use of support vector machines applied to medical images has been an active research field [27][28][29][30]. Data-driven diagnosis of COVID-19 through X-ray images using the convolutional support vector model (CSVM) have shown the great algorithm's predictive capacity of identifying patients stricken by the disease [31,32].
The contribution of this paper relies on how the convolutional support vector machine can be used in X-ray images to identify COVID-19 cases. Recently several models of automated diagnosis of this disease were being presented by the scientific community [33]. Chen et. al [34] presented the mean average of 69 proposed models and in comparison with them we can see that the convolutional support vector approach produced more accurate results than most of them. In addition, we present a new and robust analysis through simulation studies and a consistent validation procedure based on hold out repetitions, exploring different kernel functions and architectures. In comparison with other deep learning procedures [35,36], and transfer learning [37], this paper presents a model with higher accuracy, overcoming them by 2%, on average. Sophisticated models that applied the pre-processing algorithms jointly with a deep-learning framework [38] were not able to generate predictions that outperformed the CSVM proposed in this paper.
The remaining sections of this paper are organized as follows: Section 2 exposes the methodology of the convolutional network using the traditional deep learning approach. Section 3 exposes the support vector machine methodology and the convolutional support vector models. Section 4 discusses the simulation study. Section 5 displays the results and discussion. Finally, Section 6 closes the paper with the final considerations.

Convolutional Networks
Convolutional neural networks (CNN) are multi-layer architectures where the successive layers are designed to progressively learn high-level features, being the last layer responsible for producing a result [22]. They have been shown to be extremely accurate for time series analysis and image classification [39,40]. The convolutional layers present in CNN are responsible for applying filters throughout the image and thus reduce its complexity. When an image is filtered, the output is another image that can be understood as a feature map indicating whether certain features (for example, borders) were detected in the input image [41].
The process of a convolutional network happens as depicted in Figure 1: first, an input image is convoluted with filters, thus reducing its size. Then, a new matrix will show the results of the convolution operations. These results are grouped through a pooling operation, which extracts the maximum values from small regions of interest. Finally, the grouped matrix is decimated by a factor of two to produce the final result, through the decimation operation. The result is conducted by a classification method responsible to predict the label which summarizes the main content detected on the image.

Multi-Layer Perceptron Neural Networks
Neurons are the main cells that make up the nervous system, responsible for conducting, receiving, and transmitting nerve impulses throughout the body as a response from stimuli from the environment, for example. The brain is a complex network that processes information through a system of several interconnected neurons. It has always been challenging to understand brain functions; however, due to advances in computing technologies, it is possible to program artificial neural networks capable of mimicing some known brain capabilities [43]. Perceptrons are the basic building blocks of artificial neural networks. A perceptron implements a linear equation that takes as input all the relevant variables (X i ) and their associated weights (w i ), performs the calculation, and generates a binary output (0 or 1) by comparing the result with a given threshold. Figure 2 shows a basic representation of a perceptron. A multi-layer perceptron network can be built by a sequence of layers, as depicted in Figure 3. The first layer takes decisions based on the input variables and their weights; then perceptrons in the subsequent (hidden) layers can make more complex decisions by weighing the results from the previous layers. The output layer is responsible to generate a result (usually, a classification) based on some specific decision function.

Convolution Neural Networks
The so-called convolutional neural networks (CNNs) can be seen as an extension of common neural networks, in which new pre-processing layers can be added to extract patterns during image analysis, for example. Usually, after the application of the convolutional layers, a neural network model is applied. However, nothing prevents the use of other models at this step. Some studies, such as [45,46], followed this approach, using an SVM model at the end of the convolutional layers instead of the commonly used neural network.
The layers used are a set of filters that are convoluted in order to generate a set of features map. Among the convolutional layers used, there is a pooling layer, which operates over blocks of the input set and combines the activation features. This combining operation is defined by a grouping function, such as average or maximum. This pooling process step is used to summarize the nearby outputs statistically according to the pooling function [40] creating patches of feature maps. Since pooling downsizes the output, another step of pre-processing the image is zero padding, which allows us to control the size of the output [40]. Padding is the process of adding pixels to the border of the image. After this feature extraction step, this output is sent to a fully connected layer. In general, a traditional CNN uses dense multi-layer perceptrons as the classification method, as pointed out in Figure 4.

Support Vector Machines
Cortes and Vapnik [16] proposed a new class of models known as support vector machines (SVM), which are based on the theory of Statistical Learning [47]. Basically, this class of models is based on the construction of optimal hyperplanes to classify labels-usually into two distinct classes-as well as on the use of kernelization to increase the model's flexibility [48].
SVM is used to solve classification problems, finding the hyperplane with the greatest separation space (delimited by its margins) between the categories of the chosen variable. Figure 5 displays the general representation of an SVM. The hyperplane is the equation separating the two classes, while the support vectors are the data points closest to the margin boundaries.

Support Vector Classifier
The simplest form of SVM is called support vector classifier (or linear SVM), which is a technique that allows for linear separation of data whenever possible. This technique can be used with rigid (hard) margins, which do not allow for misclassification but hinders generalization; or soft margins, which allow the entry of wrongly classified data but improve the model generalization. Figure 6a depicts an SVM using rigid margins, with no data points between the margins; while Figure 6b depicts a soft margin SVM. In SVM with soft margins, with n the sample size of training data and i = 1, . . . , n, the ε i are slack variables (errors) which are added to relax the restrictions and turn the model more receptive to new data. Therefore, the function to be maximized or minimized can be expressed as: since C is a regularization constant that imposes a weight to minimize errors, there is no limit to the number of wrong classifications. If C → ∞, a soft margin SVM will act as a rigid margin SVM. After applying the maximization process via Lagrange multipliers of the Equation (1) with margins constraints, the SVM classifier with soft margins is given by: where b is a constant parameter also called bias. The α i are the Lagrangian parameters that identify the support vectors when α i > 0 implies that x i is a support vector and an important observation in the entire model. The function sign is responsible to verify if any observation is upper or lower of the estimated hyperplane.

Kernel Transformation
In simplest situations where the data can be easily classified through a linear equation, a linear classifier solves the problem with good levels of accuracy. However, in many situations, this linear classifier will encounter problems to separate the data and require the use of a non-linear classifier. This nonlinear classifier via kernelization procedure maps the training set from its original space, referred to as (R) input space, into a new, larger space, called (F) characteristic space, as shown in Figure 7. This mapping is defined as a φ function. The appropriate choice of the φ mapping function means that the training set mapped into (F) can be now separated by a linear SVM. Thus, our classifier will be rewritten as: (1) by adding the kernelization function, expressed by the aggregation of the map functions. In general, the Gaussian kernelization has a high predictive capacity and is being frequently used [48]. The most commonly used kernels [50,51] are shown in Table 1. Table 1. Most used kernels for kernel transformation.

Kernel Type K(x, x ) Parameters
Linear

Convolution Procedures and SVM
Analogously to a convolutional neural network (CNN), convolutional support vector machine uses the convolutional image processing to reduce the complexity of the image, through operations as the convolution and pooling as shown in Section 2.2. Initially a convolutional model is generated to select and optimize the filter's that will be used. Afterwards, with fixed hyper-parameters of the convolutional layers, the traditional support vector machine (SVM) is employed, and the model can be applied using any of the kernels discussed above, with their due parameters that can be estimated through a tuning process. Figure 8 shows the convolution support vector machine (CSVM) architecture. Among the existing layers in the pre-processing of data through convolution, we can mention two that have extremely important roles, pooling layer and flattening layer. Directly, the pooling layer seeks to somehow reduce the size of the data, so that information is not lost, but to keep those that have a greater amount of significant information, reducing noise in order to reduce processing time.
In the final stage of the pre-processing stage, the flattening layer seeks to convert the data into a single observation in order to facilitate insertion into the model sequence. In general, the convolution acts as a complex processing filter which can be applied in other model instead of the classic Multilayer Perceptron (MLP). After the flattening step, SVM replaces MLP. Figure 9 shows flattening layer after convolution filter processing.

Simulation Study
For this simulation, we have created synthetic 64 × 64 pixels images with three color channels (RGB) to compose two distinct classes. We aimed to evaluate the performance of different models, in terms of computation time and predictive performance. Each image class was created from a normal distribution with different means µ in the Green (G) channel, which produced two different validation sets per class. This is a similar artificial image procedure as in [53].
Furthermore, Table 2 presents the parameters used to generate the simulation sets. The first column indicates the mean difference between classes in standard deviation. Thus, 1SD means the difference in channel G is one standard deviation, while 3SD means the difference in channel G is equal to three standard deviations. In summary, there are four different kinds of images: Class1, with means equal to 119 and 101 in channel G; and Class2, with means equal to 137 and 155 in channel G. The other color channels are identical for all images. Figure 10 shows the artificial images used in our simulation.  It is important to notice that the classification task can be easier if there is a higher means difference among the images. The next subsection details the model configuration.

Experimental Setup
In addition, we have used a CNN (called CNN 1 ) with the following filter configuration: first, a 3 × 3 convolutional layer with 30 filters, then a 2 × 2 pooling layer. Another CNN (called CNN 2 ) was used with a different initial filter configuration: first, a 3 × 3 convolutional layer with 30 filters, then a second 3 × 3 convolutional layer with 60 filters. Then we used a 2 × 2 pooling layer, plus a 3 × 3 convolutional layer with 60 filters, and finally a new 2 × 2 pooling layer.
In the convolutional neural network models, the two architectures were used for comparison purposes. For the first architecture (CNN 1 ), we used an initial layer with 256 nodes, a second layer with 128 nodes, the third layer with 64 nodes, and an output layer with 2 nodes. In the second architecture (CNN 2 ), we used an initial layer with 128 nodes, a second layer with 64 nodes, the third layer with 32 nodes, and finally a layer with 2 nodes. In addition, the activation function used by the CNN model was the relu function, as well as the adam optimizer, which sets by the default learning rate value to 0.001. A total of 420 images (210 with covid and 210 without covid) were used, requiring a total of 25 epochs to adjust the model. Another important parameters to be mentioned are the batch size with default value equals to 32, the objective function, adjusted to sparse categorical crossentropy and accuracy as metrics. Figure 4 presents a complete view of the CNN architecture used in our simulation.
At the convolutional SVM modeling, pre-trained weights from CNN are used as filters for the classification, being it is fixed in its training. Afterwards, the support vector component is adjusted with different kernel functions CSVM RBF , CSVM Lin , and CSVM Pol , respectively, with their respective default parameters. The applied convolution filters were the same as the ones in CNN 1 . Representations of the model setups can be seen in Figure 11. Simulations were split into image batches of 100, 300, 500, and 1000 sample sizes with an equal balance of each class. First, for the image group with one standard deviation difference in means (named 1SD) and then for the image group with three standard deviation difference in means (named 3SD). We applied a training-test split in a proportion of 90/10 for a hundred cross validation repetitions.
As the simulation is a binary classification problem, it produces the well-known confusion matrix which correlates the predictions and true class values. The evaluation is made by the following metrics, calculated from the confusion matrix entries (Baldi et al. [55]): • Accuracy (ACC): is the rate between correct predictions and total of predictions. It is sensitive to classes unbalancing. All the procedures were performed on a personal laptop with the following configurations: Linux, 64-bit Operating System, Intel Core processor i5-3210M 2.50GHz, and 8 GB of RAM. Moreover, R software version 3.6.3 with packages keras [56] and kernlab [57]. This same configuration was also used in the next section.
In the support vector models, the selection of the σ parameter of Gaussian kernel function from Table 1, was realized using an interval with lower and upper bound of the 0.1 and 0.9 quantiles of ||x − x || 2 , respectively. Reference [58] showed that any values inside that interval will produce good values of σ. This procedure can exponentially reduce the computational time of Gaussian SVM, once exhaustive hyperparameter tuning is not needed to find acceptable σ values. For the other kernel functions presented in Table 1, the default parameters used were q = 2, d = 0, and γ = 1. Tables 3 and 4  We can observe that all models have poor performance for small batch size (less than 500) simulations, but CSVM models have smaller processing time than CNN. For batch size equal to 500 (Table 3), CSVM models and CNN 1 presented an improvement in the prediction capacity whilst CNN 2 model kept the same performance as in small size batches. Finally, in the larger image batch, all models have an expressive improvement in performance when compared to the results of small batches. For all image batch sizes, CSVM models present smaller processing times and competitive predictive performance. Regarding the results from Table 4, given that the difference between the means is equals to 3SD and the sample size is large enough, this is a simple task for machine learning algorithms to correctly classify the two classes, which is shown through the unusual results equal to one. However, even in those cases, the computational time of CSVM remains significantly smaller when compared with CNN.

Data and X-ray Image Acquisition
Images used in this analysis came from anteroposterior radiographs (X-rays). Radiography is a technique for generating and recording an X-ray pattern to provide the user with a static image(s) after the end of X-ray exposure, as displayed in Figure 12. The erect anteroposterior (AP) chest view is performed with an X-ray tube firing photons through the patient to form the image on a detector positioned behind the patient [59].
Image data were collected from two main sources: (i) COVID-19 and other lung disease cases came from COVID-19 Image Data Collection [60]; (ii) healthy X-ray images were retrieved from the Open Access Biomedical Search Engine (https://openi.nlm.nih.gov/). The COVID-19 Image Data Collection is the first public COVID-19 CXR data collection and aims to aid in the treatment of the disease. Data aggregation was partially made by hand, in the case of public research articles, or automated, in the case of images stored at websites (such as Radiopedia or Eurorad). From the set of images released on the GitHub (https://www.github.com/ieee8023/covid-chestxray-dataset), 105 images of patients diagnosed with COVID-19 were collected on 4 April 2020. A new search on 27 July 2020 has updated our dataset with 142 new images, including AP (anteroposterior) and AP Supine (patient laying down) images from different patients with confirmation of SARSr-CoV-2. After selecting the images, a hashing technique was used to detect duplicates and 30 duplicated images were removed, resulting in 217 COVID-19 detected X-rays. After removing duplicates, a grayscale/color conversion was applied to each of them through the OpenCV library in Python. This was the only pre-processing technique applied. No noise reduction methods were used. Another set of 108 images with other lung diseases (SARS, Pneumocystis, or Legionella) was gathered from the same GitHub database from COVID-19 Image Data Collection.
The query used for healthy chest X-ray images can be reproduced by accessing the Open-i service (https://openi.nlm.nih.gov). This service aims to enable the search and collection of abstracts and images from open-source literature. In our study, 105 images were retrieved on 4 April 2020, while seven new images from COVID-19 Image Data Collection with a "no Finding" tag were added, resulting in a dataset with 112 images. No duplicated images were found in this case. After the image retrieval, the final dataset consisted of 437 images as summarized in Table 5. Image sizes range from a minimal height and width of 235 and 256 pixels to a maximum of 4757 and 5623 pixels. Table 5. Overview of the image dataset used in this work. An exploratory analysis of the metadata provided by the COVID-19 Image Data Collection resulted in the data shown by Figure 13. Such metadata information was used in our predictive models since it is only available for COVID-19 cases. In addition, statistics metrics from patients' ages and sex information were calculated: meanX = 59.45, median Med = 61, and standard deviation σ = 16.88. Figure 12 displays an example of a healthy chest X-ray and an X-ray of a patient diagnosed with COVID-19. These images correspond to the dataset used to train our models. No data augmentation process was used.

Predictive Models
The evaluation of all methods was realized through a repeated holdout validation technique, with a split ratio of the data of 90-10% in the training-test set and a total of 100 repetitions. Despite the most common validation technique used in image classification tasks being K-Fold, a high number of repetitions in the hold-out achieve lower bias and variance [61]. The classification performance of the CSVM models was compared with those from other methods in cluing traditional neural network (MLP 1 and MLP 2 ), in which all of them converged after 20 epochs. Table 6 presents the performance measured from ten methods ran 100 times each.  Table 6, we can observe that the convolutional support vector machine with the polynomial kernel (CSVM Pol ) presented a better performance when compared with the others, achieving higher scores in F1 Score and MCC. The best accuracy was CSVM Gau and SVM RBF was the fastest among all models. Additionally, the CSVM Lin achieved the second-best overall result, followed by MLP 2 . From these outcomes, it can be observed in this situation that the support vector models are more useful from the convolution framework when compared with the traditional CNN.
In order to realize a complete comparison between the models, one important aspect is parsimony. The Occam's Razor [62], in the context of statistical modeling [63], states that "given two models with the same training set error, the simpler one should be preferred because it is likely to have lower generalization error". Regarding this statement, a model with a greater number of parameters is considered as being more complex. In general, the number of parameters of a SVM is given by the number of support vectors, and in the neural networks, these parameters correspond to the weights used in several nodes aggregation in the model. References [64][65][66] showed that the reduction of the number of support vectors would be able to produce parsimonious models yielding better results. The result of the SVM models produced respectively, on average, 201, 278, 264 while the from SVM Lin , SVM Pol , SVM Gau while the convolutional support vector models produced 31, 29, and 71 support vectors CSVM Lin , CSVM Pol , CSVM Gau , respectively, showing that the convolution process was able to produce a more general model.
Another way to analyze and compare the performance of each algorithm is through a win-loss table, where it is counted the number of times that a method produced an equal or greater value of the determined evaluation metric. Figures 14 and 15 correspond, respectively, to accuracy (ACC) and F1 Score comparison. From all figures, we observed that those results reflect the previous table, where we can see that CSV M Pol produced the highest values in most of the times.
The computational effort from CNN and CSVM is an important aspect to be treated. Reference [67] presented that it is computationally expensive to train the convolutional neural networks, being necessary the use of GPU processing to train some models in a viable way.

Final Considerations
This paper presented a novel study and application of the convolutional support vector machines to classify patients infected with COVID-19 using X-ray data. The result was compared with the use of a CNN approach for this task [36,38,39], which is considered the state-of-art in many image classification tasks [68][69][70][71]. The result showed that the CSVM outperformed the CNN approach through higher values of ACC, F1 Score, and MCC obtained in a holdout repetition as a robust validation procedure. The one-hundred repeated hold-out technique was used to reinforce that great performance as [61] showed that this validation procedure has lower bias and variance when compared with the commonly used K-Fold. In addition, the CSVM showed up as being computationally cheaper since it can be one-hundred times faster, on average, when compared with CNN in this situation with a medium sample size. In comparison with other works that presented analysis on detecting SARS-CoV-2 from X-ray images, we presented a novel database with the most recent data that is composed of more instances, and with a greater diversity of diseases in the group of no-COVID, improving the capacity of the model to generalize and predict new observations. Additionally, even by boosting research on COVID-19, data on such patients are still small or medium-scale, a fact that makes CSVM even more effective for these cases.
For future works, the number of kernel functions, as well as its hyperparameters can be explored to achieve even stronger results in CSVM. In addition, the architectures in both cases, CSVM and CNN, can be diversified.

Conflicts of Interest:
The authors declare no conflict of interest.