EVAE-Net: An Ensemble Variational Autoencoder Deep Learning Network for COVID-19 Classification Based on Chest X-ray Images

The COVID-19 pandemic has had a significant impact on many lives and the economies of many countries since late December 2019. Early detection with high accuracy is essential to help break the chain of transmission. Several radiological methodologies, such as CT scan and chest X-ray, have been employed in diagnosing and monitoring COVID-19 disease. Still, these methodologies are time-consuming and require trial and error. Machine learning techniques are currently being applied by several studies to deal with COVID-19. This study exploits the latent embeddings of variational autoencoders combined with ensemble techniques to propose three effective EVAE-Net models to detect COVID-19 disease. Two encoders are trained on chest X-ray images to generate two feature maps. The feature maps are concatenated and passed to either a combined or individual reparameterization phase to generate latent embeddings by sampling from a distribution. The latent embeddings are concatenated and passed to a classification head for classification. The COVID-19 Radiography Dataset from Kaggle is the source of chest X-ray images. The performances of the three models are evaluated. The proposed model shows satisfactory performance, with the best model achieving 99.19% and 98.66% accuracy on four classes and three classes, respectively.


Introduction
In late December 2019, the COVID-19 disease was discovered in Wuhan, a city in eastern China, and began spreading worldwide. By March 2020, the World Health Organization (WHO) classified it as a pandemic. COVID-19 is a pathogen strain caused by SARS-CoV2 that causes severe acute respiratory discomfort and exhibits symptoms of pneumonia in humans. The infection usually starts in the mucous membrane in the throat and rapidly spreads to the lungs. Once in the lungs, it impairs function and mutates rapidly before a patient can be diagnosed correctly. It is transmitted efficiently from human-to-human via aerosol and is highly infectious. This makes COVID-19 a public health emergency. Early detection and diagnosis are significant for breaking the transmission chain and controlling its spread.
To help control the spread of the disease, healthcare professionals and researchers have adopted several modalities to detect the virus. Some of the modalities include reverse transcriptase-polymerase chain reaction (RT-PCR) test [1], chest X-ray (CXR) image [2,3], and computed tomography (CT) scans [4]. RT-PCR has been the go-to diagnostic modality for detecting COVID-19 pathogens. It requires acquiring a respiratory specimen from the subject's body. Though efficient, it has some disadvantages, such as a longer detection time and a lower detection rate.
contrast, variational inference typically becomes increasingly computationally complex as the number of samples increases.
Motivated by this, we propose EVAE-Net, an ensemble variational autoencoder deep learning network that combines the high-quality latent representations generated by VAE and ensemble learning for COVID-19 classification based on chest X-ray images. Three variants of EVAE-Net are proposed, and their performances are compared. The proposed model consists of two encoders, one or two reparameterization phases, a classification head, and one or two decoders. In the case of the single reparameterization phase, the feature maps from the encoders are merged before sampling the latent embeddings. The latent embeddings are then passed to the classification head and the decoder for reconstruction. In the case where each encoder has its reparameterization phase, the feature maps from each encoder are passed to their respective reparameterization phase to generate the latent embeddings. The latent embeddings are merged to form a single latent embedding and passed to the classification head. The individual latent embeddings are passed to their respective decoders for reconstruction. Our proposed methodology achieves promising classification performance, with the best model achieving 98.66% accuracy, 98.47% recall, 98.60% F1 score, and 98.75% precision for three classes, and 99.19% accuracy, 98.82% recall and precision, and 98.94% F1 score for four classes.

Contributions
The main contributions of our work are summarized as follows: • Propose a deep learning model based on VAE and ensemble learning for COVID-19 classification. Particularly, three models are proposed. Each model consists of an encoder for feature extraction, a reparameterization phase for sampling latent vectors from the extracted features, a decoder for reconstructing the input image from the latent vector, and a classification head for classifying  Demonstrate the superiority of the proposed model by performing essential experiments. The experiment was conducted on both three classes and four classes using the standard COVID-19 radiography database. Several classification metrics were used, including accuracy, recall, precision, F1 score, and ROC-AUC. The experimental results demonstrate that the proposed EVAE-Net can automatically classify COVID-19 infections from CXR images and achieve better performance. • Compare the performance of the proposed model to that of several state-of-the-art models, showing our proposed model outperforms these existing models.
The remaining work is structured as follows: In Section 2, we present related studies and works. Section 3 describes the research methodology. The datasets used in this work, evaluation metrics, and experimental setup are described in Section 4, followed by results and analysis in Section 5. Finally, we present discussion and conclusions of the study in Sections 6 and 7, respectively.

Related Work
Since the outbreak of COVID-19, the scientific community has responded with high volumes of research dedicated to different levels. ML and DL models have been adopted in several areas to help deal with COVID-19 disease discovery and spread monitoring [29][30][31][32], prognosis [33], etc. Nevertheless, most of these ML and DL models focus on preprocessing X-ray or CT scan images to detect and classify COVID-19. Several tools, techniques, and datasets have been utilized to facilitate the detection and classification of COVID-19.

Deep Learning Models
Researchers have identified several approaches for identifying COVID-19 from chest X-ray images. One well-established method is the conventional method. However, this approach is not acceptable in the medical field due to its limitations, such as wasting time extracting features and inaccurate results leading to false positive results. In recent years, researchers have conducted several studies to detect COVID-19 infections. The researchers in [34] presented CovXNet, a deep learning model to classify COVID-19 and pneumonia infections by utilizing a public dataset containing 1493 non-COVID pneumonia and 305 COVID-19 pneumonia cases. Their model achieved an accuracy of 96.9%. Using four pretrained models, VGG16, ResNet50, DenseNet-121, and MobileNet, Umair et al. [35] proposed a technique for binary classification of COVID-19 and compared the performances of the four models. DenseNet-121 achieved the best performance with an accuracy of 96.49%.
The authors of Li et al. [36] proposed COVNET, a deep learning model for detecting COVID-19 infections. The proposed model successfully differentiated between COVID-19 pneumonia and community-acquired pneumonia (CAP) with sensitivity and specificity rates of 90% and 96%, respectively. A novel deep learning model, CovidDetNet, for detecting COVID-19 infections using chest radiograph images was proposed by Ullah et al. [37]. The proposed model comprises nine convolutional layers, one fully-connected layer, two activation functions (ReLU and Leaky ReLU), and two normalization operations (batch normalization and cross-channel normalization) and achieved an accuracy of 98.40%. A 17-layered deep learning model, DarkCovidNet, with different filter sizes for detecting COVID-19 infection, was presented by Ozturk et al. [38]. The authors used DarkNet [39] as the backbone for their model. The proposed model provided accurate diagnostics for binary classification (COVID vs. no findings) with 98.08% accuracy and multi-class classification (COVID vs. no findings vs. pneumonia) with 87.02% accuracy.
The authors of Khan et al. [40] proposed CoroNet, a CNN for the classification of COVID-19 by modifying an Xception pre-trained model. The model achieved an accuracy of 89.6%. In Agrawal and Choudhary [41], researchers presented FocusCovid, an automated deep learning model for detecting COVID-19 using chest radiograph images. The authors used FocusNet [42], a UNet-based encoder-decoder architecture proposed for medical image segmentation, as their backbone network. The model achieved 99.2% and 95.2% accuracy for binary classification and multi-classification, respectively. In [43], a convolution support estimation network (CSEN) that combines the advantages of a representationbased technique and deep learning was presented to detect COVID-19 infections from X-ray images. The proposed method uses training samples and a dictionary to map the sparse support coefficients to the query samples.
Other works have also exploited automated feature extraction and filtering techniques [44][45][46][47]. For example, the authors of [44] segregated COVID-19 positive cases and healthy cases by exploring several image filter techniques, such as a conservative smoothing filter and a Gaussian filter, and feature extraction techniques, such as linear discriminant analysis (LDA) and principal component analysis (PCA). The extracted features were then passed to various classification models, including CNN, SVM, and logistic regression (LR). The proposed model achieved impressive results by attaining an overall accuracy of 99.93%.

Transfer Learning Methods
In the COVID-19 detection and classification literature, most of the ML and DL models employed use pre-trained models as the baseline of their models. The pre-trained models were trained using the Imagenet dataset [48], and their weights are available for download. Although the baseline models used were the same, the architecture of the individual models is different [49]. VGGNet [50], ResNet [51], EfficientNet [52], DenseNet [53], and SqueezeNet [54] are few examples of state-of-the-art pre-trained models. In the literature, a large number of works have been proposed using transfer learning [49,[55][56][57][58][59][60][61][62][63]. The authors of [62] implemented a deep learning model by fine-tuning three pre-trained models-VGG16, VGG19, and ResNet201-using chest X-ray images for binary classification of COVID-19. Using Bi-LSTM and CNN, Aslan et al. [60] designed a hybrid model to detect COVID-19 from CT scans. The authors used AlexNet as their pre-trained models. Similarly, Apostolopoulos and Mpesiana [49] proposed a CNN architecture based on transfer learning using ReLU [64] as the activation function and dropout to deal with overfitting. A binary and multi-class classification model using NASNet-large, DenseNet169, Incep-tionV3, ResNet18, and Inception ResNetV2 was proposed by Punn and Agarwal [59]. To overcome class imbalance, the authors used a weighted class-loss function (giving higher weight to COVID labels) and a random sampling technique (upsampled minority class). Using transfer learning and fine-tuning, El Asnaoui et al. [65] conducted a comparative study on several pre-trained models. They applied intensity normalization [66] during the preprocessing stage to enhance image quality, and they applied contrast-limited adaptive histogram equalization (CLAHE) [67]. Similarly, using five pre-trained models, the authors of [63] presented a deep learning model to detect COVID-19 from CT scan images. The authors assessed the performance of histogram equalization (HE) and CLAHE enhancement on CT scan images. The best pre-trained model attained an accuracy of 95.75%.

Autoencoders and Variational Autoencoders
In recent years, autoencoders (AEs) and variational autoencoders (VAEs) have been exploited for different medical tasks [68][69][70][71][72][73][74]. Given input data, AEs and VAEs transform the input from a high dimension to a low dimension known as a latent vector. Due to this capability, they have also been adopted for dimensionality reduction tasks [75][76][77]. VAEs are probabilistic variations of AEs. Unlike AEs, which attempt only to learn the latent representation of the input data, VAEs attempt to learn the latent distribution of the input data [78]. AE and VAE approaches have been adopted in several works. The authors of Rashid et al. [79] proposed a two-stage deep CNN scheme dubbed "AutoCovNet" to detect COVID-19. They used an encoder-decoder at the first stage and an encoder-merging network for the second stage. The weights learned from the first stage are used to initialize the encoder in the second stage. The outputs from the different layers in the encoder model are connected to the encoder-merging network, which finally performs feature extraction. The extracted features are then used to perform classification. An integrated model consisting of a pre-trained model, sparse autoencoder, and a feedforward neural network was proposed by J.L. et al. [80]. Similarly, Abdulkareem et al. [81] proposed a model that used stacked autoencoders instead of sparse autoencoders. Other works have also explored stacked autoencoders [82][83][84]. To deal with anomaly localization and lack of pixel annotation in CT images, Zhou et al. [78] proposed a "Weak Variational Autoencoder for Localisation and Enhancement (WAVLE)" framework with two parts: the localization part, which generates attention maps by combining a context-encoding variational autoencoder with a gradient-based technique; and an enhancement part that localizes the infected regions in the CT images, combining the attention maps generated in the first part. An unsupervised deep learning variational autoencoder model was proposed by Mansour et al. [85]. InceptionV4 with Adagrad was used as the feature extractor, and an unsupervised VAE performed classification. The quality of the image was enhanced using adaptive Wiener filtering during preprocessing. So far, this is the only work similar to our work. In our work, we explore the latent representations generated by VAE and leverage the capabilities of ensemble learning to design three ensembled variational autoencoder models to classify COVID-19. Several other methods and techniques to detect and classify COVID-19 have been proposed in the literature [15,17,[86][87][88][89][90][91][92]. Table 1 summarizes the related works. • Model was trained on a small dataset.
• Only binary classification was performed. • Handles chest CT images and chest X-ray images.
• Intensity normalization CLAHE were used to eliminate noise and improve the quality of the X-ray images. • Data augmentation, weight-decay, and L2 regularizer are applied over-fitting.

• Only binary classification was done
The abovementioned studies use convolutional operations to extract relevant features from the given input. The feature map is then passed to the classification head to detect or classify COVID-19. These feature maps may contain the relevant features extracted from the input image. However, a convolution neural network sees the input image as a cluster of pixels arranged in distinct patterns. It does not understand them as components present in the image nor the probability distribution of these components. With VAE, given a feature map, it describes the probability distribution of the feature map in a latent space. Thus, each attribute in the feature map is represented as a probability distribution for which we can randomly sample latent vectors for classification instead of using the entire feature map. This statistical distribution enforces a continuous, smooth latent space representation of the latent vector with its values near one another in a latent space, thereby acting as further filtering of the feature map to a reduced dimension but still holding relevant features with a probability distribution for classification.

Materials and Methods
This section discusses the theoretical aspect of VAEs and our proposed model. We describe the theory of variational autoencoders and the loss functions used in this work. We then describe the architectural design of the three proposed approaches.

Variational Autoencoders
AEs are neural network architectures with two main parts: an encoder f e and a decoder f d . Given input datum x i , the encoder transforms x i into a latent representation z i , which is of lower dimension than x i . The decoder takes z i as input and attempts to reconstruct x i . Reconstruction loss, L, is computed to measure the model's performance. L is computed by comparing the difference between x i and x i . The weights of the model are updated through backpropagation. VAEs are probabilistic variations of AEs that combine Bayesian variational inference and deep learning. Similar to AEs, VAEs also have encoders and decoders and perform the same function. Figure 1 shows the basic architecture of VAEs. Instead of just learning the latent representation, VAEs learn the probability distribution of the training data through an amortized inference procedure that provides the means for computing the latent representation of the input data via the encoder. The encoder acts as the variational posterior q φ (z|x), while the decoder acts as a generative model representing the likelihood p θ (x|z). Given the posterior and the likelihood, we can define a joint inference distribution as Any probability distribution function (PDF) can be used as the posterior distribution, but it is mainly assumed to be a multivariate Gaussian with diagonal covariance matrix p z (z) = N (z; 0, I). The likelihood function p θ (x|z) can be a Gaussian distribution or a Bernoulli distribution. In the encoder, the weights and biases are parameterized by the variational parameter φ, while in the decoder, the weights and biases are parameterized by the model parameter θ. Unlike in autoencoders, where the encoder outputs z (a probability distribution representing the latent embeddings), encoders in VAEs output µ and log σ 2 and generate ε from N (0, I), out of which z is sampled using a reparameterization trick. The variable z is then fed as input to the decoder, which tries to reconstruct x; z is defined as: where is an elementwise multiplication operation, and z is a probability distribution representing the relevant features from the input x. Assuming z ∼ N µ, σ 2 , then z can be reparameterized by Since p θ (z) and q φ (z|x) are both Gaussian distributions, the difference between the two distributions can be computed directly. Given a data point x (i) , the resulting likelihood is calculated as: Equation (4) is known as evidence lower bound (ELBO). The objective of the model is to maximize ELBO L(θ, φ, x) with respect to θ, the model parameters, and φ, the variational parameters. The first part of ELBO represents the reconstruction loss, which ensures the decoder is able to reconstruct x i from the latent representation z. The second part of ELBO acts as a regularizer called a divergence. It measures the divergence between q φ (z|x) and p θ (z) as well as penalizes the entanglement between the components in the latent space.

Kullback-Leiber (KL) Divergence
KL divergence measures how similar two probability distributions are. Given two probability distributions p(x) and q(x) defined over x, the KL divergence denoted by D KL (p(x)||q(x)) from q(x) to p(x) is defined as:

Maximum Mean Divergence (MMD)
Similar to KL divergence, MMD measures how close two distributions are to each other. Given two probability distributions, p(x) and q(x), MMD, denoted as MMD(p(x) q(x)), is defined as: where k(x, x ) is any universal kernel, for which we use the Gaussian kernel in this work.

Loss Function Establishment
In this study, we adopted two objective functions. The first objective function ensures the VAE reconstructs x i as closer to the actual input datum x i such that the latent representation z that is created is in some particular distribution. First, we compute the reconstruction loss by measuring the difference between x i and x i . The reconstruction loss, denoted as L(θ, φ; x), is defined as: where q (φ) (z|x) represents the latent distribution created by the encoder from the input data, and p θ (x|z) represents the reconstructed distribution from the latent representation by the decoder. Afterwards, the divergence among the latent representation distribution is measured. In this study, we adopt the MMD. The final VAE loss is defined as: The cross-entropy (CE) loss was adopted as the second objective function to handle the classification task. The CE loss function is computed as: where N represents the number of classes, O ∈ R N×1 represents the output from the fully connected layer, y and y i represent the true label in the batch and predicted label of image i in the batch, respectively, and S is the softmax applied to the output of O to normalize it.

Ensemble Variational Autoencoder
This section details the architecture of our proposed model. Using VAE and ensemble learning, we propose three different architectures for the classification of COVID-19. Each architecture consists of two encoders (ResNet50 and VGG16), reparameterization, a classification head, and either one or two decoders. The classification layers of ResNet50 and VGG16 were removed.

ResNet50 Encoder
The ResNet50 encoder follows the ResNet50 architecture [51]. The first layer is a convolutional layer with a kernel size of 7 × 7 and 64 different kernels with a 2 × 2 stride. The layer is immediately followed by a 2 × 2 max-pooling layer. Four sequential blocks follow Layer 1. Each block performs a convolution operation with different kernel sizes. In Block 1, there is a 1 × 1, 64 kernel convolutional operation followed by a 3 × 3, 64 kernel convolutional and finally a 1 × 1, 256 kernel operation. Block 1 is repeated 3 times, giving a total of 9 layers. Similarly, in Blocks 2, 3, and 4, the kernel sizes used are (128, 128, 512), (256, 256, 1024), and (512, 512, 1024), respectively. Blocks 2, 3, and 4 are repeated 4, 6, and 3 times, respectively, giving a total of 50 layers. Finally, an adaptive average pooling and flattening layer is applied to the output of Block 4.

VGG16 Encoder
We adopted the VGG16 [50] architecture for the VGG16 encoder. There are two 3 × 3 convolutional filter layers and a 2 × 2 max-pooling layer. Each of these two layers are repeated 3 times. Next, there are three 3 × 3 convolutional filter layers and a 2 × 2 maxpooling layer, each of which is repeated two times. Finally, adaptive average pooling and flattening is applied.

Reparameterization
The outputs from the encoders are transformed using a linear layer to the dimension of the latent space, R B×128 , where B is the batch size. We then perform random sampling from a standard normal Gaussian distribution having a mean and variance. Two linear layers are used to implement the mean and variance, which has the same dimension as the latent space. Standard deviation is computed from the variance, and Equation (2) was applied to sample z.

Classification Head
The classification head takes z as input and performs a linear transformation on it. The output dimension of the classification head is R B×C , where B is the batch size, and C = {3, 4} is the number of classes.

Decoders
The decoder performs the reverse operation of the encoder. It takes z as input and performs a ConvTranspose2d on it to upsample it to produce an output X ∼ X.

Model One
In this model, as depicted in Figure 2, the feature maps of the last layers in both encoders are concatenated before performing the reparameterization trick to sample the latent embedding z. This model allows us to merge at a high level the individual features from both encoders, giving us a richer feature map before sampling z. The sampled latent vector is then passed as input to the decoder to reconstruct the input and the classification head to classify the various types of pneumonia. In this model, only one decoder is used. There are two objective functions: L VAE , which ensures the output of the decoder is closer to that of the input; and L cls (O, y), which provides the classification head the ability to accurately classify the various types of pneumonia. The advantage of this model is its simplicity. It takes advantage of concatenating the feature maps of each encoder before sampling the latent vector. Though this is the simplest of the three models, its performance depends on the sensitivity of how well each encoder can extract relevant features from the input since they share a single decoder and a single L VAE objective function.

Model Two
This model combines separate low-level VAEs to form a high-level VAE. Each low-level VAE learns the representation of the input data to create its latent vector. These independent latent vectors are merged to create an integrated low-level representation. Like Model One, Model Two also employs a single decoder and two objective functions. The computational cost of this model is higher than that of Model One. There is a three-stage learning process involved: first, learning the low-level VAEs of individual encoders; second, reconstruction of the low-level representation to high-representation by the decoder; third, classification of the various types of pneumonia by the classification head. Despite its computational cost, it presents some advantages. The decoder and classification head input is composed of low-level representations from the individual encoders, which already have distribution regularization terms. Thus, the merged representations already consist of approximated multivariate standard normal distributions representing the input of each encoder. Similar to Model One, performance also depends on the sensitivity of how well each encoder can extract relevant features from the input, since they share a single decoder and a single L VAE objective function. Figure 3 depicts the architecture of Model Two.  Figure 3. Model Two: Each encoder performs its own reparameterization to generate its own latent embedding z. The latent embeddings are then merged and passed as input to the decoder and classification head for reconstruction and classification of pneumonia, respectively.

Model Three
This model is similar to Model Two. The difference is that for Model Three, each encoder implements its decoders. After reparameterization, the latent vectors of the individual encoder are merged to form one latent vector and then passed to the classification head. The individual latent vectors are then passed to their respective decoders to reconstruct the input. Three objective functions are used: two L VAE function-one for each encoder-decoder-and L cls (O, y) for the classification task. This model inherits all the advantages presented by architecture two. In addition, it solves the performance problem of Models One and Two since each encoder now implements its own decoder and L VAE . Figure 4 depicts the architecture of Model Three.

Dataset
In this study, we used the COVID-19 Radiography Database curated by [93,94], which is publicly available in the Kaggle repository. The dataset comprises 21,165 X-ray images: 3616 COVID-19-positive, 10,192 normal (non-COVID), 6012 lung opacity (non-COVID lung infection), and 1345 viral pneumonia. We experimented with both three and four classes. For four classes, we split the dataset into training (12,696 samples, representing 60% of the entire dataset), validation (6351 samples, representing 30% of the entire dataset), and testing (2118 samples, representing 10% of the entire dataset) sets. For the three classes, we combined lung opacity and viral pneumonia to form one set called viral pneumonia. To increase the number of COVID samples, we performed data augmentation using Augmentor to generate 6000 samples. We then sampled 5000 COVID images, 5000 normal images, and 5000 pneumonia images. We then split the dataset into training (10,499 samples, representing 70% of the dataset), validation (3750 samples, representing 25% of the dataset), and testing (750 samples, representing 5% of the dataset). Table 2 shows the composition of the dataset for four classes and three classes. Figure 5 shows samples of images from the dataset.

Selection Of Backbone Networks
To select the backbone networks for our work, we experimented with four pre-trained models: ResNet50, DenseNet201, VGG16, and Xception, out of which we chose the top two models with the highest performance. We ran each model on the four-class dataset for 20 epochs using a learning rate of 0.00003 and the Adam optimizer. The ResNet50 pretrained model obtained the best results, followed by VGG16. DenseNet201 and Xception showed similar results. Hence, we selected ResNet50 and VGG16 as the backbone networks for our model. Table 3 shows the results of the backbone network selection experiment.

Evaluation Metrics
In this study, five quantitative measures were adopted to measure the performance and effectiveness of our proposed model: accuracy, precision, recall, area under the curve (AUC), and F1 score.
Confusion Metric: Provides an intuitive and descriptive way to summarize the performance and correctness of a model.

Experimental Setup
In all the experiments, the input image was resized to a fixed size of 224 × 224 before feeding it to the network, there was a mini-batch size of 4 for both training and validation sets. The dimensions of the input image were 4 × 3 × 224 × 224. All models were trained for 20 epochs. For the validation set, the weight with the best accuracy was saved. We adopted two loss functions: maximum mean divergence (MMD) for variational autoencoder (refer to Equation (8)) and cross-entropy function for the classification task. The Adam optimizer was chosen as the primary optimizer and had a learning rate of 0.00003. We also experimented with three additional learning rates: 0.00001, 0.00002, and 0.00004. Table 4 shows the hyperparameters used in this study. Choosing an optimal objective function for the experiment was also considered. Since the architectures involve multiple tasks-reconstruction and classification-the chosen objective functions should improve the disentanglement between the latent embeddings and the classification of the different types of pneumonia. For the encoder-decoder, the objective function should consider both the reconstruction loss and a regularization term. Mean square error (MSE) loss was chosen for the reconstruction task, with MMD as the regularization term and cross-entropy as the loss for classification.
The network weights were initialized randomly since we did not use pre-trained models as the backbone for our network. All experiments were conducted on a single GPU (GEFORCE RTX 3060) with an 8-core AMD Ryzen 7 3700X processor.

Results and Analysis
This section reports the various experimental results of the proposed models using the COVID-19 Radiography dataset. For each of the models, we conducted several experiments by changing some of the hyperparameters. For the initial experiment, the Adam optimizer with a learning rate of 0.00003 was used and was trained for 20 epochs with a batch size of 4. This was done for both three and four classes. In the subsequent experiments, the optimizer was kept constant while we varied the learning rate. Table 5 shows the computational complexity of our proposed model in terms of the number of parameters and multiply-and-accumulate operations (MACs). From the table, Models One and Two have almost the same complexity since both models share a single decoder unit. The slight difference between their complexities is a result of Model Two performing a reparameterization operation for both encoders compared to a single reparameterization operation in Model One. For Model Three, each encoder has its own decoder, resulting in a higher number of parameters. Additionally, there are two reconstruction loss functions for the reconstruction of the input image, and one cross-entropy loss for the classification task, resulting in higher MACs. Table 5. Computational complexity of the three models in terms of parameters (in millions) and multiply-and-accumulate operations (MACs).   Table 6. Best performance for the three models on the validation dataset. For each model, the performance metrics for both three classes and four classes are shown, along with the best learning rate.  Table 7 shows the performance metrics of the best model (Model Three). For the three classes, it achieved 98.24% accuracy, recall, and F1 score, and 98.25% precision. For the four classes, an accuracy of 98.72%, recall of 98.52%, precision of 98.55%, and F1 score of 98.77% was achieved. Figure 6 shows the accuracy, loss, and ROC AUC graph comparing the performance of the three models for both three classes and four classes. These results indicates the proposed model's capability in classifying COVID-19. For all the performances measured, the best results are obtained with 20 training epochs, which is an indication that the performance of the model increases gradually as the number of epochs increases. We performed cross-validation using the testing set, as depicted in Table 8, which the model was not exposed to during training and validation. Cross-validations was performed to verify that our model does not suffer from overfitting and to test the model's robustness.  Tables 9 and 10 show the confusion matrix for the crossvalidation on the testing dataset. For three classes, out of a total of 487 radiographs, the model misclassified 130 radiographs; of those, 47 were COVID-19 images, 43 were normal images, and 40 were viral pneumonia images. For four classes, out of 2118 radiographs, the proposed model misclassified 17; of those, 4 were COVID-19 images, 3 were normal images, 6 were lung opacity images, and 4 were viral pneumonia images. This indicates the proposed model has higher true negative and true positive values and lower false negative and false positive values, suggesting the proposed model can accurately classify COVID-19 infections. Table 7. Performance metrics of the best model among the three models for both three and four classes on the validation dataset.  Table 8. Cross-validation results on the testing dataset for both three classes and four classes.

Ablation Studies
An ablation study was conducted to observe the contribution of the latent vector's dimension to the model's performance. The default dimension of the latent vector throughout the experiment was 128. Using a learning rate of 0.00003 and the Adam optimizer, we experimented with two lower dimensions for the ablation study: 32 and 64. The ablation study was conducted for both three classes and four classes on the validation dataset. Table 11 shows the ablation study for various latent sizes. The table shows that the model's performance increases with an increase in the latent vector's dimension. A latent dimension of 128 achieves higher performance in all the models.

Comparison with State-of-the-Art Methods
We compared our proposed model to various methods, as shown in Table 12. The comparison shows the proposed model, EVAE-Net, outperformed the methods in [37,40,41,95,96] using the same dataset (COVID-19 Radiography Database) for COVID-19 classification and methods that used other modalities [17,38,49,97]. It was worth noting that most of these methods only focused on either three classes or four classes. In our proposed model, we tested both three classes and four classes, making our model generalize better for COVID-19 classification.

Discussion and Future Work
In this study, we adopted VAE combined with an ensemble technique for the classification of COVID-19 with high accuracy. Tables 4 and 7 show the best model's validation and testing performance metrics. Of the three models, Model Three achieved the best performance. This reveals that the best result is obtained when each encoder implements its own decoder. With a single decoder, as with Model One and Model Two, the model's performance highly depends on how each encoder can extract relevant features from the input, since they share a single L VAE . With a decoder for each encoder, each encoder implements its L VAE with the advantage of learning to improve its performance from the loss by extracting more relevant features from the input. This further improves the latent embeddings sampled during the reparameterization stage since they are sampled from feature maps with more relevant features. A merger of the two latent vectors produces a richer latent vector, producing a higher classification result. The proposed EVAE-Net is more accurate than PCR since PCR results rely on the sample collection time and how they are stored and processed. Again, PCR results have high false negative rates when the patient is tested too early or too late after exposure to the virus. Further, the proposed model has higher true negative and true positive values and lower false negative and false positive values, suggesting the proposed model can accurately classify COVID-19 infections.
Although the proposed EVAE-Net produced interesting results, it still has some limitations, which we will address in future work. The proposed model cannot indicate the exact region of pneumonia in the chest X-ray image, which is vital for a radiologist. Future work will consider an attention mechanism focusing on the precise pneumonia region in the chest X-ray image. Since we only concentrated on using chest X-ray images for this study, the proposed model is biased toward chest X-ray images. This study did not consider other data modalities, such as CT scans. In future research, we will extensively examine other imaging modalities, which will help the model to generalize better.
Furthermore, we will combine several imaging modalities into a single dataset to investigate the robustness of the proposed model in future studies. We will also examine how image enhancement techniques such as discrete wavelet transform (DWT), CLAHE, etc., will enhance the feature maps from the encoders to improve the model's performance. We believe images that the model incorrectly classified can be improved when enhanced.
Other limitations are associated with the nature of VAE architecture. VAE architecture requires rich domain-specific knowledge, and this barrier prevents end users without design expertise from utilizing VAEs. Even researchers with expert knowledge must go through an arduous trial-and-error process to tune the architecture manually. For example, finding the optimal depth of the VAE is unknown from the beginning. Thus, it is unknown how to choose the appropriate number of convolutional, pooling, and dense layers for the VAE architecture and what the optimal hyperparameters for each convolutional and deconvolutional layer are. In future studies, we will explore neural architecture search (NAS) to find solutions to this limitation.
In recent years, many Internet of Things (IoT) devices, particularly Medical Internet of Things (MIoT), have been adopted in the health sector to combat various medical issues. Several applications have been deployed in these smart devices for remote patient monitoring (RPM) and other related medical tasks. In fighting against COVID-19, one important aspect of containing the virus is an effective and fast diagnosis method. With the rate at which the virus spreads, diagnosing and screening it quickly is key to containing it. Therefore, there is a need to develop effective and efficient deep learning models leveraging the advantages of IoT and MIoT to diagnose and screen COVID-19 with speed.

Conclusions
This study proposed an ensemble variational autoencoder network for the classification of COVID-19 with high accuracy. We exploited variational autoencoders to produce feature-rich latent vectors for classification. Three different ensemble variational autoencoders were designed. Models One and Two have two encoders and a single decoder; Model Three has two decoders. In the case of Model One, the feature maps from the two encoders are concatenated before reparameterization to sample the latent vector. The latent vector is then passed to the classification head for classification and to the decoder for reconstruction of the input image. Models Two and Three each implement reparameterization. The latent vectors from reparameterization are merged and passed to the classification head and the decoder. Our model was trained on the COVID-19 Radiography Dataset. In the case of three classes, the best model achieved 98.66% accuracy, 98.47% recall, 98.60% F1 score, and 98.75% precision. For four-class classification, 99.19% accuracy, 98.82% recall and precision, and 98.94% F1 score were accomplished. We demonstrated that our model can automatically predict and classify COVID-19 by extracting relevant features from chest radiograph images. This will go a long way to help reduce radiologists' workload and to avoid misdiagnosing COVID-19 patients. In future studies, we will adopt different DL strategies, such as attention mechanisms and NAS, to improve the computational complexity and performance of EVAE-Net.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The dataset that supports this study was obtained from the Kaggle repository https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database (accessed on 14 August 2022), which was curated by [93,94]. It is publicly available. Further details can be found here.

Conflicts of Interest:
The authors declare no conflict of interest.