Investigating Beta-Variational Convolutional Autoencoders for the Unsupervised Classification of Chest Pneumonia

The world’s population is increasing and so is the challenge on existing healthcare infrastructure to cope with the growing demand in medical diagnosis and evaluation. Although human experts are primarily tasked with the diagnosis of different medical conditions, artificial intelligence (AI)-assisted diagnoses have become considerably useful in recent times. One of the critical lung infections, which requires early diagnosis and subsequent treatment to reduce the mortality rate, is pneumonia. There are different methods for obtaining a pneumonia diagnosis; however, the adoption of chest X-rays is popular since it is non-invasive. The AI systems for a pneumonia diagnosis using chest X-rays are often built on supervised machine-learning (ML) models, which require labeled datasets for development. However, collecting labeled datasets is sometimes infeasible due to constraints such as human resources, cost, and time. As such, the problem that we address in this paper is the unsupervised classification of pneumonia using unsupervised ML models including the beta-variational convolutional autoencoder (β-VCAE) and other variants, such as convolutional autoencoders (CAE), denoising convolutional autoencoders (DCAE), and sparse convolutional autoencoders (SCAE). Namely, the pneumonia classification problem is cast into an anomaly detection to develop the aforementioned ML models. The experimental results show that pneumonia can be diagnosed with high recall, precision, f1-score, and f2-score using the proposed unsupervised models. In addition, we observe that the proposed models are competitive with the state-of-the-art models, which are trained on a labeled dataset.


Introduction
The lung is a very crucial organ of the human body. It is situated in the thorax region and primarily responsible for respiration [1]. All the cells in the body require oxygen to carry out various activities. Humans typically have two lungs, one on the left and the other on the right. They continuously provide the body with the oxygen that it requires for performing different important biochemical processes [2]. The results of these biochemical processes often include carbon dioxide (CO 2 ) as a by-product [3]. The human body is not conditioned for accumulating CO 2 ; such an accumulation in the body is harmful. Interestingly, the lungs are also tasked with extracting the produced CO 2 from the body [3]. The alveoli and capillaries are the specific sites where the air exchange occurs.
The right lung is slightly larger than the left lung since the heart is also situated in the left section of the chest. Precisely, the right lung occupies 56% of the total lung volume [4]. Furthermore, the lung is involved in other functions such as the absorption and removal of water, alcohol, nitrous oxide, and drugs [5]. As such, a healthy lung is indispensable for the proper functioning of the human body. Unfortunately, there are different Table 1. Related works summary for the classification of pneumonia using chest X-ray.
In [16], pre-trained DNNs including AlexNet, ResNet18, DenseNet201, SqueezeNet were used for the knowledge transfer. However, the problem was formulated as a multiclass problem, where chest X-rays were classified as having 'no pneumonia', 'bacteria pneumonia', or 'viral pneumonia'. Another work [17] proposed the supervised classification of chest X-rays using a CNN and dynamic capsule routing units; competitive results were reported in the work. The work in [19] proposed training a CNN for feature extraction and, then, using different classifiers such as SVM, k-NN and RF to improve classification accuracy. Several experimental results were reported in the work for validating the claim. However, all the works mentioned above use labeled datasets for learning. Unfortunately, the collection of labeled datasets is not always feasible due to reasons such as cost, time, and human resources. Given the problems mentioned above, this paper proposes the unsupervised classification of pneumonia using chest X-rays. The classification task is reformulated as an anomaly detection task using different types of autoencoder models. The results of the models are especially compared against other unsupervised learning models and certain supervised models.
As such, the major expositions in this paper are the following: (i) (Investigate unsupervised ML models such as the beta-variational convolutional autoencoder (β-VCAE) [21] and other variants, such as the plain convolutional autoencoder (PCAE) [22], denoising convolutional autoencoder (DCAE) [22], sparse convolutional autoencoder (SCAE) [23] for pneumonia diagnosis. (ii) Study how the choice of the class that the DNN is trained on impacts the generalization performance of the aforementioned models. (iii) Provide comparative analysis with state-of-the-art supervised models for pneumonia diagnosis. Our evaluation metrics include recall, precision, f 1 -score, and f 2 -score, which particularly takes into account false negatives and false positives.
In the remaining parts of this paper, there are the following sections: Section 2 discusses the background and problem statement. In Section 3, the proposed unsupervised ML models for the diagnosis of pneumonia are presented. In Section 4, the results of experiments from the different models investigated are given and discussed, including comparison with some state-of-the-art supervised ML models. We conclude the paper in Section 5 by summarizing the main findings from the experimental results.

Background
Given a chest X-ray, the traditional approach for pneumonia diagnosis requires a trained medical specialist to examine and evaluate the X-ray to determine if the patient has pneumonia or not. However, such a procedure can sometimes result in erroneous diagnosis due to human fatigue or random attention loss. Moreover, the time for diagnosis can quickly increase when only a few medical specialists are available to evaluate many X-ray scans. As an alternative, an AI-based computer program can be developed to aid diagnosis. Such a computer program processes the chest X-ray and searches for patterns that may reflect the presence of pneumonia. Such a computer program that is based on AI is data-driven for knowledge acquisition. The AI program is tasked to classify X-rays as having pneumonia or not, that is, to act as a binary classifier.
Since computer programs cannot be impacted by fatigue or distractions, they are generally reliable once properly trained for the given task. Furthermore, it is possible with AI programs to evaluate several chest X-ray records over a short period; a human examiner usually requires more time to evaluate the same number of chest X-ray records due to fatigue. The ML model that that is investigated in this work is the β-VCAE and its variants. Figure 1 shows some samples of the chest X-rays used in this work. Figure 1. Chest X-ray samples used in this paper [24]. Top row: X-ray samples with no pneumonia. Bottom row: X-ray samples with pneumonia.

Problem Statements
Although there are existing DNN approaches for classifying chest X-rays for pneumonia, they are often supervised DNN models [25][26][27]. This means that these approaches require labeled datasets. However, collecting and labeling datasets can sometimes be very costly. At other times, we were only able to collect datasets with all samples (examples) belonging to only one class, e.g., with no pneumonia. This can occur when the disease that we are interested in diagnosing is rare so that data samples for the positive class (i.e., with pneumonia) are difficult or outrightly impossible to collect. In this scenario, most supervised DNN models such as VGG [28], ResNet [29], DenseNet [30], etc., cannot be employed. Consequently, the problem statement is the development of pneumonia classification models using unlabeled data, that is, unsupervised models.

Proposed Unsupervised Classification of Pneumonia
The objective of this work is to investigate the unsupervised classification of pneumonia using chest X-rays. What particularly distinguishes this work from the related works on the diagnosis of chest pneumonia is that the classification of pneumonia is formulated as an anomaly detection problem. That is, the developed models are trained on only samples that belong to one class and tested on samples that belong to both classes. The hypothesis is that a model that is successfully trained on only a particular class takes any sample that does not belong to the class that it was trained on as an anomalous data sample. The family of unsupervised DNNs that we explored is the CAE, which is either generative or non-generative based on the training algorithm. Generative CAEs learn the training data generation process and interpretable latent space (hidden representation), which can be sampled to generate novel data samples. In contrast, non-generative CAE do not model the training data generation process, and therefore, its latent space cannot be sampled from to generate novel samples; they simply learn a compressed representation in the latent space. In this paper, for generative CAE, we will explore the beta-variational convolutional autoencoder (β-VCAE) [21], which is a generalization of the variational convolutional autoencoder (VCAE) [31] when β = 1. In addition, we experiment with other CAEs such as the conventional CAE referred to as plain CAE (PCAE) [22], the denoising convolutional autoencoder (DCAE) [22], and the sparse convolutional autoencoder (SCAE) [23].
The developed unsupervised pneumonia classification system that is based on autoen- Figure 1. Chest X-ray samples used in this paper [24]. Top row: X-ray samples with no pneumonia. Bottom row: X-ray samples with pneumonia.

Problem Statements
Although there are existing DNN approaches for classifying chest X-rays for pneumonia, they are often supervised DNN models [25][26][27]. This means that these approaches require labeled datasets. However, collecting and labeling datasets can sometimes be very costly. At other times, we were only able to collect datasets with all samples (examples) belonging to only one class, e.g., with no pneumonia. This can occur when the disease that we are interested in diagnosing is rare so that data samples for the positive class (i.e., with pneumonia) are difficult or outrightly impossible to collect. In this scenario, most supervised DNN models such as VGG [28], ResNet [29], DenseNet [30], etc., cannot be employed. Consequently, the problem statement is the development of pneumonia classification models using unlabeled data, that is, unsupervised models.

Proposed Unsupervised Classification of Pneumonia
The objective of this work is to investigate the unsupervised classification of pneumonia using chest X-rays. What particularly distinguishes this work from the related works on the diagnosis of chest pneumonia is that the classification of pneumonia is formulated as an anomaly detection problem. That is, the developed models are trained on only samples that belong to one class and tested on samples that belong to both classes. The hypothesis is that a model that is successfully trained on only a particular class takes any sample that does not belong to the class that it was trained on as an anomalous data sample. The family of unsupervised DNNs that we explored is the CAE, which is either generative or non-generative based on the training algorithm. Generative CAEs learn the training data generation process and interpretable latent space (hidden representation), which can be sampled to generate novel data samples. In contrast, non-generative CAE do not model the training data generation process, and therefore, its latent space cannot be sampled from to generate novel samples; they simply learn a compressed representation in the latent space. In this paper, for generative CAE, we will explore the beta-variational convolutional autoencoder (β-VCAE) [21], which is a generalization of the variational convolutional autoencoder (VCAE) [31] when β = 1. In addition, we experiment with other CAEs such as the conventional CAE referred to as plain CAE (PCAE) [22], the denoising convolutional autoencoder (DCAE) [22], and the sparse convolutional autoencoder (SCAE) [23].
The developed unsupervised pneumonia classification system that is based on autoencoders can provide fast, considerably accurate, and cost-effective diagnoses without the need for a labeled dataset. This section details the proposed diagnosis system that employs different unsupervised ML models. The main stages in the classification framework are data processing and model training set up, which are discussed in the rest of the section.

Data and Data Processing
The chest X-ray images obtained from the database [24] have samples that reflect X-rays without pneumonia and those with pneumonia. The X-rays with pneumonia are referred to positive (+ve) samples or class, while X-rays without pneumonia are referred to as negative (−ve) samples or class. Subsequently, the positive and negative training data are abbreviated as Train (pos) and Train (neg) , respectively. Similarly, the positive and negative testing data are abbreviated as Test (pos) and Test (neg) , respectively. Figure 2 shows the overall diagnosis framework. The X-ray images have different sizes. However, the ML models employed in this paper accept input images of fixed size. As such, the X-ray images are resized to a fixed size of 64 × 64 pixels. This has the additional advantage of reducing computing hardware requirements for training and testing the ML models.

Data and Data Processing
The chest X-ray images obtained from the database [24] have samples that reflect Xrays without pneumonia and those with pneumonia. The X-rays with pneumonia are referred to positive (+ve) samples or class, while X-rays without pneumonia are referred to as negative (−ve) samples or class. Subsequently, the positive and negative training data are abbreviated as Train(pos) and Train(neg), respectively. Similarly, the positive and negative testing data are abbreviated as Test(pos) and Test(neg), respectively. Figure 2 shows the overall diagnosis framework. The X-ray images have different sizes. However, the ML models employed in this paper accept input images of fixed size. As such, the X-ray images are resized to a fixed size of 64 × 64 pixels. This has the additional advantage of reducing computing hardware requirements for training and testing the ML models.

Figure 2.
Proposed unsupervised framework for the diagnosis of pneumonia using chest X-ray. The orange items and red items show the training and testing paths, respectively. The DNN is trained using only data from Train(pos) or Train(neg), while the model is tested using data from both Test(pos) and Test(neg).

Proposed Unsupervised Models
The problem of pneumonia classification is formulated as an anomaly detection problem based on the autoencoder model family, where the model is trained on either only the positive class or negative. A model that is trained on the positive class is expected to have successfully learned the reconstruction of X-ray samples from the positive class, namely, giving small reconstruction errors for positive samples but high errors for negative samples during testing. Similarly, a model successfully trained on only negative samples is expected to give small reconstruction errors when tested with negative samples but a high reconstruction error when tested with positive samples. It is on this concept that the proposed models are based for diagnosing pneumonia in an unsupervised fashion.
In the paper, four different convolutional autoencoders are investigated. The motivation is to study how the formulation of the models impacts their classification performances. It is well-known that the latent space properties of autoencoders significantly contribute to their regularization characteristics and, therefore, modeling power. An autoencoder is essentially tasked to reconstruct the supplied input data in the output layer [32]. This is expected to foster the learning of interesting latent representations in the hidden layer. The hidden units learn a mapping function for reproducing the input data in the output layer. It is noteworthy that the major challenge in learning autoencoders is capturing important features in the training data, that is, learning good latent Proposed unsupervised framework for the diagnosis of pneumonia using chest X-ray. The orange items and red items show the training and testing paths, respectively. The DNN is trained using only data from Train(pos) or Train(neg), while the model is tested using data from both Test(pos) and Test(neg).

Proposed Unsupervised Models
The problem of pneumonia classification is formulated as an anomaly detection problem based on the autoencoder model family, where the model is trained on either only the positive class or negative. A model that is trained on the positive class is expected to have successfully learned the reconstruction of X-ray samples from the positive class, namely, giving small reconstruction errors for positive samples but high errors for negative samples during testing. Similarly, a model successfully trained on only negative samples is expected to give small reconstruction errors when tested with negative samples but a high reconstruction error when tested with positive samples. It is on this concept that the proposed models are based for diagnosing pneumonia in an unsupervised fashion.
In the paper, four different convolutional autoencoders are investigated. The motivation is to study how the formulation of the models impacts their classification performances. It is well-known that the latent space properties of autoencoders significantly contribute to their regularization characteristics and, therefore, modeling power. An autoencoder is essentially tasked to reconstruct the supplied input data in the output layer [32]. This is expected to foster the learning of interesting latent representations in the hidden layer. The hidden units learn a mapping function for reproducing the input data in the output layer. It is noteworthy that the major challenge in learning autoencoders is capturing important features in the training data, that is, learning good latent representations for the training data [33]. Although the different autoencoders used in this work employ convolution operations, we describe their operations in this section using fully connected networks (multilayer perceptrons) for the sake of simplicity. However, similar formulations can be directly extended to the autoencoders with convolution operations.

Plain Autoencoder
The plain autoencoder is shown in Figure 3, where for a supplied input pattern, x, the autoencoder is tasked with the learning of good hidden representation, L1(x), essential for the successful reconstruction of the input data in the output layer as y. Here, u is the dimensionality of the input data, and h is the number of hidden units. The notation, L1(x), was used to denote the first hidden layer representation (or codes), since autoencoders can be stacked to obtain a deep network (with many hidden layers). Generally, autoencoders are learned in two stages, referred to as encoding and decoding. Let the training set be of the form: x ∈ R u and y ∈ R u (as obtained in the input and output data for autoencoding) and L1(x) ∈ R h . representations for the training data [33]. Although the different autoencoders used in this work employ convolution operations, we describe their operations in this section using fully connected networks (multilayer perceptrons) for the sake of simplicity. However, similar formulations can be directly extended to the autoencoders with convolution operations.

Plain Autoencoder
The plain autoencoder is shown in Figure 3, where for a supplied input pattern, , the autoencoder is tasked with the learning of good hidden representation, 1( ), essential for the successful reconstruction of the input data in the output layer as . Here, is the dimensionality of the input data, and h is the number of hidden units. The notation, 1( ), was used to denote the first hidden layer representation (or codes), since autoencoders can be stacked to obtain a deep network (with many hidden layers). Generally, autoencoders are learned in two stages, referred to as encoding and decoding. Let the training set be of the form: ∈ and ∈ (as obtained in the input and output data for autoencoding) and 1( ) ∈ ℎ . -Encoding: the autoencoder learns the mapping of input data, , to the hidden layer codes 1( ) as a function : ↦ 1( ). This is described as given below: where 1( ) is the hidden layer output, ( 1) ∈ ℎ× is the input-hidden weight matrix, ( 1) ∈ ℎ is the bias vector for the hidden layer, and is the activation function. The encoding phase can be conceived of as maximizing the mutual information between the input data, , and hidden layer, 1( ) [33].
-Decoding: the decoder learns the mapping of the learned hidden layer representations (activations), 1( ) , to input data, , in the output layer as y; this is learned as a mapping function : 1( ) ↦ . Note that the actual autoencoder output is y, while the target output is x, i.e., Figure 3. The decoding phase is described as follows: where y is the actual network output, ( 1) ∈ ×ℎ is the hidden-output weight matrix, 1( ) is the hidden layer activations matrix, ( ) ∈ is the bias vector for the output layer, and is the activation function. The autoencoder weights, ( 1) and ( 1) , are learned by minimizing a cost function, ( , ; ), for the model; where denotes model parameters weights and biases, that is, = { ( 1) , ( 1) , ( 1) , ( ) }. For real-valued inputs, the obvious choice is the mean squared error, as defined below: − Encoding: the autoencoder learns the mapping of input data, x, to the hidden layer codes L1(x) as a function f enc : x → L1(x) . This is described as given below: where L1(x) is the hidden layer output, W enc ∈ R h×u is the input-hidden weight matrix, b (L1) ∈ R h is the bias vector for the hidden layer, and g is the activation function. The encoding phase can be conceived of as maximizing the mutual information between the input data, x, and hidden layer, L1(x) [33]. − Decoding: the decoder learns the mapping of the learned hidden layer representations (activations), L1(x), to input data, x, in the output layer as y; this is learned as a mapping function f dec : L1(x) → y .
Note that the actual autoencoder output is y, while the target output is x, i.e., Figure 3. The decoding phase is described as follows: where y is the actual network output, W dec ∈ R u×h is the hidden-output weight matrix, L1(x) is the hidden layer activations matrix, b (y) ∈ R u is the bias vector for the output layer, and g is the activation function. The autoencoder weights, W dec , b (y) }. For real-valued inputs, the obvious choice is the mean squared error, as defined below: Note that for autoencoders, the output unit denoted y i is tasked to reconstruct the input x i . Furthermore, the number of output units is equal to the number of input units, u; N is the number of training patterns or samples. For binary inputs, the sum of Bernoulli cross entropies can be used as the cost function, as follows: It is important to note that the autoencoder described with the convolution operations, which is used in the paper, is referred to as the plain convolutional autoencoder (PCAE).

Denoising Autoencoder
Given sufficient model capacity (i.e., number of parameters), the plain autoencoder can undesirably learn to copy the input as the output without learning useful latent representations. As such, autoencoders can be learned with important and additional constraints that encourage the network to learn better hidden representations that are useful for subsequent discriminative tasks, e.g., improve model optimization and regularization [34]. An interesting constraint is the denoising training criterion, where input data are stochastically corrupted to some specified degree (usually 5-50%), and the autoencoder is tasked with the reconstruction of the clean (uncorrupted) data in the output layer; this model is referred to as a denoising autoencoder (DAE) [35]. The way the denoising autoencoder is learned can provide an interpretation as a generative model. Let the stochastic process where the noisy input data ..
x is generated from clean input data x be represented as κ : x → ..
x . Then, the denoising training criterion is a way to approximate the stochastic operator κ' : ..
x → x . In this work, we implement the zero-masking noise, where input data are randomly set to zero with some pre-defined probability, σ. The DAE with the convolution operations is referred to as a convolutional autoencoder (CDAE).

Sparse Autoencoder
An obvious concern with autoencoders is the possibility of the model to simply duplicate (without learning useful hidden representations) the input data in the output layer in view of maximizing the mutual information of input data, x, and hidden representation, L1(x) [22]. Hence, the application of sparsity can be employed as a way to constrain autoencoders to learn useful hidden representations for input data [22]. The hidden units are constrained to have a small pre-defined activation value, z [36]. From Figure 3, the computed sparsity parameter,z, for a hypothetical hidden unit, j, can be obtained using where O j is the output (or activation) of hidden unit j, x n is the training input pattern vector (sample) with index n, and N is the number of training samples. The main motivation behind sparsity is to constrain the hidden unit j such thatz = z. An inspection of (5) reveals that such a constraint enforces the average activations of hidden unit j to have small values over training samples {x n } n=1:N to realize small values ofz. The Kullback Leibler (KL) divergence described in (6) can be used to measure the deviation of distribution z fromz [34] and, hence, optimize the whole model.
It is important to note that (z z) = 0 forz= z. For the overall model cost minimization, the KL divergence is added to the mean squared error. Hence, the new overall cost function C(x, y; θ) is expressed in (7), whereas γ emphasizes the effect of the sparsity constraint.
The SAE with the convolution operations is referred to as a sparse convolutional autoencoder (SCAE).

Beta-Variational Autoencoder
The beta-variational autoencoder (β-VAE) [21] is a generative model, and it is wellknown for its structured latent space property, which makes the sampling of novel plausible samples interesting. This is an attribute that the other autoencoders lack, since their latent spaces are not structured or continuous (i.e., smooth) such that sampling from them is not guaranteed to generate semantically meaningful samples. Interestingly, the continuous latent space was argued to regularize the model even for tasks that do not require generating novel samples. Namely, the latent space of the β-VAE is constrained to be continuous manifolds so that the data reconstructed from it are more plausible. The second term in (8) is responsible for the smoothness constraint. This, in turn, is expected to improve its performance for anomaly detection and, therefore, the diagnosis of chest pneumonia.
In simple terms, this autoencoder has two terms in its cost function, as shown in the following expression: where q ϕ (L1(x)|x i ) is the model encoder that is used to approximate the intractable p(L1(x)|x), which is generally chosen to have the isotropic unit Gaussian distribution. The first term in the cost function is the typical reconstruction error, and the other term is the latent space regularization that encourages smoothness and structure. It can be noted that the β-VAE is a generalization of the conventional VAE proposed in work [31], where we can set β = 1. The β-VAE with the convolution operation is referred to as β-VCAE.

Model Evaluation Metrics
The evaluation metrics for the models in this work are recall, precision, f 1 -score, and f 2 -score. The chosen metrics ensure that the model evaluation is thorough. For example, class distribution impact, false positive, and false negative results are reasonably evaluated. Considering that the misdiagnosis of pneumonia can be life-threatening, the main metric for results comparison will be the f 2 -score. The f 2 -score especially highlights the false negatives predicted by the model, which are very undesirable in this application. The definitions of the evaluation metrics are as follows, where TP, FN, and FP are true positive, false negative, and false positive, respectively.

Experiments
The dataset details and experimental settings, along with results and discussion, are presented here. The experiments were carried out using a workstation with 32 GB random access memory (RAM), intel core-i7 CPU, Nvidia GTX1080Ti GPU, which runs Windows 10 OS.

Dataset Preparation
There are 3875 positive data (chest X-ray) samples and 1341 negative data samples in the dataset used in this paper. The two data preparation scenarios in the work are the following.
(i) Train and validate the model on only positive data samples, that is, using only Train (pos) and Val (pos) . Test the trained models on both positive and negative data samples, that is, using Test (pos) and Test (neg) . (ii) Train and validate the models on only negative data samples, that is, using only Train (neg) and Val (neg) . Test the trained models on both negative and positive data samples, that is, using Test (neg) and Test (pos) .
For Train (pos) , the data samples from the positive class are divided into training, validation, and testing sets in the ratio 60%:20%:20%, respectively. All the negative class data samples are used as testing data. For Train (neg) , the data samples from the negative class are partitioned into training, validation, and testing sets in the ratio 60%:20%:20%, respectively. All the data samples from the positive class are taken as taken as testing data. The details of the data preparation for model training are given in Table 2, where positive and negative classes are abbreviated as +ve and −ve, respectively. The pixels in the images are normalized to the range 0 to 1. Table 2. Chest X-ray dataset preparation.

Data
Train (

Model Architecture and Training Settings
Herein, the details of the training settings for the different models are discussed. All models have a single filter in the input and output layers that allow the reconstruction of the input images in the output layers of the different models. All the constructed models have either two or four hidden layers, which are convolution layers. As such, we can observe how the number of hidden layers impact model performance. The models with two hidden layers have 16 filters in both hidden layers. The models with four hidden layers have 32, 16, 16, and 32 filters, consecutively. All filters are of spatial size 3 × 3. The hidden layers use the rectified linear activation function, while the output layer uses the sigmoid function. The model hyperparameters are chosen by heuristics using the validation data. The initial learning rate is set at 0.1 and momentum rate at 0.9. The learning rate is decayed during model training every 3 epochs by a factor of 0.2 to facilitate model convergence to one of the good local optima. The batch size is 32, and the models are trained for 10 epochs. The threshold for class association is taken as the maximum reconstruction error on the validation data.

Results and Discussion
The results of the different models obtained using Train (pos) and Train (neg) are reported in this section. Furthermore, we compare the results of our model, which employ unsupervised learning with the state-of-the-art models that use supervised learning. Tables 3-6 show the testing results for the different data preparation settings discussed in Section 4.1 (i.e., Table 2) on evaluation metrics such as recall, precision, f 1 -score, and f 2 -score. Table 3 shows the results of the models with two hidden layers trained using Train (pos) settings, that is, only positive samples as input but tested using both positive and negative samples. Table 4 reports similar experimental results with models that have four hidden layers. Table 3. Testing results of DNN models with two hidden layers obtained using Train (pos) .

Model
Recall It is seen in Table 3 that the β-VCAE models provides the best results based on both f 1 -score and f 2 -score, β-VCAE with β = 5, which enforces a more structured latent space that slightly outperforms β-VCAE with β = 1. The DCAE with σ = 0.1 outperforms the DCAE with σ = 0.05; we conjecture that using σ = 0.1 imposes more regularization on the model and, therefore, better performance. The SCAE model with z = 0.3 performs better than when z = 0.5. Table 3 can be observed. First, it is seen that increasing the number of hidden layers results in some performance improvements on the models when comparing Tables 3 and 4. Again, the β-VCAE models give the best results. Using training settings Train (neg) , Tables 5 and 6 show the results of the models with two and four hidden layers. The models are trained on only the negative class samples but tested on both negative and positive class samples. Importantly, it is seen in the experiments that the performances of these models lag behind the corresponding models trained using Train (pos) ; compare Table 3 with Tables 4 and 5 with  Table 6. Similarly, Tables 5 and 6 show that the β-VCAE models give the best results in comparison with other models. Again, this reflects the crucial importance of the structured latent space property of β-VCAE, which the other autoencoders lack. Overall, it is seen that the β-VCAE model gives the best performance in comparison to the other autoencoders. We posit that the nature of the latent space of the β-VCAE model discussed in Section 3.2.4 can be related to its success. Namely, the structured latent space ensures that data points are constrained within a plausible manifold from which the output of the model is reconstructed. The other autoencoders do not possess this interesting attribute, as the manifold for the latent representations is not constrained. Consequently, they can generate implausible data, which reflects in lower generalization performance.
Finally, the comparison of our best results obtained using β-VCAE with supervised models is considered. Namely, popular supervised DNNs such as the VGG-16, ResNet-56, ResNet-110, and DenseNet-40 are used for the study. The results, including the number of model parameters and depth, are shown in Table 7, where the β-VCAE models are trained using Train (pos) settings, and thus, the results are taken from Table 4. The supervised models are trained and tested by combining the same positive samples used for β-VCAE and the negative samples in Table 2. It is seen in Table 7 that the supervised models slightly outperform the proposed β-VCAE models that are unsupervised. However, when the different costs associated with collecting labeled data for training are taken into consideration, our unsupervised models are competitive and can be successfully employed for the diagnosis of pneumonia using chest X-rays. The training curve for β-VCAE with β = 5, which is reported in Table 7, is given in Figure 4. Furthermore, in comparison to the supervised models, the proposed models have significantly fewer layers and parameters, and thus, training the β-VCAE model is many times faster than the supervised models. Importantly, the inference times of the different models are especially important, since these reflect the maximum number of diagnoses that can be made within a given period. As such, the inference times of the different models are shown in Figure 5. It is observed that the proposed β-VCAE model has significantly smaller latency during testing compared to the supervised models. This is an additional advantage for the proposed model in comparison to the supervised models.   Table 7.  Table 7.
The results of the unsupervised models are quite surprisingly competitive when compared with the supervised learning models in Table 7. Our explanation is that by casting the classification problem into an anomaly detection task that autoencoders are known to perform well, the classification problem becomes easy. The autoencoder only has to learn good latent representations that partition the decoded (reconstructed) output into the two data classes, positive and negative pneumonia cases. The limitation of the proposed solution that is formulated as an anomaly detection problem in the paper is that it limits diagnosis to two classes, i.e., binary classification problems. For problems with multiple classes, the proposed solution cannot be directly applied. However, supposing that training data for all classes in a multiple classification task are available, it would be interesting to study if training an autoencoder model for each class and then ranking the errors of all the autoencoders for a given data sample during testing gives interesting results.  Table 7.   Table 7.  Table 7.
The results of the unsupervised models are quite surprisingly competitive when compared with the supervised learning models in Table 7. Our explanation is that by casting the classification problem into an anomaly detection task that autoencoders are known to perform well, the classification problem becomes easy. The autoencoder only has to learn good latent representations that partition the decoded (reconstructed) output into the two data classes, positive and negative pneumonia cases. The limitation of the proposed solution that is formulated as an anomaly detection problem in the paper is that it limits diagnosis to two classes, i.e., binary classification problems. For problems with multiple classes, the proposed solution cannot be directly applied. However, supposing that training data for all classes in a multiple classification task are available, it would be interesting to study if training an autoencoder model for each class and then ranking the errors of all the autoencoders for a given data sample during testing gives interesting results.  Table 7.
The results of the unsupervised models are quite surprisingly competitive when compared with the supervised learning models in Table 7. Our explanation is that by casting the classification problem into an anomaly detection task that autoencoders are known to perform well, the classification problem becomes easy. The autoencoder only has to learn good latent representations that partition the decoded (reconstructed) output into the two data classes, positive and negative pneumonia cases. The limitation of the proposed solution that is formulated as an anomaly detection problem in the paper is that it limits diagnosis to two classes, i.e., binary classification problems. For problems with multiple classes, the proposed solution cannot be directly applied. However, supposing that training data for all classes in a multiple classification task are available, it would be interesting to study if training an autoencoder model for each class and then ranking the errors of all the autoencoders for a given data sample during testing gives interesting results.
The major challenge in the investigation is computational resources for training the models in a reasonable time. A workstation with a GPU that has a memory of over 6 Gigabytes was found sufficient for the experiments. The training time can be further improved by using high-end GPUs that allow a larger batch size.

Conclusions
This work investigates the diagnosis of pneumonia in the scenario where the available dataset is unlabeled using a beta-variational convolutional autoencoder (β-VCAE). The work also considers the plain convolutional autoencoder (PCAE), convolutional denoising autoencoder (CDAE), and convolutional sparse autoencoder (CSAE) for the diagnosis problem. The diagnosis problem is formulated as an anomaly detection problem so that the models are trained on either only the data from the positive or negative class. The findings show that the autoencoders can be successfully applied to the problem, especially when they are trained on the positive class samples. All the models achieve an over 97% f 2 -score performance. We note that the β-VCAE consistently gives the best results among the different autoencoder models investigated. A comparison with supervised models such as VGG-16, ResNet-56, ResNet-110, and DenseNet-100 shows that the proposed β-VCAE, which is unsupervised, gives very competitive results and requires a significantly shorter time for testing. As such, a considerably larger number of diagnoses can be performed with the β-VCAE in comparison to the supervised models.