An Ensemble SSL Algorithm for Efﬁcient Chest X-Ray Image Classiﬁcation

: A critical component in the computer-aided medical diagnosis of digital chest X-rays is the automatic detection of lung abnormalities, since the effective identiﬁcation at an initial stage constitutes a signiﬁcant and crucial factor in patient’s treatment. The vigorous advances in computer and digital technologies have ultimately led to the development of large repositories of labeled and unlabeled images. Due to the effort and expense involved in labeling data, training datasets are of a limited size, while in contrast, electronic medical record systems contain a signiﬁcant number of unlabeled images. Semi-supervised learning algorithms have become a hot topic of research as an alternative to traditional classiﬁcation methods, exploiting the explicit classiﬁcation information of labeled data with the knowledge hidden in the unlabeled data for building powerful and effective classiﬁers. In the present work, we evaluate the performance of an ensemble semi-supervised learning algorithm for the classiﬁcation of chest X-rays of tuberculosis. The efﬁcacy of the presented algorithm is demonstrated by several experiments and conﬁrmed by the statistical nonparametric tests, illustrating that reliable and robust prediction models could be developed utilizing a few labeled and many unlabeled data.


Introduction
During the second half of the last century, the area of diagnostic medicine has massively changed; from a rather qualitative science that was based on observations of whole organisms to a more quantitative science, which is also based on knowledge extraction from databases.The widespread adoption of electronic medical records contributes to the exponential generation of biomedical data in size, dimension and complexity [1].Furthermore, these biomedical datasets have non-linear relationships between inputs and outcomes, hindering their analysis and modeling.Leveraging these data leads to a significant potential to transform biomedical research and the delivery of healthcare.Therefore, machine learning and data mining techniques can be considered a helpful tool, extracting useful and valuable information for the development of intelligent computational systems.
Despite the development of efficient treatments, as well as the advances in medicine, Tuberculosis (TB) is considered to be one of the greatest lethal diseases worldwide.More specifically, only in 2013, it is estimated that 1.5 million people died of TB and nine million new cases occurred.The rate of TB mortality is slowly declining each year through early diagnosis and effectively targeted treatment.Although several tests for TB diagnosis, either active (e.g., sputum culture or XpertMTB/RIF) or latent (e.g., Mantoux test or interferon-gamma release assay) exist, their application is cumbersome and expensive and/or the time required to process a sample is frequently long [2].To this end, a typical method for TB detection consists of a posterior-anterior Chest X-Ray (CXR) in order to search the lung region for any abnormalities that could be present.
Due to its relatively low price and easy accessibility, CXR imaging is widely used for health monitoring and diagnosis of TB.In the clinic, the medical image interpretation has been mostly performed by human experts such as radiologists and physicians and is considered a long and complicated process.Hence, the advances of digital technology and chest radiography, as well as the rapid development of digital image retrieval and analysis have renewed the interest in developing Computer-Aided Diagnosis (CAD) systems for the automatic recognition of abnormalities from CXRs in order to assist radiologists in analyzing chest images.Along this line, a variety of methodologies has been proposed, aiming at: • classifying and/or detecting the presence of an abnormality (image classification); • segmenting images into normal and abnormal (medical image segmentation).
These have proven to be powerful tools in diagnosing a patient and assisting medical staff [3,4].Hogeweg et al. [5] combined a texture-based abnormality detection system with a clavicle detection stage in order to suppress false positive responses.Based on their previous work, Hogeweg et al. [6] utilized a combination of pixel classifiers and activated shape models for clavicle segmentation.Notice that the clavicle region consists of a notoriously difficult region for the detection of TB since the clavicles can obscure manifestations of TB in the apex of the lung.Another similar work is presented by Jaeger et al. [7], which proposed an approach for detecting TB in conventional posteroanterior chest radiographs.Initially, their proposed method extracted the lung region from the CXRs utilizing a graph cut segmentation method, and a set of texture and shape features in the lung region was computed in order to classify the patient as normal or abnormal.Based on their numerical experiments on two real-world datasets, the authors concluded that the proposed CAD system for TB screening achieved high performance, which approached that of human experts.In [8], Candermir et al. presented a non-rigid registration-driven robust lung segmentation method using image retrieval-based patient-specific adaptive lung models to develop an anatomical atlas that detects lung boundaries.Their proposed method was evaluated utilizing 585 chest radiographs from patients with normal lungs and various pulmonary diseases, indicating the robustness and effectiveness of the proposed approach.
However, despite all these efforts, there is still no widely-utilized method, since the medical domain requires high accuracy; especially, it is imperative for the rate of false negatives to be very low.This is due to the fact that the progress in the field has been hampered by the lack of available labeled images for efficiently training a supervised classifier.Notice that the vigorous development of the Internet, the emergence of vast image collections and the widespread adoption of electronic medical records have led to the development of large repositories of labeled and mostly of unlabeled images.Nevertheless, the process of correctly labeling new unlabeled CXRs frequently requires the efforts of specialized personnel, which will incur high time and monetary costs.
To address this problem, Semi-Supervised Learning (SSL) algorithms constitute the appropriate machine learning methodology for extracting useful knowledge from both labeled and unlabeled data in order to build efficient classifiers [9].More analytically, these algorithms combine the explicit classification information of labeled data with the information hidden in the unlabeled data in a most efficient way.The main issue in semi-supervised learning is how to efficiently exploit the information hidden in the unlabeled data.In the literature, several approaches have been proposed, each with a different philosophy related to the link between the distribution of labeled and unlabeled data [9][10][11][12][13].Self-labeled algorithms are probably considered the most popular class of SSL algorithms that address the shortage of labeled data via a self-learning process based on supervised prediction models.The main advantages of these algorithms consist of their simplicity, as well as their wrapper-based philosophy; therefore, they have been successfully applied in a variety of real-world classification problems (see [11,[14][15][16][17][18][19] and the references therein).
In this work, we examine and evaluate the performance of a new semi-supervised algorithm, called CST-Voting, for the classification of CXRs of tuberculosis, which is based on an ensemble philosophy.The proposed algorithm combines the predictions of three of the most productive and regularly-used self-labeled algorithms, using a voting methodology.Our preliminary numerical experiments present the efficacy of the proposed algorithm and its classification accuracy, therefore illustrating that reliable prediction models could be developed utilizing a few labeled and many unlabeled data.
The remainder of this paper is organized as follows: Section 2 defines the semi-supervised classification problem and presents an overview of the self-labeled methods and the proposed ensemble semi-supervised classification algorithm.Section 3 presents a series of experiments in order to examine and evaluate the accuracy of the proposed algorithm compared with the most popular SSL classification algorithms.Finally, Section 4 sketches our concluding remarks and future work directions.

A Review of Semi-Supervised Self-Labeled Learning
In this section, we present a formal definition of the semi-supervised classification problem and briefly describe the most relevant self-labeled approaches proposed in the literature.
Let (x, y) be an example, where x belongs to a class y and a D-dimensional space in which x i is the i-th attribute of the instance.Suppose that the training set L ∪ U consists of a labeled set L of N L instances where y is known and of an unlabeled set U of N U instances where y is unknown with N L N U .Furthermore, there exists a test set T of N T unseen instances where y is unknown, which has not been utilized in the training stage.Notice that the aim of the semi-supervised classification is to obtain an accurate and robust learning hypothesis with the use of the training set.
Self-labeled techniques are considered a significant family of classification methods, which progressively classify unlabeled data based on the most confident predictions.More to the point, these techniques utilize the aforementioned predictions in order to modify the hypothesis learned from labeled samples.Therefore, the methods of this class accept that their own predictions tend to be correct, without making any specific assumptions about the input data.
In the literature, a variety of self-labeled methods has been proposed each with a different philosophy and methodology on exploiting the information hidden in the unlabeled data.In this work, we focus our attention on self-training, co-training and tri-training, which constitute the most useful and commonly-used self-labeled methods [12,16,20,21].Notice that the crucial difference between them is the mechanism used to label unlabeled data.Self-training and tri-training are single-view methods, while co-training is considered as a multi-view method.

Self-training
Self-training is a wrapper-based semi-supervised approach, which is comprised of an iterative procedure of self-labeling unlabeled data.It is generally considered to be a non-complex important SSL algorithm.According to Ng and Cardie [22], "Self-training is a single-view weakly supervised algorithm", which is based on its own predictions on unlabeled data with the aim of teaching itself.
It has been established as a very popular algorithm due to its simplicity, and it is often found to be more accurate than other semi-supervised algorithms [16,20,23].In the self-training framework, an arbitrary classifier is initially trained with a small amount of labeled data, which comprise its training set, aiming to classify unlabeled points.Subsequently, it iteratively enlarges its labeled training set with its own most confident predictions and retrained.More specifically, at each iteration, the classifier's training set is gradually augmented with classified unlabeled instances; these instances have achieved a probability value over a defined threshold c and are considered sufficiently reliable to be added to the training set.A high-level description of the self-training algorithm is presented in Algorithm 1.Clearly, this model does not make any specific assumptions about the input data, but it accepts that its own predictions tend to be correct.Therefore, since the success of the self-training algorithm is heavily dependent on the newly-labeled data based on its own predictions, its weakness is that erroneous initial predictions will probably lead the classifier to generate incorrectly labeled data [9].

Co-training
Co-training [11] is a semi-supervised algorithm, which is based on the strong hypothesis that the feature space can be split into two different conditionally independent views, each of which is able to predict the classes in a perfect way [24,25].Under these assumptions, this algorithm opts to predict the unlabeled instances by dividing the features of data into two separable categories, bearing in mind that this act is more productive.
In this framework, two learning algorithms were separately trained for each view utilizing the initial labeled dataset.In the following, the most confident predictions of each algorithm on unlabeled data are used in order to augment the training set of the other algorithm through an iterative learning process.In essence, co-training is a "two-view weakly supervised algorithm", since it uses the self-training approach on each view [22].
Clearly, the classification efficacy and the effectiveness of co-training is closely related to the appropriate selection of the two learning algorithms, as well as the existence of two conditionally independent views.Nevertheless, the requirement of two sufficient and redundant views is a luxury hardly met in most scenarios and real-world tasks; therefore, several extensions of this algorithm have already been developed, such as Tri-training, etc. Algorithm 2 presents a high-level description of the co-training algorithm.

5:
for each classifier C i do (i = 1, 2) 6: C i chooses p samples (P) that it most confidently labels as positive and n sentences (N) that it most confidently labels as negative from U.

7:
Remove P and N from U .

8:
Add P and N to L. 9: end for 10: Refill U with examples from U to keep U at a constant size of u examples.11: until some stopping criterion is met or U is empty.
Remark: V 1 and V 2 are two feature conditionally independent views of instances.

Tri-training
The tri-training [18] algorithm extends the co-training methodology without any constraint on which supervised learning algorithm is chosen as the base learner; also, it does not assume that a feature split exists.This SSL algorithm utilizes three base learners that iteratively assign labels to unlabeled instances.At each iteration, if two classifiers agree on the labeling of an unlabeled instance while the third one disagrees, then these two classifiers will label this instance for the third classifier.
The tri-training algorithm is based on the strategy "the majority teaches the minority", which serves as an implicit confidence measurement in order to avoid the use of complicated and time-consuming approaches.These approaches explicitly measure the predictive confidence, and hence, the training process is more efficient.A high-level description of tri-training is presented in Algorithm 3.
Nevertheless, there are times when the performance of tri-training degrades; thus, three other issues must be taken into consideration [14]: (1) Excessively-confined restrictions introduce further classification noise.
(2) Estimation of the classification error is unsuitable.
(3) Differentiation between the initial labeled example and the label of a previously unlabeled example is deficient.S i = BootstrapSample(L).

3:
Train C i on S i .4: end for 5: repeat 6: for i = 1, 2, 3 do 7: end for 13: end for 14: for i = 1, 2, 3 do 15: Train C i on S i .16: end for 17: until some stopping criterion is met or U is empty.

CST-Voting Algorithm
In this section, we present a detailed description of the proposed SSL algorithm for the classification of chest X-rays for tuberculosis, which is based on an ensemble philosophy, entitled CST-Voting [26].
The corresponding algorithm is based on the idea of generating classifiers by applying different SSL algorithms (with heterogeneous model representations) to a single dataset.On this basis, the learning algorithms, which constitute the proposed ensemble, are: co-training, self-training, as well as tri-training.We recall that these methods are self-labeled ones, which are operating in different ways in order to take full advantage of the hidden information in unlabeled data.
The main and crucial difference between these three learning algorithms is the mechanism used to label unlabeled data.More to the point, self-training and tri-training are single-view methods, while co-training is a multi-view method.Furthermore, it is worth mentioning that co-training and tri-training are indeed ensemble methods, since they both make use of multiple classifiers.An overview of CST-Voting is depicted in Figure 1.
Initially, the classical semi-supervised algorithms, which constitute the ensemble, i.e., self-training, co-training and tri-training, are trained utilizing the same labeled L and unlabeled dataset U. Subsequently, the final hypothesis on an unlabeled example of the test set combines the individual predictions of the SSL algorithms, thus utilizing a simple majority voting methodology.Therefore, the ensemble output is the one made by more than half of them.A high-level description of the proposed CST-Voting is presented in Algorithm 4.

Self-training
Labeled data

Co-training
Labeled data

Labeled data
Unlabeled data

Experimental Results
We conducted a series of experiments in order to evaluate the performance of CST-Voting algorithm compared to the most popular and frequently-used SSL algorithms, which are self-training, co-training and tri-training.All SSL algorithms were evaluated using the following Shenzhen lung mask dataset.

Dataset Description
The dataset utilized in our work was constructed by manually-segmented lung masks for the Shenzhen Hospital X-ray set as presented in [27].These segmented lung masks were original utilized for the description of the lung segmentation technique in combination with lossless and lossy data augmentation.
The segmentation masks for the Shenzhen Hospital X-ray set were manually prepared by students and teachers of the Computer Engineering Department, Faculty of Informatics and Computer Engineering, National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" [27].The set contained 279 normal CXRs and 287 abnormal ones with tuberculosis.
The original Shenzhen Hospital X-ray set contained images from Shenzhen Hospital, which is one of the largest hospitals in China for infectious diseases, with a focus both on their prevention, as well as treatment [7,8].The X-rays were collected within a one-month period, mostly in September 2012, as a part of the daily routine at Shenzhen Hospital, using a Philips DR Digital Diagnost system.
The implementation code was written in Java, using the WEKA Machine Learning Toolkit [35], and the classification accuracy was evaluated using the stratified 10-fold cross-validation.In this validation, the data were separated into folds so that each fold had the same distribution of grades as the entire dataset.Similar to Blum and Mitchell [11], a limit to the number of iterations of all SSL algorithms was established.The proposed implementation strategy had also been adopted by many researchers as stated in [12,[15][16][17]21,36].In order to study the influence of the amount of labeled data, three different ratios (R) of the training data were used, i.e., 10%, 20% and 30%.
The configuration parameters for all SSL algorithms, utilized in our experiments, are presented in Table 1.Furthermore, in order to minimize the effect of any expert bias, instead of attempting to tune any of the algorithms to the specific datasets, all base learners were used with their default parameter settings included in the Weka library [37].
Table 1.Parameter specification for all the SSL methods employed in our experiments.

Tri-training
No parameters specified.
To evaluate the performance of the SSL classification algorithms, the following three performance metrics are considered, namely Sensitivity (Sen), Specificity (Spe) and Accuracy (Acc): where TP stands for the number of normal patients who are identified as normal, TN for the number of abnormal patients who are identified as abnormal, FP (type I error) for the number of normal patients who are identified as abnormal and FN (type II error) for the number of abnormal patients who are identified as normal.
The sensitivity of classification was the proportion of actual positives that were predicted as positive; in the following, specificity represents the proportion of actual negatives that were predicted as negative, while accuracy was the ratio of correct predictions of a classification model.Additionally, since it is crucial for a prediction model to accurately identify abnormal patients, the following performance metric was considered: which constitutes a harmonic mean of precision.In particular, this metric takes into account the accuracy for both normal and abnormal patients and poses additional weight for abnormal patients instead of for normal ones [38].Obviously, from a medical perspective, it is better to misidentify an "abnormal" patient than a "normal" one.
Tables 2-4 present the accuracy of each SSL algorithm based on the performance metrics Sen, Spe and F 1.5 , respectively.Notice that the highest classification accuracy is underlined.Firstly, it is worth mentioning that CST-Voting performed better in five out of six cases for a 30% labeled ratio for each performance metric and improved its classification accuracy as the labeled ratio increased.
Moreover, relative to the performance metrics Sen and Spe, the proposed algorithm exhibited the best or the second best accuracy, independent of the classifier utilized as the base learner and the value of the labeled ratio.Regarding the F 1.5 metric, CST-Voting exhibited the highest accuracy reporting the top performance in 4, 2 and 5 cases for a 10%, 20% and 30% labeled ratio, respectively, while self-training achieved the worst performance.Finally, a more representative visualization of the accuracy of the compared SSL is presented in Figures 2-4.Each box-plot presents the accuracy measure for each tested SSL algorithm according to the supervised base learner and labeled ratio.

Statistical and Post-Hoc Analysis
In machine learning, the statistical comparison of multiple algorithms over multiple datasets is fundamental, and it is usually carried out by means of a statistical test [16].Since our motivation stems from the fact that we are interested in evaluating the rejection of the hypothesis that all the algorithms perform equally well for a given level based on their classification accuracy and highlighting the existence of significant differences between our proposed algorithm and the classical SSL algorithms, we utilized the non-parametric Friedman Aligned Ranking (FAR) [39] test.
Let r j i be the rank of the j-th of k learning algorithms on the i-th of N problems.Under the null-hypothesis H 0 , which states that all the algorithms are equivalent, the Friedman aligned ranks test statistic is defined by: where Ri is equal to the rank total of the i-th dataset and Rj is the rank total of the j-th algorithm.The test statistic F AR is compared with the χ 2 distribution with (k − 1) degrees of freedom.Notice that, since the test is non-parametric, it does not require the commensurability of the measures across different datasets.In addition, this test does not assume the normality of the sample means, and thus, it is robust to outliers.In statistical hypothesis testing, the p-value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.In other words, the p-value provides information about whether a statistical hypothesis test is significant or not, indicating "how significant" the result is while it does this without committing to a particular level of significance.When a p-value is considered in a multiple comparison, it reflects the probability error of a certain comparison; however, it does not take into account the remaining comparisons belonging to the family.One way to address this problem is to report adjusted p-values, which take into account that multiple tests are conducted and can be compared directly with any significance level [40].
To this end, the Finner post-hoc test [41] with a significance level α = 0.05 was applied to detect the specific differences among the algorithms.More to the point, the Finner test is easy to comprehend, and it usually offers better results than other post-hoc tests, such as the Holm [42] or Hochberg test [43], especially when the number of compared algorithms is low [40].
The Finner procedure adjusts the value of α in a step-down manner.Let p 1 , p 2 , . . ., p k−1 be the ordered p-values with p 1 ≤ p 2 ≤ • • • ≤ p k−1 and H 1 , H 2 , . . ., H k−1 be the corresponding hypothesis.The Finner procedure rejects H 1 -H i−1 if i is the smallest integer such that p i > 1 − (1 − α) (k−1)/i , while the adjusted Finner p-value is defined by: , where p j is the p-value obtained for the j-th hypothesis and 1 ≤ j ≤ i.It is worth mentioning that the test rejects the hypothesis of equality when the p F is less than α.Tables 6-8 present the information of the statistical analysis performed by nonparametric multiple comparison procedures over 10%, 20% and 30% of labeled data, respectively.The best (e.g., lowest) ranking obtained in each FAR test determined the control algorithm for the post-hoc test.Moreover, the adjusted p-value with Finner's test (p F ) was presented based on the control algorithm, at the α = 0.05 level of significance.Clearly, CST-Voting achieved the best performance due to better probability-based ranking and higher classification accuracy.

Algorithm 1 : 2 : 3 : 4 : 5 :
Self-training Input: L − Set of labeled instances.U − Set of unlabeled instances.Parameters: ConLev − Confidence level.C − Base learner.Train C on L. Apply C on U. Select instances with a predicted probability more than ConLev per iteration (x MCP ).Remove x MCP from U, and add to L. 6: until some stopping criterion is met or U is empty.

Algorithm 2 : 1 :
Co-training Input: L − Set of labeled instances.U − Set of unlabeled instances.C i − Base learner (i = 1, 2).Output: Trained classifier.Create a pool U of u examples by randomly choosing from U.

Algorithm 4 : 1 : 2 : 3 : 4 : for each x ∈ T do 5 :
CST-Voting Input: L-Set of labeled instances.U-Set of unlabeled instances.C-Base learner.Output: The labels of instances in the testing set./* Training phase */ Self-training(L, U) Co-training(L, U) Tri-training(L, U) /* Voting phase */ Apply self-training, co-training and tri-training on x. 6: Use majority vote to predict the label y * of x. 7: end for

Table 2 .
Accuracy of the SSL algorithms based on the Sen performance metric for each labeled ratio.

Table 4 .
Accuracy of the SSL algorithms based on the F 1.5 performance metric for each labeled ratio.

Table 5 .
Performance evaluation of the SSL algorithm relative to the performance metric Acc for each labeled ratio.