Lung Cancer Risk Prediction with Machine Learning Models

: The lungs are the center of breath control and ensure that every cell in the body receives oxygen. At the same time, they ﬁlter the air to prevent the entry of useless substances and germs into the body. The human body has specially designed defence mechanisms that protect the lungs. However, they are not enough to completely eliminate the risk of various diseases that affect the lungs. Infections, inﬂammation or even more serious complications, such as the growth of a cancerous tumor, can affect the lungs. In this work, we used machine learning (ML) methods to build efﬁcient models for identifying high-risk individuals for incurring lung cancer and, thus, making earlier interventions to avoid long-term complications. The suggestion of this article is the Rotation Forest that achieves high performance and is evaluated by well-known metrics, such as precision, recall, F-Measure, accuracy and area under the curve (AUC). More speciﬁcally, the evaluation of the experiments showed that the proposed model prevailed with an AUC of 99.3%, F-Measure, precision, recall and accuracy of 97.1%


Introduction
The lungs are the main organs of respiration.The human body has two lungs, one on each side of the chest.The left lung is smaller than the right, leaving room for the heart.During breathing, the chest rises and falls.That is because by inhalation, the lungs swell, and by exhalation, they shrink.The lungs are responsible for enriching the blood with oxygen.The heart sends to the lungs blood that is low in oxygen and rich in carbon dioxide.The blood inside the lungs is "cleansed", absorbs oxygen and leaves carbon dioxide.Carbon dioxide is eliminated during exhalation, while oxygen enters the lungs during inhalation [1,2].
Moreover, the air, in order to reach the lungs, passes successively through the nasal cavity (or the oral cavity in case we breathe through the mouth), the pharynx, the larynx, the trachea and the bronchi.Inside the lungs, the bronchi branch into smaller and smaller bronchi and end up in the alveoli.There are too many capillaries in the alveoli, which release carbon dioxide into the alveoli and take in oxygen.Humans never stop breathing until they die because the lungs supply our blood with oxygen, which is vital for human life [2].
In developed countries, lung diseases are one of the main causes of death.Factors such as smoking, environmental toxins and chronic inflammation cause harmful effects that often lead to permanent damage.The lungs have the ability to clear themselves through a series of processes and mechanisms, such as phlegm.However, for someone who smokes, this is not enough.Environmental factors, genetic, hereditary or a combination thereof, are able to affect the lungs and promote their progression from various diseases.Diseases that occur in the respiratory system belong to several categories [1,3].
Specifically, chronic obstructive pulmonary disease (COPD) includes chronic bronchitis and emphysema.Often, the two diseases coexist, thus creating a complex condition called chronic obstructive pulmonary disease.Smoking is the leading cause of obstructive pulmonary disease [4].Chronic bronchitis is characterized by inflammation and damage to the lining of the bronchi.The bronchi connect the trachea to the lungs.The main symptoms are chronic cough, increased mucus production and shortness of breath.The main symptoms of emphysema include coughing, shortness of breath, limited exercise resistance and effort for various activities [5].
Moreover, asthma is a chronic condition that affects the bronchi and bronchioles.The most common signs of asthma are shortness of breath and wheezing due to the narrowing of the airways [6].Cystic fibrosis is an inherited condition that affects patients' mucus and sweat.Due to the problems that arise, the mucus accumulates in the lungs and is the cause of frequent lung infections.Gradually, permanent damage is caused to the lungs, and severe respiratory failure is established.Tuberculosis is an infection caused by a type of bacterium that mainly affects the lungs.This bacterium causes inflammation in the lung tissue and then destroys it [7].Finally, pneumonia includes a wide range of infectious diseases caused by infection of the lungs by various germs, bacteria, viruses, parasites, and fungi [8].
Lung cancer is the primary cause of death from malignancies in both genders.It is worth noting that deaths from lung cancer exceed deaths from cancers of the colon, cervix and breast combined.The most common symptom of lung cancer is coughing, which needs special attention, as most lung cancer patients have a cough because they are smokers and suffer from chronic obstructive pulmonary disease, which in itself causes coughing.More important is the change in the character of the cough (it becomes more persistent, more intense, and may be accompanied by expectoration or bloody sputum).In addition, the symptoms caused by lung cancer include expectoration, chest pain, shortness of breath, anorexia, weight loss, fever and hemoptysis [9][10][11].
Thoracic computed tomography (CT) or chest X-ray are some typical methods for lung cancer diagnosis.Occasionally, magnetic resonance imaging (MRI) or positron emission tomography (PET) imaging can be used in the course of staging the extent of the spread of cancer, as it helps to determine the best therapeutic management.Bronchoscopy and biopsy (aspirational needle biopsy, surgical biopsy) are required to determine the actual diagnosis of lung cancer as well as to provide information on the histological type [12,13].
In many countries, the number of former smokers is high, and many types of lung cancer concern former smokers as well.In the United States alone, there are more than 50 million former smokers (i.e., people who have already stopped smoking) [14], so approaches such as lung cancer screening are evidence-based measures to detect and cure lung cancer before the development of lethal metastatic spread in current and former smokers.Supporting smoking cessation is important for current smokers, but lung cancer is a lifelong risk for every smoker.The patient's risk of dying of lung cancer is determined by the advanced stage of cancer.If someone identifies it in the early stages, it can even be cured, while, at an advanced stage, median survival is less than two years.The earlystage detection of lung cancer is associated with a high frequency of cure, whereas lung cancer detected in higher stages is often associated with a median survival of less than years [15][16][17].
Nowadays, artificial intelligence (AI) and machine learning (ML) techniques play a critical role in healthcare.Due to the wide applicability of AI/ML in numerous health conditions' risk prediction, a variety of regulations should be determined as in [18,19] to evaluate and support the practical development of AI/ML-based software tools for the early prediction and diagnosis of a disease.The most common diseases that these tools concern are diabetes (as a classification [20] or time-series task for the prediction of continuous glucose values [21]), hypertension [22], COVID-19 [23], hypercholesterolemia [24], COPD [25], stroke [26], cardiovascular diseases (CVDs) [27], acute liver failure (ALF) [28], sleep disorders [29], hepatitis C [30], metabolic syndrome [31], chronic kidney disease (CKD) [32], etc.
In the context of this study, lung cancer will concern us.For this particular disease, many scientific studies have been executed from the perspective of ML.Here, a methodology for designing effective ML classification models is presented to predict lung cancer occurrence with the aid of the most common habits and symptoms/signs as input features to the models.Our contribution is a comparative assessment of numerous classifiers to develop the intended model with the highest sensitivity and discrimination ability in identifying those at high risk.For the evaluation of the models, we considered the performance metrics precision, recall, F-Measure, accuracy and AUC.Moreover, AUC ROC curves are also captured and presented.Finally, from various aspects, the performance analysis revealed that Rotation Forest is the most efficient model, and therefore constitutes the main proposition of this research article.
The next sections of the paper are formulated as follows.In Section 2, related works are provided on the subject under investigation.A focused presentation of the dataset and an analysis of the methodology followed are given in Section 3. Furthermore, in Section 4, we discuss the acquired experimental results.Finally, conclusions and future directions are noted in Section 5.

Related Work
Here, we provide a brief overview of the most recent relevant works for the prediction of lung cancer occurrence using ML techniques and models.
Firstly, in [33], the authors demonstrated an efficient approach for the detection and classification of lung cancer by exploiting CT scan images.They employed seven classification models, such as a decision tree, random forest, support vector machine, naive Bayes, k-nearest neighbors, stochastic gradient descent and multi-layer perceptron.For the training and testing of these classifiers, a dataset of 15,750 clinical data, containing 6910 benign and 8840 malignant lung cancer-related images, was considered.In the acquired outcomes, the multi-layer perceptron classifier achieved superior accuracy, with a value of 88.55% in relation to the other classifiers.
Similarly, in [34], the authors applied a neural network, radial basis function network, support vector machine, logistic regression, random forest, J48, naive Bayes and K-nearest neighbors in order to predict lung cancer.They showed that the radial basis function network achieved a higher accuracy of 81.25% on lung cancer data.Additionally, the key objective of [35] is the early diagnosis of lung cancer by examining the performance of classification algorithms.The authors applied classification algorithms, such as naive Bayes, support vector machine, decision tree and logistic regression.In the lung cancer dataset from the UCI, the logistic regression achieved higher accuracy of 96.9%, while in the lung cancer dataset from the data.world,support vector machine achieved a higher accuracy of 99.2%.
The goal of the research work [36] was to enhance the prediction accuracy and Root Mean Square Error (RMSE) of lung cancer patient survival time in months (survival ≤ 6, 7-24, or >24 months) by combining the Random Forest classification model with three regression ones (general linear regression and gradient-boosted machines).Random forest prevailed for survival times ≤ 6 (RMSE 10.52) and > 24 months (RMSE 20.51), while the gradient boosting machine was the winning model for 7-24 months (RMSE 15.65).
Moreover, in [37], several well-known classifiers, such as support vector machine, C4.5 decision tree, multi-layer perceptron, neural network, and naïve Bayes, were applied to a reference dataset obtained from the UCI repository for the early-stage prediction of lung cancer.Additionally, ensemble models, such as random forest and majority voting were used in the context of performance comparison.According to these outcomes, the gradient-boosted tree outperformed the others and achieved an accuracy of 90%.
The authors in [38] aimed to build a data mining classification model in order to predict whether or not a patient has lung cancer based on the [39] dataset.Through the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology and the RapidMiner software, different models and sampling methods were built.The artificial neural network algorithm prevailed and achieved an accuracy of 92%, a recall of 94.2%, and a precision of 90.8%, compared with other models.
Finally, in [40], the authors designed a mechanism to identify the appropriate biomarkers for early diagnosis of lung cancer by combining established metabolomics mechanisms and machine learning methods.Their study was based on a dataset consisting of 110 lung cancer patients and 43 healthy participants.For enabling the discrimination of first-stage lung cancer patients by healthy individuals, six specific biomarkers were selected after ROC analysis with an AUC, Sensitivity and Specificity equal to 0.989, 0.981, and 1, respectively.The fast correlation-based filter (FCBF) was considered for finding the top 5 relative importance metabolic biomarkers.Among the evaluated models, Naïve Bayes is the suggested one for the early prediction of lung tumor.

Materials and Methods
In this section, we will describe the dataset we based on and the main steps of the adopted methodology for lung cancer risk prediction, namely, class balancing and features' ranking in the balanced data.We will also capture the nominal features' frequency of occurrence in relation to the lung cancer classes.Moreover, the ML models and performance metrics are described.

Dataset Description
The present work relied on a public dataset [39].The number of participants is 309, and all the attributes (15 as input to the ML models and 1 for the target class) are described as follows: • Gender [41]: This feature shows if the person's sex is male or female.

Data Preprocessing
We have to note that no processing was performed on the dataset we relied on, as there are no missing values or outliers.To tackle the highly skewed class distribution of the participants among the Lung Cancer (87.4%) and Non-Lung Cancer classes, we employed SMOTE [56].SMOTE is a widely used method that applies a 5-NN classifier to generate synthetic data [57] for the minority class, i.e., Non-Lung Cancer, which is oversampled such that the instances in two classes are equally distributed (i.e., 50%-50%).

Features Analysis
In the context of features analysis, first, we measured the importance score of all involved features in the target class.For this purpose, two feature ranking methods were considered, i.e., gain ratio and random forest.We applied the gain ratio (GR) method [58], which assigns a score based on GR( f , where H(c) is the entropy of the variable that captures the class values, H(c| f i ) and H( f i ) are the conditional entropy of the class given the feature, and the entropy of the feature f i (i = 1, 2, . . ., 15), respectively.Random forest computes the Gini impurity to measure the ability of a feature to optimally discriminate the instances in the two classes [59].
The ranking scores in descending order are presented in Table 1.We see that both methods ranked six out of fifteen features with the same order of importance according to the derived scores while some of the rest presented with proximal or reverse ordering.The features of low or no importance are scored by values close to zero and/or negative.However, all features are important signs of lung cancer occurrence and its management by physicians, thus the models will be trained and validated considering all of them.  1 shows the participants' distribution per age group.We observe that lung cancer mostly concerns people between 50 and 79 years old, where the age group 60-64 is the one with the highest frequency.In Table 2, we show the features' manifestation in each class.As for gender, men are approximately equally probable to be diagnosed with lung cancer compared to women.Moreover, from this table, we conclude that each of the examined features is activated in patients with lung cancer by 26% to 35%, while an important percentage noted these signs without having been diagnosed with lung cancer.Even if the disease had not occurred, risk-factor-signs monitoring and follow-up clinical examination may prevent or limit the unpleasant effects of the disease.

Evaluation Metrics
In order to assess the machine learning models' performance, accuracy, precision, recall, F-Measure and AUC metrics were considered [75].The desired metrics will be evaluated with the contribution of the confusion matrix which consists of the elements true positive (TP), true negative (TN), false positive (FP) and false negative (FN): Accuracy summarizes the performance of the classification task and measures the number of correctly predicted instances from all data instances.We also examined recall, which captures the true positive rate or the sensitivity of a model to identify participants who actually had lung cancer and were correctly considered positive, relative to all positive participants.Precision is a measure of quality, while recall is a measure of quantity.The F-Measure is the harmonic mean of precision and recall and allows a model to be evaluated using a single score.Finally, the AUC ranges between zero and one and is used to determine the ML model with the best performance in discriminating Lung Cancer from Non-Lung Cancer instances.AUC is a measure of separability.If the AUC reaches one, it means that the models have a perfect ability to distinguish two class distributions.

Experiments Setup
The performance of the ML models was evaluated in the Weka [76] environment, which offers a variety of libraries for data preprocessing, classification, clustering, prediction and visualization.In addition, the experiments were performed on a computer system with the following specifications: 11th generation Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz, RAM 16 GB, Windows 11 Home, 64-bit OS and x64 processor.We applied 10-fold crossvalidation and SMOTE to measure the effectiveness of the models on the balanced dataset of 540 instances.Finally, in Table 3, we list the optimal parameter settings of the proposed ML models.

Evaluation
In the context of this research work, plenty of machine learning models, such as NB, BayesNet, SGD, SVM, LR, ANN, KNN, J48, LMT, RF, RT, RepTree, RotF and AdaBoostM1 are evaluated in terms of accuracy, precision, recall, F-Measure and AUC in order to determine the model with the best predictive performance.
Specifically, in Table 4, we provide the models' performance evaluation after SMOTE with 10-fold cross-validation.All our proposed models present percentages greater than 93.3% (RT).The best performance is achieved by the RotF model, which has the RF as its base classifier.It presents accuracy, precision, recall and F-Measure equal to 97.1% and an AUC of 99.3%.In addition, it should be noted that high percentages of AUC are achieved by RF with 99.1% and AdaBoostM1 with 98.5%, which has RF as its base classifier.Finally, in Figure 2, we plot the AUC ROC curve of the proposed machine learning models, where the superior performance of RotF is confirmed.Moreover, in Table 5, models' comparisons in terms of accuracy, recall and precision are made.The authors in the research work [38] used dataset [39] with the same number of features as us.The results of their models were obtained after 10-fold cross-validation.Our proposed models performed better in all three metrics compared to the models in the aforementioned research work.More specifically, The best performance of our proposed models in terms of accuracy, recall, and precision is achieved by the SVM with a percentage of 95.4%, respectively, whereas in [38], the best performance in the same metrics is achieved by the ANN with percentages of 92%, 94.2%, 90.8%, respectively.In all three metrics, our proposed models outperform.

Discussion
The proposed methodology in the current study is based on a dataset consisting of features that capture human habits (such as smoking, and alcohol consumption) and signs/symptoms as risk factors that lung cancer patients usually incur.However, these signs are not necessarily related to lung cancer disease, as we observed from features analysis in Section 3.3 of Materials and Methods.Unlike other cancers, lung cancer cannot be seen with the naked eye, and its symptoms are often accompanied by other disease symptoms.The symptoms are allergies, asthma, shortness of breath, and coughing [33].In this work, we selected to train several classifiers on various risk factors related to such symptoms to be able to correctly identify the class label (Lung Cancer or Non-Lung Cancer) of an unknown instance, and thus the associated risk.Even if the disease has not manifested, risk-factor monitoring and follow-up clinical examination are appropriate practices for lung cancer management that may prevent or limit the unpleasant effects of the disease through early diagnosis.The clinical examination and identification of lung cancer are usually made when an X-ray, CT, PET-CT, and MRI scan of the patient's chest is performed [77].Hence, the considered dataset in combination with features derived from lung images would be quite beneficial for the early diagnosis of the disease and its stage.Let us recall that this study aims to identify the occurrence of lung cancer or not.Therefore, a binary classification problem was studied.From an ML perspective, the cancer stage identification could be solved following a multi-class classification strategy, such as methods one vs.one (OVO), and one vs.all (OVA) [25].However, the dataset under consideration does not allow us to tackle the problem in such a manner.
Undoubtedly, machine learning has become an important tool for medical carers and clinicians for the early screening, prediction and/or prognosis of several diseases.Significant efforts have been made by researchers to gain access to medical information of individuals' health records, collect data through questionnaires or generate their own datasets in the laboratories in order to support healthcare analytics by training and testing appropriate models which will give insights about the future development and prevention of disease.To exemplify, in our recent study [32], several classifiers were trained about the prediction of CKD disease, while in this study, lung cancer is the target health condition.These two cases show flexibility and diversity in terms of the applicability of machine learning in healthcare.Irrespective of the data and related disease, after class balancing with SMOTE, all models demonstrated high performance in all metrics.Moreover, promising outcomes were achieved by stacking and voting ensemble models as shown in [32] which here were not investigated.From tree models, the prevalence of the rotation forest classifier is verified both in the case of lung cancer and CKD.
Concluding the results and discussion section, we have to point out a limitation of our article.This research paper was based on a publicly available dataset [39], and it did not come from a hospital unit or institute, which could have given us richer data with various characteristics.Additionally, gaining access to sensitive medical data is difficult due to privacy reasons.However, the dataset we relied on had beneficial features that led us to derive reliable and accurate research results.

Conclusions
The lungs are the main organs of respiration.Humans never stop breathing until they die because the lungs supply their blood with oxygen, which is vital for human life.Lung cancer is the leading cause of death from malignancies in both genders.The patient's lifespan is determined by the advanced stage of cancer.The earlier the diagnosis, the longer the life expectancy.
In this research work, we exploit supervised learning to develop models for identifying individuals with lung cancer manifestation based on several features-symptoms.Various machine learning models, including NB, BayesNet, SGD, SVM, LR, ANN, KNN, J48, LMT, RF, RT, RepTree, RotF, and AdaBoostM1, were evaluated in terms of accuracy, precision, recall, F-Measure and AUC.From the experiment results and after applying SMOTE with 10fold cross-validation, the RotF outperformed the other models with an accuracy, precision, recall and F-Measure equal to 97.1% and an AUC of 99.3%.Additionally, our proposed models performed with better results in comparison to the models of reference [38] as shown in Table 5.
In future work, we aim to extend the current study along two axes.First, the machine learning framework will be enriched by exploiting deep learning methods and, especially, long short-term memory (LSTM) and convolutional neural networks (CNN) and comparing the results in terms of accuracy with research works in the same scope.Second, the evaluation of classification models in the same dataset will be made assuming a bootstrapping process [78] apart from the existing 10-fold cross-validation, an alternative data-splitting method for the models' validation, which applies resampling with replacement in the original data.

Figure 1 .
Figure 1.Participants distribution in the age groups in the balanced data.

Figure 2 .
Figure 2. Models Evaluation Based on AUC ROC Curves.
This feature shows if the participant has been diagnosed with lung cancer or not.All the features are nominal except for age, which is numerical.
[47]e (years)[42]: This feature captures the person's age.•Smoking[43]:Thisfeatureindicates if the participant is a smoker or not.•Yellowfingers[44]:Thisfeaturerefers to whether the participant has yellow fingers or not.•Anxiety[45]:Thisfeatureshows if the participant is anxious or not.•Peerpressure[46]:Thisfeaturecaptures if the participant feels peer pressure or not.•Chronicdisease[47]:This feature expresses if the participant suffers from a chronic disease or not.• Fatigue [48]: This feature manifests if the participant suffers from fatigue or not.• Allergy [49]: This feature refers to whether the participant has an allergy or not.• Wheezing [50]: This feature declares if the participant suffers from wheezing or not.• Alcohol [51]: This feature shows if the participant consumes alcohol or not.• Coughing [52]: This feature refers to whether the participant suffers from coughing or not.• Shortness of breath [53]: This feature refers to whether the participant has shortness of breath or not.• Swallowing difficulty [54]: This feature indicates if the participant has difficulty swallowing or not.• Chest pain [55]: This feature captures whether the participant has chest pain or not.• Lung Cancer:

Table 1 .
Features' ranking in the balanced data.

Table 2 .
The distribution of participants in terms of feature values and class label in the balanced data.

Table 4 .
Performance evaluation after SMOTE with 10-fold cross validation.

Table 5 .
Models' comparison in terms of accuracy, recall and precision.