Determining Association between Lung Cancer Mortality Worldwide and Risk Factors Using Fuzzy Inference Modeling and Random Forest Modeling

Lung cancer remains the leading cause for cancer mortality worldwide. While it is well-known that smoking is an avoidable high-risk factor for lung cancer, it is necessary to identify the extent to which other modified risk factors might further affect the cell’s genetic predisposition for lung cancer susceptibility, and the spreading of carcinogens in various geographical zones. This study aims to examine the association between lung cancer mortality (LCM) and major risk factors. We used Fuzzy Inference Modeling (FIM) and Random Forest Modeling (RFM) approaches to analyze LCM and its possible links to 30 risk factors in 100 countries over the period from 2006 to 2016. Analysis results suggest that in addition to smoking, low physical activity, child wasting, low birth weight due to short gestation, iron deficiency, diet low in nuts and seeds, vitamin A deficiency, low bone mineral density, air pollution, and a diet high in sodium are potential risk factors associated with LCM. This study demonstrates the usefulness of two approaches for multi-factor analysis of determining risk factors associated with cancer mortality.


Introduction
Lung cancer is defined by the World Health Organization's (WHO), "International Classification of Diseases", as a malignant neoplasm (tumor) of the trachea, bronchus, and/or lungs. About 98 to 99 percent of lung cancers are carcinomas, thus, this disease is often referred to as lung carcinoma [1]. Based on information from WHO's GLOBOCAN 2020 estimates of cancer incidence and mortality, lung cancer remains the leading cause of cancer-related deaths in the world, with an estimated 1.8 million deaths (18 percent) worldwide. Lung cancer has an incidence ratio of 1 out of 10 of newly diagnosed cancer cases, and a mortality rate of 1 out of 5 deaths worldwide [2,3].
More recently, the medical community continues to report that, worldwide, approximately 85 percent of lung cancer deaths are due to long-term smoking [4]. The etiology in this regard has been well-established over decades of research-it is the remaining 15 percent of deaths of non-smokers that has left the research community puzzled for the past six decades. Potential risk factors for lung cancer in this regard include passive smoking, radon gas, asbestos, aerosols from mining and metal processing, combustion (indoor emissions, exhaust, and petroleum processing), ionizing radiation, toxic gasses, rubber production, and silica processing. Further, Turner and colleagues also observe that, in more recent times, emissions from industry, power generation, transportation, and domestic burning, exceed considerably the WHO's health-based air-quality guidelines and subject the world's population to unsafe levels of air pollution. They suggest that outdoor air pollution is an urgent worldwide public health challenge, particularly in relation to cancers [5].
The question of how to compare the impacts of human diet, habits, and environmental risk factors on lung cancer is still a challenging task. To fill these knowledge gaps, we used fuzzy inference of weight and a Random Forest Tree (RFM) model to assess the weights of the aforementioned variables in association with lung cancer mortality (LCM). The common characteristics or determinants are revealed in the comparison of the two methods in the analysis. For decades, previous studies have mainly focused on the genetic and biological aspects of lung cancer, as well as the efficacy of medical surgeries and treatment protocols for lung carcinomas [6][7][8]. Moreover, the relationship between demographic and socioeconomic variables-gender, age structure, race, income, diet and food access, level of a country's development-have been extensively researched [9][10][11][12][13][14]. However, in more recent times, researchers such as Turner and colleagues have been focusing on the association between environmental risk factors and lung cancer. Yet, their focus is limited to one aspect of lung cancer's incidence and mortality (i.e., indoor and outdoor pollution). These diverse perspectives, no matter the focus, tend to be localized in one specific geographical region and, findings are hence limited to specific geographic regions [15][16][17][18]. Therefore, there is a need to investigate if other factors such as human diet and habits, and physical and mental health variables are associated with LCM. In addition, this study utilizes machine learning analysis and fuzzy inference rather than multivariate statistics to explore the association between these factors and LCM.

Data
We collected lung cancer mortality data and mortality data associated with 30 risk factors in 100 countries during the period from 2006 to 2016, from open datasets published by the Global Burden of Disease (https://vizhub.healthdata.org/gbd-results/ (accessed on 4 October 2022). The dataset contained 1097 observations. Mortality associated with a risk factor is simply the estimated number of deaths associated with each of the 30 risk factors. These 30 risk factors are listed in Table 1. Table 1. List of risk factors of a country used as independent variables in the analyses.

NO.
Variable Name Acronym Description

Analysis Procedure
We treated LCM as the dependent variable and the 30 risk factors as independent variables in the analysis. We first computed the crude mortality rate for each variable in each of the 100 countries. The crude rate is the rate between the number of deaths associated with each risk factor divided by the average population size in each country from 2006 to 2016. We then classified the countries into five risk levels for each variable using quintiles. From the lowest quintile to the highest quintile, all 100 countries were classified into very low risk, low risk, medium risk, high risk, and very high risk in LCM, and in each of the other 30 variables. Figure 1 shows the distribution of LCM risk levels of these 100 countries in the world. Second, we performed analyses using the fuzzy inference modeling (FIM) and the random forest modelling (RFM) shown in Figure 2. In FIM, we implemented analysis using the Analytical Hierarchical Process (AHP), the RIDIT analysis, and the Chi-square analysis to search for the optimal lattice degree based on nearness. Third, we conducted the RFM. Fourth, we selected the optimal weighting group based on the results of the optimal lattice degree on nearness and compared the results from RFM. Fifth, we determined the similarity and dissimilarity in the comparison of the two approaches. Figure 2 illustrates the overall analysis framework. We provided a brief description of the fuzzy inference modeling below. Details of the RFM can be found in the article by Du and colleagues [19].
We used ArcGIS Pro 3.0 to create Figure 1

Fuzzy Inference Methods
The FIM is based on fuzzy logic that is used to make decisions on imprecise information. Since fuzzy logic originates from fuzzy set theory where reasoning is approximate, the fuzzy inference is used in the field of anomaly detection so that all variables are

Fuzzy Inference Methods
The FIM is based on fuzzy logic that is used to make decisions on imprecise information. Since fuzzy logic originates from fuzzy set theory where reasoning is approximate, the fuzzy inference is used in the field of anomaly detection so that all variables are

Fuzzy Inference Methods
The FIM is based on fuzzy logic that is used to make decisions on imprecise information. Since fuzzy logic originates from fuzzy set theory where reasoning is approximate, the fuzzy inference is used in the field of anomaly detection so that all variables are viewed as fuzzy variables [20]. The strengths of these methods include its capacity of modeling non-linearity efficiently, segregating normal and anomalous samples, and better predicting the inconsistencies [21]. The application of these methods in this study can be considered as the auxiliary validation of machine learning-based medical anomaly detection that is related to the purpose of prediction and diagnosis, in addition to the medical data analyzed by machine learning. The procedure of fuzzy inference used in this study includes the 10 steps listed below [22].
(1). Filter out unrelated variables and accept Table 1 to establish the LCM index system. In this research, we selected the 30 risk factors as the variables related to LCM.
(2). Establish all factors as vector-matrix (U) and use the five risk levels of LCM as the five classes in the analysis (V).
(3). Generate the fuzzy similar matrix (R) among the 100 countries using the product of U times V based on the formula below.
(4). Implement the Chi-square test, generate chi-square value and weight, and normalize the weights to obtain the first set of weights-A1. (5). Perform RIDIT analysis to obtain the second set of weights-A2. RIDIT values are generally based on the observed distribution of a response variable for a specified set of individuals [23]. This approach is very closely related to distribution-free methods based on ranks such as the Wilcoxon Test [24]. RIDIT possesses two very important properties. First, it assigns a rank value to each class proportional to the relative frequency of observations in that class. Second, it standardizes the rank values to vary between 0 and 1. The latter property eliminates the problem of variation in the relative positions with respect to the number of ranks. RIDIT technique appears to suppress the differences in distributional shape [25]. (6). Perform analysis using the AHP method to obtain the third set of weights-A3. (7). Use B1 = A1 × R, B2 = A2 × R, B3 = A3 × R to obtain the values of B1, B2, and B3, and then normalize these values. (8). Compute the lattice degree of nearness σ using Equation (2) below based on the original weights C. where . Obtain an optimal lattice degree of nearness using fuzzy pattern recognition and the formula : (10). Select the optimal weight group based on the results from the three analytic approaches.

Chi-Square Analysis
Initially, 30 variables were used and passed the chi-square test. We used disease burden as an example in Table 2.
H 0 : disease burden with very low, low, middle, high, and very high risk are independent from the LCM rate, H α : they are dependent.
We calculated the observed and expected values to obtain five levels of disease burden mortality rates and examine the Chi-square test. The p-value was less than 0.001 after we compared observed results with expected results, leading to the rejection of H 0 . We hence concluded that different risk levels of LCM caused by disease burden were not independent of each other (H a ). Once the disease burden passed the test, we computed the χ 2 values of the disease burden to standardize the χ 2 weights, which were shown in Table 3.

RIDIT Analysis
First, we determined the frequencies of the five levels associated with the 30 risk factors. The frequencies in the five levels from very low risk through very high risk are 219, 221, 217, 221, and 219 among the 1097 values. Next, we calculated the mean of each risk level. These five means are 109.5, 110.5, 108.5, 110.5 109.5, respectively. Third, we computed the RIDIT value related to each of the five levels. These values are 0.0998, 0.3004, 0.5000, 0.6996, and 0.9002 from the lowest risk level to the highest risk level (Table 4). In the fourth and final step, we calculated the RIDIT values of the 30 risk factors and standardized weights shown in Table 5.

AHP Analysis
We used the Delphi method in Table 6 to investigate the roles of the 30 independent variables in LCM and produced the AHP weights for LCM in Table 7. In the results, living environment qualities such as outdoor air pollution and air pollution were classified as the first class. Smoking and secondhand smoke related to human behavior were viewed as the second class. Diet health of patients was the third-class, including diet high in sodium, low bone mineral density, diet low in whole grains, and diet low in vegetables and fruits. Patient physical and mental health indices (e.g., disease burden) were assigned as the fourth class. Childhood health, such as low birth weight due to short gestation, was assigned as the class with the least impact. The results of the AHP analysis show the importance of environmental variables on LCM (Table 7).  Because the lattice degree of nearness from the Chi-square method has the highest value of 69.36% (Table 8), the Chi-square method is considered the best weight method.

Results of the RFM
We utilized the free application of Google Colab to perform the RFM. Google Colab uses Python 3.6 and allows researchers to share codes (Figure 3). The tree structure of the RFM is illustrated in Figure 4, and the results are given in Figure 5. In Figure 3, the maximum tree depth was 5, the minimum number of cases in the parent node was 10, and the minimum number of cases in a child node was 5. We obtained 20 nodes, 10 terminal nodes, and 5 levels of depth of the tree. The predicted accuracy of LCM was 96.17%. The tree structure of LCM in Figure 3

Results of the RFM
We utilized the free application of Google Colab to perform the RFM. Google Colab uses Python 3.6 and allows researchers to share codes (Figure 3). The tree structure of the RFM is illustrated in Figure 4, and the results are given in Figure 5. In Figure 3, the maximum tree depth was 5, the minimum number of cases in the parent node was 10, and the minimum number of cases in a child node was 5. We obtained 20 nodes, 10 terminal nodes, and 5 levels of depth of the tree. The predicted accuracy of LCM was 96.17%. The tree structure of LCM in Figure 3

Discussion
This research took advantage of empirical analysis to compare the importance of lung cancer's impact factors. Table 9 is a summary of the analysis results for the random forest modeling (RFM) and fuzzy inference modeling (FIM). The top 10 risk factors detected by each method are highlighted in yellow in Table 9. The top 10 risk factors identified by RFM are: smoking, low physical activity, child wasting, low birth weight due to short gestation, iron deficiency, diet low in nuts and seeds, vitamin A deficiency, low bone mineral density, air pollution, and diet high in sodium. Since smoking is a well-known risk factor associated with lung cancer [26][27][28][29][30], the results from RFM appear to be more meaningful compared with the results from FIM. The RFM findings that shed light on the smoking habit are highly superior to environmental factors, which is the first killer of lung cancer, pregnancy, and heart disease [31]. The main reason is related to the immune system's impairment in the recruitment of white cells that release free radicals to kill off the pathogens. These free radicals could provoke an inflammatory overload when combined with those in cigarette smoke, stimulating the activated leukocytes that emit an array of cytokines, resulting in the generation of more inflammatory cells [32]. Meanwhile, the RFM results are based on the machine-learning bagging algorithm and use the ensemble learning technique [33]. It created as many decision trees as possible on the subset of the data and congregated the output of all decision trees. It reduced overfitting problems in decision trees and variances so that it substantially improves the accuracy in the terminated comparison. Importantly, the RFM evidence portrays that the LCM research belongs to machine learning-based medical anomaly detection that aims to predict and diagnose illnesses [34]. The RFM, therefore, is a generally advanced application of emergent disease detection.

Discussion
This research took advantage of empirical analysis to compare the importance of lung cancer's impact factors. Table 9 is a summary of the analysis results for the random forest modeling (RFM) and fuzzy inference modeling (FIM). The top 10 risk factors detected by each method are highlighted in yellow in Table 9. The top 10 risk factors identified by RFM are: smoking, low physical activity, child wasting, low birth weight due to short gestation, iron deficiency, diet low in nuts and seeds, vitamin A deficiency, low bone mineral density, air pollution, and diet high in sodium. Since smoking is a well-known risk factor associated with lung cancer [26][27][28][29][30], the results from RFM appear to be more meaningful compared with the results from FIM. The RFM findings that shed light on the smoking habit are highly superior to environmental factors, which is the first killer of lung cancer, pregnancy, and heart disease [31]. The main reason is related to the immune system's impairment in the recruitment of white cells that release free radicals to kill off the pathogens. These free radicals could provoke an inflammatory overload when combined with those in cigarette smoke, stimulating the activated leukocytes that emit an array of cytokines, resulting in the generation of more inflammatory cells [32]. Meanwhile, the RFM results are based on the machine-learning bagging algorithm and use the ensemble learning technique [33]. It created as many decision trees as possible on the subset of the data and congregated the output of all decision trees. It reduced overfitting problems in decision trees and variances so that it substantially improves the accuracy in the terminated comparison. Importantly, the RFM evidence portrays that the LCM research belongs to machine learning-based medical anomaly detection that aims to predict and diagnose illnesses [34]. The RFM, therefore, is a generally advanced application of emergent disease detection. The fuzzy inference provided an effective and quantitative weighting method to search the primary impact factors on LCM. On the one hand, its strength is capable of modeling non-linearity efficiently, segregating normal and anomalous samples, and better predicting the inconsistencies [35]. On the other hand, the FIM outcomes strengthened the RFM detections. Albeit the results from FIM are mixed, smoking was identified to be the third most important risk factor by both the AHP analysis and the Chi-square test, in accordance with the RFM outcome. Smoking, in the FIM, was not picked up by RIDIT analysis as an important risk factor, implying that the validity of the RIDIT analysis for the data used in this study is problematic. However, it does not have an impact on the FIM results. This is because the most optimal weight group was computed by the Chi-square test. The air pollution and outdoor air pollution were detected as the top two most important risk factors by AHP analysis, which were verified by Turner and colleagues, who applied and compared the association between outdoor air pollution and lung cancer to account for the global spatial variability of lung cancer [5]. These findings and other risk factors identified by AHP analysis warrant additional research. It is also important to note that four risk factors, smoking, low physical activity, child wasting, and air pollution, are the common risk factors identified by FRM and two other methods from FIM. Additional research is needed to examine the association between these four risk factors and lung cancer mortality.
Most importantly, the difference between FIM and RFM draws focus to the Chi-square results and RFM results. This could be due to the weighing discrepancies of environment, nutrition, diet and sex. Apart from the common factors, there are five different factors that should be noticed in the top 10 factors of both results. Iron deficiency, diet low in nuts and seeds, low bone mineral density, air pollution, diet high in sodium in the RFM results were concentrated in a poorly balanced nutrition, except for air pollution. Malnutrition caused 35% of the incidences of cancer worldwide, estimated by the World Cancer Research Fund (WCRF) Report 2007 [36]. Air pollution leading to lung cancer is reported by the Lancet October 2022 [37]. The reason is that air pollution stimulated inactive cells with cancercausing mutations to generate tumors. Simultaneously, the Chi-square findings depicted diet low in vegetables, child stunting, drug use, unsafe water source, and secondhand smoke as the five carcinogenic factors. Diet low in vegetables belongs to diet risk factors. Child stunting is chronic malnutrition, as the same as child wasting. Diet and nutrition, as two of modifiable lifestyle factors, were associated with reduced total cancer-specific mortality, updated by the WCRF and the American Institute for Cancer Research (AICR) (2018) [38][39][40][41]. Drug use was positively correlated with sexual behaviors [42], impacting on individual HIV-infection, ultimately resulting in the increased risk of developing lung cancer for the general population [43]. Unsafe water source is present in various food products, including mutagenic and carcinogenic compounds [36]. Secondhand smoke is strongly associated with small cell lung cancer [44], causing a 25% increased risk of lung cancer for non-smokers (American Cancer Society Report). Indeed, the root of the difference between the two results is the biases of the two methods. Due to overfitting, RFM results might have unacceptably high variance and consequently poor predictions on unseen data. FIM results depend on the lattice degree of nearness, which might from subjective judgment by experts.
In addition to smoking, this study suggests that future research should examine risk factors such as low physical activity, child wasting, low birth weight due to short gestation, iron deficiency, diet low in nuts and seeds, vitamin A deficiency, low bone mineral density, air pollution, and diet high in sodium. Despite the fact that this study established two robust models to classify LCM to determine the most sensitive impact factors, some limitation should be noted. First, albeit two models in the application of LCM are novel for pinpointing etiology and pathogenesis on LCM, overfitting in the RFM model might exist so that outcomes are changed. Second, the scale of this research is too big to model spatial-temporal regression. The scale might be narrowed down in future works to make models more robust. Finally, with the advent of the Big Data Era and the development of data mining techniques, deep learning-based medical anomaly detection draws more attention [19]. Some updated Artificial Intelligence (AI) algorithms such as convolution neural networks might improve the model's accuracy in future research [45].

Conclusions
This study demonstrates the feasibility of using Fuzzy Inference Modeling (FIM) and Random Forest Modeling (RFM) approaches to identify potential risk factors associated with Lung Cancer Mortality (LCM). The approaches may be useful and effective in exploring the association between a disease and its potential risk factors involving the analysis of large datasets. Future research efforts should expand the research to other diseases and their possible risk factors. In addition, further research is needed examine the effectiveness and difference of FIM and RFM in this type of analysis.