Air Pollution Associates with Cancer Incidences in Poland

: In many countries around the world (including the United States, Canada, and Spain), research is being conducted into the impact of air pollution on the formation of various types of cancer. For a long time it was thought that the inhalation of pollutants could lead to lung diseases. Now the e ﬀ ects of air pollutants on tumors in the airways, kidneys, bladder, breast, and colon have been investigated and are better understood. It is now known that particulates in air pollution can cross the blood–brain barrier and also reach the placenta. The aim of this study was to ﬁnd a possible relationship between the emission of pollutants into the atmosphere and the formation of speciﬁc types of tumors in the Polish population. Two databases available on the Internet were used in the analysis: the bank of measurement data on air quality in Poland (the repository of Environmental Protection Inspection) and cancer statistics. The pollution measurement data for the years 2000–2016 were taken from the Chief Inspectorate for Environmental Protection website, a database with results from 264 stations located in Poland for 13 types of gases and atmospheric pollutants. Statistical data on cancer C00–D09 (according to the International Statistical Classiﬁcation of Diseases and Related Health Problems , 10th Revision (ICD-10)) in the Polish population in the years 1999–2015 were retrieved from onkologia.org.pl. A novel code was constructed, allowing the downloading of statistics from the databases, examination of their correlation, and selection of the best model of regression through machine learning. The results of the analyses indicate a high correlation of air pollution with the incidence of selected types of cancer. Particularly noteworthy is the observed e ﬀ ect of NO x on the incidence of small and large intestine cancers in the Masovia and West Pomerania provinces. The other gases and pollutants with the most signiﬁcant impact on the incidence of gastrointestinal cancer have also been identiﬁed. Based on statistical analysis, we found a correlation between air pollution and tumor incidence in individual provinces, as well as an inﬂuence of the emission of nitrogen oxides on the cancer incidence rate.


Introduction
Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. Humans are affected by over 100 types of cancers. Exposure to ionizing radiation and air pollution are listed among other carcinogenic factors. Air pollutants include oxides of sulfur, nitrogen and carbon, as well as benzene and smog-forming particles, namely PM2.5 age-adjusted, gender-specific and race-specific cancer mortality rates were obtained (20 groups of cancer) population of the United States (data up to 1994) [4] 4. pollution emitted by various industries population exposure to industrial pollution was estimated by reference to the distance from town centroids to industrial facilities the relative risks (RR) of towns situated at a short distance from industrial installations were estimated (colorectal cancer mortality) 8098 towns in Spain [5] 5. In Poland, the earliest studies on the correlation between air pollution and cancer risk concerned the impact of environmental pollution on the incidence of malignant neoplasms of the upper respiratory tract, mainly in the region of Silesia [6,7]. Further studies extended the research area to all 16 provinces, distinguishing the type of gas or pollution occurring in a given area of the country. Concurrently, the problem of cancer in Poland has been described in [8,9], and the statistics of cases from 1999-2015 are presented in Figure 1. Scientists from Taiwan conducted a very similar study to ours [11]. Annual mean concentrations of each air pollutant were determined at 75 air quality monitoring stations, and the concentrations were extrapolated for 349 local Taiwanese administrative areas. In total, 70 correlation coefficients between cancer incidence rates and various air pollutants were calculated. A significantly positive correlation was observed between the level of PM2.5 and the cancer incidence rate after multiple testing corrections.

Methods
Two databases were used in this work: a database with the results of pollutant measurements (the repository of Environmental Protection Inspection) [12] and statistics on the formation of tumors [10]. The former contains measurements of gases and pollutants contained in the air carried out in Poland in the years 2000-2016. The list of measured substances includes SO2, NO2, NOx, CO, O3, C6H6, PM10, PM2.5, Pb(PM10), As(PM10), Cd(PM10), Ni(PM10), and BaP(PM10).
The Environmental Protection Inspection tests the PM10 and PM2.5 emission and content in the air using two complementary methods: the gravimetric (reference) method (approx. 250 locations) and the automatic (approx. 180 locations) method. The data is read out every hour, and it is verified in a 4-stage system: ongoing, periodic, annual, and national verification. Two types of devices are used: 1. Dust collectors operating on the basis of reference methodologies. Collectors are produced by the companies Comde Derenda GmbH, MCZ GmbH, and Sven Leckel. 2. Meters operating in online measurement mode, according to the methodology equivalent to the reference method. These meters are manufactured by the companies Envea, Grimm Aerosol Technik, PALAS GmbH, and Thermo Fisher Scientific (US).
For gaseous pollutants (CO, SO2, NO-NO2-NOx, O3, and BTEX-volatile aromatic The aim of our study was to indicate which types of cancer are characterized by the highest correlation with the emission of selected types of air pollutants. As a consequence of individual stages of research, the most important gases and pollutants influencing the selected type of cancers (malignant neoplasm of the bronchus and lung, and both the small and large intestine) were determined. This choice was dictated by the national disease statistics-these were the most common cancers.
Scientists from Taiwan conducted a very similar study to ours [11]. Annual mean concentrations of each air pollutant were determined at 75 air quality monitoring stations, and the concentrations were extrapolated for 349 local Taiwanese administrative areas. In total, 70 correlation coefficients between cancer incidence rates and various air pollutants were calculated. A significantly positive correlation was observed between the level of PM2.5 and the cancer incidence rate after multiple testing corrections.

Methods
Two databases were used in this work: a database with the results of pollutant measurements (the repository of Environmental Protection Inspection) [12] and statistics on the formation of tumors [10]. The former contains measurements of gases and pollutants contained in the air carried out in Poland in the years 2000-2016. The list of measured substances includes SO 2 , NO 2 , NO x , CO, O 3 , C 6 H 6 , PM10, PM2.5, Pb(PM10), As(PM10), Cd(PM10), Ni(PM10), and BaP(PM10).
The Environmental Protection Inspection tests the PM10 and PM2.5 emission and content in the air using two complementary methods: the gravimetric (reference) method (approx. 250 locations) and the automatic (approx. 180 locations) method. The data is read out every hour, and it is verified in a 4-stage system: ongoing, periodic, annual, and national verification. Two types of devices are used:

1.
Dust collectors operating on the basis of reference methodologies. Collectors are produced by the companies Comde Derenda GmbH, MCZ GmbH, and Sven Leckel.

2.
Meters operating in online measurement mode, according to the methodology equivalent to the reference method. These meters are manufactured by the companies Envea, Grimm Aerosol Technik, PALAS GmbH, and Thermo Fisher Scientific (US).
The second database covers epidemiological data on the formation of cancer in Poland in the years 1999-2015, with the division into provinces and counties, types of disease according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) and patient gender. It should be noted that in Poland there is an obligation to submit a Malignant Cancer Notification Card to doctors (data came from this source). The data available in the two databases are compared in Table 2. We then calculated the correlation between the read value of concentrations of dangerous gases (on an annual scale) and the number of cancer cases (the cancer incidence rate, determined by the number of cases or deaths per 100,000 people tested). To perform the statistical analysis, a Python code [13] was used to estimate the Pearson correlation coefficient and the random forest regression algorithm results. The Pearson product-moment correlation coefficient is received using the NumPy package, in which the main parameters are two arrays containing multiple variables and observations (each row represents a variable, and each column a single observation of all those variables). A random forest regression algorithm is taken from Scikit-learn, which is an open source machine learning library that supports supervised and unsupervised learning. In this procedure each tree in the ensemble is built from a sample drawn with replacement from the training set. The source code used for analysis is posted on the Github repository (https://github.com/ntusnio/CV/blob/master/Projekt%20rak.ipynb).
The Pearson correlation coefficient is a measure of the linear correlation between two variables. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.
The random forest algorithm [14] is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, an accurate result most of the time. It is also one of the most used algorithms because of its simplicity and the fact that it can be employed for both classification and regression tasks. The second method was used to assess the qualitative impact of individual types of hazardous gases on the statistics of the formation of small and large intestine cancers. The random forest algorithm was introduced in 1995 and for research purposes its results were verified on the basis of calculating the average accuracy of 1000 calculations on a separate part of the test data. It turned out that it gives better results than the XGBoost algorithm and the Lasso method. The random forest algorithm computes qualitative effects based on a feature importance score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model (the more an attribute is used to make key decisions with decision trees, the higher its relative importance).
The latency period for selected neoplasms was not included in the analyses. For example, lung cancer can usually occur 10-40 years from the onset of exposure. In addition, smoking (active or passive) and occupational exposure to inhalation of asbestos increase the risk of lung cancer development, and the research confirmed the synergistic effect of both of these factors. In order to prevent cancer, it is also necessary to eliminate additional factors contributing to the occurrence of cancer or mesothelioma, i.e., avoiding exposure to aromatic hydrocarbons.
The reason the latency period was not included in the analyses was because there was a wide range of neoplasms (over 100 types, according to the International Classification of Diseases), each of which is characterized by a different value, but also related to individual features. The consequences of adopting a zero latency period are similar to assuming an inappropriate value for the carcinogenesis period. As a result, the level of pollutants identified in the air in a given year will be correlated with the number of cancer cases in the year with an incorrectly selected delay. Due to the complexity and diversity of the process of changes taking place in the body's cells, leading to the formation of cancer, as well as the fact that carcinogenesis is a long-term process (the average period of development of a tumor with a diameter of 1 cm is about 5 years, although it depends on the type of tumor and tissue) it was decided to follow an approach that does not take into account the delay in cancer formation. Thus, following this assumption, the incidence of a given type of cancer was examined in a geographical area where a given type of air pollution occurs.
The analyses did not take into account any statistical methods other than correlations, and the research was limited to comparing the content of the two available databases. The limitations associated with such an approach resulted in not taking into account other factors leading to the formation of cancer, which include the presence of carcinogens, such as physical carcinogens (e.g., ultraviolet radiation), chemical carcinogens (alcohol and tobacco addiction, occupational exposure), or biological carcinogens (some viruses).
It should be added that many studies on the influence of air pollution on cancer risk have been conducted in Poland, and their results are presented, for example, in [15][16][17][18][19].

Results
First, the possible correlation between air pollution and cancer formation for all provinces in Poland was verified. For each pollutant, the best correlated cancer type was identified ( Figure 2).
The reason the latency period was not included in the analyses was because there was a wide range of neoplasms (over 100 types, according to the International Classification of Diseases), each of which is characterized by a different value, but also related to individual features. The consequences of adopting a zero latency period are similar to assuming an inappropriate value for the carcinogenesis period. As a result, the level of pollutants identified in the air in a given year will be correlated with the number of cancer cases in the year with an incorrectly selected delay. Due to the complexity and diversity of the process of changes taking place in the body's cells, leading to the formation of cancer, as well as the fact that carcinogenesis is a long-term process (the average period of development of a tumor with a diameter of 1 cm is about 5 years, although it depends on the type of tumor and tissue) it was decided to follow an approach that does not take into account the delay in cancer formation. Thus, following this assumption, the incidence of a given type of cancer was examined in a geographical area where a given type of air pollution occurs.
The analyses did not take into account any statistical methods other than correlations, and the research was limited to comparing the content of the two available databases. The limitations associated with such an approach resulted in not taking into account other factors leading to the formation of cancer, which include the presence of carcinogens, such as physical carcinogens (e.g., ultraviolet radiation), chemical carcinogens (alcohol and tobacco addiction, occupational exposure), or biological carcinogens (some viruses).
It should be added that many studies on the influence of air pollution on cancer risk have been conducted in Poland, and their results are presented, for example, in [15][16][17][18][19].

Results
First, the possible correlation between air pollution and cancer formation for all provinces in Poland was verified. For each pollutant, the best correlated cancer type was identified ( Figure 2). Next, the most important pollutant was selected, based on the summation of the correlation values (r) for all cancer cases (C00-D09). Calculations were made for the whole country (for individual provinces these were not carried out). The results of this comparison are shown in Figure  3. The best correlation with the incidence of cancer was noted for the emission of nitrogen oxides. Next, the most important pollutant was selected, based on the summation of the correlation values (r) for all cancer cases (C00-D09). Calculations were made for the whole country (for individual provinces these were not carried out). The results of this comparison are shown in Figure 3. The best correlation with the incidence of cancer was noted for the emission of nitrogen oxides.
Based on the results mentioned above, we focused our further analyses on nitrogen oxides and examined the correlation of their presence in the air at the monitoring sites with the formation of various types of tumors in each province. Table 3 shows the correlations in given provinces with the type of cancer for NO 2 and NO x emissions.
The choice of the type of cancer that were examined in detail was guided by the ratio of deaths to malignant neoplasms in Poland in the period available in the database (Figure 4), as well as the trends of the highest increases in disease.
Of note, our findings showed that the highest rate of increase in the number of cases in Poland is related to colorectal cancer, yet the literature suggests that cancers of the bronchus and lung cause the most deaths among men and women in the country [20].
Detailed results for intestinal cancer in correlation with air pollution are presented in Figure 5, in which C17 refers to the small intestine and C18 to the large intestine. Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 15 Based on the results mentioned above, we focused our further analyses on nitrogen oxides and examined the correlation of their presence in the air at the monitoring sites with the formation of various types of tumors in each province. Table 3 shows the correlations in given provinces with the type of cancer for NO2 and NOx emissions.    The selection of provinces resulted from the reading presented in Table 2, in which bowel cancer (C17 and C18) was best correlated with NO x emissions.
In the last part of the analysis, we examined which contaminants have the greatest influence on cancer incidence of malignant neoplasm of the bronchus and lung (C34), and also of the small (C17) and large intestine (C18) by means of the random forest regression model. The random forest algorithm is an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of prediction (regression) of the individual trees. In the simplest terms, this method is based on the best fit of the function, in which the arguments are the measured amounts of gases and pollutants, and the result is the number of cases of a given type of cancer.  Of note, our findings showed that the highest rate of increase in the number of cases in Poland is related to colorectal cancer, yet the literature suggests that cancers of the bronchus and lung cause the most deaths among men and women in the country [20].
Detailed results for intestinal cancer in correlation with air pollution are presented in Figure 5, in which C17 refers to the small intestine and C18 to the large intestine.
The selection of provinces resulted from the reading presented in Table 2, in which bowel cancer (C17 and C18) was best correlated with NOx emissions.  Malignant neoplasms: 31.7%-bronchi and lungs 7.4%-prostate gland 7.0%-gastric 6.7%-large intestine 5.5%-without specifying its location 4.6%-bladder 4.2%-pancreas 3.0%-rectal 3.0%-kidneys with the exception of the pelvis 2.8%-larynx 24.1%-other Malignant neoplasms: 13.8%-bronchi and lungs 13.2%-nipple 7.8%-large intestine 6.8%-without specifying its location 6.0%-ovary 5.3%-pancreas 5.0%-gastric 4.4%-cervix 3.2%-brain 2.8%-rectal 31.7%-other Of note, our findings showed that the highest rate of increase in the number of cases in Poland is related to colorectal cancer, yet the literature suggests that cancers of the bronchus and lung cause the most deaths among men and women in the country [20].
Detailed results for intestinal cancer in correlation with air pollution are presented in Figure 5, in which C17 refers to the small intestine and C18 to the large intestine.
The selection of provinces resulted from the reading presented in Table 2, in which bowel cancer (C17 and C18) was best correlated with NOx emissions. In the last part of the analysis, we examined which contaminants have the greatest influence on cancer incidence of malignant neoplasm of the bronchus and lung (C34), and also of the small (C17) and large intestine (C18) by means of the random forest regression model. The random forest algorithm is an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of prediction (regression) of the individual trees. In the simplest terms, this method is based In the case of lung cancer, it turned out that the basic contaminants affecting its formation are particles with a diameter of 2.5 µm or less (PM2.5). These contaminants may not be filtered by human organs, thus enabling toxic dust to penetrate into the lungs, bronchi, blood, and thus into the brain [21].
The results of the analysis are shown in Figure 6a-c.
on the best fit of the function, in which the arguments are the measured amounts of gases and pollutants, and the result is the number of cases of a given type of cancer.
In the case of lung cancer, it turned out that the basic contaminants affecting its formation are particles with a diameter of 2.5 μm or less (PM2.5). These contaminants may not be filtered by human organs, thus enabling toxic dust to penetrate into the lungs, bronchi, blood, and thus into the brain [21]. The concept of feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction. It is quantitative parameter. As shown in Figure 6c, it can be seen that the three most important air pollutants that may affect the formation of malignant colon cancer are NO2, As (PM10), and BaP (PM10).
Finally, the average accuracy of the random forest model was calculated, but the result was not high (only 20%). This means that air pollution is not the only factor in the formation of cancer in The concept of feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction. It is quantitative parameter. As shown in Figure 6c, it can be seen that the three most important air pollutants that may affect the formation of malignant colon cancer are NO 2 , As (PM10), and BaP (PM10).
Finally, the average accuracy of the random forest model was calculated, but the result was not high (only 20%). This means that air pollution is not the only factor in the formation of cancer in Poland. We may suggest that factors related to human nutrition, water quality, and smoking also need to be included.

Discussion
The first major observation in this study was a strong relationship between the level of PM 2.5 in the air and the incidence of lung cancer. Furthermore, we showed the effect of nitrogen oxides on the formation of tumors, and in particular the correlation between the presence of NO 2 in the air and the formation of colon cancer. Consequently, our data suggest that the level of NO 2 in the air and compounds present in the dust (arsenic, benzo(a)pyrene) occurring in the inhaled air may have a strong influence on the incidence of colorectal cancer.
Our results are in line with a very interesting correlation study performed in Japan, which examined the factors that could have caused the geographic variation observed in the lung and large intestinal cancer morbidity in that country. Lung cancer was highly correlated with industrialization-related factors such as localization of manufacturing industries, automobile traffic, and air pollution, whereas colon cancer was correlated with the population density of workers in the tertiary industries such as services, trade, and government. A multiple regression analysis could not detect any single factor with an exceptionally strong influence on either cancer [22].
An important problem when examining the factors contributing to the formation of specific cancer types is the proximity of residences to incinerators or hazardous waste disposal plants. The analysis of this problem was carried out in Spain and Italy. An increased cancer-related mortality in Spain was detected in the total population residing in the vicinity of these installations as a whole, and principally in the vicinity of incinerators and scrap metal/end-of-life vehicle handling facilities in particular. Special mention should be made of the results for tumors of the pleura, stomach, liver, kidney, ovary, lung, leukemia, colon/rectum, and bladder [23].
In the Italian analysis, no association between pollution exposure from the incinerators and all-cause and cause-specific mortality outcomes was observed in men, with the exception of colon cancer. However, exposure to the incinerators was associated with cancer mortality among women, in particular for stomach, colon, liver, and breast cancer. NO 2 levels as a proxy from other pollution sources (traffic in particular) did not exert an important confounding role [24].
The above may be of importance in relation to recent events in Poland. In the first half of 2018, nearly 70 landfill sites were burnt, and these fires may have similar effects as those mentioned in the abovementioned articles. As a result of the burning of rubber, plastic waste, and many kinds of chemical waste, poisonous and carcinogenic substances are created. Breathing polluted air increases the risk of cancer, which will pose a serious health issue in the near future.

Conclusions
Lung cancer is not the only cancerous threat related to air pollution. The latest research suggests that there are other cancers linked to air pollution. Nitrogen oxides have been shown to be the most strongly correlated type of gas with cancer statistics, and there are scientific grounds to attribute to it an influence on the development of serious illnesses. In Poland, the number of deaths attributed to long-term exposure to NO 2 is estimated at 1600 annually. It is worth mentioning that nitrogen oxides also harm us indirectly. They are precursors of carcinogenic compounds formed in soils that can