Environmental Impacts of Cement Production: A Statistical Analysis

: The attention to environmental impacts of cement production has grown fast in recent decades. The cement industry is a signiﬁcant greenhouse gases emitter mainly due to the calcinations of raw materials and the combustions of fuels. This paper investigates on the environmental performances of cement production and on the identiﬁcation of factors driving emissions. For this purpose, a sample of 193 di ﬀ erent recipes of gray cement produced in Italy from 2014 to 2019 according to the European standard EN 197-1. This paper identiﬁes the consumption impact categories (e.g., fossil fuels, renewable and non-renewable secondary fuels) that explain the assessment of the Global Warming Potential, one of the most crucial impacts of cement production. Having regard to the overall examined dataset and each cement type, a set of predictive models is implemented and evaluated. A similar approach has been adopted to produce accurate predictive models for further environmental impact categories that quantify emissions to air. The obtained results provide important information that can support cement producers to develop low-impacting cement recipes.


Introduction
In recent years, the interest for environmental protection has grown faster, becoming an important criterion for public policy in social and political contexts [1,2]. The most pursued objective is to reduce emissions of greenhouse gases (e.g., carbon dioxide, nitrous oxide and methane), which are responsible for the greenhouse effect [3]. In particular, the cement industry contributes about 5% of global anthropogenic CO 2 emissions excluding land-use change [4,5], as the production of the binder is a highly energy-intensive and emitting process. Calcination of raw materials for the cement production (e.g., limestone, clay, calcareous marl and other clay-like materials) and burning (fossil) fuels to maintain high temperature in the kiln are the processes with highest environmental impact. The former is a chemical emission, the latter a physical emission. Indeed, raw materials are heated inside large rotating furnaces at 1400 • C to form a solid substance called clinker [6]. During this process, chemical emissions mainly come from calcium carbonate (CaCO 3 ) and magnesium carbonate (MgCO 3 ) calcination according to Equations (1) and (2) [7,8]: CaCO 3 (s) + heat → CaO(s) + CO 2 (g) MgCO 3 (s) + heat → MgO(s) + CO 2 (g) Clinker is then ground or milled with gypsum and other constituents (e.g., products, raw materials, additives, recycled waste) to produce cement [9]. According to [9], 5 main types of cement (CEM I to

Data and Methods
After a quick introduction of the available dataset, this section provides the reader with the key concepts to investigate the behavior of the Global Warming Potential (GWP) and its connections with the production parameters, as well as a short summary of the statistical methods here employed.
A preliminary analysis on the correlation matrix, performed to study relations between variables, is given in Figure 1. Positive and negative correlations are displayed in blue and red color, respectively, while the intensity of the color of each circle and its size is proportional to the absolute value of the corresponding correlation coefficient. Figure 1 shows that ODP, ADPf and TNRPE are pairwise perfectly correlated, that is, the corresponding correlation coefficient is equal to 1. This implies that these variables match deterministically, up to a multiplicative factor. For this reason, ODP and TNRPE are discarded in the further analysis. The remaining ICs have been used as independent variables and split into two groups, "Emissions" and "Consumption" (Table 2).

Emission
Consumption GWP, AP, EP, POCP, P ADPe, ADPf, TRPE, SRM, nonRSF, RSF, Water, El This work will focus on the following questions: 1. Which ones are the most relevant variables among the Consumption ICs in order to explain the behavior of GWP? 2. Are different variables important to predict GWP for the four types of cement? 3. Are the relevant variables to predict GWP also useful to predict other Emission ICs?
The answer to the first question is given by fitting and evaluating a linear regression model linking GWP to the Consumption ICs, (Equation (3)): where ε is the so-called noise, a vector of size n = 193 of independent and identically distributed random variables, while (βADPe, βADPf, βTRPE, βSRM, βnonRSF, βRSF, βWater, βEl) is the vector of parameters to be estimated [31]. In statistics, linear regression is a widely used approach to establish the relationship between the so-called response (in our case GWP for Sections 3.2 and 3.3, the other Emission ICs in Section 3.4) and a set of explanatory variables (here the Consumption ICs) [32]. The relationship between the response and the explanatory variables is modeled by means of a linear predictor function whose unknown model parameters are estimated from the data. In this work, all the linear models are fit by means of the so-called least squares approach, which minimizes the sum of the squared residuals (the differences between the observed value and the one predicted by the model. This work will focus on the following questions:

1.
Which ones are the most relevant variables among the Consumption ICs in order to explain the behavior of GWP? 2.
Are different variables important to predict GWP for the four types of cement? 3.
Are the relevant variables to predict GWP also useful to predict other Emission ICs?
The answer to the first question is given by fitting and evaluating a linear regression model linking GWP to the Consumption ICs, (Equation (3)): where ε is the so-called noise, a vector of size n = 193 of independent and identically distributed random variables, while (β ADPe , β ADPf , β TRPE , β SRM , β nonRSF , β RSF , β Water , β El ) is the vector of parameters to be estimated [31]. In statistics, linear regression is a widely used approach to establish the relationship between the so-called response (in our case GWP for Sections 3.2 and 3.3, the other Emission ICs in Section 3.4) and a set of explanatory variables (here the Consumption ICs) [32]. The relationship between the response and the explanatory variables is modeled by means of a linear predictor function whose unknown model parameters are estimated from the data. In this work, all the linear models are fit by means of the so-called least squares approach, which minimizes the sum of the squared residuals (the differences between the observed value and the one predicted by the model. In our case, solving the regression problem produces a vector of estimators (β ADPe ,β ADP f ,β TRPE ,β SRM ,β nSRM ,β SRM ,β Water ,β El ). The most relevant estimators can be then selected by means of the so-called Akaike Information Criterion (AIC) [32]. This procedure results in a model where only statistically significant variables in terms of the highest variances are selected, while the others are iteratively discarded. A model validation is then performed by means of a 10-fold cross validation procedure [31] to assess and compare the accuracy of the two models. Cross validation evaluates the accuracy of a predictive model, estimating its ability to predict new data. In the k-fold cross validation, the original dataset is randomly partitioned into k subsamples of equal size. Then, one subsample (the validation data) is used to test the model obtained by using the remaining k-1 subsamples (the so-called training data). This procedure is then repeated k times and averaged, so that all the observations are used for both validation and testing. The measure of the accuracy of each method is provided by the root mean square error, the risk function which measures the square root of the average squared difference between observations and the estimated values [31]. As far as the second question is concerned, it is now investigated if different types of cement influence GWP in different ways and, consequently, if the statistical model's accuracy can be improved by fitting a separate regression model for each class. These regression models can be expressed as Equation (4): where i = I, II, III, IV and solving the regression model produces estimates for the set of parameters . . , IV. It is also relevant to check if some types of cement behave differently in terms of the regression models and relevant variables. Again, 10-fold cross validation procedures [31] are used to compare the results and to verify how accurate each predictive model is.
Regarding the third question, a multiple linear regression using the full set of Consumption ICs is performed and used to evaluate two alternative models. The first alternative model uses as input variables only the ones selected for GWP. The other alternative model develops different variables for each Emission IC, by means of the AIC criterion. If sufficiently accurate, the first model would allow the producer to focus on the same subset of variables to control jointly all the emissions. If it is not the case, the second model establishes which Consumption ICs are relevant to predict other emissions than GWP. Also in this case, a 10-fold cross validation procedure has been applied to compare the accuracy of the models. The statistical analysis has been performed within the R Cran environment [33] and the support of additional packages [34,35].

Results
In this section, details concerning the performed data analysis are presented and discussed to answer the questions introduced in Section 2. In particular, Section 3.1 includes some preliminary exploratory analysis. Section 3.2 concentrates on Question 1, by studying and comparing two models to predict GWP, the first model obtained by fitting a linear regression, and the second model by selecting the most relevant variables by means of the AIC criterion. Section 3.3 is concerned with Question 2. For each type of cement, a linear regression is fit and then the most important Consumption ICs are selected by the AIC criterion. Analogies and differences shown by the models here developed and the ones in Section 3.1 are then investigated. Section 3.4 is focused on the other Emission ICs and, then, on Question 3. For each type of Cement and for each Emission, three different models are studied and then compared. The first model is a linear regression which uses all the available Consumption ICs to predict each emission. The second model is a linear regression where only the relevant variables to GWP established in Section 3.2 are used. The third model selects the relevant variables for each Emission by the AIC criterion. The three models are then examined and compared to establish whether the same Consumption ICs can be used to predict accurately all the Emissions or not.

Exploratory Analysis
First, it is investigated if there is some relation linking GWP with the other Emission ICs, and it can be observed that all the Emission variables are positively correlated (i.e., green lines) with GWP, as shown in Table 3 and described in Figure 2. as shown in Table 3 and described in Figure 2.  Table 4 presents the correlation coefficients of GWP with the Consumption variables, presented in Figure 4. Having regard to the correlation coefficients, Figure 2 shows that all Emission ICs are positively correlated.  Figure 3 shows the correlation between GWP and the Consumption ICs. GWP is strongly positively correlated to ADPf and El and mildly negatively correlated to TRP and SRM. The results about GWP, El and ADPf comply with the physical meaning of these ICs. Indeed, the more fossil fuels and electricity are consumed, more greenhouse gases are emitted. On the other hand, the use of secondary fuels (both renewable and non-renewable) reduces the consumption of fossil fuels. Correlation between ADPe and water is due to the upstream processes (i.e., quarry extraction) of the cement production.  Table 4 presents the correlation coefficients of GWP with the Consumption variables, presented in Figure 4. Having regard to the correlation coefficients, Figure 2 shows that all Emission ICs are positively correlated. Figure 3 shows the correlation between GWP and the Consumption ICs. GWP is strongly positively correlated to ADPf and El and mildly negatively correlated to TRP and SRM. The results about GWP, El and ADPf comply with the physical meaning of these ICs. Indeed, the more fossil fuels and electricity are consumed, more greenhouse gases are emitted. On the other hand, the use of secondary fuels (both renewable and non-renewable) reduces the consumption of fossil fuels. Correlation between ADPe and water is due to the upstream processes (i.e., quarry extraction) of the cement production. Therefore, the most relevant variables among the Consumption ICs that could explain the behavior of GWP are electricity and fossil fuels consumption. This complies with the Italian energy mix, whose main energy consumption is driven by petroleum and other liquids and natural gas [36]. On the other hand, the correlations between GWP and other Emission ICs (Table 3) and GWP and the Consumption ICs (Table 4) justify the international approach to protect the environment reducing greenhouse gas emissions. At this purpose, in 2003 the European Parliament and the Council established the Emissions Trading Scheme [37] to limit or reduce greenhouse gas emissions. Figure 4 provides the reader with explicit correlation coefficients (in the top-right cells with respect to the main diagonal), an estimation of the density function by a histogram and a kernel density estimation (KDE) (in the main diagonal) and, finally, scatterplots with fitted nonparametric regression lines to stress the relationship between pairs of different variables (in the bottom-left cells with respect to the main diagonal). In the first column, each plot displays values for GWP paired with all the Consumption ICs, while in the first row the correlation coefficients between GWP and the Consumption ICs are listed.
In Figure 4 both x-and y-axis labels refer to the corresponding iCs listed in the main diagonal; their units comply with those listed in Table 1. Therefore, GWP values obtained in the LCA range between 0.6 and 1.0 kg CO2 eq./1 kg of produced cement; ADPe ranges between 1.0 × 10 −7 and 5.0 × 10 −7 kg Sb eq./1 kg of produced cement. Therefore, the most relevant variables among the Consumption ICs that could explain the behavior of GWP are electricity and fossil fuels consumption. This complies with the Italian energy mix, whose main energy consumption is driven by petroleum and other liquids and natural gas [36]. On the other hand, the correlations between GWP and other Emission ICs (Table 3) and GWP and the Consumption ICs (Table 4) justify the international approach to protect the environment reducing greenhouse gas emissions. At this purpose, in 2003 the European Parliament and the Council established the Emissions Trading Scheme [37] to limit or reduce greenhouse gas emissions. Figure 4 provides the reader with explicit correlation coefficients (in the top-right cells with respect to the main diagonal), an estimation of the density function by a histogram and a kernel density estimation (KDE) (in the main diagonal) and, finally, scatterplots with fitted nonparametric regression lines to stress the relationship between pairs of different variables (in the bottom-left cells with respect to the main diagonal). In the first column, each plot displays values for GWP paired with all the Consumption ICs, while in the first row the correlation coefficients between GWP and the Consumption ICs are listed.

Linear Regression and Variable Selection for GWP
In this Section two predictive models for GWP are fit. The first model is a linear regression where GWP is the scalar response and the Consumption ICs are the input variables. All variables have been preliminarily normalized to simplify the interpretation. The estimated coefficients are listed in Table 5, together with the related standard deviations (St. dev.) and the corresponding significance for the p-values associated to the significance test of the model. "***" (p-value ≤ 0.001), "**" (p-value ≤ 0.01) and " " otherwise.
To develop the second model, an AIC backward selection procedure is then performed on the linear regression, to find the best subset of Consumption ICs to accurately predict GWP, leading to the model described in Table 6. Then, a 10-fold cross-validation procedure is performed to compare the two models. The root mean square errors (RMSE) are computed by Equations (5) and (6): Recall that the RMSE is a risk function aimed to measure the discrepancy between the observations and the corresponding estimated values. In Table 7 RMSE lin is higher than RMSE AIC , thus the variable reduction produces a more accurate model. Finally, Figure 5 describes the size of each regression slope coefficient, after the variable selection. The highest contribution to GWP is given by ADPf. This result complies with the release of carbon dioxide into the atmosphere by burning of fossil fuels [38][39][40].
Appl. Sci. 2020, 10, x; doi: FOR PEER REVIEW 10 of 25 Finally, Figure 5 describes the size of each regression slope coefficient, after the variable selection. The highest contribution to GWP is given by ADPf. This result complies with the release of carbon dioxide into the atmosphere by burning of fossil fuels [38][39][40]. In answer to Question 1, the most relevant variables among the Consumption ICs to predict GWP are ADPf, NRSF and RSF. The energy-intensive industry of cement manufacturing can motivate this variable selection: all these ICs quantify the energy, mainly fossil but also alternative, spent in the process. This result complies with the efforts to implement in the cement sector different management systems, process-integrated techniques and end-of-pipe measures identified as Best Available Techniques (BAT) to have environmental benefits (e.g., thermal energy optimization techniques in the kiln system; reduction of electrical energy use; recovery of excess heat from the process and cogeneration of steam and electrical power) [41].

Linear Regression and Variable Selection for Each Type of Cement
The different types of cement are now studied separately to evaluate their impact on GWP. Figure 6, which contains the scatterplots related to GWP and the Consumption ICs, shows that the points associated to the class CEM I (in blue) are isolated in the GWP scatterplots with respect to the data belonging to the other types. Moreover, the environmental impacts of CEM I are higher than other investigated cement types: both the qualitative and the quantitative observed trends suggest investigating whether predictive models built separately for each class (type of cement) could achieve more accurate predictions for GWP. Furthermore, it is of extreme interest to check if different variables result to be important for each separate class with respect to the ones selected for the whole dataset. Figure 6 contains a matrix of scatterplots used to visualize the relationship between pairs of variables, all listed in the main diagonal. For each scatterplot, the variables in the x-axis (y-axis, respectively) can be found in the entry belonging to the main diagonal in the same column (row, respectively). The units of each axis label are listed in Table 1. In answer to Question 1, the most relevant variables among the Consumption ICs to predict GWP are ADPf, NRSF and RSF. The energy-intensive industry of cement manufacturing can motivate this variable selection: all these ICs quantify the energy, mainly fossil but also alternative, spent in the process. This result complies with the efforts to implement in the cement sector different management systems, process-integrated techniques and end-of-pipe measures identified as Best Available Techniques (BAT) to have environmental benefits (e.g., thermal energy optimization techniques in the kiln system; reduction of electrical energy use; recovery of excess heat from the process and cogeneration of steam and electrical power) [41].

Linear Regression and Variable Selection for Each Type of Cement
The different types of cement are now studied separately to evaluate their impact on GWP. Figure 6, which contains the scatterplots related to GWP and the Consumption ICs, shows that the points associated to the class CEM I (in blue) are isolated in the GWP scatterplots with respect to the data belonging to the other types. Moreover, the environmental impacts of CEM I are higher than other investigated cement types: both the qualitative and the quantitative observed trends suggest investigating whether predictive models built separately for each class (type of cement) could achieve more accurate predictions for GWP. Furthermore, it is of extreme interest to check if different variables result to be important for each separate class with respect to the ones selected for the whole dataset. Figure 6 contains a matrix of scatterplots used to visualize the relationship between pairs of variables, all listed in the main diagonal. For each scatterplot, the variables in the x-axis (y-axis, respectively) can be found in the entry belonging to the main diagonal in the same column (row, respectively). The units of each axis label are listed in Table 1  The dimensions of each class are given in Table 8. Due to the small number of observations, CEM III is filtered out. For each class, a linear regression is fit, where GWP corresponds to the scalar response and the ICs to the explanatory variables. The estimated regression coefficients for CEM I, II, and IV are listed in Tables 9-11, respectively. "***" (p-value ≤ 0.001), "**" (p-value ≤ 0.01), "." (p-value ≤ 0.1), and " " otherwise. "***" (p-value ≤ 0.001), "**" (p-value ≤ 0.01) and " " otherwise. Then, AIC backward selection procedures yield the models for the classes CEM I, II and IV described in Tables 12-14, respectively.   "***" (p-value ≤ 0.001), "**" (p-value ≤ 0.01), and " " otherwise.
A 10-fold cross-validation procedure shows that the root mean square error (RMSE) related to the AIC selection model, is lower than the one computed for the full linear regression model (Table 15). To answer to Question 2, also in this case a variable selection procedure leads to a more accurate model. Furthermore, different variables are selected as relevant depending on the type of cement. It can be observed that while for the class CEM I, the ICs TRPE and NRSF are discarded, for the classes CEM II and CEM IV the ones rejected by the AIC criterion are Water and El and the same ICs are selected. These ICs are also consistent also with the ones chosen for the whole dataset (except to SRM). The model for the class CEM I is characterized by a different set of variables and associated sizes, including Water and El ( Figure 7 shows the size of the coefficients for each class and for the aggregate data, Figure 7a-d, respectively). It also confirms the qualitative interpretation of Figure 6, where data belonging to CEM I was mostly isolated.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 25 data, Figure 7a,b-d, respectively). It also confirms the qualitative interpretation of Figure 6, where data belonging to CEM I was mostly isolated. The results of CEM I (Figure 7b) differ from CEM II and CEM IV (Figure 7c,d) due to its composition (i.e., at least 95% by mass of clinker and not more than 5% by mass of gypsum according to [9]), while other cement types contain less clinker and other main constituents.

Other Emission Variables
This section focuses on producing accurate models for the other Emission ICs. The first model here fit is a multiple linear regression model containing as target variables all the emissions. The results are collected in Tables 16-18 for CEM I, II and IV, respectively. Since in Section 3.3, subsampling data with respect to the type of cement has led to a more accurate model, also in this Section each type of cement is examined separately. The results of CEM I (Figure 7b) differ from CEM II and CEM IV (Figure 7c,d) due to its composition (i.e., at least 95% by mass of clinker and not more than 5% by mass of gypsum according to [9]), while other cement types contain less clinker and other main constituents.

of 25
A third model is finally produced applying the AIC criterion separately to each Emission IC for each class. Then, each of this model selects the most important variables for the corresponding IC. The results are summarized in the Tables 22-24 for CEM I, II, and IV, respectively. "***" (p-value ≤ 0.001), "**" (p-value ≤ 0.01), "*" (p-value ≤ 0.05), "." (p-value ≤ 0.1), and " " otherwise. "***" (p-value ≤ 0.001), "**" (p-value ≤ 0.01), "*" (p-value ≤ 0.05), and " " otherwise. In answer to Question 3, the results in Table 25 highlight that, while in general using the same subset of Consumption IC relevant to predict GWP does not improve the accuracy of the linear model, performing separately for each Emission IC a variable selection procedure leads to a meaningful enhancement in terms of predictive assessment of the model.
In Figure 8, all the important variables related to each Emission IC listed in Tables 19-21 are represented proportionally to their size.  Figure 8 confirms that ADPf plays a pivotal role for the majority of IC emissions, that is, AP, EP, POCP and P (CEM I and CEM IV). CEM II differs from CEM I and CEM IV due to its limestone-based composition; particularly, POCP CEM II has its highest correlation with the Water consumption IC. It is confirmed by the upstream processes necessary to quarry natural raw materials.

Discussion
Due to a dependence on fossil fuels and the calcination of raw materials, the cement industry generates about 5% of global greenhouse gas emissions. Within this framework, several efforts are on-going to protect the environment and increase energy efficiency using renewable resources or  Figure 8 confirms that ADPf plays a pivotal role for the majority of IC emissions, that is, AP, EP, POCP and P (CEM I and CEM IV). CEM II differs from CEM I and CEM IV due to its limestone-based composition; particularly, POCP CEM II has its highest correlation with the Water consumption IC. It is confirmed by the upstream processes necessary to quarry natural raw materials.

Discussion
Due to a dependence on fossil fuels and the calcination of raw materials, the cement industry generates about 5% of global greenhouse gas emissions. Within this framework, several efforts are on-going to protect the environment and increase energy efficiency using renewable resources or alternative fuels. In order to analyze comparable environmental performances, cement companies are conducting life cycle assessment of their "from cradle to gate" processes in order to identify the best strategies to meet the need for sustainable development.
In this study, having regard to the European approach compliant with the standard EN 15804, the environmental impacts of 193 different recipes of gray cement produced in Italy from 2014 to 2019 have been assessed. Fifteen different impact categories have been considered and split into two classes, "Emissions" and "Consumption". One of the main results of this work concerns the identification of the significant Consumption ICs to predict the behavior of Emission ICs, In particular, the target of this paper consists in answering to the following questions:

1.
Which ones are the most relevant variables among the Consumption ICs in order to explain the behavior of GWP? 2.
Are different variables important to predict GWP for the four types of cement? 3.
Are the relevant variables to predict GWP also useful to predict other Emission ICs?
As far as Question 1 is concerned, it is shown that the most important variable to predict the behavior of GWP is ADPf (Figure 8), while NRSF and RSF are the two other most relevant consumption variables. To answer Question 2, a more accurate model is produced by fitting a linear regression and applying the AIC criterion for different types of cement (i.e., CEM I, CEM II, CEM IV) separately. Also in this case, ADPf is proved to be the most important Consumption IC. However, scatterplots related to GWP and the Consumption ICs show that the environmental performances of CEM I differ from those of the other types, and their values are higher. Predictive models built separately for each type of cement revealed more accurate predictions for GWP. Finally, concerning Question 3, the authors investigated if the relevant variables to predict GWP could predict other Emission ICs. In this case, it is shown that fitting separately regression models and selecting the most important variables leads to more accurate predictions for all the other Emissions ICs (Table 25) in comparison to the standard linear model or the one which uses the same Consumption ICs for GWP. Also in this case, ADPf is confirmed to be a strong predictor in the models related to the emission variables AP, EP, POCP (for CEM I and CEM II), and P (for CEM I and CEM IV). Therefore, the obtained results underline the need for policies and strategies that could reduce consumption of energy, both fossil and secondary fuels, and justify the European policies about Emission trading and Best Available Techniques to be implemented in the cement industry.