1. Introduction
The food and drink industry is the largest producing sector globally, and due to the increased consumer demand for processed food products, it has led to consequential impacts on health and the environment [
1]. Biotic and abiotic components of the environment are targeted by air pollution, which is considered one of our era’s greatest scourges. Every substance, solid, liquid, or gas, if being produced in higher concentrations while reducing the quality of our environment, is defined as a pollutant [
2]. According to the World Health Organization (WHO), 99% of humans are breathing air that exceeds WHO guideline limits and contains high levels of pollutants, while low and middleincome countries are subject to the highest exposures. Air quality is closely linked to the Earth’s climate and ecosystem, and is known to be the single largest environmental health risk factor globally. Many of the drivers of air pollution are also sources of greenhouse gas (GHG) emissions [
3].
Particulate matter (PM) can be formed directly in the atmosphere by physicochemical reactions between pollutants already present in the atmosphere or can be directly emitted from anthropogenic activities and natural sources to the atmosphere. The United States Environmental Protection Agency defined PM as a term for particles, whose penetration depends on their diminutive size, ranging from particles with diameters of 10 μm (μm) or smaller, called
$P{M}_{10}$, and extremely fine particles with diameters that are generally 2.5 μm (μm) and smaller
$P{M}_{2.5}$ [
2]. PM is most likely condensed in cities and industrialized areas, as it is geographically shown in
Figure 1. PM concentrations levels are represented by color grading, where the intense green represents the highest mean values of these concentrations (μg/m
${}^{3}$) The increase in atmospheric
$P{M}_{2.5}$ concentration, air movement patterns, and exposure of populations, result in health and economic effects. Food system emissions alone account for about 22.4% of global mortality due to degraded air quality and 1.4% of global crop production losses [
4]. A recent study [
5] in the United States estimated that 4300 cases annually of premature mortality happen due to maize production. In fact, higher mortality rates were observed within the top five maizeproducing states (Iowa, Illinois, Nebraska, Minnesota, and Indiana). Moreover, increased concentrations of
$P{M}_{2.5}$ are driven by emissions of ammonia
NH3, which result from nitrogen (
N) fertilizer use [
5].
Industrial facilities, such as power stations, refineries, petrochemicals, chemical and fertilizer industries, and metallurgical, and other industrial plants, are major sources of pollutants emissions. GHG emissions from the agricultural sector increased by 10.1% from 1990 to 2018 and accounted for 9.9% of total US greenhouse gas emissions [
6]. In fact, agricultural
${N}_{2}O$ emissions are projected to continue to rise [
7]. Agricultural crop production, including farms and the supply chains that produce the chemical and energy inputs, contribute majorly to the emissions of GHGs, which include carbon dioxide (
$C{O}_{2}$), nitrous oxide (
${N}_{2}O$), methane (
$C{H}_{4}$), and black carbon [
4]. Nitrous oxide being one of the most impacting GHG, was chosen in this study. It is estimated that
${N}_{2}O$ emissions in the US account for approximately 75% of total emissions. The truth is that the increased value brought about by nitrogenbased fertilizer applications is outweighed by the expenses of environmental nitrogen pollution, such as the eutrophication of rivers, loss of biodiversity, global warming, and stratospheric ozone depletion, even though
N is a limiting component for agricultural production [
8].
With the growing rate of big data evolution and its complexity, various prediction methods based on machine learning technologies have been developed for air quality problems [
9,
10]. Multiple linear regression (MLR) is one of the most popular tools capable of incorporating complex nonlinear relationships between the concentration of air pollutants and meteorological variables [
11].
This work is based on
Greenhouse Gas Emissions from Global Production and the use of Synthetic Nitrogen Fertilizers in the Agriculture dataset, from the Figshare repository, from the year 2018. The authors also used this dataset in a recently published work [
12], where they estimated GHG emissions due to synthetic
N fertilizer manufacture, transportation, and field use in agricultural systems. Most studies have tackled the GHG emission problems; while integrating ML tools basically focus on
$C{O}_{2}$ or
$C{H}_{4}$ emissions [
13], very few papers are based on (
${N}_{2}O$) emissions. In fact, this gas is 300 times more harmful to the climate than (
$C{O}_{2}$) and steadily increases in the atmosphere, with agriculture being the largest contributor, and nitrogen the most used synthetic fertilizer [
14].
In this study, we propose two expectilebased regression approaches, namely, expectile regression (ER) and the kernel expectile regression estimator (KERE). Due to their flexibility in application, heavytailed distributions and outliers are of interest. In this context, and based on the fact that only a few countries are considered agriculturalproducing countries, there is a concentration of information in the tail of the distribution. We used expectilebased regression to take advantage of the parameterized nature, which allows for modeling different aspects of the distribution rather than the simple mean.
The rest of this manuscript is structured as follows:
Section 2 briefly outlines the works and studies related to our research.
Section 3 and
Section 4 investigate and analyze the dataset. Thereof,
Section 5 assembles the results and the discussion. Finally,
Section 6 portrays concluding remarks and states future works.
4. Data Analysis
Generally, ML techniques can be categorized into supervised, unsupervised, semisupervised, and reinforcement learning. Supervised ML approaches deal with a particular case of problems where each data sample is paired to a label. In particular, regressiontype approaches generate an underlying function that provides a real value for each data sample. In this work, the goal is to explore the relationship between ${N}_{2}O$ gas emissions (direct and indirect), the quantity of applied nitrogen, and synthetic fertilizer sources. To this end, we present the following workflow, based on expectile regression models as our learning approach.
4.1. ExpectileBased Regression
4.1.1. Linear Expectile Regression
Linear expectile regression (ER) was first introduced by Newey and Powell in risk measurement [
36]. This approach can be defined as the generalization of conditional expectation to model the relationship between a dependent variable and the covariates [
37]. Multiple studies were conducted to explore ER performance, particularly when dealing with heavytailed distribution [
38]. Although the ER provides a complete picture of the data [
39], its statistical properties are underexplored in contrast with other methods, such as linear regression and quantile regression [
40,
41].
Given
Y a random variable, the expectile of level
$\tau $ denoted by
${\mu}_{\tau}$ is defined as follows:
where
${\varphi}_{\tau}$ is the asymmetric least square (ALS) loss function that assigns weights
$\tau $ and
$1\tau $ to positive and negative deviations, respectively.
Figure 5 provides example curves of
${\varphi}_{{\tau}_{i}}$ and expectile values
${\mu}_{{\tau}_{i}}$ with respect to different expectile levels
${\tau}_{i}$, respectively.
Let us suppose that we have
n samples
$({\mathbf{y}}_{i},{\mathbf{x}}_{i})$, where
${\mathbf{x}}_{i}={(1,{\mathbf{x}}_{i,1},\dots ,{\mathbf{x}}_{i,p})}^{T}$ are the covariates. The expectiles defined in Equation (
1) are used to set up the expectile regression, which assumes a linear model of the following form:
The estimated coefficients
${\widehat{\beta}}_{\tau}$ can be obtained by minimizing the empirical loss function:
To solve the optimization problem in Equation (
3), we suggest the following Algorithm 1 based on using iterative reweighted least squares (IRLS) [
42]:
Algorithm 1: ALS for estimating ER coefficients. 
Input: Measured dataset ${\left(\right)}_{(}^{{x}_{i}}$. 
1. Initialize ${\beta}_{\tau ,0}$. 
2. Use the empirical loss function in Equation (3); 
3. Update the coefficient using the algorithm of IRLS [42], 
Output: The coefficients’ estimates $\widehat{\mathsf{\beta}}$.

The linear expectile regression has shown great performance in contrast with classical approaches of regression [
36]. However, we may encounter more complex datasets for which ER might be too restrictive in terms of errors. To this end, researchers have developed more flexible methods namely: expectile regression with boosting (ERboosting) [
43] and nonparametric estimator of conditional expectiles based on local linear polynomials with a onedimensional covariate [
44].
4.1.2. Kernel Expectile Regression Estimator (KERE)
In this work, we adopt a recent flexible method introduced in a modern study [
45], based on exploiting the properties of the reproducing kernel Hilbert spaces (RKHS) [
46]. Let
${\mathbb{H}}_{K}$ denote a Hilbert space generated by a predefined kernel
K. Given
n samples
$({\mathbf{y}}_{i},{\mathbf{x}}_{i})$, the kernel expectile regression estimator is derived from the following optimization problem:
where
f spans the Hilbert space,
${\parallel f\parallel}_{{\mathbb{H}}_{K}}^{2}$ is the norm of
f in
${\mathbb{H}}_{K}$,
${\alpha}_{0}$ is the intercept and
$\lambda $ is the regularization parameter.
Although Equation (
4) lies in an infinite dimensional space, the dimension of this formulation is reduced by using the representer theorem and the reproducing property [
47]. Thus, the optimization parameter
f in Equation (
4) and its RKHS norm are expressed as follows:
where
K is the kernel function and
${\alpha}_{k}\in \mathbb{R}$.
To this end, Equation (
4) can be reformulated as follows:
A compact formulation of Equation (
6) using matrix notations is introduced as follows:
where
The proposed algorithm to solve Equation (
7) relies on using maximization–minimization (MM) approaches [
48]. The key idea is to find a surrogate function through the Taylor expansion that majorizes the objective function. Optimizing this surrogate function will either improve the value of the objective function or leave it unchanged.
Using the MM approach to solve Equation (
7) yields the following formulation for iteratively updating
$\mathit{\alpha}=\left(\right)open="("\; close=")">{\alpha}_{0},{\alpha}_{1},\cdots ,{\alpha}_{n}$.
where
${\mathit{K}}_{u}^{1}$ is the inverse matrix of
${\mathit{K}}_{u}$ defined as follows:
Algorithm 2 summarizes the steps to reach the KERE estimates
$\widehat{\mathit{y}}$ of the output
$\mathit{y}$.
Algorithm 2: Kernel expectile regression estimator. 
Input: Dataset ${\left(\right)}_{(}^{{x}_{i}}$, kernel function $K(.,.)$, tolerance $\u03f5$, maximum iterations ${i}_{max}$. 
1: Calculate the kernel matrix $\mathit{K}={\left(K({\mathit{x}}_{i},{\mathit{x}}_{j})\right)}_{i,j}$. 
2: Initialize ${\alpha}^{\left(0\right)}$, ${r}_{i}^{\left(0\right)}$ and $t\stackrel{}{\leftarrow}0$. 
3: While ${r}_{i}^{\left(t\right)}\ge \u03f5$ and $t\le {i}_{max}$: 
● Calculate updated residue ${r}_{i}^{\left(t\right)}={y}_{i}{\mathit{K}}_{i}{\mathit{\alpha}}^{\left(\mathbf{t}\right)}$ 
● Update ${\mathit{\alpha}}^{\left(t\right)}$ based on Equation (10). 
● $t\stackrel{}{\leftarrow}t+1$ 
4: Calculate the output estimator ${\widehat{y}}_{i}={\alpha}_{0}^{\left(t\right)}+{\displaystyle \sum _{j=1}^{n}}{\alpha}_{j}^{\left(t\right)}K\left(\right)open="("\; close=")">{\mathit{x}}_{j},{\mathit{x}}_{i}$ 
Output: The vector of estimates $\widehat{\mathit{y}}$. 
4.2. Experimental Setup
In this work, we evaluated some of the expectilebased approaches, namely, expectile regression (ER) [
36] and kernel expectile regression estimator (KERE) [
45] on the GHG emission dataset. As detailed in
Section 4, both ER and KERE depend on the chosen expectile level
w. To depict the performance relative to the expectile level, we construct multiple models of ER and KERE using multiple expectile levels spanning the following values:
In addition, KERE models require the kernel function to be selected. Although various kernels are available for use, we choose the wellknown radial basis family (RBF) kernel defined in Equation (
11).
where
$\sigma $ stands for the bandwidth.
To select the best hyperparameters for KERE models, we perform two dimensional 8fold crossvalidation to select the optimal hyperparameters $(\lambda ,\sigma )$, where $\lambda $ and $\sigma $ stand for the regularization parameter and the RBF kernel bandwidth, respectively. Moreover, the maximum number of iterations ${i}_{max}$ and the tolerance value $\u03f5$ is fixed to 4000 and ${10}^{6}$, respectively.
Furthermore, we select the KERE model corresponding to the expectile level of interest being
$w=0.7$ to conduct a benchmark comparison with stateoftheart regression models.
Table 2 summarizes the selected regression methods to be compared with KERE, as well as their respective characteristics. We also conduct an 8fold crossvalidation to tune each model’s hyperparameter.
The evaluation process is twofold. First, we compare ER and KERE models using a customized error metric, namely, mean absolute deviation (MAD) (w) defined below. This type of error reflects the model fit with an emphasis on the tails by assigning weights $\tau $ and 1 −$\tau $ to positive and negative deviations, respectively. Second, a subset of the KERE models is selected to be compared with stateoftheart regression approaches using mean absolute error (MAE) and rootmeansquare error (RMSE).
Mean absolute deviation MAD (
w):
Mean absolute error (MAE):
Rootmeansquare error (RMSE):
4.3. Benchmark Methods
In order to assess the performance of the kernel expectile regression estimator on the proposed dataset, we compare it to twelve other benchmark regression approaches. As detailed in
Table 2, we use support vector regression, lasso, light gradient boosting machine, random forest, Kneighbor, extra trees, AdaBoost, gradient boosting, decision tree, Huber, multilayer perceptron, and ridge regressors.
Table 2 summarizes the techniques considered as well as their hyperparameters to be tuned using Kfold crossvalidation.
4.4. Computational Software
The computational software for this study was written using both RStudio and Python. RStudio was used for comparing kernel expectile regression estimator and linear expectile regression. Both approaches were implemented using “KERE” and “Expectreg” libraries [
49], respectively. On the other hand, Python was used to compute the comparison with regression benchmark approaches detailed in
Table 2. The benchmark comparison was conducted using the
PyCaret library (version 3.0.0rc4), specifically
PyCaret.regression module.
All of the aforementioned regression techniques were computed using an 8fold crossvalidation to tune the corresponding hyperparameters. The advantage of the PyCaret library in Python is the agility of its classes, particularly Compare_models in setting up the proper framework for a fair comparison.
5. Results and Discussion
Firstly, we report the results of two expectilebased approaches, namely kernel expectile regression estimator (KERE) and expectile regression (ER). The two methods were applied to predict both ${N}_{2}O$ direct and indirect emissions. First, KERE and ER are compared on both the training and testing phases. We evaluate the models primarily using mean absolute deviation error (MAD), which varies with respect to the expectile level. In addition, we report the mean absolute error (MAE), rootmeansquare error (RMSE), and R². Second, we chose the KERE model corresponding to the expectile level $w=0.7$ for comparison with stateoftheart regression approaches using R², MAE, and RMSE.
We report the results for both the training and testing of KERE and ER regarding the direct emissions in
Table 3 and
Table 4, respectively. It is noticeable that KERE outperforms ER in all reported metrics, namely MAD, RMSE, and MAE. This is because KERE is able to depict nonlinear behavior utilizing the kernel trick. In addition, ER approach reports an increasing MAD error as the expectile levels increase, whereas KERE stays relatively stable as the expectile level increases. This is reflected by the mean absolute deviation of ER (0.8) being 0.308 compared to KERE (0.7) being 0.041. In addition, it is apparent that R² values drop significantly between training and testing for both KERE and ER which highlights the failure to explain the direct emissions variance.
One KERE model corresponding to expectile level
$w=0.7$ is selected from
Table 4 to be compared with the benchmark approaches summarized in
Table 1.
Table 5 summarizes the performance of the benchmark regressors in contrast with KERE models, where
${R}^{2}$, MAE and RMSE are reported.
It is shown that the KERE Rsquared values are slightly low but significantly better than the rest of the regression approaches (18%). We can argue that the usage of fewer fertilizers implies less factors implicated in the agricultural processes. For example, in small countries where agriculture is not the main activity, the source of Nbased fertilizers does not necessarily explain ${N}_{2}O$ emissions. However, when Nbased fertilizers are applied in bigger quantities, the source explains more about the direct emissions.
Secondly, we report the performance analysis of KERE and ER with respect to indirect emissions.
Table 6 and
Table 7 summarize the performance evaluation of both approaches in the training and testing phases. The MAD, RMSE, and MAE metrics are reported with respect to the various expectile levels. Similar to the models reported previously, KERE models significantly outperform ER models. The MAD evaluation metric of KERE and ER on the testing set is 0.046 and 2.762 corresponding to expectile level
$w=0.7$. Whereas KERE and ER report an MSE metric of 0.391 and 3.010 with respect to the same expectile level.
Similar to the previously mentioned results, the KERE model corresponding to the expectile level
$w=0.7$ is selected to be compared with the benchmark regression methods by reporting MAE, RMSE, and R². As outlined in
Table 8, the selected KERE model performed significantly better, especially with regard to the RMSE metric evaluation.
Reporting the training and testing results for both direct and indirect emissions outlined the outperformance of kernel expectile regressions estimator approach to its counterpart Expectile Regression, especially with regard to MAD evaluation error. Furthermore, the KERE model corresponding to expectile level $w=0.7$ was selected to focus on the data closer to the tail of the data. The latter corresponds to the range of medium to large countries, showing that such a model, in addition to being flexible, performs better than all other considered benchmark regression approaches.
The KERE technique is indeed an explainable approach, allowing us to explore the relationship between synthetic N fertilizers use and global
${N}_{2}O$ emissions. Reducing
N rates is not the main factor for reducing GHG emissions, mainly nitrous oxide. The adoption of lower
N rates underestimates
${N}_{2}O$ emissions [
50], as there are many factors at stake, such as the management of fertilizer applications while enhancing
N use efficiency (NUE).
Nbased fertilizers vary depending on the
Nform they contain, either ammonium
$N{H}_{4}^{+}$ or nitrate
$N{O}_{3}^{}$. In fact,
$N{H}_{4}^{+}$ is the starting point by which the soil microorganisms perform the nitrification process to form
$N{O}_{3}^{}$, from which other soil microorganisms convert it to
${N}_{2}$ gas through the process of denitrification, while emitting the
${N}_{2}O$ gas during the whole process and this is the direct pathway of the
${N}_{2}O$ emissions. Referring to
Figure 2 which highlights
N pathways, and based on a modern study [
51], it appears that there is a positive correlation between soil moisture content and cumulative
${N}_{2}O$ emissions. When water content is high in the soil, it was suggested that the required conditions for denitrification are met, leading to higher
$N{O}_{3}^{}$ concentrations in the soils, providing N substrate for the production of
${N}_{2}O$. In fact, floods and rain may have an impact on GHG emissions, increasing precipitation may enhance soil
${N}_{2}O$ emissions [
52].
A first onfarm study [
50] was conducted to report
${N}_{2}O$ response to multiple fertilizer rates on productionscale fields. They observed linear and nonlinear increases in
${N}_{2}O$ depending on the study locations and year. However, the nonlinear exponential response models best represented the overall
${N}_{2}O$ response to N fertilizer across all site years. A more recent study [
53] also demonstrated the nonlinear relationship between the application of nitrogenbased fertilizers and
${N}_{2}O$ emissions, explaining how it changes depending on the meteorological circumstances, while the correlation between the
${N}_{2}O$ emissions and the Nfertilizer rate used remains unclear. In another study, it has been proven that monitoring nitrogen application alone is not capable of stimulating
${N}_{2}O$ emissions as much as the combination of nitrogen addition and rainfall reduction [
54]. Nonetheless, nitrogen fertilization is an external factor; other management practices generate
${N}_{2}O$ as a side effect of their applications, such as irrigation or tillage practices or even the crop type used. Cropping systems have an impact on soil quality and soil GHG emissions. In a recent study [
55], the importance of combining winter cover crop cultivation for single cropping systems with reduced N fertilizer application was investigated. This work supports our results; the cropping system and N rate application impact the GHG emissions and
${N}_{2}O$ direct emissions. In comparison with traditional cotton cultivation, where
${N}_{2}O$ and other GHG emissions were increased, it appears that both cover cropping with reduced N helped to mitigate soil GHG emissions. Furthermore, the geographical disposition may have an impact as well. For example, countries, such as China, the USA, Canada, India, and Brazil are known for their enormous agricultural production, hence, diverse soil characteristics, diverse climates, and various crop types are greatly responsible for the increase of GHG emissions. From another point of view, and in the case of China, rice represents the most economically important crop, where rice paddies are an excellent environment for the biological activities of nitrification and denitrification processes which have been accelerated especially in flooded soils, leading to enormous
${N}_{2}O$ production as described by [
56]. Other Nfertilization management techniques have an impact on
${N}_{2}O$ emissions, with an increase in
N fertilizer’s use, adopting the 4R Nutrient Stewardship has a significant potential to lower
${N}_{2}O$ emissions [
57].
The results also show that the source of
N synthetic fertilizers does not contribute as much to indirect emissions of
${N}_{2}0$. When it comes to indirect emissions, research has proven that the leaching and runoff of nitrate from the application of synthetic N fertilizers is a substantial indirect source of
${N}_{2}O$ emissions from groundwater. This indicates that indirect emissions are not related to the source of
N fertilizer itself, but depend mainly on the
N content in agricultural soils that could be lost through leaching/Runoff [
58].
6. Conclusions and Future Works
The food supply chain is a salient contributor to global emissions of air pollutants. These major air pollutant compounds, such as ${N}_{2}O$, are emitted by different stages of the food system, from food production, processing, packaging, transport, retail, consumption, and disposal. This work used an explainable approach in order to explore the relationship between synthetic N fertilizers and global ${N}_{2}O$ emissions.
According to our results and the findings within the literature, we conclude, two major points:
Using the kernel expectile regression estimator approach is highly suitable when dealing with air quality related data.
Integrating additional external factors is highly recommended, for more accurate results and in order to interpret our outputs.
Based upon the results from the KERE method, the source of Nbased fertilizers is not capable of explaining ${N}_{2}O$ emissions on their own. On the one hand, the disparity in the degree of ${N}_{2}O$ emissions between countries, where many national GHG inventories employ the IPCC’s firsttier method, which uses a set of global default emission factors (EF) to estimate the ${N}_{2}O$ emissions depending on varying geography. On the other hand, Nfertilizer sources and quantities are not the only major contributors in nitrous oxide emissions, in fact, a significant portion of greenhouse gas emissions is now caused by farming activities either directly or indirectly, leading to a decay in air quality. Agricultural practices are of great importance; hence, we propose that reducing the usage of synthetic N fertilizers is not the most suitable solution to reduce ${N}_{2}O$ emissions. However, there is a need to monitor nitrogenbased fertilizer usage, while keeping an eye on the whole farming management practices, such as conventional tillage, which is a major contributor in ${N}_{2}O$ emissions, and notillage can decrease the ${N}_{2}O$ emissions, in the presence of different fertilizer treatments. Furthermore, our research suggests the necessity for a more specified and detailed estimation of ${N}_{2}O$ emissions, based on each country for a longer period of time. In order to establish a more effective study, we also encourage agribusinesses, organizations, and farmers to empower data availability. Bridging the gap between research and industry is a challenging step. Gaining insight from the data is very important; hence, increasing the usage of explainable machine learning tools is of great significance toward integrating interpretability for air quality models.