Rethinking Machine Learning Evaluation in Waste Management Research

Mullane, Paul; Fitzpatrick, Colin; Grua, Eoin Martino

doi:10.3390/su172310707

Open AccessArticle

Rethinking Machine Learning Evaluation in Waste Management Research

by

Paul Mullane

^1,2,*

,

Colin Fitzpatrick

¹

and

Eoin Martino Grua

^1,2

¹

Department of Electronic & Computer Engineering, University of Limerick, V94 T9PX Limerick, Ireland

²

Data-Driven Computer Engineering (D²iCE) Group, University of Limerick, V94 T9PX Limerick, Ireland

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(23), 10707; https://doi.org/10.3390/su172310707

Submission received: 21 October 2025 / Revised: 24 November 2025 / Accepted: 28 November 2025 / Published: 29 November 2025

(This article belongs to the Special Issue Waste Management for Sustainability: Emerging Issues and Technologies)

Download

Browse Figures

Versions Notes

Abstract

Reliable model evaluation is critical in waste management research, where machine learning is increasingly used to inform policy, circular economy strategies and progress towards the United Nations Sustainable Development Goals. However, common evaluation practices often fail to account for key methodological challenges, risking misleading conclusions. This study presents a theoretical analysis supplemented with a practical example of municipal solid waste generation in Ireland to demonstrate how standard evaluation metrics can produce distorted results. In particular, the widespread use of the

R^{2}

in waste management/sustainability machine learning is examined, showing its susceptibility to inflation when data exhibit strong correlations, temporal dependence or non-linear model structures. The findings show that reliance on the

R^{2}

misrepresents model performance under conditions typical of waste datasets. In the Irish example, the

R^{2}

often suggested a degradation of predictive ability even when error-based metrics, such as root mean squared error (RMSE) and mean absolute error (MAE), indicated improvement or stability. These results demonstrate the need for evaluation frameworks that move beyond single, correlation-based metrics. Future work should focus on developing and standardising robust practices to ensure that machine learning can support transparent, reliable and effective decision-making in waste management and circular economy contexts.

Keywords:

circular economy; model performance metrics; multicollinearity; trend; R²

1. Introduction

Machine learning (ML) is a subset of artificial intelligence that enables computers to improve performance on specific tasks by learning from data, without being explicitly programmed [1]. Its ability to generalise from examples has made it a powerful tool for prediction, classification and decision-making across disciplines. With advances in computational power and data availability, ML has increasingly been applied to sustainability research, offering new approaches for modelling resource flows, predicting waste generation and informing circular economy (CE) strategies. Reliable forecasting of municipal solid waste is essential for achieving Sustainable Development Goal (SDG) 12 (responsible consumption and production) and for implementing the 3Rs principle (reduce, reuse, recycle) that underpins CE policy [2].

Municipal solid waste (MSW) presents a particular challenge for both policymakers and researchers. Globally, MSW generation is projected to reach between 2.89 and 4.54 billion tonnes by 2050, driven by urbanisation and economic growth [3]. In the European Union, MSW management is central to achieving CE objectives and compliance with recycling targets set under Directive 2008/98/EC as amended by 2018/851 [4]. Accurate forecasting of MSW is therefore essential for infrastructure planning, investment decisions and monitoring progress towards SDG 12.

As interest in ML grows, more researchers are turning to these methods to address environmental and resource management challenges [5]. However, the rapid uptake of ML has brought methodological risks that can undermine the credibility of findings. A central concern is model evaluation. The choice of evaluation metric directly shapes how results are interpreted, and in sustainability contexts, may influence policy and resource allocation. Misapplied metrics risk producing misleading conclusions and suboptimal decisions.

As highlighted by Pennington et al. [6], to further the use of data science in sustainability applications, focus must be applied to methodological knowledge and underlying assumptions, as well as possible pitfalls that researchers may encounter. While Zhu et al. [7] cautions against relying on a single metric, the limitations of

R^{2}

in this context remain largely unexamined. The objective of this study is to provide a critical assessment of the

R^{2}

in ML applications to waste management, given that recent studies that forecast MSW using ML techniques report the

R^{2}

as an evaluation metric [8,9,10,11,12,13,14].

To this end, three contributions are made. Firstly, a practical example of Irish MSW generation demonstrates how reliance on the

R^{2}

can distort conclusions about model performance under the data constraints typical in this field. Secondly, the theoretical limitations of the

R^{2}

are examined, with emphasis on why its assumptions are frequently violated in waste management and sustainability applications. Thirdly, the broader implications for researchers and policymakers are considered, with recommendations made on evaluation metrics and practices better suited to sustainability data. While the statistical limitations of the

R^{2}

are documented in the statistical contexts, the statistic is yet to be criticised in the ML literature. The instability of the statistic has not been systematically linked to the data characteristics typical of waste management ML. The contribution of this study is a theoretical analysis showing how these features undermine the stability of the

R^{2}

for ML waste management research, supported by an illustrative Irish MSW example and a practical workflow for more reliable model evaluation in sustainability applications.

2. Evaluation Metrics in the Literature

This section reviews the evaluation metrics most commonly used in waste management-related ML research, with particular attention to MSW studies. Across the 18 studies identified, the three dominant metrics were the coefficient of determination (

R^{2}

), the root mean squared error (RMSE) and the mean absolute error (MAE). Their prevalence is consistent with broader patterns in environmental ML applications [7] and with findings from Allen et al. [15], who observed similar metric choices in ML work aligned with the SDGs. Thus, while this section focuses on MSW forecasting, the observations made here are likely generalisable across sustainability research. Table 1 summarises the metrics used in each study. Only one of the 18 did not report at least one of

R^{2}

, RMSE, or MAE [16]. It is important to emphasise that Table 1 reflects the popularity of metrics rather than their validity. Frequency of use reflects convention rather than validation and therefore does not constitute evidence of their suitability for waste management applications.

Although the

R^{2}

, RMSE and MAE remain the dominant metrics in waste management ML, recent developments in probabilistic and interval-based model evaluation offer alternative perspectives on predictive performance. These include prediction interval metrics [17], as well as probabilistic scoring rules such as the continuous ranked probability score (CRPS), which has been refined and operationalised in recent statistical and ML literature [18]. Quantile-based scoring rules, including the pinball loss used in distributional and multi-horizon forecasting, have also gained increasing attention in ML contexts [19]. While such approaches are not yet common in waste management ML, they represent promising directions for future methodological development, particularly where predictive uncertainty is relevant to decision-making.

Table 1. References to the metrics used in the identified literature.

Metrics	References
$R^{2}$	[8,9,10,11,12,13,14,20,21,22,23,24,25,26,27]
RMSE	[8,10,12,13,14,20,23,25,26,28]
MAE	[10,13,14,20,22,23,25,27,28,29]

While the focus of this research is on waste management, the

R^{2}

has been used to evaluate machine learning models across sustainability and environmental research. Recent reviews in areas such as water quality/hydrology [30,31], air quality [32], renewable energy [33] and ecological modelling [34] have all documented the use of the

R^{2}

. Despite its prevalence, these reviews do not assess the suitability of the

R^{2}

itself. This further highlights the need for work that examines the use of the

R^{2}

in applied modelling contexts, such as waste management.

2.1. The $R^{2}$

The coefficient of determination (

R^{2}

) is a measure used to assess the goodness of fit of a model. For linear regression models fitted by ordinary least squares, it quantifies the proportion of variance in the dependent (what is being predicted) explained by the covariates (inputs to the models). Hence, it should take a value between 0 and 1 [35]. The formula for the

R^{2}

is shown in Equation (1).

R^{2} = 1 - \frac{\sum_{i} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}} .

(1)

where

y_{i}

are the dependent values,

\hat{y_{i}}

are the values predicted by the model (fitted values) and

\bar{y}

is the mean of the dependent.

The

R^{2}

is used extensively in research applying ML to waste management. Of the 18 studies identified, 15 of them used the

R^{2}

for evaluating their ML models. This makes it the most popular metric for assessing ML models among the studies identified. However, as will be discussed in Section 5, the application of

R^{2}

in ML has significant limitations that can lead to misleading conclusions.

2.2. Error-Based Metrics

Error-based metrics provide a direct measure of prediction accuracy, which is crucial for assessing real-world applicability. Of the error metrics used in the literature, the two most common were the root mean squared error (RMSE) and mean absolute error (MAE), which both appeared 10 times in the 18 studies considered. The RMSE measures the average magnitude of the residuals (errors) [36]. It is defined in Equation (2). The MAE is defined in Equation (3) and measures the average magnitude of residuals, without consideration of their direction. RMSE is more sensitive to large errors due to the squared residuals, while MAE provides a more balanced assessment by averaging absolute errors. Both RMSE and MAE are on the same scale as the dependent, allowing for interpretation of the errors in model prediction. RMSE is defined in Equation (2) while MAE is defined in Equation (3).

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(2)

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(3)

In Equations (2) and (3), n is the number of predictions made by the model,

y_{i}

is the ith dependent value and

{\hat{y}}_{i}

is the ith predicted value.

Other error-based metrics were also used in the literature; however, they were not as common as the RMSE and MAE. Examples include the mean absolute percentage error (MAPE) and the mean squared error (MSE, which is the same as the RMSE, without the square root). The MAPE and MSE were used six and four times, respectively.

2.3. Criticism of the $R^{2}$ in the Literature

Numerous publications have outlined the limitations and problems associated with using

R^{2}

s for non-linear regression models [37,38,39,40,41,42]. Many of these publications are from the 20th century, and it seems their lessons have been forgotten. More recently, Speiss and Neumeyer [43] used a Monte Carlo approach to demonstrate the inadequacy of

R^{2}

s for selecting non-linear models in a pharmacological setting. These authors found that the

R^{2}

struggled to select the optimal model and was biassed towards models with more parameters.

Despite these observed limitations in a more classical regression setting, there is a surprising lack of research addressing the misuse of the

R^{2}

in machine learning, particularly within the context of waste management.

3. Methods

Having identified and discussed three commonly used evaluation metrics in waste management ML research, their appropriateness is examined through a practical example on Irish MSW. The MSE is also included for completeness. The purpose of this example is not to provide an exhaustive cross-country analysis but rather to offer a representative case that mirrors the type of data, modelling practices and constraints frequently encountered in the literature. The structural issues that affect the behaviour of the

R^{2}

are widespread across sustainability and waste management datasets. The Irish case therefore serves as an illustrative example of broader methodological challenges.

In this section, four models are constructed to predict municipal waste generation in Ireland using socio-economic variables as covariates. These models include a mean baseline model that always predicts the training mean, a naive baseline model that predicts the last observed value, a support vector machine (SVM), and a neural network. The comparison of these models highlights the importance of selecting appropriate evaluation metrics and demonstrates that the

R^{2}

can behave inconsistently and misleadingly when applied to ML models. All data preparation, modelling and evaluation procedures were conducted in R version 4.4.1 [44].

3.1. Data Acquisition and Processing

Annual Irish municipal solid waste (MSW), recorded in kilograms per capita, was obtained from Eurostat [45]. The dataset comprises 19 observations for 2000–2020, with values for 2013 and 2015 missing. Eurostat is the source used in many MSW forecasting studies identified in Section 2 [11,12,21,22]. The two missing values were linearly interpolated prior to splitting the data into training and test sets, following common practice for addressing small, isolated gaps in time series data [24,46]. Because these are nationally compiled statistics used for EU target monitoring, alternative values cannot be obtained, and interpolation ensures continuity of the series. The full 2000–2020 MSW trajectory, which serves as the dependent variable in the modelling, is shown in Figure 1. The length of this time series is typical of nationally reported MSW datasets used in the literature [8,9,23].

The covariates selected were gross domestic product (GDP), population and unemployment [47,48,49]. More details on the covariates can be seen in Table 2. These covariates were chosen as they capture socio-economic and environmental influences on waste. The studies identified in Section 2 frequently use such covariates when building their models [8,9,11,12,21,23,27].

At this stage, trend and multicollinearity analyses are deliberately omitted, as is commonly executed in the 18 studies identified in Section 2. These analyses will be conducted and discussed in Section 5.

Prior to training the models, the data was split into training and test datasets according to a chronological 70/30 split. The models were trained on the years 2000–2016 and tested on the years 2017–2020. A 70/30 split was chosen, as this is a common and well-accepted train/test split [50,51]. When the dependent data is recorded over time (also known as a time series [52]), it is best practice to employ a chronological split as opposed to a random split. Random splits in time series can lead to future values influencing past predictions, which is unrealistic in real-world forecasting scenarios [53].

The training data was standardised by subtracting the mean and dividing by the standard deviation. The same scaling was applied to the test data. Each value in the test set was scaled by subtracting the training mean and dividing by the training standard deviation. Standardisation prevents variables with larger magnitudes (e.g., GDP) from disproportionately influencing the model compared to smaller values (e.g., unemployment) [54]. This process is described mathematically in Equation (4). Standardisation was used by studies identified in Section 2 [9,14].

x_{i, j}^{*} = \frac{x_{i, j} - {\bar{x}}_{j}^{train}}{σ_{j}^{train}}

(4)

where

x_{i, j}

is the ith observation of the jth variable,

{\bar{x}}_{j}^{train}

is the training mean of the jth variable and

σ_{j}^{train}

is the training standard deviation of the jth variable.

3.2. Mean Baseline Model

The mean baseline model is a simple reference model that predicts all future values as the average of the training data. It does not account for variation over time or relationships with covariates but instead assumes that the mean is the best predictor. Its purpose is to provide a lower bound for model performance. Any predictive model should outperform the baseline model if it is capturing meaningful patterns in the data. By including it, distinctions can be made between genuine predictive ability and performance that simply reflects the central tendency of the dataset.

3.3. Naive Baseline Model

The naive baseline model predicts each future value as equal to the most recent observed value. For example, this model will predict the 2019 value as being the 2018 value. In the context of time series data, this is equivalent to a random walk benchmark, making it a strong reference model when the dependent variable exhibits trend or persistence.

3.4. Support Vector Machine

An SVM was built using the ‘e1071’ package in R version 4.4.1 [55]. It took the variables in Table 2 as covariates and predicted MSW. The data was scaled prior to the modelling (as described in Section 3.1). A radial basis kernel was employed, based on choices made in the literature [9,11,20]. This kernel is defined in Equation (5).

K (x_{i}, x_{j}) = exp (- γ ∥ x_{i} - x_{j} ∥^{2}) .

(5)

Given the limited sample size, we adopted a conservative hyperparameter specification. The SVM hyperparameters were set as follows: The gamma value was 1/3, corresponding to the inverse of the number of predictors and determining the influence radius of individual data points. The cost value was set to 1, which balances model complexity against margin violations. The epsilon was set to 0.1, defining the width of the insensitive zone in regression. These choices represent values that prioritise generalisation and avoid the risk of overfitting. To ensure robustness, alternative hyperparameter values were also explored, and the results of the analysis remained consistent. The SVM was chosen as it was commonly used in the literature identified in Section 2 [8,9,11,20,25].

3.5. Neural Network

A feedforward neural network was built using the ‘neuralnet’ package in R [56]. It took the variables in Table 2 as covariates and predicted MSW. The data was scaled prior to the modelling (as described in Section 3.1). Given the limited sample size, a simple architecture was used to avoid overfitting. The network had 1 hidden layer with 2 neurones. The activation function was a logistic function. The hyperparameters were as follows, and were chosen to reduce the risk of the model overfitting: The error threshold was set to 0.01, and the network was trained using resilient backpropagation with the maximum number of steps for the training set to 5000. The number of training repetitions was set to 10. The network was trained with 10 independent random initialisations.

4. Results

The training and test metrics of the mean baseline model, naive baseline model, SVM and neural network can be seen in Table 3. To complement the numerical metrics in Table 3, Figure 2 presents a visual comparison of the test set predictions from each model.

The results in Table 3 highlight several important contrasts between the models. For the mean baseline, the

R^{2}

behaves as expected on the training set, taking a value of 0 since the model predicts the training mean. On the test set its

R^{2}

falls to −5.02, reflecting the fact that the test mean differs from the training mean. In absolute terms, the RMSE (59.67) and MAE (54.49) are interpretable as error magnitudes, but the negative

R^{2}

illustrates how the statistic can be uninterpretable for trivial models once applied out of sample. As shown in Figure 2, the mean baseline produces a flat prediction line that fails to match the dynamics of the series.

Although the

R^{2}

of the naive baseline model declines from 0.72 in training to 0.39 on the test set, its error metrics improve substantially. The RMSE decreases from 39.75 to 19.02, the MAE decreases from 29.19 to 17.75, and the MSE decreases from 1580.28 to 361.75. Thus, while

R^{2}

suggests a loss of explanatory power on the test set, the absolute prediction errors indicate that the naive model generalises well. Figure 2 shows that the naive baseline tracks year-to-year variation well, which explains the low errors despite a decline in

R^{2}

. This highlights that the

R^{2}

can understate the performance of simple, effective forecasting methods.

For the SVM, the

R^{2}

suggests a collapse of the SVM on the test set. The model achieves a high

R^{2}

of 0.87 on the training set, but this drops sharply to −0.31 on the test set. Judged by

R^{2}

alone, one might conclude that the model generalises poorly and is of little practical use. Yet the RMSE and MAE remain relatively stable between train and test (26.76 vs. 27.81 for RMSE, 20.44 vs. 25.24 for MAE and 716.11 vs. 773.34 for MSE). In other words, the decline in performance is modest if not acceptable in absolute terms, but the

R^{2}

greatly exaggerates it. Figure 2 illustrates this. The model captures the overall level and broad shape of the test series, contradicting the impression of collapse implied by the test set

R^{2}

.

The neural network exhibits a negative

R^{2}

on the test set, which would typically indicate performance worse than the mean baseline. However, the error metrics tell a different story: the RMSE and MAE remain moderate and well within the range seen for the other models. As shown in Figure 2, the neural network captures the general level of the series but does not follow year-to-year fluctuations closely, resulting in prediction errors but not catastrophic failure. The negative

R^{2}

therefore overstates the weakness of the model and fails to reflect its actual predictive accuracy.

Overall, these findings demonstrate that the

R^{2}

can be a misleading guide to the performance of machine learning models in sustainability applications, as well as the complications involved with its use for even extremely simple models. In this example, the SVM and neural network appear to collapse on the test set if judged by

R^{2}

, despite maintaining error magnitudes similar to those observed in training. The naive baseline appears weak according to its test

R^{2}

, even though it achieves the lowest predictive errors of all models. These results show that

R^{2}

can mislead in both directions, overstating the failure of complex models and understating the success of simple baselines. To understand why the

R^{2}

behaves in this way, it is necessary to examine its underlying assumptions. While the statistic is well defined in the context of linear regression, the conditions required for valid interpretation are frequently violated in ML and time series settings. The next section outlines these theoretical limitations and explains why the

R^{2}

can distort perceptions of both model weakness and model strength in practice.

5. Theoretical Reasons for the $R^{2}$ s Failings

The

R^{2}

has several limitations that reduce its reliability as an evaluation metric in ML applications, including those used for waste forecasting. This section explains its theoretical limitations. The findings from the Irish practical example generalise to a wide range of sustainability and waste management applications, as the behaviour of the

R^{2}

examined here stems from fundamental statistical properties (non-linearity of ML models, trend and multicollinearity).

While the assumptions underlying the

R^{2}

in linear regression are well known, this section is not a restatement of classical statistics. Instead, it extends those assumptions to demonstrate why they are systematically violated in ML contexts, such as waste management. Although similar issues have been documented in the statistical and econometric literature, they have not been systematically acknowledged in ML applications, where the

R^{2}

remains the dominant metric. Thus, this section highlights how well-known statistical concerns become overlooked in the specific context of ML waste management forecasting, contributing to the discussions started by [6,7].

5.1. The $R^{2}$ for Linear Regression

Linear regression is a classical statistical technique which uses a linear combination of covariates to predict the dependent [35]. As with most statistical tools, it comes with a set of assumptions [57]. These assumptions can be expressed as follows: (1) The data is observed with no error; (2) the relationships between the dependent and covariates are linear; (3) the residuals (errors) are random and not correlated; (4) the covariates are not correlated; and (5) the dependent is conditionally normally distributed. When these hold, the

R^{2}

is a meaningful performance measure of the linear regression. Otherwise, it may be miscalculated and misleading.

In ML models, some of these conditions can be relaxed. ML models are designed to capture complex, non-linear relationships, meaning they do not require assumption 2 [58]. However, other assumptions, like independent errors, still matter. If these are not met, the

R^{2}

may not provide a reliable measure of model accuracy.

5.2. The Effect of Hyperparameters on the $R^{2}$

Unlike linear regression, ML models have hyperparameters, which are design choices made before or during training [59]. These hyperparameters influence how the model ‘learns’ from the data and can affect the independence of the residuals. In a linear regression model, the total variation in the data can be split into two parts, the variation explained by the model and the unexplained variation. This is explained mathematically in Equation (6), which is commonly referred to as the total sum of squares identity. While it provides a formal basis for understanding the

R^{2}

, the key takeaway is that it assumes a clean partitioning of variance in the data. It is also worth noting that Equation (6) can be rearranged to obtain the definition of the

R^{2}

in Equation (1).

\sum_{i} {(y_{i} - \bar{y})}^{2} = \sum_{i} {(\hat{y_{i}} - \bar{y})}^{2} + \sum_{i} {(y_{i} - \hat{y_{i}})}^{2} .

(6)

where

y_{i}

is the ith observed dependent value,

\bar{y}

is the mean of the dependent and

\hat{y}

is the ith value predicted by the model.

Since ML models are non-linear (because of their complexity and use of hyperparameters) [58], they introduce a ‘cross term’ that disrupts variance partitioning [38]. This makes residuals non-random, violating assumption 3. This issue can lead to

R^{2}

values below 0 or above 1 [39]. This challenges the

R^{2}

’s reliability for ML models. The reader is referred to Book and Young [38] for the mathematical derivation of the cross term.

5.3. The Effect of Trend on the $R^{2}$

When data follow a steady upward or downward trajectory, the

R^{2}

can give a misleading impression of model quality [60]. This issue is particularly acute in waste management, where long-term trends in MSW generation or recycling rates are directly linked to progress towards SDG 12. In such cases, even a very simple model that merely reproduces the trend may obtain a high

R^{2}

, not because it captures the underlying drivers, but because it aligns with the overall direction of the series. As shown in Section 4, the naive baseline achieved a train

R^{2}

of 0.72 despite achieving nothing more than projecting the previous observation forward. This illustrates how the

R^{2}

rewards the replication of trend rather than genuine explanatory or predictive insight.

In sustainability research, the dependent is often a time series (data collected over time [52]), such as annual waste generation or pollution levels. Many studies in this field use the

R^{2}

to assess ML models applied to such data but do not check whether a trend is present. Of the 18 studies identified in Section 2, only 1 performed a check for trend [16].

Trend can cause a model’s residuals to be correlated over time, breaking assumption 3 [57]. This suggests the model has not fully captured the patterns in the data, making the

R^{2}

unreliable. Furthermore, autocorrelation can negatively affect the model’s forecasting ability.

A simple way to check for a trend is to plot the data and visually inspect whether it consistently moves upward or downward. More formally, an Augmented Dickey–Fuller (ADF) test can be used to test for trend [61]. However, for small datasets, the test may be underpowered, so it should always be complemented with a visual inspection [62]. Looking back to Figure 1, a clear pre-2007 rise in MSW can be identified, as well as a fall post-2007. This clearly increases the naive baseline’s

R^{2}

as it follows these trends without actually modelling the dynamics.

Another method is to examine the residuals of a model. Plotting the fitted values (predictions) against the residuals should show a random scatter of points. A visible pattern, such as a consistent upward or downward slope, a curved shape or wave-like movement, suggests the presence of a trend [63]. Trend can also cause heteroscedasticity, where the spread of residuals changes across the fitted values (that is, the residuals exhibit non-constant variance) [64]. Figure 3 plots the fitted values against the residuals on the training set for the SVM built in Section 3.4. These residuals are not ideal; there is an outlying value at around 675 on the x axis, and there seems to be a non-linear tail to the right of the plot. This would suggest a violation of the necessary assumptions for the

R^{2}

s correct calculation.

To correct for trend, a process called differencing can be used. This is where the change in the data is modelled, rather than the data itself (for example, the change in MSW from one year to the next, rather than the quantity of MSW). Sometimes, multiple rounds of differencing are needed to fully remove the trend.

5.4. The Effect of Multicollinearity on the $R^{2}$

Multicollinearity occurs when covariates are highly correlated with each other [65]. In sustainability contexts, this is common: for example, economic activity, consumption and waste generation often move together, making it difficult to disentangle drivers of waste and weakening the worth of

R^{2}

for CE-related modelling. In the presence of multicollinearity, correlated predictors can cause unstable estimates and may give an inflated impression of model fit on the training data [66]. This effect can be thought of as the model counting the same information twice. Multicollinearity can also lead to overfitting [67]. Multicollinearity causes the model to learn specific patterns in the training set rather than general relationships. This makes the model sensitive to small changes in the data, reducing its ability to make reliable predictions [68].

To identify multicollinearity, one can create a correlation matrix, which shows the correlation coefficients between covariates. If high levels of correlation are identified (typically greater than 0.7 in magnitude), then multicollinearity is a concern. Table 4 shows a correlation matrix for the covariates used to build the SVM in Section 3. As shown in boldface, there is one correlation which is greater than 0.7 in magnitude. This is evidence that multicollinearity could be influencing the

R^{2}

s in Section 4, giving a false impression of model ability.

Multicollinearity can be dealt with by omitting one covariate of a correlated pair. Alternatively, one could use a technique such as Principal Component Analysis (PCA). PCA transforms correlated covariates into a smaller set of uncorrelated components (linear combinations of the covariates) while preserving as much information as possible [69].

6. Discussions

The theoretical limitations of the

R^{2}

discussed in Section 5 are not merely academic concerns. Issues related to the misevaluation of ML models have tangible, real world impacts when ML is used as part of the decision-making process in sustainability and waste management planning.

The policy relevance of these findings is particularly strong in the context of EU waste legislation. Core instruments such as the Waste Framework Directive (2008/98/EC) [70] (which establishes the waste hierarchy and sets binding targets for recycling and recovery) and the Packaging and Packaging Waste Directive (94/62/EC) [71] and its subsequent amendments, rely on accurate national statistics. These policy frameworks inform investment decisions, extended producer responsibility (EPR) obligations and the monitoring of Member State compliance. If ML models used in such policy analyses are evaluated using inappropriate metrics, there is a genuine risk of misjudging whether countries are on track to meet EU targets. For example, a model that appears strong under

R^{2}

may systematically under-predict municipal waste arisings, affecting assessments of compliance with recycling requirements. Conversely, a negative test set

R^{2}

may lead policymakers to dismiss models that in fact produce acceptable errors. Such distortions can influence planning for recycling infrastructure, evaluations of EPR performance and the implementation timelines associated with the Packaging and Packaging Waste Regulation [72]. By demonstrating how easily model performance can be misinterpreted under realistic data conditions, this study highlights the potential policy risks that arise when evaluation metrics are not chosen with care.

While the practical example focuses on Irish MSW data, the implications of the findings extend more broadly. The limitations of the

R^{2}

highlighted in Section 5 arise from the non-linearity of ML models and the structural properties of sustainability datasets rather than from characteristics specific to Ireland. As a result, the mechanisms through which the

R^{2}

becomes unreliable are expected to generalise across diverse contexts, even though this study deliberately restricts itself to a single representative case. The practical example should therefore be viewed as illustrative rather than exhaustive, with the theoretical argument providing the basis for generalisability.

When evaluating models, error-based metrics like RMSE and MAE should be prioritised, as they provide concrete error assessments in the same units as the prediction and do not rely on assumptions that are often violated in ML. Moreover, because RMSE and MAE are in the same units as the predictions, they offer a direct and interpretable measure of error, helping policymakers understand the accuracy of forecasts. It is also good practice to visualise predictions made by ML models, to ensure that the predictions look feasible (i.e., the model is not predicting unrealistic values or making repeated predictions).

To make these considerations operational, we outline a simple and practical workflow for evaluating ML models in sustainability and waste management research. First, model performance should always be assessed using a suite of metrics, such as the MAE and RMSE, rather than relying on a single indicator such as the

R^{2}

. Second, evaluation should incorporate basic diagnostic tools, such as residual plots, checks for autocorrelation and visualisations of fitted and observed values, to identify issues that scalar metrics may obscure. Third, for time series data, validation procedures should respect temporal ordering, avoiding random train/test splits that can unrealistically mix past and future information. Finally, multicollinearity should also be checked for among the covariates. If present, it can be dealt with by omitting covariates or using techniques such as PCA. These steps provide a transparent and reproducible procedure that can be readily implemented using standard ML workflows and help ensure that model assessment remains methodologically sound.

The credibility of waste management research is pivotal in maintaining public trust and ensuring stakeholder engagement. Policymakers, practitioners, and the public rely on transparency in the modelling process to justify decisions and secure funding. By ensuring the selected evaluation metrics are appropriate, researchers can offer a more nuanced and accurate picture of model performance. This not only mitigates the risks of misinterpretation but also promotes better decision-making. Demonstrating awareness of the limitations of traditional metrics and addressing them can foster a culture of methodological rigour in ML-based waste management research, advancing the field and supporting robust CE strategies.

A limitation of this study is the size of the Irish MSW dataset, which contains 19 annual observations from 2000 to 2020. Short time series are common in national-level waste management research but can affect the stability of ML models and contribute to variability in performance metrics, including the

R^{2}

. Two missing values were linearly interpolated, introducing a small degree of uncertainty. However, as can be seen from Figure 1, the interpolated years (2013 and 2015) fall within a period where the series is relatively stable (2012–2017), reducing the likelihood of distortion. To mitigate risks of overfitting, we restricted the models to three covariates and used conservative hyperparameter settings.

Importantly, these limitations reinforce the central argument of the paper. Under realistic data constraints typical of sustainability applications, the

R^{2}

becomes particularly unreliable as an evaluation metric, whereas error-based measures offer a more stable and informative basis for model assessment.

In summary, evaluating ML models with appropriate error metrics and diagnostic checks is not just a methodological concern but a prerequisite for evidence-based waste management. A thorough evaluation framework supports the achievement of waste management objectives and helps ensure that waste policies contribute effectively to SDG 12.

7. Conclusions

This study examined the use of evaluation metrics in machine learning applications to waste management, with a focus on the limitations of the

R^{2}

statistic. Through a theoretical analysis and a practical example of Irish MSW forecasting, it is demonstrated how reliance on

R^{2}

can misrepresent model performance under the data characteristics typical of waste management research, including temporal dependence and correlated predictors.

These findings show that error-based measures such as RMSE and MAE offer more stable and interpretable assessments of predictive accuracy, while residual diagnostics and checks for multicollinearity are essential for ensuring robust models. The novelty of this work lies in linking recognised statistical limitations of

R^{2}

to the specific context of sustainability and waste management ML, where these concerns remain underappreciated.

Building on these insights, this study provides a set of operational recommendations for improving ML evaluation in waste management research. First, model performance should be assessed using a suite of complementary metrics, prioritising assumption-free error measures such as RMSE and MAE, and avoiding the use of the

R^{2}

for model evaluation. Second, researchers should implement a minimum diagnostic protocol that includes visual inspection of trends in both the raw data and residuals and checks for autocorrelation using tools such as residual ACF plots and Durbin–Watson statistics. Multicollinearity should be assessed through correlation matrices, with corrective steps such as variable omission or PCA implemented where necessary. Time series validation must respect temporal ordering by using chronological train/test splits rather than random partitions. Finally, numerical metrics should always be paired with visual diagnostics (such as plotting observed vs. predicted values and out-of-sample forecast trajectories) to detect implausible behaviour that scalar metrics may obscure. Taken together, these steps form a practical and reproducible evaluation workflow suitable for sustainability applications.

Future research should prioritise the development and standardisation of evaluation frameworks that move beyond single metrics, ensuring transparency, reliability and policy relevance. Strengthening methodological practice in this way is a prerequisite for advancing CE and waste management strategies, contributing to achieving SDG 12.

Author Contributions

Conceptualization, P.M., C.F. and E.M.G.; methodology, P.M., C.F. and E.M.G.; formal analysis, P.M.; writing—original draft preparation, P.M.; writing—review and editing, C.F. and E.M.G.; supervision, C.F. and E.M.G.; funding acquisition, C.F. and E.M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Irish Environmental Protection Agency Research Program [grant number: 2023-GCE-1206].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the Irish Central Statistics Office, Eurostat, Our World in Data and the World Bank. More information can be found in Table 2. The code that accompanies this study can be found in the following repository: https://github.com/paulmullane/evaluation-critiques, accessed on 27 November 2025.

Conflicts of Interest

The authors declare that they have no competing financial interests or personal relationships that could influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

ADF	Augmented Dickey–Fuller test
CE	Circular Economy
EPR	Extended Producer Responsibility
GDP	Gross Domestic Product
GEC	General Electricity Consumption
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
ML	Machine Learning
MSE	Mean Squared Error
MSW	Municipal Solid Waste
PCA	Principal Component Analysis
SDG	Sustainable Development Goals
SVM	Support Vector Machine
$R^{2}$	Coefficient of Determination
RMSE	Root Mean Squared Error

References

Yan, Y. Machine learning fundamentals. In Machine Learning in Chemical Safety and Health: Fundamentals with Applications; Wiley: Hoboken, NJ, USA, 2022; pp. 19–46. [Google Scholar]
United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development. A/RES/70/1, United Nations General Assembly. Available online: https://sdgs.un.org/2030agenda (accessed on 27 November 2025).
Maalouf, A.; Mavropoulos, A. Re-assessing global municipal solid waste generation. Waste Manag. Res. 2023, 41, 936–947. [Google Scholar] [CrossRef]
European Parliament and Council of the European Union. Directive (EU) 2018/851 of the European Parliament and of the Council of 30 May 2018 Amending Directive 2008/98/EC on Waste (Text with EEA Relevance); Official Journal of the European Union: Luxembourg, 2018; L 150, 14.6.2018; pp. 109–140. [Google Scholar]
Majeau-Bettez, G.; Frayret, J.M.; Ramaswami, A.; Li, Y.; Heeren, N. Data innovation in industrial ecology. J. Ind. Ecol. 2022, 26, 6–11. [Google Scholar] [CrossRef]
Pennington, D.; Ebert-Uphoff, I.; Freed, N.; Martin, J.; Pierce, S.A. Bridging sustainability science, earth science, and data science through interdisciplinary education. Sustain. Sci. 2020, 15, 647–661. [Google Scholar] [CrossRef]
Zhu, J.J.; Yang, M.; Ren, Z.J. Machine learning in environmental research: Common pitfalls and best practices. Environ. Sci. Technol. 2023, 57, 17671–17689. [Google Scholar] [CrossRef] [PubMed]
Ceylan, Z. Estimation of municipal waste generation of Turkey using socio-economic indicators by Bayesian optimization tuned Gaussian process regression. Waste Manag. Res. 2020, 38, 840–850. [Google Scholar] [CrossRef]
Jassim, M.S.; Coskuner, G.; Zontul, M. Comparative performance analysis of support vector regression and artificial neural network for prediction of municipal solid waste generation. Waste Manag. Res. 2022, 40, 195–204. [Google Scholar] [CrossRef]
Nguyen, X.C.; Nguyen, T.T.H.; La, D.D.; Kumar, G.; Rene, E.R.; Nguyen, D.D.; Chang, S.W.; Chung, W.J.; Nguyen, X.H.; Nguyen, V.K. Development of machine learning-based models to forecast solid waste generation in residential areas: A case study from Vietnam. Resour. Conserv. Recycl. 2021, 167, 105381. [Google Scholar] [CrossRef]
Oguz-Ekim, P. Machine learning approaches for municipal solid waste generation forecasting. Environ. Eng. Sci. 2021, 38, 489–499. [Google Scholar] [CrossRef]
Puntarić, E.; Pezo, L.; Zgorelec, Ž.; Gunjača, J.; Kučić Grgić, D.; Voća, N. Prediction of the production of separated municipal solid waste by artificial neural networks in Croatia and the European Union. Sustainability 2022, 14, 10133. [Google Scholar] [CrossRef]
Roseckỳ, M.; Šomplák, R.; Slavík, J.; Kalina, J.; Bulková, G.; Bednář, J. Predictive modelling as a tool for effective municipal waste management policy at different territorial levels. J. Environ. Manag. 2021, 291, 112584. [Google Scholar] [CrossRef]
Yang, L.; Zhao, Y.; Niu, X.; Song, Z.; Gao, Q.; Wu, J. Municipal solid waste forecasting in China based on machine learning models. Front. Energy Res. 2021, 9, 763977. [Google Scholar] [CrossRef]
Allen, C.; Smith, M.; Rabiee, M.; Dahmm, H. A review of scientific advancements in datasets derived from big data for monitoring the Sustainable Development Goals. Sustain. Sci. 2021, 16, 1701–1716. [Google Scholar] [CrossRef]
Intharathirat, R.; Salam, P.A.; Kumar, S.; Untong, A. Forecasting of municipal solid waste quantity in a developing country using multivariate grey models. Waste Manag. 2015, 39, 3–14. [Google Scholar] [CrossRef] [PubMed]
Duan, Y. A novel interval energy-forecasting method for sustainable building management based on deep learning. Sustainability 2022, 14, 8584. [Google Scholar] [CrossRef]
V’yugin, V.V.; Trunov, V.G. Online learning with continuous ranked probability score. In Proceedings of the Conformal and Probabilistic Prediction and Applications, Golden Sands, Bulgaria, 9–11 September 2019; pp. 163–177. [Google Scholar]
Wang, Y.; Gan, D.; Sun, M.; Zhang, N.; Lu, Z.; Kang, C. Probabilistic individual load forecasting using pinball loss guided LSTM. Appl. Energy 2019, 235, 10–20. [Google Scholar] [CrossRef]
Abbasi, M.; El Hanandeh, A. Forecasting municipal solid waste generation using artificial intelligence modelling approaches. Waste Manag. 2016, 56, 13–22. [Google Scholar] [CrossRef]
Antanasijević, D.; Pocajt, V.; Popović, I.; Redžić, N.; Ristić, M. The forecasting of municipal waste generation using artificial neural networks and sustainability indicators. Sustain. Sci. 2013, 8, 37–46. [Google Scholar] [CrossRef]
Batinic, B.; Vukmirovic, S.; Vujic, G.; Stanisavljevic, N.; Ubavin, D.; Vukmirovic, G. Using ANN model to determine future waste characteristics in order to achieve specific waste management targets-case study of Serbia. J. Sci. Ind. Res. 2011, 70, 513–518. [Google Scholar]
Chhay, L.; Reyad, M.A.H.; Suy, R.; Islam, M.R.; Mian, M.M. Municipal solid waste generation in China: Influencing factor analysis and multi-model forecasting. J. Mater. Cycles Waste Manag. 2018, 20, 1761–1770. [Google Scholar] [CrossRef]
Kannangara, M.; Dua, R.; Ahmadi, L.; Bensebaa, F. Modeling and prediction of regional municipal solid waste generation and diversion in Canada using machine learning approaches. Waste Manag. 2018, 74, 3–15. [Google Scholar] [CrossRef]
Noori, R.; Abdoli, M.; Ghasrodashti, A.A.; Jalili Ghazizade, M. Prediction of municipal solid waste generation with combination of support vector machine and principal component analysis: A case study of Mashhad. Environ. Prog. Sustain. Energy Off. Publ. Am. Inst. Chem. Eng. 2009, 28, 249–258. [Google Scholar] [CrossRef]
Yapıcı, E.; Akgün, H.; Özkan, K.; Günkaya, Z.; Özkan, A.; Banar, M. Prediction of gas product yield from packaging waste pyrolysis: Support vector and Gaussian process regression models. Int. J. Environ. Sci. Technol. 2023, 20, 461–476. [Google Scholar] [CrossRef]
Younes, M.K.; Nopiah, Z.; Basri, N.A.; Basri, H.; Abushammala, M.F.; Maulud, K. Prediction of municipal solid waste generation using nonlinear autoregressive network. Environ. Monit. Assess. 2015, 187, 753. [Google Scholar] [CrossRef]
Fasano, F.; Addante, A.S.; Valenzano, B.; Scannicchio, G. Variables influencing per capita production, separate collection, and costs of municipal solid waste in the Apulia region (Italy): An experience of deep learning. Int. J. Environ. Res. Public Health 2021, 18, 752. [Google Scholar] [CrossRef] [PubMed]
Birgen, C.; Magnanelli, E.; Carlsson, P.; Skreiberg, Ø.; Mosby, J.; Becidan, M. Machine learning based modelling for lower heating value prediction of municipal solid waste. Fuel 2021, 283, 118906. [Google Scholar] [CrossRef]
He, M.; Qian, Q.; Liu, X.; Zhang, J.; Curry, J. Recent progress on surface water quality models utilizing machine learning techniques. Water 2024, 16, 3616. [Google Scholar] [CrossRef]
Li, W.; Zhao, Y.; Zhu, Y.; Dong, Z.; Wang, F.; Huang, F. Research progress in water quality prediction based on deep learning technology: A review. Environ. Sci. Pollut. Res. 2024, 31, 26415–26431. [Google Scholar] [CrossRef]
Agbehadji, I.E.; Obagbuwa, I.C. Systematic review of machine learning and deep learning techniques for spatiotemporal air quality prediction. Atmosphere 2024, 15, 1352. [Google Scholar] [CrossRef]
Benti, N.E.; Chaka, M.D.; Semie, A.G. Forecasting renewable energy generation with machine learning and deep learning: Current advances and future prospects. Sustainability 2023, 15, 7087. [Google Scholar] [CrossRef]
Almeida, B.; David, J.; Campos, F.S.; Cabral, P. Satellite-based Machine Learning modelling of Ecosystem Services indicators: A review and meta-analysis. Appl. Geogr. 2024, 165, 103249. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: Berlin/Heidelberg, Germany, 2021; Chapter 3, Linear Regression. [Google Scholar]
Miller, C.; Portlock, T.; Nyaga, D.M.; O’Sullivan, J.M. A review of model evaluation metrics for machine learning in genetics and genomics. Front. Bioinform. 2024, 4, 1457619. [Google Scholar] [CrossRef]
Anderson-Sprecher, R. Model comparisons and R². Am. Stat. 1994, 48, 113–117. [Google Scholar]
Book, S.A.; Young, P.H. The trouble with R². J. Parametr. 2006, 25, 87–114. [Google Scholar] [CrossRef]
Cameron, A.C.; Windmeijer, F.A. An R-squared measure of goodness of fit for some common nonlinear regression models. J. Econom. 1997, 77, 329–342. [Google Scholar] [CrossRef]
Haessel, W. Measuring goodness of fit in linear and nonlinear models. S. Econ. J. 1978, 44, 648–652. [Google Scholar] [CrossRef]
Huang, H.H.; Hsiao, C.K.; Huang, S.Y.; Peterson, P.; Baker, E.; McGaw, B. Nonlinear regression analysis. Int. Encycl. Educ. 2010, 2010, 339–346. [Google Scholar]
Kvalseth, T.O. Note on the R2 measure of goodness of fit for nonlinear models. Bull. Psychon. Soc. 1983, 21, 79–80. [Google Scholar] [CrossRef]
Spiess, A.N.; Neumeyer, N. An evaluation of R 2 as an inadequate measure for nonlinear models in pharmacological and biochemical research: A Monte Carlo approach. BMC Pharmacol. 2010, 10, 6. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing, Version 4.4.1.; R Foundation for Statistical Computing: Vienna, Austria, 2024; Available online: https://www.r-project.org/ (accessed on 25 November 2025).
Eurostat. Generation of Municipal Waste per Capita. 2024. Available online: https://ec.europa.eu/eurostat/databrowser/view/cei_pc031/default/table?lang=en (accessed on 1 October 2025).
AlSalehy, A.S.; Bailey, M. Improving Time Series Data Quality: Identifying Outliers and Handling Missing Values in a Multilocation Gas and Weather Dataset. Smart Cities 2025, 8, 82. [Google Scholar] [CrossRef]
World Bank. GDP (Constant 2015 US$)—Ireland. 2024. Available online: https://data.worldbank.org/indicator/NY.GDP.MKTP.KD?locations=IE (accessed on 1 October 2025).
World Bank. Population, Total—Ireland. 2024. Available online: https://data.worldbank.org/indicator/SP.POP.TOTL?end=2023&locations=IE&start=1960&view=chart (accessed on 1 October 2025).
World Bank. Unemployment, Total (% of Total Labour Force) (Modeled ILO Estimate)—Ireland. 2024. Available online: https://data.worldbank.org/indicator/SL.UEM.TOTL.ZS?locations=IE (accessed on 27 November 2025).
Berdyyev, A.; Al-Masnay, Y.A.; Juliev, M.; Abuduwaili, J. Desertification Monitoring Using Machine Learning Techniques with Multiple Indicators Derived from Sentinel-2 in Turkmenistan. Remote Sens. 2024, 16, 4525. [Google Scholar] [CrossRef]
Nguyen, Q.H.; Ly, H.B.; Ho, L.S.; Al-Ansari, N.; Le, H.V.; Tran, V.Q.; Prakash, I.; Pham, B.T. Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math. Probl. Eng. 2021, 2021, 4832864. [Google Scholar] [CrossRef]
Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Petropoulos, F.; Apiletti, D.; Assimakopoulos, V.; Babai, M.Z.; Barrow, D.K.; Taieb, S.B.; Bergmeir, C.; Bessa, R.J.; Bijak, J.; Boylan, J.E.; et al. Forecasting: Theory and practice. Int. J. Forecast. 2022, 38, 705–871. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: Berlin/Heidelberg, Germany, 2021; Chapter 6, Linear Model Selection and Regularization. [Google Scholar]
Meyer, D.; Dimitriadou, E.; Hornik, K.; Weingessel, A.; Leisch, F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, R package version 1.7-16; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]
Fritsch, S.; Guenther, F.; Wright, M.N. neuralnet: Training of Neural Networks, R package version 1.44.2; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
Wooldridge, J.M. Introductory Econometrics a Modern Approach; South-Western Cengage Learning: Singapore, 2016; Chapter 2. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: Berlin/Heidelberg, Germany, 2021; Chapter 10, Deep Learning. [Google Scholar]
Arnold, C.; Biedebach, L.; Küpfer, A.; Neunhoeffer, M. The role of hyperparameters in machine learning models and how to tune them. Political Sci. Res. Methods 2024, 12, 841–848. [Google Scholar] [CrossRef]
Asteriou, D.; Hall, S.G. Applied Econometrics; Bloomsbury Publishing: London, UK, 2021; Chapter 3. [Google Scholar]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018; Chapter 6. [Google Scholar]
Maitra, S.; Politis, D.N. Prepivoted Augmented Dickey-Fuller Test with Bootstrap-Assisted Lag Length Selection. Stats 2024, 7, 1226. [Google Scholar] [CrossRef]
Pesaran, M.H. Time Series and Panel Data Econometrics; Oxford University Press: London, UK, 2015. [Google Scholar]
Górecki, T.; Horváth, L.; Kokoszka, P. Change point detection in heteroscedastic time series. Econom. Stat. 2018, 7, 63–88. [Google Scholar] [CrossRef]
Shrestha, N. Detecting multicollinearity in regression analysis. Am. J. Appl. Math. Stat. 2020, 8, 39–42. [Google Scholar] [CrossRef]
Chan, J.Y.L.; Leow, S.M.H.; Bea, K.T.; Cheng, W.K.; Phoong, S.W.; Hong, Z.W.; Chen, Y.L. Mitigating the multicollinearity problem and its machine learning approach: A review. Mathematics 2022, 10, 1283. [Google Scholar] [CrossRef]
Ying, X. An overview of overfitting and its solutions. J. Physics Conf. Ser. 2019, 1168, 022022. [Google Scholar] [CrossRef]
Kim, J.H. Multicollinearity and misleading statistical results. Korean J. Anesthesiol. 2019, 72, 558–569. [Google Scholar] [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
European Parliament; Council of the European Union. Directive 2008/98/EC of the European Parliament and of the Council of 19 November 2008 on Waste and Repealing Certain Directives; Official Journal of the European Union: Luxembourg, 2008; OJ L 312, 22.11.2008; pp. 3–30.
European Parliament; Council of the European Union. Directive 94/62/EC of the European Parliament and of the Council of 20 December 1994 on Packaging and Packaging Waste; Official Journal of the European Union: Luxembourg, 1994; OJ L 365, 31.12.1994; pp. 10–23.
European Parliament; Council of the European Union. Regulation (EU) 2025/40 of the European Parliament and of the Council of 19 December 2024 on Packaging and Packaging Waste, Amending Regulation (EU) 2019/1020 and Directive (EU) 2019/904, and Repealing Directive 94/62/EC; Official Journal of the European Union: Luxembourg, 2025; OJ L 22.1.2025/40, (Text with EEA relevance) Eli:CELEX 32025R0040.

Figure 1. Plot of annual municipal solid waste generation of Ireland [45].

Figure 2. Plots of the observed (black line) and fitted (red line) test set values for the (a) mean baseline model, (b) naive baseline model, (c) support vector machine (SVM) model and (d) neural network.

Figure 3. Fitted vs. residuals plot of the training set for the SVM built in Section 3.4.

Table 2. Explanation of variables used for modelling.

Variable	Units	Source
Municipal waste	kilograms/capita	[45]
GDP	constant to 2015 US$	[47]
Population	total population	[48]
Unemployment	% of total labour force	[49]

Table 3. Evaluation metrics for models.

Model	Train $R^{2}$	Train RMSE	Train MAE	Train MSE	Test $R^{2}$	Test RMSE	Test MAE	Test MSE
Mean Baseline	0	75.07	68.84	5635.53	−5.02	59.67	54.49	3560.34
Naive Baseline	0.72	39.75	29.19	1580.28	0.39	19.02	17.75	361.75
SVM	0.87	26.76	20.44	716.11	−0.31	27.81	25.24	773.34
Neural Network	0.75	37.85	20.69	1432.51	−0.64	31.17	25.20	971.49

Notes: Lower RMSE, MAE and MSE values indicate better model performance.

R^{2}

values closer to 1 indicate better model performance.

Table 4. Correlation matrix for the covariates.

	GDP	Population	Unemployment
GDP	1	-	-
Population	0.84	1	-
Unemployment	−0.08	0.43	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mullane, P.; Fitzpatrick, C.; Grua, E.M. Rethinking Machine Learning Evaluation in Waste Management Research. Sustainability 2025, 17, 10707. https://doi.org/10.3390/su172310707

AMA Style

Mullane P, Fitzpatrick C, Grua EM. Rethinking Machine Learning Evaluation in Waste Management Research. Sustainability. 2025; 17(23):10707. https://doi.org/10.3390/su172310707

Chicago/Turabian Style

Mullane, Paul, Colin Fitzpatrick, and Eoin Martino Grua. 2025. "Rethinking Machine Learning Evaluation in Waste Management Research" Sustainability 17, no. 23: 10707. https://doi.org/10.3390/su172310707

APA Style

Mullane, P., Fitzpatrick, C., & Grua, E. M. (2025). Rethinking Machine Learning Evaluation in Waste Management Research. Sustainability, 17(23), 10707. https://doi.org/10.3390/su172310707

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rethinking Machine Learning Evaluation in Waste Management Research

Abstract

1. Introduction

2. Evaluation Metrics in the Literature

2.1. The $R^{2}$

2.2. Error-Based Metrics

2.3. Criticism of the $R^{2}$ in the Literature

3. Methods

3.1. Data Acquisition and Processing

3.2. Mean Baseline Model

3.3. Naive Baseline Model

3.4. Support Vector Machine

3.5. Neural Network

4. Results

5. Theoretical Reasons for the $R^{2}$ s Failings

5.1. The $R^{2}$ for Linear Regression

5.2. The Effect of Hyperparameters on the $R^{2}$

5.3. The Effect of Trend on the $R^{2}$

5.4. The Effect of Multicollinearity on the $R^{2}$

6. Discussions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Rethinking Machine Learning Evaluation in Waste Management Research

Abstract

1. Introduction

2. Evaluation Metrics in the Literature

2.1. The R 2

2.2. Error-Based Metrics

2.3. Criticism of the R 2 in the Literature

3. Methods

3.1. Data Acquisition and Processing

3.2. Mean Baseline Model

3.3. Naive Baseline Model

3.4. Support Vector Machine

3.5. Neural Network

4. Results

5. Theoretical Reasons for the R 2 s Failings

5.1. The R 2 for Linear Regression

5.2. The Effect of Hyperparameters on the R 2

5.3. The Effect of Trend on the R 2

5.4. The Effect of Multicollinearity on the R 2

6. Discussions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.1. The $R^{2}$

2.3. Criticism of the $R^{2}$ in the Literature

5. Theoretical Reasons for the $R^{2}$ s Failings

5.1. The $R^{2}$ for Linear Regression

5.2. The Effect of Hyperparameters on the $R^{2}$

5.3. The Effect of Trend on the $R^{2}$

5.4. The Effect of Multicollinearity on the $R^{2}$