Logistics Sprawl and Urban Congestion Dynamics Toward Sustainability: A Logistic Regression and Random-Forest-Based Model

El Yadari, Manal; Jawab, Fouad; Moufad, Imane; Arif, Jabir

doi:10.3390/su17135929

Open AccessArticle

Logistics Sprawl and Urban Congestion Dynamics Toward Sustainability: A Logistic Regression and Random-Forest-Based Model

Technologies and Industrial Services Laboratory, Higher School of Technology, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(13), 5929; https://doi.org/10.3390/su17135929

Submission received: 16 May 2025 / Revised: 22 June 2025 / Accepted: 24 June 2025 / Published: 27 June 2025

(This article belongs to the Special Issue Sustainable Operations and Green Supply Chain)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Increasing road congestion is the main constraint that may influence the economic development of cities and urban freight transport efficiency because it generates additional costs related to delay, influences social life, increases environmental emissions, and decreases service quality. This may result from several factors, including an increase in logistics activities in the urban core. Therefore, this paper aims to define the relationship between the logistics sprawl phenomenon and congestion level. In this sense, we explored the literature to summarize the phenomenon of logistics sprawl in different cities and defined the dependent and independent variables. Congestion level was defined as the dependent variable, while the increasing distance resulting from logistics sprawl, along with city and operational flow characteristics, was treated as independent variables. We compared the performance of several models, including decision tree, support vector machine, gradient boosting, k-nearest neighbor, logistic regression and random forest. Among all the models tested, we found that the random forest algorithm delivered the best performance in terms of prediction. We combined both logistic regression—for its interpretability—and random forest—for its predictive strength—to define, explain, and interpret the relationship between the studied variables. Subsequently, we collected data from the literature and various databases, including transit city sources. The resulting dataset, composed of secondary and open-source data, was then enhanced through standard augmentation techniques—SMOTE, mixup, Gaussian noise, and linear interpolation—to improve class balance and data quality and ensure the robustness of the analysis. Then, we developed a Python code and executed it in Colab. As a result, we deduced an equation that describes the relationship between the congestion level and the defined independent variables.

Keywords:

sustainability; urban freight transport; logistics sprawl; congestion; logistic regression; random forest

1. Introduction

The increasing demand for freight and the complexity of logistics operations are the main factors leading to a rethinking of the performance of logistics operations in urban centers. The economic development of cities primarily depends on the efficiency of logistics operations [1]. In this sense, optimizing the transport network is fundamental to ensuring logistics activity performance and satisfying the growing demand [2].

The current tendency is to relocate many logistics facilities to outlying areas to develop a logistics infrastructure that meets the demand of cities [3]. More logistics hubs are being developed to consolidate freight, optimize truck filling rates, and design efficient vehicle routing [4,5]. In parallel with this movement, we have observed that the distance travelled is increasing, which can increase road occupancy and trip operations and can lead to several road problems, such as carbon footprint and congestion [6]. However, other studies have confirmed that logistics sprawl contributes to reducing congestion in urban centers by destressing cities through minimizing the number of trips made in the urban center [7].

Further research papers have focused on the phenomenon of logistics sprawl [2,8], which describes the drivers of the movement and confirms that it is pushed by land availability and attractive prices in peripheral areas [9,10], confirming that the movement is primarily driven by organizational factors such as the need for freight consolidation [11], infrastructure availability, delivery frequency [12], demand volume, and characteristics of logistics schemes. However, few studies have discussed the relationship between logistics sprawl and urban freight transport efficiency, specifically congestion levels.

In this paper, we aim to understand how logistics sprawl contributes to increasing or decreasing congestion levels in the urban core. Therefore, we conducted a literature review to describe the phenomenon of logistics sprawl and congestion. In particular, we analyzed how logistics sprawl manifests in various cities worldwide and deduced the main variables that may be used to model the relationship between logistics sprawl and congestion level. Subsequently, we defined the dependent and independent variables.

Finally, we developed a model using logistic regression and a random forest model to define the relationship between logistics sprawl and congestion level. We developed the model using the Phyton 3 in Colab platform.

To train the model, we extracted data from the literature review and different databases, including city transit databases, and encoded the categorical variables. Furthermore, we balanced and augmented our database using SMOTE, linear interpolation, Gaussian noise, and the mixup algorithm. Finally, we executed the model and deduced the results. From this, we have drawn the relationship between logistics sprawl and congestion level.

This paper is organized into four sections: In the Section 1, we introduce the research methodology. In the Section 2, we briefly introduce the literature review, through which we define both the logistics sprawl and congestion level as well as retrieve data about the logistics sprawl phenomenon in different cities. In the Section 3 and Section 4, we introduce the developed model and the retrieved data. Finally, in the Section 5, we discuss the obtained results.

2. Research Background

2.1. Introduction to Logistics Sprawl, Congestion, and Sustainability

The urban core refers to the central area of a metropolitan region, characterized by a dense concentration of economic, administrative, and commercial activity. This zone often corresponds to the old city core and serves as a major freight and goods center. Basically, it serves as an important hub for freight distribution due to its high population density and business activity. However, the spatial limitations, high land costs, and congestion challenges in these areas pose significant constraints for large-scale logistics operations.

In contrast, peripheral areas—situated on the outskirts or in the suburban zones of metropolitan regions—offer lower land costs and greater spatial availability. These characteristics make them attractive for building logistics hubs and storage facilities. This development has been accompanied by a significant shift in the center of gravity of logistics infrastructure as companies attempt to minimize costs and maximize operational efficiency.

This spatial transition is known as logistics sprawl, a process denoting the migration of logistics infrastructure and operations from the urban core to the periphery. The concept of sprawl was initially introduced in 1952 to describe the expansion of urban structures and populations in suburban areas. Since then, the term has evolved to encompass the dispersion of logistics operations, raising important questions about its implications for urban traffic, environmental sustainability, and freight efficiency (Figure 1).

Logistics sprawl was introduced by [2], who defined this movement as the spatial deconcentration of logistics facilities from urban cores to suburban and periurban areas. Similarly, other studies [13,14,15,16,17] have defined it as the shift of logistics infrastructure to the city’s periphery. This movement was driven by several factors, including socio-economic evolution and political, territorial, and organizational patterns [3]. The availability of land in the urban core is the main driver of logistics sprawl, followed by land prices, availability, and proximity of infrastructure [4,5].

As described in the conducted research work of [6,7], logistics sprawl depends on the city’s characteristics, including population density, type of logistics activities [18], demand volume, infrastructure capacity, land availability, economic costs, and territorial strategies [19].

This phenomenon contributes to increased travel distances, energy consumption, and environmental emissions [20]. Research has also highlighted its negative impact on freight transport efficiency and urban mobility, as it raises transport costs, exacerbates congestion, and intensifies environmental pollution [21,22].

Furthermore, it increases pressure on transport systems, especially in areas with poor infrastructure, where workers and freight vehicles navigate alternately between urban centers and suburban areas [23], thereby impacting traffic flow and transport efficiency.

From the reviewed papers, we observed that the behavior of logistics sprawl and its consequences differ from city to city, depending on the characteristics of the cities, type of logistics operations, and transport patterns [9].

Empirical research from diverse geographic contexts confirmed that logistics sprawl may manifest differently depending on urban dynamics. For instance, in Wuhan, China, the logistics sprawl movement was strongly influenced by postindustrial spatial restructuring, leading to significant changes in freight mobility and increased urban congestion due to longer delivery routes [24]. In the Indian context, Ref. [15] demonstrated that logistics sprawl increased vehicle kilometers traveled (VKT), traffic delays, and intensified environmental impacts in cities with limited last-mile logistics infrastructure. In the Katowice region of Poland, Ref. [25] introduced both the outward movement of logistics facilities and the phenomenon of ‘“logistics anti-sprawl, which involves the recentralization of certain high-value logistics operations to take advantage of intermodal nodes and improve access to labor”.

These international cases reveal that logistics sprawl is not a uniform process but is embedded in urban spatial structures and regional freight governance. In this respect, Ref. [26] introduced the concept of “logistics distension” in the French context, where urban expansion and freight decentralization create fragmented supply chains and variable congestion impacts. While logistics sprawl can reduce inner-city congestion by displacing freight traffic to the outskirts, it generates peripheral congestion, longer distances, and environmental externalities. Therefore, understanding logistics sprawl requires the integration of spatial planning, transport modelling, and sustainability frameworks to guide a balanced urban freight policy (Table 1).

The suburbanization of logistics facilities is closely associated with transport activities; as logistics platforms move farther from city centers, freight transport distances increase, and the number and length of the truck journeys increase. Furthermore, as employment shifts to peripheral areas, the demand for transportation increases, requiring highly advanced infrastructure to support freight transportation and labor mobility.

The growing demand for transport resulting from logistics sprawl affects congestion in different ways; it can intensify traffic in some areas and relieve it in others. However, as freight transport activity expands to support longer journeys and greater mobility needs, road infrastructure is under increased pressure, which directly affects congestion levels.

Road congestion describes infrastructure overload; when the number of vehicles on roads exceeds the infrastructure capacity, it may be measured by the speed decrease in a road section, queue length, delay, or waiting time during routing [10].

Congestion is a major phenomenon that affects the efficiency of transport activities and decreases their performance as a whole [11]. It generates additional congestion costs [30] and increases energy usage and environmental pollution [7].

Congestion may affect traffic flow because frequent acceleration and stops increase emissions, including CO₂, nitrogen oxides (NOx), and particulate matter (PM). It also reduces energy efficiency, increases transport costs, and affects urban liveability.

Several factors may influence the increasing levels of congestion, including infrastructure capacity, growth of logistics activities [31], rise in vehicle ownership, poor management of logistics flows [32], and logistics sprawl.

As confirmed by Ref. [33], the movement of logistics sprawl contributes to shaping urban mobility patterns and the congestion levels. Although this movement may reduce freight traffic in city centers, it simultaneously creates new congestion challenges on suburban and interurban road networks to meet these challenges, which require effective congestion management and strategic planning of logistics expansion. It involves expanding logistics activities within an optimal range that aligns with the specific characteristics of the region, such as transport infrastructure capacity, land use patterns, and mobility demands, while promoting urban sustainability and minimizing environmental impacts.

2.2. Overview of Modeling Approaches

This paper aims to identify the relationship between logistics sprawl and congestion level. Our purpose was to understand how logistics sprawl can contribute to increasing or decreasing the level of congestion, and we reviewed the existing literature to identify the different models and methods that have been used to analyze the impact of logistics sprawl on traffic congestion, emissions, and service performance. The main objective of this review is to assess the validity and applicability of existing approaches and, ultimately, to identify the most appropriate model or methodology that can be adopted to address our specific research problem (Table 2).

From the reviewed papers, we observed that the combination of logistics sprawl and logistic regression was the most frequent, with 14 papers, indicating a strong interest in understanding the spatial dynamics of logistics infrastructure through statistical modelling methods. In addition, we note that congestion analysis is often conducted using decision trees and random forests, with six articles each, underlining the importance of interpretable machine learning models in traffic-related studies. The recurrent co-occurrence of multiple techniques, including decision trees, random forests, and logistic regression, in several studies (including three papers combining all three) provides a comparative approach for modelling congestion. Moreover, methods such as gradient boosting and k-nearest neighbors also attract attention, each appearing in four congestion-related papers, while support vector machines are used less frequently, with only one paper. However, other studies apply four or more approaches, reflecting the growing trend towards hybrid modelling approaches, which allow the multidimensional nature of urban transport challenges to be included. Overall, logistic regression remains the dominant technique, particularly in studies of logistics sprawl, whereas random forest and ensemble methods predominate in research focusing on traffic congestion. However, the uniqueness of our research lies in the study of the impact of logistics sprawl on the level of congestion using logistic regression and random forests. Various models have been proposed in the literature, each tailored to specific contexts, objectives, and data availabilities. These include statistical methods, machine learning, simulation-based techniques, and optimization models. The choice of model often depends on factors such as the nature of the problem, computational resources, and the desired level of precision or interpretability. Table 3 summarizes the various models used to study logistics, sprawl, and traffic congestion separately or together.

Table 3 classifies the machine learning models employed in transportation and urban research and highlights their respective application areas. From the reviewed papers, we observed that tree-based models were used in 33% of the reviewed papers because of their robustness and adaptability to various urban datasets, demonstrating strong predictive capabilities in the areas of traffic congestion, urban sprawl and logistics modelling. Next, regression models, led by logistic regression, are widely used (26% of papers) for their ease of interpretation and implementation, particularly in research on urban sprawl and shared mobility. Distance-based models, such as k-nearest neighbors and SVMs, were moderately present (9%), mainly in travel time estimation and traffic congestion prediction. Neural network models, including ANN, LSTM, and CNN, were introduced in 8% of the research papers; they were used to process temporal traffic data, although they were used less frequently because of their complexity and high data requirements. Hybrid and ensemble models, which account for 21% of the research, combine the strengths of several algorithms, particularly for traffic flow prediction, indicating a shift toward more integrated modelling approaches. The remaining 4% of the models were based on other, less common techniques. Among all these techniques, random forest and logistic regression stand out as the most dominant models in the literature, reflecting a strong balance between performance, flexibility, and practicality. As a result, we deduced that logistic regression and random forest are two methods among the most widely used models in the field of urban logistics and congestion analysis. Logistic regression, in particular, is frequently employed in studies addressing urban sprawl, logistics sprawl, and traffic congestion. Its popularity can be attributed to its simplicity and effectiveness in handling binary classification tasks, such as identifying whether a specific area experiences congestion or not. On the other hand, random forest is another model that has received particular attention for solving problems related to urban sprawl and sustainable urban development. It is characterized by its robustness, ability to handle complex datasets with multiple variables, and ability to capture nonlinear relationships. The ability of random forests to handle missing data and their relatively high predictive performance also contribute to their widespread use. Furthermore, the combination of multiple models has become increasingly popular in the literature, such as decision trees combined with logistic regression and neural networks. These hybrid models tend to outperform individual models by capturing both linear and nonlinear patterns, improving classification accuracy and model robustness.

3. Model

3.1. Comparative Review of Modeling Methods

Based on the research background and the nature of the problem studied, we identified a set of machine learning and statistical models that are widely used to examine the relationship between variables and predict outcomes, such as traffic congestion and logistics sprawl. These models include logistic regression, support vector machine (SVM), random forest (RF), decision tree, gradient boosting, and k-nearest neighbors (KNN). Each of these methods has strengths in terms of interpretability, handling nonlinearity, and performance. To evaluate and compare their effectiveness, we used several metrics for classification tasks: accuracy, AUC score, precision, recall, F1-score, and the confusion matrix. The goal was to identify the most reliable and robust model based on these indicators. Below, each metric is defined to provide a rigorous basis for the comparison (Table 4).

Confusion Matrix

The confusion matrix measures the model’s classification performance by comparing the predicted and actual values. It is used to identify how well the classification model performs, particularly by showing where the model’s predictions are correct and where errors occur. The matrix is used for both binary and multiclass classification tasks. For binary classification, it has a 2 × 2 dimension revealing the counts for each of the four possible outcomes: true positives, false positives, true negatives, and false negatives.

Accuracy

Accuracy measures the percentage of correctly classified instances from the total instances.

AUC Score (Area Under the ROC Curve)

The AUC score measures the model’s ability to distinguish between the classes. A higher AUC indicates better performance in separating positive and negative classes.

Precision (Class 0 and Class 1)

Precision quantifies the accuracy of positive predictions. It measures the proportion of positive identifications that are correct.

Class 0 precision: accuracy of predicting “No Congestion”.

Class 1 precision: accuracy of predicting “Congestion”.

A high precision value indicates few false positives.

Recall (Class 0 and Class 1)

Also known as sensitivity or true positive rate, recall measures how well the model captures actual positive cases.

Class 0 recall: proportion of correctly identified “No Congestion” cases.

Class 1 recall: proportion of correctly identified “Congestion” cases.

A high recall value indicates few false negatives.

F1-Score (Class 0 and Class 1)

The F1-score is the harmonic mean of precision and recall. It balances both metrics into one value, especially useful when dealing with class imbalance.

Class 0 F1-score: overall performance for “No Congestion”.

Class 1 F1-score: overall performance for “Congestion”.

Where

TP is the number of true positives
TN is the number of true negatives
FP is the number of false positives
FN is the number of false negatives.

Each of these metrics offers a distinct perspective on model performance, capturing various aspects, such as classification accuracy, the model’s ability to distinguish between classes, and its effectiveness in minimizing false positives and false negatives. By applying these indicators to all models, we aimed to identify the most suitable approach for our research objective, which focuses on accurately modeling the relationship between logistic sprawl or other variables and predicting traffic congestion. The comparative results allow us to determine the model that provides the most balanced and reliable performance across all evaluation dimensions (Table 5).

To assess the performance and stability of the models used, we applied k-fold cross-validation with different values of k (5, 10, and 15). We aimed to obtain a more robust and generalizable assessment of the model performance by dividing the dataset into k folds of equal size and iteratively training the model on k-1 folds while testing on the remaining fold. The k-fold cross-validation method reduces the variance associated with data partitioning and helps ensure that the results are not biased by any particular division. It provides a more complete evaluation by averaging the performance measures over several folds, which helps to better understand the extent to which the model generalizes to unseen data. Table 6 summarizes the average accuracy scores obtained from each fold setting across all evaluated models.

The results obtained show that the performance and stability of the model improve with increasing k, particularly for ensemble-based methods such as random forest, KNN, and gradient boosting. In this respect, we deduce that random forest consistently achieves the highest accuracy, rising from 80.7% with 5-fold cross-validation to 83.1% with 15-fold cross-validation, highlighting its strong generalization and robustness to data variation. Furthermore, we observed that logistic regression maintained a consistent but weaker accuracy (~69%) across all cross-validations, which is justified by its linearity. Despite its relatively poorer predictive performance, logistic regression allows for a transparent comparison with more complex models, such as random forests. The combination of these two models, one interpretable and the other high-performing, offers both explanatory power and predictive strength, providing a balanced and comprehensive modelling approach. As 15-fold cross-validation yielded the most robust and consistent results across all models, it was selected for final evaluation. Consequently, we ranked the models according to their performance metrics as follows (Table 7):

Based on these performance indicators, we deduced that the combination of random forest and logistic regression presents a balanced approach that meets the objectives of both interpretability and relationship identification. Logistic regression, despite its lower predictive performance (accuracy = 0.69), offers valuable insights into the influence and relationship between variables based on its coefficients; therefore, it is particularly well adapted to understanding relationships within the data. On the other hand, the random forest demonstrates solid general performance (accuracy = 0.83, low overfitting risk) while preserving a high degree of transparency in how input features contribute to predictions. This approach allows leveraging the interpretative strength of logistic regression with the structured decision logic and higher predictive reliability of random forest, making it a practical and explainable solution when both variable insight and reasonable predictive accuracy are required.

Despite the lower level of performance of logistic regression, we observed that it is among the models that may answer our problem. It is designed for classification problems where the outcome variable has categories. It provides probability estimates for analyzing the predicted outcomes. The powerful aspect of using logistic regression relies on computational efficiency, ease of interpretation, and adaptability with different datasets; it is used to define the relationship between variables, specifically in our case, the logistics sprawl and congestion level. Unlike more complex models, such as support vector machines, logistic regression was used to analyze how each factor influenced the congestion level.

We then used random forest, which is among the most powerful methods for capturing complex and nonlinear relationships between the features and the target variable.

Compared to other algorithms, such as decision trees or support vector machines, decision trees are computationally efficient and offer high interpretability, as they require less fine-tuning and handle both numerical and categorical data effectively. However, while decision trees are prone to overfitting and may lack generalization ability on unseen data, random forests overcome these limitations by combining multiple decision trees through an ensemble approach. This leads to significantly improved accuracy, robustness against overfitting, and better handling of high-dimensional data and complex-variable interactions. Additionally, random forest maintains interpretability through feature importance scores while delivering superior predictive performance, making it a more reliable and balanced choice for our research objective.

In this sense, we suggest using both logistic regression and random forest to define the relationship between features and to predict the model outcome. Logistic regression sets up a linear relationship among the variables, offering an interpretable model that uses coefficients so that the contribution of each feature to the outcome is quantifiable.

By using both models, the performances can be compared, and inference can be drawn from both standpoints of linear and nonlinear, hence making the final model robust in predicting congestion based on logistics sprawl, along with other features.

3.2. Logistic Regression and Random-Forest-Based Model

1.: Logistic regression

Logistic regression is a statistical model for binary classification that assumes a linear relationship between independent variables and the log-odds of the dependent variable.

It consists of modelling the relationship between a set of independent variables (features) and a binary dependent variable (target). In this case, we intended to predict whether there is high or moderate congestion in the city based on various features such as logistics sprawl, density, and vehicle kilometers in operation. It is defined by the following equation [104]:

l o g i t (p) = \ln (\frac{p}{1 - P}) = β_{0} + β_{1} X_{1} + \dots + β_{n} X_{n}

(1)

p is the probability of the target variable being 1
x1, x2, …xn are the features in the dataset
β0 is the intercept
β1, β2, …, βn represent the coefficient corresponding to x1, x2, …xn

The output p is calculated as follows:

p = \frac{1}{1 + \exp (- l o g i t (p))}

(2)

The coefficient β0, β1, β2, …, βn represents the log-odds change of the target variable (congestion) for a one-unit increase in the corresponding feature while holding the other features constant.

Furthermore, the interaction terms, such as logistics sprawl and density, are relevant because they allow the model to capture the effect of combining different features on the predicted level of congestion.

Logistic regression relies on the following assumptions (Table 8):

Several key assumptions were used to assess the robustness of our logistic regression model, mainly the assumption of linearity in the logit, which was handled by incorporating meaningful interaction terms, such as sprawl × density and sprawl × vehicle_KM, which approximate linear effects and partially mitigate the limitations of not using a fully nonlinear model. This is followed by the observation of independence, which is maintained as each data point represents a distinct city with no spatial or temporal overlap between cities. However, we recognize the potential influence of latent regional similarities, which could be explored in future geographically aware models. While the homoscedasticity and normality of the residuals are less critical in logistic regression, we did not observe any major deviations. In addition, the use of classification measures such as AUC and F1, which are not sensitive to the behavior of residuals, further reduces these concerns. Potential multicollinearity, particularly between features such as vehicle ownership and vehicle kilometers, was addressed through feature standardization and careful construction, minimizing its effect on interpretability rather than predictive power. Finally, although the original sample size was limited (n = 19), we expanded the dataset using SMOTE and domain-informed augmentation techniques (e.g., noise-based and mixup). These synthetic enhancements were rigorously validated using 15-fold cross-validation, confirming stable model performance and supporting the adequacy of the data for exploratory analyses.

2.: Random Forest

Random forest is a machine learning method. It is a part of supervised learning, and is an ensemble learning algorithm that combines the predictions of multiple decision trees to improve predictive accuracy and control overfitting. Formally, given a training dataset

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, where

x_{i} \in ℝ^{d}

and

y_{i} \in Y

, a random forest constructs T decision trees

{(h_{t} (x)}_{t = 1}^{T}

, each trained on a bootstrap sample

D_{t} \subset D

. At each split in a tree, a random subset

F \subset {1, \dots d}

of features is considered for determining the optimal split, introducing additional variance reduction.

For classification, the final prediction ỹ for a new instance x is given by majority voting

\tilde{y} = m o d e {(h_{t} (x)}_{t = 1}^{T}

For regression, the prediction is the average of outputs

\tilde{y} = \frac{1}{T} \sum_{t = 1}^{T} h_{t} (x)

The randomness introduced through both bootstrapping and feature subsetting enhances model robustness by reducing the correlation between individual trees and limiting overfitting, especially in high-dimensional spaces.

Otherwise, the random forest algorithm holds several assumptions, defined as described in Table 9.

As described in the table below, we deduce that the random forest perfectly accommodates the structure and complexity of our dataset. It is characterized by its ability to capture complex, nonlinear relationships. This was especially important for our variables, such as congestion, miles traveled, and logistic sprawl, which tend to interact in unpredictable and nonlinear ways. Unlike models such as logistic regression, which require more advanced feature engineering, random forests handle them naturally. Another advantage is that it does not make specific assumptions about the data distribution. Because our dataset includes variables with skewed distributions and varying scales, such as density and car ownership, random forest allowed us to work directly with the raw data without the need for transformations. The model also assumes that each observation is independent, which is the case in our study, as each data point represents a distinct urban area with no temporal or repeating structures. We mitigated the risk of overfitting by employing synthetic data generated using techniques such as Gaussian noise, mixup, and SMOTE. In this sense, we optimized the model using GridSearchCV, applied 15-fold cross-validation, and monitored key metrics such as AUC and F1 score; none showed signs of overfitting.

However, the main goal of using random forest was prediction rather than interpretability; therefore, this trade-off was acceptable for our study. Finally, although the original sample was relatively small (n = 19), our data augmentation strategy provided the model with sufficient diversity to learn from and consistently deliver stable and reliable performance in defining the relationship between logistics sprawl and congestion. First, we processed the data and applied linear interpolation, Gaussian noise, SMOTE, and mixup algorithms to augment and balance our dataset. We then used logistic regression to estimate the probability of congestion and its relationship with the different features. Logistic regression measures feature importance using coefficients and introduces a relationship equation. Random forest classification is used to capture nonlinear patterns and hierarchical decision rules, making it more interpretable for predicting congestion. Both models were evaluated for accuracy to explore the relationships between the variables and congestion levels. The use of logistic regression and random forest classification is significant because they complement each other’s strengths. Logistic regression offers a clear, interpretable model that analyzes the linear relationships between features and the target (congestion), providing insights into feature importance and probability predictions. In contrast, random forest excels in capturing nonlinear relationships and interactions between features, revealing complex patterns. We selected logistic regression (LR) and random forest (RF) due to their complementary strengths in classification tasks involving structured tabular data. Logistic regression is a linear and stable model that offers interpretable coefficients, which are perfect for capturing the direction and strength of relationships between predictors such as logistics sprawl, density, and congestion levels. Such interpretability is suited for policy and planning scenarios. On the other hand, random forest, being an ensemble learner, is able to learn complex, nonlinear relationships and is also less prone to overfitting due to its bagging process and intrinsic feature importance estimation. Employing both models allows for a comparative view: LR offers transparency and statistical interpretation, whereas RF offers predictive ability and detects interactions and nonlinearities that are intrinsic to real-world urban systems. Cross-validation and hyperparameter tuning also allow for model robustness.

3.3. Variables Identification

Based on the literature review, we selected a set of variables through which we modelled the relationship between logistics sprawl and congestion level. We retained the most frequent and relevant indicators, among which we identified ordinal and categorical variables. We summarized these variables in Table 10.

Among these variables, we have categorical and ordinal variables, which can be classified as follows:

The variables used in our model were extracted from the literature review, and we identified the most significant and recurrent variables through which we could define the relationship between the logistic sprawl and congestion. We then classified them according to their type and characteristics, as explained in Table 11.

4. Data

To perform our model, we constructed data from the literature review, identified the characteristics of these variables in different cities around the world, and searched different database sources, including “CityTransitData”, to complete the remaining data. Thereafter, we used linear interpolation, Gaussian noise, mixup, and the SMOTE algorithm to augment and balance our dataset.

The main aim of this study was to introduce the model and not focus on a specific dataset. Therefore, we used this data only to perform our model and to show its accuracy, since using this model will allow predicting and defining the relationship between our variables and the congestion level. This study provides decision-makers with valuable insights and hypotheses to consider in their decisions.

4.1. Collecting Data from the Literature Review

In total, we reviewed 29 research papers that dealt specifically with the logistics sprawl phenomenon in different cities. Based on this review, we summarized the behavior of logistics sprawl in different regions in the Table 12 below.

Through the literature review, we observed that the logistics sprawl phenomenon acts similarly in different cities, contributing to an increase in the distance travelled to the urban core, except in the case of Seattle, where logistics sprawl has led to a decrease in the distance travelled [8].

We observed that the majority of cities are sprawled within the interval of [0.6; 19.58], except for Melbourne, which sprawled 40 to 50 km from the city center. Gauteng sprawled by 231.49 km, while in the case of Seattle, the sprawling movement decreased the distance by 1.3 km.

Based on this finding, we intend to complete the data on the city’s characteristics and congestion level. Therefore, we propose to retrieve data from different database sources, including the city transit database.

4.2. Data Augmentation Workflow

In this step, we searched for data on cities and operational flow characteristics. We searched for cities in different database sources, including the Transit Database. Then, we used a synthetic minority oversampling technique to generate synthetic data points in imbalanced datasets, which consists of creating new examples for the minority class. It identifies the nearest neighbors of each sample and then interpolates between them to create synthetic examples. We trained our model using the data collected from these cities.

Data collection involved three major steps. In the first step, we identified several cities where logistics sprawl occurs, as shown in Table 13. The level of logistics sprawl was measured by the increase in distance from urban centers. In the second step, we completed the remaining data regarding population density, infrastructure, delays, etc., from different sources, such as the Transit City Database.

In the third step, we used linear interpolation, Gaussian noise, and mixup to generate additional dataset samples. We then used the SMOTE algorithm to balance our dataset (Figure 2).

To select the most appropriate augmentation technique, we compared the performance of different existing augmentation and balancing techniques (Table 14).

Based on the conducted comparative survey, we proposed combining linear interpolation, Gaussian noise, and the mixup method for data augmentation, alongside SMOTE for balancing the dataset. After comparing different augmentation techniques, we found these methods to be the most effective in improving the model performance. Linear interpolation and Gaussian noise are widely used because they generate realistic synthetic data while preserving the inherent dataset structure. Mixup further enhances data diversity by creating new training samples that are linear combinations of existing samples, which is beneficial for improving model generalization. Conversely, SMOTE stands out as one of the best methods for balancing the dataset. It improves classification accuracy by addressing class imbalance, achieving an excellent balance between accuracy and recall, particularly with the random forest model. Therefore, this combination of techniques was chosen to optimize the quality, balance the dataset, and improve the model performance.

Mathematical Formulation of SMOTE

This method is used to augment the dataset and create synthetic samples by interpolating between existing samples of the minority class.

Given the dataset (X, y), where

X \in ℝ^{n x d}

represent the feature matrix of n samples and d features, and

y \in {0, 1}

is the binary class label used for congestion level. The SMOTE method is implemented as follows:

➢: Identify Minority Class Samples

We identified

S = {x_{1}, x_{2} \dots x_{m}}

is the set of minority class sample, where

m < n

.

➢: Find K-Nearest Neighbors

For each minority class sample

x_{i}

, we searched its k-nearest neighbor within the minority class using the Euclidian distance.

d (x_{i}, x_{j}) = \sqrt{\sum_{l = 1}^{d} {(x_{i, l} - x_{j, l})}^{2}}, \forall j \in N_{k} (x_{i})

where

N_{k} (x_{i})

is the set of k-nearest neighbors.

➢: Generate Synthetic Samples

The SMOTE method used interpolation to generate neighbors

x_{n n}

of

x_{i}

.

x_{n e w} = x_{i} + λ \cdot (x_{n n} - x_{i})

where

λ ~ U (0, 1)

is a random number generated based on a uniform distribution.

➢: Repeat Until Class Balance is Achieved

Repeat the process until class balance is obtained.

The SMOTE method is among the most robust methods for handling class imbalances and it is preferred over other augmentation techniques because it avoids overfitting, provides optimal class representation, and generates realistic synthetic data. To enhance the overall generalizability and robustness of the classifier models, we employed three simulated data augmentation strategies: Gaussian noise, linear interpolation, and mixup. These enlarge the training dataset by generating synthesized samples that capture potential variation in the original distribution. Gaussian noise introduces small, random perturbations around existing samples, assisting with regularization. Linear interpolation interpolates randomly selected instances from the dataset to create new points that resemble intermediate patterns. Mixup pushes it a step further by combining pairs of samples using a beta distribution to produce both synthetic labels and features. Based on this dataset, we developed our model and ran it in the Python 3 tool Colab. In this research paper, we focus on presenting the model rather than the data.

4.3. Coding Tool

To execute our model, we used the Python 3 Colab platform, employing libraries such as Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn, and Imbalanced-learn. The dataset was first loaded and processed using Pandas, with categorical variables encoded and missing values addressed. To handle class imbalance and enrich the dataset, multiple data augmentation techniques were applied: SMOTE was used to synthetically generate samples for the minority class, while Gaussian noise addition and linear interpolation introduced variability and smoother transitions in the feature space; mixup was also employed to create hybrid samples by blending input features and labels from different classes. These combined techniques improved the representativeness of the training set. After feature scaling using StandardScaler, the dataset was split into training and test sets. Logistic regression was used to analyze the linear relationship between logistics indicators like logistics sprawl, vehicle kilometers, and road congestion, while random forest captured complex, nonlinear patterns for improved prediction. Both models were evaluated using accuracy, confusion matrix, and classification reports, with results compared to highlight their respective strengths in interpreting and predicting urban congestion. Logistic regression introduced the relationship between various features and target variables, such as logistics sprawl, vehicle kilometers, and road congestion. Then, we visualized the data through pair plots with regression lines, providing a deeper understanding of how these features correlate with congestion levels. The logistic regression model’s equation was also derived, displaying the impact of each variable on road congestion. Finally, the results were compared between the two models, demonstrating their effectiveness in predicting congestion based on urban logistics factors.

5. Results and Discussion

5.1. Performance Indicator

We executed the developed logistic regression and random forest model in Colab Python. Then, we analyzed the model performance using the indicators described in Table 5.

Based on the metric value, we observed that the model demonstrates a moderate overall performance with an accuracy of 0.697, indicating that approximately 70% of instances are correctly classified. However, accuracy alone is insufficient for evaluating performance in the presence of potential class imbalance, necessitating further metric analysis. The AUC score of 0.75 suggests good discriminative ability, meaning the model effectively differentiates between classes better than random chance, without signs of overfitting or data leakage. Class-specific precision values of 0.69 for Class 0 and 0.71 for Class 1 reflect moderate correctness in positive predictions, with some false positives still present. Similarly, recall values of 0.72 for Class 0 and 0.68 for Class 1 indicate that the model detects a majority, but not all, of the actual instances, missing approximately 28% of negatives and 32% of positives, respectively. The F1-scores, which balance precision and recall, were 0.70 for Class 0 and 0.69 for Class 1, indicating moderate harmony between these measures but also room for improvement. The confusion matrix further quantifies this performance: 1364 true negatives and 1293 true positives were correctly identified, whereas the model produced 536 false positives and 619 false negatives, illustrating its moderate capability in distinguishing between the two classes.

As a result, we deduced that logistic regression is not the most suitable method for predicting congestion levels, as it achieves only a moderate accuracy of 69%. The model struggled particularly with correctly identifying Class 0 (negative cases). However, it remains effective for identifying relationships between variables, as the model coefficients clearly show which features influence the outcome and the direction of that influence. The model performed well in identifying positive cases, with a strong recall for Class 1. However, this is achieved at the cost of generating numerous false-positive predictions.

Although logistic regression may not be the most accurate for prediction, it provides valuable insights into how different variables interact with and impact the outcome, making it a useful tool for understanding relationships. In cases where the goal is to understand these relationships rather than achieve perfect prediction, logistic regression is a good choice.

After leveraging logistic regression to interpret the relationships between explanatory variables and congestion, we used random forest to address the nonlinear nature of the problem and enhance predictive accuracy, offering more reliable insights for decision-making.

We observed that random forest exhibited a strong overall performance with an accuracy of 0.831, correctly classifying nearly 83% of the instances. The AUC score of 0.84 further confirmed the model’s robust ability to distinguish between classes, indicating effective class separation. The precision for Class 0 was perfect at 0.86, signifying that all predicted negatives were correct and there were no false positives in this class, demonstrating high reliability in identifying Class 0 instances. The precision for Class 1 was 0.83, which, although slightly lower, still indicated reasonable accuracy with some false positives. The recall for Class 0 was 0.83, meaning that the model correctly detected 83% of the actual negatives but missed approximately 18%, resulting in some false negatives. The recall for Class 1 was excellent at 0.87, indicating that the model successfully identified most positive instances with minimal misses. Correspondingly, the F1-scores for both classes reflect a balanced performance: 0.84 for Class 0 and 0.85 for Class 1, indicating a strong harmony between precision and recall for both classes. The confusion matrix further quantifies this performance, with 1570 true negatives and 1666 true positives correctly identified, alongside 330 false positives and 246 false negatives, reinforcing the model’s effective classification capabilities.

Table 15 summarizes the calculated values:

Based on the accuracy metric, we observed that the logistic regression model was correct 69% of the time. In contrast, the random forest model is considerably better, with an accuracy of 83%, reflecting a strong capability to classify data. From these metrics, we observed that the two models used showed some classification errors, but the random forest model was always more consistent in terms of accuracy and reliability than the logistic regression model. The poor performance of the logistic regression model may indicate that linear relationships are not sufficient to model data complexity. This may be due to the higher performance of the random forest, as it tends to identify the nonlinear pattern and interaction between features. The accuracy metrics, with logistic regression at 69% and random forest at 83%, indicate a reasonable fitting of our model. This means that random forest performs better than logistic regression, but in this case, both models lead to an understanding of the pattern of congestion. Based on the confusion matrix and accuracy values, the model can predict congestion levels and can do so very accurately.

5.2. Variables Weights and Equation Model

The developed model allowed us to measure the impact weight of each variable on the congestion level. The Table 16 below summarizes the weight of each variable:

Based on these results, we observed that vehicle ownership significantly impacts the congestion level, and we deduced a coefficient of (0.7463) related to vehicle ownership, suggesting that a higher number of vehicles per 1000 people leads to increased congestion, since more vehicles on the road may lead to increasing congestion levels in the city. Followed by the average daily activity by (0.5837), the level of vehicle routing in the city increases the congestion risks. Then, the population density also contributes to increasing the congestion with an important coefficient of (0.4040), since increased population may lead to an increase in the number of passenger vehicles, increasing demand, and freight transport activity. Otherwise, logistics sprawl has a negative coefficient (−0.2413), meaning that spreading logistics activities away from urban centers helps alleviate congestion. Similarly, the sprawl-density (0.0514) and sprawl-vehicle km (0.1572) interaction terms reinforce this trend; the movement of logistics facilities from the urban core to the peripheries contributes to reducing congestion by decentralizing freight movement. Vehicle kilometers in operations (−0.1910) and road length (−0.3120) have relatively weaker impacts, indicating that their effects are less pronounced than vehicle ownership, ridership, and density.

The observed negative association between logistics sprawl and urban congestion is supported by a growing body of urban logistics literature. Research by [1] has shown that when logistics facilities relocate from central to peripheral urban zones, it can relieve traffic pressure by shifting freight movements out of high-density centers. This spatial decentralization reduces the concentration of freight vehicle trips within congested cores, aligning with our regression findings (coefficient = −0.2413).

However, this effect is complex. Logistics sprawl can lead to longer supply chains and less efficient routing, which may result in increased freight vehicle kilometers and higher emissions. This is reflected in our interaction terms, where the positive coefficients for sprawl × density and sprawl × vehicle km suggest that while congestion in the city center may decrease, other forms of congestion or inefficiency may emerge in outer zones. Our results contribute to this nuanced debate by illustrating that the benefits of logistics sprawl on congestion are not linear or universal but are contingent on urban density and transport demand dynamics.

These findings reinforce the need for coordinated urban planning strategies that integrate land-use policies with freight infrastructure development to optimize congestion outcomes.

In this respect, we introduced an equation that explains the relationship between the congestion level and these variables, which can be expressed as follows:

Equation Representation
Logit(p) = 0.0087 + (−0.2413) * Logistics Sprawl + (0.4040) * Density (people/km²) + (−0.1910) * Vehicle Kilometers in Operations (million km) + (0.5837) * Average Daily Ridership (thousands) + (−0.3120) * Length of Roads (km) + (0.7463) * Vehicle Ownership (per 1000 people) + (0.0514) * Sprawl Density Interaction + (0.1572) * Sprawl Vehicle KM Interaction (2)

From this equation, we confirmed that the more we sprawled, the less congestion decreased in the urban core.

The level of congestion in urban centers decreases with logistics sprawl. When logistics sprawl occurs, it leads to an increasing distance from the urban center, shifting logistics activities to the peripheries. It contributes to destressing and decongesting logistics activities in the urban core. As a result, we deduced that the more activities are spread out to the peripheries, the more congestion is reduced in the city center.

Table 17 summarizes the impact of each variable on the congestion level.

These observations confirm some potential relationships but also highlight areas where the relationships are strong, weak, or nonexistent. The LR coefficient provides a comprehensive visual overview, helping identify which predictor variables may be important relative to others.

Overall, the results confirmed that logistics sprawl, density, vehicle kilometers, and ridership drive the variation in congestion, whereas other variables may be less significant. This suggests that urban planning interventions related to sprawl and transportation should target congestion abatement in highly dense areas.

From this analysis, we summarize the relationship between our independent and dependent variables.

Figure 3 introduces the relationship between the congestion level as a dependent variable and the independent variables, which are: logistic sprawl, density, vehicle kilometers in operations, average daily ridership, length of roads, and vehicle ownership.

Using logistic regression curves, the three graphs highlight the relationships between density, vehicle kilometers in operations, average daily ridership, and congestion on roads. The first graph shows a positive correlation between population density and congestion, indicating that increasing density leads to increasing congestion levels, which is related to higher travel demand and road usage. The second graph demonstrates a negative relationship between the vehicle kilometers travelled and congestion. We observed that higher operational vehicle kilometers do not effectively lead to increased congestion, considering that efficient transportation systems or better infrastructure may reduce congestion levels. The third graph illustrates a great positive relationship between average daily ridership and congestion, suggesting that zones with more transportation tend to develop higher levels of congestion, possibly due to high infrastructure or route overflow. These insights indicate the importance of balancing urban densification with proper transit operations and road infrastructure to mitigate congestion.

Therefore, we observed that congestion increased in parallel with increasing population density. Similarly, the increase in the average daily ridership, which refers to vehicle routing, led to an increase in the congestion level. However, we observed that an increase in the number of kilometers in operation does not directly contribute to a decrease in congestion.

Similarly, the graphs in Figure 4 represent the relationship between road length, vehicle ownership, and congestion through logistic regression curves. We observed a slight increase in congestion in parallel with the increase in road length, which indicates that more roads invite more vehicle usage, thus leading to higher congestion levels. The second graph shows a very strong positive correlation between vehicle ownership and congestion: as more people own vehicles, the more cars are likely to be on the road, and thus, congestion increases enormously. These results imply that simply expanding road infrastructure without effective traffic management strategies and alternative transportation solutions may fail to reduce congestion.

In the same vein, we evaluated the impact of logistics sprawl on congestion level; Figure 5 introduces this impact. As a result, we observed a negative correlation between logistics sprawl and congestion on roads, indicating that as logistics sprawl increases, congestion levels tend to decrease.

This trend suggests that distributing logistics hubs over a wider geographical area helps reduce traffic congestion, likely by decentralizing freight movements and reducing the concentration of delivery vehicles in high-traffic zones. When logistics facilities are concentrated in a small area, they generate intense freight activity in limited road space, leading to severe congestion. However, as these facilities sprawl outward, truck routes become more dispersed, alleviating congestion in urban centers.

The steep decline in congestion at lower levels of logistics sprawl suggests that even modest decentralization can significantly reduce congestion.

Additionally, the confidence interval shows some variability, indicating that other factors, such as road capacity, traffic policies, and the efficiency of logistics operations, may influence congestion levels.

In summary, encouraging a balanced approach to logistics sprawl can be an effective strategy for mitigating congestion. However, excessive dispersion may lead to inefficiencies, such as longer travel distances and increased transportation costs. Therefore, urban planners should focus on optimizing logistics distribution to strike a balance between alleviating congestion and ensuring overall operational efficiency.

From these graphs, we observed that logistics sprawl, measured by the increase in distance, leads to a slight decrease in the level of congestion. This means that the more we extend logistics activities to peripheral areas, the more congestion in the urban center decreases.

However, several other factors influence congestion in often negative ways, offsetting logistics decentralization trends.

The density of urban areas mostly contributes to congestion: the higher the population and business density, the more vehicles, the struggle on the roadways, and thus, the more frequent the stop-and-go traffic. Furthermore, additional stress is imposed by a continuously increasing number of private vehicles: passenger cars consume a great deal of on-road capacity and delay both people and goods in their movement. Likewise, the average transportation ridership contributes to congestion, particularly when road-based transit systems are the major share, causing bottlenecks and slowing down overall traffic flow.

Again, road network design also plays an important role: the longer the length of key roads, the more congestion arises because such essential roads are transformed into the corridors of the largest volume of traffic, given their limited numbers, hence leaving drivers with fewer routes to travel. In contrast, a well-thought-out road network will lead to better spreading of the volumes of traffic without relying too heavily on any particular route and decreasing overall congestion.

Interestingly, VKO—that is, routing and transport movements, or the logistics activity itself—might not generate congestion and, in some instances, will actually decrease congestion. In this respect, optimized logistics are the result of efficient route planning, consolidation of deliveries, and the use of advanced planning techniques. This means that there are fewer vehicles to transport the same quantity of goods, less unnecessary use of roads, and less congestion. Good logistical planning ensures that freight transportation provides a service that minimizes the overall distance traveled, avoids peak hours of traffic congestion, and utilizes off-peak delivery slots to help reduce congestion rather than add to it.

While logistics sprawl contributes to congestion dynamics, only a wide holistic urban mobility strategy incorporating intelligent infrastructure development, planning of alternative roads, effective logistics routing, and demand management policies will offer a prospect of long-term congestion reduction.

Based on the results, we confirmed that the model developed has a high level of accuracy, as justified by the confusion matrix and accuracy values. Logistic regression demonstrated an accuracy of 69%, whereas random forest showed an accuracy of 84%. These values demonstrate the efficiency of our model. However, the accuracy of the prediction also depends on the size of the data; in our sample, we used augmented data.

The results suggest that organizational aspects, including vehicle routing, population density, increase in transport demand, and logistics sprawl, are among the criteria that have the greatest impact on the level of congestion in urban centers. In this respect, we have confirmed that optimizing logistics flows, consolidating demand, loading and unloading operations, etc. [119], can help optimize the level of congestion in urban centers. Moreover, the use of new technologies, including the Internet of Things [120], VANET [121], intelligent transport systems [122], the Internet of Vehicles, etc. [123], is relevant for predicting the level of congestion and proposing optimal vehicle routing scenarios.

Furthermore, the choice of means of transport is important for making logistics operations more flexible and accessible. Fundamentally, the use of unmanned aerial vehicles, hybrid trucks, automated vehicles, and drone systems for delivery operations helps reduce road congestion and destress city cores [124].

In particular, the use of Petri nets can help monitor the impact of logistics expansion on transport efficiency, as they provide a systematic and formal framework for modelling and analyzing the complex interactions and processes within intermodal freight terminals [125]. In addition, timed Petri nets are used to optimize resource allocation and transport planning [126]. Then, first-order hybrid Petri nets may be used to support decision-making in terms of maximizing container flows and managing and modeling congestion [127]. In general, Petri net models, including colored, continuous, and hybrid models, are used to optimize last-mile operations, as they help estimate optimal vehicle routing scenarios and transport planning, and are used to define the least congested routes for efficient logistics operations [128].

Logistics sprawl moves activities from the urban core to the periphery. In this study, we discuss the impact of this movement on congestion levels.

Consequently, we deduced that logistics sprawl can contribute to optimizing heavy traffic flows in city centers. It reduces congestion, which is one of the main causes of delays and inefficient urban mobility. Congestion also promotes fuel consumption and emissions due to bottlenecks. Thus, reducing congestion through logistics sprawl can help reduce overall environmental emission. In this sense, our results indicate that extending logistics facilities to suburban and periurban locations improves mobility and air quality in urban areas. This practice induces sustainability by reducing emissions resulting from congestion (Figure 6).

The paper provides a model that explains and quantifies the impact of logistics sprawl on urban traffic congestion and its ecological footprint. Rather than providing results, it frames future research agendas with a scientific approach to researching these dynamics. However, the scarcity of proper data remains an issue, and future research may render our findings more precise and applicable to the general population.

6. Conclusions

In this paper, we investigated the relationship between logistics sprawl and congestion levels in the urban core. First, we reviewed the literature from which we summarized the phenomenon of logistics sprawl in different cities and deduced its impact on traffic congestion. In general, we discovered that the logistics sprawl in the studied cases contributes to increasing distance, except for the case of Seattle, where the sprawl movement leads to a decrease in the distance to the city center. We found that this movement increased the travelled distance, ranging from a minimum of 0.6 km to a maximum of 231.49 km.

To define this relationship, we identified independent variables, including logistics sprawl and city characteristics. We considered the distance resulting from the sprawl movement, population density, vehicle ownership, average daily ridership, vehicle operations, and road length as independent variables and the level of congestion as a dependent variable. The level of congestion is a categorical variable that was constructed based on the waiting time resulting from road congestion.

Based on our analysis, we identified multiple independent variables that may impact the dependent variable.

Our model uses logistic regression and random forest to predict the impact of logistics sprawl on congestion and explore variable relationships. For our model, we identified cities with logistics sprawl from the literature and constructed a dataset from multiple sources, such as the Transit Database. Due to the inherent scarcity and imbalance of urban data, we applied a comprehensive data augmentation pipeline to ensure robustness.

We applied a multifaceted data augmentation approach to address class imbalance and improve generalization. First, we applied the synthetic minority oversampling technique (SMOTE) to generate synthetic samples of the minority class and balance the dataset. Subsequently, Gaussian noise was added to simulate realistic variability and improve the robustness of the model. Additionally, linear interpolation was used to create new data points by combining the values of existing observations, thereby improving the continuity and density of the feature space. Finally, we used the mixup technique, which mixes pairs of training examples and their corresponding labels, to construct smoother decision boundaries and reduce overfitting. Combined, these augmentation methods have contributed to an increased resilience and adaptability of the model, enabling it to learn from various models and distributions.

This augmentation enabled the model to simulate a wide variety of realistic urban scenarios and learn from robust models. This considerably improved the generalization of the model without changing its architecture. To ensure the credibility of the model, we applied an 80/20 training/test split, where validation was performed on unseen and augmented data, ensuring that the evaluation was independent of the augmentation process.

The results show that logistic regression learned fine-grained relationships between congestion characteristics and outcomes, whereas random forest excelled at capturing nonlinearities and interactions between variables. The performance discrepancies between the models can be explained by these learning differences.

Notably, the negative coefficient for logistics sprawl (−0.2413) suggests that, in the context of the modeled data, greater decentralization of logistics infrastructure may reduce congestion, possibly because of traffic diffusion. Meanwhile, population density, average daily ridership, and vehicle ownership were positively correlated with congestion, as expected. The model demonstrated strong learning performance on a diverse range of simulated urban profiles and is immediately applicable to studying real-world cases when the corresponding urban data are available. It can function as a decision support framework in urban logistics and planning contexts.

The use of augmentation approaches, including SMOTE, mixup, and Gaussian noise, demonstrates the model’s resilience and ability to generalize across diverse scenarios. These techniques have not only addressed data imbalance and scarcity but have also enhanced interpretability by allowing the model to learn from the edge cases and nonlinear interactions. Future research should attempt to incorporate real-time traffic data. By integrating dynamic traffic information with spatial analysis tools, models can accurately reflect the complex interactions between logistics activities and urban traffic patterns.

In addition, integrating the characteristics of the studied urban areas can be fundamental to explore the impact of logistics sprawl on congestion and its indirect environmental consequences, including carbon emissions, energy consumption, and air quality, which are key factors in assessing the sustainability of urban freight strategies. Finally, the developed model can provide a means of supporting decision-making by integrating it into urban mobility dashboards, policy simulation platforms, and logistics planning tools. Integrating the model into these systems would provide stakeholders with actionable information, supporting informed decision-making to promote sustainable and efficient transport networks.

From another perspective, urban congestion management requires the development of innovative public policies that complement data-driven analyses and extend beyond the predictive capabilities of existing models. The integration of dynamic congestion pricing helps adjust prices in real time based on factors such as vehicle type and traffic volume. The main aim is to reduce traffic and encourage off-peak deliveries for optimal traffic distribution over time and space.

Additionally, real-time congestion forecasts from models and trends in urban freight demand can be leveraged to align smart road-use policies, such as managing delivery time slots with digital permit systems. In this context, the use of IoT infrastructure, Logistics 4.0 technologies, and vehicle data can facilitate adaptive regulations. Furthermore, it may help urban planners and logistics stakeholders to model, evaluate, and improve their strategies and provide a more comprehensive, adaptable, and proactive approach to managing congestion.

Author Contributions

Methodology, F.J.; Validation, F.J.; Formal analysis, I.M.; Investigation, J.A.; Writing—original draft, M.E.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Requests for further information should be addressed to the author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yousefi, J.; Ashtab, S.; Yasaei, A.; George, A.; Mukarram, A.; Sandhu, S.S. Multiple Linear Regression Analysis of Canada’s Freight Transportation Framework. Logistics 2023, 7, 29. [Google Scholar] [CrossRef]
Dablanc, L.; Rakotonarivo, D. The impacts of logistics sprawl: How does the location of parcel transport terminals affect the energy efficiency of goods’ movements in Paris and what can we do about it? Procedia-Soc. Behav. Sci. 2010, 2, 6087–6096. [Google Scholar] [CrossRef]
Moufad, I.; Jawab, F. A study framework for assessing the performance of the urban freight transport based on PLS approach. AoT 2019, 49, 69–85. [Google Scholar] [CrossRef]
Seevarethnam, M.; Rusli, N.; Ling, G.H.T. Prediction of Urban Sprawl by Integrating Socioeconomic Factors in the Batticaloa Municipal Council, Sri Lanka. Int. J. Geo-Inform. 2022, 11, 442. [Google Scholar] [CrossRef]
Chen, Y.; He, Y. Urban Land Expansion Dynamics and Drivers in Peri-Urban Areas of China: A Case of Xiaoshan District, Hangzhou Metropolis (1985–2020). Land 2022, 11, 1495. [Google Scholar] [CrossRef]
Guan, J.; Zhang, S.; D’Ambrosio, L.A.; Zhang, K.; Coughlin, J.F. Potential Impacts of Autonomous Vehicles on Urban Sprawl: A Comparison of Chinese and US Car-Oriented Adults. Sustainability 2021, 13, 7632. [Google Scholar] [CrossRef]
Strale, M. Logistics sprawl in the Brussels metropolitan area: Toward a socio-geographic typology. J. Transp. Geogr. 2020, 88, 102372. [Google Scholar] [CrossRef]
Dablanc, L.; Ogilvie, S.; Goodchild, A. Logistics Sprawl: Differential Warehousing Development Patterns in Los Angeles, California, and Seattle, Washington. Transp. Res. Rec. 2014, 2410, 105–112. [Google Scholar] [CrossRef]
Rao, A.M.; Rao, K.R. Measuring Urban Traffic Congestion–A Review. Int. J. Traffic Transp. Eng. 2012, 2, 286–305. [Google Scholar] [CrossRef]
Kang, S. Exploring the contextual factors behind various phases in logistics sprawl: The case of Seoul Metropolitan Area, South Korea. J. Transp. Geogr. 2022, 105, 103476. [Google Scholar] [CrossRef]
Imane, M.; Fouad, J. Dassia: A Micro-Simulation Approach to Diagnose Urban Freight Delivery Areas Impacts on Traffic Flow. Int. J. Sci. Technol. Res. 2020, 9, 7. [Google Scholar]
El Yadari, M.; Jawab, F.; Arif, J. Logistics4.0 for urban logistics: A literature review and research framework. In Proceedings of the 2022 14th International Colloquium of Logistics and Supply Chain Management (LOGISTIQUA), El Jadida, Morocco, 25–27 May 2022; pp. 1–7. [Google Scholar] [CrossRef]
Taghvaee, V.M.; Nodehi, M.; Saber, R.M.; Mohebi, M. Sustainable development goals and transportation modes: Analyzing sustainability pillars of environment, health, and economy. World Dev. Sustain. 2022, 1, 100018. [Google Scholar] [CrossRef]
Aljohani, K.; Thompson, R.G. A multi-criteria spatial evaluation framework to optimise the siting of freight consolidation facilities in inner-city areas. Transp. Res. Part A Policy Pract. 2020, 138, 51–69. [Google Scholar] [CrossRef]
Mohapatra, S.S.; Pani, A.; Sahu, P.K. Examining the Impacts of Logistics Sprawl on Freight Transportation in Indian Cities: Implications for Planning and Sustainable Development. J. Urban Plan. Dev. 2021, 147, 04021050. [Google Scholar] [CrossRef]
Kin, B.; Spoor, J.; Verlinde, S.; Macharis, C.; Van Woensel, T. Modelling alternative distribution set-ups for fragmented last mile transport: Towards more efficient and sustainable urban freight transport. Case Stud. Transp. Policy 2018, 6, 125–132. [Google Scholar] [CrossRef]
Is the Location of Warehouses Changing in the Belo Horizonte Metropolitan Area (Brazil)? A Logistics Sprawl Analysis in a Latin American Context. Available online: https://www.mdpi.com/2413-8851/2/2/43 (accessed on 12 June 2025).
Moufad, I.; Jawab, F. Etude d’impact des plateformes logistiques sur la logistique urbaine au Maroc’. In Proceedings of the 10th International Conference on Integrated Design and Production, Tanger, Morocco, 2–4 December 2015. [Google Scholar]
Imane, M.; Fouad, J. Proposal Methodology of Planning and Location of Loading/Unloading Spaces for Urban Freight Vehicle: A Case Study. Adv. Sci. Technol. Eng. Syst. J. 2019, 4, 273–280. [Google Scholar] [CrossRef]
Tronnebati, I.; El Yadari, M.; Jawab, F. A Review of Green Supplier Evaluation and Selection Issues Using MCDM, MP and AI Models. Sustainability 2022, 14, 16714. [Google Scholar] [CrossRef]
Manal, E.Y.; Fouad, J.; Imane, M. Logistics Sprawl: A Systematic Literature Review. IEEE Eng. Manag. Rev. 2024, 53, 1–13. [Google Scholar] [CrossRef]
Mahmoudzadeh, H.; Abedini, A.; Aram, F. Urban Growth Modeling and Land-Use/Land-Cover Change Analysis in a Metropolitan Area (Case Study: Tabriz). Land 2022, 11, 2162. [Google Scholar] [CrossRef]
Sogbe, E.; Susilawati, S.; Pin, T.C. Scaling up public transport usage: A systematic literature review of service quality, satisfaction and attitude towards bus transport systems in developing countries. Public Transp. 2024, 17, 1–44. [Google Scholar] [CrossRef]
Yuan, Q.; Zhu, J. Logistics sprawl in Chinese metropolises: Evidence from Wuhan. J. Transp. Geogr. 2019, 74, 242–252. [Google Scholar] [CrossRef]
Krzysztofik, R.; Kantor-Pietraga, I.; Spórna, T.; Dragan, W.; Mihaylov, V. Beyond “logistics sprawl” and “logistics anti-sprawl”. Case of the Katowice region, Poland. Eur. Plan. Stud. 2019, 27, 1646–1660. [Google Scholar] [CrossRef]
Gardrat, M. Urban growth and freight transport: From sprawl to distension. J. Transp. Geogr. 2021, 91, 102979. [Google Scholar] [CrossRef]
Trent, N.M.; Joubert, J.W. Logistics sprawl and the change in freight transport activity: A comparison of three measurement methodologies. J. Transp. Geogr. 2022, 101, 103350. [Google Scholar] [CrossRef]
Gupta, S.; Garima, S. Logistics Sprawl in Timber Markets and its Impact on Freight Distribution Patterns in Metropolitan City of Delhi, India. Transp. Res. Procedia 2017, 25, 965–977. [Google Scholar] [CrossRef]
Dhonde, B.; Patel, C.R. The Impacts of City Sprawl on Urban Freight Transport in Developing Countries. Archit. Eng. 2021, 6, 52–62. [Google Scholar] [CrossRef]
Feng, R.; Cui, H.; Feng, Q.; Chen, S.; Gu, X.; Yao, B. Urban Traffic Congestion Level Prediction Using a Fusion-Based Graph Convolutional Network. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14695–14705. [Google Scholar] [CrossRef]
Arif, J.; Jawab, F. Outsourcing logistics: Strategic tools for decision to outsource logistic activities. In Proceedings of the IEEE 2011 4th International Conference on Logistics, Hammamet, Tunisia, 31 May–3 June 2011; pp. 104–108. [Google Scholar] [CrossRef]
Moufad, I.; Jawab, F. Multi-criteria analysis of urban public transport problems: The city of Fes as a Case. Int. J. Sci. Eng. Res. 2017, 8, 675–681. [Google Scholar]
Rahman, M.; Najaf, P.; Fields, M.G.; Thill, J.-C. Traffic congestion and its urban scale factors: Empirical evidence from American urban areas. International J. Sustain. Transp. 2022, 16, 406–421. [Google Scholar] [CrossRef]
Falahatraftar, F.; Pierre, S.; Chamberland, S. An Intelligent Congestion Avoidance Mechanism Based on Generalized Regression Neural Network for Heterogeneous Vehicular Networks. IEEE Trans. Intell. Veh. 2023, 8, 3106–3118. [Google Scholar] [CrossRef]
Chen, P. Design of Intelligent Traffic Simulation System Based on Decision Tree Algorithm. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence and Information Systems, ICAIIS 2021, Chongqing, China, 28–30 May 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1–3. [Google Scholar] [CrossRef]
Zolghadri, M.; Asghari, P.; Dashti, S.E.; Hedayati, A. Dynamic task offloading for IoT-Fog-Cloud systems: A network traffic-aware decision tree approach. Computing 2025, 107, 94. [Google Scholar] [CrossRef]
Bannur, C.; Bhat, C.; Goutham, G.; Mamatha, H.R. General Transit Feed Specification Assisted Effective Traffic Congestion Prediction Using Decision Trees and Recurrent Neural Networks. In Proceedings of the 2022 IEEE 1st International Conference on Data, Decision and Systems (ICDDS), Bangalore, India, 2–3 December 2022; pp. 1–6. [Google Scholar] [CrossRef]
Fuhrer, B.; Shpigelman, Y.; Tessler, C.; Mannor, S.; Chechik, G.; Zahavi, E.; Dalal, G. Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs. In Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023’, Bangalore, India, 1–4 May 2023; pp. 331–343. [Google Scholar] [CrossRef]
Zhuang, L.; Xu, R.; Xu, Y.; Li, X.; Guo, Z. Road congestion Caused Identification Algorithm Based on the Image Morphology and Decision Tree. In Proceedings of the SPIE Eighth International Conference on Traffic Engineering and Transportation System (ICTETS 2024), Dalian, China, 20–22 December 2024; pp. 1177–1186. [Google Scholar] [CrossRef]
Wahbi, M.; Boulaassal, H.; Maatouk, M.; El Kharki, O.; Alaoui, O.Y. Monitoring the Urban Sprawl of the City of Tangier from Spot and Sentinel2 Images. In Advanced Intelligent Systems for Sustainable Development, Proceedings of the AI2SD’2019, Marrakech, Morocco, 8–11 July 2019; Ezziyyani, M., Ed.; Springer International Publishing: Cham, Switzerland, 2020; pp. 453–465. [Google Scholar] [CrossRef]
Al Kheder, S.; Al Omair, A. Urban traffic prediction using metrological data with fuzzy logic, long short-term memory (LSTM), and decision trees (DTs). Nat. Hazards 2022, 111, 1685–1719. [Google Scholar] [CrossRef]
Chen, Y.; Li, C.; Yue, W.; Zhang, H.; Mao, G. Root Cause Identification for Road Network Congestion Using the Gradient Boosting Decision Trees. In Proceedings of the GLOBECOM 2020-2020 IEEE Global Communications Conference, Taipei, Taiwan, 7–11 December 2020; pp. 1–6. [Google Scholar] [CrossRef]
Tamir, T.S.; Xiong, G.; Li, Z.; Tao, H.; Shen, Z.; Hu, B.; Menkir, H.M. Traffic Congestion Prediction using Decision Tree, Logistic Regression and Neural Networks. IFAC-Pap. Online 2020, 53, 512–517. [Google Scholar] [CrossRef]
Arieth, R.M.; Chowdhury, S.; Sundaravadivazhagan, B.; Srivastava, G. Traffic Prediction and Congestion Control Using Regression Models in Machine Learning for Cellular Technology. In Machine Learning for Mobile Communications; CRC Press: Boca Raton, FL, USA, 2024. [Google Scholar]
Wang, K.; Chen, Z.; Cheng, L.; Zhu, P.; Shi, J.; Bian, Z. Integrating spatial statistics and machine learning to identify relationships between e-commerce and distribution facilities in Texas, US. Transp. Res. Part A Policy Pract. 2023, 173, 103696. [Google Scholar] [CrossRef]
Chaoura, C.; Lazar, H.; Jarir, Z. Predictive System of Traffic Congestion based on Machine Learning. In Proceedings of the 2022 9th International Conference on Wireless Networks and Mobile Communications (WINCOM), Rabat, Morroco, 26–29 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Hassan, M.; Arabiat, A. An evaluation of multiple classifiers for traffic congestion prediction in Jordan. Indones. J. Electr. Eng. Comput. Sci. 2024, 36, 461–468. [Google Scholar] [CrossRef]
Kafy, A.; Faisal, S.I.; Rahman, M.L.; Moni, R.; Shanmuganathan, H.; Raza, D.M. Traffic Congestion Prediction using Machine Learning. In Proceedings of the 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 28 February–1 March 2024; pp. 1290–1295. [Google Scholar] [CrossRef]
Arabiat, A.; Hassan, M.; Almomani, O. Traffic congestion prediction using machine learning: Amman City case study. In Proceedings of the International Conference on Medical Imaging, Electronic Imaging, Information Technologies, and Sensors, Online, 1–2 March 2024. [Google Scholar] [CrossRef]
Kezia, M.; Anusuya, K.V. A Comparative Study on Machine Learning Algorithms for Congestion Control in VANET. In Proceedings of the 2022 International Conference on Intelligent Innovations in Engineering and Technology (ICIIET), Coimbatore, India, 22–24 September 2022; pp. 38–44. [Google Scholar] [CrossRef]
Moumen, I.; Abouchabaka, J.; Najat, R. Enhanced Traffic Management Through Data Mining: Predictive Models for Congestion Reduction. In Proceedings of the 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), Istanbul, Turkiye, 26–28 October 2023. [Google Scholar] [CrossRef]
Gupta, S.; Kumar, A.; Kumar, A. Analysis of Traffic Flow Congestion by Integrating Supervised Machine Learning with K-mean Clustering. SN Comput. Sci. 2025, 6, 255. [Google Scholar] [CrossRef]
Hatami, F.; Rahman, M.M.; Nikparvar, B.; Thill, J.-C. Non-Linear Associations Between the Urban Built Environment and Commuting Modal Split: A Random Forest Approach and SHAP Evaluation. IEEE Access 2023, 11, 12649–12662. [Google Scholar] [CrossRef]
Ghazaryan, G.; Rienow, A.; Oldenburg, C.; Thonfeld, F.; Trampnau, B.; Sticksel, S.; Jürgens, C. Monitoring of Urban Sprawl and Densification Processes in Western Germany in the Light of SDG Indicator 11.3.1 Based on an Automated Retrospective Classification Approach. Remote Sens. 2021, 13, 1694. [Google Scholar] [CrossRef]
Shao, Z.; Sumari, N.S.; Portnov, A.; Ujoh, F.; Musakwa, W.; Mandela, P.J. Urban sprawl and its impact on sustainable urban development: A combination of remote sensing and social media data. Geo-Spat. Inf. Sci. 2021, 24, 241–255. [Google Scholar] [CrossRef]
Sani, A.Z.; Dubé, J. Identifying the Impact of Public Amenities on Urban Growth: A Case Study of the Quebec Metropolitan Region, Canada (1986–2022). Sustainability 2025, 17, 1631. [Google Scholar] [CrossRef]
Dai, J.; Tian, X.; Liu, L.; Zhang, H.; Fu, J.; Yu, M. The Intelligent Traffic Safety System Based on 6G Technology and Random Forest Algorithm. IEEE Trans. Intell. Transp. Syst. 2025, 1–11. [Google Scholar] [CrossRef]
Shenghua, H.; Zhihua, N.; Jiaxin, H. Road Traffic Congestion Prediction Based on Random Forest and DBSCAN Combined Model. In Proceedings of the 2020 5th International Conference on Smart Grid and Electrical Automation (ICSGEA), Zhangjiajie, China, 13–14 June 2020; pp. 323–326. [Google Scholar] [CrossRef]
Song, W.; Zhou, Y. Road Travel Time Prediction Method Based on Random Forest Model. In Smart Trends in Computing and Communications; Zhang, Y.-D., Mandal, J.K., So-In, C., Thakur, N.V., Eds.; Springer: Singapore, 2020; pp. 155–163. [Google Scholar] [CrossRef]
Maulida, N.R.; Mutijarsa, K. Traffic Density Classification Using Multilayer Perceptron and Random Forest Method. In Proceedings of the 2021 International Seminar on Intelligent Technology and Its Applications (ISITIA), Online, 21–22 July 2021; pp. 117–122. [Google Scholar] [CrossRef]
Ware, P.; Pednekar, V.; Kulkarni, P.; Huddar, C.; Bhamare, M. Smart Traffic Management: A Deep Learning Approach for Congestion Reduction using YOLOV8 & RandomForest. In Proceedings of the 2024 IEEE Conference on Engineering Informatics (ICEI), Melbourne, Australia, 20–21 November 2024; pp. 1–8. [Google Scholar] [CrossRef]
Jenifer, J.; Priyadarsini, J. Empirical Research on Machine Learning Models and Feature Selection for Traffic Congestion Prediction in Smart Cities. Int. J. Recent Innov. Trends Comput. Commun. 2023, 11, 269–275. [Google Scholar] [CrossRef]
Khurram, S.; Rose, S.; Sadiq, S. Radar Sensor-Based Smart Traffic Management System Revolutionized Using Random Forest. In Advances in Information and Communication; Arai, K., Ed.; Springer Nature: Cham, Switzerland, 2024; pp. 402–413. [Google Scholar] [CrossRef]
Manimaran, A. Improving the accuracy of predicting congested traffic flow road transport using random forest algorithm and compared with the naives bayes algorithm using machine learning. AIP Conf. Proc. 2024, 3193, 020222. [Google Scholar] [CrossRef]
Kumar, V.; Tiwari, S.; Sharma, R.K.; Sinha, A.; Tejani, G.G. A real time video sliced frame image based intelligent traffic congestion monitoring system using faster CNN. J. Opt. 2025. [Google Scholar] [CrossRef]
Geromichalou, O.; Mystakidis, A.; Tjortjis, C. Traffic Congestion Prediction: A Machine Learning Approach. In Extended Selected Papers, Proceedings of the 14th International Conference on Information, Intelligence, Systems, and Applications, Volos, Greece, 10–12 July 2023; Bourbakis, N., Tsihrintzis, G.A., Virvou, M., Jain, L.C., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 388–411. [Google Scholar] [CrossRef]
Ali, I.; Hong, S.; Cheung, T. Congestion or No Congestion: Packet Loss Identification and Prediction Using Machine Learning. In Proceedings of the 2024 International Conference on Platform Technology and Service (PlatCon), Jeju, Republic of Korea, 26–28 August 2024; pp. 72–76. [Google Scholar] [CrossRef]
Quang, D.T.; Bae, S.H. A Hybrid Deep Convolutional Neural Network Approach for Predicting the Traffic Congestion Index. Promet-Traffic Transport. 2021, 33, 3. [Google Scholar] [CrossRef]
Ahmad, A.; Gilani, H.; Shirazi, S.A.; Pourghasemi, H.R.; Shaukat, I. Chapter 9—Spatiotemporal urban sprawl and land resource assessment using Google Earth Engine platform in Lahore district, Pakistan. In Computers in Earth and Environmental Sciences; Pourghasemi, H.R., Ed.; Elsevier: Amsterdam, The Netherlands, 2022; pp. 137–150. [Google Scholar] [CrossRef]
Bokaba, T.; Doorsamy, W.; Paul, B.S. A Comparative Study of Ensemble Models for Predicting Road Traffic Congestion. Appl. Sci. 2022, 12, 1337. [Google Scholar] [CrossRef]
Deepika, D.; Pandove, G. Enhancing urban mobility: Predicting traffic congestion with optimized ML model. Eng. Res. Express 2024, 6, 045242. [Google Scholar] [CrossRef]
Singh, R.; Gaonkar, G.; Bandre, V.; Sarang, N.; Deshpande, S. Gradient Boosting Approach for Traffic Flow Prediction Using CatBoost. In Proceedings of the 2021 International Conference on Advances in Computing, Communication, and Control (ICAC3), Mumbai, India, 3–4 December 2021; pp. 1–5. [Google Scholar] [CrossRef]
Kumar, A.; Hemrajani, N. Optimized Extreme Gradient Boosting with Remora Algorithm for Congestion Prediction in Transport Layer. Int. J. Comput. Netw. Inf. Secur. 2024, 16, 144–158. [Google Scholar] [CrossRef]
Garcia, E.; Calvet, L.; Carracedo, P.; Serrat, C.; Miró, P.; Peyman, M. Predictive Analyses of Traffic Level in the City of Barcelona: From ARIMA to eXtreme Gradient Boosting. Appl. Sci. 2024, 14, 4432. [Google Scholar] [CrossRef]
Yuan, Y.; Zhou, L. Dynamic changes in small wetland landscapes and their driving factors under the background of urbanization. AES 2022, 42, 7028–7042. [Google Scholar] [CrossRef]
Ko, E.; Lee, S.; Jang, K.; Kim, S. Changes in inter-city car travel behavior over the course of a year during the COVID-19 pandemic: A decision tree approach. Cities 2024, 146, 104758. [Google Scholar] [CrossRef]
Sarkar, A.; Chouhan, P. Modeling spatial determinants of urban expansion of Siliguri a metropolitan city of India using logistic regression. Model. Earth Syst. Environ. 2020, 6, 2317–2331. [Google Scholar] [CrossRef]
Salem, M.; Tsurusaki, N.; Prasana, D. Land use/land cover change detection and urban sprawl in the peri-urban area of greater Cairo since the Egyptian revolution of 2011. J. Land Use Sci. 2020, 15, 592–606. [Google Scholar] [CrossRef]
Zineddine, O.; Boulkaibet, A. Quantifying Urban Expansion and Its Driving Forces in the City of MILA. Acta Greogr. Univ. 2023, 67, 217–239. [Google Scholar]
Dutta, R.; Banerjee, I. Analysing Spatio-temporal Dynamics of Urban Sprawl, Evolving Pattern of Urban Landscape and Driving Forces of Urban Growth: A Case of Varanasi Planning Region, India. J. Asian Afr. Stud. 2024, 00219096241287355. [Google Scholar] [CrossRef]
Grigorescu, I.; Kucsicsa, G.; Mitrica, B.; Mocanu, I.; Dumitrașcu, M. Driving factors of urban sprawl in the Romanian plain. Regional and temporal modelling using logistic regression. Geocarto Int. 2022, 37, 7220–7246. [Google Scholar] [CrossRef]
He, Z.; Ren, B.; He, C. Identification of influencing factors of urban traffic congestion based on ordered Logistic regression. In Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 15–17 April 2022; pp. 914–918. [Google Scholar] [CrossRef]
Zhang, G.; Wang, J.; Hu, X. Correlation Analysis of Internet Car and Urban Traffic Congestion Based on Logistics Regression Congestion. In Proceedings of the 20th COTA International Conference of Transportation Professionals, Xi’an, China, 17–20 December 2020; p. 3695. [Google Scholar] [CrossRef]
Diop, A.K.; Gueye, A.D.; Tall, K.; Farssi, S.M. A SVM Approach for Assessing Traffic Congestion State by Similarity Measures. In Emerging Trends in Intelligent Systems & Network Security; Ahmed, M.B., Abdelhakim, B.A., Ane, B.K., Rosiyadi, D., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 63–72. [Google Scholar] [CrossRef]
Khattak, Z.H.; Khattak, A.J. Using behavioral data to understand shared mobility choices of electric and hybrid vehicles. Int. J. Sustain. Transp. 2023, 17, 163–180. [Google Scholar] [CrossRef]
Elangovan, K.; Krishnaraaju, G. Mapping and Prediction of Urban Growth using Remote Sensing, Geographic Information System, and Statistical Techniques for Tiruppur Region, Tamil Nadu, India. J. Indian Soc. Remote Sens. 2023, 51, 1657–1671. [Google Scholar] [CrossRef]
Saganeiti, L.; Mustafa, A.; Teller, J.; Murgante, B. Modeling urban sprinkling with cellular automata. Sustain. Cities Soc. 2021, 65, 102586. [Google Scholar] [CrossRef]
Faria de Deus, R.; Tenedório, J.A.; Rocha, J. Modelling Land-Use and Land-Cover Changes: A Hybrid Approach to a Coastal Area. In Methods and Applications of Geospatial Technology in Sustainable Urbanism; Tenedório, J.A., Estanqueiro, R., Henriques, C.D., Eds.; IGI Global: Pennsylvania, PA, USA, 2021; pp. 57–102. [Google Scholar] [CrossRef]
Myagmartseren, P.; Ganpurev, D.; Myagmarjav, I.; Byambakhuu, G.; Dabuxile, G. Remote Sensing and Multivariate Logistic Regression Model for the Estimation of Urban Expansion (Case of Darkhan City, Mongolia). Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, XLIII-B3-2020, 721–726. [Google Scholar] [CrossRef]
Kassan, S.; Hadj, I.; Jemaa, S.B.; Allio, S. A Hybrid machine learning based model for congestion prediction in mobile networks. In Proceedings of the 2022 IEEE 33rd Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Kyoto, Japan, 12–15 September 2022; pp. 583–588. [Google Scholar] [CrossRef]
Islam, M.; Rickty, T.; Das, P.; Haque, M. Modeling and Forecasting Urban Sprawl in Sylhet Sadar Using Remote Sensing Data. Proc. Eng. Technol. Innov. 2023, 23, 23–35. [Google Scholar] [CrossRef]
Chakraborty, A.; Mustafa, A.; Teller, J. Modelling multi-density urban expansion using Cellular Automata for Brussels Metropolitan Development Area. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, X-4-W4-2024, 29–34. [Google Scholar] [CrossRef]
Dede, M.; Asdak, C.; Setiawan, I. Spatial dynamics model of land use and land cover changes: A comparison of CA, ANN, and ANN-CA. Regist. J. Ilm. Teknol. Sist. Inf. 2021, 8, 38–49. [Google Scholar] [CrossRef]
Wang, Q.; Wang, H. Spatiotemporal dynamics and evolution relationships between land-use/land cover change and landscape pattern in response to rapid urban sprawl process: A case study in Wuhan, China. Ecol. Eng. 2022, 182, 106716. [Google Scholar] [CrossRef]
Mostafa, E.; Li, X.; Sadek, M.; Dossou, J.F. Monitoring and Forecasting of Urban Expansion Using Machine Learning-Based Techniques and Remotely Sensed Data: A Case Study of Gharbia Governorate, Egypt. Remote Sens. 2021, 13, 4498. [Google Scholar] [CrossRef]
Sousa, L.T.M.D.; de Oliveira, L.K. ‘Influence of Characteristics of Metropolitan Areas on the Logistics Sprawl: A Case Study for Metropolitan Areas of the State of Paraná (Brazil). Sustainability 2020, 12, 9779. [Google Scholar] [CrossRef]
Tian, Y.; Chen, J. Suburban sprawl measurement and landscape analysis of cropland and ecological land: A case study of Jiangsu Province, China. Growth Change 2022, 53, 1282–1305. [Google Scholar] [CrossRef]
Meng, M.; Toan, T.D.; Wong, Y.D.; Lam, S.H. Short-Term Travel-Time Prediction using Support Vector Machine and Nearest Neighbor Method. Transp. Res. Rec. 2022, 2676, 353–365. [Google Scholar] [CrossRef]
Harrou, F.; Zeroual, A.; Sun, Y. Traffic congestion monitoring using an improved kNN strategy. Meas. J. Int. Meas. Confed. 2020, 156, 107534. [Google Scholar] [CrossRef]
Lusiandro, M.A.; Nasution, S.M.; Setianingsih, C. Implementation of the Advanced Traffic Management System using k-Nearest Neighbor Algorithm. In Proceedings of the 2020 International Conference on Information Technology Systems and Innovation (ICITSI), Padang, Indonesia, 19–23 October 2020; pp. 149–154. [Google Scholar] [CrossRef]
Aditya, F.S. Traffic Flow Prediction using SUMO Application with K-Nearest Neighbor (KNN) Method. Int. J. Integr. Eng. 2020. Available online: https://www.academia.edu/79440615/Traffic_Flow_Prediction_using_SUMO_Application_with_K_Nearest_Neighbor_KNN_Method (accessed on 4 May 2025).
Alejandrino, J.; Concepcion, R.; Lauguico, S.; Palconit, M.G.; Bandala, A.; Dadios, E. Congestion Detection in Wireless Sensor Networks Based on Artificial Neural Network and Support Vector Machine. In Proceedings of the 2020 IEEE 12th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), Manila, Philippines, 3–7 December 2020; pp. 1–6. [Google Scholar] [CrossRef]
Lim, H.; Park, M. Modeling the Spatial Dimensions of Warehouse Rent Determinants: A Case Study of Seoul Metropolitan Area, South Korea. Sustainability 2019, 12, 259. [Google Scholar] [CrossRef]
Zou, B. Multiple Classification Using Logistic Regression Model. In Internet of Vehicles-Technologies and Services, Lecture Notes in Computer Science; Hsu, C.H., Wang, S., Zhou, A., Shawkat, A., Eds.; Springer International Publishing: Cham, Switzerland, 2016; Volume 10036, pp. 238–243. [Google Scholar] [CrossRef]
Kang, S. Relative logistics sprawl: Measuring changes in the relative distribution from warehouses to logistics businesses and the general population. J. Transp. Geogr. 2020, 83, 102636. [Google Scholar] [CrossRef]
Woudsma, C.; Jakubicek, P. Logistics land use patterns in metropolitan Canada. J. Transp. Geogr. 2020, 88, 102381. [Google Scholar] [CrossRef]
Dubie, M.; Kuo, K.C.; Giron-Valderrama, G.; Goodchild, A. An evaluation of logistics sprawl in Chicago and Phoenix. J. Transp. Geogr. 2020, 88, 102298. [Google Scholar] [CrossRef]
He, M.; Zeng, L.; Wu, X.; Luo, J. The Spatial and Temporal Evolution of Logistics Enterprises in the Yangtze River Delta. Sustainability 2019, 11, 5318. [Google Scholar] [CrossRef]
Heitz, A.; Dablanc, L.; Olsson, J.; Sanchez-Diaz, I.; Woxenius, J. Spatial patterns of logistics facilities in Gothenburg, Sweden. J. Transp. Geogr. 2020, 88, 102191. [Google Scholar] [CrossRef]
Jaller, M.; Pineda, L.; Phong, D. Spatial Analysis of Warehouses and Distribution Centers in Southern California. Transp. Res. Rec. 2017, 2610, 44–53. [Google Scholar] [CrossRef]
Woudsma, C.; Jakubicek, P.; Dablanc, L. Logistics Sprawl in North America: Methodological Issues and a Case Study in Toronto. Transp. Res. Procedia 2019, 12, 474–488. [Google Scholar] [CrossRef]
Aljohani, K.; Thompson, R.G. The impacts of relocating a logistics facility on last food miles–The case of Melbourne’s fruit & vegetable wholesale market. Case Stud. Transp. Policy 2018, 6, 279–288. [Google Scholar] [CrossRef]
Heitz, A.; Dablanc, L.; Tavasszy, L.A. Logistics sprawl in monocentric and polycentric metropolitan areas: The cases of Paris, France, and the Randstad, the Netherlands. Region 2017, 4, 93. [Google Scholar] [CrossRef]
Guerin, L.; Vieira, J.G.V.; de Oliveira, R.L.M.; de Oliveira, L.K.; de Miranda Vieira, H.E.; Dablanc, L. The geography of warehouses in the São Paulo Metropolitan Region and contributing factors to this spatial distribution. J. Transp. Geogr. 2021, 91, 102976. [Google Scholar] [CrossRef]
Dablanc, L.; Ross, C. Atlanta: A mega logistics center in the Piedmont Atlantic Megaregion (PAM). J. Transp. Geogr. 2012, 24, 432–442. [Google Scholar] [CrossRef]
Klauenberg, J.; Elsner, L.-A.; Knischewski, C. Dynamics of the spatial distribution of hubs in groupage networks–The case of Berlin. J. Transp. Geogr. 2020, 88, 102280. [Google Scholar] [CrossRef]
de Oliveira, R.L.M.; Dablanc, L.; Schorung, M. Changes in warehouse spatial patterns and rental prices: Are they related? Exploring the case of US metropolitan areas. J. Transp. Geogr. 2022, 104, 103450. [Google Scholar] [CrossRef]
Sakai, T.; Kawamura, K.; Hyodo, T. Locational dynamics of logistics facilities: Evidence from Tokyo. J. Transp. Geogr. 2015, 46, 10–19. [Google Scholar] [CrossRef]
Duan, S.; Lyu, F.; Zhu, X.; Ding, Y.; Wang, H.; Zhang, D.; Liu, X.; Zhang, Y.; Ren, J. VeLP: Vehicle Loading Plan Learning from Human Behavior in Nationwide Logistics System. Proc. VLDB Endow. 2023, 17, 241–249. [Google Scholar] [CrossRef]
Menelaou, C.; Kolios, P.; Timotheou, S.; Panayiotou, C.G.; Polycarpou, M.P. Controlling road congestion via a low-complexity route reservation approach. Transp. Res. Part C Emerg. Technol. 2017, 81, 118–136. [Google Scholar] [CrossRef]
El Yadari, M.; Moufad, I.; Jawab, F.; Arif, J. Logistic 4.0 Implementation for Efficient Urban Freight Transport: A Systematic Literature Review. In Proceedings of the 2024 IEEE 15th International Colloquium on Logistics and Supply Chain Management (LOGISTIQUA), Sousse, Tunisia, 2–4 May 2024; pp. 1–8. [Google Scholar] [CrossRef]
Intelligent Transportation Systems-Vehicle to Infrastructure (V2I) Deployment Guidance and Resources. Available online: https://www.its.dot.gov/v2i/ (accessed on 31 August 2024).
Arif, J.; Mouzouna, Y.; Jawab, F. The Use of Internet of Things (IoT) Applications in the Logistics Outsourcing: Smart RFID Tag as an Example. In Proceedings of the International Conference on Industrial Engineering and Operations Management, Pilsen, Czech Republic, 23–26 July 2019. [Google Scholar]
Cavone, G.; Epicoco, N.; Carli, R.; Del Zotti, A.; Pereira, J.P.R.; Dotoli, M. Parcel Delivery with Drones: Multi-criteria Analysis of Trendy System Architectures. In Proceedings of the IEEE 2021 29th Mediterranean Conference on Control and Automation (MED), Puglia, Italy, 22–25 June 2021; pp. 693–698. [Google Scholar] [CrossRef]
Dotoli, M.; Epicoco, N.; Falagario, M.; Cavone, G. A Timed Petri Nets Model for Intermodal Freight Transport Terminals. IFAC Proc. Vol. 2014, 47, 176–181. [Google Scholar] [CrossRef]
Cavone, G.; Dotoli, M.; Seatzu, C. Resource planning of intermodal terminals using timed Petri nets. In Proceedings of the 2016 13th International Workshop on Discrete Event Systems (WODES), Xi’an, China, 30 May 2016; pp. 44–50. [Google Scholar] [CrossRef]
Cavone, G.; Dotoli, M.; Seatzu, C. Management of Intermodal Freight Terminals by First-Order Hybrid Petri Nets. IEEE Robot. Autom. Lett. 2016, 1, 2–9. [Google Scholar] [CrossRef]
Cavone, G.; Dotoli, M.; Seatzu, C. A Survey on Petri Net Models for Freight Logistics and Transportation Systems. IEEE Trans. Intell. Transp. Syst. 2018, 19, 1795–1813. [Google Scholar] [CrossRef]

Figure 1. Tracing the evolution of urban sprawl.

Figure 2. Data augmentation workflow.

Figure 3. Relationship between dependent and independent variables (density, VKO, average daily ridership, and congestion level).

Figure 4. Relationship between dependent and independent variables (length of roads, vehicle ownership, and congestion).

Figure 5. Relationship representation between dependent and independent variables (logistics sprawl and congestion).

Figure 6. Dimensions of the impact of logistics sprawl.

Table 1. Impact of logistics sprawl on freight transport.

Impact Category	Description	Effect on Freight Transport	References
Road Congestion	Logistics sprawl increases freight vehicles, particularly trucks. It increases travel distances and road congestion.	▪ Increased travel times and delays ▪ Increased road sharing between freight transport	[27,28]
Transport Reliability	Logistics sprawl increases freight traffic between urban and suburban areas, which affects transport schedules, making it more difficult to maintain a reliable service.	▪ Increased travel trips ▪ Introduction of heavier trucks leading to slow-moving traffic ▪ Increased accident risk	[14,28]
Infrastructure Occupancy	Logistics sprawl leads to increasing road occupancy, since it increases travelled distances.	▪ Increased road occupancy ▪ Road degradation due to heavy trucks	[27,29]
Increased transport demand in peripheral areas	Logistics facilities increase demand for transport to connect workers to these facilities.	▪ Increased use of transport to serve peripheral areas ▪ Imbalance in transport flows between urban and suburban areas	[2,26]

Table 2. Keyword analysis by model and research domain.

Keywords	Number of Paper
Logistics Sprawl and Logistic Regression	14
Congestion and Decision Tree	6
Congestion and Random Forest	6
Congestion, Decision Tree, and Random Forest	6
Congestion and Gradient Boosting	4
Congestion and K-Nearest Neighbors	4
Congestion and Logistic Regression	4
Logistics Sprawl and Logistic Regression	4
Sprawl and Decision Tree	4
Congestion, Decision Tree, Logistic Regression, and Random Forest	3
Congestion, Decision Tree, and Gradient Boosting	2
Logistics Sprawl, Logistic Regression, and Random Forest	2
Congestion and Support Vector Machine	1
Congestion, Decision Tree, and Logistic Regression	1
Congestion, Decision Tree, and Logistic Regression	1
Congestion, Decision Tree, Logistic Regression, Random Forest, and Gradient Boosting	1
Congestion, Decision Tree, Random Forest, and Gradient Boosting	1
Congestion, Random Forest, and Gradient Boosting	1
Congestion, Random Forest, and K-Nearest Neighbors	1
Congestion, Random Forest, and K-Nearest Neighbors	1
Congestion, Support Vector Machine, Decision Tree, Random Forest, and Gradient Boosting	1
Logistics sprawl, Logistic Regression, and K-Nearest Neighbors	1
Logistics Sprawl, Congestion, and Logistic Regression	1
Logistics Sprawl and Random Forest	1
Logistics Sprawl, Logistic Regression, and Decision Tree	1
Sprawl, Congestion, and Random Forest	1

Table 3. Overview of models and their applications.

Model Category	Specific Algorithms	Application Fields	References
Tree-Based Models	Decision Tree	Congestion, Urban Sprawl, Traffic Simulation, Urban Traffic	[34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52]
	Random Forest	Traffic Congestion, Urban Sprawl, Traffic Density Classification	[46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68]
	CART	Urban Sprawl	[69]
	AdaBoost with Decision Trees	Congestion	[70]
	Gradient Boosting/GBDT/XGBoost	Traffic Flow Congestion, Logistics Sprawl	[42,45,46,52,66,71,72,73,74]
	Boosted Regression Tree (BRT)	Urban Sprawl	[75]
	Conditional Inference Tree (CIT)	Traffic Behavior	[76]
Regression Models	Logistic Regression	Urban Sprawl, Logistics Sprawl, Congestion, Shared Mobility	[5,21,43,44,47,48,49,75,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95]
	Linear Regression	Traffic Flow Congestion, Logistics Sprawl	[46,51,52,66,96]
	Ordinary Least Squares	Urban Sprawl, Traffic Congestion	[68,91]
	Logistic-Geographically Weighted Regression	Urban Sprawl	[97]
Distance-Based Models	K-Nearest Neighbors (KNN)	Travel Time, Traffic Congestion	[50,65,67,68,98,99,100,101]
	Support Vector Machine (SVM)	Traffic Flow Congestion, Travel Time	[48,52,65,84,98,102]
Neural Network Models	ANN, MLP, RNN, LSTM		[20,41,43,44,46,60,68,93,102]
Neural Network Models	Convolutional Neural Networks (CNN)	Congestion	[65,68]
Hybrid & Ensemble Models	PCA + RF, YOLO + RF, RF + DBSCAN, etc.	Traffic Flow Congestion	[46,47,49,50,51,52,58,59,61,62,63,64,65,66,67,68]
	Cellular Automata + Logistic Regression + Markov Chain	Urban, Logistics Sprawl	[5,87,88,91,92,93,94,95]
	Heckman, Bayesian Logistic, FR, LBM, GWLR	Shared Mobility, Urban, Logistics Sprawl, Congestion	[85,86,90,103]
Other Methods	Centrographic, Spatial Autoregressive, Fuzzy Logic, Kalman Filter	Logistics Sprawl, Urban Sprawl, Congestion	[45,95,96,99,103]

Table 4. Evaluation metrics.

Metric	Symbol/Formula	Definition
Accuracy	$A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$	Proportion of correctly classified instances among all instances.
AUC Score	Area under the ROC curve	Measures the ability of the model to distinguish between classes (ranges from 0 to 1).
Precision (Class 0)	$P r e c i s i o n_{0} = \frac{T N}{T N + F N}$	Proportion of predicted Class 0 that are actually Class 0 (true negatives).
Precision (Class 1)	$P r e c i s i o n_{1} = \frac{T P}{T P + F P}$	Proportion of predicted Class 1 that are actually Class 1 (true positives).
Recall (Class 0)	$R e c a l l_{0} = \frac{T N}{T N + F P}$	Proportion of actual Class 0 that are correctly predicted.
Recall (Class 1)	$R e c a l l_{1} = \frac{T P}{T P + F N}$	Proportion of actual Class 1 that are correctly predicted.
F1-Score (Class 0)	$F 1_{0} = 2 \cdot \frac{P r e c i s i o n_{0} \cdot R e c a l l_{0}}{P r e c i s i o n_{0} + R e c a l l_{0}}$	Harmonic mean of precision and recall for Class 0.
F1-Score (Class 1)	$F 1_{1} = 2 \cdot \frac{P r e c i s i o n_{1} \cdot R e c a l l_{1}}{P r e c i s i o n_{1} + R e c a l l_{1}}$	Harmonic mean of precision and recall for Class 1.
Confusion Matrix	$[\begin{matrix} T N & F P \\ F N & T P \end{matrix}]$	The matrix showing true positives (TP), false positives (FP), false negatives (FN), true negatives (TN).

Table 5. Evaluation metrics for model performance.

Metric	LR	DT	RF	SVM	KNN	GB
Accuracy	0.69	0.80	0.83	0.68	0.81	0.79
AUC Score	0.75	0.82	0.84	0.69	0.82	0.91
Precision (Class 0)	0.69	0.82	0.86	0.68	0.81	0.84
Precision (Class 1)	0.71	0.82	0.83	0.70	0.83	0.79
Recall (Class 0)	0.72	0.82	0.83	0.70	0.84	0.78
Recall (Class 1)	0.68	0.82	0.87	0.68	0.80	0.85
F1-Score (Class 0)	0.70	0.82	0.84	0.69	0.82	0.81
F1-Score (Class 1)	0.69	0.82	0.85	0.69	0.82	0.82
Confusion Matrix	[[1364 536] [619 1293]]	[[1555 345] [338 1574]]	[[1570 330] [246 1666]]	[[1338 562] [617 1295]]	[[1596 304] [378 1534]]	[[1479 421] [287 1625]]

Table 6. Accuracy results using 5-Fold, 10-Fold, and 15-Fold cross-validation.

Models	5 Fold	10 Fold	15 Fold
LR	0.69	0.69	0.69
DT	0.78	0.79	0.80
RF	0.80	0.82	0.83
SVM	0.69	0.68	0.68
KNN	0.79	0.81	0.81
GB	0.78	0.79	0.79

Table 7. Comparison of models performance.

Model	Final Verdict
Random Forest	Best performing model overall. Excels in all metrics including accuracy, precision, recall, and F1-score. Very reliable for predicting congestion with low error rates.
Gradient Boosting	Very strong performer. Delivers the highest AUC score, showing excellent ability to distinguish between classes. Slightly less balanced than random forest but still excellent.
Decision Tree	Moderate performer. Offers interpretability and decent accuracy but less stable and more prone to overfitting than ensemble methods.
K-Nearest Neighbors	Moderate performer. Performs similarly to decision tree. Easy to understand but less precise. Sensitive to data scaling and structure.
Logistics Regression	Poor performer for prediction. Highly interpretable and useful for understanding relationships but lacks predictive power in this case.
Support Vector Machine	Poor performer. Weak in most metrics, sensitive to tuning, and not well-suited for this specific task or dataset. Not recommended here.

Table 8. Logistic regression assumptions.

Assumptions	Definition	Mathematical Definition
Linearity	Assumes that the logarithms of the dependent variable are linearly related to the independent variables.	$\ln (\frac{p}{1 - P}) = β_{0} + β_{1} X_{1} + \dots + β_{n} X_{n}$ where p is the probability of the outcome (y = 1) and $\frac{p}{1 - P}$ is the odds ratio.
Independence	Assumes that the observations are independent of each other.	$C o v (ε_{i}, ε_{j}) = 0 f o r i \neq j$ Cov is the covariance, $ε$ is the error
Homoscedasticity	Suggests that the variance of the errors is constant.	The variance of errors is expected to be constant.
Normality	Assumes that residuals are normally distributed.	$ε_{i} ~ N (0, σ^{2})$ $ε_{i}$ represents the residual of the logistic regression model.
Multicollinearity	Predictors should not be perfectly correlated with each other.	$VIF (x_{j}) = \frac{1}{1 - R_{j}^{2}}$ Where $R_{j}^{2} i s t h e R s q u a r e d$
Large simple size	Logistic regression requires a large sample size to provide reliable results.	At least 10 events per predictor variable are recommended for reliable model estimation.

Table 9. Random forest assumptions.

Assumptions	Definition	Mathematical Definition
Nonlinearity	Random forest captures complex nonlinear relationships by aggregating multiple decision trees.	No assumption of linearity; ensemble of nonlinear decision trees.
No Distribution Assumption	Random forest makes no assumptions about the underlying data distribution (e.g., normality, homoscedasticity).	Nonparametric method; no specific statistical distribution required.
Independence	Assumes that individual observations are independent.	$C o v (ε_{i}, ε_{j}) = 0 f o r i \neq j$ Cov is the covariance, $ε$ is the residual.
Overfitting Risks	Less prone to overfitting than individual decision trees due to averaging but can still overfit on noisy data.	Reduced overfitting via bagging and aggregation; no direct equation, but generalization improves with more trees.
No Multicollinearity	Random forest handles multicollinearity better than a single decision tree, as it selects random subsets of features for each split.	Correlated features may still affect variable importance measures; model remains robust.
Large Sample Size	Requires sufficiently large datasets for training to ensure accurate averaging and stable performance across trees.	Larger sample size improves ensemble stability and predictive accuracy.

Table 10. Variables introduction.

Indicators	Variables	Definition	References
City Characteristics	Population Density	Describes the level of urbanization, indicating a population per square kilometer, and refers to demand volume.	[3,20,105]
	Average Daily Ridership	Refers to the city’s transport movement.	[27,106]
	Infrastructure: Length of Roads	Introduces the availability of infrastructure and its capacity.	[105,107]
	Vehicle Ownership Rate	Introduces an overview of the number of vehicles on roads, and refers to road occupancy level.	[6,32]
	Vehicle Kilometer Operations	Represents the average travelled distance to reach the demand.	[108,109,110,111]
	Level of Sprawl (Distance)	Illustrates the level of logistics sprawl in the city regarding mobility pattern.	[2,26,112,113]
	Congestion on Roads	Indicates the level of congestion either in terms of capacity, speed, or delays.	[29,113,114]

Table 11. Variables classification.

Indicators	Variables	Type	Characteristics
Logistics Sprawl	Population Density	Independent/Predictor	Ordinal
	Average Daily Ridership	Independent/Predictor	Ordinal
	Infrastructure: Length of Roads	Independent/Predictor	Ordinal
	Vehicle Ownership Rate	Independent/Predictor	Ordinal
	Vehicle Kilometer Operations	Independent/Predictor	Ordinal
	Level of Sprawl (Increased Distance)	Independent/Predictor	Ordinal
	Congestion on Roads	Dependent Variable	Categorical

Table 12. Reviewed papers per journal.

Journal	Number of papers
Case Studies on Transport Policy	1
Procedia—Social and Behavioral Sciences	1
Journal of Transport Geography	18
Journal of the Transportation Research Board	2
Transportation Research Procedia	2
Sustainability	2
REGION	1
Applied Mobilities	1
Cities	1

Table 13. Synthesis of literature review findings.

Reference	Journal	Region	Logistics Activity	Distance	Period
[112]	Case Studies on Transport Policy	Australia, Melbourne	Markets	40–50	2015
[2]	Procedia—Social and Behavioral Sciences,	France, Paris	Express transport terminals	10	1974–2008
[115]	Journal of Transport Geography	Georgia, Atlanta	Warehouses	4.5	1998–2008
[8]	Journal of the Transportation Research Board	United States, Los Angeles	Warehouses	9	1998–2009
[29]	-	Surat, India	Textile industry	4.44	2008–2018
[26]	Journal of Transport Geography	France, Lyon	Logistics facilities	2.76	1982–2012
[114]	Journal of Transport Geography	Brazil, São Paulo	Warehouse	0.6	2010–2017
[28]	Transportation Research Procedia	India, Delhi	Timber markets	2.4	1991–2014
[109]	Journal of Transport Geography	Sweden, Göteborg	Warehouses	4.2	2000–2014
[113]	REGION	France, Paris metropolitan area	Warehouses	4.1	2004–2012
[110]	Journal of the Transportation Research Board	California-Southern California	Warehouses	12	1998–2014
[116]	Journal of Transport Geography	Germany, Berlin	Logistics hubs	4	1994–2014
[117]	Journal of Transport Geography	Brazil, Belo Horizonte Metropolitan Area	Warehouses	1.2	1995–2015
[118]	Journal of Transport Geography	Japan, Tokyo	Logistics facilities	2.4	1980–2003
[27]	Journal of Transport Geography	South Africa, Gauteng	logistics activities	231.49	2010–2014
[111]	Transportation Research Procedia	Canada, Toronto	Warehouses	9.5	2002–2012
[109]	Journal of Transport Geography	Sweden, Västra Götaland	Warehouses	2.7	2000–2014
[27]	Journal of Transport Geography	South Africa, Cape Town	logistics activities	19.58	2010–2014

Table 14. Data augmentation and balancing algorithms.

Method	Description	Advantages	Performance
Linear Interpolation	Generates new data points by interpolating between existing samples.	Simple, fast, and effective in preserving continuity in numeric data.	Helps model generalization, especially with numeric data.
Gaussian Noise	Adds random noise from a Gaussian distribution to the features.	Prevents overfitting by regularizing the model.	Reduces overfitting but may hurt performance in some models.
Mixup	Combines two samples and their labels using linear interpolation.	Improves model robustness, forces smoother decision boundaries.	Increases generalization, reduces overfitting.
SMOTE (Synthetic Minority Over-sampling)	Generates synthetic samples by interpolating between minority class examples.	Balances the dataset without overfitting, improves recall.	Improves recall and classification accuracy in imbalanced datasets.
ADASYN (Adaptive Synthetic Sampling)	Focuses on generating synthetic samples from difficult-to-learn minority class samples.	Targets hard-to-learn examples, improves model robustness.	Can degrade performance by introducing noise, especially in logistic regression.
Random Oversampling	Duplicates samples from the minority class to balance the dataset.	Simple, effective at balancing the dataset.	Can cause overfitting.
Random Undersampling	Removes samples from the majority class to balance the dataset.	Fast and simple, helps prevent overfitting in large datasets.	May cause underfitting by reducing the dataset size.
Time Series Augmentation	Involves transformations like jittering, warping, and scaling in time-series data.	Preserves temporal relationships in data, enhances model robustness.	Enhances robustness in time-series forecasting.
CutMix	Combines two images (or samples) by cutting and mixing their regions.	Makes the model focus on different parts of the data, improving robustness.	Increases robustness and generalization, especially for images.
Random Erasing	Randomly selects a region in the data and erases it to augment the dataset.	Helps the model focus on relevant features and reduces overfitting.	Helps prevent overfitting but may lead to information loss.
Feature Engineering Augmentation	Involves creating new features by applying transformations to existing ones.	Expands the feature space, capturing more data complexity.	Expands model capacity but can lead to overfitting if not controlled.

Table 15. Measures of model accuracy.

Model	Accuracy
Logistic Regression	0.69
Random Forest	0.83

Table 16. Measures of variables weights.

Variables	Coefficient
Logistics Sprawl	−0.2413
Density	0.4040
Vehicle_Kilometers_in_Operations	−0.1910
Average_Daily_Ridership	0.5837
Length_of_Roads	−0.3120
Vehicle_Ownership	0.7463
Sprawl Density Interaction	0.0514
Sprawl Vehicle KM Interaction	0.1572

Table 17. Coefficient interpretation.

Variable	Coefficient	Interpretation
Intercept	0.0087	Baseline log-odds of congestion when all variables are zero.
Logistics Sprawl	−0.2413	An increase in sprawl reduces congestion, suggesting decentralizing logistics activities can ease urban traffic.
Density (people/km²)	0.4040	Higher population density increases congestion due to increased travel demand and transport pressure.
Vehicle Kilometers in Operation	−0.1910	Surprisingly, more vehicle activity is associated with lower congestion, possibly due to efficient routing or larger service areas.
Average Daily Ridership (thousands)	0.5837	Higher public transit usage correlates with more congestion, potentially due to inadequate transit infrastructure or mixed traffic conditions.
Length of Roads (km)	−0.3120	Longer road networks reduce congestion, likely by distributing traffic more effectively.
Vehicle Ownership (per 1000 people)	0.7463	More vehicles per capita lead to higher congestion, reflecting greater private vehicle use and competition for road space.
Sprawl–Density Interaction	0.0514	Sprawl combined with high density slightly increases congestion, showing that dense areas with dispersed logistics face pressure.
Sprawl–Vehicle KM Interaction	0.1572	Logistics sprawl with high vehicle activity increases congestion, reinforcing the strain from extended delivery routes in spread-out zones.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

El Yadari, M.; Jawab, F.; Moufad, I.; Arif, J. Logistics Sprawl and Urban Congestion Dynamics Toward Sustainability: A Logistic Regression and Random-Forest-Based Model. Sustainability 2025, 17, 5929. https://doi.org/10.3390/su17135929

AMA Style

El Yadari M, Jawab F, Moufad I, Arif J. Logistics Sprawl and Urban Congestion Dynamics Toward Sustainability: A Logistic Regression and Random-Forest-Based Model. Sustainability. 2025; 17(13):5929. https://doi.org/10.3390/su17135929

Chicago/Turabian Style

El Yadari, Manal, Fouad Jawab, Imane Moufad, and Jabir Arif. 2025. "Logistics Sprawl and Urban Congestion Dynamics Toward Sustainability: A Logistic Regression and Random-Forest-Based Model" Sustainability 17, no. 13: 5929. https://doi.org/10.3390/su17135929

APA Style

El Yadari, M., Jawab, F., Moufad, I., & Arif, J. (2025). Logistics Sprawl and Urban Congestion Dynamics Toward Sustainability: A Logistic Regression and Random-Forest-Based Model. Sustainability, 17(13), 5929. https://doi.org/10.3390/su17135929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Logistics Sprawl and Urban Congestion Dynamics Toward Sustainability: A Logistic Regression and Random-Forest-Based Model

Abstract

1. Introduction

2. Research Background

2.1. Introduction to Logistics Sprawl, Congestion, and Sustainability

2.2. Overview of Modeling Approaches

3. Model

3.1. Comparative Review of Modeling Methods

3.2. Logistic Regression and Random-Forest-Based Model

3.3. Variables Identification

4. Data

4.1. Collecting Data from the Literature Review

4.2. Data Augmentation Workflow

4.3. Coding Tool

5. Results and Discussion

5.1. Performance Indicator

5.2. Variables Weights and Equation Model

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI