Robust Low-Carbon Multimodal Transport Route Optimization for Containers Under Dual Uncertainty: A Proximal Policy Optimization Approach

Zhang, Rui; Dai, Cuilian; Li, Yunpeng

doi:10.3390/electronics15010005

Open AccessArticle

Robust Low-Carbon Multimodal Transport Route Optimization for Containers Under Dual Uncertainty: A Proximal Policy Optimization Approach

by

Rui Zhang

,

Cuilian Dai

and

Yunpeng Li

^*

School of Economics and Management, Chongqing Jiaotong University, Chongqing 400074, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 5; https://doi.org/10.3390/electronics15010005

Submission received: 7 November 2025 / Revised: 15 December 2025 / Accepted: 16 December 2025 / Published: 19 December 2025

(This article belongs to the Special Issue Advancements in Autonomous Agents and Multi-Agent Systems)

Download

Browse Figures

Versions Notes

Abstract

Container multimodal transport faces many uncertainties in practice. To improve operational efficiency and reduce carbon emissions in freight transport, this study develops a multi-objective optimization model for container multimodal routes that incorporates demand and time uncertainties as well as carbon emissions. The proximal policy optimization (PPO) algorithm identifies robust transport paths facing uncertainty and assesses the model’s sensitivity to price fluctuations and carbon tax rates. Empirical results for the Chongqing–Singapore container route demonstrate the strong applicability of the PPO algorithm. Compared with traditional routing methods, the algorithm yields a lower late-arrival rate and delivers clear advantages in risk avoidance and cost control, thereby effectively reducing carbon emissions in line with carbon-reduction policies and offering practical guidance for logistics firms. The model operates under the assumptions of indivisible cargo and single-visit constraints at nodes, which impose certain limitations. In addition, the current model requires substantial computational resources, which may limit its applicability for smaller companies. With continued optimization, however, the approach advances the industry toward data-driven, intelligent decision-making.

Keywords:

proximal policy optimization algorithm; demand uncertainty; time uncertainty; robustness; progressive carbon tax; multimodal transport; container shipping

1. Introduction

1.1. Research Background and Motivation

Extreme climate events increasingly result from global warming and threaten supply chains. Global warming prompts governments and businesses to accelerate efforts to meet carbon emission targets. The EU Green Deal and China’s dual carbon strategy mark the start of emissions control in the transportation sector. Global economic integration expands logistics networks, and customers prioritize efficiency and lower cost in logistics services. Under these conditions, the supply chain industry faces unprecedented dual pressures: lower transportation costs and improve operational efficiency to remain competitive, and reduce carbon emissions through technological innovation and model optimization in line with the global green transition trend [1,2,3]. Traditional single-mode transport fails to meet diverse market demands, and uncertainties in transit time, freight pricing, and transport requirements diminish service reliability [4,5,6]. Multimodal transport integrates rail, waterway, road, and other modes into a single system and reduces energy consumption per unit of freight turnover [7,8]. Therefore, addressing the effects of uncertainty and developing adaptive multimodal transport solutions for varied scenarios becomes a critical pathway to enhance organizational capability and competitive advantage in this field [9].

Multimodal container transport emerges as a growing trend in the transport services market [10]. Container multimodal transport is a complex dynamic system that is facing many uncertain factors in operation. In cross-border transport, policies, regulations, tariffs, and related measures across countries raise transportation costs. Container demand fluctuates because of market changes, seasonal variation, and consumption cycles. Multimodal transport requires cooperation among multiple stakeholders to provide door-to-door full chain service from the shipper’s origin to the consignee’s destination. Furthermore, modern supply chains impose strict delivery time windows to minimize inventory costs. Deviations from these windows incur direct financial losses, such as demurrage charges for early arrival or contractual penalties for late delivery. Without real-time information sharing, total transportation lead time for container shipments grows [11]. Unknown events such as natural disasters, extreme weather, and equipment failures impede transportation and make it impossible to execute planned operations as scheduled [12]. Existing research mainly addresses either transport-time uncertainty or demand uncertainty in isolation, and few studies examine both demand volume and timing simultaneously [13]. This gap highlights both the theoretical significance and the practical value of the research presented in this paper.

1.2. Literature Review

Route selection optimization for container multimodal transportation represents a well-established research area. Li et al. [14] examine uncertainty in demand and carbon trading prices, formulate models that minimize total transportation costs, and solve them using a hybrid Fireworks Algorithm with a Gravitational Search Operator (FAGSO). Xu et al. [15] address carbon emissions and process uncertainty, construct a robust optimization model with carbon emission limits, and solve it using the Non-dominated Sorting Genetic Algorithm II (NSGA-II). Ge [16] develops a model-solving approach that integrates ant colony algorithms, starts from a multimodal transport network under a single task, and provides an illustrative example. Shoukat et al. [17] introduce a mixed-integer linear programming (MILP) model that formulates a dual-objective minimization of total delivery time and greenhouse gas (GHG) emissions and apply a multi-objective genetic algorithm (MOGA) to solve it. Cheng [18] develops a multimodal transport route optimization model that minimizes total transport costs, including freight charges, waiting costs, and carbon emission costs, and implements a genetic algorithm with preservation and migration strategies. Other researchers apply the Hummingbird Evolutionary Genetic Algorithm [19], the Particle Swarm Optimization Algorithm (IPSO) [20], the Multi-Strategy Enhanced Whale Optimization Algorithm (WOA) [13], and hybrid methods combining Trapezoidal Cloud and MULTIMOORA (Multi-Objective Optimization via Ratio Analysis and Full Multiplicative Form) [21] to optimize container multimodal transport routes under uncertainty. However, reinforcement learning (RL) algorithms remain unused for this problem.

In recent years, artificial intelligence [22] and neural network methods [23] have been applied in the logistics and supply chain domain to reduce forecasting errors and handle nonlinear demand data, which significantly improves transport efficiency and lowers costs. Among these approaches, RL emerges as the most widely used. As shown in the literature (Table 1), RL is employed across many logistics and supply chain problems, especially vehicle routing; however, its use for container route optimization remains virtually absent. Yan et al. [24] systematically analyze RL methods, application scenarios, and research status, and they reveal substantial potential for RL in logistics and supply chain management. They show that compared with traditional heuristic methods, RL provides real-time decision making and adaptive optimization through continuous learning, which is suitable to address dynamic uncertainties in supply chains. Thus, applying RL to optimize container multimodal transport routes holds clear relevance for mitigating the uncertainties encountered in real transport operations.

1.3. Research Objectives

This study addresses the low-carbon container multimodal transport route-optimization problem under dual constraints of demand uncertainty and time uncertainty. This study formulates a multi-objective optimization model that integrates transportation cost, transshipment cost, carbon tax cost, and time-penalty cost. Confronted with a highly stochastic environment, the study abandons traditional static planning, casts the problem as a Markov decision process (MDP), and solves it using the proximal policy optimization (PPO) algorithm. By leveraging the adaptive capabilities of reinforcement learning, the approach learns optimal path-selection policies in continuously changing stochastic settings to keep the late rate below 20% while achieving lower overall costs than conventional methods. Finally, this study applies the method to the actual Chongqing–Singapore route of the New International Land–Sea Corridor as an empirical case study. The results demonstrate the robustness of the PPO algorithm for this problem and offer practical operational solutions for logistics enterprises.

1.4. Structure of the Paper

The study proceeds as follows: Section 2 addresses problem formulation and model development; Section 3 details the proposed algorithm; Section 4 presents an empirical study of the Chongqing–Singapore route; and Section 5 concludes with the key findings and future research directions.

2. Problem Description and Model Establishment

2.1. Problem Description

This study examines scenarios in which containers with uncertain quantities move from Chongqing to Singapore. Integrating road, rail, and maritime modes enables a range of multimodal routes. The study compiles characteristics for each mode, including speed, unit cost, transit time, and unit carbon emissions. It also assembles data on intercity distances, transshipment costs, and transshipment durations. Unlike traditional work, the model incorporates two practical factors: a stepwise carbon tax mechanism and stochastic transport processes. Carbon emission cost does not scale linearly; it increases in steps as total emissions grow. Transportation time, which includes real-world factors such as weather, port congestion, and equipment failure influence, is treated as a stochastic variable.

2.2. Model Assumptions

To ensure model solution feasibility, the following assumptions are established:

Cargo remains an integral whole during transportation, with no splitting permitted.
Each transfer node supports only one transshipment operation.
Transportation route follows a one-way flow; each node city is traversed once at most.
All nodes are equipped with sufficient transshipment facilities and processing capacity.
Railway and waterway timetables are not considered.
The capacity on each voyage segment is sufficient to meet the current TEU demand.

2.3. Mathematical Formulation

This study defines its core objective by formulating the problem as a mixed-integer programming model. The optimization primarily aims to minimize the total cost, which comprises transportation costs, transshipment costs, carbon tax costs, and time penalty costs.

M i n Z = C_{t r a n s} + C_{t r a n s f e r} + C_{t a x} + C_{t i m e}

(1)

C_{t r a n s} = \sum_{(i, j) \in E} \sum_{m \in M} Q \cdot d_{i, j} \cdot p_{m} \cdot x_{i, j}^{m}

(2)

C_{t r a n s f e r} = \sum_{i \in V} \sum_{k \in M} \sum_{l \in M, k \neq l} Q \cdot c_{k, l}^{t r a n s} \cdot y_{i}^{k, l}

(3)

In Equation (2):

E

represents the set of all edges,

(i, j)

denotes a specific route segment from node

i

to node

j

, M represents the set of transportation modes,

m

denotes a specific transportation mode,

Q

refers to the total number of containers to be transported, d_i,j distance between node

i

to node

j

,

p_{m}

unit transportation rate for using transportation mode

m

.

x_{i, j}^{m}

is a binary decision variable: it takes a value of 1 if mode

m

is selected for the route segment from node

i

to node

j

, and 0 otherwise.

In Equation (3): V represents the set of all nodes,

k

denotes the transportation mode used to arrive at the node,

m

denotes the transportation mode used to depart from the node,

k \neq l

indicates that transshipment costs are only calculated when the two modes differ,

c_{k, l}^{t r a n s}

unit transshipment cost for switching from transportation mode

k

to mode

l

.

y_{i}^{k, l}

is a binary decision variable: it equals 1 if transshipment from mode

k

to mode

l

takes place at node

i

, and 0 otherwise.

The carbon tax cost

C_{t a x}

is related to the accumulated carbon emissions

E

, and its calculation is as follows:

C_{t a x} = \{\begin{matrix} ω_{k} \cdot (E \cdot Q) - δ_{k}, Z_{k - 1} < E \cdot Q \leq Z_{k} \\ ω_{k_{\max}} \cdot (E \cdot Q) - γ_{k_{\max}}, E \cdot Q \geq Z_{k_{\max}} \end{matrix}

(4)

ω_{k}

is the carbon tax rate for the k-th tier in the progressive carbon tax system,

δ_{k}

is the quick deduction for the k-th tier, and

Z_{k}

is the threshold for the k-th tier [36].

C_{t i m e} = \{\begin{array}{l} α \cdot Q \cdot (T_{t i m e} - T), T < T_{m i n} \\ β \cdot Q \cdot (T - T_{t i m e}), T > T_{\max} \\ 0, T_{m i n} \leq T \leq T_{\max} \end{array}

(5)

In Equation (5), α and β are the penalty rates for early and late arrivals, respectively.

T_{m i n}

and

T_{m a x}

define the bounds of the allowable time window, and

C_{t i m e}

captures the time-penalty cost and is linked to the accumulated travel time

T

.

To capture the uncertainty in real-world scenarios (e.g., adverse weather, channel congestion, and equipment failures), this study models the actual transportation time

T_{i, j}^{m, a c t u a l}

of each route segment as a stochastic variable.

T_{i, j}^{m, a c t u a l} = T_{i, j}^{m, b a s e} \cdot ξ, ξ \sim (0.95, 1.30)

(6)

Herein,

T_{i, j}^{m, b a s e}

denotes the baseline time calculated based on the average speed, and

ξ

represents a stochastic perturbation factor that follows a uniform distribution over the interval (0.95, 1.30).

2.4. Construction of Markov Decision Process (MDP)

This study models the container multimodal route optimization problem as an MDP (S, A, P, R, γ), as illustrated in Figure 1. The components of this MDP are defined in detail below [37].

2.4.1. State Space

To ensure the effectiveness of decision-making, the state

s \in S

must encapsulate comprehensive information. In this study, the state is defined as a composition of several elements: a one-hot encoding of the city, normalized transport metrics, cumulative time, cumulative carbon emissions, cumulative cost, and a one-hot encoding of the previous action mode.

s = \{i, q, t, e, c, m\}

(7)

Herein,

i \in {\{0, 1\}}^{|I|}

is the one-hot encoding of the city, and

|I|

denotes the number of cities.

q \in [0, 1]

is the normalized transportation scalar, with

q = \min (\frac{Q}{Q_{n o r m}}, 1)

, where

Q

is the transportation scalar and,

Q_{n o r m}

is the normalization factor.

t \in [0, 1]

is the normalized accumulated time, with

t = \min (\frac{T}{T_{n o r m}}, 1)

, where

T

is the accumulated time and

T_{n o r m}

is the normalization factor.

e \in [0, 1]

is the normalized accumulated carbon emissions, with

e = \min (\frac{E}{E_{n o r m}}, 1)

, where

E

is the accumulated carbon emissions and

E_{n o r m}

is the normalization factor.

c \in [0, 1]

is the normalized accumulated cost, with

c = \min (\frac{C}{C_{n o r m}}, 1)

, where

C

is the accumulated cost and

C_{n o r m}

is the normalization factor.

m \in {\{0, 1\}}^{|M|}

is the one-hot encoding of the previous mode, where

|M|

denotes the number of transportation modes.

2.4.2. Action Space

At each node, the agent’s action centers on selecting both the next city and the corresponding transport mode. Formally, the action

a \in A

denotes the transport mode chosen for travel from a source city to a destination city, such that

a = (s r c, d s t, m o d e)

.

2.4.3. State Transition Probability

In a deterministic setting, if the action a taken in state

s

leads uniquely to a subsequent state

s^{'}

, then the transition probability is

P (s^{'}| s, a) = 1

; otherwise, it is zero. In practice, however, uncertainty typically arises not from the state transition itself, but rather from fluctuations in the environment that influence the accumulation of state-dependent variables.

The state transition in this study combines deterministic routing logic with stochastic temporal dynamics. Let the current state be

s_{k} = \{i, t_{k}, \dots\}

and the discrete action be

a_{k}

, which selects the next node

j

and transportation mode

m

. The transition to the next state

s_{k + 1} = \{i, t_{k + 1}, \dots\}

is defined as follows:

Spatial Transition (Deterministic): The current location updates from node

i

to node

j

deterministically based on action

a_{k}

.

Temporal Transition (Stochastic): The accumulated travel time

t_{k + 1}

is updated by integrating the stochastic factor

ξ

defined in Equation (6). The formulation is:

t_{k + 1} = t_{k} + T_{i, j}^{m, b a s e} \cdot ξ + t_{t r a n s}

(8)

where

t_{t r a n s}

is the transshipment time.

This mechanism ensures that while the agent’s choices are discrete, the consequences reflect the continuous uncertainty of the environment.

2.4.4. Reward Function

The reward function

R (s, a, s^{'})

consists of the step reward

r_{s t e p}

, visit penalty

r_{v i s i t}

, and terminal penalty

r_{t e r m i n a l}

:

R (s, a, s^{'}) = r_{s t e p} + r_{v i s i t} + r_{t e r m i n a l}

(9)

r_{s t e p} = - (ω_{c} \cdot (c_{s t e p} + c_{t r a n s f e r}) + ω_{t} \cdot t_{s t e p} + ω_{e} \cdot (e_{s t e p} + e_{t r a n s f e r}))

(10)

r_{v i s i t} = \{\begin{matrix} - 0.1, i f a c i t y i s r e v i s i t e d \\ 0, o t h e r w i s e \end{matrix}

(11)

r_{t e r m i n a l} = \frac{C_{t a x} + C_{t i m e}}{N_{p e n a l t y}}

(12)

In Equation (10),

ω_{c}

,

ω_{t}

, and

ω_{e}

denote the weights assigned to cost, time, and carbon emissions, respectively. The terms

c_{s t e p}

and

c_{t r a n s f e r}

represent the per-step cost and transfer cost, while

t_{s t e p}

denotes the per-step travel time. Likewise,

e_{s t e p}

and

e_{t r a n s f e r}

correspond to the carbon emissions incurred during each step and at transfer points.

In Equation (12),

C_{t a x}

is the carbon tax cost, computed as described above.

C_{t i m e}

captures the time-penalty cost and is linked to the accumulated travel time

T

, and

N_{p e n a l t y}

serves as the normalization factor for the penalty term.

2.4.5. Discount Factor

γ

ranges from [0, 1] and is used to weight immediate rewards and future rewards. When calculating the cumulative discounted reward, the weight for the reward

R_{t + n}

at step n in the future is

γ^{n}

.

3. Proposed Methodology

3.1. Reinforcement Learning (RL) and Algorithm Selection

RL provides a principled framework for sequential decision-making in which an agent learns to improve its behavior through continuous interaction with the environment rather than relying on a predefined system model [38]. RL exhibits strong adaptability to uncertain conditions because it does not require a precise environmental model. It handles large-scale, high-dimensional state and action spaces and addresses delayed-reward problems. Therefore, RL suits intermodal routing optimization problems that feature high uncertainty and multiple objectives.

RL encompasses a variety of algorithms, including Deep Q-Network (DQN), Asynchronous Advantage Actor-Critic (A2C), Soft Actor-Critic (SAC), Trust Region Policy Optimization (TRPO), and Proximal Policy Optimization (PPO), among others. A3C typically trains faster but shows greater instability in reward values, while PPO exhibits longer execution time yet provides a more stable training process [39,40]. DQN handles discrete action spaces effectively but performs worse than PPO in stochastic policy quality and convergence stability [41]. SAC targets continuous action spaces. TRPO stabilizes policy updates by imposing constraints, but requires computationally expensive second-order optimization. PPO preserves TRPO’s stability advantages while ensuring reliable convergence under complex reward structures [42]. Therefore, this study selects PPO as the optimal solver to balance computational efficiency and robust performance.

3.2. Proximal Policy Optimization (PPO) Algorithm

This study employs the PPO algorithm to train the RL model. PPO is a policy gradient method based on the Actor-Critic framework. It incorporates a probability ratio clipping mechanism, effectively constraining the magnitude of policy updates. This achieves a balance between convergence stability and training efficiency.

Actor–Critic Architecture.

Actor is responsible for the output strategy distribution, while Critic estimates the state value function, which can fit the highly nonlinear value function and strategy function, so as to capture the complex cost structure and dynamics. Formally, this study defines the policy network

π_{θ} (a| s)

as:

π_{θ} (a| s) = \frac{e x p (l o g i t s_{θ} (a| s))}{\sum_{a^{'} \in A} e x p (l o g i t s_{θ} (a| s))}

(13)

where

θ

denotes the parameters of the policy network, and

l o g i t s_{θ} (a| s)

is the network output corresponding to action

a

under state

s

.

2.: PPO-Clip Objective Function.

L^{C L I P} (θ) = {\hat{E}}_{t} [m i n (r_{t} (θ) {\hat{A}}_{t}, c l i p (r_{t} (θ), 1 - ε, 1 + ε) {\hat{A}}_{t})]

(14)

r_{t} (θ) = \frac{π_{θ} (a_{t}| s_{t})}{π_{θ_{o l d}} (a_{t}| s_{t})}

(15)

In this formulation,

r_{t} (θ)

represents the probability ratio between the new and old policies. The term

{\hat{A}}_{t}

denotes the advantage estimate, reflecting how much better taking action

a_{t}

in state

s_{t}

is compared to the average behavior under the policy. The parameter

ε

serves as a clipping coefficient that constrains the magnitude of policy updates, ensuring training stability. The clipping function

c l i p (x, a, b) = m a x (a, m i n (x, b))

enforces this bound by restricting

x

within the interval

[a, b]

.

3.: Value Loss Function.

L_{t}^{V F} (φ) = {\hat{E}}_{t} [{(V_{φ} (s_{t}) - {\hat{V}}_{t})}^{2}]

(16)

where

{\hat{V}}_{t}

is the target value estimate, which can be obtained using Monte Carlo estimation or temporal difference estimation.

4.: Entropy Regularization Term.

S (π_{θ}) = {\hat{E}}_{t} [H [π_{θ} (a| s_{t})]]

(17)

H [π_{θ} (a| s_{t})] = - \sum_{a \in A} π_{θ} (a| s_{t}) l o g π_{θ} (a| s_{t})

(18)

where

H [π_{θ} (a| s_{t})]

is the entropy of the policy distribution.

5.: Final Objective Function.

L (θ, φ) = L^{C L I P} (θ) - c_{1} L^{V F} (φ) + c_{2} S (π_{θ})

(19)

where

c_{1}

and

c_{2}

are weight coefficients used to balance the loss of different components.

6.: Algorithm Flow.

Step 1. Initialize the Actor-Critic network and the optimizer.

Step 2. Sample trajectories in the environment, and record states, actions, rewards, next states, and probabilities of the old policy.

Step 3. When sufficient experience is accumulated, calculate the advantage and discounted return.

Step 4. Perform multiple training iterations: Sample a mini-batch of data from the buffer; Re-evaluate action probabilities and state values using the current actor network; Calculate the loss and perform parameter updates.

Step 5. Empty the buffer and repeat the above process until the predefined number of training steps is reached or the policy converges.

Through the aforementioned design and training mechanisms, the PPO algorithm is capable of learning stable and robust strategies for transport path optimization in complex environments. This capability is particularly relevant to the high-dimensional, uncertain, and multi-objective scenarios encountered in this study (see Figure 2).

3.3. Implementation Details and Hyperparameters

We implement the proposed PPO algorithm in the PyTorch (2.6.0+cpu) deep learning framework. To ensure experiment reproducibility, we specify the network architecture and the key hyperparameters used in this study. The implementation adopts an Actor-Critic architecture with a shared feature-extraction network that consists of two fully connected layers with 256 and 128 neurons, respectively. We use the rectified linear unit (ReLU) as the activation function. Separate output layers serve the Actor and the Critic.

We select the hyperparameters following the recommendations in the original PPO paper by Schulman et al. [42] and fine-tune them to match the scale of the multimodal routing problem. The specific settings are summarized in Table 2.

4. Case Study and Experimental Analysis

4.1. Experimental Setup and Data

4.1.1. Case Background: Chongqing–Singapore Route

In this study, twenty-two transport nodes are established based on the spatial configuration of the primary corridors of the Chongqing–Singapore multimodal container transport network. This network is based on the actual New International Land–Sea Corridor. The network’s detailed structure is shown in Figure 3. The numbers corresponding to the cities in Table 3 are the same as the numbers in Figure 3. The corresponding speeds, unit freight charges, transit times, and unit carbon emissions for each mode of transport (Table 4) are known. Additionally, data regarding the distances between node cities (Table 5), transshipment costs, and the time required for transshipment (Table 6) are available.

In terms of data, domestic railway container freight rates were obtained from the 95,306 Railway Freight Network https://ec.95306.cn/ (accessed on 18 May 2025). Distances between nodal cities were derived from both AutoNavi Maps https://ditu.amap.com/ (accessed on 18 May 2025) and Baidu Maps https://map.baidu.com/ (accessed on 18 May 2025).

4.1.2. Parameter Settings

To represent uncertainty in cargo demand, and following the daily container-volume data and designs in the literature [14], the model uses three demand scenarios with their probabilities: 180 TEU (0.36), 90 TEU (0.5), and 45 TEU (0.14). Transportation time is modeled as a stochastic variable (see Equation (5)). Customers require shipment from Chongqing to Singapore within four days, with a delivery-time window of 80 to 110 h. However, to simulate real-world disruption factors in logistics, we define the random fluctuation range of the total time as [0.95, 1.30] (see Equation (6)). Early arrivals incur a penalty of 50 CNY per TEU per hour, and delayed arrivals incur a penalty of 100 CNY per TEU per hour. Transportation prices show a nonlinear relationship with shipment quantity.

To account for carbon regulatory policies, carbon tax rates are set per international benchmarks, ranging from 0.05 to 0.20 CNY per kilogram. We consider four discrete tax rate levels (

ω = (0.05, 0.1, 0.15, 0.2)

) corresponding to emission levels of

Z = (10, 000, 50, 000, 100, 000)

kilograms. In the reinforcement learning reward function, weights for cost, time, and carbon emissions are set to 1.0, 1.0, and 1.0 respectively. Under this configuration, the PPO model attains the highest level of robustness.

4.2. Algorithm Validation and Training Stability Analysis

This section first compares the PPO algorithm with other baseline RL algorithms to justify its selection and then analyzes the internal training stability of PPO using multiple metrics over five random seeds.

4.2.1. Algorithm Validation

To demonstrate the superiority of PPO for the stochastic multimodal routing problem, we benchmark PPO against two representative RL algorithms: DQN, a classic value-based method, and A2C, a synchronous policy-based method. All algorithms train under identical stochastic environment settings for 150,000 steps, the remaining data is consistent with Table 2. Figure 4 shows the training curves.

PPO blue curve climbs steadily and rapidly, reaching the highest stable reward (approx.97) at around 40,000 steps. The narrow standard deviation indicates very robustness to environment stochasticity. DQN red line exhibits a rapid increase in reward during the mid-training phase (after approximately 20,000 steps), stabilizing in the high positive reward range of 90~100 in the later stages, albeit with minor fluctuations. A2C green line maintains its reward level within the low-to-medium range of 20–60 throughout training, representing the algorithm with the lowest reward ceiling among the three. Its curve displays significant volatility, with a broad shaded area indicating poor stability during the training process.

These results demonstrate empirically that PPO’s clipped surrogate objective reduces variance due to uncertain travel times and makes PPO the most suitable solution for this work.

4.2.2. Training Internal Stability Analysis

To further verify the internal stability of the proposed PPO model, we analyze key training metrics averaged over five random seeds.

Figure 5a shows that the policy entropy is high early in training and drops sharply as steps increase—most dramatically in the first 40,000 steps—before reaching zero later. This is a change from exploration focused behavior to exploit focused decision making, which reflects reinforcement learning’s training logic. The light green band (denoting standard deviation) shrinks steadily in training and entropy paths between the five seeds become more parallel, confirming the stability of the model’s convergence under different initial conditions.

Figure 5b shows that value loss fluctuates early in training, peaking between 40,000–60,000 steps, and then slowed down steadily. Although the value function is adjustment-dependent volatility at the start, it settles into low-loss regime later on, suggesting sharper predictions. The light orange standard-deviation band contracts with training and value loss patterns between random seeds become more parallel.

These two metrics show that the PPO algorithm is stable and insensitive to random seed initialization. We believe that PPO is stable and insensitive to random seed initialization, and we can predict robust learning performance even in complex stochastic conditions.

4.3. Robustness in Practical Scenarios: PPO vs. Traditional Methods

We chose two benchmark methods for comparison. One is Mixed-Integer Linear Programming (MILP). It represents the theoretical optimal solution in deterministic environments. The other is the Genetic Algorithm (GA). It stands for the performance of conventional intelligent methods. These comparisons help verify PPO’s stability in uncertain environments. Table 7 and Figure 6 reveal a key trade-off: theoretical optimality versus practical robustness.

(1): 45 TEU scenario

MILP and GA selected routes with multiple waterways and transshipments. Their deterministic cost is 455,381 CNY. This is lower than PPO’s deterministic cost (499,331 CNY). But when travel times fluctuate, the situation changes. PPO’s stochastic average cost is 499,574 CNY. It is lower than MILP’s 515,766 CNY and GA’s 511,771 CNY. Notably, PPO has the smallest standard deviation (1228 CNY). It also has the lowest stochastic late rate (4%), the smallest robustness gap (0.6%), and the lowest carbon emissions (48,465 Kg). These results show that at low demand volumes, PPO ensures on-time and budget-compliant delivery. It also has excellent stability.

(2): 90 TEU scenario

MILP found the route with the lowest total deterministic cost (892,287 CNY). GA and PPO converged to the same alternative route. This route has the lowest carbon emissions (96,931.80 Kg). When travel times are volatile, MILP and GA perform poorly. Both have late rates above 92%. Their standard deviations are over 64,167 CNY. This shows traditional methods lack adaptability. In contrast, PPO acts differently. It selects a more robust route under time uncertainty. It maintains a high on-time delivery rate. It also shows strong nonlinear adaptability.

(3): 180 TEU scenario

In the 180 TEU, the standard deviation of the PPO solution (73,172 CNY) is slightly higher than that of the MILP baseline (62,099 CNY). Because the PPO agent operates on a stochastic policy, retaining a minimal probability of exploration even after convergence. All three algorithms converge to the same route: “Chongqing–(Rail)–Qinzhou Port–(Water)–Singapore”. At this scale, the deterministic optimal solution matches the robust optimal solution, which validates PPO’s convergence and effectiveness in large-scale scenarios.

Table 7 and Figure 6 show that PPO, with stochastic travel times (weather and traffic uncertainties), performs significantly better than MILP and GA for three reasons: total cost control, stability, and on-time delivery rate. In all TEU scenarios PPO selects the route Chongqing–(Rail)–Qinzhou Port–(Water)–Singapore, which is the main route in New International Land–Sea Corridor, and is feasible in terms of carbon emission reduction and time-window restrictions while meeting real transportation requirements (Figure 7).

4.4. Sensitivity Analysis

4.4.1. Sensitivity to Transportation Cost Fluctuations

As part of the comprehensive sensitivity analysis, this subsection evaluates the model’s response to transportation cost fluctuations by introducing uniform random noise at different perturbation levels. The analysis assesses whether the model’s route choices and cost performance are sensitive to these operational uncertainties.

This study introduces uniform random noise (mean = 0, perturbation levels of ±5%, ±10%, and ±20%) based on original transportation costs, simulating cost fluctuations in real-world environments. In the constructed perturbation environment above, the trained model reidentifies optimal routes. Behavioral performance and strategy robustness are analyzed under different noise levels.

Table 8 presents route selection and cost changes of the PPO model under various price noise levels. Results demonstrate the model consistently selects the transportation route “Chongqing–(Rail)–Qinzhou Port–(Water)–Singapore” across all set transportation cost perturbation levels. Figure 8 presents route robustness results. Figure 8 and Figure 9 present cost robustness results of the RL strategy. Results show the model maintains the original strategy consistently, unaffected by transportation cost fluctuations. The average total cost remains closely around the initial deterministic cost.

These results collectively indicate that the proposed model exhibits strong robustness to transportation cost fluctuations within the practical range, and they further support the global robustness conclusion in Section 4.3 by confirming the model’s insensitivity to a key operational variable.

4.4.2. Sensitivity to Carbon Tax Rate

To examine the proposed model’s sensitivity to carbon tax rate changes, this study adjusts carbon tax rate

ω_{k}

while keeping threshold

Z_{k}

unchanged, resolving the optimal route strategy anew. Table 9 presents transportation routes, modes, and cost data under the 90TEU scenario, where (0,0,0,0) serves as the control group without considering carbon emission costs.

Results demonstrate that as carbon tax rate levels rise, total carbon emissions exhibit a decreasing trend. At the Level II tax rate, the transportation route aligns with the control group: “Chongqing–(Rail)–Guiyang–(Rail)–Liuzhou–(Rail)–Litang–(Water)–Qinzhou Port–(Water)–Singapore”. This phenomenon indicates that carbon cost accounts for a limited share of total cost at this stage, with the model preferring routes with lower total costs.

Furthermore, starting from the Level III carbon tax rate, transportation routes and modes shift. Route selection converges to “Chongqing–(Rail)–Qinzhou Port–(Water)–Singapore”, confirming carbon tax rate changes exert a practical guiding effect on transportation decisions.

Overall, the proposed model is sensitive to carbon tax rate changes. Particularly at higher carbon tax levels, the model proactively adjusts route strategies to balance transportation costs and carbon emissions, demonstrating sensitivity to carbon tax rates and strategy flexibility.

5. Discussion

This paper presents a multi-objective model for container logistics under unknown demand and time and carbon emissions. In highly random environments, we abandon static planning and use PPO. Although the PPO algorithm takes a long time to train, once the model is trained, the decision-making process only requires a few seconds. Case studies and multi-dimensional experiments on the Chongqing–Singapore route show that PPO can switch routes dynamically under changing conditions, unlike static planning, while maintaining strong performance. The reward function allows weight customization: cost sensitive firms can increase the weight on costs, and low carbon-oriented firms can increase the weight on carbon emissions. No previous work utilizes reinforcement learning for container route optimization problem when demand uncertainty and time variability are mixed. For the Chongqing–Singapore case we agree with other studies, to recommend rail transport from Chongqing to Qinzhou Port and sea transport to Singapore. We emphasize the strategic relevance of the New Western Land–Sea Corridor and the practical and policy importance of the proposed method.

6. Conclusions

Our results summarize the following conclusions:

(1): Comparative experiments between PPO and classic reinforcement learning algorithms DQN and A2C show that PPO offers clear advantages in multimodal transport route optimization, because clipped surrogate objective function reduces travel-time variability. This study provides empirical evidence that Policy-Gradient methods outperform Value-based methods in stochastic routing problems with discrete action spaces, offering a reference for algorithm selection in the field of intelligent logistics.
(2): Statistical analysis of training metrics (policy entropy and critic value loss) from five random seeds shows that PPO trains well and is low sensitive to initial conditions.
(3): Comparison with MILP and GA shows that while MILP and GA have marginally lower costs in low-cargo-volume cases, PPO can perform better in realistic travel-time fluctuations. It maintains a small stochastic late rate, a narrow robustness gap, and the lowest carbon emissions in all three cargo-volume cases, and gets to the optimal route.
(4): Sensitivity analysis shows that the model is robust to transportation cost changes and is sensitive to carbon tax changes. Under cost disturbances, the model always keeps the best route and the average cost close to the initial deterministic cost. As carbon-tax levels rise, routes are adjusted to balance transportation costs and carbon emissions. This provides decision support for low carbon transport policy design.
(5): In conclusion, this paper establishes a novel MDP framework that couples a progressive carbon tax mechanism with dual uncertainties. We find that reinforcement learning is applicable to the multi-objective path optimization problem of container multimodal transport and can adapt to the path strategy. In addition, compared with traditional static methods, the PPO algorithm is more adaptable to the randomness of environmental changes, and its results have a certain degree of robustness.

However, the study has limitations. It does not consider transportation capacity constraints, dynamic adjustments to carbon tax thresholds, or rail and waterway schedule constraints. The current model also requires substantial computational power. This may make it unsuitable for small-sized enterprises. For future research, our study will consider sensitivity analysis of combined parameter changes to ensure the robustness of the model. Enterprises first collect data for model training and then deploy it into the transportation management system to assist schedulers in decision-making. Researchers should expand application scenarios to multi-source collaborative transportation. They should also integrate more advanced deep-learning architecture. These architectures will improve decision-making efficiency in large-scale transportation networks. These efforts drive the industry toward data-driven, intelligent decision-making.

Author Contributions

Study conceptualization, formal analysis, data curation, methodology, and investigation were performed by R.Z., C.D. and Y.L. Funding acquisition, project administration, and supervision were performed by R.Z. Resources, software, visualization were performed by C.D. and Y.L. Validation, writing—original draft, and writing—review & editing were performed by R.Z., C.D. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Social Science Fund of China (Grant no. 22BJL069), Chongqing Technology Foresight and Institutional Innovation Project (Grant no. CSTB2025zdcx-gkjzX0019), and Chongqing Graduate Research and Innovation Project (Grant no. CYS240510).

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

PPO	Proximal Policy Optimization
RL	Reinforcement Learning
MDP	Markov Decision Process
DQN	Deep Q-Network
A2C	Advantage Actor-Critic
SAC	Soft Actor-Critic
TRPO	Trust Region Policy Optimization
MILP	Mixed-Integer Linear Programming
GA	Genetic Algorithm
TROA	Tyrannosaurus Optimization Algorithm
FAGSO	Fireworks Algorithm hybridized with a Gravitational Search Operator
NSGA-II	Non-dominated Sorting Genetic Algorithm II

References

Ahmady, M.; Eftekhari Yeghaneh, Y. Optimizing the Cargo Flows in Multi-Modal Freight Transportation Network Under Disruptions. Iran. J. Sci. Technol. Trans. Civ. Eng. 2022, 46, 453–472. [Google Scholar] [CrossRef]
Miechowicz, W.; Kiciński, M.; Miechowicz, I.; Merkisz-Guranowska, A. The Attractiveness of Regional Transport as a Direction for Improving Transport Energy Efficiency. Energies 2024, 17, 4844. [Google Scholar] [CrossRef]
Pezzella, G.; Baaqel, H.; Messa, G.M.; Sarathy, S.M. Towards Decarbonized Heavy-Duty Road Transportation: Design and Carbon Footprint of Adsorption-Based Carbon Capture Technologies Using Life Cycle Thinking. Chem. Eng. J. 2025, 508, 161168. [Google Scholar] [CrossRef]
Dini, N.; Yaghoubi, S.; Bahrami, H. Logistics Performance Index-Driven in Operational Planning for Logistics Companies: A Smart Transportation Approach. Transp. Policy 2025, 160, 42–62. [Google Scholar]
Shramenko, N.; Muzylyov, D.; Shramenko, V. Methodology of Costs Assessment for Customer Transportation Service of Small Perishable Cargoes. Int. J. Bus. Perform. Manag. 2020, 21, 132–148. [Google Scholar] [PubMed]
Solano-Charris, E.L.; Prins, C.; Santos, A.C. Solving the Bi-Objective Robust Vehicle Routing Problem with Uncertain Costs and Demands. RAIRO Oper. Res. Rech. Opér. 2016, 50, 689–714. [Google Scholar] [CrossRef]
Feng, X.; Song, R.; Yin, W.; Yin, X.; Zhang, R. Multimodal Transportation Network with Cargo Containerization Technology: Advantages and Challenges. Transp. Policy 2023, 132, 128–143. [Google Scholar]
Amine Masmoudi, M.; Baldacci, R.; Mancini, S.; Kuo, Y.H. Multi-Compartment Waste Collection Vehicle Routing Problem with Bin Washer. Transp. Res. Part E Logist. Transp. Rev. 2024, 189, 103681. [Google Scholar]
Chang, T.S. Best Routes Selection in International Intermodal Networks. Comput. Oper. Res. 2008, 35, 2877–2891. [Google Scholar] [CrossRef]
Tagiltseva, J.; Vasilenko, M.; Kuzina, E.; Drozdov, N.; Parkhomenko, R.; Prokopchuk, V.; Skichko, E.; Bagiryan, V. The Economic Efficiency Justification of Multimodal Container Transportation. Transp. Res. Procedia 2022, 63, 264–270. [Google Scholar] [CrossRef]
Ding, L. Multimodal Transport Information Sharing Platform with Mixed Time Window Constraints Based on Big Data. J. Cloud Comput. 2020, 9, 11. [Google Scholar] [CrossRef]
Raicu, S.; Popa, M.; Costescu, D. Uncertainties Influencing Transportation System Performances. Sustainability 2022, 14, 7660. [Google Scholar] [CrossRef]
Feng, F.; Zheng, F.; Zhang, Z.; Wang, L. A Framework for Low-Carbon Container Multimodal Transport Route Optimization Under Hybrid Uncertainty: Model and Case Study. Appl. Sci. 2025, 15, 6894. [Google Scholar] [CrossRef]
Li, M.; Sun, X. Path Optimization of Low-Carbon Container Multimodal Transport under Uncertain Conditions. Sustainability 2022, 14, 14098. [Google Scholar] [CrossRef]
Xu, Q.; Huang, X.; Zhang, W.; Zhao, H.; Zhang, H.; Jin, Z. Eurasian Container Intermodal Transportation Network: A Robust Optimization with Uncertainty and Carbon Emission Constraints. Front. Mar. Sci. 2025, 12, 1576006. [Google Scholar] [CrossRef]
Ge, D. Optimal Path Selection of Multimodal Transport Based on Ant Colony Algorithm. J. Phys. Conf. Ser. 2021, 2083, 032011. [Google Scholar] [CrossRef]
Shoukat, R.; Xiaoqiang, Z. Is Multimodal Transportation Greener and Faster than Intermodal in Full Container Load? A Case from Pakistan–China. Energy Syst. 2025, 16, 1117–1142. [Google Scholar]
Cheng, J. Research on Optimizing Multimodal Transport Path under the Schedule Limitation Based on Genetic Algorithm. J. Phys. Conf. Ser. 2022, 2258, 012014. [Google Scholar]
Liu, S. Multimodal Transportation Route Optimization of Cold Chain Container in Time-Varying Network Considering Carbon Emissions. Sustainability 2023, 15, 4435. [Google Scholar] [CrossRef]
Zheng, C.; Sun, K.; Gu, Y.; Shen, J.; Du, M. Multimodal Transport Path Selection of Cold Chain Logistics Based on Improved Particle Swarm Optimization Algorithm. J. Adv. Transp. 2022, 2022, 5458760. [Google Scholar] [CrossRef]
Wan, S.P.; Gao, S.Z.; Dong, J.Y. Trapezoidal Cloud Based Heterogeneous Multi-Criterion Group Decision-Making for Container Multimodal Transport Path Selection. Appl. Soft Comput. 2024, 154, 111374. [Google Scholar] [CrossRef]
Rezki, N.; Mansouri, M. Deep Learning Hybrid Models for Effective Supply Chain Risk Management: Mitigating Uncertainty While Enhancing Demand Prediction. Acta Logist. 2024, 11, 589–604. [Google Scholar] [CrossRef]
Zhu, W.; Wang, H.; Zhang, X. Synergy Evaluation Model of Container Multimodal Transport Based on BP Neural Network. Neural Comput. Appl. 2021, 33, 4087–4095. [Google Scholar] [CrossRef]
Yan, Y.; Chow, A.H.F.; Ho, C.P.; Kuo, Y.-H.; Wu, Q.; Ying, C. Reinforcement Learning for Logistics and Supply Chain Management: Methodologies, State of the Art, and Future Opportunities. Transp. Res. Part E Logist. Transp. Rev. 2022, 162, 102712. [Google Scholar] [CrossRef]
Nazari, M.; Oroojlooy, A.; Snyder, L.; Takac, M. Reinforcement Learning for Solving the Vehicle Routing Problem. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Chen, X.; Tian, Y. Learning to Perform Local Rewriting for Combinatorial Optimization. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Kool, W.; van Hoof, H.; Welling, M. Attention, Learn to Solve Routing Problems! arXiv 2019, arXiv:1803.08475. [Google Scholar] [CrossRef]
Lu, H.; Zhang, X.; Yang, S. A Learning-Based Iterative Method for Solving Vehicle Routing Problems. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Joe, W.; Lau, H.C. Deep Reinforcement Learning Approach to Solve Dynamic Vehicle Routing Problem with Stochastic Customers. Proc. Int. Conf. Autom. Plan. Sched. ICAPS 2020, 30, 394–402. [Google Scholar] [CrossRef]
Zhao, J.; Mao, M.; Zhao, X.; Zou, J. A Hybrid of Deep Reinforcement Learning and Local Search for the Vehicle Routing Problems. IEEE Trans. Intell. Transp. Syst. 2021, 22, 7208–7218. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, H.; Zhou, M.; Wang, J. Solving Dynamic Traveling Salesman Problems With Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2119–2132. [Google Scholar] [CrossRef] [PubMed]
Irannezhad, E.; Prato, C.G.; Hickman, M. An Intelligent Decision Support System Prototype for Hinterland Port Logistics. Decis. Support Syst. 2020, 130, 113227. [Google Scholar] [CrossRef]
Kim, B.; Jeong, Y.; Shin, J.G. Spatial Arrangement Using Deep Reinforcement Learning to Minimise Rearrangement in Ship Block Stockyards. Int. J. Prod. Res. 2020, 58, 5062–5076. [Google Scholar] [CrossRef]
Guo, C.; Thompson, R.G.; Foliente, G.; Peng, X. Reinforcement Learning Enabled Dynamic Bidding Strategy for Instant Delivery Trading. Comput. Ind. Eng. 2021, 160, 107596. [Google Scholar] [CrossRef]
Farahani, A.; Genga, L.; Schrotenboer, A.H.; Dijkman, R. Capacity Planning in Logistics Corridors: Deep Reinforcement Learning for the Dynamic Stochastic Temporal Bin Packing Problem. Transp. Res. Part E Logist. Transp. Rev. 2024, 191, 103742. [Google Scholar] [CrossRef]
Sun, Y.; Mao, X.; Yin, X.; Liu, G.; Zhang, J.; Zhao, Y. Optimizing Carbon Tax Rates and Revenue Recycling Schemes: Model Development, and a Case Study for the Bohai Bay Area, China. J. Clean. Prod. 2021, 296, 126519. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control Through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Hu, K.; Li, M.; Song, Z.; Xu, K.; Xia, Q.; Sun, N.; Zhou, P.; Xia, M. A Review of Research on Reinforcement Learning Algorithms for Multi-Agents. Neurocomputing 2024, 599, 128068. [Google Scholar] [CrossRef]
Wang, H.; Ye, Y.; Zhang, J.; Xu, B. A Comparative Study of 13 Deep Reinforcement Learning Based Energy Management Methods for a Hybrid Electric Vehicle. Energy 2023, 266, 126497. [Google Scholar] [CrossRef]
del Rio, A.; Jimenez, D.; Serrano, J. Comparative Analysis of A3C and PPO Algorithms in Reinforcement Learning: A Survey on General Environments. IEEE Access 2024, 12, 146795–146806. [Google Scholar] [CrossRef]
Sharma, R.; Garg, P. Reinforcement Learning Advances in Autonomous Driving: A Detailed Examination of DQN and PPO. In Proceedings of the 2024 Global Conference on Communications and Information Technologies (GCCIT), Bengaluru, India, 25–26 October 2024; pp. 1–5. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]

Figure 1. The agent interacts with the environment.

Figure 2. PPO training process for RL.

Figure 3. Chongqing–Singapore container multimodal transport network.

Figure 4. The comparative training curves.

Figure 5. Metric changes during PPO training. Panels (a,b) present policy entropy evolution and critic value loss.

Figure 6. Robustness comparison with traditional methods.

Figure 7. 90TEU path results.

Figure 8. Path stability diagram.

Figure 9. Distribution chart of Key metrics. Panels (a–c) depict the robustness of RL policies for Q = 45, 90, and 180 TEUs under cost uncertainty.

Table 1. Applications for RL.

Problem	Method	Study
travelling salesman problem	Policy Gradient	[25]
capacitated vehicle routing problem	Actor-Critic	[26]
travelling salesman problem, vehicle routing problem	Policy Gradient	[27]
vehicle routing problem	Policy Gradient	[28]
dynamic vehicle routing problem	Deep Q-Network	[29]
vehicle routing problem	Policy Gradient	[30]
dynamic travelling salesman problem	Policy Gradient	[31]
routing	Adaptive Dynamic Programming	[32]
allocation	Asynchronous Advantage Actor-Critic	[33]
carriers	Deep Q-Network, Tabular Q-learning	[34]
carriers	Proximal Strategy Optimization	[35]

Table 2. Hyperparameter settings for the PPO algorithm.

Hyperparameter	Symbol	Value	Description
Learning Rate	$α$	1 × 10⁻⁴	Step size for network weight updates
Discount Factor	$γ$	0.99	Weighting for future rewards
GAE Parameter	$λ$	0.95	For Generalized Advantage Estimation
Clipping Ratio	$ε$	0.2	Constraints policy update within $[1 - ε, 1 + ε]$
Value Loss Coeff	$C_{1}$	0.5	Weight of the critic loss term
Entropy Coeff.	$C_{2}$	0.01	Encourages exploration to avoid local optima
Update Epochs	$K$	10	Number of iterations per policy update
Update Timesteps	$T_{h o r i z o n}$	2048	Data collection batch size before update
Max Training Steps	$T_{m a x}$	150,000	Total steps for the training process
Random Seeds	-	5 Seeds	42, 101, 202, 303, 404 (For stability verification)

Table 3. City coding table.

Code	City	Code	City	Code	City	Code	City
1	Chongqing	7	Vientiane	13	Bangkok	19	Qinzhou Port
2	Pingxiang	8	Huangtong	14	Shanghai Port	20	Hong kong Ting Yi District
3	Mohan	9	Liuzhou	15	Nanning	21	Nansha Port
4	Guiyang	10	Yangon	16	Litang	22	Singapore
5	Huaihua	11	Baise	17	Yantian Port
6	Haioi	12	Sihanoukville	18	Kuala Lumpur

Table 4. Average speeds of road, railway, and waterway transport modes.

Transport Modes	Transport Price	Average Speed	Unit Carbon Emission
Railway	2.8 CNY/km	85 km/h	1.27 kg/TEU.km
Highway	6 CNY/km	65 km/h	3.73 kg/TEU.km
Waterway	1 CNY/km (<500 km)	35 km/h	0.42 kg/TEU.km
Waterway	0.6 CNY/km (>500 km)	35 km/h	0.42 kg/TEU.km
Seaway	3 CNY/km	35 km/h	0.42 kg/TEU.km

Table 5. Transport distances by road, railway, and waterway for each segment.

Transport Segment	Highway	Railway	Waterway	Transport Segment	Highway	Railway	Waterway
(1, 2)	1060	-	-	(9, 16)	150	135	198
(1, 3)	1514	-	-	(10, 12)	-	-	1200
(1, 4)	392	424	-	(11, 15)	230	223	-
(1, 5)	563	602	-	(12, 13)	940	-	-
(1, 19)	1065	1217	2478	(13, 18)	1480	665	2047
(1, 14)	1667	-	2400	(13, 22)	-	-	1574
(2, 6)	172	167	-	(14, 22)	-	-	4700
(3, 7)	680	422	-	(15, 19)	139	117	-
(4, 8)	127	315	-	(16, 19)	204	215	300
(4, 9)	495	489	564	(17, 20)	50	80	84
(5, 9)	448	453	-	(17, 22)	-	-	2408
(5, 17)	920	974	-	(18, 22)	325	-	426
(5, 21)	870	976	-	(19, 22)	4270	-	2450
(6, 10)	1100	-	-	(20, 22)	-	-	2670
(7, 13)	600	-	-	(21, 22)	-	-	4000
(8, 11)	400	-	-

Table 6. Transshipment costs and consumption.

Transshipment	Cost: CNY/TEU	Time: h/TEU	Carbon Emission: kg/TEU
Highway– Railway	82.6	0.05	2.24
Highway– Waterway	178.64	0.06	2.07
Railway– Waterway	242.06	0.08	1.90
Waterway– Seaway	200.00	0.07	1.50
Highway– Railway	82.6	0.05	2.24

Table 7. Comprehensive comparison of route selection and performance metrics under deterministic vs. stochastic environments.

Scenario	Index	Method
Scenario	Index	MILP	GA	PPO
45TEU	Route	1-4-9-16-19-22	1-4-9-16-19-22	1-19-22
	Mode	R-W-W-W-W	R-W-W-W-W	R-W
	Deterministic Cost (CNY)	455,381	455,381	499,331
	Stochastic Average Cost (CNY)	515,766	511,971	499,574
	Standard Deviation (CNY)	24,976	22,311	1228
	Stochastic Late Rate	58%	59%	4%
	Robustness Gap	+13.3%	+12.4%	+0.5%
	Carbon Emissions (Kg)	48,112.20	48,112.20	48,465.90
90TEU	Route	1-4-9-16-19-22	1-19-22	1-19-22
	Mode	R-R-W-W-W	R-W	R-W
	Deterministic Cost (CNY)	892,287	904,691	904,691
	Stochastic Average Cost (CNY)	992,408	987,034	908,864
	Standard Deviation (CNY)	64,167	66,527	10,626
	Stochastic Late Rate	95%	92%	17%
	Performance Gap	+11.2%	+9.1%	+0.4%
	Carbon Emissions (Kg)	98,578.00	96,931.80	96,931.80
180TEU	Route	1-19-22	1-19-22	1-19-22
	Mode	R-W	R-W	R-W
	Deterministic Cost (CNY)	1,720,256	1,720,256	1,720,256
	Stochastic Average Cost (CNY)	1,770,229	1,761,626	1,780,535
	Standard Deviation (CNY)	62,099	63,728	73,172
	Stochastic Late Rate	53%	50%	52%
	Performance Gap	+2.9%	+2.7%	+3.5%
	Carbon Emissions (Kg)	193,863.60	193,863.60	193,863.60

Table 8. Route selection and cost variations for 90 TEU under different price noise levels.

Noise Levels	Route Change vs. 0% Noise	Success Rate	Average Cost Change (%)	Total Cost Variation Range (%)
5%	No	100%	−0.02	(−2.03, 2.03)
10%	No	100%	−0.16	(−4.32, 4.03)
20%	No	100%	+0.08	(−8.08, 8.18)

Table 9. Results of carbon tax rate adjustments.

Level	Carbon Tax Rate	Route	Mode	Time (Hours)	Carbon Emissions(kg)	Total Cost (CNT)
90TEU	(0,0,0,0)	1-4-9-16-19-22	R-R-W-W-W	108.36	98,578.00	880,500.60
	(0.025,0.05,0.075,0.01)	1-4-9-16-19-22	R-R-W-W-W	108.36	98,578.00	886,394.01
	(0.05,0.1,0.15,0.2)	1-19-22	R-W	91.52	96,777.90	904,690.77
	(0.075,0.15,0.225,0.3)	1-19-22	R-W	91.52	96,777.90	910,426.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, R.; Dai, C.; Li, Y. Robust Low-Carbon Multimodal Transport Route Optimization for Containers Under Dual Uncertainty: A Proximal Policy Optimization Approach. Electronics 2026, 15, 5. https://doi.org/10.3390/electronics15010005

AMA Style

Zhang R, Dai C, Li Y. Robust Low-Carbon Multimodal Transport Route Optimization for Containers Under Dual Uncertainty: A Proximal Policy Optimization Approach. Electronics. 2026; 15(1):5. https://doi.org/10.3390/electronics15010005

Chicago/Turabian Style

Zhang, Rui, Cuilian Dai, and Yunpeng Li. 2026. "Robust Low-Carbon Multimodal Transport Route Optimization for Containers Under Dual Uncertainty: A Proximal Policy Optimization Approach" Electronics 15, no. 1: 5. https://doi.org/10.3390/electronics15010005

APA Style

Zhang, R., Dai, C., & Li, Y. (2026). Robust Low-Carbon Multimodal Transport Route Optimization for Containers Under Dual Uncertainty: A Proximal Policy Optimization Approach. Electronics, 15(1), 5. https://doi.org/10.3390/electronics15010005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Low-Carbon Multimodal Transport Route Optimization for Containers Under Dual Uncertainty: A Proximal Policy Optimization Approach

Abstract

1. Introduction

1.1. Research Background and Motivation

1.2. Literature Review

1.3. Research Objectives

1.4. Structure of the Paper

2. Problem Description and Model Establishment

2.1. Problem Description

2.2. Model Assumptions

2.3. Mathematical Formulation

2.4. Construction of Markov Decision Process (MDP)

2.4.1. State Space

2.4.2. Action Space

2.4.3. State Transition Probability

2.4.4. Reward Function

2.4.5. Discount Factor

3. Proposed Methodology

3.1. Reinforcement Learning (RL) and Algorithm Selection

3.2. Proximal Policy Optimization (PPO) Algorithm

3.3. Implementation Details and Hyperparameters

4. Case Study and Experimental Analysis

4.1. Experimental Setup and Data

4.1.1. Case Background: Chongqing–Singapore Route

4.1.2. Parameter Settings

4.2. Algorithm Validation and Training Stability Analysis

4.2.1. Algorithm Validation

4.2.2. Training Internal Stability Analysis

4.3. Robustness in Practical Scenarios: PPO vs. Traditional Methods

4.4. Sensitivity Analysis

4.4.1. Sensitivity to Transportation Cost Fluctuations

4.4.2. Sensitivity to Carbon Tax Rate

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI