Previous Article in Journal
A Strongly Robust Secret Image Sharing Algorithm Based on QR Codes
Previous Article in Special Issue
Modeling Local Search Metaheuristics Using Markov Decision Processes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

A Reinforcement Learning Hyper-Heuristic with Cumulative Rewards for Dual-Peak Time-Varying Network Optimization in Heterogeneous Multi-Trip Vehicle Routing

College of Transportation Engineering, Dalian Maritime University, Dalian 116026, China
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(9), 536; https://doi.org/10.3390/a18090536
Submission received: 17 July 2025 / Revised: 20 August 2025 / Accepted: 21 August 2025 / Published: 22 August 2025

Abstract

Urban logistics face complexity due to traffic congestion, fleet heterogeneity, warehouse constraints, and driver workload balancing, especially in the Heterogeneous Multi-Trip Vehicle Routing Problem with Time Windows and Time-Varying Networks (HMTVRPTW-TVN). We develop a mixed-integer linear programming (MILP) model with dual-peak time discretization and exact linearization for heterogeneous fleet coordination. Given the NP-hard nature, we propose a Hyper-Heuristic based on Cumulative Reward Q-Learning (HHCRQL), integrating reinforcement learning with heuristic operators in a Markov Decision Process (MDP). The algorithm dynamically selects operators using a four-dimensional state space and a cumulative reward function combining timestep and fitness. Experiments show that, for small instances, HHCRQL achieves solutions within 3% of Gurobi’s optimum when customer nodes exceed 15, outperforming Large Neighborhood Search (LNS) and LNS with Simulated Annealing (LNSSA) with stable, shorter runtime. For large-scale instances, HHCRQL reduces gaps by up to 9.17% versus Iterated Local Search (ILS), 6.74% versus LNS, and 5.95% versus LNSSA, while maintaining relatively stable runtime. Real-world validation using Shanghai logistics data reduces waiting times by 35.36% and total transportation times by 24.68%, confirming HHCRQL’s effectiveness, robustness, and scalability.
Keywords: routing; multi-trip; time-varying road networks; heterogeneous fleets; Q-learning; hyper-heuristic routing; multi-trip; time-varying road networks; heterogeneous fleets; Q-learning; hyper-heuristic

Share and Cite

MDPI and ACS Style

Wang, X.; Li, N.; Jin, X. A Reinforcement Learning Hyper-Heuristic with Cumulative Rewards for Dual-Peak Time-Varying Network Optimization in Heterogeneous Multi-Trip Vehicle Routing. Algorithms 2025, 18, 536. https://doi.org/10.3390/a18090536

AMA Style

Wang X, Li N, Jin X. A Reinforcement Learning Hyper-Heuristic with Cumulative Rewards for Dual-Peak Time-Varying Network Optimization in Heterogeneous Multi-Trip Vehicle Routing. Algorithms. 2025; 18(9):536. https://doi.org/10.3390/a18090536

Chicago/Turabian Style

Wang, Xiaochuan, Na Li, and Xingchen Jin. 2025. "A Reinforcement Learning Hyper-Heuristic with Cumulative Rewards for Dual-Peak Time-Varying Network Optimization in Heterogeneous Multi-Trip Vehicle Routing" Algorithms 18, no. 9: 536. https://doi.org/10.3390/a18090536

APA Style

Wang, X., Li, N., & Jin, X. (2025). A Reinforcement Learning Hyper-Heuristic with Cumulative Rewards for Dual-Peak Time-Varying Network Optimization in Heterogeneous Multi-Trip Vehicle Routing. Algorithms, 18(9), 536. https://doi.org/10.3390/a18090536

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop