Multi-Agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper explores the significant technical challenges of integrating autonomous vehicles (AVs) into platoons. It presents a multi-agent deep reinforcement learning (MA-DRL) model for control coordination. The goal of the model is to achieve collaborative optimization of platoon movements in terms of both longitudinal and lateral-longitudinal motion by controlling platoon acceleration, vehicle acceleration, and lateral steering angles. The article presents a well-structured and interesting read. The strength of the study lies in its clear description of a multi-agent deep reinforcement learning control model that integrates the MAMQPPO algorithm and the Actor-Critic network, aiming to develop more accurate and efficient strategies for autonomous vehicles (AVs) and platoons. It is also worth noting that the study compares the proposed approach with other methods such as TD3, DDPG, and SAC. As a result, MAMQPPO has been shown to excel in solving the problem of AV merging, achieving high performance and low energy consumption. However, there are some limitations to the article, primarily related to a lack of clarity in defining the task within the broader context of managing an ensemble of unmanned vehicles.. From the problem statement, it is unclear how the contribution of merging efficiency relates to other important aspects of the studied transportation system, such as travel time, road safety, and traffic flow harmonization. Additionally, the study assumes a homogeneous agent environment, in which only autonomous vehicles (AVs) are present. However, in reality, urban traffic systems also include other agents, such as pedestrians and human-driven vehicles, which can interfere with AVs and hinder their integration into platoons. Furthermore, traffic congestion and adverse weather conditions can significantly slow down traffic and cause delays. Another limitation of the study is that it lacks a comprehensive scenario design, making it difficult to determine the stability of the results in the face of varying external conditions. The following are specific recommendations for enhancing the paper:
1. The literature review could be enhanced. It is suggested to include a greater number of publications on multi-agent modeling of unmanned vehicle behavior, for instance, the following recent works:
[1] Akopov, A.S., Beklaryan, L.A. Agent-Based Modelling of Dynamics of Interacting Unmanned Ground Vehicles Using FLAME GPU. Program Comput Soft 50 (Suppl 2), S91–S103 (2024). https://doi.org/10.1134/S0361768824700464
[2] Karolemeas, C., Tsigdinos, S., Moschou, E. et al. Shared autonomous vehicles and agent based models: a review of methods and impacts. Eur. Transp. Res. Rev. 16, 25 (2024). https://doi.org/10.1186/s12544-024-00644-2
2. The introduction could be improved. It would be valuable to explain how the contribution of merging efficiency relates to other important aspects of the transportation system under study, such as travel time, road safety, and traffic flow harmonization.
3. As mentioned earlier, the study considers a homogeneous agent environment in which only autonomous vehicles (AVs) are present. However, in real-world traffic systems, there are other agents such as pedestrians and manned vehicles that can interfere with autonomous vehicles and hinder their integration into traffic streams. Additionally, traffic congestion and adverse weather conditions can significantly slow traffic and cause delays. It would be beneficial for the authors to consider several scenarios related to changes in the external environment and how these changes affect the main metrics of the system, such as merging time and average energy consumption, in their study. Therefore, it is essential to demonstrate that the results achieved are resilient to environmental influences.
Author Response
Dear reviewers:
Thank you sincerely for your careful review and valuable suggestions on this article! Your feedback is crucial for improving the quality of the paper. We have made revisions based on your comments, and our specific responses are as follows:
- On the supplementation of literature review:We have added references [6], [7] in Section 1.1, where literature [6] provides technical references for the underlying modelling for the algorithm design in this paper, and literature [7] cites its review on the social impact of sharing autonomous vehicles (AVs) with multi-agent models, which further elucidates the practical significance of the research in this paper in enhancing the efficiency of the collaborative transport system. As Section 1.1,third paragraph in blue.
- Note on the relevance of merger efficiency in the introductory section: as we have also shown in the first paragraph of the introduction that vehicle queuing improves road safety and saves energy, and that the level of merger efficiency affects the safety of road travelling, the following is added to section 1.3: Moreover, efficient AV merging into platoon not only reduces vehicle merging time but also enhances road safety and promotes coordinated traffic flow. As Section 1.3,seventh paragraph in blue.
3.On the robustness verification of heterogeneous traffic environments and external conditions:
This paper focuses on the validation of the basic algorithms at this stage, mainly for the current structured motorway environment with good conditions, disregarding the effects of bad weather and pedestrians in the urban traffic environment for the time being. As Section 2.1,first paragraph in blue.
In the meantime, your comments do need to be taken into account, and we have added a number of new elements that, in the future work of the Future work will focus on: 1) integrating real platoon traffic data to enhance generalization capabilities in complex scenarios; 2) developing a multi-vehicle parallel merging mechanism based on game theory or swarm reinforcement learning to optimize global decision-making efficiency; 3) incorporating weather disturbance factors and robust control modules to improve model reliability in extreme environments. As Section 5,third paragraph in blue.
Thank you once again for your professional advice!
The above modifications have been verified through simulation experiments and updated in the manuscript, significantly enhancing the paper's completeness and practical value. If you have any other suggestions, we are willing to make further adjustments.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsMulti-agent Deep Reinforcement Learning Cooperative Control Model for Autonomous Vehicle Merging into Platoon in Highway
General Comments
The manuscript presents a novel multi-agent deep reinforcement learning (MA-DRL) cooperative control model for autonomous vehicle (AV) merging into platoons on highways. The proposed model integrates maximum Q-value proximal policy optimization (MAMQPPO) and a partially decoupled reward function (PD-Reward) to optimize merging efficiency, energy consumption, and safety. While the study addresses an important challenge in intelligent transportation systems, several aspects require improvement before it can be considered for publication.
Abstract
The abstract is overloaded with technical jargon, making it difficult to follow. Key terms such as “multi-agent deep reinforcement learning” and “proximal policy optimization” are introduced without sufficient context, which may confuse readers unfamiliar with the field. These concepts should be briefly defined, and their relevance to AV platoon merging should be explained. Additionally, the abstract is repetitive, using phrases like “optimization” and “efficiency” multiple times without clearly distinguishing their meaning in different contexts. Instead of broadly stating that the proposed method improves merging efficiency, energy consumption, and safety, the authors should specify by how much and in comparison to which baseline methods.
Introduction.
The first sentence could also be restructured to be more direct. For example, instead of stating, “With the advancement of autonomous driving and vehicle-to-everything (V2X) technologies, platooning has emerged as a frontier in intelligent transportation system research,” a more precise version would be: “Autonomous driving and vehicle-to-everything (V2X) technologies enable vehicle platooning, which enhances traffic efficiency and safety. However, integrating autonomous vehicles into existing platoons remains a challenge due to coordination complexities.”
The introduction also lacks a clear structure. It jumps between concepts without a logical flow, making it difficult to follow. The transition from traditional methods to learning-based approaches should be smoother, explaining why classical control methods struggle with non-linearity before introducing deep reinforcement learning as a solution.
Literature Review.
I think it would be good to separate the introduction from the literature review.
The literature review presents many references but lacks a critical discussion of their limitations. Instead of listing previous studies, the authors should analyze their shortcomings and explicitly justify why the proposed approach is necessary. For example, several studies using PID, MPC, and reinforcement learning are cited, but their performance in similar merging scenarios is not compared. When referencing Gaagai et al.’s distributed controller and Tapli et al.’s cooperative adaptive cruise control (CCAC) algorithm, the weaknesses of these methods—such as computational complexity or instability under dynamic conditions—should be discussed. Additionally, some citations are outdated or do not seem directly relevant to the study. Instead of referencing general deep learning papers, the authors should link them specifically to the challenges of AV platoon merging. A table summarizing key differences between classical, optimization-based, and learning-based methods would improve readability and highlight the novelty of the proposed work.
Methodology (Problem formulation)
The methodology section lacks clarity in key areas. The problem formulation introduces a Markov Decision Process (MDP) but does not clearly define the state and action spaces. It is unclear how environmental uncertainty is handled—are stochastic disturbances considered? The description of the MAMQPPO algorithm is overly technical without sufficient explanation for a broader audience. When introducing the PD-Reward function, the authors state that it “reduces learning complexity and accelerates convergence,” but no theoretical justification or empirical evidence is provided. The reward function is described as “multi-dimensional,” but the explanation of how rewards are assigned and weighted remains vague. The weighting coefficients should be explicitly defined and justified based on experiments or previous literature.
The notation in Equations (1)–(4) is inconsistent, sometimes using subscripts and other times omitting them. For example r=ws×rs+wc×rc and could be r=wsafety×rsafety+wcomfort×rcomfort
Additionally, there are multiple occurrences of “[Error! Reference source not found.],” indicating missing citations or figure references. These issues must be corrected before submission.
Experimental Analysis
The experimental setup lacks critical details. The study mentions using a highway simulation but does not specify the simulator or its validation process. How do the traffic conditions compare to real-world scenarios? A clearer explanation of lane configurations, vehicle positions, and initial conditions is needed. The choice of baseline algorithms (DDPG, TD3, SAC, PPO) is reasonable, but the authors do not justify why these were selected. Why were other multi-agent reinforcement learning approaches, such as MADDPG, not considered?
The evaluation metrics focus on merging time, success rate, and energy consumption, but computational efficiency is not analyzed. Given that MAMQPPO is a hierarchical approach, how does its training time compare to single-agent methods? The claim that “the AV safely merges into the platoon at a speed of 22.6m/s within 13.5s” needs further discussion—how do these values compare to human-driven vehicles or other control policies? Additionally, many figures are referenced incorrectly, making it difficult to follow the results. All figure captions should be descriptive, and trends in the data should be clearly explained.
Conclusion and discussion (missing).
A discussion section is mandatory, and this manuscript lacks it.
The conclusion overstates the impact of the proposed approach without sufficient supporting evidence. The authors claim that the method “effectively improves efficiency and reduces energy consumption while ensuring safety,” but improvements are not quantified compared to existing methods. The discussion should be more nuanced, acknowledging potential limitations, such as high computational demands or sensitivity to hyperparameter tuning.
Additionally, the conclusion does not mention possible future research directions. Would integrating real-world vehicle data improve model generalization? How could transfer learning be applied to reduce the training burden? Ending with an open-ended discussion on these aspects would strengthen the study’s contribution.
Specific comments
The references do not follow the template of the journal
The manuscript contains several grammatical errors and awkward phrasing. For example, the sentence “the model achieves collaborative optimization of longitudinal platoon movements and AV lateral-longitudinal motions through integrated control” is difficult to parse and should be rephrased for clarity. Some technical terms are used inconsistently, such as switching between “multi-agent deep reinforcement learning” and “multi-agent DRL.” Consistency should be maintained throughout the text.
Additionally, the figures are poorly formatted, with some missing labels or unclear axis descriptions. The authors should ensure that all figures are referenced in order and provide a detailed legend where necessary.
Comments on the Quality of English Language
The manuscript contains several grammatical errors and awkward phrasing. For example, the sentence “the model achieves collaborative optimization of longitudinal platoon movements and AV lateral-longitudinal motions through integrated control” is difficult to parse and should be rephrased for clarity. Some technical terms are used inconsistently, such as switching between “multi-agent deep reinforcement learning” and “multi-agent DRL.” Consistency should be maintained throughout the text.
Author Response
Dear reviewers:
I sincerely thank you for your careful review and valuable comments on this paper! The issues you have pointed out are extremely constructive and crucial to improving the quality of the paper. We have carefully revised the paper one by one, and now we would like to report the revisions as follows:
1.abstract:The abstract is overloaded with technical jargon, making it difficult to follow. Key terms such as “multi-agent deep reinforcement learning” and “proximal policy optimization” are introduced without sufficient context, which may confuse readers unfamiliar with the field. These concepts should be briefly defined, and their relevance to AV platoon merging should be explained. Additionally, the abstract is repetitive, using phrases like “optimization” and “efficiency” multiple times without clearly distinguishing their meaning in different contexts. Instead of broadly stating that the proposed method improves merging efficiency, energy consumption, and safety, the authors should specify by how much and in comparison to which baseline methods.
1)We have added explanations of terminology in the summary, the developed MA-DRL architecture enables coordinated learning among multiple autonomous agents to address the multi-objective coordination challenge through synchronized control of platoon longitudinal acceleration, AV steering and acceleration. ……which extends the Multi-agent PPO algorithm (a policy gradient method ensuring stable policy updates) by incorporating maximum Q-value action selection for platoon gap control and discrete command generation. As Section 3.3.1,second paragraph in blue.
2)Quantitative evaluation metrics: for the problem of new self-driving vehicles converging into an existing platoon, this paper may be the first to propose a unified framework, at present, the existing research mainly focuses on the one-sided (such as vehicle lateral control, trajectory planning, and longitudinal control of the fleet), therefore, we add a new comparative experiment of commonly used quintic polynomials+PID control+CACC for the dense traffic flow scenarios, and the experimental results show that this paper. The proposed method reduces the merging time by 37.69% (12.4s vs. 19.9s) and energy consumption by 58% (3.56kWh vs. 8.47kWh)." As Section 4.2,fourth and fifth paragraph in blue.
- Introduction.
The first sentence could also be restructured to be more direct. For example, instead of stating, “With the advancement of autonomous driving and vehicle-to-everything (V2X) technologies, platooning has emerged as a frontier in intelligent transportation system research,” a more precise version would be: “Autonomous driving and vehicle-to-everything (V2X) technologies enable vehicle platooning, which enhances traffic efficiency and safety. However, integrating autonomous vehicles into existing platoons remains a challenge due to coordination complexities.”
The introduction also lacks a clear structure. It jumps between concepts without a logical flow, making it difficult to follow. The transition from traditional methods to learning-based approaches should be smoother, explaining why classical control methods struggle with non-linearity before introducing deep reinforcement learning as a solution.
Literature Review.
I think it would be good to separate the introduction from the literature review.
The literature review presents many references but lacks a critical discussion of their limitations. Instead of listing previous studies, the authors should analyze their shortcomings and explicitly justify why the proposed approach is necessary. For example, several studies using PID, MPC, and reinforcement learning are cited, but their performance in similar merging scenarios is not compared. When referencing Gaagai et al.’s distributed controller and Tapli et al.’s cooperative adaptive cruise control (CCAC) algorithm, the weaknesses of these methods—such as computational complexity or instability under dynamic conditions—should be discussed. Additionally, some citations are outdated or do not seem directly relevant to the study. Instead of referencing general deep learning papers, the authors should link them specifically to the challenges of AV platoon merging. A table summarizing key differences between classical, optimization-based, and learning-based methods would improve readability and highlight the novelty of the proposed work.
1)We follow your comments and requests to adjust the introduction section accordingly and refine the literature review (1.1 Platoon Longitudinal Control, 1.2 Single-autonomous-vehicle merging control, 1.3 Single-autonomous-vehicle trajectory planning) In order to make the article logical, we have reorganised the language and added transitional sentences to make the logic clearer; we have also added a table (table1) to summarise the literature for easier reading.
2)We have summarized the limitations of the literature in the introduction section according to your suggestions and requirements and added tables to enhance readability.
3.Methodology (Problem formulation)
The methodology section lacks clarity in key areas. The problem formulation introduces a Markov Decision Process (MDP) but does not clearly define the state and action spaces. It is unclear how environmental uncertainty is handled—are stochastic disturbances considered? The description of the MAMQPPO algorithm is overly technical without sufficient explanation for a broader audience. When introducing the PD-Reward function, the authors state that it “reduces learning complexity and accelerates convergence,” but no theoretical justification or empirical evidence is provided. The reward function is described as “multi-dimensional,” but the explanation of how rewards are assigned and weighted remains vague. The weighting coefficients should be explicitly defined and justified based on experiments or previous literature.
The notation in Equations (1)–(4) is inconsistent, sometimes using subscripts and other times omitting them. For example r=ws×rs+wc×rc and could be r=wsafety×rsafety+wcomfort×rcomfort
Additionally, there are multiple occurrences of “[Error! Reference source not found.],” indicating missing citations or figure references. These issues must be corrected before submission.
1)We have added a section on modeling Markov decision processes (see Section 2.1, second paragraph in blue). Additionally, this paper proposes an MAMQPPO algorithm, which is an improvement over the PPO algorithm. The PPO algorithm maintains policy exploration intrinsically through entropy regularization, with its core principle being the introduction of an entropy reward term in the optimization objective to dynamically adjust the flatness of the policy distribution. The PPO algorithm itself already has the capability to handle perturbations, detailed theoretical background can be found in the literature:[Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arxiv preprint arxiv:1707.06347.]
[Lee, H.K. and Yoon, S.W., Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning. In The Thirteenth International Conference on Learning Representations.]
[Zhang, H., Chen, H., **ao, C., Li, B., Liu, M., Boning, D. and Hsieh, C.J., 2020. Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems, 33, pp.21024-21037.]
2)We have added a new paragraph describing the technical route to the MAMQPPO algorithm (see the fourth paragraph in blue of Section 3.3), along with a detailed description of the construction of the PD-Reward function in Section 2.3, first paragraph in blue.
3)We target the formula subscripts and the "[Error! Reference source not found.]", indicating missing citations or chart references, etc., are carefully modified to ensure proper submission formatting.
4)In the experimental phase, we revert to a detailed statement containing the lane setup vehicle position and initial conditions, etc., as detailed in ( as section 4.1, first paragraph in blue)
5.Experimental Analysis
The experimental setup lacks critical details. The study mentions using a highway simulation but does not specify the simulator or its validation process. How do the traffic conditions compare to real-world scenarios? A clearer explanation of lane configurations, vehicle positions, and initial conditions is needed. The choice of baseline algorithms (DDPG, TD3, SAC, PPO) is reasonable, but the authors do not justify why these were selected. Why were other multi-agent reinforcement learning approaches, such as MADDPG, not considered?
The evaluation metrics focus on merging time, success rate, and energy consumption, but computational efficiency is not analyzed. Given that MAMQPPO is a hierarchical approach, how does its training time compare to single-agent methods? The claim that “the AV safely merges into the platoon at a speed of 22.6m/s within 13.5s” needs further discussion—how do these values compare to human-driven vehicles or other control policies? Additionally, many figures are referenced incorrectly, making it difficult to follow the results. All figure captions should be descriptive, and trends in the data should be clearly explained.
1)The baseline reinforcement learning algorithm selected in this paper is more widely used in the current research field of intelligent transport, according to what you said about the MADDPG algorithm, because the DDPG algorithm is a global evaluation network, which can not meet the PD-Reward function designed in this paper.
2)We have reorganized the experimental analysis graphs and language, as shown in (in section 4.1, in blue font).
6.Conclusion and discussion (missing).
A discussion section is mandatory, and this manuscript lacks it.
The conclusion overstates the impact of the proposed approach without sufficient supporting evidence. The authors claim that the method “effectively improves efficiency and reduces energy consumption while ensuring safety,” but improvements are not quantified compared to existing methods. The discussion should be more nuanced, acknowledging potential limitations, such as high computational demands or sensitivity to hyperparameter tuning.
Additionally, the conclusion does not mention possible future research directions. Would integrating real-world vehicle data improve model generalization? How could transfer learning be applied to reduce the training burden? Ending with an open-ended discussion on these aspects would strengthen the study’s contribution.
1)In response to your question, we have added an experiment with a fifth degree polynomial + PID + CACC from an existing method to compare with our proposed model and analyse the data in detail.(See Section 4.2, blue font section, for more details.).
2)We designed several sets of experiments for the learning rate α and found that the best convergence was achieved when α = 0.03, so in this paper we set it to 0.03:
learning rate α: When α = 0.03, the model reaches peak returns (>1.2 × 10⁵) in 1.5 × 10³ steps, with the fastest convergence, but with ±8% late fluctuations; α = 0.003 requires 2 × 10³ steps to stabilise to 1.0 × 10⁵, with a 37% lower speed of convergence, but with better stability; and α =0.0003 is the least efficient, reaching only 8 × 10⁴ in 3 × 10³ steps. Experiments show that a high learning rate (α = 0.03) reinforces early exploration capability through fast gradient updating and is suitable for real-time demands of queue control (e.g., emergency lane changing), although stability risks need to be balanced, while a low learning rate (α = 0.003) is suitable for long-term strategy optimisation scenarios. In this paper, we choose α = 0.03 to prioritise the real-time control objective.
3).We have added new Future Directions content: The limitations of this study are threefold: first, model validation is based solely on simulated highway scenarios and does not cover the complex dynamic environment of urban traffic; second, multi-vehicle coordinated merging uses a phased serial strategy, lacking parallel decision optimization, which limits reorganization efficiency; third, it does not consider the interference of adverse weather conditions such as rain and snow on sensor . Future work will focus on: 1) integrating real platoon traffic data to enhance generalization capabilities in complex scenarios; 2) developing a multi-vehicle parallel merging mechanism based on game theory or swarm reinforcement learning to optimize global decision-making efficiency; 3) incorporating weather disturbance factors and robust control modules to improve model reliability in extreme environments.
7.Specific comments
The references do not follow the template of the journal
The manuscript contains several grammatical errors and awkward phrasing. For example, the sentence “the model achieves collaborative optimization of longitudinal platoon movements and AV lateral-longitudinal motions through integrated control” is difficult to parse and should be rephrased for clarity. Some technical terms are used inconsistently, such as switching between “multi-agent deep reinforcement learning” and “multi-agent DRL.” Consistency should be maintained throughout the text.
Additionally, the figures are poorly formatted, with some missing labels or unclear axis descriptions. The authors should ensure that all figures are referenced in order and provide a detailed legend where necessary.
1)We made careful changes to information such as references and figure serial numbers to ensure proper submission formatting. At the same time, we adjusted the technical terminology of the paper accordingly, describing it with the full name for the first time and the abbreviation for the subsequent ones.
Thank you once again for your professional advice!
The above modifications have been verified through simulation experiments and updated in the manuscript, significantly enhancing the paper's completeness and practical value. If you have any other suggestions, we are willing to make further adjustments.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript tries to emphasize the development of a new model for optimizing platoon maneuvers integrated with Autonomous vehicles in traffic.
To talk about this topic the manuscript tries to provide a comprehensive introduction that still lacks consistency. The introduction does not account for the different types of models used or the different improvements provided by each of the mentioned models, it just mentions as a list all the models. This aspect is crucial to get the state of the art and the possible improvements led by the proposed manuscript. Therefore, a better literature review is strongly required.
- 105-112 are a perfect synthesis of what is required and what is currently present; however, it seems not linked well with the mentioned cited articles or with the way the cited articles have been presented.
The goal of the research could have been addressed talking about the gap and the contribution in the same point, not talking about the three gaps and then listing the three research contributions.
Another important aspect to tackle is the connection between EV and AV. In the middle of nowhere, the authors start talking about EVs and then they disappear again. Is there any connection that authors want to emphasize?
The rationale behind the choice of the TTC for safety reward and for the other rewards might be better explained and detailed.
The entire chapter 3 is the core part of the manuscript. It could be more precise and clearer. Some parts are confusing and the description is not linear. Please try to include a more reader-friendly approach while describing. Not all the readers could be expert; therefore, the concepts must be designed in a friendly way.
Figures must be renumbered in section 4.
The safety features introduced in the description of the model are not accounted in the results. How did you define safety benefits based on TTC?
The entire structure of the manuscript can be better designed to provide a better understanding of the idea behind the manuscript. Section 4 is another key part but it is not well-introduced and its importance becomes not that relevant. All the details of the simulation are not presented.
The Annexes have been recalled Attachments in the manuscript. The name must be coherent with the citations.
Author Response
Dear reviewers:
I sincerely thank you for your careful review and valuable comments on this paper! The issues you have pointed out are extremely constructive and crucial to improving the quality of the paper. We have carefully revised the paper one by one, and the main changes are reported as follows:
1.To talk about this topic the manuscript tries to provide a comprehensive introduction that still lacks consistency. The introduction does not account for the different types of models used or the different improvements provided by each of the mentioned models, it just mentions as a list all the models. This aspect is crucial to get the state of the art and the possible improvements led by the proposed manuscript. Therefore, a better literature review is strongly required.
105-112 are a perfect synthesis of what is required and what is currently present; however, it seems not linked well with the mentioned cited articles or with the way the cited articles have been presented.
The goal of the research could have been addressed talking about the gap and the contribution in the same point, not talking about the three gaps and then listing the three research contributions.
1)In order to make the presentation of the article clearer, we restructured the introduction to make it easier for the reader to read, and added tables to summarise the literature and improve readability.
2.Another important aspect to tackle is the connection between EV and AV. In the middle of nowhere, the authors start talking about EVs and then they disappear again. Is there any connection that authors want to emphasize?
1)We have added an elaboration of the link between electric and self-driving cars (Section 2.1, first paragraph inblue): Thanks to the deep compatibility between EV drive systems and autonomous driving architectures, they can more efficiently integrate key hardware components such as high-precision environmental perception sensors, real-time decision-making units, and by-wire actuators, providing reliable technical support for intelligent control in complex traffic scenarios.
3.The rationale behind the choice of the TTC for safety reward and for the other rewards might be better explained and detailed.
Thank you for your interest in reward function design. Time to Collision (TTC) was chosen as the core metric for safety rewards based on its combined advantages of dynamic risk quantification, multi-objective optimisation compatibility, and engineering interpretability. TTC directly characterises the urgency of a potential collision in the time dimension by fusing the relative distance and velocity information (TTC = Δd/Δv), and is capable of adapting to better high-speedsafety requirements in scenarios. The advantages of TTC as a safe reward function are described in detail in literature [Zhao, R., Chen, Z., Fan, Y., Li, Y. and Gao, F., 2024. Towards robust decision-making for autonomous highway driving based on safe reinforcement learning. Sensors, 24(13), p.4140.]
4.The entire chapter 3 is the core part of the manuscript. It could be more precise and clearer. Some parts are confusing and the description is not linear. Please try to include a more reader-friendly approach while describing. Not all the readers could be expert; therefore, the concepts must be designed in a friendly way.
1)We have refined the chapter titles, presented the concepts, described the algorithms in a more accessible way, and in section 3.1.1 we have detailed the inputs and outputs of the model as well as the technical routes, while we have explained the training process in detail see paragraph 4 of 3.3.1
5.Figures must be renumbered in section 4.
1)We have restructured the experimental analysis graphs and language, see (Sections 4.1, 4.2 in blue) for the results, and added new experiments with the quintic polynomial + PID + CACC from the existing method to compare with our proposed model and to analyse the data in detail.
6.The safety features introduced in the description of the model are not accounted in the results. How did you define safety benefits based on TTC?
6)The safety in the model is captured by the vehicle spacing of the fleet, whose safety benefits are integrated in the reward function of the AV, the details of which are given in the blue part of Part A in Section 2.3.
Thank you once again for your professional advice!
The above modifications have been verified through simulation experiments and updated in the manuscript, significantly enhancing the paper's completeness and practical value. If you have any other suggestions, we are willing to make further adjustments.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThis study presents a multi-agent reinforcement learning based control model for autonomous vehicles joining convoys and provides higher efficiency and energy savings compared to existing approaches. Although the study is innovative and well-structured, it will make a stronger contribution if the above-mentioned deficiencies are addressed.
Reviewer Decision:
Major Revision
Suggested Steps for Revision:
- The differences of the MAMQPPO algorithm from previous studies should be clearly stated.
- Hyperparameter selections and the training process should be explained in more detail.
- Experimental results should be analyzed in more depth, especially in terms of success rate, energy consumption and the effect of the number of vehicles in the convoy.
- A discussion on how to validate the model with real-world data should be added.
- The obtained results should be discussed in detail. In addition, the limitations and deficiencies of the study should be emphasized and suggestions for future studies should be made.
- All figure and table references should be corrected and missing explanations should be completed.
It would be appropriate to re-evaluate after revision.
Author Response
Dear reviewers:
I sincerely thank you for your careful review and valuable comments on this paper! The issues you have pointed out are extremely constructive and crucial to improving the quality of the paper. We have carefully revised the paper one by one, and now we would like to report the revisions as follows:
1.The differences of the MAMQPPO algorithm from previous studies should be clearly stated.
1)Algorithmic differences: we describe the structure of the algorithm in detail in section 3.3.1 in blue, as well as the technical route flow, compared to the previous algorithm, our algorithm is designed with a two-layer network structure, while the proposed MAMQPPO algorithm consists of two Actor and two Critic networks, where the AV-Actor and Platoon-Actor represent the two Actor networks. The AV-Actor is a seven-layer network, including an input layer, three hidden layers (with 64, 128, and 64 neurons, respectively), and an output layer, utilizing ReLU and Tanh activation functions to enhance nonlinear processing. The Platoon-Actor incorporates the Max-Value function (MaxQ) from the DQN network at the output layer to select the maximum-value platoon gap and actions。
2.Hyperparameter selections and the training process should be explained in more detail.
We designed multiple sets of experiments and set the learning rate hyperparameter to 0.03 as detailed below:
learning rate α: When α = 0.03, the model reaches peak returns (>1.2 × 10⁵) in 1.5 × 10³ steps, with the fastest convergence, but with ±8% late fluctuations; α = 0.003 requires 2 × 10³ steps to stabilise to 1.0 × 10⁵, with a 37% lower speed of convergence, but with better stability; and α =0.0003 is the least efficient, reaching only 8 × 10⁴ in 3 × 10³ steps. Experiments show that a high learning rate (α = 0.03) reinforces early exploration capability through fast gradient updating and is suitable for real-time demands of queue control (e.g., emergency lane changing), although stability risks need to be balanced, while a low learning rate (α = 0.003) is suitable for long-term strategy optimisation scenarios. In this paper, we choose α = 0.03 to prioritise the real-time control objective.
3.Experimental results should be analyzed in more depth, especially in terms of success rate, energy consumption and the effect of the number of vehicles in the convoy.
3)We have restructured the experimental analysis graphs and language, as shown in (Sections 4.1 and 4.2 in blue), and have added a new experiment with the quintic polynomial + PID from an existing method to compare with our proposed model, and to analyse the data in detail.
4.A discussion on how to validate the model with real-world data should be added.
4)Thank you for your suggestion. Our research focuses on models and algorithms for merging autonomous vehicles into platoon, and we have mainly adopted the Multi-agent Deep Reinforcement Learning (MADRL) method and conducted simulations to validate it. Currently, since there is no publicly available dataset of real fleets, we constructed a simulation environment and evaluated the effectiveness of the algorithms based on it. In order to better fit the real traffic situation and sumo has a significant advantage in simulating macroscopic traffic flow, we therefore utilise sumo for the construction, details of which are given in Section 4.1 in blue.
5.The obtained results should be discussed in detail. In addition, the limitations and deficiencies of the study should be emphasized and suggestions for future studies should be made.
5)We analyse the experimental results in more depth, and their limitations are detailed in the last paragraph of Section 5 in blue
6.All figure and table references should be corrected and missing explanations should be completed.
6)We made careful adjustments to the figures and tables, as well as the references, to ensure that the submission was formatted correctly.
Thank you once again for your professional advice!
The above modifications have been verified through simulation experiments and updated in the manuscript, significantly enhancing the paper's completeness and practical value. If you have any other suggestions, we are willing to make further adjustments.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have made general improvements to the article based on feedback from reviewers. However, there are some areas where the design of formulas and figures could still be improved. The article is now suitable for publication after minor edits.
Author Response
Dear Reviewer,
Thank you for your valuable feedback. We have thoroughly revised the figures and tables according to the journal's template, converting them to PNG format to ensure higher resolution and clarity. In addition, we have addressed the labeling and formatting issues in figures and formulas as suggested. We hope that these revisions meet the journal's requirements and sincerely appreciate your guidance and suggestions.
Best regards
Reviewer 2 Report
Comments and Suggestions for AuthorsThe updated manuscript demonstrates significant improvements in structure, clarity, methodology, and experimental validation. The added discussion, comparative analysis, and definition of technical terms have substantially enhanced the quality and readability of the work. The issues related to missing references, reward function clarity, and figure formatting have also been adequately resolved.
Please format the manuscript according to the journal's template
Author Response
Dear Reviewer,
Thank you for your positive feedback on the revised manuscript. We are pleased to hear that the improvements in structure, methodology, and experimental validation have enhanced the work's quality. Regarding your request, we have carefully formatted the manuscript according to the journal's template, ensuring consistency in fonts, headings, citations, and reference styles. We appreciate your thorough review.
Best regards
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript has been improved significantly addressing all the main comments. At the current stage it can be considered for publication
Author Response
Dear Reviewer,
We sincerely appreciate your positive feedback and confirm that the manuscript has been improved to meet publication standards. Thank you for your time, expertise, and constructive suggestions throughout the review process. We are honored that your recommendation was accepted and look forward to contributing to the academic community of the journal.
Best regards
Reviewer 4 Report
Comments and Suggestions for AuthorsYou did a good job. Accepted.
Author Response
Dear Reviewer,
Thank you very much for your kind feedback and acceptance of our manuscript. We sincerely appreciate your time and expertise throughout the review process. It is an honor to contribute to the journal and we look forward to future academic collaboration opportunities.
Best regards