Multi-Agent Sensor Fusion Methodology Using Deep Reinforcement Learning: Vehicle Sensors to Localization

Araújo, Túlio Oliveira; Netto, Marcio Lobo; Francisco Justo, João

doi:10.3390/s26041105

Open AccessArticle

Multi-Agent Sensor Fusion Methodology Using Deep Reinforcement Learning: Vehicle Sensors to Localization

by

Túlio Oliveira Araújo

^*

,

Marcio Lobo Netto

and

João Francisco Justo

Sistemas Eletrônicos, Programa de Pós-Graduação em Engenharia Elétrica, Escola Politécnica da Universidade de São Paulo, São Paulo 05508-010, Brazil

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(4), 1105; https://doi.org/10.3390/s26041105

Submission received: 9 November 2025 / Revised: 18 December 2025 / Accepted: 28 December 2025 / Published: 8 February 2026

(This article belongs to the Special Issue Cooperative Perception and Control for Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Despite recent major advances in autonomous driving, several challenges remain. Even with modern advanced sensors and processing systems, vehicles are still unable to detect all possible obstacles present in complex urban settings and under diverse environmental conditions. Consequently, numerous studies have investigated artificial intelligence methods to improve vehicle perception capabilities. This paper presents a new methodology using a framework named CarAware, which fuses multiple types of sensor data to predict vehicle positions using Deep Reinforcement Learning (DRL). Unlike traditional DRL applications centered on control, this approach focuses on perception. As a case study, the PPO algorithm was used to train and evaluate the effectiveness of this methodology.

Keywords:

autonomous vehicles; carla simulator; deep reinforcement learning; urban vehicle simulation

1. Introduction

Autonomous transportation has attracted significant attention in recent years from both academia and industry [1]. Technologies such as self-driving, 5G, edge/cloud analytics, artificial intelligence, and augmented/virtual reality are shaping future vehicles. Despite recent developments in sensing and processing hardware, fully reliable autonomous vehicles remain elusive, as evidenced by recent accidents involving semi-autonomous vehicles [2]. Particularly in complex traffic scenarios, sensor-based automated decision efficiency is limited, requiring advanced artificial intelligence (AI) solutions—for instance, for lane merging [3], highway platooning [4], and roundabout navigation [5].

Recent studies have shown that data exchange between vehicles, infrastructure, and the cloud (V2V, V2I, and V2C, respectively) can significantly enhance perception by enabling accurate detection and identification of the environment. Using probabilistic and neural networks, it is possible to identify and classify objects, building a comprehensive navigation map with information that can improve perception for all connected vehicles [6]. Deep reinforcement learning (DRL) is a machine learning technique that has been leveraged for several traffic and cooperative control tasks, such as DRL-based traffic signal control for emission reduction in cooperative vehicle-infrastructure systems [7], and a cooperative DRL-based control model for CAV and traffic signaling proposed in [8]. This paper introduces a novel approach to creating a shared online map/database, integrating sensor data from vehicles with a DRL agent, simulating a V2C scenario. The agent employs DRL to fuse multimodal sensor data and infer the positions of all vehicles. Unlike conventional DRL applications in autonomous vehicles, which focus on control, this work emphasizes perception and sensor fusion. Several case studies were conducted to evaluate the efficiency of this approach in accurately localizing all vehicles in multi-sensor scenarios.

2. Connected and Autonomous Vehicles Background

2.1. Vehicle Sensors

Sensors are essential for vehicle perception systems. Understanding their functions and the information they provide is crucial for designing autonomous vehicles and their data processing strategies. The most commonly used sensors are ultrasonic, radar, cameras, LiDAR, GNSS, IMU, steering angle sensors, and wheel odometry [9]. Multiple sensor types can be found within a single autonomous vehicle (AV), as illustrated in Figure 1.

To detect objects in the vehicle’s path, it is desirable to employ a combination of sensors, allowing each to compensate for possible limitations of the others. A comparison of the main features of the sensors most commonly used for AVs is provided in Table 1, adapted from [10,11,12]. Within the CarAware framework [13], all sensors mentioned in this section are implemented and ready for application in any training scenario. Some of these sensors include onboard pre-processing and are considered smart sensors.

2.2. Vehicular Connectivity

Even with state-of-the-art sensors and computational capabilities, autonomous vehicles face challenges in object detection, recognition, and situational awareness, especially due to environmental effects, operational regions, field-of-view occlusions, unpredictable motion, and other events. Each vehicle’s sensing area is inherently limited. Connecting an AV (thus forming a CAV) allows for the extension and enrichment of available information, based on the data of other vehicles and objects.

To enable this, certain connectivity technologies must be present in both vehicles and infrastructure. The main concepts include:

5G: The fifth generation of mobile networks, offering data transfer speeds from 1 to 10 Gbps with latencies as low as 1 ms [14], thus supporting real-time applications.
Cloud computing: All connected devices leveraging online processing and storage resources.
V2X: Encompasses all vehicle communications with “anything” (vehicle-to-infrastructure, vehicle-to-vehicle, vehicle-to-device, vehicle-to-grid, and vehicle-to-cloud).

For interoperability, common standards are required. Thus, the SAE established standards defining message formats and protocols, such as the Basic Safety Message (BSM, SAE J2735) [15]. BSM is a broadcast message, sent up to ten times per second with each vehicle’s basic data (e.g., latitude, longitude, elevation, speed, and heading).

2.3. Carla Simulator and CarAware Framework

CARLA is an open-source simulator for urban driving [16]. It is categorized as nanoscopic, but supports microscopic simulations as well. Its strengths include realistic vehicle and pedestrian dynamics, detailed urban environments and graphics, the ability to script scenarios, and support for a broad range of sensors.

Due to CARLA’s open and highly customizable nature—and the absence of manufacturer-provided user interfaces for rapid experiments—a new framework was developed: CarAware [13]. CarAware runs atop CARLA, harnessing its 3D simulation capabilities while adding a simplified top-down visualization and a HUD showing critical training metrics. Figure 2 shows the main interface. Code for both the simulation framework and PPO implementation is available at https://github.com/tulioaraujoMG/CarAware (accessed on 27 December 2025).

3. Deep Reinforcement Learning Background

3.1. Overview

Reinforcement Learning (RL) describes how animals (and, now, machines) learn by interacting with their environment through observations and actions, receiving rewards or penalties and reinforcing desired behaviors. This principle is central to machine learning: agents interact with their environment via sensors and actuators, and their actions are evaluated by a reward function. Based on the reward, the agent learns which actions maximize cumulative reward—a process analogous to human learning. This technique is implemented in the framework to undertake the training for the proposed case study.

Advances in deep learning have transformed RL by leveraging neural networks to estimate the value of observation/action pairs (value-based methods) or to directly estimate the best decision policies (policy-based methods). As shown in [17], DRL algorithms are widely applied in areas such as robotics, autonomous control, communications, natural language processing (NLP), games, and computer vision. Tools exist to interpret the intrinsic “black box” behaviors of DRL models and validate their safety and efficiency [18].

With increased computational power, deep RL (DRL) methods have emerged, utilizing neural networks to approximate the large tables otherwise required for mapping state-action or state-value pairs.

According to OpenAI [19], DRL algorithms can be classified as depicted in Figure 3. The highest distinction is between model-free algorithms, which learn by directly interacting with the environment, and model-based algorithms, which learn or are given a model of the environment to facilitate planning [20].

Model-free algorithms fall into two main types: value-based (e.g., Q-Learning) and policy-based (e.g., Policy Optimization). Value-based algorithms focus on learning a value function that finds the most rewarding states and actions; policy-based algorithms directly learn the policy that yields the best rewards for each state. All DRL algorithms have a policy, but the focus of training differs: value-based on state/action values; policy-based on directly optimizing the policy.

In policy-based methods, rather than learning the value function, the policy is optimized directly to increase the likelihood of favorable outcomes. These methods can learn stochastic policies (probabilistic outputs), in contrast to the deterministic policies typically found in value-based approaches.

Some methods, such as actor–critic, combine both paradigms: the “actor” chooses actions according to the current policy (policy-based), while the “critic” evaluates these actions (value-based), providing feedback to improve performance.

PPO (Proximal Policy Optimization) [21] is a hybrid policy-based/value-based DRL algorithm, well-suited for continuous observation and action spaces. In PPO, actor and critic networks improve training stability by constraining policy updates, avoiding overly large and possibly disruptive changes. PPO is efficient, easy to implement, and robust for diverse applications—a justification for its use in this work.

PPO applies incremental updates to the actor network, calculating the ratio between the current and previous policy probabilities using Equation (1),

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

(1)

where

π_{θ}

is the probability of taking an action

a_{t}

at state

s_{t}

in the current policy, and

π_{θ old}

in the previous one. If

r_{t} (θ)

is greater than 1, the current policy is more likely to select action

a_{t}

at state

s_{t}

; if between 0 and 1, the previous policy was more likely. The advantage function, typically estimated through GAE (Generalized Advantage Estimation), quantifies how much better a particular action performs relative to average, as in Equation (2),

\begin{matrix} δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t}) \\ {\hat{A}}_{t} = δ_{t} + (γ λ) δ_{t + 1} + \dots + {(γ λ)}^{T - t + 1} δ_{T - 1} \end{matrix}

(2)

where

λ

is an exponential mean discount factor,

γ

is the trajectory discount, and

δ_{t}

is the TD advantage estimate.

3.2. Curriculum Learning

Curriculum learning (CL) is a training strategy that enhances an agent’s generalization and convergence by presenting a sequence of tasks of gradually increasing difficulty [22]. Figure 4 shows an example. In DRL and other machine learning paradigms, CL helps guide agent learning, particularly in complex settings: agents first learn from simple environments and then progressively face harder ones, thus fostering better generalization.

A common approach is to start with simple training scenarios and gradually present more complex ones, improving the agent’s ability to link observations and outputs. This methodology has succeeded in not only DRL, but also unsupervised and supervised learning applications, including computer vision, NLP, and autonomous vehicle control [23].

4. Collective Perception Methodology

4.1. Training Setup

The DRL training was performed in a single environment within the CARLA simulator, using the CarAware framework. To deploy the PPO algorithm, an actor–critic architecture was implemented, comprising an actor network

π (a_{t} | s_{t}; θ)

and a critic network

V (s_{t}; θ_{v})

, both instantiated as distinct MLPs (as the input

s_{t}

is a vector). As shown in Figure 5, their configurations are as follows:

Actor Network: An MLP with three fully connected layers of dimensions $500 \times 300 \times 2$ , where $500 \times 300$ defines the hidden layers (selected experimentally) and 2 is the action output dimension (coordinates x, y). The first two layers use ReLU activation; the last uses no activation. Input and output are normalized to $[- 1, 1]$ to prevent early weight overfitting. During training, actions are drawn from the multivariate Gaussian $a_{i} \sim N (μ_{i}, σ_{i})$ ; during evaluation, the mean $μ_{i}$ is directly used.
Critic Network: An MLP with three fully connected layers of dimensions $500 \times 300 \times 1$ , where 1 is the output (value function $V (s_{t}; θ_{v}) \sim R (s_{t})$ ). The first two layers use ReLU activation; the last uses none.

The actor network outputs (mean and standard deviation) are used to calculate the policy change ratio (Equation (1)), necessary for the Clipped Surrogate Objective Loss (CSOL, Equation (3)):

L_{t}^{C S O L} (θ) = ϵ_{t} [L_{t}^{C L I P} (θ) - c_{1} L_{t}^{V F} (θ) + c_{2} S [π_{θ}] (s_{t})]

(3)

The DRL PPO training process consists of:

Start a simulation episode, spawning vehicles and assigning automatic agents.
Store tuples $(s_{t}, a_{t}, r_{t}, V (s_{t}; θ_{v}))$ for each transition.
Calculate advantage estimates using GAE (Equation (2)).
Divide the complete horizon of data into stochastically sampled mini-batches, and feed data into the actor–critic network, defined by the hyperparameter “Epoch Number”. The training process optimizes the parameters of actor ( $θ$ ) and critic ( $θ_{v}$ ) networks via Adam optimizer based on the Clipped Surrogate Objective Loss ( $L^{C S O L}$ ) function.
Run the previous steps repeatedly until all episodes have been completed (define in “Episodes Number”).

4.2. Training Methodology

To showcase the framework’s capabilities, a series of DRL case studies was carried out. The objective was to use input data from GNSS, IMU, and SAS/WO sensors in multiple simulated vehicles to accurately infer the position of every vehicle from the top-view window. These sensors are standard in CAVs, supplying motion and position data (used in BSM protocol messages). The choice is further justified by their output bandwidth, which is far lower than that of complex sensors like cameras, radar, and LiDAR—minimizing computational demands for this proof-of-concept. Sensor frequencies: GNSS at 1 Hz, IMU at 100 Hz, SAS/WO at 10 Hz; all sensor outputs are affected by scenario-varying standard deviation (noise, low = 0.00001 and high = 0.0001), but without bias. Simulated blackout events occur every 5–10 s, with a duration of 5–10 s (randomly defined).

The selected algorithm for this application was Proximal Policy Optimization (PPO), which has proven effective for continuous input/output DRL problems in complex domains [21]. Each vehicle’s observation comprises a nine-element vector (GNSS “X” and “Y”; IMU: accelerometer

X, Y, Z

, gyroscope pitch, yaw, roll, compass; and SAS steering angle and WO speed), and the action space is a two-element vector representing the agent’s prediction of the

x, y

coordinates. The DRL agent acts on each vehicle individually per training step, cycling through all vehicles, simulating a V2C setup. The reward function is the Euclidean distance between predicted and real positions (simulator ground truth). Hyperparameters are shown in Table 2.

It is important to emphasize that the primary focus of this work is on perception, with the expected output being an accurate estimation of vehicle positions. This methodology differs from the regular DRL scenario since the vehicles’ actions (localization estimations) do not cause direct impacts on the environment. While the reward is directly affected by each predicted action, this characteristic does not pose a problem for the training scenarios considered. Deep Reinforcement Learning (DRL) was selected over other training methods—such as supervised regression or Bayesian filtering—because of its strength in learning from heterogeneous input types, without the need for explicit understanding or conversion of sensor data into a unified format. Moreover, this approach allows for the straightforward integration of additional sensor types in future work.

5. Results and Discussions

5.1. Scenario 1: Town 02—Localization with All Sensors and No Blackout Events

The first task was to train an agent to infer the positions of all vehicles in the map using BSM-associated sensors. The scenario assumed perfect communication and continuous sensor data availability. The “Town 02” map was selected for its urban features and compact size.

In high-complexity settings like urban driving, lower learning rates improve convergence by allowing more input analysis per update. Adding input noise fosters better generalization. Initial attempts showed the agent could not consistently achieve multi-agent position prediction through one-step training, even after hyperparameter optimization, due to environmental complexity. Curriculum Learning (CL) was necessary. Six curriculum steps were adopted (see Table 3), with training evolution shown in Figure 6.

After 109 h of training over 1510 episodes, the agent demonstrated robust learning capacity and was able to accurately locate vehicles in the tested scenario (see Figure 2). However, the agent’s predictions were less accurate in less frequently explored map regions (e.g., roads in the middle). As suggested by DRL theory, asynchronous training with parallel environments training would further improve generalization, but was beyond available computational resources and time.

5.2. Scenario 2: Town 02—Localization with Eventual GNSS Blackout Events

In this scenario, the goal was to enable the agent to cope with temporary GNSS sensor failures, localizing vehicles using only the remaining sensors. Since GNSS directly provides

(x, y)

coordinates, its unavailability presents a major challenge. Complete loss makes localization impossible, but short outages may be mitigated by other sensor data.

Transferring an agent trained only on reliable GNSS to this new scenario did not work: without new training, the agent simply kept the last observed GNSS output. Continuing training from the previous agent produced unsatisfactory results (the output had low entropy and poor adaptation to the new challenge). Therefore, curriculum learning was again needed, training the agent from scratch and introducing GNSS outages partway through training. Results are shown in Table 4 and Figure 7.

After 87 h of training (1230 episodes), results showed that the agent could maintain accurate predictions for up to 10 s of GNSS blackout—including during curves (Figure 8). Occasionally, the agent learned to offset predictions ahead of the vehicle position (Figure 9). This undesirable compensation illustrates the challenge of avoiding local optima in DRL.

5.3. Scenario 3: Town 02—Localization with Eventual IMU, SAS/WO Blackout Events

Here, the objective was to predict vehicle positions during IMU and SAS/WO sensor outages, simulated via random blackout events (that could happen at the same time or not). An agent initialized from previous phases (before GNSS blackout bias) was used.

Curriculum learning was again applied (see Table 5 and Figure 10). With the GNSS sensor remaining available, predictions remained accurate, with the agent learning to increase the weight placed on GNSS input during blackouts.

After 79 h of training (1310 episodes), the agent was able to maintain accurate localization during temporary IMU/SAS/WO outages (see Figure 11). This outcome was anticipated, as the GNSS sensor serves as the primary source of localization information and therefore contributes most significantly to vehicle position prediction compared to the other sensors. When multiple sensor blackouts occurred, the agent learned to increase the reliance on GNSS data within its decision-making process.

5.4. Scenario 4: Town 01—Localization with All Sensors and No Blackout Events

Similar to scenario 1, this simulation was conducted to evaluate both the training methodology and the agent’s performance on a different map, aiming to verify whether the conclusions remained consistent regardless of the training environment. “Town 01” was chosen as it also offers a realistic urban setting and, although larger than “Town 02”, remains sufficiently compact to ensure that computational resources were not a limiting factor. The same assumptions and conditions applied as in the previous scenarios. The curriculum (Table 6) and result curves (Figure 12) follow the same structure.

After 210 h (3003 episodes), localization was achieved, but with increased difficulty and some areas showing reduced model efficiency—attributable to higher map complexity and computational cost. The previously noted challenges regarding the DRL methodology and computational resource limitations are also present in this scenario; however, they are even more pronounced here due to the larger map size, which introduces additional complexity and a greater number of details for the agent to process and learn in order to make accurate predictions.

6. Conclusions

Simulating urban autonomous vehicles is computationally intensive. A downside of DRL training is that it requires substantial time for the model to learn, compared to other machine learning techniques, since it always starts from a “blank slate” state.

Both positive and negative aspects were observed in the adopted training strategies. The trained model was able to partially fulfill the intended objective, although some undesired behaviors were present. Certain limitations, such as the available training time and computational resources, impacted model performance. Nevertheless, results were sufficient to confirm that the proposed methodology is feasible for real-world applications.

Applying asynchronous training with multiple environments and diverse maps—allowing mixed experience to be gathered for each training step—can considerably improve PPO learning, reducing biases and strengthening generalization [21,24]. A common challenge in reinforcement learning is a lack of generalization capability [25]. Recent studies suggest that blending learning approaches, for example, integrating imitation learning with DRL, can improve generalization [26]. Moreover, simulating more dynamic traffic events and complex interactions between vehicles, and implementing more complex reward functions, like multi-objective rewards that also penalize prediction uncertainty (e.g., negative log-likelihood of the Gaussian output) and favor trajectory smoothness (e.g., second-order difference in successive estimates) could improve training effectiveness.

With further improvements, these methodologies could be deployed in real backend systems, collecting vehicle data and providing information to increase the situational awareness and safety of connected and autonomous vehicles in real traffic.

Author Contributions

Conceptualization, T.O.A.; Methodology, T.O.A.; Software, T.O.A.; Validation, T.O.A.; Investigation, T.O.A.; Resources, T.O.A.; Writing—original draft, T.O.A.; Writing—review & editing, M.L.N. and J.F.J. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Can be found in https://github.com/tulioaraujoMG/CarAware, accessed on 20 December 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Aria, M. A Survey of Self-driving Urban Vehicles Development. IOP Conf. Ser. Mater. Sci. Eng. 2019, 662, 042006. [Google Scholar] [CrossRef]
Chougule, A.; Chamola, V.; Sam, A.; Yu, F.R.; Sikdar, B. A Comprehensive Review on Limitations of Autonomous Driving and Its Impact on Accidents and Collisions. IEEE Open J. Veh. Technol. 2024, 5, 142–161. [Google Scholar] [CrossRef]
Hwang, S.; Lee, K.; Jeon, H.; Kum, D. Autonomous Vehicle Cut-In Algorithm for Lane-Merging Scenarios via Policy-Based Reinforcement Learning Nested Within Finite-State Machine. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17594–17606. [Google Scholar] [CrossRef]
Cheng, J.; Ju, M.; Zhou, M.; Liu, C.; Gao, S.; Abusorrah, A.; Jiang, C. A Dynamic Evolution Method for Autonomous Vehicle Groups in a Highway Scene. IEEE Internet Things J. 2022, 9, 1445–1457. [Google Scholar] [CrossRef]
García Cuenca, L.; Puertas, E.; Fernandez Andrés, J.; Aliane, N. Autonomous Driving in Roundabout Maneuvers Using Reinforcement Learning with Q-Learning. Electronics 2019, 8, 1536. [Google Scholar] [CrossRef]
Sidauruk, A.; Ikmah. Congestion Correlation and Classification from Twitter and Waze Map Using Artificial Neural Network. In Proceedings of the 2018 3rd International Conference on Information Technology, Information System and Electrical Engineering (ICITISEE), Yogyakarta, Indonesia, 13–14 November 2018; pp. 224–229. [Google Scholar] [CrossRef]
Shang, W.; Song, X.; Xiang, Q.; Chen, H.; Elhajj, M.; Bi, H.; Wang, K.; Ochieng, W. The impact of deep reinforcement learning-based traffic signal control on Emission reduction in urban Road networks empowered by cooperative vehicle-infrastructure systems. Appl. Energy 2025, 390, 125884. [Google Scholar] [CrossRef]
Fang, S.; Yang, L.; Shang, W.; Zhao, X.; Li, F.; Ochieng, W. Cooperative Control Model Using Reinforcement Learning for Connected and Automated Vehicles and Traffic Signal Light at Signalized Intersections. IEEE Internet Things J. 2025, 2, 44037–44050. [Google Scholar] [CrossRef]
Mahtani, A.; Sanchez, L.; Fernandez, E.; Martinez, A.; Joseph, L. ROS Programming: Building Powerful Robots; Packt Publishing: Birmingham, UK, 2018. [Google Scholar]
Rosique, F.; Navarro, P.J.; Fernández, C.; Padilla, A. A Systematic Review of Perception System and Simulators for Autonomous Vehicles Research. Sensors 2019, 19, 648. [Google Scholar] [CrossRef] [PubMed]
Vargas, J.; Alsweiss, S.; Toker, O.; Razdan, R.; Santos, J. An Overview of Autonomous Vehicles Sensors and Their Vulnerability to Weather Conditions. Sensors 2021, 21, 5397. [Google Scholar] [CrossRef] [PubMed]
Ignatious, H.A.; Hesham-El-Sayed; Khan, M. An overview of sensors in Autonomous Vehicles. Procedia Comput. Sci. 2022, 198, 736–741. [Google Scholar] [CrossRef]
Araújo, T.O.; Netto, M.L.; Justo, J.F. CarAware: A Deep Reinforcement Learning Platform for Multiple Autonomous Vehicles Based on CARLA Simulation Framework. In Proceedings of the 2023 8th International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS), Nice, France, 14–16 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
Jansen, M.; Beaton, P. 5G vs. 4G: How Does the Newest Network Improve on the Last? 2022. Available online: http://www.digitaltrends.com/mobile/5g-vs-4g/ (accessed on 15 December 2025).
SAE. V2X Communications Message Set Dictionary; SAE: London, UK, 2020. [Google Scholar]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; López, A.M.; Koltun, V. CARLA: An Open Urban Driving Simulator. arXiv 2017. [Google Scholar] [CrossRef]
Naeem, M.; Rizvi, S.T.H.; Coronato, A. A Gentle Introduction to Reinforcement Learning and its Application in Different Fields. IEEE Access 2020, 8, 209320–209344. [Google Scholar] [CrossRef]
Alharin, A.; Doan, T.N.; Sartipi, M. Reinforcement Learning Interpretation Methods: A Survey. IEEE Access 2020, 8, 171058–171077. [Google Scholar] [CrossRef]
OpenAI. Part 2: Kinds of RL Algorithms; OpenAI: San Francisco, CA, USA, 2022. [Google Scholar]
Wang, T.; Bao, X.; Clavera, I.; Hoang, J.; Wen, Y.; Langlois, E.; Zhang, S.; Zhang, G.; Abbeel, P.; Ba, J. Benchmarking Model-Based Reinforcement Learning. arXiv 2019. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017. [Google Scholar] [CrossRef]
Wang, X.; Chen, Y.; Zhu, W. A Survey on Curriculum Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4555–4576. [Google Scholar] [CrossRef] [PubMed]
Khaitan, S.; Dolan, J.M. State Dropout-Based Curriculum Reinforcement Learning for Self-Driving at Unsignalized Intersections. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022. [Google Scholar] [CrossRef]
Berner, C.; Brockman, G.; Chan, B.; Cheung, C.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv 2019. [Google Scholar] [CrossRef]
Kirk, R.; Zhang, A.; Grefenstette, E.; Rocktäschel, T. A Survey of Generalisation in Deep Reinforcement Learning. arXiv 2021. [Google Scholar] [CrossRef]
Li, Z. A Hierarchical Autonomous Driving Framework Combining Reinforcement Learning and Imitation Learning. In Proceedings of the 2021 International Conference on Computer Engineering and Application (ICCEA), Kunming, China, 25–27 June 2021; pp. 395–400. [Google Scholar] [CrossRef]

Figure 1. Autonomous Vehicle Sensors.

Figure 2. Case study predictions after curriculum training. Red dots represent predicted positions, and blue arrows represent the true positions.

Figure 3. Main DRL algorithms and their classification (Adapted from [19]).

Figure 4. Example of a curriculum-learning process for computer vision.

Figure 5. Complete implemented training architecture.

Figure 6. Reward and standard deviation curves for scenario 1.

Figure 7. Reward and standard deviation curves for scenario 2.

Figure 8. Correct prediction under GNSS blackout event.

Figure 9. Wrong behaviors learned by the agent to deal with GNSS blackout events.

Figure 10. Reward and standard deviation curves for scenario 3.

Figure 11. Correct prediction under IMU and SAS/WO blackout events.

Figure 12. Reward and standard deviation curves for scenario 4.

Table 1. Comparison of sensor features (Adapted from [10,11,12]).

Feature	Ultrasonic	RADAR	LiDAR	Camera
Primary Technology	Sound wave	Radio wave	Laser beam	Light
Range	∼5 m	∼250 m	∼200 m	∼200 m
Infrared Frequency	40–70 kHz	24, 74 or 79 GHz	193 or 331 THz	272–1498 THz
Affected by Weather	Yes	No	Yes	Yes
Affected by Lighting	No	No	No	Yes
Size	Small	Small	Big	Small
Detects Speed	Poor	Very Good	Good	Poor
Resolution	Poor	Average	Good	Very Good
Detects Distance	Good	Very Good	Good	Poor
Interference Susceptibility	Good	Poor	Good	Very Good
Field of View	Poor	Average	Very Good	Good
Accuracy	Poor	Average	Very Good	Very Good
Frame Rate	Average	Average	Average	Good
Colour Perception	Poor	Poor	Poor	Very Good
Maintenance	Average	Poor	Poor	Average
Visibility	Poor	Poor	Average	Good
Price	Good	Average	Poor	Average

Table 2. Hyperparameters used for the case study.

Hyperparameter	Value
Learning Rate	0.0001
Learning Rate Decay	1
GAE Discount Factor	0.99
GAE Lambda	0.95
Value Loss Scale Factor	1
Initial Deviation	0.7
Entropy Scale	0.01
PPO Epsilon	0.2
Horizon Number	32,768
Batch Size	2048
Epoch Number	4

Table 3. Curriculum steps for scenario 1.

Step	Vehicle	Restart	GNSS Error	Blackout	Town
1	Single	No	High	None	01
2	Single	No	High	None	02
3	Single	Yes	High	None	01/02
4	Single	Yes	High	None	02
5	Single	Yes	High	None	02
6	Single	Yes	Low	None	02

Table 4. Curriculum steps for scenario 2.

Step	Vehicle	Restart	GNSS Error	Blackout	Town
1	Single	No	High	None	01
2	Single	No	High	None	02
3	Single	Yes	High	None	01/02
4	Single	Yes	High	None	02
5	Single	Yes	High	GNSS	02
6	Single	Yes	Low	GNSS	02

Table 5. Curriculum steps for scenario 3.

Step	Vehicle	Restart	GNSS Error	Blackout	Town
1	Single	No	High	None	01
2	Single	No	High	None	02
3	Single	Yes	High	None	01/02
4	Single	Yes	High	None	02
5	Single	Yes	High	IMU/SAS/WO	02
6	Single	Yes	Low	IMU/SAS/WO	02

Table 6. Curriculum steps for scenario 4.

Step	Vehicle	Restart	GNSS Error	Blackout	Town
1	Single	No	High	None	01
2	Single	No	High	None	02
3	Single	Yes	High	None	01/02
4	Single	Yes	High	None	01
5	Single	Yes	Low	None	01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Araújo, T.O.; Netto, M.L.; Francisco Justo, J. Multi-Agent Sensor Fusion Methodology Using Deep Reinforcement Learning: Vehicle Sensors to Localization. Sensors 2026, 26, 1105. https://doi.org/10.3390/s26041105

AMA Style

Araújo TO, Netto ML, Francisco Justo J. Multi-Agent Sensor Fusion Methodology Using Deep Reinforcement Learning: Vehicle Sensors to Localization. Sensors. 2026; 26(4):1105. https://doi.org/10.3390/s26041105

Chicago/Turabian Style

Araújo, Túlio Oliveira, Marcio Lobo Netto, and João Francisco Justo. 2026. "Multi-Agent Sensor Fusion Methodology Using Deep Reinforcement Learning: Vehicle Sensors to Localization" Sensors 26, no. 4: 1105. https://doi.org/10.3390/s26041105

APA Style

Araújo, T. O., Netto, M. L., & Francisco Justo, J. (2026). Multi-Agent Sensor Fusion Methodology Using Deep Reinforcement Learning: Vehicle Sensors to Localization. Sensors, 26(4), 1105. https://doi.org/10.3390/s26041105

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Sensor Fusion Methodology Using Deep Reinforcement Learning: Vehicle Sensors to Localization

Abstract

1. Introduction

2. Connected and Autonomous Vehicles Background

2.1. Vehicle Sensors

2.2. Vehicular Connectivity

2.3. Carla Simulator and CarAware Framework

3. Deep Reinforcement Learning Background

3.1. Overview

3.2. Curriculum Learning

4. Collective Perception Methodology

4.1. Training Setup

4.2. Training Methodology

5. Results and Discussions

5.1. Scenario 1: Town 02—Localization with All Sensors and No Blackout Events

5.2. Scenario 2: Town 02—Localization with Eventual GNSS Blackout Events

5.3. Scenario 3: Town 02—Localization with Eventual IMU, SAS/WO Blackout Events

5.4. Scenario 4: Town 01—Localization with All Sensors and No Blackout Events

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI