Reinforcement Learning and Genetic Algorithm-Based Network Module for Camera-LiDAR Detection

: Cameras and LiDAR sensors have been used in sensor fusion for robust object detection in autonomous driving. Object detection networks for autonomous driving are often trained again by adding or changing datasets aimed at robust performance. Repeat training is necessary to develop an efficient network module. Existing efficient network module development changes to hand design and requires much module design experience. For this, a neural architecture search was designed, but it takes much time and requires optimizing the design process. To solve this problem, we propose a two-stage optimization method for the offspring generation process in a neural architecture search based on reinforcement learning. In addition, we propose utilizing two split datasets to solve the fast convergence problem as the objective function of the genetic algorithm: source data (daytime, sunny) and target data (day/night, adversary weather). The proposed method is an efficient module generation method requiring less time than the NSGA-NET. We confirmed the performance improvement and the convergence speed reduction using the Dense dataset. Through experiments, it was proven that the proposed method generated an efficient module.


Introduction
Artificial neural networks have made significant advancements in autonomous driving, particularly in classification [1] and object detection [2,3].In such environments, variations in light and humidity, particularly during adverse weather conditions or at night, can affect LiDAR data [4].Therefore, designing networks that can adapt to various conditions in autonomous driving is crucial, and sensor fusion has proven to be an effective method for this.Recently, research on camera-LiDAR detection in autonomous driving has explored middle-level fusion techniques [5,6], which involve integrating inputs from different sensor-based networks.
Figure 1a shows the object detection network frameworks.The network takes input from a single sensor-based detection network, which refines the features and performs detection by inputting it to the middle layer or directly to the head.When fusing networks from different sensors, a multi-modal fusion method is preferred.This involves creating a backbone network for each sensor and performing fusion at a middle layer.Figure 1b,c illustrate fusion in the middle layer, and (b) is a simple fusion method carried out by stacking or concatenating features.However, in the case of multi-modal fusion, a recalibration process is required between the features from different sensors [7].A configuration of the module is shown in Figure 1c with the feature in the middle to be corrected.The development of these modules must be efficient and effective.Efficient network metrics [8] in object detection include quality and footprint.Quality metrics, such as mAP (mean average precision) [9], precision, and recall, can evaluate and develop network effectiveness.As the data grow, the difficulty of learning increases, and it takes much time to check quality through repeated experiments.Existing studies have developed effective deep learning, and we intend to go further in existing research to resolve learning difficulties.The attention mechanism module shows visible performance improvement in various fields [7,10] with a slight increase in computation resources.We can improve performance and solve efficiency issues by utilizing the attention mechanism well.
Figure 2 shows the number of epochs needed for convergence using the same attention mechanism module as previous studies, AFAM and FSL [11,12].We found out that the training epochs varied.AFAM, which uses contrastive learning, converges faster due to using a separate dataset for contrast learning.Our experiments suggest that separating datasets can affect the convergence speed, which can be a helpful idea in the optimization process.We propose a method to create a network module that uses neural architecture search (NAS) and utilizes separated adversary weather data.NAS methods [13] include evolutionary [14][15][16] and learning methods [17].Evolutionary methods apply a genetic algorithm (GA) when solving multi-objective problems with multiple objective functions to optimize.GA creates offspring and solves the overall objective function through the population, crossover, and mutation processes.The population process in GA makes chromosomes considered for diversity.NSGA-NET [14] was proposed as a NAS method using GA.Existing NSGA [18] research has strong randomness in the population process, Efficient network metrics [8] in object detection include quality and footprint.Quality metrics, such as mAP (mean average precision) [9], precision, and recall, can evaluate and develop network effectiveness.As the data grow, the difficulty of learning increases, and it takes much time to check quality through repeated experiments.Existing studies have developed effective deep learning, and we intend to go further in existing research to resolve learning difficulties.The attention mechanism module shows visible performance improvement in various fields [7,10] with a slight increase in computation resources.We can improve performance and solve efficiency issues by utilizing the attention mechanism well.
Figure 2 shows the number of epochs needed for convergence using the same attention mechanism module as previous studies, AFAM and FSL [11,12].We found out that the training epochs varied.AFAM, which uses contrastive learning, converges faster due to using a separate dataset for contrast learning.Our experiments suggest that separating datasets can affect the convergence speed, which can be a helpful idea in the optimization process.We propose a method to create a network module that uses neural architecture search (NAS) and utilizes separated adversary weather data.NAS methods [13] include evolutionary [14][15][16] and learning methods [17].Evolutionary methods apply a genetic algorithm (GA) when solving multi-objective problems with multiple objective functions to optimize.GA creates offspring and solves the overall objective function through the population, crossover, and mutation processes.The population process in GA makes chromosomes considered for diversity.NSGA-NET [14] was proposed as a NAS method using GA.Existing NSGA [18] research has strong randomness in the population process, so Remote Sens. 2024, 16, 2287 3 of 20 creating a learning module takes a long time.In addition, NSGA-based EEEA [16] proposed an optimization method by adding a heuristic and exiting early based on parameter values below the threshold.However, in this method, it is critical to set a standard for setting a specific threshold, and the standard can only be set through experimentation.So, in this paper, we optimized the population process to reduce randomness via a reinforcement learning network.The reinforcement learning network is divided into training and utilization, and the dataset composition varies for each stage.The contributions of this paper are as follows.
paper are as follows.

•
The model training of the reinforcement learning network was optimized by dividing it into two stages.

•
The search time to create a practical network module was shortened by reducing the search space using reinforcement learning in GA's population process.

•
Performance and network convergence speed are improved based on GA assisted with reinforcement learning.
This paper applies GA to improve performance and computation resource issues and solve multi-objective problems for fast convergence.Our core contribution is to propose a method for GA's search space problem using a reinforcement learning network rather than random offspring.This method can effectively obtain a solution close to global rather than local optimal.As a comparative experiment, it was demonstrated by comparing the creation of a network module using GA-based NSGA and reinforcement learning support GA.

Related Works
Deep learning networks have made significant progress in object detection and classification [19] in computer vision.Recently, researchers have studied a network module that refines features using squeeze and excitation [20] or attention mechanisms.The CBAM [10] method proposes refined features by emphasizing essential features for each element in each feature map using channel and spatial attention.Wang et al. [21] improved performance by configuring efficient modules using channel and spatial attention for different domain alignments.The attention mechanism can be applied to other dimensions, and the ReFID [22] study proposed learning in the frequency domain by utilizing the attention mechanism in the frequency domain, improving generalization ability, and proving it through extensive experiments.The Transformer [23] is an attention mechanism that has been used in various fields and has achieved several state-of-the-art results [24][25][26].However, its high computation requirements limit its use in embedded environments.These studies [10,20] can be applied to various fields such as speech recognition [27], audio-video speech [28], and image captioning [29].Our research aims to improve performance with minimal computation.The SCConv [30] method generates a feature map highlighted from channel and spatial information, but it has the limitation of determining • The model training of the reinforcement learning network was optimized by dividing it into two stages.• The search time to create a practical network module was shortened by reducing the search space using reinforcement learning in GA's population process.

•
Performance and network convergence speed are improved based on GA assisted with reinforcement learning.
This paper applies GA to improve performance and computation resource issues and solve multi-objective problems for fast convergence.Our core contribution is to propose a method for GA's search space problem using a reinforcement learning network rather than random offspring.This method can effectively obtain a solution close to global rather than local optimal.As a comparative experiment, it was demonstrated by comparing the creation of a network module using GA-based NSGA and reinforcement learning support GA.

Related Works
Deep learning networks have made significant progress in object detection and classification [19] in computer vision.Recently, researchers have studied a network module that refines features using squeeze and excitation [20] or attention mechanisms.The CBAM [10] method proposes refined features by emphasizing essential features for each element in each feature map using channel and spatial attention.Wang et al. [21] improved performance by configuring efficient modules using channel and spatial attention for different domain alignments.The attention mechanism can be applied to other dimensions, and the ReFID [22] study proposed learning in the frequency domain by utilizing the attention mechanism in the frequency domain, improving generalization ability, and proving it through extensive experiments.The Transformer [23] is an attention mechanism that has been used in various fields and has achieved several state-of-the-art results [24][25][26].However, its high computation requirements limit its use in embedded environments.These studies [10,20] can be applied to various fields such as speech recognition [27], audio-video speech [28], and image captioning [29].Our research aims to improve performance with minimal computation.The SCConv [30] method generates a feature map highlighted from channel and spatial information, but it has the limitation of determining the ratio through experimentation.We will create a module using attention research for this multi-modal network fusion.
In NAS, there were evolutionary methods [15] and reinforcement learning [17] methods, and research was actively conducted based on NASNet [31] proposed by Zoph.Among various approaches, the cell-based method by Bender [32] has been applied as a simple way to implement NAS, and in particular, DARTs [33] enable network design more efficiently than before by optimizing search space.It is possible to quickly derive optimal results by defining and learning various cases in advance, but computing power is required.Evolutionary NAS [34,35] has been efficiently utilized in multiple-objective problems aimed at operating in various environments, reducing computation, and improving performance.NSGA-NET [14], an existing study of evolutionary methods, effectively solved multipleobjective problems and showed the potential for performance and an efficient approach.We changed the search strategy to NSGA-Net and found a structure with faster convergence through the optimized population process.
Over the past decades, neural network-based object detection research [36][37][38] has been conducted using cameras as single sensors.Object detection networks for autonomous driving rely on multi-sensors, making camera, LiDAR, and radar fusion research necessary [4,39].Research on sensor fusion may be a matter of considering the domain or fusing various views.The study proposed [40] learning complex correlations through graph collaboration in a multi-view problem, and at the same time, it designed the learning convergence problem as an optimization problem and proved fast convergence using the Lagrange multiplier method.A neural network based on fusion techniques [6,39,[41][42][43] can enhance the limited data from far-off LiDAR sensors by incorporating detailed information from cameras or depth estimates.This integration can improve the accuracy of vehicle volume and heading estimation by incorporating shape and spatial data.The network fusion of multi-modals has different roles and requires alignment in the network [20].In a study [7] that effectively fused multi-modals such as voice, video, and motion, filtering was applied to each sensor feature using attention [10] on the channel and spatial information, and then, features of the different modals were combined through a squeeze and excitation network (SENet) [20].Since data fusion cannot simply stack different features, it can be refined using attention and SENet, as in previous studies.
In previous extensive research, the measurement deviation of LiDAR [44][45][46] in adversary weather situations, that is, high-humidity situations, was analyzed, and object classification through noise filtering using CNNs was proposed [47].Camera-based adversary weather studies [48,49] analyzed clear and fog differences using noise to create training data by making insufficient data synthetically.In research on adversary weather, an important element is analyzing the characteristics of noise according to the situation of each sensor.In adversary weather, as a study, domain adaptations [50] learn only daytime and sunny weather data; evaluate learning on fog and rain, which is the target domain; and aim to learn a generalized classifier by proposing domain-invariant representations [51,52].However, since such research requires a target domain, domain generalization research has recently been proposed [53].So, domain generalization does not require target data, enabling more generalized network learning, but since it uses only the source domain, there is a performance discrepancy in learning.Existing studies on reducing this domain gap can be considered a very effective method in adversary weather or when there is a lack of data, but the difficulty of network learning must be considered.

Method Overview
We are working on designing network modules that can handle middle-level fusion and deliver reliable performance even in adverse weather conditions.However, finding a suitable network module can be time-consuming and experimental.To address this, we are focusing on improving the efficiency of both gene generation and convergence modules.First, we have improved generation efficiency by introducing chromosomes based on reinforcement learning to the gene generation of the GA algorithm.The reinforcement learning network aims to optimize the gene generation process, and we aim to achieve this within minimal training time.However, the reinforcement learning process can take longer if data deviations increase.So, we have excluded adverse weather data from the source data used for training in reinforcement learning.
In Figure 3a, the process of creating GA involves reinforcement learning.However, the adversary weather data is not included in the source data used for training.Reinforcement learning networks are designed to optimize the gene generation process to minimize training time.However, if the variance in the data increases, the reinforcement learning process may require more training time.During Stage I, reinforcement learning learns the source dataset and predicts the loss in the next epoch for each module generated via GA.This prediction helps forecast the network's convergence speed.Reinforcement learning predicts the loss, modifies the module with actions, and receives a penalty if the loss is not improved as much as predicted.
Remote Sens. 2024, 16, x FOR PEER REVIEW 5 of 22 learning network aims to optimize the gene generation process, and we aim to achieve this within minimal training time.However, the reinforcement learning process can take longer if data deviations increase.So, we have excluded adverse weather data from the source data used for training in reinforcement learning.
In Figure 3a, the process of creating GA involves reinforcement learning.However, the adversary weather data is not included in the source data used for training.Reinforcement learning networks are designed to optimize the gene generation process to minimize training time.However, if the variance in the data increases, the reinforcement learning process may require more training time.During Stage I, reinforcement learning learns the source dataset and predicts the loss in the next epoch for each module generated via GA.This prediction helps forecast the network's convergence speed.Reinforcement learning predicts the loss, modifies the module with actions, and receives a penalty if the loss is not improved as much as predicted.
In Stage II, a module is created using the NSGA-NET algorithm proposed in [14].The method generates a gene population by combining the module set  proposed through reinforcement learning with the parent population  generated via NSGA.The proposed module is obtained through sampling from the rescued gene population.This approach enables simultaneous search and learning by the network's training with the head.Some deviations are expected as the data were collected during adversary weather conditions.However, the optimization process of GA can prevent the problem of falling into local optimality.Furthermore, the module estimated from the source data of reinforcement learning can reduce search time and support faster convergence speed.In Stage II, a module is created using the NSGA-NET algorithm proposed in [14].The method generates a gene population by combining the module set O t proposed through reinforcement learning with the parent population P t generated via NSGA.The proposed module is obtained through sampling from the rescued gene population.This approach enables simultaneous search and learning by the network's training with the head.Some deviations are expected as the data were collected during adversary weather conditions.However, the optimization process of GA can prevent the problem of falling into local optimality.Furthermore, the module estimated from the source data of reinforcement learning can reduce search time and support faster convergence speed.

Genetic Algorithm for Neural Architecture Search
We are looking for modules that can refine and highlight the features of the camera and LiDAR backbone.The features extracted from the backbone are F i C and F i L , where i is the number of features extracted from the backbone layer.The features extracted from the backbone are denoted as C,L , where C and L refer to the camera and LiDAR.A feature where W(width) and H(height) are the spatial dimensions, and D(depth of feature tensor) is the depth of the backbone network; the feature M_C,L of a specific size through the input and calculation process to the module ′ , and will be different from the input feature W, H, D. So, we applied channel attention to and converted it into a feature map of a certain size.Since channel attention uses squeeze and excitation, if M is smaller than D, it is expanded, and in the opposite case, it is squeezed to create a feature map of a certain size.It is expressed as Equations ( 1) and (2).
The highlighted feature map is created by applying elemental-wise multiplication to the existing input feature map to the channel attended result M_C,L , which is given in Equations ( 3) and ( 4): where ⊗ denotes element-wise multiplication.By adding the recalibrated feature map F ′ C,L and the backbone features F i C,L , a feature map with highlighted features is created.Equations ( 3) and ( 4) are highlighted feature maps, and the i-th modules and the results for each sensor are concatenated and delivered to the layer for object detection.The output of the module created in this paper is delivered to the head or middle layer through the process of Equations ( 1)- (6).
The genetic algorithm of the proposed method must define encoding, selection, crossover, and mutation, and reinforcement learning is used as an aid in the gene reproduction process using these.To find an efficient and effective module, similar to the NSGA-NET [14] process, we made the pop size smaller and optimized it by changing the selection process to reduce GPU-Days.
(1) Encoding: In the proposed method, each iterative process gradually builds up a good group.The middle layer consists of blocks and nodes, and Figure 4 shows the encoding process for each block.In this study, the node of the block is expressed as , where the subscript indicates the sensor and the superscript indicates the location of the block.Each block can consist of up to m or n nodes.Each x C,L node is a basic computation based on VGG [54], ResNet [1], and DenseNet [55] in the CNN literature.We chose a computation operation from among the following options, collected based on their prevalence in the CNN literature (1-8). 1.
5 × 5 dilated convolution. x C,L refers to the camera or LiDAR feature map and is the 1st conv block.A node consists of a total of 6 bits; 3 bits are the operation described above and the remaining 3 bits are the output channel size.When the GA algorithm determines only channel and operation, the kernel size may be larger than the input, and in this case, it is skipped without operation.
(2) Selection: Selection plays an important role in the process of GA finding the global optimal solution, and the tournament method was mainly used.Selection results due to fitness bias do not always provide us with a global optimal solution and can also have the disadvantage of widening the search space.In the proposed method, the best candidates were derived in terms of performance and calculation time, respectively, so that nondominant solutions could be passed on to the next generation.Then, it is to be noted once again that a certain probability is also added to the outcome proposed by the reinforcement learning network later.(3) Crossover: Fitness is improved by stacking blocks through selection based on fitness bias and gene combination process crossover.The building block represents various node connection combinations, and the various connections optimize the conv block.
Our proposed method used a one-point method, randomly selecting a node in the encoding vector and changing the designated point.If the crossover is diverse, the search space expands and a global near optimal can be found.However, to reduce the search time, a simple method is used and supplemented with reinforcement learning.(4) Mutation: To improve diversity, there is mutation as a gene generation process that deviates from the local optimal solution during the population process, and we changed the connection node by flipping the bits of the encoding vector.The bits of the encoding vector are reversed and the operation or output channel at the connection node is changed.The proposed method only changes the form of the operation or channel, but it is difficult to find a new structure.However, in this study, a reinforcement learning network was able to solve the problem of diversity by proposing a new structure.( 5) Reproduction: The softmax method of selecting the reproduction process, like the method proposed by DARTs [33], is suitable when using only genetic algorithms.To form an optimal gene population, we must appropriately combine genes generated through reinforcement learning with previous generations during reproduction.A reasonable moment when reinforcement learning genes should be sampled is when the update size of the loss is small.Algorithm 1 is the module creation process for GA-based NAS.Genes are organized into modules, and fitness functions for the initial genes are calculated.Dominant genes are sampled through binary tournament selection and reproduced as the size of K. Genes are updated by adding modules generated via RLNet or crossover and mutation processes within a generation.After breeding, based on the evaluation of an objective function favoring fitness, nondominant genes are selected and passed on to the next generation, updating the generation.(7).27: Return parent population  () refers to the sampling probabilities obtained from the learned reinforcement learning model.The larger the difference between the current loss and the loss estimated by the reinforcement learning model, the more reasonable it will be to change the module.The value of  for the  − ℎ block increases if it has been changed frequently, thus increasing the stability of the creation process by avoiding changes to the block.This reduces frequent changes during the creation process.To shorten the search time and minimize frequent changes, the ℎ is also proposed.A flowchart and pseudocode outlining the overall approach are shown in Figure 3 and Algorithm 1, respectively.

Reinforcement Learning Network for Supported Genetic Algorithm
The reinforcement learning network (RLNet)'s action is to choose changes that can significantly improve the module.Reinforcement learning changes the gene encoding to RL(prob ) refers to the sampling probabilities obtained from the learned reinforcement learning model.The larger the difference between the current loss and the loss estimated by the reinforcement learning model, the more reasonable it will be to change the module.The value of w tabu for the i-th block increases if it has been changed frequently, thus increasing the stability of the creation process by avoiding changes to the block.This reduces frequent changes during the creation process.To shorten the search time and minimize frequent changes, the Epoch is also proposed.A flowchart and pseudocode outlining the overall approach are shown in Figure 3 and Algorithm 1, respectively.

Reinforcement Learning Network for Supported Genetic Algorithm
The reinforcement learning network (RLNet)'s action is to choose changes that can significantly improve the module.Reinforcement learning changes the gene encoding to change its structure.The difference with GA is that GA does not consider improved convergence speed, while reinforcement learning considers the convergence speed for improved performance.The proposed method can efficiently find fast and effective modules by separating roles to find structures that can achieve fast convergence.
In Algorithm 2, RLNet is used as a training algorithm for Stage I. To begin, determine the required M and capacity N for reinforcement learning.Next, initialize the Q function and network weight.The memory holds the updated information, improved loss, and extracted features as inputs.s 1 sets each state θ 1 , which is a feature extracted from the input camera and LiDAR backbone.The DQN-based [56] action A t is taken from the CNN literature mentioned above.RLNet training is stopped once the loss is lower than a predetermined threshold.The population can be utilized when the change in loss decreases, and the backbone may have different standards for this.
Q is a state-dependent action function, and Equation ( 8) was constructed to maximize Q.The reward of reinforcement learning is Equation ( 9), also constructed to maximize Q.There is a certain probability that it will act the opposite of r s , and this value is minimal.The action that maximizes Q is determined at time t, and this is the CNN literature that will be changed to A t .
The action A t plays a crucial role in calculating rewards and penalties when it comes to reinforcement learning.This type of learning involves taking actions and updating rewards and penalties accordingly to achieve a specific goal.In the case of this paper, the goal of RLNet is to determine whether the loss has decreased more or less than expected.To make this determination, the gene module P t−1 and the O t module, both of which are affected by the action, must be saved and evaluated through learning.Reinforcement learning reward Equation ( 10) is determined through loss prediction.The purpose of the reinforcement learning network is to improve loss.
The purpose of the reinforcement learning network is to improve loss.
Rewards = 1 i f Decision > 0 and Loss(P t ) < Loss(O t ) 0.5 i f Decision > 0 and Loss(P t ) > Loss(O t ) , Rewards are determined by comparing the learning loss of the previous module with that of the currently changed module.Reinforcement learning aims to find module configurations that converge faster than genetic algorithms; therefore, if the module O t composed of predictions has more significant improvement than the gene module P t , a reward of 1 is given.If reinforcement learning can accurately predict loss, imposing a penalty when the current module is improved is desirable.However, in this paper, the goal is to search modules efficiently.If a penalty is given just because the prediction module O t has improved less than the gene module P t , the performance of the reinforcement learning-based module may decline.So, there is an improvement, but the current converges faster, and only half the reward is given.A penalty is given when the loss is updated without improvement.

Search Strategy for Fast Convergence Module
To find a globally optimal solution to an optimization problem, designing the objective function and constraints is important.The problem with NAS is that the search space for the global optimal solution has a wide range of objective functions and constraints, so it takes a long time.In this study, GA-based NAS was selected to solve problems with fast convergence and performance, and the search space is still enormous.To assist with this, we proposed a NAS algorithm using reinforcement learning.The important point is that if the value is large based on a certain probability RL(prob ), the structure proposed using RLNet is added to the sample and then, the optimal solution is found using nondominated sort.Since the module proposed via reinforcement learning may not always be nondominant, all proposed outputs must be evaluated.The fitness function evaluates performance and convergence, Frames Per Second (FPS) and is given in Equation (12).
Loss is an improvement of the network and includes classification and box regression for detection as learning.As learning improves, the loss should become smaller.So, we gave it a negative value so that it improves as the value becomes smaller.We think about calculating fitness for convergence like making a prediction.For a prediction, two methods were used in parallel.First, we compared the current loss and the previous loss using Equation (13), as used in reinforcement learning.
Our reinforcement learning predicts loss improvement for each work of the CNN literature (1)(2)(3)(4)(5)(6)(7)(8).Our learning strategy is to predict the loss for each configuration change and select the module with the highest change.Convergence(loss, Q) included a small value as a fitness function to control uncertainty in the calculated value.Through this equation, the fitness function can contain the possibility of improvement for the next generation.Lastly, FPS refers to calculation speed, and as the amount of calculation increases, FPS decreases.Our goal is to maximize the fitness function.

Experiment Setup and Evaluation of the Proposed Method
In this section, we first describe our implementation details.Then, we compare our proposed NAS with NSGA-Net and other modules.NSGA-Net compares the efficiency and performance of the module creation process.GPU-Day is used to evaluate the efficiency of the production process.

Implementation Details
The search itself is repeated five times with different initial random seeds.We use the EfficientDet [57] loss function for learning the associated weights for each architecture.All experiments are performed on Nvidia 3080Ti GPU cards.The experimental dataset used the Dense dataset [4], and the composition of the train, validation, and test sets is summarized in Table 1.Stage I uses only the T1 dataset, while Stage II utilizes all datasets.In other words, the T1 dataset refers to the source data described in the Method Overview Section, and the T1-T6 datasets refer to the target data.The performance is evaluated using mean average precision (mAP).We have applied the PASCAL VOC 11-point interpolation method to compute the average precision (AP) for each class.Later, the average is computed using mean average precision across all the classes.In our case, there are two object class labels, i.e., pedestrian and vehicle.So, we compute the AP for each class using Equation ( 14): where label = {vehicle, pedestrian}, obj is the index of class values in the label, and P corresponds to the precision at each interpolated recall r.The mAP is computed using Equation (15).The IoU threshold was set to 0.5.
Here, n is the number of class labels.
Figure 5 shows the specific structure of the network used in the experiment, using EfficientDet.EfficientDet proposed a BiFPN layer to efficiently and quickly converge learning.In this paper, a comparative experiment was conducted by adding a module to the EfficientDet structure for efficient module testing.The overall structure of the object detection network used EfficientDet, but to prove its effectiveness, experiments were conducted using ResNet [1], Efficientnet [58], and ResT [59] backbone.
In order to add a module to EfficientDet for module evaluation, the feature map size had to be changed to a specific size.We created a feature map of a specific size using channel attention used in CBAM. Figure 5a illustrates the process and (b) indicates the specific feature map size used in the experiment.k is the kernel size of 2d convolution in Figure 5b.
First, from the bottom, we calculate F (i) C,L , a recalibration process for a total of three feature maps.The detailed process of F (i) C,L is described in Section 4.1, and it was concatenated and input into the BiFPN layer to use the EfficientDet structure without change.EffcientDet extracts five feature maps from the backbone and inputs them into the BiFPN Layer, but in this paper, modules for three feature maps were created.When applying GA to find a CNN-based module, the sizes of W and H of the fourth and fifth feature maps were tiny, so it did not help improve performance.The image used RGB data, and the LiDAR data used the Range Image proposed by Laser-Net [60], which is a projection of the size of the image.Image pixels were resized to 896 for learning and evaluation.
where  = {ℎ, }, obj is the index of class values in the , and  corresponds to the precision at each interpolated recall .The mAP is computed using Equation (15).The IoU threshold was set to 0.5.
Here, n is the number of class labels.
Figure 5 shows the specific structure of the network used in the experiment, using EfficientDet.EfficientDet proposed a BiFPN layer to efficiently and quickly converge learning.In this paper, a comparative experiment was conducted by adding a module to the EfficientDet structure for efficient module testing.The overall structure of the object detection network used EfficientDet, but to prove its effectiveness, experiments were conducted using ResNet [1], Efficientnet [58], and ResT [59] backbone.In order to add a module to EfficientDet for module evaluation, the feature map size had to be changed to a specific size.We created a feature map of a specific size using channel attention used in CBAM. Figure 5a illustrates the process and (b) indicates the specific feature map size used in the experiment. is the kernel size of 2d convolution in Figure 5b.First, from the bottom, we calculate  , ( ) , a recalibration process for a total of three feature maps.The detailed process of  , ( ) is described in Section 4,1, and it was concatenated and input into the BiFPN layer to use the EfficientDet structure without change.EffcientDet extracts five feature maps from the backbone and inputs them into the BiFPN Layer, but in this paper, modules for three feature maps were created.When applying GA to find a CNN-based module, the sizes of  and  of the fourth and fifth Table 2 shows the experimental hyperparameters.Search space is the parameter of the initial GA gene block, and the block composition consists of up to six nodes.The initial channel is extracted from the backbone as described in Figure 5b and is a conv channel.Learning optimization is a dropout and update method to prevent the overfitting of learning, and when the dropout rate ( r dropout is less than 0.5, overfitting problems often occur.Adam [61] optimization was applied to gradient descent for network learning, and the search strategy is a parameter for the regeneration process of GA.As the population size grew, there was a problem of overfitting or increased learning instability, and in this paper, experiments showed that the number 10 was the most stable and that learning and network architecture search functioned normally.Crossover and mutation are the probabilities of the combination and the change of two selected genes, and α, β, and γ are the effects of each element in the fitness evaluation and have the greatest evaluation of loss. Lastly, as reinforcement learning parameters, alpha and beta suggest the ratio of loss and convergence of reinforcement learning predictions.A batch is not a batch of images and LiDAR data; it refers to the size of the episode.What is remarkable is that batch was used in reinforcement learning to increase learning stability.However, the batch was not used in the NAS module creation and object detection network learning.To evaluate the creation time (GPU-Days) and learning convergence time of an object detection network, the influence of batch must be reduced.

Effectiveness and Efficiency of Proposed Method
Table 3 shows the results of an experiment by changing the backbone, comparing the method proposed in the modified EfficientDet-b3 with the existing hand-crafted module.In addition, we evaluated performance, FPS, and convergence speed by comparing the existing modules CBAM and AFAM, which were manually designed by experts.By changing the backbone to ResNet and ResT, we verified whether module search using reinforcement learning efficiently generates modules even if the backbone is changed.CBAM [10] is an efficient attention mechanism that combines features emphasized in channel and spatially to improve performance in various tasks through refined calibration.The proposed module improved performance compared to the CBAM module, reduced network learning time, and improved performance.We conducted a comparative experiment with the existing research, AFAM, which requires contrastive learning and uncertainty calculation.In this paper, only the module's structure was used without calculating the uncertainty of AFAM.The module configuration of AFAM is a fusion of CBAM and MMTM and determines the camera's or LiDAR's weight according to uncertainty to construct a recalibrated feature map.Here, without calculating uncertainty, the features F C,L extracted from each backbone were input to the AFAM module, and then, two recalibrated feature maps F ′ C and F ′ L with weights for the two sensors were created.None is a concatenation method without an intermediate module, corresponding to (b) in Figure 1.
In the case of ResNet, concatenation at the intermediate level is more efficient than configuring modules.Experimental results show performance improvement when adding modules, and overall learning time increases.Efficientnet-b3 is the best backbone for the EfficientDet structure.Therefore, the performance and convergence results were the most efficient.CBAM focuses on features for each sensor, and AFAM proposes methods to recalibrate backbone features extracted from both sensors.When extracting with the Efficientnet backbone, recalibrating the two features performed excellently.Based on the experimental results, the features of the two sensors increase by 0.3% when recalibrating.In Transformer-based backbone ResT, MLP-based CBAM and AFAM improved learning performance, but the learning time increased accordingly.Only modules found through NAS achieved faster performance and learning convergence.Unlike ResNet, we were able to confirm that the attention module achieves efficient performance improvement in the Transformer-based backbone.
Table 4 shows the experimental results to verify the effectiveness of the module.In this paper, we designed a module that uses NAS to support object detection networks in adverse weather environments.The comparative experiment used intermediate fusion methods and EA-based NSGA-NET and EEEA.In the case of EEEA, when looking for a lightweight model by setting a specific threshold in NSGA-NET, an early exit was proposed to reduce GPU-Days dramatically.However, as a result of applying it through experiments, in order to set the threshold, it was necessary to perform learning at least once and find an appropriate threshold for early exit to operate.Otherwise, there was no difference between NSGA-NET and GPU-Days.In Table 4, only the best results are written.In the case of ResNet, the modules found based on genes improved performance like the hand-crafted modules CBAM and AFAM, but the overall learning time increased.In the case of Efficientnet-b3, the performance of the attention module has been improved, and the convergence speed can also be confirmed to have been increased.Through experiments, it can be confirmed that the attention module operates most effectively in the EfficientDet structure.Similarly, in ResT, the attention module shows significant results in terms of learning time and performance improvement.
Overall, through all experiments, it was confirmed that ResNet's backbone attention module improves performance but increases learning time.In the case of Efficientnet and ResT, the configuration of the attention module was able to simultaneously increase learning time and performance.In the case of CBAM, the results show deviations such as performance decreases or increases, while AFAM has fewer deviations, but the learning time increases.In the case of NAS based on a genetic algorithm, NSGA-NET can reliably find modules.However, it takes a lot of GPU-Days and to solve this problem, EEEA has been proposed.Nevertheless, in EEEA, there were many cases where no performance improvement was achieved or the extent of the improvement was minimal.For early exit, it is judged that it is not possible to find a module with fast performance and convergence speed due to limitations in parameter size.Therefore, assistance with the proposed method, reinforcement learning, can contribute to reliably finding effective and efficient modules.

Ablation Study
A limitation of NAS research is that network reproducibility is difficult.In this study, the training loss steadily decreased, but the validation loss diverged, causing frequent overfitting.The reason is that learning stability decreased as the proposed method repeated module search simultaneously with learning.As one of the stable convergence methods of learning, Dropout [62], a regulation technique, is simple but can increase learning stability.Setting a high dropout rate for convergence in the middle layer is adequate.In the case of existing AFAM and CBAM, learning was stable regardless of r dropout , but learning was unstable in the evolutionary method.
Figure 6 shows the learning stability and performance of the proposed method for each r dropout .When r dropout = 0.1, performance deteriorates due to low learning stability in the NAS that learns while changing the module configuration.Overall, the stability of learning is an essential factor in module search.As the value of r dropout increases, the learning stability is secured, and a module with performance that meets the purpose can be found.However, at r dropout = 0.7, the stability of learning and the convergence speed increase simultaneously, increasing GPU-Days.The problem of learning stability, a limitation of NAS research, was compensated for with dropout, but sometimes, it was impossible to find a module with fast convergence.Figure 8 illustrates the practical implications of our research, showing the results of AFAM with the slightest deviation as the average and variance values for mAP, a performance indicator from an effective perspective.CBAM and AFAM are based on MLP and may result in slight deviations.This is a significant finding as it demonstrates that the MLP-based module can effectively reduce learning deviation.NSGA and EEEA show significant differences in performance.EEEA limits the size of parameters to find efficient modules.Therefore, performance decreases but learning time is advantageous.On the other hand, the proposed method shows similar means and deviation accuracy as NSGA.This problem is with a wide search space, such as GA-based research.In the context of object detection, where CNN-based modules may exhibit large learning deviations, this is a crucial insight.It highlights the high probability of overfitting and the need for appropriate regulation techniques such as dropout.Figure 8 illustrates the practical implications of our research, showing the results of AFAM with the slightest deviation as the average and variance values for mAP, a performance indicator from an effective perspective.CBAM and AFAM are based on MLP and may result in slight deviations.This is a significant finding as it demonstrates that the MLP-based module can effectively reduce learning deviation.NSGA and EEEA show significant differences in performance.EEEA limits the size of parameters to find efficient modules.Therefore, performance decreases but learning time is advantageous.On the other hand, the proposed method shows similar means and deviation accuracy as NSGA.This problem is with a wide search space, such as GA-based research.In the context of object detection, where CNN-based modules may exhibit large learning deviations, this is a crucial insight.It highlights the high probability of overfitting and the need for appropriate regulation techniques such as dropout.Also, our research uncovers the complex nature of NAS studies.While they hold promise for high performance, there are instances where the performance falls short of the baseline network.This variability in learning performance presents a significant challenge.Even with the implementation of dropout, further research is necessary to identify modules with consistently good performance and to address reproducibility problems.
Next, Table 5 is an experiment to check the effect of fusion prediction on the gene generation process in reinforcement learning.Using ResNet and Efficientnet backbone in the EfficientDet structure, a comparative experiment was conducted between the cases where only loss and FPS are included in fitness and adding predicted convergence.The comparative experiments confirmed that accuracy and performance were slightly improved.Although the difference in values is not significant, it has been verified that reinforcement learning results are involved in the gene generation process by influencing the fitness function, and through this, convergence speed and slight performance improvement can be achieved.As the population grows and becomes more diverse, the impact of reinforcement learning will become much more significant.Also, our research uncovers the complex nature of NAS studies.While they hold promise for high performance, there are instances where the performance falls short of the baseline network.This variability in learning performance presents a significant challenge.Even with the implementation of dropout, further research is necessary to identify modules with consistently good performance and to address reproducibility problems.
Next, Table 5 is an experiment to check the effect of fusion prediction on the gene generation process in reinforcement learning.Using ResNet and Efficientnet backbone in the EfficientDet structure, a comparative experiment was conducted between the cases where only loss and FPS are included in fitness and adding predicted convergence.The comparative experiments confirmed that accuracy and performance were slightly improved.Although the difference in values is not significant, it has been verified that reinforcement learning results are involved in the gene generation process by influencing the fitness function, and through this, convergence speed and slight performance improvement can be achieved.As the population grows and becomes more diverse, the impact of reinforcement learning will become much more significant.Since CBAM and AFAM used the MLP-based squeeze and excitation [20] module, errors frequently occurred when the vehicle was truncated.However, since the proposed method generates modules from the 2D CNN literature, it shows robust results.Object detection performs well both Day and Night in foggy situations that require fusion during adverse weather situations.In particular, we could confirm that the vehicle's bounding box was accurately found even at night.However, an interesting result is that the false detection rate increases in the 2D CNN literature-based modules like camera-based networks in Day and Fog situations.

Conclusions
Object detection is a crucial part of autonomous navigation in dynamic environments.A lot of research has been conducted in this area, which has led to the development of object detection modules based on sensor fusion.However, these modules are challenging in handling frequent training to handle adverse weather conditions and extreme lighting changes.To address these shortcomings, a new method has been proposed which uses a genetic algorithm to ensure diversity and reinforcement learning for optimization.This method improves object detection accuracy and enables module search with fast convergence speed.Experiments conducted on a benchmark dataset show that this proposed

Figure 1 .
Figure 1.Comparison of the current object detection network frameworks: (a) Object detection with one sensor dataset.(b) Configure each backbone to process two or more features in the middle layer.The features of each backbone are fused at the middle layer.(c) Configure a module using the multimodal fusion method to process specific features and perform refined calibration.

Figure 1 .
Figure 1.Comparison of the current object detection network frameworks: (a) Object detection with one sensor dataset.(b) Configure each backbone to process two or more features in the middle layer.The features of each backbone are fused at the middle layer.(c) Configure a module using the multi-modal fusion method to process specific features and perform refined calibration.

Figure 2 .
Figure 2. Number of epochs to convergence.Blue: AFAM; gray: FSL.Network's loss convergence is faster when two different networks add adversary weather data.

Figure 2 .
Figure 2. Number of epochs to convergence.Blue: AFAM; gray: FSL.Network's loss convergence is faster when two different networks add adversary weather data.

Figure 3 .
Figure 3. Description of the overall structure of the proposed method.(a) Modifying the middle layer in the object detection network.Stage I is the training stage of the reinforcement learning network.Stage II supports the GA's regeneration process using RL in the network module design process.Yellow is the NSGA-NET method.(b) The reinforcement learning training process consists of evaluating action and loss and giving a reward.(c) Designing multi-objective functions with improved performance and convergence.

Figure 3 .
Figure 3. Description of the overall structure of the proposed method.(a) Modifying the middle layer in the object detection network.Stage I is the training stage of the reinforcement learning network.Stage II supports the GA's regeneration process using RL in the network module design process.Yellow is the NSGA-NET method.(b) The reinforcement learning training process consists of evaluating action and loss and giving a reward.(c) Designing multi-objective functions with improved performance and convergence.

Figure 4 .
Figure 4. Encoding: illustration of a middle-layer network encoded by  =  , where is the connection in the block (gray, blue boxes, each with a possible maximum of 6 nodes).See Section 4.1 for a detailed description of the encoding schemes.Algorithm 1 is the module creation process for GA-based NAS.Genes are organized into modules, and fitness functions for the initial genes are calculated.Dominant genes are sampled through binary tournament selection and reproduced as the size of .Genes are updated by adding modules generated via RLNet or crossover and mutation processes within a generation.After breeding, based on the evaluation of an objective function favoring fitness, nondominant genes are selected and passed on to the next generation, updating the generation.

Figure 4 .
Figure 4. Encoding: illustration of a middle-layer network encoded by x = x s , where is the connection in the block (gray, blue boxes, each with a possible maximum of 6 nodes).See Section 4.1 for a detailed description of the encoding schemes.

Figure 5 .
Figure 5. Describes implementation details.Backbone is displayed in the table with the EfficientDet structure.(a) Attention techniques are used as a process to fuse features of different sizes within the middle layer.(b) Size of experimental feature map.

Figure 5 .
Figure 5. Describes implementation details.Backbone is displayed in the table with the EfficientDet structure.(a) Attention techniques are used as a process to fuse features of different sizes within the middle layer.(b) Size of experimental feature map.

Figure 6 .
Figure 6.The proposed method achieved the best performance at  = 0.5 and experimental results by dropout rate.

Figure 7
Figure 7 shows the average learning time.If EfficientDet uses only cameras, it converges very quickly.However, as explained later, the performance is lower than the sensor fusion result.Our experiments were based on the simple fusion EfficientDet (C, L).Handcrafted CBAM and AFAM took a longer overall learning time than NAS.The characteristics of the two modules are that they are based on MLP, so learning takes a lot of time.The modules found through NAS comprise the CNN literature, enabling efficient learning from image-based data.For this reason, we found a module with fast learning convergence.CNN-based NAS studies such as NSGA have advantages in terms of training time.

Figure 6 .
Figure 6.The proposed method achieved the best performance at r dropout = 0.5 and experimental results by dropout rate.

Figure 7
Figure 7 shows the average learning time.If EfficientDet uses only cameras, it converges very quickly.However, as explained later, the performance is lower than the sensor fusion result.Our experiments were based on the simple fusion EfficientDet (C, L).Handcrafted CBAM and AFAM took a longer overall learning time than NAS.The characteristics of the two modules are that they are based on MLP, so learning takes a lot of time.The modules found through NAS comprise the CNN literature, enabling efficient learning from image-based data.For this reason, we found a module with fast learning convergence.CNN-based NAS studies such as NSGA have advantages in terms of training time.Figure8illustrates the practical implications of our research, showing the results of AFAM with the slightest deviation as the average and variance values for mAP, a performance indicator from an effective perspective.CBAM and AFAM are based on MLP and may result in slight deviations.This is a significant finding as it demonstrates that

Figure 6 .
Figure 6.The proposed method achieved the best performance at  = 0.5 and experimental results by dropout rate.

Figure 7
Figure7shows the average learning time.If EfficientDet uses only cameras, it converges very quickly.However, as explained later, the performance is lower than the sensor fusion result.Our experiments were based on the simple fusion EfficientDet (C, L).Handcrafted CBAM and AFAM took a longer overall learning time than NAS.The characteristics of the two modules are that they are based on MLP, so learning takes a lot of time.The modules found through NAS comprise the CNN literature, enabling efficient learning from image-based data.For this reason, we found a module with fast learning convergence.CNN-based NAS studies such as NSGA have advantages in terms of training time.

Figure 7 .
Figure 7. Networks' average training time, unit: hour.The red line represents the average, and the box represents the deviation.

Figure 7 . 22 Figure 8 .
Figure 7. Networks' average training time, unit: hour.The red line represents the average, and the box represents the deviation.Remote Sens. 2024, 16, x FOR PEER REVIEW 18 of 22

Figure 8 .
Figure 8. Network's average mAP.The red line represents the average, and the box represents the deviation.

Figure
Figure 9a-d show the qualitative results of the EfficientDet, CBAM-EfficientDet, and AFAM-EfficientDet papers using cameras only and the proposed method.Camera-based networks and other modules may see inaccurate bounding boxes in Day, Clear situations.Since CBAM and AFAM used the MLP-based squeeze and excitation[20] module, errors frequently occurred when the vehicle was truncated.However, since the proposed method generates modules from the 2D CNN literature, it shows robust results.Object detection performs well both Day and Night in foggy situations that require fusion during adverse weather situations.In particular, we could confirm that the vehicle's bounding box was accurately found even at night.However, an interesting result is that the false detection rate increases in the 2D CNN literature-based modules like camera-based networks in Day and Fog situations.

22 Figure 9 .
Figure 9. Object detection performance qualitative results.(a) Only camera feature-based object detection result; (b-d) camera-LiDAR fusion-based result; (b) with CBAM module; (c) with AFAM module; (d) proposed module using reinforcement learning and genetic algorithm.

Figure 9 .
Figure 9. Object detection performance qualitative results.(a) Only camera feature-based object detection result; (b-d) camera-LiDAR fusion-based result; (b) with CBAM module; (c) with AFAM module; (d) proposed module using reinforcement learning and genetic algorithm.

Algorithm 1 :
Population Strategy of Genetic Algorithm using Reinforcement Learning Network Input: Max, number of generations G, Size of crossover selection K, Crossover probability P c , Mutation probability P m , Reinforcement Learning Network (RLNet) Output: Parent population PoP

Table 1 .
Dataset size used for training, testing, and validation.

Table 3 .
Result of object detection with different existing modules and backbones on Dense dataset.

Table 4 .
Evaluating the efficiency and effectiveness of modules created using evolutionary methods.

Table 5 .
Result of including the prediction loss of reinforcement learning in the genetic algorithm's fitness function.w/: with.

Table 5 .
Result of including the prediction loss of reinforcement learning in the genetic algorithm's fitness function.w/: with.