Adaptive Feature Attention Module for Robust Visual–LiDAR Fusion-Based Object Detection in Adverse Weather Conditions

: Object detection is one of the vital components used for autonomous navigation in dynamic environments. Camera and lidar sensors have been widely used for efﬁcient object detection by mobile robots. However, they suffer from adverse weather conditions in operating environments such as sun, fog, snow, and extreme illumination changes from day to night. The sensor fusion of camera and lidar data helps to enhance the overall performance of an object detection network. However, the diverse distribution of training data makes the efﬁcient learning of the network a challenging task. To address this challenge, we systematically study the existing visual and lidar features based on object detection methods and propose an adaptive feature attention module (AFAM) for robust multisensory data fusion-based object detection in outdoor dynamic environments. Given the camera and lidar features extracted from the intermediate layers of EfﬁcientNet backbones, the AFAM computes the uncertainty among the two modalities and adaptively reﬁnes visual and lidar features via attention along the channel and the spatial axis. The AFAM integrated with the EfﬁcientDet performs the adaptive recalibration and fusion of visual lidar features by ﬁltering noise and extracting discriminative features for an object detection network under speciﬁc environmental conditions. We evaluate the AFAM on a benchmark dataset exhibiting weather and light variations. The experimental results demonstrate that the AFAM signiﬁcantly enhances the overall detection accuracy of an object detection network.


Introduction
Autonomous navigation aims to enable safe driving without human intervention.It relies on various element technologies, including simultaneous localization and mapping (SLAM) [1], 3D pose estimation, object classification, detection [2], etc., to perceive and understand the surrounding environment.In particular, object detection plays a crucial role in autonomous navigation by detecting obstacles, pedestrians, and other vehicles on the road and making informed decisions to ensure safe driving [3].
To achieve reliable and consistent performance in object detection, even in a dynamic environment, researchers often propose sensor fusion, a technique that integrates data from multiple sensors.When noise increases in one sensor data, sensor fusion compensates for performance degradation by combining information from other sensors.However, developing a deep learning network for sensor fusion requires updating the network parameters to extract key features.When the sensor data significantly changes, such as when a different camera is used or the environmental conditions change, the feature extraction process may not be consistent, leading to performance degradation [4].To address the aforementioned problem, effective sensor fusion must consider the impact of data changes on performance degradation.This involves developing robust feature extraction methods that are resilient to changes in sensor data and environmental conditions.By incorporating these methods, the sensor fusion approach can continue to deliver consistent and reliable performance, even in a dynamic environment, and thus enable safe and effective autonomous navigation.
Numerous methods have been proposed in the past focusing on the development of a multi-sensor fusion-based object detection system [4][5][6][7][8][9][10].Such systems mainly consist of two major components i.e., robust feature extraction and data fusion.In a networkbased multi-sensor system using deep fusion, each sensor dataset is processed by an appropriate network separately to extract features.For the extraction of robust features, network configurations are used that are optimized for each sensor, i.e., camera and lidar.This approach enables the extraction of robust features from sensor data.Later, those features are then fused in the middle of the network using various network structures.The convergence of extracted robust features leads to the improved performance of an object detection network.
Following the benefits of a deep fusion approach [9,11], we propose an adaptive feature attention module for an object detection network with which we extract the visual and lidar features from camera and lidar data using respective networks.The extraction of distinctive and robust features is performed using an attention mechanism in the middle layer of the fusion network.This is achieved by selecting the maximum or average value from the channel of the tensor, which reduces high-dimensional features to low-dimensional ones.Those features are then merged in the middle of the network, enabling the sensor data to be converged into a unified feature representation.During the feature space merging process, some features may cause noise during network training.Moreover, in the case of adverse weather conditions, such as fog, snow, sun, or illumination variations from day to night, sensory noise is inevitable.For example, lidar sensor noise increases in case of fog, while visual feature detection is difficult at night in comparison to daytime.In such scenarios, robust feature extraction is a challenging task [12].To address this issue, the proposed AFAM performs the adaptive filtration of unnecessary features causing noise, as illustrated in Figure 1.
To address the aforementioned problem, effective sensor fusion must consider the impact of data changes on performance degradation.This involves developing robust feature extraction methods that are resilient to changes in sensor data and environmental conditions.By incorporating these methods, the sensor fusion approach can continue to deliver consistent and reliable performance, even in a dynamic environment, and thus enable safe and effective autonomous navigation.
Numerous methods have been proposed in the past focusing on the development of a multi-sensor fusion-based object detection system [4][5][6][7][8][9][10].Such systems mainly consist of two major components i.e., robust feature extraction and data fusion.In a networkbased multi-sensor system using deep fusion, each sensor dataset is processed by an appropriate network separately to extract features.For the extraction of robust features, network configurations are used that are optimized for each sensor, i.e., camera and lidar.This approach enables the extraction of robust features from sensor data.Later, those features are then fused in the middle of the network using various network structures.The convergence of extracted robust features leads to the improved performance of an object detection network.
Following the benefits of a deep fusion approach [9,11], we propose an adaptive feature attention module for an object detection network with which we extract the visual and lidar features from camera and lidar data using respective networks.The extraction of distinctive and robust features is performed using an attention mechanism in the middle layer of the fusion network.This is achieved by selecting the maximum or average value from the channel of the tensor, which reduces high-dimensional features to lowdimensional ones.Those features are then merged in the middle of the network, enabling the sensor data to be converged into a unified feature representation.During the feature space merging process, some features may cause noise during network training.Moreover, in the case of adverse weather conditions, such as fog, snow, sun, or illumination variations from day to night, sensory noise is inevitable.For example, lidar sensor noise increases in case of fog, while visual feature detection is difficult at night in comparison to daytime.In such scenarios, robust feature extraction is a challenging task [12].To address this issue, the proposed AFAM performs the adaptive filtration of unnecessary features causing noise, as illustrated in Figure 1.In [13], the authors proposed a feature recalibration method to filter the noisy features while training using data labels, also known as data annotations.However, annotations may not always be suitable for training the network to recognize various data distributions.This is because annotations may not reflect real-world conditions, such as In [13], the authors proposed a feature recalibration method to filter the noisy features while training using data labels, also known as data annotations.However, annotations may not always be suitable for training the network to recognize various data distributions.This is because annotations may not reflect real-world conditions, such as weather or lighting, which can introduce noise to the data.Therefore, relying solely on annotations can result in another source of uncertainty that can impede learning.In order to handle this problem, the proposed method performs self-learning that utilizes robust features extracted from the network to overcome the limitations of annotation and improve the network's performance in recognizing objects under various conditions.
The proposed AFAM-EfficientDet utilizes four EfficientNet backbones for visual and lidar feature extraction consisting of two pairs known as the source and target networks.Both the network pairs differ in the training data.For each network, the lidar point cloud is converted into the dense range image before feature extraction.For efficient convergence, the extracted visual-lidar features are converted into query, key, and value to estimate the correlation between lidar features and their relevant camera features.Given the query, key, and value from the source and target networks, the AFAM first computes the uncertainty between the lidar features of the source and target networks and the camera features of the source and target networks.Based on the computed uncertainty, the AFAM adaptively computes the attention maps along the channel and spatial axis.The extracted camera and lidar features from the target network are recalibrated by element-wise multiplication with attention maps.Finally, the refined camera and lidar features are fused and given as input to the EfficientDet-B3 for object detection.We evaluated the performance of the proposed method for object detection in adverse weather conditions using the Dense Dataset [12].To conduct the evaluation, we employed EfficientDet [14] and trained it five times.The robustness of the network's performance was assessed by calculating the difference between the maximum and minimum mean average precision (mAP) and by computing the average and deviation values.The experimental results demonstrate that our proposed method effectively improves the sensor fusion performance of object detection networks in adverse weather conditions, enabling them to operate more robustly in real-world scenarios.
Our main contributions are as follows: 1.
We propose an effective adaptive feature attention module (AFAM) that can be widely applied to boost the representation power of CNNs.

2.
We validate the effectiveness of our AFAM via ablation studies.

3.
We verify that the AFAM outperforms the benchmark network EfficientDet on the benchmark dataset, the Dense Dataset.
The rest of the paper is structured as follows: In Section 2, a literature review of existing camera-and lidar-based object detection methods is presented.Section 3 describes the proposed adaptive feature attention module for the visual-lidar sensor fusion network in detail.Section 4 outlines the experimental setup and the results obtained.Finally, Section 5 concludes this research.

Related Work
In this section, we have discussed previous works related to object detection using camera and lidar sensors.Based on the input sensor data used for object detection, the existing literature can be grouped into three main categories: camera-based object detection methods, lidar-based object detection methods, and visual-lidar based object detection methods.The literature related to each of these categories is explained in the subsequent sections.

Camera-Based Object Detection
Camera-based object detection techniques, such as Fast-RCNN [15,16] and YOLO [17], have been advanced with the aid of contextual information provided in images to detect objects of varying sizes.However, the incorporation of rich contexts poses a challenge to the network training process.To address this problem, feature compression techniques, like squeeze-and-excitation networks [18], and attention methods, such as CBAM [19], have been proposed.In low-visibility conditions, such as fog and darkness, object detection becomes a challenging task.Existing methods employ dehazing techniques for object detection in bad weather and illumination conditions where dim images are transformed into brighter ones, thus enabling better detection [20][21][22][23][24][25].Though dehazing methods show improved performance, they require the same scene in clear weather conditions, necessitating the use of synthetic data.Moreover, computational efficiency is another challenge for camera-based object detection methods.Recently, transformer-based object detection methods [26][27][28][29] have emerged as a solution to address the computational power required by such methods [28].Nonetheless, such methods have been crucial to improving object detection performance, particularly in challenging weather conditions and lighting conditions.

Lidar-Based Object Detection
On the other hand, extensive research has been performed to analyze the impact of weather conditions such as fog and high humidity on LiDAR sensor data for object detection problems [30][31][32].Heinzler et al. [30] artificially induced humidity in a chamber and examined how the data measurements of human and vehicle objects were affected under foggy conditions.It was found that humidity has a significant impact on the distribution of LiDAR data.Object detection using LiDAR data is typically achieved by Point-Net [33,34] and the Voxel-based Network [35][36][37].PointNet is designed to learn consistent features of LiDAR data, but it struggles to identify invariant features when there is noise in the data.On the other hand, Voxel-based object detection suffers from unnecessary data in the grid, which makes network training difficult, especially in high humidity conditions.Moreover, lidar-based object detection methods suffer from the point cloud sparse distribution in 3D space, which affects the detection performance.Motivated from the benefits of camera-based object detection, LaserNet [38] generates cylindrical range images using lidar data, allowing for more effective noise removal and greater contextual information extraction via convolution.Moreover, utilizing dense images enables more information to be extracted from the CNN kernel even for the regions where the point cloud is sparse.This method was successful in achieving high detection performance on significantly large dataset; however, its performance degrades when training is insufficient.

Visual-Lidar-Based Object Detection
In the past few years, many multi-sensor fusion-based object detection methods have been proposed to overcome the limitations of a single modality, i.e., camera or lidar [4][5][6][7]9,[12][13][14]39].The existing fusion architectures can be grouped into three main categories based on the stage at which they merge features from different modalities: early fusion, deep fusion, and late fusion [6].In early fusion, data from different modalities are combined at the input stage [5,40].Deep fusion utilizes distinct networks for different modalities and simultaneously integrates intermediate features [4,5,7,11].Late-fusionbased methods handle each modality separately and merge their outputs at the decisionmaking level [6,41,42].
The use of multisensory fusion-based methods can lead to good performance, but their effectiveness may decrease when one of the sensors fails to function properly in adverse weather conditions.To address this issue, researchers have proposed the feature switch layer [13] and FIFO [43] to teach distinctive features that are specific to the current environmental conditions.This enables the robust fusion of multisensory data in challenging weather conditions such as fog and low light.However, relying solely on dataset annotations for training the network may not always be ideal as real-world conditions such as weather and lighting can introduce noise to the data, making the annotations less suitable.This could lead to uncertainty and hinder learning.
Self-supervised learning-based object detection methods can be solutions for such problems [44].Contrastive learning is utilized to assess uncertainty by evaluating learned and unlearned data.Uncertainty, in this context, pertains to the ability of the network to determine the similarity between the learned data and new data that need to be learned, allowing for an assessment of the data distribution [45].
In this research, we focus on visual-lidar fusion-based object detection and develop the adaptive feature attention module for the deep fusion of extracted features for adverse weather conditions.

Proposed Method
In Section 3.1, we first present the overall multisensory deep fusion network pipeline.Then, we explain the proposed adaptive feature attention module (AFAM) in Section 3.2.Finally, we explain the training of the object detection network with the AFAM in Section 3.3.allowing for an assessment of the data distribution [45].

Network Pipeline
In this research, we focus on visual-lidar fusion-based object detection and develop the adaptive feature attention module for the deep fusion of extracted features for adverse weather conditions.

Proposed Method
In Section 3.1, we first present the overall multisensory deep fusion network pipeline.Then, we explain the proposed adaptive feature attention module (AFAM) in Section 3.2.Finally, we explain the training of the object detection network with the AFAM in Section 3.3.The proposed method uses EfficientNet [46] as a backbone for feature extraction from camera and LiDAR data.The lidar data exhibit inherent density when observed from the sensor's perspective, but it becomes sparser upon projection into a 3D space.This sparsity arises due to the constant angular density of the measurements, resulting in a larger number of measurements for objects in close proximity compared to those located further away.Moreover, the coordinate system of the raw lidar point cloud differs from the camera coordinates, which makes visual-lidar fusion-based object detection a complex problem.In order to deal with the aforementioned problem, this research generates a dense range image from the raw lidar data as performed in [38].The obtained dense image is the range view representation of lidar data and is obtained by projecting the point cloud onto a camera coordinate system.It comprises three channels: depth, height, and intensity.The dense image offers a denser point cloud, allowing the use of a convolutional neural network (CNN) kernel size equivalent to that of a camera.This facilitates the efficient alignment of coordinate systems across different sensors, enabling effective convergence.The proposed method leverages the converted lidar and camera data to achieve its objectives.The proposed method uses EfficientNet [46] as a backbone for feature extraction from camera and LiDAR data.The lidar data exhibit inherent density when observed from the sensor's perspective, but it becomes sparser upon projection into a 3D space.This sparsity arises due to the constant angular density of the measurements, resulting in a larger number of measurements for objects in close proximity compared to those located further away.Moreover, the coordinate system of the raw lidar point cloud differs from the camera coordinates, which makes visual-lidar fusion-based object detection a complex problem.In order to deal with the aforementioned problem, this research generates a dense range image from the raw lidar data as performed in [38].The obtained dense image is the range view representation of lidar data and is obtained by projecting the point cloud onto a camera coordinate system.It comprises three channels: depth, height, and intensity.The dense image offers a denser point cloud, allowing the use of a convolutional neural network (CNN) kernel size equivalent to that of a camera.This facilitates the efficient alignment of coordinate systems across different sensors, enabling effective convergence.The proposed method leverages the converted lidar and camera data to achieve its objectives.

Network Pipeline
Thus, four backbone networks [35] consisting of two pairs known as the source and target networks are utilized.Each pair comprises a camera and a lidar sensor backbone network.The key distinction between the source and target networks lies in the configuration of their respective training datasets.The source network is trained solely on camera images and LiDAR point cloud data captured during daytime and clear weather conditions [12], while the target network is trained using data captured during clear weather, adverse weather, and illumination conditions.The features F extracted from the source S and target T network given as F S and F T are expressed as a set of camera C and lidar L features extracted from the source and target networks, as illustrated in Equations ( 1) and (2): where, F C S and F C T represents the n camera features extracted from the source and target backbone networks given in Equation (3), while F L S and F L T denote the m Lidar features extracted from the source and target backbones, as depicted in Equation ( 4).Thus, features F are of size W × H × D, where W (width) and H (height) are the spatial dimensions, and D (depth) is the number of channels extracted from the backbone network.
For the efficient convergence of lidar features with their relevant camera features, we employ a cross-attention mechanism [9] that captures correlations between the two modalities in a dynamic manner.The input consists of a voxel cell and its corresponding N camera features.By utilizing three fully connected layers, we individually transform the voxel into a query Q L and the camera features into key K C and value V C vectors.The inner product operation is then applied between the query and keys, resulting in an attention affinity matrix.This matrix encapsulates the 1 × N correlations between the voxel and its associated camera features.To ensure proper weighting, the attention affinity matrix is normalized using a SoftMax operator.Subsequently, this normalized matrix is used to weigh and aggregate the camera feature values V C , which contain relevant camera information.The resultant feature vectors for source and target networks N S and N T with the corresponding query Q L , key K C , and value V C , depicted in Equations ( 5) and (6), are given as input to the adaptive feature attention module (AFAM) for the adaptive learning and recalibration of the features based on the computed uncertainty, as explained in Section 3.2: The AFAM module outputs the fused camera-lidar features, which are given as input to BiFPN [14] for fast multi-scale feature fusion.The fused features are fed to the object class and the box detection head.The BiFPN and detection head are configured following the EfficientDet-B3 [14].

Adaptive Feature Attention Module (AFAM)
The AFAM takes N S and N T as input and compares them to learn the robust features for object detection in adverse weather and illumination conditions.The major components of AFAM include uncertainty computation, adaptive channel attention, and spatial attention, as shown in Figure 3.

Uncertainty Computation
The uncertainty is computed by comparing the query, key, and value of the source  and target  networks with each other.Let  , in Equation ( 5), have  number of queries denoted as  … with their corresponding key and value represented as  … and  … , respectively.On the other hand, let  , in Equation ( 6), have  number of queries with their corresponding key and value given as  … ,  … , and  … .In this scenario, each  query from the source network  , where 1 ≤  ≤ , is compared with the  query received from the target network  with 1 ≤  ≤ .This comparison is performed to determine the similarity ′ between the  and  , denoted as  when  =  using Equation (7): The variables  and  refers to the indexes of the query in  and  .Similarly, the similarity for key  and value  of the source and target networks is obtained using Equations ( 8) and ( 9) with  and  as indexes in the key and value arrays of the source and target networks: The computed similarities  … ,  … , and  … for the query, key, and value are summed to obtain the overall similarity  ,  and  between the lidar data of the source and target networks using Equations ( 10) and ( 11):

Uncertainty Computation
The uncertainty is computed by comparing the query, key, and value of the source S and target T networks with each other.Let Q L S , in Equation ( 5), have n number of queries denoted as q 1...n S with their corresponding key and value represented as k 1...n S and v 1...n S , respectively.On the other hand, let Q L T , in Equation ( 6), have m number of queries with their corresponding key and value given as q 1...m T , k 1...m T , and v 1...m T .In this scenario, each ith query from the source network q i S , where 1 ≤ i ≤ n, is compared with the jth query received from the target network q j T with 1 ≤ j ≤ m.This comparison is performed to determine the similarity S between the q i S and q j T , denoted as S Q i when i = j using Equation (7): The variables i and j refers to the indexes of the query in Q L S and Q L T .Similarly, the similarity for key S K and value S V of the source and target networks is obtained using Equations ( 8) and ( 9) with i and j as indexes in the key and value arrays of the source and target networks: The computed similarities S Q i...n , S K i...n , and S V i ...n for the query, key, and value are summed to obtain the overall similarity S Q , S K and S V between the lidar data of the source and target networks using Equations ( 10) and ( 11): The similarity S Q , S K and S V between the source and target networks is used to compute the uncertainty of camera and lidar sensor data.For this purpose, the S K and S V are averaged to represent the combined similarity S C of the camera data, depicted in Equation (13).Equations ( 14) and ( 15) are used to compute the uncertainty for lidar U L and camera U C :

Adaptive Channel Attention
The adaptive channel attention module takes uncertainty values U L , U C , and camera and lidar features of the target network F T = {F C T , F L T } as input and applies channel attention on the target network features to obtain the refined features for object detection, as shown in Figure 3. Channel attention is applied to the camera and lidar features given the following conditions:

•
Case 1: U L < U C , the lidar data uncertainty is lower than the camera uncertainty.In such condition, max pooling is applied on the lidar features F L T , while camera features F C T are average pooled.

•
Case 2: U C < U L , the similarity between camera features is higher in comparison to the lidar features of the source and target networks.In this case, max pooling is applied on the camera features F C T , while lidar features F L T are average pooled.The rationale behind applying max pooling to features from sensors with low uncertainty is that regions in the feature vector with high values indicate a higher likelihood of object presence.Hence, when a sensor has low uncertainty, its extracted features are deemed more reliable, and max pooling is employed.Conversely, for sensors with high uncertainty, average pooling is applied to their features.This is because feature vectors extracted from uncertain sensors are expected to contain more noise, and averaging is used to filter out such noise.In situations where noise is prominent, it is common practice to employ averaging or outlier detection for data filtering.
After aggregating the spatial information of feature maps using adaptive max and average pooling, the squeeze-and-excitation [18] method is applied to dynamically recalibrate the channel-wise feature responses.This process aims to extract distinctive features while suppressing less informative ones.Specifically, to enhance features from sensors with low uncertainty, a higher compression ratio, referred to as "hard squeeze," is employed.This higher compression ratio helps preserve robust features, allowing the network to focus on effectively learning them.Conversely, for sensor data with high uncertainty, average pooling is applied.Applying a high compression ratio to such data would lead to a significant reduction in feature values, resulting in decreased object detection performance.Hence, a "soft squeeze and excitation" approach with a lower compression ratio is utilized for sensor data with higher uncertainty.
The channel attention outputs the 1D channel attention maps, M c and M L , for camera and lidar features.Each map is of size 1 × 1 × D, where D is the channel depth.The atten-tion map is merged with the input features, F C T and F L T , using element-wise multiplication generating the refined features F c and F L , as given in the Equations ( 16) and ( 17):

Adaptive Spatial Attention
In order to capture the interspatial relationships of features, a spatial attention map is generated.This spatial attention differs from channel attention as it focuses on determining the informative regions.To compute the spatial attention, we perform average-pooling and max-pooling operations along the channel axis on the refined features F c and F L .The obtained attention spatial maps for camera and lidar features, M C Spatial (F c ) and M L Spatial (F L ), are merged with the input refined features using Equations ( 18) and ( 19) resulting in efficient feature descriptors F c and F L : The obtained features F c and F L are concatenated, resulting in fused robust visuallidar features F C,L .

Training with AFAM
This subsection presents the training process of the object detection network using AFAM-EfficientDet.Table 1 enlists dataset traverses exhibiting different weather and illumination conditions and a number of samples used for training, testing, and validation from each of the traverses.Algorithm 1 illustrates the overall training process using AFAM-EfficientDet.Firstly, the source and target networks are trained using T 1 , which consists of data captured during daytime and clear weather.Randomized initial weights are used for each of the backbone.The training continues as long as the obtained loss Loss total is above the threshold .The value of is the same as in [13].If Loss total falls below the threshold , the network starts training with the AFAM module.In case of Loss total < , the backbone networks are given camera and lidar data from T 2−6 .The extracted features from the source and target networks are given as input to the AFAM.Based on the computed uncertainty, the target network's features F T are refined.The target network is trained using the recalibrated features.The feature recalibration with AFAM is given in Algorithm 2.

Implementation Setup, Dataset, and Evaluation Parameters
This section discusses the experiments performed to evaluate the performance of the proposed network.All experiments are carried out on Intel core i7-9700, NVIDIA GeForce RTX 3080 using PyTorch library.
Addressing the object detection problem in adverse weather conditions and light changes, the proposed method is evaluated on publicly available benchmark dataset, i.e., the Dense Dataset [12].This dataset is captured using a stereo camera, Velodyne 64ch LiDAR, and a radar exhibiting extreme light changes from day to night and weather conditions including clear weather, fog, and snow.
The open-source implementation of EfficientNet [47] is utilized as a backbone.The proposed method was employed using the AFAM when the total loss value reached a particular threshold.The critical value of AFAM learning was determined by applying the algorithm when the total loss value was 0.5.During the training process, the images were resized to a width and height of 896 × 896 pixels.The training was carried out for a maximum of 30 epochs, and the best validation dataset performance was used to determine the final model.The object detection network is trained for two semantic classes: pedestrian and vehicle.
The performance is evaluated using mean average precision (mAP).We have applied the PASCAL VOC 11-point interpolation method to compute the average precision (AP) for each class.Later, the average is computed using mean average precision across all the classes.In our case, there are two object class labels, i.e., pedestrian and vehicle.So, we compute the AP for each class using Equation (20): where label = {vehicle, pedestrian}, i is the index of class values in label, and P corresponds to the precision at each interpolated recall r.The mAP is computed using Equation ( 21).The IoU threshold was set to 0.5.
Here, n is the number of class labels.

Ablation Study
Table 2 displays the experimental outcomes of varying the configuration of the AFAM and assessing the performance based on query (Q), key (K), and value (V) applications.The paper posits that max pooling is effective in extracting appropriate features when there is a considerable shift in data, while avg pooling is optimal when there is minimal variation in data distribution.The table presents the results of experimenting with the configuration of Q when the LiDAR uncertainty value is high.The results indicate that using max pooling leads to better performance than using average pooling when the overall uncertainty is high.Additionally, the table lists the effect of using hard and soft squeeze with max and average pooling.It can be observed that incorporating squeeze and excitation with a relatively small ratio, soft squeeze, to the max pooling yields less information loss.In summary, the experimental results support the notion that the AFAM module configuration is reasonable.
To determine the optimal compression ratio of information, experiments were conducted, and the results are detailed in Table 3.The table presents the experimental outcomes of varying the ratio of squeeze and excitation given as R hard for hard squeeze and excitation, while R so f t is used for soft squeeze and excitation.The best results were achieved when compressing the output channel by 10× or 20×.This experiment highlights the challenge of finding the appropriate hyperparameters to effectively utilize the AFAM.In Figure 4, it is demonstrated that cameras struggle to identify objects in foggy conditions.Although it is daytime, the data distribution of snow is dissimilar to that of clear sunny days, with clustering occurring differently around 80 values.In sunny weather, the values are more closely clustered around 80.However, it is evident that the data distribution has a significant variance at 80 in snowy conditions.In the case of fog, the data values are clustered around 120.Based on this data distribution, the camera and LiDAR data are challenging to be used for network training due to the changes in data distribution unless the input data of the network are refined first.
ditions.Although it is daytime, the data distribution of snow is dissimilar to that of clear sunny days, with clustering occurring differently around 80 values.In sunny weather, the values are more closely clustered around 80.However, it is evident that the data distribution has a significant variance at 80 in snowy conditions.In the case of fog, the data values are clustered around 120.Based on this data distribution, the camera and LiDAR data are challenging to be used for network training due to the changes in data distribution unless the input data of the network are refined first.

Comparison with Other Methods
This section presents the performance comparison of the proposed AFAM-EfficientDet with state-of-the-art object detection networks using single and multiple modalities, i.e., EfficientDet [14] using only camera sensor, EfficientDet with camera-lidar fusion, a Feature Switch Layer [13] enabled EfficientDet, and ResT [48] enabled EfficientDet.EfficientDet is typically built to optimize the network efficiency in terms of computational cost and robust feature fusion.Open-source implementation has been used for the implementation of EfficientDet.The feature switch layer (FSL) is designed for object detection in adverse weather conditions.Based on the dynamic environmental conditions, the FSL extracts and fuses visual-lidar features for roust object detection.Both aforementioned methods and our proposed method use the same backbone network, which is EfficientNet [47].ResT is used as a backbone network and is integrated with the EfficientDet for class prediction.For fair comparison, we have trained the backbone networks on the same dataset traverses T 1-6 , as explained in Table 1, with their default configurations.
Table 4 lists the results obtained for four comparisons performed to evaluate the effectiveness of the proposed AFAM.The performance is evaluated based on the highest mean average precision (mAP) and variance recorded as a result of five training experiments.Top5-mAP depicts the mAP obtained for all five experiments, with Top1-mAP presenting the best mAP and Worst-mAP presenting the minimum mAP obtained among the five experiments.Variance is the difference between Top1-mAP and Worst-mAP.Comparison with baseline method (only camera features): Firstly, we compared the performance of the AFAM for camera only features.The EfficientDet takes raw features from the backbone network as input and predicts the object classes.On the other hand, AFAM when embedded with the EfficientDet performs the refinement of the features, thus providing more robustness for the object detection network.It is observed that using refined features enhances object detection performance.

2.
Comparison with baseline method (visual-lidar fusion): Secondly, the performance of the multimodal EfficientDet that undergoes the deep fusion of camera and lidar features for object detection is analyzed and is compared with the AFAM-EfficientDet.It can be clearly seen that the AFAM, providing the more robust deep fusion of visual-lidar features, achieves a higher mAP in comparison to the multimodal EfficientDet.

3.
Comparison with different network architecture: Here, we present the evaluation results when the AFAM is ported to different network architecture.For this purpose, first, we compute the results for ResT-EfficientDet.Here, ResT is used as a backbone network for visual-lidar feature extraction, and those features are given as input to EfficientDet for class prediction.To assess the effectiveness of the AFAM, we have replaced the backbone EfficientNet with ResT.The AFAM takes raw features from ResT as input, performs the feature recalibration and fusion, and then gives the fused visual-lidar features as input to the BiFPN layer of EfficientDet.As the backbone network is changed, the input features are different, resulting in a change in performance, which can be observed in Table 4.However, the AFAM-ResT-EfficientDet outperforms the ResT-EfficientDet.

4.
Comparison with feature refinement method: Finally, we present the comparison of the AFAM with the FSL.Both the modules are used in integration with Efficient-Det.They take the same features as input, perform the recalibration of the features, and output the refined visual-lidar features for class prediction.It is observed that FSL-EfficientDet achieves the highest Top1-mAP as it employs annotations for environment learning, while AFAM-EfficientDet adaptively learns via the environment's dissimilar appearance and computes the uncertainty.In the case of dense fog, snow, and light changes from day to night, the camera-lidar-based object detection performance is significantly degraded.Based on the adaptive learning approach, AFAM-EfficientDet achieves the least variance, which is the difference between Top1-mAP and Worst-mAP, when the environment significantly changes due to illumination changes or adverse weather.On the other hand, annotation-dependent FSL-EfficientDet fails to deliver high performance under challenging weather conditions such as dense fog or extreme light changes, resulting in increased variance.Thus, the AFAM empowers the object detection network, EfficientDet, in this case, to achieve more robustness in adverse weather conditions.
Figure 5a-e illustrate the qualitative results of the object detection performed by the proposed AFAM-EfficientDet in comparison to EfficientDet using only camera features, EfficientDet using multimodal fusion, and FSL-EfficientDet.The results show that object detection performance is good during daytime in clear weather i.e., (Day, Clear).Using only the camera leads to decreased detection rates and non-detection in foggy conditions (Day, Fog), as given in Figure 5a.The fog at nighttime (Night, Fog) is even more challenging when camera features cannot be detected due to illumination variation.Object detection can be better achieved using multimodal fusion, while poor convergence resulted in the performance deterioration illustrated in Figure 5c.In contrast, visual-lidar fusion methods, shown in Figure 5d,e, have shown good detection performance even in fog and crowded situations, with the proposed method performing more robustly than FSL-EfficientDet.Moreover, the use of a multimodal fusion layer was found to enhance performance in all scenarios, surpassing the use of a single sensor.Interestingly, the network's performance was observed to improve in foggy daytime conditions, indicating that it was compensating for the limitations of image-based fog detection.Figure 5f,g illustrate the qualitative results of the object detection performed by using ResT as backbone with EfficientDet.It is visible that when features are recalibrated using the AFAM, the performance of the ResT-EfficientDet is enhanced.These results ensure the portability of the AFAM.

Conclusions
Object detection is crucial for autonomous navigation in dynamic environments.Extensive research has been performed in this field presenting single-and multi-sensor fusion-based object detection.However, adverse weather conditions and extreme illumination changes pose challenges for both camera and lidar sensors.This research presents a systematic study of the existing methods using cameras, lidar, and a fusion of both sensors for object detection.In order to address the shortcomings of previous literature, this research proposes an adaptive feature attention module (AFAM) that leverages the fusion of camera and lidar data and performs efficient object detection.The AFAM computes uncertainty between modalities and adaptively refines visual and lidar features using

Conclusions
Object detection is crucial for autonomous navigation in dynamic environments.Extensive research has been performed in this field presenting single-and multi-sensor fusion-based object detection.However, adverse weather conditions and extreme illumination changes pose challenges for both camera and lidar sensors.This research presents a systematic study of the existing methods using cameras, lidar, and a fusion of both sensors for object detection.In order to address the shortcomings of previous literature, this research proposes an adaptive feature attention module (AFAM) that leverages the fusion of camera and lidar data and performs efficient object detection.The AFAM computes uncertainty between modalities and adaptively refines visual and lidar features using attention mechanisms along the channel and spatial axes.Integrated with the EfficientDet framework, the AFAM enhances object detection accuracy by effectively filtering noise and extracting discriminative information for object detection in specific environmental conditions.To evaluate the AFAM's effectiveness, we conducted experiments on a benchmark dataset that exhibits variations in weather and lighting conditions.The evaluation results demonstrate a significant improvement in the overall detection accuracy of the object detection network when the AFAM is employed, thus outperforming state-of-the-art methods.This research focuses on enhancing the performance of neural networks for object detection via sensor fusion, offering practical implications for real-world scenarios.
The AFAM contributes to the adaptive learning of the distinctive features for object detection in adverse weather conditions.In the future, we aim to extend this work for object detection in extreme seasonal changes along with varying weather and lighting conditions and to evaluate benchmark datasets with diverse environments in order to generalize the model so that it can be applicable in any real-world environment.Furthermore, extending this research from static object detection to dynamic object tracking can be another future direction.

Figure 1 .
Figure 1.A noise feature occurs when fusion is simply merged in feature space.This makes object detection difficult.Noise features tend to occur when both sensors are weak.

Figure 1 .
Figure 1.A noise feature occurs when fusion is simply merged in feature space.This makes object detection difficult.Noise features tend to occur when both sensors are weak.

Figure 2
Figure 2 illustrates the working pipeline of the proposed AFAM-EfficientDet network.The network takes lidar and image data as input and generates the prediction scores and semantic labels for the detected objects in the current view.

Figure 2
Figure 2 illustrates the working pipeline of the proposed AFAM-EfficientDet network.The network takes lidar and image data as input and generates the prediction scores and semantic labels for the detected objects in the current view.

Figure 2 .
Figure 2. The network pipeline of the proposed AFAM-EfficientDet network.

Figure 2 .
Figure 2. The network pipeline of the proposed AFAM-EfficientDet network.

Figure 3 .
Figure 3. Overview of AFAM.The module has three sequential sub-modules: uncertainty computation, adaptive channel, and spatial attention.The module is embedded with EfficientDet-B3.The extracted camera and lidar features are refined and fused using our model AFAM.

Figure 3 .
Figure 3. Overview of AFAM.The module has three sequential sub-modules: uncertainty computation, adaptive channel, and spatial attention.The module is embedded with EfficientDet-B3.The extracted camera and lidar features are refined and fused using our model AFAM.

Figure 4 .
Figure 4. Histogram distribution based on weather for daytime data.

Figure 5 .
Figure 5. Object detection performance qualitative results.(a,b) Only camera-feature-based object detection performance; (c-e) visual-lidar deep-fusion-based object detection performance; (f,g) integration of AFAM in different network architecture.

Figure 5 .
Figure 5. Object detection performance qualitative results.(a,b) Only camera-feature-based object detection performance; (c-e) visual-lidar deep-fusion-based object detection performance; (f,g) integration of AFAM in different network architecture.

Table 1 .
Dataset size used for training, testing, and validation.

Algorithm 1 :
Object Detection Network Training with AFAM-EfficientDet

Table 2 .
Different configurations used for AFAM module.

Table 3 .
Experimental results obtained for compression ratio.

Table 4 .
Performance comparison of proposed method with SOTA algorithms.