Human-like Attention-Driven Saliency Object Estimation in Dynamic Driving Scenes

: Identifying a notable object and predicting its importance in front of a vehicle are crucial for automated systems’ risk assessment and decision making. However, current research has rarely exploited the driver’s attentional characteristics. In this study, we propose an attention-driven saliency object estimation (SOE) method that uses the attention intensity of the driver as a criterion for determining the salience and importance of objects. First, we design a driver attention prediction (DAP) network with a 2D-3D mixed convolution encoder–decoder structure. Second, we fuse the DAP network with faster R-CNN and YOLOv4 at the feature level and name them SOE-F and SOE-Y, respectively, using a shared-bottom multi-task learning (MTL) architecture. By transferring the spatial features onto the time axis, we are able to eliminate the drawback of the bottom features being extracted repeatedly and achieve a uniform image-video input in SOE-F and SOE-Y. Finally, the parameters in SOE-F and SOE-Y are classiﬁed into two categories, domain invariant and domain adaptive, and then the domain-adaptive parameters are trained and optimized. The experimental results on the DADA-2000 dataset demonstrate that the proposed method outperforms the state-of-the-art methods in several evaluation metrics and can more accurately predict driver attention. In addition, driven by a human-like attention mechanism, SOE-F and SOE-Y can identify and detect the salience, category, and location of objects, providing risk assessment and a decision basis for autonomous driving systems.


Introduction
There are various objects surrounding vehicles including pedestrians, other vehicles, signal signs, buildings, and billboards.Not all of these objects will have the same effect on the vehicle, which means that different objects have varying effects on the driver.Environment perception relies on the detection, tracking, and trajectory prediction of potentially hazardous targets in complex traffic scenes.However, general algorithms view pedestrians, vehicles, and other obstacles as individual targets for detection and recognition so they are incapable of highlighting the current potentially hazardous targets.Consequently, assessing the object value of the ego-vehicle enables autonomous systems to perform a risk assessment and decision control like a real driver [1], allowing it to automatically learn and predict the driver's attention distribution and detect important targets through human-like perception, is regarded as a powerful approach.
The human visual system can rapidly locate regions of interest and objects in the visual field [2].Accordingly, a skilled driver is able to quickly identify various objects and their motion states in a traffic scene, thereby identifying the most influential objects or potential risk information in a timely manner [3].Therefore, research on how to learn and predict driver attention and detect important objects with human-like perception in an automated manner has become a hot topic in advanced driver assistance systems (ADAS) and autonomous driving, given that driver attention intensity is used as a criterion for object importance.
In driving tasks, driver attention prediction (DAP) is a concrete implementation of video saliency prediction (VSP) [4].VSP aims to predict the viewer's point of gaze during free viewing and emphasize the task-free state.In contrast, the driving task strongly influences the driver's attentional features.Many researchers [5][6][7][8] have collected and published datasets and models for DAP tasks.Although these works have effectively extracted and utilized spatio-temporal features using long short-term memory (LSTM), 3D convolution, and established advanced DAP models, further advancements are necessary.These models mostly suffer from the drawbacks of repeated extraction of continuous frame features, large size, and high computational effort.In addition, these works place a greater emphasis on saliency prediction at the region level but fail to directly estimate and identify salient objects in dynamic scenes.Therefore, it is necessary to develop a novel DAP model that is lightweight, portable, and capable of using joint networks with object detection tasks.
As shown in Figure 1c, drivers will pay attention to objects they consider important.Similarly, in Figure 1d, the human-like saliency object estimation (SOE) function for predicting the saliency, category, and location of each object can be achieved by fusing the driver attention map in Figure 1c with the results of the object detection in Figure 1b.Traditional deep learning methods involve single-task learning; for the SOE task, it is easier to decompose it into two independent subtasks, DAP and object detection, and then combine them at the object level.The disadvantage of the object-level fusion method, which only fuses the predicted driver's attention with the detected objects, is repeated feature extraction [9].Since the DAP and object detection tasks in driving scenarios share the same data source domain and feature extraction method, the only difference between them is the objective of task optimization.For such interrelated subproblems, we can therefore link them via shared factors or shared representation.Multi-task learning (MTL) [10] and domain adaptation theory [11] are learning paradigms that focus on utilizing shared representations, with MTL focusing on correlations between different task goals for the same data and domain adaptation theory focusing on correlations between the same task goals for different datasets.We utilize the shared-bottom framework in MTL to construct the SOE network.The input for the generic object detection network is a single-frame image and the input for our proposed DAP network is a video clip.Therefore, even though the inputs are both from the driving scene domain, we use domain adaptation theory to partition the network parameters to make them more adaptive for the image-video data.To address these issues, we use the bottom feature extraction networks of faster R-CNN [12] and YOLOv4 [13] as the shared bottom of the DAP task, named SOE-F and SOE-Y, respectively, in order to achieve SOE based on feature-level fusion.For the DAP task, we employ a 2D-3D mixed convolution-based U-shaped encoder-decoder DAP network.The encoder is composed of 2D convolutional layers, allowing the network to share the bottom layer with any cutting-edge object detection network.In addition, we transfer the historical frame features extracted by the 2D encoder in the time dimension, concatenate them with the current frame features, and then feed them to the 3D decoder in order to predict the driver attention map.Then, in accordance with domain adaptation theory, we perform training and optimization using the shared bottom as the domain-invariant (shared) parameters and the top layer of the network as the domain-adaptive (private) parameters.Finally, we combine the driver attention map and the results of the object detection to fuse attention, category, and location data in order to determine the salience and significance of the objects.
In summary, the main contributions of this article are as follows: (1) Inspired by the human attention mechanism, we propose a human-like attentiondriven SOE method based on a shared-bottom multi-task structure in dynamic driving scenes that can predict and detect the saliency, category, and location of objects in real time.(2) We propose a U-shaped encoder-decoder DAP network that is capable of performing feature-level fusion with any object detection network, achieving good portability and avoiding the disadvantage of repeatedly extracting bottom-level features.(3) We combine faster R-CNN and YOLOv4 with DAP to create SOE-F and SOE-Y, respectively.The experimental results on the DADA-2000 dataset demonstrate that our method can predict driver attention distribution and identify and locate salient objects in driving scenes with greater accuracy than competing methods.

Driver Attention Prediction
Both DAP and VSP aim to predict the location and intensity of human visual attention.With the development of deep learning, most current research focuses on identifying more efficient network structures and spatio-temporal information extraction techniques for video data.Lai et al. [14] proposed STRA-Net, which uses a convolutional gated recurrent unit (convGRU) to pass long-term temporal information between the preceding and subsequent frames and dense residual cross connections to combine motion and static features at multiple scales.TASED-Net [15] is a full 3D convolutional model capable of extracting spatio-temporal coupling features from video clips simultaneously.Palazzi et al. [16] published the driver attention dataset DR(eye) VE using eye-movement data collected from drivers while driving in real vehicles.To predict driver attention, they proposed a multi-branch network consisting of image branches, optical flow branches, and semantic branches.However, this network's multi-branch structure makes it computationally intensive and often complex.Deng et al. [8] produced and published the traffic driving videos dataset and then proposed a 2D convolutional neural network to predict driver attention.However, their method does not utilize the temporal information present in dynamic scenes.
Xia et al. [7] collected driver eye movement data from critical driving situations in the laboratory and proposed an attention prediction model employing LSTM to convey temporal information.However, this method does not adequately capture the deeper spatio-temporal coupling features between successive frames.Fang et al. [6] collected eye-movement data from skilled drivers while they watched driving accident videos and published DADA-2000, a dataset of drivers' gaze points in actual driving accidents in multiple traffic environments.In a subsequent study, Fang et al. [17] designed SCAFNet, a dual-stream network of RGB and semantic images, to predict driver attention using 3D convolution as the feature extraction backbone to obtain deep spatio-temporal coupling features.However, in the SCAFNet model, the bypassed semantic branches and 3D backbone significantly increase the network's size and computational effort.Therefore, designing a simple and efficient network architecture and spatio-temporal feature extraction strategy is necessary for the DAP task.

Saliency Object Estimation
The saliency map predicted by drivers' attention consists primarily of region-level saliency estimates, where the intensity assigned to each pixel on the saliency map determines the saliency score of the corresponding image location.However, this approach cannot identify and detect the category and location of salient objects directly.Saliency object detection aims to segment the most visually appealing objects in an image [18].However, it does not permit simultaneous saliency evaluations of multiple objects.
Several new methods for evaluating the significance of objects in dynamic driving scenes have been proposed recently.Gao et al. [19] proposed an object importance estimation model for road-driving videos that predicts the importance score of each object using a two-stream network containing a visual model and an object model.Zhang et al. [3] proposed a new framework for object importance estimation based on frequent interactions between objects in a scene using interaction graphs.However, these methods do not effectively utilize driver attention data.Li et al. [9] proposed a complex SOE network that fuses attention predictions with the object detection branch and then estimates the saliency of the detected objects.However, this network's attention prediction and object detection branches repeat feature extraction.

Multi-Task Learning and Domain Adaption
MTL is a learning paradigm that uses shared knowledge to jointly optimize multiple task goals and it is utilized in a variety of computer vision applications [20,21].Domain adaptive technology is a domain-specific learning [22] approach that adapts the network to data from different domains by separating domain-invariant (shared) parameters from domain-specific (private) parameters.In particular, domain-specific parameters are also referred to as domain-adaptive parameters.
MTL emphasizes joint optimization by designing ways to share parameters between different tasks in order to avoid the inherent biases of each task and improve the performance of each task.In practice, however, the inherent drawbacks of task differences may compromise the accuracy of predictions for some tasks [23].Khattar et al. [24] experimentally found a low correlation between the object detection task and the saliency task, and they proposed a cross-domain MTL for object detection and saliency estimation.This work inspired us to use the shared-bottom layers of the joint network as the domain-invariant (shared) parameters, as the data for both the DAP and object detection tasks originate from the same driving scene domain.Using domain adaptation, the two independent task branches are then learned and optimized.

Saliency Object Estimation Framework
We propose a novel SOE model employing the shared-bottom model structure in MTL in order to improve the understanding of driving scenes and recognition of salient objects.Figure 2 depicts an overview of the SOE architecture, which combines the two tasks of object detection and DAP.The bottom of the SOE model uses a 2D convolutional neural network (CNN) as the feature extraction module for the shared parameters.In our approach, the backbone and neck of the object detection network are utilized as shared parameters and as 2D encoders for the DAP task.The top layers of the SOE framework use two distinct branches for the object detection and DAP tasks, which represent the task-specific domain-adaptive parameters.The bottom module extracts domain-invariant features (shared), whereas the top module learns domain-adaptive parameters and completes the corresponding tasks.In this study, we utilize the pre-trained parameters of the object detection network so we can concentrate more on the learning of the domain-adaptive parameters for the DAP task.In the following subsections, we discuss the proposed SOE network in greater detail.Branches A and B represent the domain-adaptive portions of the object detection and DAP tasks, respectively.Output A refers to the categories and locations of the detected objects, whereas Output B is the predicted attention saliency map.Output is the result of the saliency objectives estimation incorporating attention, categories, and locations.

Driver Attention Prediction Module
Our proposed DAP module is a 2D-3D fully convolutional network that builds the encoder and decoder using 2D convolution and 3D convolution, respectively.As depicted in Figure 3, this module employs a traditional U-shaped architecture [25] and comprises a contracting path for context capture and an expanding path for precise localization.The encoder is used to extract the pure spatial features {F C2 , F C3 , F C4 , F C5 } ∈ R H×W×C of the input image I on four different scales.Since our proposed DAP network is applied to dynamic driving scenes, this indicates that the input for the network is a video clip containing T consecutive frames.Therefore, we let the spatial features extracted by the encoder be passed on the time axis.Then, the spatial features F t C ∈ (C ∈ {C2, C3, C4, C5}) of the current frame I t are concatenated with the spatial features of the current frame in the time dimension to obtain the spatial features F T C (C ∈ {C2, C3, C4, C5}) ∈ R T×H×W×C with the consecutive frames (contains T frames).Since we use a method of spatio-temporal feature extraction that passes spatial features between consecutive frames, the features between frames are independent of one another.To capture the deep spatio-temporal coupling features of dynamic scenes, we design the decoder of the DAP network using 3D CNN.As shown in Equation ( 1), the input F T D5 at the bottom level of the decoder is the output F T C5 from the front of the network.The encoder aggregates the temporal and spatial information in the process of decoding and then proceeds to the next level of decoding by upsampling to expand the spatial scale by a factor of 2. As shown in Equation ( 2), we concatenate F T C (C ∈ {C3, C4, C5}) with the features from the bottom decoder level on the channel at levels 2, 3, and 4.Then, we use this as the input for F T D (D ∈ {D2, D3, D4}).The aggregation of the scale information at various levels enables the network to capture stimulus information for objects of various sizes in the driving scene.With the accumulation of the decoding steps, the spatio-temporal information is gradually fused, allowing the network to learn and use the time, space, and scale information from the video.Finally, the DAP network generates the prediction saliency map, which depicts the attention distribution of the driver at the current t moments.
where Decoder [•] denotes the decoding operations at each level and Cat (•) represents the channel-wise concatenation operation.
To implement the spatio-temporal decoding operations required by the decoder, we design and implement four basic operation units: region-based non-local operation (RNL) [26], 3D downsampling, 3D convolution module, and 3D upsampling.As shown in Figure 4a, the RNL is a non-local spatio-temporal attention mechanism capable of aggregating feature information from global locations in a video clip via tensor reshaping and convolutional computation with varying kernel sizes.The RNL is represented as a matrix from where z ∈ R T×H×W×C is the output after recalibration with spatio-temporal attention weights.x ∈ R T×H×W×C is the input.W z and W g are learnable parameters, specifically expressed as convolutional operations of a specific kernel size.F θ is a channel-wise convolution.
x ∈ R T×H×W×C is the input, z ∈ R T×H×W×C is the output.W z , W g , and F θ are learnable parameters, specifically expressed as convolutional operations of a specific kernel size.⊗ denotes matrix multiplication, whereas ⊕ denotes element-wise addition.so f tmax denotes the softmax function.Conv3d (•) denotes 3D convolutional operations and k, s, and p denote the size of the convolution kernel, stride, and padding, respectively.BatchNorm and RELU are the 3D batch normalization and RELU activation functions, respectively.
In video input-based DAP, although the salient location on the saliency map is only a small area or just a single point, it is essentially an important object in a spatio-temporal region around that location that draws the driver's attention.Therefore, the region-based non-local spatio-temporal attention calculation process in Equation ( 3) is able to highlight relevant regions around the prediction point and significant frames in the temporal dimension, giving more weight to the locations of interest to the driver.
3D downsampling is a standard 3D convolutional layer with batch regularization and RELU activation functions, as shown in Figure 4b.We designed the internal parameters of the kernel size so that the 3D downsampling layer could halve the input feature's temporal dimension without changing its spatial scale or channel size.Figure 4c shows that the 3D convolution module consists of a 3D depth-wise convolution layer, with a 3D batch regularization and RELU activation function after each convolution layer.Threedimensional separable convolution separates spatio-temporal convolution into spatial and temporal convolutions.It can effectively reduce computational effort and the difficulty of parameter optimization compared to standard 3D convolution.At the end of each layer's decoding operations, 3D upsampling scales the feature map by a factor of two using a trilinear function.Moreover, step-by-step upsampling mitigates the loss of larger-scale detail caused by direct upsampling.Using a combination of 3D downsampling, the 3D convolution module, and 3D upsampling, the decoder can gradually fuse spatio-temporal features between consecutive frames while simultaneously reducing time and channel depth, and gradually restoring feature size to its original resolution.In Section 3.3, specific application examples of our DAP network are described.

Saliency Object Estimation Network
SOE requires the detection of an object's category, location, and saliency in dynamic driving scenarios.In our method, the predicted intensity of the driver's attention on the object is used as the saliency score.Our proposed DAP module in Section 3.2 can be incorporated into any type of object detection network to construct a multi-task joint network by sharing the bottom-feature extraction module to achieve SOE capability.In or-der to demonstrate this idea, we constructed joint networks based on faster R-CNN [12] and YOLOv4 [13] and named them SOE-F and SOE-Y, respectively.Faster R-CNN is the most traditional two-stage object detection algorithm, employing sparse prediction for the object's location and category.In the first stage, faster R-CNN generates a large number of proposed regions using the region proposals network (RPN).After the second stage's region of interest (ROI) pooling, the classes of objects are classified and their locations are regressed using the detection head.SOE-F is shown in Figure 5.We chose the most popular faster R-CNN, which uses ResNet50 as the backbone and adds feature pyramid networks (FPN) [27] and ROI align [28].We used the backbone and neck of faster R-CNN as the shared bottom in the SOE framework.For the DAP task, the encoder is the shared-bottom parameter for the DAP task and the decoder is branch B. In SOE-F, the four abstraction levels feature F C ∈ (C ∈ {C2, C3, C4, C5}) before the second, third, fourth, and final average pooling layers of ResNet50 are used as the outputs of the backbone.Then, after the FPN operation at the neck, F P ∈ (P ∈ {P2, P3, P4, P5}) is obtained, where F P6 is obtained by F P5 after downsampling.In the sparse prediction of faster R-CNN, only the current frame I t needs to be detected so the features F t p ∈ R H×W×C from the I t need to be input in branch A. However, in branch B, it is necessary to obtain the features F T p ∈ R T×H×W×C from V = {I t−T+1 , • • • , I t }, which include the T consecutive frames.Consequently, we aggregate the spatial features of t consecutive frames along the time axis in the same manner as we proposed in Section 3.2.Aligning the dimensions of the shared bottom's features, we construct the decoder with four skip-layer connections in SOE-F branch B. In the decoder, we cascade RNL, 3D downsampling, the 3D convolution module, and 3D upsampling to achieve the goal of fusing the spatio-temporal features and refining the salient objects.
YOLOv4 belongs to the classic one-stage You Only Look Once (YOLO) [29] family of methods.As shown in Figure 6, YOLOv4 uses CSPDarknet53 as the backbone, and then the three different scale features F C ∈ (C ∈ {C3, C4, C5}) extracted from the backbone are augmented with contextual scale information using an FPN and a path aggregation network (PAN) [30] to obtain F P ∈ (P ∈ {P3, P4, P5}).Finally, dense prediction is performed in the detection head to directly generate the category probability and location coordinates for each object.Due to the fact that only three distinct scale features are output in the neck of YOLOv4, we eliminate one skip-layer connection branch in the DAP decoder.In addition, the computation processes for SOE-Y and SOE-F are identical so this discussion does not need to be repeated.After obtaining the output A (categories, locations) of the object detection in branch A and the output B (saliency map) of the DAP in branch B, the final SOE output consists of the saliency score, category, and location of each object.Since the saliency score of each point on the saliency map reflects the intensity of the driver's attention on that location, we calculate the saliency score as the sum of the saliency map within the bounding box of each object.For any one object Object n , its saliency score V Object n is calculated using Equation ( 4).
where (i, j) are the coordinates of a point on the saliency map, v (i,j) is the saliency score of that point, and location (•) denotes the range of the bounding box coordinates for the object.We can determine the most prominent object in a dynamic driving scene by ranking the objects according to their salience based on their salience scores.Objects with a saliency score of zero are categorized as non-significant, indicating that they do not require attention.

Loss Functions
MTL usually uses a combined loss function to train the joint network, and in our method, the combined loss function for the object detection and DAP tasks is defined in Equation ( 5).
where L detect and L saliency are the loss functions for the target detection task and the DAP task, respectively.w 1 and w 2 are the weight coefficients of the corresponding tasks, respectively.However, due to SOE-F and SOE-Y only needing to optimize the domain adaptive parameters of the DAP module, which is branch B, when L detect = 0, the combined loss function degenerates to L = w 2 L saliency (6) where w 2 = 1.We choose the Kullback-Leibler divergence (KL) [31], which is widely used in VSP as the loss function of the DAP task.KL is defined as where S ∈ [0, 1] is the predicted saliency map, G ∈ [0, 1] is the saliency map in the annotation, and i ∈ N denotes the pixel on the saliency map.

Experiments 4.1. Experiment Setup
Because our SOE method focuses on optimizing the domain-adaptive parameters in the DAP task and driving accident scenarios have more anomalous objects, a more complex traffic environment, and a greater variety of human-vehicle interaction behaviors than normal scenarios, the driver attention prediction in driving accident scenarios (DADA-2000) dataset [6] was chosen for training and testing.
DADA-2000: This is a large-scale driver attention dataset in driving accident scenarios containing 2000 videos, of which 1018 videos have been made publicly available.Among the dataset, 598 videos (241 K frames) were used for training, 198 videos (64 K frames) for validation, and 222 videos (70 K frames) for testing, respectively.The annotations were derived from the eye-tracking data of 20 experienced drivers, and each video recorded the eye-tracking data of at least five drivers.
We used the parameters of faster R-CNN and YOLOv4 pre-trained on the COCO dataset to initialize the shared bottom and branch A in SOE-F and SOE-Y, respectively.Therefore, the domain-adaptive parameters of branch A were already adapted to the object detection task after initialization, and the detection accuracy was fully determined by the object detection network used in the SOE model.Thus, only the branch B parameters were trained, whereas the other parameters remained constant.Moreover, we believe that the adaptive learning of task-specific parameters enabled our SOE model to easily incorporate other tasks such as object tracking or semantic segmentation.
Our model was implemented using the Pytorch framework and trained on an NVIDIA RTX5000.To achieve a balance between speed and precision, we trained with mixed precision.During training, the ADAM optimizer was utilized, the learning rate was set to 0.001, and the weight decay was set to 2 × 10 −7 .The number of consecutive SOE input frames was set to 16, and the image was resized to 320 × 320.Due to the strong spatiotemporal continuity of the driving task, only Z-score normalization, S random mirroring, and random clipping were selected to improve the data.We set the training batch size to 200, but due to memory constraints, we could only process five video clips at a time.Therefore, we accumulated the gradient and updated the model parameters every 40 steps.

Performance Comparison
Because the designed SOE network was not trained on the object detection branch, only the DAP branch's performance on the DADA-2000 dataset was tested and evaluated.We selected six representative evaluation metrics [32] to quantitatively compare the performance of our proposed method to that of the state-of-the-art saliency model.These evaluation metrics can be divided into two classes, where the location-based metrics include the area under the curve by Judd (AUC-J), shuffled-AUC (s-AUC), and normalized scanpath saliency (NSS).In addition, the distribution-based metrics consist of the Kullback-Leibler divergence (KL), similarity (SIM), and Pearson's correlation coefficient (CC).The AUC measures the tradeoff between true positives (TPs) and false positives (FPs) at various discrimination thresholds.The AUC-J focuses on the TP and FP rates.In s-AUC, the mean saliency map is introduced to compensate for the center bias in the AUC calculation and the center fixation bias.The KL is defined in Equation (7).The NSS, CC, and SIM are defined in Equation (8).They expect larger values.
where (x i , y i ) is the location of the fixation point map, µ S and σ S are the mean and standard deviation of the prediction saliency map S, respectively, cov(•) is the coefficient of covariance, and σ G is the standard deviation of the saliency map in the annotations.
The location-based metrics use a binary fixation point map as the ground truth to measure the difference in the annotation location between the predicted saliency map and the annotated annotation.Distribution-based metrics measure the difference between the predicted saliency map and the annotated distribution of the attention points using a continuous saliency map as the ground truth.By simulating the foveal vision [33] of humans at each gaze point, a continuous saliency map can be obtained after blurring each fixation point using a small Gaussian sigma.
Performance on DADA-2000.Table 1 compares the performance of our SOE method on the test set of the DADA-2000 dataset (222 video sequences) with eight saliency methods and six saliency metrics.For the NSS, AUC-J, s-AUC, CC, and KL, it is evident that our proposed DAP branch in SOE-F achieved better results than the current state-of-the-art methods.Although SCAFNet received the highest score on the SIM metric, the disadvantage of a distributional metric such as the SIM is that the size of the Gaussian sigma chosen in the construction of the saliency map influences the model evaluation.Even though our method predicted the correct attentional location (NSS, AUC-J, s-AUC), the scores were slightly lower because the SIM penalized false negatives more than false positives.The calculated symmetry of the CC indicates that our method was more balanced than SCAFNet with regard to the distribution similarity of the saliency map.In addition, SCAFNet uses two-stream 3D convolution as the feature extraction network, resulting in larger computational and memory requirements.The performance of KL and s-AUC was significantly enhanced by 27% using our method.ASIAF-Net was the most similar to our aims, but our designed DAP branch outperformed it in five metrics.The comparison results indicate that our SOE method makes better use of the driver's visual information to achieve attention-driven SOE similar to that of humans.The DAP branch in SOE-Y and SOE-F is based on the same constructive idea, but there was a difference in the final prediction results due to the shared bottom.The DAP branch in SOE-Y employs CSPDarknet53 and FPN as encoders to extract the video's continuous frame spatial features.However, only three feature levels used a multi-scale enhancement strategy.In SOE-F, ResNet50 and FPN are utilized as encoders, and multi-scale enhancement strategies are implemented at four feature levels, enabling the method to learn and predict driver eye-movement behavior on a larger scale.In addition, because of the differences in the encoders, the decoder of SOE-Y has one fewer hopping connection path than that of SOE-F, resulting in a weaker adaptive learning capability.
We compared the model size and runtime of the proposed SOE-Y and SOE-F singleframe inference.As shown in Table 2, SOE-Y has more channels than SOE-F, therefore, the model size was larger.However, SOE-Y utilized a single-stage strategy to predict the object's category and location, making the inference time faster than SOE-F and meeting the real-time requirement.Combining the differences in model performance and network structure between SOE-F and SOE-Y reveals that the network structure, and not its depth or width, determines model performance.We compared the model size and runtime between our methods and four saliency prediction methods.As shown in Table 3, SALION had the smallest model size, indicating that it had the smallest memory requirements.However, the single-frame inference speed of SALION was too slow to meet the real-time demand.ACLNet had the smallest runtime, but the model performed poorly on the task of DAP.Despite the larger model size of SOE-Y and SOE-F, it is important to note that our methods integrate two tasks, DAP and object detection, whereas other methods are only suitable for the DAP task.Moreover, our methods all have millisecond running times, and SOE-Y meets the real-time requirements of an autonomous driving system.Qualitative analysis.The qualitative results after the visualization of some driving scenes are shown in Figure 7.The prevalent object detection model aimed to detect all vehicles, pedestrians, signal signs, etc., within the scene but did not evaluate the significance of these objects.For instance, in the first row in Figure 7, it can be seen that the model detected the pedestrians and vehicles ahead but did not know that the pedestrian about to fall in front was the most significant object in the scene.Similarly, in the fifth row of the sample in Figure 7, it can be seen that a person fell in front of the vehicle but the object detection network was unable to identify it due to the pose deformation and detection blind spot of the person.However, our DAP division accurately recognized this instance and predicted that it was the most prominent object in the current scene.The detection accuracy of our method depends entirely on the performance of the object detection network used.In the future, superior object detection networks can be used to eliminate the errors due to missed detection and improve the overall performance of the SOE model.Figure 7 illustrates that the human-like attention prediction branch of our proposed SOE method accurately predicted the driver's attention location and then assigned a saliency score to the object, thereby identifying the most salient and significant objects.By using our proposed SOE-F or SOE-Y, autonomous driving systems and ADAS can improve their monitoring and evaluation of driving risk states.

Ablation Analysis
In this section, we performed an ablation analysis on the DADA-2000 dataset using the proposed SOE method.Since we did not modify the network architecture of the object detection task in the SOE method, we only concentrated on the DAP task.In addition, the encoder in the DAP network is comprised of the bottom layers of faster R-CNN or YOLOv4 so no ablation analysis was required.In addition, Palazzi et al. [16] concluded in a previous study that attention prediction works best when the length of the input video clip is 16 frames.Therefore, we only performed ablation analysis for the decoder-added spatio-temporal attention mechanism.
We employed the 3D convolution module to replace the spatio-temporal attention mechanism RNL as the base model and then employed the same training and testing strategies to obtain the test performance of SOE-F and SOE-Y without the RNL on the DADA-2000 test set.As shown in Table 4, when we replaced the RNL, the network's performance on each evaluation metric decreased slightly.These results demonstrate that our base model is inherently capable of extracting and learning spatio-temporal information in dynamic driving scenarios and accurately predicting driver attention.When the RNL is added to the decoder, the DAP network's ability to fuse spatio-temporal features is strengthened, and its overall performance is enhanced.

Conclusions
In this paper, we propose an attention-driven SOE method to estimate the salient and important objects in dynamic driving scenes in real time.First, a U-shaped 2D-3D encoder-decoder DAP network is constructed to predict driver attention.Using a sharedbottom MTL architecture, a combined DAP and object detection network is constructed.The experimental results on the DADA-2000 dataset demonstrate that our method provides the best performance.By extracting 2D features using shared layers and passing the features in consecutive frames, we effectively avoid the computational waste caused by repeated feature extraction and meet the computational inference requirements of autonomous driving systems in real time.This 2D feature-coding structure permits us to combine DAP networks with any of the most advanced object detection networks, demonstrating the portability of our method.
Future research will concentrate on incorporating driver intent prediction and trajectory prediction methods into our MTL framework in order to improve the network's ability to distinguish salient and significant objects in complex human-vehicle-road interaction scenarios.

Figure 1 .
Figure 1.Drivers pay attention to the most important objects in the scene: (a) current driving scene, (b) results of object detection, (c) driver attention map after visualization, and (d) fusion map that projects the driver attention map onto the results of the object detection.

Figure 2 .
Figure 2. The overall architecture of the proposed SOE network.Input is the captured image from the onboard camera.Shared bottom refers to the 2D CNNs located at the base of the SOE network.Branches A and B represent the domain-adaptive portions of the object detection and DAP tasks, respectively.Output A refers to the categories and locations of the detected objects, whereas Output B is the predicted attention saliency map.Output is the result of the saliency objectives estimation incorporating attention, categories, and locations.

Figure 3 .
Figure 3.The overall Driver Attention Prediction Module, where t denotes the current frame, t− denotes the historical frame, t+ denotes the future frame, and T = {t − T + 1, ..., t} denotes the time length of consecutive frames.F t C ∈ (C ∈ {C2, C3, C4, C5}) denotes the spatial features extracted from the current frame.F T−1 C ∈ (C ∈ {C2, C3, C4, C5}) denotes the spatial features extracted over T − 1 historical frames.F T C ∈ (C ∈ {C2, C3, C4, C5}) denotes the spatial features extracted on T consecutive frames.⊗ indicates that the features are concatenated in the time dimension.

Figure 6 .
Figure 6.Overview of SOE-Y, where t denotes the current frames, t− represents the historical frames, t+ denotes the future frames, and T represents the number of consecutive frames.C ∈ {C3, C4, C5} and P ∈ {P3, P4, P5} represent different feature levels in SOE-Y.

Figure 7 .
Figure 7. Visualization of the SOE-F and SOE-Y prediction results on the DADA-2000 dataset.The first four rows are from SOE-F and the last four rows are from SOE-Y.

Table 1 .
Comparison of the saliency metrics of SOE-Y and SOE-F with those of the other state-of-theart methods on the DADA-2000 test set (The symbol ↑ prefers a larger value and ↓ expects a smaller value.Bold indicates the best results).

Table 2 .
Comparison of model size and runtime for SOE-Y and SOE-F.

Table 3 .
Comparison of model size and runtime of SOE-Y and SOE-F with those of other methods.

Table 4 .
Comparison of SOE-F and SOE-Y performance on the DADA-2000 dataset with or without RNL (The symbol ↑ prefers a larger value and ↓ expects a smaller value.Bold indicates the best results).