Human-like Attention-Driven Saliency Object Estimation in Dynamic Driving Scenes

Jin, Lisheng; Ji, Bingdong; Guo, Baicang

doi:10.3390/machines10121172

Open AccessArticle

Human-like Attention-Driven Saliency Object Estimation in Dynamic Driving Scenes

by

Lisheng Jin

^†,

Bingdong Ji

^† and

Baicang Guo

^*,†

School of Vehicle and Energy, Yanshan University, Qinhuangdao 066000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Machines 2022, 10(12), 1172; https://doi.org/10.3390/machines10121172

Submission received: 10 November 2022 / Revised: 2 December 2022 / Accepted: 2 December 2022 / Published: 7 December 2022

(This article belongs to the Section Vehicle Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Identifying a notable object and predicting its importance in front of a vehicle are crucial for automated systems’ risk assessment and decision making. However, current research has rarely exploited the driver’s attentional characteristics. In this study, we propose an attention-driven saliency object estimation (SOE) method that uses the attention intensity of the driver as a criterion for determining the salience and importance of objects. First, we design a driver attention prediction (DAP) network with a 2D-3D mixed convolution encoder–decoder structure. Second, we fuse the DAP network with faster R-CNN and YOLOv4 at the feature level and name them SOE-F and SOE-Y, respectively, using a shared-bottom multi-task learning (MTL) architecture. By transferring the spatial features onto the time axis, we are able to eliminate the drawback of the bottom features being extracted repeatedly and achieve a uniform image-video input in SOE-F and SOE-Y. Finally, the parameters in SOE-F and SOE-Y are classified into two categories, domain invariant and domain adaptive, and then the domain-adaptive parameters are trained and optimized. The experimental results on the DADA-2000 dataset demonstrate that the proposed method outperforms the state-of-the-art methods in several evaluation metrics and can more accurately predict driver attention. In addition, driven by a human-like attention mechanism, SOE-F and SOE-Y can identify and detect the salience, category, and location of objects, providing risk assessment and a decision basis for autonomous driving systems.

Keywords:

saliency object estimation; driver attention prediction; object detection; multi-task learning; domain adaptive

1. Introduction

There are various objects surrounding vehicles including pedestrians, other vehicles, signal signs, buildings, and billboards. Not all of these objects will have the same effect on the vehicle, which means that different objects have varying effects on the driver. Environment perception relies on the detection, tracking, and trajectory prediction of potentially hazardous targets in complex traffic scenes. However, general algorithms view pedestrians, vehicles, and other obstacles as individual targets for detection and recognition so they are incapable of highlighting the current potentially hazardous targets. Consequently, assessing the object value of the ego-vehicle enables autonomous systems to perform a risk assessment and decision control like a real driver [1], allowing it to automatically learn and predict the driver’s attention distribution and detect important targets through human-like perception, is regarded as a powerful approach.

The human visual system can rapidly locate regions of interest and objects in the visual field [2]. Accordingly, a skilled driver is able to quickly identify various objects and their motion states in a traffic scene, thereby identifying the most influential objects or potential risk information in a timely manner [3]. Therefore, research on how to learn and predict driver attention and detect important objects with human-like perception in an automated manner has become a hot topic in advanced driver assistance systems (ADAS) and autonomous driving, given that driver attention intensity is used as a criterion for object importance.

In driving tasks, driver attention prediction (DAP) is a concrete implementation of video saliency prediction (VSP) [4]. VSP aims to predict the viewer’s point of gaze during free viewing and emphasize the task-free state. In contrast, the driving task strongly influences the driver’s attentional features. Many researchers [5,6,7,8] have collected and published datasets and models for DAP tasks. Although these works have effectively extracted and utilized spatio-temporal features using long short-term memory (LSTM), 3D convolution, and established advanced DAP models, further advancements are necessary. These models mostly suffer from the drawbacks of repeated extraction of continuous frame features, large size, and high computational effort. In addition, these works place a greater emphasis on saliency prediction at the region level but fail to directly estimate and identify salient objects in dynamic scenes. Therefore, it is necessary to develop a novel DAP model that is lightweight, portable, and capable of using joint networks with object detection tasks.

As shown in Figure 1c, drivers will pay attention to objects they consider important. Similarly, in Figure 1d, the human-like saliency object estimation (SOE) function for predicting the saliency, category, and location of each object can be achieved by fusing the driver attention map in Figure 1c with the results of the object detection in Figure 1b. Traditional deep learning methods involve single-task learning; for the SOE task, it is easier to decompose it into two independent subtasks, DAP and object detection, and then combine them at the object level. The disadvantage of the object-level fusion method, which only fuses the predicted driver’s attention with the detected objects, is repeated feature extraction [9]. Since the DAP and object detection tasks in driving scenarios share the same data source domain and feature extraction method, the only difference between them is the objective of task optimization. For such interrelated subproblems, we can therefore link them via shared factors or shared representation. Multi-task learning (MTL) [10] and domain adaptation theory [11] are learning paradigms that focus on utilizing shared representations, with MTL focusing on correlations between different task goals for the same data and domain adaptation theory focusing on correlations between the same task goals for different datasets. We utilize the shared-bottom framework in MTL to construct the SOE network. The input for the generic object detection network is a single-frame image and the input for our proposed DAP network is a video clip. Therefore, even though the inputs are both from the driving scene domain, we use domain adaptation theory to partition the network parameters to make them more adaptive for the image–video data.

To address these issues, we use the bottom feature extraction networks of faster R-CNN [12] and YOLOv4 [13] as the shared bottom of the DAP task, named SOE-F and SOE-Y, respectively, in order to achieve SOE based on feature-level fusion. For the DAP task, we employ a 2D-3D mixed convolution-based U-shaped encoder–decoder DAP network. The encoder is composed of 2D convolutional layers, allowing the network to share the bottom layer with any cutting-edge object detection network. In addition, we transfer the historical frame features extracted by the 2D encoder in the time dimension, concatenate them with the current frame features, and then feed them to the 3D decoder in order to predict the driver attention map. Then, in accordance with domain adaptation theory, we perform training and optimization using the shared bottom as the domain-invariant (shared) parameters and the top layer of the network as the domain-adaptive (private) parameters. Finally, we combine the driver attention map and the results of the object detection to fuse attention, category, and location data in order to determine the salience and significance of the objects.

In summary, the main contributions of this article are as follows:

(1): Inspired by the human attention mechanism, we propose a human-like attention-driven SOE method based on a shared-bottom multi-task structure in dynamic driving scenes that can predict and detect the saliency, category, and location of objects in real time.
(2): We propose a U-shaped encoder–decoder DAP network that is capable of performing feature-level fusion with any object detection network, achieving good portability and avoiding the disadvantage of repeatedly extracting bottom-level features.
(3): We combine faster R-CNN and YOLOv4 with DAP to create SOE-F and SOE-Y, respectively. The experimental results on the DADA-2000 dataset demonstrate that our method can predict driver attention distribution and identify and locate salient objects in driving scenes with greater accuracy than competing methods.

2. Related Works

2.1. Driver Attention Prediction

Both DAP and VSP aim to predict the location and intensity of human visual attention. With the development of deep learning, most current research focuses on identifying more efficient network structures and spatio-temporal information extraction techniques for video data. Lai et al. [14] proposed STRA-Net, which uses a convolutional gated recurrent unit (convGRU) to pass long-term temporal information between the preceding and subsequent frames and dense residual cross connections to combine motion and static features at multiple scales. TASED-Net [15] is a full 3D convolutional model capable of extracting spatio-temporal coupling features from video clips simultaneously. Palazzi et al. [16] published the driver attention dataset DR(eye) VE using eye-movement data collected from drivers while driving in real vehicles. To predict driver attention, they proposed a multi-branch network consisting of image branches, optical flow branches, and semantic branches. However, this network’s multi-branch structure makes it computationally intensive and often complex. Deng et al. [8] produced and published the traffic driving videos dataset and then proposed a 2D convolutional neural network to predict driver attention. However, their method does not utilize the temporal information present in dynamic scenes.

Xia et al. [7] collected driver eye movement data from critical driving situations in the laboratory and proposed an attention prediction model employing LSTM to convey temporal information. However, this method does not adequately capture the deeper spatio-temporal coupling features between successive frames. Fang et al. [6] collected eye-movement data from skilled drivers while they watched driving accident videos and published DADA-2000, a dataset of drivers’ gaze points in actual driving accidents in multiple traffic environments. In a subsequent study, Fang et al. [17] designed SCAFNet, a dual-stream network of RGB and semantic images, to predict driver attention using 3D convolution as the feature extraction backbone to obtain deep spatio-temporal coupling features. However, in the SCAFNet model, the bypassed semantic branches and 3D backbone significantly increase the network’s size and computational effort. Therefore, designing a simple and efficient network architecture and spatio-temporal feature extraction strategy is necessary for the DAP task.

2.2. Saliency Object Estimation

The saliency map predicted by drivers’ attention consists primarily of region-level saliency estimates, where the intensity assigned to each pixel on the saliency map determines the saliency score of the corresponding image location. However, this approach cannot identify and detect the category and location of salient objects directly. Saliency object detection aims to segment the most visually appealing objects in an image [18]. However, it does not permit simultaneous saliency evaluations of multiple objects.

Several new methods for evaluating the significance of objects in dynamic driving scenes have been proposed recently. Gao et al. [19] proposed an object importance estimation model for road-driving videos that predicts the importance score of each object using a two-stream network containing a visual model and an object model. Zhang et al. [3] proposed a new framework for object importance estimation based on frequent interactions between objects in a scene using interaction graphs. However, these methods do not effectively utilize driver attention data. Li et al. [9] proposed a complex SOE network that fuses attention predictions with the object detection branch and then estimates the saliency of the detected objects. However, this network’s attention prediction and object detection branches repeat feature extraction.

2.3. Multi-Task Learning and Domain Adaption

MTL is a learning paradigm that uses shared knowledge to jointly optimize multiple task goals and it is utilized in a variety of computer vision applications [20,21]. Domain adaptive technology is a domain-specific learning [22] approach that adapts the network to data from different domains by separating domain-invariant (shared) parameters from domain-specific (private) parameters. In particular, domain-specific parameters are also referred to as domain-adaptive parameters.

MTL emphasizes joint optimization by designing ways to share parameters between different tasks in order to avoid the inherent biases of each task and improve the performance of each task. In practice, however, the inherent drawbacks of task differences may compromise the accuracy of predictions for some tasks [23]. Khattar et al. [24] experimentally found a low correlation between the object detection task and the saliency task, and they proposed a cross-domain MTL for object detection and saliency estimation. This work inspired us to use the shared-bottom layers of the joint network as the domain-invariant (shared) parameters, as the data for both the DAP and object detection tasks originate from the same driving scene domain. Using domain adaptation, the two independent task branches are then learned and optimized.

3. Methods

3.1. Saliency Object Estimation Framework

We propose a novel SOE model employing the shared-bottom model structure in MTL in order to improve the understanding of driving scenes and recognition of salient objects. Figure 2 depicts an overview of the SOE architecture, which combines the two tasks of object detection and DAP. The bottom of the SOE model uses a 2D convolutional neural network (CNN) as the feature extraction module for the shared parameters. In our approach, the backbone and neck of the object detection network are utilized as shared parameters and as 2D encoders for the DAP task. The top layers of the SOE framework use two distinct branches for the object detection and DAP tasks, which represent the task-specific domain-adaptive parameters. The bottom module extracts domain-invariant features (shared), whereas the top module learns domain-adaptive parameters and completes the corresponding tasks. In this study, we utilize the pre-trained parameters of the object detection network so we can concentrate more on the learning of the domain-adaptive parameters for the DAP task. In the following subsections, we discuss the proposed SOE network in greater detail.

3.2. Driver Attention Prediction Module

Our proposed DAP module is a 2D-3D fully convolutional network that builds the encoder and decoder using 2D convolution and 3D convolution, respectively. As depicted in Figure 3, this module employs a traditional U-shaped architecture [25] and comprises a contracting path for context capture and an expanding path for precise localization. The encoder is used to extract the pure spatial features

\{F_{C 2}, F_{C 3}, F_{C 4}, F_{C 5}\} \in R^{H \times W \times C}

of the input image I on four different scales. Since our proposed DAP network is applied to dynamic driving scenes, this indicates that the input for the network is a video clip containing T consecutive frames. Therefore, we let the spatial features

F_{C}^{T - 1} (C \in {C 2, C 3, C 4, C 5})

of the history frames

V = \{I_{t - T + 1}, \dots, I_{t - 1}\}

extracted by the encoder be passed on the time axis. Then, the spatial features

F_{C}^{t} \in (C \in {C 2, C 3, C 4, C 5})

of the current frame

I_{t}

are concatenated with the spatial features of the current frame in the time dimension to obtain the spatial features

F_{C}^{T} (C \in {C 2, C 3, C 4, C 5}) \in R^{T \times H \times W \times C}

with the consecutive frames (contains T frames).

Since we use a method of spatio-temporal feature extraction that passes spatial features between consecutive frames, the features between frames are independent of one another. To capture the deep spatio-temporal coupling features of dynamic scenes, we design the decoder of the DAP network using 3D CNN. As shown in Equation (1), the input

F_{D 5}^{T}

at the bottom level of the decoder is the output

F_{C 5}^{T}

from the front of the network. The encoder aggregates the temporal and spatial information in the process of decoding and then proceeds to the next level of decoding by upsampling to expand the spatial scale by a factor of 2. As shown in Equation (2), we concatenate

F_{C}^{T} (C \in {C 3, C 4, C 5})

with the features from the bottom decoder level on the channel at levels 2, 3, and 4. Then, we use this as the input for

F_{D}^{T} (D \in {D 2, D 3, D 4})

. The aggregation of the scale information at various levels enables the network to capture stimulus information for objects of various sizes in the driving scene. With the accumulation of the decoding steps, the spatio-temporal information is gradually fused, allowing the network to learn and use the time, space, and scale information from the video. Finally, the DAP network generates the prediction saliency map, which depicts the attention distribution of the driver at the current t moments.

F_{D 5}^{T} = F_{C 5}^{T}

(1)

F_{D}^{T} = Decoder [cat (F_{C}^{T}, F_{D + 1}^{T})], {C \in {C 2, C 3, C 4}, D \in {D 2, D 3, D 4}}

(2)

where

Decoder [•]

denotes the decoding operations at each level and

Cat (•)

represents the channel-wise concatenation operation.

To implement the spatio-temporal decoding operations required by the decoder, we design and implement four basic operation units: region-based non-local operation (RNL) [26], 3D downsampling, 3D convolution module, and 3D upsampling. As shown in Figure 4a, the RNL is a non-local spatio-temporal attention mechanism capable of aggregating feature information from global locations in a video clip via tensor reshaping and convolutional computation with varying kernel sizes. The RNL is represented as a matrix from

\begin{array}{l} z = y W_{z} + x, \\ y = softmax (F_{θ} (x W_{g}) {(F_{θ} (x W_{g}))}^{T}) x W_{g} \end{array}

(3)

where

z \in R^{T \times H \times W \times C}

is the output after recalibration with spatio-temporal attention weights.

x \in R^{T \times H \times W \times C}

is the input.

W_{z}

and

W_{g}

are learnable parameters, specifically expressed as convolutional operations of a specific kernel size.

F_{θ}

is a channel-wise convolution.

In video input-based DAP, although the salient location on the saliency map is only a small area or just a single point, it is essentially an important object in a spatio-temporal region around that location that draws the driver’s attention. Therefore, the region-based non-local spatio-temporal attention calculation process in Equation (3) is able to highlight relevant regions around the prediction point and significant frames in the temporal dimension, giving more weight to the locations of interest to the driver.

3D downsampling is a standard 3D convolutional layer with batch regularization and RELU activation functions, as shown in Figure 4b. We designed the internal parameters of the kernel size so that the 3D downsampling layer could halve the input feature’s temporal dimension without changing its spatial scale or channel size. Figure 4c shows that the 3D convolution module consists of a 3D depth-wise convolution layer, with a 3D batch regularization and RELU activation function after each convolution layer. Three-dimensional separable convolution separates spatio-temporal convolution into spatial and temporal convolutions. It can effectively reduce computational effort and the difficulty of parameter optimization compared to standard 3D convolution. At the end of each layer’s decoding operations, 3D upsampling scales the feature map by a factor of two using a trilinear function. Moreover, step-by-step upsampling mitigates the loss of larger-scale detail caused by direct upsampling. Using a combination of 3D downsampling, the 3D convolution module, and 3D upsampling, the decoder can gradually fuse spatio-temporal features between consecutive frames while simultaneously reducing time and channel depth, and gradually restoring feature size to its original resolution. In Section 3.3, specific application examples of our DAP network are described.

3.3. Saliency Object Estimation Network

SOE requires the detection of an object’s category, location, and saliency in dynamic driving scenarios. In our method, the predicted intensity of the driver’s attention on the object is used as the saliency score. Our proposed DAP module in Section 3.2 can be incorporated into any type of object detection network to construct a multi-task joint network by sharing the bottom-feature extraction module to achieve SOE capability. In order to demonstrate this idea, we constructed joint networks based on faster R-CNN [12] and YOLOv4 [13] and named them SOE-F and SOE-Y, respectively. Faster R-CNN is the most traditional two-stage object detection algorithm, employing sparse prediction for the object’s location and category. In the first stage, faster R-CNN generates a large number of proposed regions using the region proposals network (RPN). After the second stage’s region of interest (ROI) pooling, the classes of objects are classified and their locations are regressed using the detection head.

SOE-F is shown in Figure 5. We chose the most popular faster R-CNN, which uses ResNet50 as the backbone and adds feature pyramid networks (FPN) [27] and ROI align [28]. We used the backbone and neck of faster R-CNN as the shared bottom in the SOE framework. For the DAP task, the encoder is the shared-bottom parameter for the DAP task and the decoder is branch B.

In SOE-F, the four abstraction levels feature

F_{C} \in (C \in {C 2, C 3, C 4, C 5})

before the second, third, fourth, and final average pooling layers of ResNet50 are used as the outputs of the backbone. Then, after the FPN operation at the neck,

F_{P} \in (P \in {P 2, P 3, P 4, P 5})

is obtained, where

F_{P 6}

is obtained by

F_{P 5}

after downsampling. In the sparse prediction of faster R-CNN, only the current frame

I_{t}

needs to be detected so the features

F_{p}^{t} \in R^{H \times W \times C}

from the

I_{t}

need to be input in branch A. However, in branch B, it is necessary to obtain the features

F_{p}^{T} \in R^{T \times H \times W \times C}

from

V = \{I_{t - T + 1}, \dots, I_{t}\}

, which include the T consecutive frames. Consequently, we aggregate the spatial features of t consecutive frames along the time axis in the same manner as we proposed in Section 3.2. Aligning the dimensions of the shared bottom’s features, we construct the decoder with four skip-layer connections in SOE-F branch B. In the decoder, we cascade RNL, 3D downsampling, the 3D convolution module, and 3D upsampling to achieve the goal of fusing the spatio-temporal features and refining the salient objects.

YOLOv4 belongs to the classic one-stage You Only Look Once (YOLO) [29] family of methods. As shown in Figure 6, YOLOv4 uses CSPDarknet53 as the backbone, and then the three different scale features

F_{C} \in (C \in {C 3, C 4, C 5})

extracted from the backbone are augmented with contextual scale information using an FPN and a path aggregation network (PAN) [30] to obtain

F_{P} \in (P \in {P 3, P 4, P 5})

. Finally, dense prediction is performed in the detection head to directly generate the category probability and location coordinates for each object. Due to the fact that only three distinct scale features are output in the neck of YOLOv4, we eliminate one skip-layer connection branch in the DAP decoder. In addition, the computation processes for SOE-Y and SOE-F are identical so this discussion does not need to be repeated.

After obtaining the output A (categories, locations) of the object detection in branch A and the output B (saliency map) of the DAP in branch B, the final SOE output consists of the saliency score, category, and location of each object. Since the saliency score of each point on the saliency map reflects the intensity of the driver’s attention on that location, we calculate the saliency score as the sum of the saliency map within the bounding box of each object. For any one object

O b j e c t_{n}

, its saliency score

V_{O b j e c t_{n}}

is calculated using Equation (4).

\begin{array}{l} Object \in \{Object_{1}, Object_{2}, \dots, Object_{n}\}, \\ V_{Object_{n}} = \sum_{(i, j)}^{location (Object_{n})} v_{(i, j)} \end{array}

(4)

where

(i, j)

are the coordinates of a point on the saliency map,

v_{(i, j)}

is the saliency score of that point, and

location (\cdot)

denotes the range of the bounding box coordinates for the object.

We can determine the most prominent object in a dynamic driving scene by ranking the objects according to their salience based on their salience scores. Objects with a saliency score of zero are categorized as non-significant, indicating that they do not require attention.

3.4. Loss Functions

MTL usually uses a combined loss function to train the joint network, and in our method, the combined loss function for the object detection and DAP tasks is defined in Equation (5).

L = w_{1} L_{detect} + w_{2} L_{saliency}

(5)

where

L_{detect}

and

L_{saliency}

are the loss functions for the target detection task and the DAP task, respectively.

w_{1}

and

w_{2}

are the weight coefficients of the corresponding tasks, respectively.

However, due to SOE-F and SOE-Y only needing to optimize the domain adaptive parameters of the DAP module, which is branch B, when

L_{detect} = 0

, the combined loss function degenerates to

L = w_{2} L_{saliency}

(6)

where

w_{2} = 1

. We choose the Kullback–Leibler divergence (KL) [31], which is widely used in VSP as the loss function of the DAP task. KL is defined as

L_{saliency} (G, S) = \sum_{i = 1}^{N} G_{i} ln \frac{G_{i}}{S_{i}}

(7)

where

S \in [0, 1]

is the predicted saliency map,

G \in [0, 1]

is the saliency map in the annotation, and

i \in N

denotes the pixel on the saliency map.

4. Experiments

4.1. Experiment Setup

Because our SOE method focuses on optimizing the domain-adaptive parameters in the DAP task and driving accident scenarios have more anomalous objects, a more complex traffic environment, and a greater variety of human–vehicle interaction behaviors than normal scenarios, the driver attention prediction in driving accident scenarios (DADA-2000) dataset [6] was chosen for training and testing.

DADA-2000: This is a large-scale driver attention dataset in driving accident scenarios containing 2000 videos, of which 1018 videos have been made publicly available. Among the dataset, 598 videos (241 K frames) were used for training, 198 videos (64 K frames) for validation, and 222 videos (70 K frames) for testing, respectively. The annotations were derived from the eye-tracking data of 20 experienced drivers, and each video recorded the eye-tracking data of at least five drivers.

We used the parameters of faster R-CNN and YOLOv4 pre-trained on the COCO dataset to initialize the shared bottom and branch A in SOE-F and SOE-Y, respectively. Therefore, the domain-adaptive parameters of branch A were already adapted to the object detection task after initialization, and the detection accuracy was fully determined by the object detection network used in the SOE model. Thus, only the branch B parameters were trained, whereas the other parameters remained constant. Moreover, we believe that the adaptive learning of task-specific parameters enabled our SOE model to easily incorporate other tasks such as object tracking or semantic segmentation.

Our model was implemented using the Pytorch framework and trained on an NVIDIA RTX5000. To achieve a balance between speed and precision, we trained with mixed precision. During training, the ADAM optimizer was utilized, the learning rate was set to 0.001, and the weight decay was set to

2 \times 10^{- 7}

. The number of consecutive SOE input frames was set to 16, and the image was resized to

320 \times 320

. Due to the strong spatio-temporal continuity of the driving task, only Z-score normalization, S random mirroring, and random clipping were selected to improve the data. We set the training batch size to 200, but due to memory constraints, we could only process five video clips at a time. Therefore, we accumulated the gradient and updated the model parameters every 40 steps.

4.2. Performance Comparison

Because the designed SOE network was not trained on the object detection branch, only the DAP branch’s performance on the DADA-2000 dataset was tested and evaluated. We selected six representative evaluation metrics [32] to quantitatively compare the performance of our proposed method to that of the state-of-the-art saliency model. These evaluation metrics can be divided into two classes, where the location-based metrics include the area under the curve by Judd (AUC-J), shuffled-AUC (s-AUC), and normalized scanpath saliency (NSS). In addition, the distribution-based metrics consist of the Kullback–Leibler divergence (KL), similarity (SIM), and Pearson’s correlation coefficient (CC). The AUC measures the tradeoff between true positives (TPs) and false positives (FPs) at various discrimination thresholds. The AUC-J focuses on the TP and FP rates. In s-AUC, the mean saliency map is introduced to compensate for the center bias in the AUC calculation and the center fixation bias. The KL is defined in Equation (7). The NSS, CC, and SIM are defined in Equation (8). They expect larger values.

\begin{array}{l} N S S = \frac{1}{N} \times \sum_{i = 1} \frac{G (x_{i}, y_{i}) - μ_{S}}{σ_{S}} \\ C C = \frac{cov (G, S)}{σ_{G} \times σ_{S}} \\ S I M = \sum_{i = 1}^{N} min (G_{i}, S_{i}) \end{array}

(8)

where

(x_{i}, y_{i})

is the location of the fixation point map,

μ_{S}

and

σ_{S}

are the mean and standard deviation of the prediction saliency map S, respectively,

cov (•)

is the coefficient of covariance, and

σ_{G}

is the standard deviation of the saliency map in the annotations.

The location-based metrics use a binary fixation point map as the ground truth to measure the difference in the annotation location between the predicted saliency map and the annotated annotation. Distribution-based metrics measure the difference between the predicted saliency map and the annotated distribution of the attention points using a continuous saliency map as the ground truth. By simulating the foveal vision [33] of humans at each gaze point, a continuous saliency map can be obtained after blurring each fixation point using a small Gaussian sigma.

We compared our method with eight state-of-the-art attention methods, including SALICON [34], two-stream [35], MLNet [36], BDD-A [7], DR(eye)VE [16], ACLNet [4], SCAFNet [17], and ASIAF-Net [9], in order to fully demonstrate the competitiveness of our method.

Performance on DADA-2000.Table 1 compares the performance of our SOE method on the test set of the DADA-2000 dataset (222 video sequences) with eight saliency methods and six saliency metrics. For the NSS, AUC-J, s-AUC, CC, and KL, it is evident that our proposed DAP branch in SOE-F achieved better results than the current state-of-the-art methods. Although SCAFNet received the highest score on the SIM metric, the disadvantage of a distributional metric such as the SIM is that the size of the Gaussian sigma chosen in the construction of the saliency map influences the model evaluation. Even though our method predicted the correct attentional location (NSS, AUC-J, s-AUC), the scores were slightly lower because the SIM penalized false negatives more than false positives. The calculated symmetry of the CC indicates that our method was more balanced than SCAFNet with regard to the distribution similarity of the saliency map. In addition, SCAFNet uses two-stream 3D convolution as the feature extraction network, resulting in larger computational and memory requirements. The performance of KL and s-AUC was significantly enhanced by 27% using our method. ASIAF-Net was the most similar to our aims, but our designed DAP branch outperformed it in five metrics. The comparison results indicate that our SOE method makes better use of the driver’s visual information to achieve attention-driven SOE similar to that of humans.

The DAP branch in SOE-Y and SOE-F is based on the same constructive idea, but there was a difference in the final prediction results due to the shared bottom. The DAP branch in SOE-Y employs CSPDarknet53 and FPN as encoders to extract the video’s continuous frame spatial features. However, only three feature levels used a multi-scale enhancement strategy. In SOE-F, ResNet50 and FPN are utilized as encoders, and multi-scale enhancement strategies are implemented at four feature levels, enabling the method to learn and predict driver eye-movement behavior on a larger scale. In addition, because of the differences in the encoders, the decoder of SOE-Y has one fewer hopping connection path than that of SOE-F, resulting in a weaker adaptive learning capability.

We compared the model size and runtime of the proposed SOE-Y and SOE-F single-frame inference. As shown in Table 2, SOE-Y has more channels than SOE-F, therefore, the model size was larger. However, SOE-Y utilized a single-stage strategy to predict the object’s category and location, making the inference time faster than SOE-F and meeting the real-time requirement. Combining the differences in model performance and network structure between SOE-F and SOE-Y reveals that the network structure, and not its depth or width, determines model performance.

We compared the model size and runtime between our methods and four saliency prediction methods. As shown in Table 3, SALION had the smallest model size, indicating that it had the smallest memory requirements. However, the single-frame inference speed of SALION was too slow to meet the real-time demand. ACLNet had the smallest runtime, but the model performed poorly on the task of DAP. Despite the larger model size of SOE-Y and SOE-F, it is important to note that our methods integrate two tasks, DAP and object detection, whereas other methods are only suitable for the DAP task. Moreover, our methods all have millisecond running times, and SOE-Y meets the real-time requirements of an autonomous driving system.

Qualitative analysis. The qualitative results after the visualization of some driving scenes are shown in Figure 7. The prevalent object detection model aimed to detect all vehicles, pedestrians, signal signs, etc., within the scene but did not evaluate the significance of these objects. For instance, in the first row in Figure 7, it can be seen that the model detected the pedestrians and vehicles ahead but did not know that the pedestrian about to fall in front was the most significant object in the scene. Similarly, in the fifth row of the sample in Figure 7, it can be seen that a person fell in front of the vehicle but the object detection network was unable to identify it due to the pose deformation and detection blind spot of the person. However, our DAP division accurately recognized this instance and predicted that it was the most prominent object in the current scene. The detection accuracy of our method depends entirely on the performance of the object detection network used. In the future, superior object detection networks can be used to eliminate the errors due to missed detection and improve the overall performance of the SOE model.

Figure 7 illustrates that the human-like attention prediction branch of our proposed SOE method accurately predicted the driver’s attention location and then assigned a saliency score to the object, thereby identifying the most salient and significant objects. By using our proposed SOE-F or SOE-Y, autonomous driving systems and ADAS can improve their monitoring and evaluation of driving risk states.

4.3. Ablation Analysis

In this section, we performed an ablation analysis on the DADA-2000 dataset using the proposed SOE method. Since we did not modify the network architecture of the object detection task in the SOE method, we only concentrated on the DAP task. In addition, the encoder in the DAP network is comprised of the bottom layers of faster R-CNN or YOLOv4 so no ablation analysis was required. In addition, Palazzi et al. [16] concluded in a previous study that attention prediction works best when the length of the input video clip is 16 frames. Therefore, we only performed ablation analysis for the decoder-added spatio-temporal attention mechanism.

We employed the 3D convolution module to replace the spatio-temporal attention mechanism RNL as the base model and then employed the same training and testing strategies to obtain the test performance of SOE-F and SOE-Y without the RNL on the DADA-2000 test set. As shown in Table 4, when we replaced the RNL, the network’s performance on each evaluation metric decreased slightly. These results demonstrate that our base model is inherently capable of extracting and learning spatio-temporal information in dynamic driving scenarios and accurately predicting driver attention. When the RNL is added to the decoder, the DAP network’s ability to fuse spatio-temporal features is strengthened, and its overall performance is enhanced.

5. Conclusions

In this paper, we propose an attention-driven SOE method to estimate the salient and important objects in dynamic driving scenes in real time. First, a U-shaped 2D-3D encoder–decoder DAP network is constructed to predict driver attention. Using a shared-bottom MTL architecture, a combined DAP and object detection network is constructed. The experimental results on the DADA-2000 dataset demonstrate that our method provides the best performance. By extracting 2D features using shared layers and passing the features in consecutive frames, we effectively avoid the computational waste caused by repeated feature extraction and meet the computational inference requirements of autonomous driving systems in real time. This 2D feature-coding structure permits us to combine DAP networks with any of the most advanced object detection networks, demonstrating the portability of our method.

Future research will concentrate on incorporating driver intent prediction and trajectory prediction methods into our MTL framework in order to improve the network’s ability to distinguish salient and significant objects in complex human–vehicle–road interaction scenarios.

Author Contributions

Conceptualization, formal analysis, L.J.; methodology, software, B.J.; visualization, validation, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (52072333, 52202503) and S&T Program of Hebei (F2022203054, 21340801D).

Data Availability Statement

Some or all data, models, or code generated or used during the study are available from the corresponding author by request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Suman, V.; Bera, A. RAIST: Learning Risk Aware Traffic Interactions via Spatio-Temporal Graph Convolutional Networks. arXiv 2020, arXiv:2011.08722. [Google Scholar]
Wolfe, J.M.; Horowitz, T.S. Five factors that guide attention in visual search. Nat. Hum. Behav. 2017, 1, 58. [Google Scholar] [CrossRef]
Zhang, Z.; Tawari, A.; Martin, S.; Crandall, D. Interaction graphs for object importance estimation in on-road driving videos. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 8920–8927. [Google Scholar]
Wang, W.; Shen, J.; Guo, F.; Cheng, M.M.; Borji, A. Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4894–4903. [Google Scholar]
Alletto, S.; Palazzi, A.; Solera, F.; Calderara, S.; Cucchiara, R. Dr (eye) ve: A dataset for attention-based tasks with applications to autonomous and assisted driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 54–60. [Google Scholar]
Fang, J.; Yan, D.; Qiao, J.; Xue, J.; Wang, H.; Li, S. Dada-2000: Can driving accident be predicted by driver attentionƒ analyzed by a benchmark. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4303–4309. [Google Scholar]
Xia, Y.; Zhang, D.; Kim, J.; Nakayama, K.; Zipser, K.; Whitney, D. Predicting driver attention in critical situations. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 658–674. [Google Scholar]
Deng, T.; Yan, H.; Qin, L.; Ngo, T.; Manjunath, B. How do drivers allocate their potential attention? driving fixation prediction via convolutional neural networks. IEEE Trans. Intell. Transp. Syst. 2019, 21, 2146–2154. [Google Scholar] [CrossRef]
Li, Q.; Liu, C.; Chang, F.; Li, S.; Liu, H.; Liu, Z. Adaptive Short-Temporal Induced Aware Fusion Network for Predicting Attention Regions Like a Driver. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18695–18706. [Google Scholar] [CrossRef]
Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Droste, R.; Jiao, J.; Noble, J.A. Unified image and video saliency modeling. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 419–435. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Lai, Q.; Wang, W.; Sun, H.; Shen, J. Video saliency prediction using spatiotemporal residual attentive networks. IEEE Trans. Image Process. 2019, 29, 1113–1126. [Google Scholar] [CrossRef] [PubMed]
Min, K.; Corso, J.J. Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 2394–2403. [Google Scholar]
Palazzi, A.; Abati, D.; Solera, F.; Cucchiara, R. Predicting the Driver’s Focus of Attention: The DR (eye) VE Project. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1720–1733. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fang, J.; Yan, D.; Qiao, J.; Xue, J.; Yu, H. DADA: Driver attention prediction in driving accident scenarios. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4959–4971. [Google Scholar] [CrossRef]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Gao, M.; Tawari, A.; Martin, S. Goal-oriented object importance estimation in on-road driving videos. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5509–5515. [Google Scholar]
Xu, D.; Ouyang, W.; Wang, X.; Sebe, N. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 675–684. [Google Scholar]
Gao, Y.; Ma, J.; Zhao, M.; Liu, W.; Yuille, A.L. Nddr-cnn: Layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3205–3214. [Google Scholar]
Chang, W.G.; You, T.; Seo, S.; Kwak, S.; Han, B. Domain-specific batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7354–7362. [Google Scholar]
Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1930–1939. [Google Scholar]
Khattar, A.; Hegde, S.; Hebbalaguppe, R. Cross-domain multi-task learning for object detection and saliency estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3639–3648. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Huang, G.; Bors, A.G. Region-based non-local operation for video classification. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10010–10017. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Li, J.; Xia, C.; Song, Y.; Fang, S.; Chen, X. A data-driven metric for comprehensive evaluation of saliency models. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 190–198. [Google Scholar]
Bylinskii, Z.; Judd, T.; Oliva, A.; Torralba, A.; Durand, F. What do different evaluation metrics tell us about saliency models? IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 740–757. [Google Scholar] [CrossRef] [PubMed]
Perry, J.S.; Geisler, W.S. Gaze-contingent real-time simulation of arbitrary visual fields. In Proceedings of the Human Vision and Electronic Imaging VII, San Jose, CA, USA, 19 January 2002; SPIE: Bellingham, WA, USA, 2002; Volume 4662, pp. 57–69. [Google Scholar]
Jiang, M.; Huang, S.; Duan, J.; Zhao, Q. Salicon: Saliency in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1072–1080. [Google Scholar]
Zhang, K.; Chen, Z. Video saliency prediction based on spatial-temporal two-stream network. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3544–3557. [Google Scholar] [CrossRef]
Cornia, M.; Baraldi, L.; Serra, G.; Cucchiara, R. A deep multi-level network for saliency prediction. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3488–3493. [Google Scholar]

Figure 1. Drivers pay attention to the most important objects in the scene: (a) current driving scene, (b) results of object detection, (c) driver attention map after visualization, and (d) fusion map that projects the driver attention map onto the results of the object detection.

Figure 2. The overall architecture of the proposed SOE network. Input is the captured image from the onboard camera. Shared bottom refers to the 2D CNNs located at the base of the SOE network. Branches A and B represent the domain-adaptive portions of the object detection and DAP tasks, respectively. Output A refers to the categories and locations of the detected objects, whereas Output B is the predicted attention saliency map. Output is the result of the saliency objectives estimation incorporating attention, categories, and locations.

Figure 3. The overall Driver Attention Prediction Module, where t denotes the current frame,

t -

denotes the historical frame,

t +

denotes the future frame, and

T = {t - T + 1, . . ., t}

denotes the time length of consecutive frames.

F_{C}^{t} \in (C \in {C 2, C 3, C 4, C 5})

denotes the spatial features extracted from the current frame.

F_{C}^{T - 1} \in (C \in {C 2, C 3, C 4, C 5})

denotes the spatial features extracted over

T - 1

historical frames.

F_{C}^{T} \in (C \in {C 2, C 3, C 4, C 5})

denotes the spatial features extracted on T consecutive frames. ⊗ indicates that the features are concatenated in the time dimension.

Figure 3. The overall Driver Attention Prediction Module, where t denotes the current frame,

t -

denotes the historical frame,

t +

denotes the future frame, and

T = {t - T + 1, . . ., t}

denotes the time length of consecutive frames.

F_{C}^{t} \in (C \in {C 2, C 3, C 4, C 5})

denotes the spatial features extracted from the current frame.

F_{C}^{T - 1} \in (C \in {C 2, C 3, C 4, C 5})

denotes the spatial features extracted over

T - 1

historical frames.

F_{C}^{T} \in (C \in {C 2, C 3, C 4, C 5})

denotes the spatial features extracted on T consecutive frames. ⊗ indicates that the features are concatenated in the time dimension.

Figure 4. Overview of RNL, 3D downsampling, and 3D convolution module, respectively.

x \in R^{T \times H \times W \times C}

is the input,

z \in R^{T \times H \times W \times C}

is the output. W_z, W_g, and

F_{θ}

are learnable parameters, specifically expressed as convolutional operations of a specific kernel size. ⊗ denotes matrix multiplication, whereas ⊕ denotes element-wise addition.

s o f t m a x

denotes the softmax function.

Conv 3 d (•)

denotes 3D convolutional operations and k, s, and p denote the size of the convolution kernel, stride, and padding, respectively.

BatchNorm

and RELU are the 3D batch normalization and RELU activation functions, respectively.

Figure 4. Overview of RNL, 3D downsampling, and 3D convolution module, respectively.

x \in R^{T \times H \times W \times C}

is the input,

z \in R^{T \times H \times W \times C}

is the output. W_z, W_g, and

F_{θ}

are learnable parameters, specifically expressed as convolutional operations of a specific kernel size. ⊗ denotes matrix multiplication, whereas ⊕ denotes element-wise addition.

s o f t m a x

denotes the softmax function.

Conv 3 d (•)

denotes 3D convolutional operations and k, s, and p denote the size of the convolution kernel, stride, and padding, respectively.

BatchNorm

and RELU are the 3D batch normalization and RELU activation functions, respectively.

Figure 5. Overview of SOE-F, where t represents the current frames,

t -

denotes the historical frames,

t +

represents the future frames, and T denotes the number of consecutive frames.

C \in {C 2, C 3, C 4, C 5}

and

P \in {P 2, P 3, P 4, P 5}

represent different feature levels in SOE-F.

Figure 5. Overview of SOE-F, where t represents the current frames,

t -

denotes the historical frames,

t +

represents the future frames, and T denotes the number of consecutive frames.

C \in {C 2, C 3, C 4, C 5}

and

P \in {P 2, P 3, P 4, P 5}

represent different feature levels in SOE-F.

Figure 6. Overview of SOE-Y, where t denotes the current frames,

t -

represents the historical frames,

t +

denotes the future frames, and T represents the number of consecutive frames.

C \in {C 3, C 4, C 5}

and

P \in {P 3, P 4, P 5}

represent different feature levels in SOE-Y.

Figure 6. Overview of SOE-Y, where t denotes the current frames,

t -

represents the historical frames,

t +

denotes the future frames, and T represents the number of consecutive frames.

C \in {C 3, C 4, C 5}

and

P \in {P 3, P 4, P 5}

represent different feature levels in SOE-Y.

Figure 7. Visualization of the SOE-F and SOE-Y prediction results on the DADA-2000 dataset. The first four rows are from SOE-F and the last four rows are from SOE-Y.

Table 1. Comparison of the saliency metrics of SOE-Y and SOE-F with those of the other state-of-the-art methods on the DADA-2000 test set (The symbol ↑ prefers a larger value and ↓ expects a smaller value. Bold indicates the best results).

Methods	Fixation Point Map			Saliency Map
Methods	NSS↑	AUC-J↑	s-AUC↑	SIM↑	CC↑	KL↓
SALICON [34]	2.71	0.91	0.65	0.30	0.43	2.17
Two-Stream [35]	1.48	0.84	0.64	0.14	0.23	2.85
MLNet [36]	0.30	0.59	0.54	0.07	0.04	11.78
BDD-A [7]	2.15	0.86	0.63	0.25	0.33	3.32
DR(eye)VE [16]	2.92	0.91	0.64	0.32	0.45	2.27
ACLNet [4]	3.15	0.91	0.64	0.35	0.48	2.51
SCAFNet [17]	3.34	0.92	0.66	0.37	0.50	2.19
ASIAF-Net [9]	3.39	0.93	0.78	0.36	0.49	1.66
SOE-Y	3.38	0.93	0.81	0.34	0.50	1.64
SOE-F	3.47	0.93	0.84	0.34	0.51	1.60

Table 2. Comparison of model size and runtime for SOE-Y and SOE-F.

Methods	Model Size of Shared Bottom (MB)	Model Size of Branch A (MB)	Model Size of Branch B (MB)	Model Size (MB)	Runtime (s)
SOE-Y	170.2	86.8	31.4	288.4	0.03
SOE-F	110.1	57.4	15.2	182.7	0.08

Table 3. Comparison of model size and runtime of SOE-Y and SOE-F with those of other methods.

Methods	Model Size (MB)	Runtime (s)
SALICON [34]	117	0.5
Two-Stream [35]	315	20
ACLNet [4]	250	0.02
DR(eye)VE [16]	155	0.03
SOE-Y	288.4	0.03
SOE-F	182.7	0.08

Table 4. Comparison of SOE-F and SOE-Y performance on the DADA-2000 dataset with or without RNL (The symbol ↑ prefers a larger value and ↓ expects a smaller value. Bold indicates the best results).

Methods	Fixation Point Map			Saliency Map
Methods	NSS↑	AUC-J↑	s-AUC↑	SIM↑	CC↑	KL↓
SOE-F (without RNL)	3.447	0.931	0.839	0.340	0.513	1.608
SOE-F	3.471	0.932	0.840	0.340	0.514	1.596
SOE-Y (without RNL)	3.363	0.929	0.813	0.341	0.502	1.641
SOE-Y	3.384	0.929	0.814	0.343	0.503	1.637

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, L.; Ji, B.; Guo, B. Human-like Attention-Driven Saliency Object Estimation in Dynamic Driving Scenes. Machines 2022, 10, 1172. https://doi.org/10.3390/machines10121172

AMA Style

Jin L, Ji B, Guo B. Human-like Attention-Driven Saliency Object Estimation in Dynamic Driving Scenes. Machines. 2022; 10(12):1172. https://doi.org/10.3390/machines10121172

Chicago/Turabian Style

Jin, Lisheng, Bingdong Ji, and Baicang Guo. 2022. "Human-like Attention-Driven Saliency Object Estimation in Dynamic Driving Scenes" Machines 10, no. 12: 1172. https://doi.org/10.3390/machines10121172

APA Style

Jin, L., Ji, B., & Guo, B. (2022). Human-like Attention-Driven Saliency Object Estimation in Dynamic Driving Scenes. Machines, 10(12), 1172. https://doi.org/10.3390/machines10121172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Human-like Attention-Driven Saliency Object Estimation in Dynamic Driving Scenes

Abstract

1. Introduction

2. Related Works

2.1. Driver Attention Prediction

2.2. Saliency Object Estimation

2.3. Multi-Task Learning and Domain Adaption

3. Methods

3.1. Saliency Object Estimation Framework

3.2. Driver Attention Prediction Module

3.3. Saliency Object Estimation Network

3.4. Loss Functions

4. Experiments

4.1. Experiment Setup

4.2. Performance Comparison

4.3. Ablation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI