Vehicle Detection in Videos Leveraging Multi-Scale Feature and Memory Information

Yang, Yanni; Lu, Shengnan

doi:10.3390/electronics14102009

Open AccessArticle

Vehicle Detection in Videos Leveraging Multi-Scale Feature and Memory Information

by

Yanni Yang

^* and

Shengnan Lu

School of Computer Science, Xi’an Shiyou University, Xi’an 710065, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(10), 2009; https://doi.org/10.3390/electronics14102009

Submission received: 13 April 2025 / Revised: 11 May 2025 / Accepted: 13 May 2025 / Published: 15 May 2025

Download

Browse Figures

Versions Notes

Abstract

Vehicle detection in videos is a critical task in traffic monitoring. Existing vehicle detection tasks commonly use static detectors. Since video frames are processed as discrete static images, static detectors neglect the temporal information of vehicles when detecting vehicles in videos, leading to a reduction in detection accuracy. To address the above shortcoming, this paper improves the detection performance by introducing a video vehicle detection method that combines multi-scale features with memory information. We design a Multi-scale Feature Generation Network (MFGN) to improve the detector’s self-adaptation ability to vehicle scales. MFGN generates features with two scales and predefines multi-scale anchors for each feature scale. Based on MFGN, we propose a Memory-based Multi-scale Feature Aggregation Network (MMFAN), which aggregates historical features with current features through two parallel memory networks. The multi-scale feature and memory based method enhances the features of each frame in two perspectives, thus enhancing the vehicle detection accuracy. On the commonly adopted vehicle detection dataset UA-DETRAC, the mAP of our method is 7.4% higher compared to its static detector. The proposed approach is further validated on the well-known ImageNet VID benchmark. It demonstrates comparable performance with the memory-driven state-of-the-art frameworks.

Keywords:

object detection; vehicle detection in videos; CNN; ConvLSTM; feature aggregation

1. Introduction

Vehicle detection an indispensable component in Intelligent Transportation Systems (ITS) [1]. In this research field, Vehicle detection methods using Convolutional Neural Networks (CNN) [2,3] have achieved significant breakthroughs in detection accuracy. However, these CNN-based methods are mostly designed for vehicle detection in still images and are called static detectors. When detecting vehicles in videos, static detectors suffer from low detection accuracy due to occlusion, motion blur, etc. Thus, how to improve the accuracy of vehicle detection in videos is a research topic worth exploring. To address this issue, adopting video object detection methods with spatiotemporal joint modeling proves to be an effective solution.

Among the video object detection methods, combining static detectors with tracking is the most natural and direct way. Song et al. [4] proposes a tracking-based method to optimize the computational efficiency of video vehicle detection in highway scenes. To enhance detection accuracy, D&T [5] establishes connections between inter-frame detection results. It leverages cross-correlation between consecutive frame features to track objects and generates motion trajectories. The above tracking-based video object detection methods can be considered as performing inter-frame optimization on the detection results of static detectors. These methods apply tracking algorithms at the output layer of detection networks to establish inter-frame associations of detection results, thus compensating and optimizing the detection results. However, they fail to build inter-frame feature-level associations, leading to limited effectiveness in accuracy improvement. Compensation is not possible in the case of long-term detection failure. As illustrated in Figure 1, the static detector exhibits persistent failure in detecting the gray car (marked by a yellow arrow) for 20 frames, that is, from frame 261 to frame 280. If the track age threshold is set to 10, the tracking-based methods will lose compensation capability for this missed vehicle starting from frame 271. This problem is a common defect of these tracking-based video vehicle detection methods. The fundamental solution lies in enhancing feature representations through inter-frame temporal modeling.

In real traffic surveillance scenes, there are usually multiple vehicles at different scales. The scale of the same vehicle also varies considerably. For the scale problem of vehicles, an effective solution for static detectors is to use a feature pyramid. For example, Mask R-CNN [6] uses five-scale features designed in FPN. However, as the number of features in the pyramid increases, temporal feature operations present higher complexity. So directly transplanting the multi-scale feature stragy of Mask R-CNN into video vehicle detection networks induces significant computational redundancy. In response to the above problems, Wang et al. [7] proposes a video object detection method MMNet (Motion aided Memory Network), which introduces three scale features into the detection network. However, this method first fuse multi-scale features into a single-scale feature using attention mechanism. The fused features are then propagated between frames through a memory network. MMNet avoids inter-frame operation for multi-scale features. This choice is computationally efficient. However, it is worth studying how to improve the scale adaptability of video object detection methods.

For the problem of accuracy degradation existing in video vehicle detection, exploiting temporal information is the most useful and fundamental solution. Early accuracy-focused video object detection studies [8,9] often use optical flow to propagate and aggregate features between frames. Feature enhancement can be implemented at the pixel-level and instance-level. The former operates on the global feature map, while the latter focuses on the features of the candidate regions. The method described in [8] is pixel-level. MANet [9] proposes an optical flow-driven feature aggregation mechanism to refine the feature representations of region proposals. However, optical flow belongs to low-level information. It has limitations when used to approximate high-level convolutional features. In MEGA [10], self-attention is used to strengthen features at instance-level. Since features of each frame are augmented through global and local information, it is very time-consuming. Memory networks are inherently suited to temporal modeling. The Long Short-Term Memory (LSTM) network is a classical memory network designed for learning sequential data. In computer vision, Convolutional LSTM (ConvLSTM) [11] is a widely adopted memory-based approach. It introduces convolutional structure into LSTM to capture spatiotemporal correlations in sequential images. For instance, Chen et al. [12] integrates ConvLSTM into SSD [13] detector to form high-level and low-level temporal feature memory modules, thereby enhancing the detection accuracy of SSD. However, this method is built on a one-stage detector, so the detection accuracy is limited.

In this work, we concentrate on improving the accuracy of video vehicle detection at the feature level, thereby addressing the long-term detection failure problem that exists in tracking-based methods. We propose a Multi-scale Feature and Memory based Network (MFMN) for video vehicle detection. We implement MFMN on Faster R-CNN [14], a two-stage static detector renowned for its high detection accuracy. In response to the scale problem of vehicle detection in traffic monitoring scenes, a multi-scale feature strategy is introduced. In addition to the original feature layer for generating region proposals in Faster R-CNN, we incorporates a low-level feature that is more sensitive to small-sized vehicles. To simultaneously establish inter-frame associations for multi-scale features, we introduce a parallel multi-scale feature aggregation network based ConvLSTM. It enhances the current frame features with historical frame features, thereby improving the feature representation of the current frame. In this way, the features of each frame are strengthened in both spatial and temporal dimensions, our method effectively enhances the feature representation, and thus demonstrates high accuracy in video vehicle detection.

In summary, the key contributions of this work are summarized as follows:

We analyze and design a multi-scale feature fusion method to improve the adaptability of the static detector to the vehicle scale. The method considers the characteristics of each feature scale and the complexity of temporal operations.
We propose a parallel memory network for feature enhancement. It establishes a framework for simultaneously aggregating multi-scale features.
We verify the proposed video vehicle detection network on the UA-DETRAC and ImageNet VID dataset, obtaining satisfactory detection accuracy.

The remainder of this paper is organized as follows: Section 2 provides a review and analysis of related works. We elaborate on the technical architecture and implementation details of the proposed framework in Section 3. In Section 4, we conduct comprehensive experimens on two benchmarks to quantitatively and qualitatively evaluate our method. Finally, we present the conclusion and the future works in Section 5.

2. Related Works

2.1. Vehicle Detection in Images

Object detection is a fundamental research focuses in computer vision. Vehicles are the most important objects in traffic scenes. Currently, CNN-based methods are the mainstream vehicle detection techniques. Depending on whether the generation of detection boxes requires anchors, CNN-based methods can be categorized into anchor-based methods [15,16,17] and anchor-free methods [18,19,20]. Anchor-based methods fall into two categories: one-stage detectors [13,15] and two-stage detectors [14].

One-stage detectors, represented by SSD [13] predict detections in a single inference stage, thus achieving high detection speed but relatively lower accuracy compared to two-stage methods. SSD generates anchor boxes on six feature layers to cope with the task of detecting objects at different scales. YOLOv4 [15] is a recent development in the YOLO family of algorithms. It introduces improvements to the backbone network and proposes an optimized training strategy, thereby enhancing its detection capability. In contrast to one-stage detectors, the primary advantage of two-stage detectors lies in their superior detection accuracy. The two-stage Faster R-CNN [14] first generates region proposals on single-scale feature maps. It then classifies and refines these regions to produce detection results. Building upon Faster R-CNN, numerous two-stage detectors have been proposed, such as R-FCN [16] and Mask R-CNN [6]. Anchor-free methods eliminate the need for pre-defined anchors and instead directly predict bounding boxes or keypoints of objects. Methods of CornerNet [18], CenterNet [21], CentripetalNet [22] and DETR [23] are the representative anchor-free methods. CornerNet generates bounding boxes by detecting two keypoints at the top-left and bottom-right corners of objects. CenterNet predicts the center point and dimensions of objects to form bounding boxes. DETR directly outputs detection boxes and their categories through the Transformer architecture.

Given the accuracy-driven requirements of our research and advantages of the above detectors, we prioritize two-stage detectors for their superior precision. Consequently, Faster R-CNN leveraging ResNet-101 [24] as its backbone is adopted as our static detector in this study.

2.2. Vehicle Detection in Videos

Detecting vehicles in videos is more challenging than in still images due to issues such as blurring, occlusion, and variations in lighting conditions. For the task of vehicle detection in videos, video object detection methods are a better choice than static detectors. To efficiently or accurately detect objects in videos, video object detection methods build upon static detectors and utilize the inter-frame context information in time domain. In general, video object detection methods are categorized into box-level methods [25,26,27,28] and feature-level methods [29,30,31,32].

Box-level methods employ spatiotemporal information in the post-processing stage of static detectors, which establish temporal associations between per-frame detection boxes. Seq-NMS [26] proposes a strategy of sequence NMS. It establishes inter-frame linkages of detection boxes and reassigns confidence scores along trajectories, thereby boosting scores for weak detections. T-CNN [33] incorporates bounding box propagation to suppress false negative detections. Besides, it integrates tracking frameworks to enable persistent inter-frame associations of detection instances. In contrast to the aforementioned accuracy-oriented approaches, the method in [34] improves the detection speed by utilizing a keyframe strategy. The method in [35] is based on the faster anchor-free detector CenterNet and propagates history detections through heatmaps. Liu et al. [28] addresses video object detection through two key mechanisms. It estimates object locations based on key frames and uses object motion as a guidance to enhance detection accuracy. Since no spatiotemporal information is introduced into the detection network, these box-level methods cannot essentially improve the accuracy of video object detection.

Feature-level methods are architected around inter-frame feature computations. Its goal is to construct inter-frame correlations of feature representations. Techniques for establishing inter-frame correlations of features include optical flow, attention mechanism, memory networks, and so on. For instance, STSN [30] employs deformable convolution to aggregate inter-frame features. DFF [29] and the method in [36] leverage optical flow to capture the inter-frame motion information of objects within video sequences, thereby enabling feature temporal propagation across frames. DFF extracts features only for key frames and propagates features for non-key frames. Therefore, it effectively improves the detection speed with a slight reduction in accuracy. He et al. [37] incorporates an impression network into DFF to utilize the feature information from historical frames. FGFA [8] relies on optical flow-based feature propagation to reinforce frame-wise feature representations. The features of each frame in FGFA are aggregated from the extracted features and propagated features, thus boosting detection precision. This feature aggregation mechanism enhances the feature expression of each frame, thereby substantially boosting detection accuracy.

With the development of Vision Transformers (ViTs), Video object detection based on ViTs emerges as a predominant research focus [38,39,40]. Researchers have strategically incorporated attention mechanism and transformer-based object detection architectures into the domain of video object detection, establishing novel ways for spatiotemporal feature learning. Shvets et al. [40] designs a temporal correlation method to learn the similarity of region proposals. It augments current-frame proposal features by adaptively selecting appropriate proposals from adjacent frames. To achieve high detection accuracy, Wu et al. [41] proposes an attention-based SELSA (Sequence Level Semantics Aggregation) module, which enables feature propagation and aggregation from the perspective of the entire sequence. PTSEFormer [42] and TransVOD [43] are end-to-end video object detection methods, which build upon transformer-based detectors. PTSEFormer adopts a progressive method to enhance both temporal and spatial information. TransVOD takes Deformable DETR [20] as its static detector and proposes an effective spatial-temporal transformer architecture.

Memory network is an effective and practical way for feature propagation and aggregation. The method in [44] proposes a dynamic detection network, which detects objects on sparse key frames and tracks them on non-key frames. It uses LSTM to propagate and aggregate instance-level features between frames. Liu et al. [45] proposes an improved Bottleneck-LSTM to realize the inter-frame feature propagation. Based on the method in [45], Liu et al. [46] proposes a memory-based dynamic detection model to achieve fast and accurate video object detection. MMNet [7] achieves efficient object detection in compressed video. STMN [47] aggregates features through a novel Spatial-Temporal Memory Module (STMM) while aligning features via the MatchTrans module. For accurate feature propagation and aggregation, we use the memory network ConvLSTM to establish inter-frame dependencies of features.

3. Method

In this section, we detail the framework of the proposed MFMN. The structure of the detection network is displayed in Figure 2. Its component parts are: (1) a Multi-scale Feature Generation Network (MFGN), (2) a Memory-based Multi-scale Feature Aggregation Network (MMFAN), and (3) a Detection Network (DN). MFGN extracts convolutional features of the current frame and selects two specific feature layers to generate multi-scale features for memorization. MMFAN is used to propagate multi-scale features and aggregate them with the corresponding scale features of the current frame. DN contains RPN (region proposal network), RoI Pooling, and Classification and Regression. Their design remains consistent with that of the original static detector.

3.1. Multi-Scale Feature Generation Network

In this study, we start from a standard Faster R-CNN backbone and take it as our static detector. ResNet-101 is adopted to extract convolutional features for each frame. It contains five stages: conv1, conv2-x, conv3-x, conv4-x, and conv5-x. conv2-x to conv5-x contain varying amounts of Bottleneck structures. Faster R-CNN generates region proposals on the conv4-23 feature of ResNet-101, thus the feature scale for generating region proposals is single. In the later section, we abbreviate conv2-x, conv3-x, conv4-x, and conv5-x as conv2, conv3, conv4, and conv5.

To detect objects at different scales, Faster R-CNN defines anchors at three scales. However, these designs do not provide much benefit for detecting small vehicles. The reason is that after 16 times downsampling, features of the small vehicles are weaker or even disappear in the feature map of conv4. As shown in Figure 3, the car indicated by the red arrow in (a) has a small scale. Faster R-CNN produces a false negative for this small vehicle. Feature maps of conv3 and conv4 for this image are shown in (b). The feature area of the missed vehicle is indicated by the red box. As can be seen that conv3 features are more salient but less semantic, while conv4 features are less salient but more semantic. Faster R-CNN only uses conv4 features to generate region proposals, thus causing a failure to recall this vehicle. This indicates that shallow-layer features are more useful for detecting small-sized vehicles as compared to deep-layer features.

To solve the low detection accuracy problem that occurs when detecting small vehicles, we propose MFGN for Faster R-CNN. Our purpose is to enrich the scale of the candidate regions generated by RPN. MFGN takes features extracted by ResNet-101 as input and outputs multiscale features used for memorization and aggregation.

The 5-layer structure of ResNet-101 produces 5 scales of features. The multiscale features commonly used for the detection task include conv2 to conv5. Additionally, a deeper feature conv6 can be produced by performing a downsampling operation on conv5, which provides additional feature scales. Consequently, the available multi-scale feature are {conv2, conv3, conv4, conv5, conv6}. They are commonly used multiscale features for static detectors, such as mask R-CNN. The conv6 feature has a downsampling multiplier of 64 which is the smallest feature scale. It is not beneficial for detecting small vehicles. Therefore, the features that can be used for the vehicle detection task include conv2 to conv5.

Multi-scale features in our method need to be propagated and aggregated between frames through a parallel-designed feature memory network. The number of feature scales determines the number of parallel memory networks. Too many feature scales will result in a significant increase in computational complexity. In order to improve the accuracy of vehicles at different scales while considering detection speed, we design two feature scales and predefine multi-scale anchors on each feature scale. Specifically, small-scale anchors are adopted on large-scale features and large-scale anchors on small-scale features. Therefore, we do not select conv5 because it is suitable for large targets, while conv4 is sufficient for large vehicles in general. We drop conv2 for its low semanticity.

The multi-scale feature strategy is shown in Table 1, features of conv3 and conv4 are selected to generate multiscale features. For each pixel of the feature map, we predefine 9 anchors with scales {(162, 322, 642), (1282, 2562, 5122)} and aspect ratios {0.5, 1, 2}. Compared with conv4, conv3 belongs to the shallow-layer feature. It has a sampling multiplicity of 8 with the original image. Its feature scale is large, which is beneficial for small vehicle detection. However, conv3 feature is less semantic than the conv4 feature, which is unfavorable for vehicle detection. To solve the conflict between scale and semanticity of the conv3 feature, we enhance the semanticity of the conv3 feature by fusing them with the conv4 feature after up-sampling.

Given a video sequence

{I_{0}, I_{1}, \dots, I_{t}, \dots, I_{n}}

,

I_{t} \in R^{3 \times W \times H}

represents the video frame at time t. W and H are the width and height of the image, respectively. We frist extract the multi-scale convolutional feature set

F_{t}

of the current frame:

F_{t} = {C N N}_{R e s} (I_{t}) = \{F_{t}^{1}, F_{t}^{2}, F_{t}^{3}, F_{t}^{4}, F_{t}^{5},\}

(1)

where

{C N N}_{R e s} (•)

is feature extraction operation of ResNet-101.

F_{t}^{3}

and

F_{t}^{4}

are output features of conv3 and conv4 layers,

F_{t}^{3} \in R^{512 \times \frac{W}{8} \times \frac{H}{8}}

,

F_{t}^{4} \in R^{1024 \times \frac{W}{16} \times \frac{H}{16}}

. Based on

F_{t}^{3}

and

F_{t}^{4}

, the multi-scale features are obtained according to the following operations:

\{\begin{matrix} F_{t}^{4_m} = v (F_{t}^{4}) \\ F_{t}^{3_m} = v (F_{t}^{3}) \oplus u (F_{t}^{4_m}) \end{matrix}

(2)

where

\{F_{t}^{3_m}, F_{t}^{4_m}\}

represent the multiscale features computed by

\{F_{t}^{3}, F_{t}^{4}\}

.

v (•)

is a convolution operation with a convolution kernel scale of

256 \times 1 \times 1

. Its purpose is to normalize the feature channel.

u (•)

is the up-sampling operation, which up-samples the output features by a factor of 2. Operator ‘⊕’ represents feature addition, which performs element-wise addition of two feature maps. Its function is to strengthen the semantic information of

F_{t}^{3}

. After the channel normalization and semantic enhancement, multiscale features

\{F_{t}^{3_m}, F_{t}^{4_m}\}

are then fed into MMFAN for further computation.

3.2. Memory-Based Multi-Scale Feature Aggregation Network

The proposed MMFAN utilizes ConvLSTM to propagate and aggregate multiscale features between frames. It consists of two memory networks, denoted as Memory1, and Memory2. They have the identical design and are used to handle the multiscale features

F_{t}^{3_m}

and

F_{t}^{4_m}

, respectively. The structure of the memory network (Memory1 or Memory2) in MMFAN is shown in Figure 4. It takes

F_{t}^{i_m}

and

H_{t - 1}^{i}

as inputs,

H_{t - 1}^{i}

denotes the output of the memory network at

t - 1

. These input signals enter the ConvLSTM unit by the input gate. After operations of the memory network, the output

H_{t}^{i}

is produced by the output gate.

H_{t}^{i}

fuses the feature information of the current frame and historical frames, which is called an aggregated feature. Its operation is expressed as follows:

H_{t}^{i} = N_{m_l s t m} (H_{t - 1}^{i}, F_{t}^{i_m})

(3)

where

F_{t}^{i_m}

is the output feature of MFGN,

i = 3, 4, 0 \leq t \leq n

. The aggregated feature

H_{t}^{i}

and

F_{t}^{i_m}

are of the same size, and

H_{t}^{i} \in R^{c \times w \times h}

.

N_{m_l s t m} (•)

is the feature aggregation operation of the memory network. When

t = 0, H_{t - 1}^{i} = Φ

,

Φ

is an empty matrix of the same size as

F_{0}^{i_m}

.

Since the input is image sequences, we employ ConvLSTM as the memory network. Multi-scale features

F_{t}^{i_m}

are memorized and aggregated in ConvLSTM, ultimately producing features that have been enhanced with historical information. The specific operations of

N_{m_l s t m} (•)

are as follows:

I_{t} = σ (W_{F i} * F_{t}^{i_m} + W_{H i} * H_{i - 1}^{i} + b_{i})

(4)

{\tilde{C}}_{t}^{i} = t a n h (W_{F c} * F_{t}^{i_m} + W_{H c} * H_{i - 1}^{i} + b_{c})

(5)

f_{t} = σ (W_{F f} * F_{t}^{i_m} + W_{H f} * H_{i - 1}^{i} + b_{f})

(6)

C_{t}^{i} = f_{t} ⊙ C_{i - 1}^{i} + I_{t} ⊙ {\tilde{C}}_{t}^{i}

(7)

o_{t} = σ (W_{F o} * F_{t} + W_{H o} * H_{t - 1}^{i} + b_{o})

(8)

H_{t}^{i} = o_{t} ⊙ t a n h (C_{t}^{i})

(9)

where ‘∗’ represents the convolution operator, and ‘⊙’ is the Hadamard product. After the above operations, the multi-scale aggregated feature

H_{t} = \{H_{t}^{i}\} = \{H_{t}^{3}, H_{t}^{4}\}

is produced,

H_{t}^{3} \in R^{256 \times \frac{W}{8} \times \frac{H}{8}}

,

H_{t}^{4} \in R^{256 \times \frac{W}{8} \times \frac{H}{8}}

. It is the output feature of MMFAN.

H_{t}

is then fed into DN to generate the final detection results

\{B_{t}, S_{t}\}

:

D_{t} = D N (H_{t}, F_{t}^{5}, l) = (B_{t}, S_{t})

(10)

where

D N (•)

denotes the operation of DN. It has the same design as that in Faster R-CNN. The two feature components of

H_{t}

are fed into DN in parallel. l is the number of categories.

4. Experiments

We evaluate our method MFMN on the ImageNet VID_vehicle and UA-DETRAC dataset. The scale range of vehicles on UA-DETRAC is relatively large, thus experiments related to multi-scale features are carried out on it. The performance evaluation experiments of the sub-networks MFGN and MMFAN are carried out on ImageNet VID_vehicle. The experimental condition ia a NVIDIA GeForce GTX 1080 Ti GPU.

4.1. Experimental Settings

4.1.1. Datasets

The ImageNet VID dataset [48] is a large and well-known video object detection dataset. It comprises a total of 5354 videos, partitioned into 3862 training samples, 555 validation samples, and 937 testing samples. This dataset does not provide annotation files for the test set. Consistent with previous studies [10,49], we employ the validation set to evaluate methods. ImageNet VID contains 30 categories. They are part of the 200 categories in ImageNet DET. In this paper, we selecte the categories of car, bus, bicycle, and motorcycle to construct a new vehicle dataset named ImageNet VID_vehicle. The new dataset comprises 653 video clips for training and 81 video clips for testing.

The UA-DETRAC dataset [50] is a public dataset for vehicle detection and tracking. Videos in this dataset are all from actual roadway surveillance. The total duration of the videos is about 10 h, including 60 and 40 video clips for training and testing. These video clips cover common traffic scenarios such as intersections and highways. The monitoring conditions for these videos include normal, night, and rainy. The test set includes 12 nighttime video clips and 7 rainy video clips, respectively. UA-DETRAC contains manually annotated 8250 vehicles across over 140,000 video frames, with a total of 1.21 million bounding boxes labeled. Among these, 5936 annotated vehicle instances are allocated to the training set. It includes four vehicle categories: car, bus, van, and others. The sample size of the above four categories in the training set are 5177, 106, 610, and 43, respectively. It can be seen that the number of vehicle samples for each category is unevenly distributed. Therefore, following references [51], all experiments on UA-DETRAC do not differentiate vehicle categories. We perform a binary classification as follows: vehicle objects (all vehicle categories) and non-vehicle objects (background or irrelevant objects).

4.1.2. Evaluation Metrics

To assess the efficacy of our methodology, we use Precision, Recall, Precision-Recall (P-R) curve, AP (average precision), and mAP (mean average precision) as performance metrics. They are the predominant evaluation criteria in video object detection methods. The specific formulas for calculations are as follows:

{P r e c i s i o n}_{i} = \frac{T P_{i}}{F P_{i} + T P_{i}}

(11)

{R e c a l l}_{i} = \frac{T P_{i}}{F N_{i} + T P_{i}}

(12)

{A P}_{i} = \int_{0}^{1} P r e c i s i o n_{i} (R e c a l l_{i}) d (R e c a l l_{i})

(13)

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(14)

where i is the index of category, and N is the number of categories. For each category,

P r e c i s i o n_{i}

indicates the proportion of correctly identified samples out of the number of samples predicted to be in that category. It indicates the reliability of the model in predicting samples of category i.

R e c a l l_{i}

represents the proportion of correctly identified samples out of the total number of samples in that class. It measures the model’s ability to correctly identify samples of each class. The P-R curve plots Precision (vertical axis) against Recall (horizontal axis). AP is defined as the area under the P-R curve, quantifying the model’s performance across all classification thresholds. mAP is the arithmetic mean of

{A P}_{i}

computed across all N categories, serving as a comprehensive evaluation metric for multi-class detection tasks. Additionally, computational efficiency is quantified by frames processed per second (fps). In experiments, the IoU threshold between the default box and ground truth is 0.5.

4.1.3. Implementation Details

The video vehicle detection models on both ImageNet VID_vehicle and UA-DETRAC are trained based on the Stochastic Gradient Descent (SGD) algorithm. The batch size is 4 for all training processes on both datasets. The whole training process is divided into two steps. Taking the ImageNet VID_vehicle dataset as an example, the training process are as follows:

The first step involves training the static detection network, i.e., Faster R-CNN integrated with MFGN. We train the above network on the mixture of ImageNet VID_vehicle and DET_vehicle (using the same vehicle category as ImageNet VID_vehicle). Our training data is formed by sampling 15 frames per video from ImageNet VID_vehicle and 2K images per category from the DET_vehicle dataset. The iterations in this step are 90k, with a learning rate of 0.001 for the first 70k iterations, and 0.0001 for the last 20k iterations.

The second step is to train MMFAN. The weight parameters before RPN are frozen during training MMFAN. In this phase, the model obtained from the previous step serves as the pre-trained model. The training is performed on the ImageNet VID_vehicle dataset. The timestep of ConvLSTM is set to 10. We select 10 consecutive frames from each video sequence to form the training set, discarding video sequences with less than 10 frames. The iterations for this training are 120k, with a learning rate of 0.001 for the first 90k iterations, and 0.0001 for the rest iterations.

In the first step of model training on UA-DETRAC, training samples are collected by sampling 1 frame from every 5 frames, resulting in a total of 16,867 samples. In the second step, training is performed on consecutive frames with a length of 10. For every 80 frames, 10 consecutive frames are extracted to form the new training set, totaling 10,830 frames. Samples without vehicles in the dataset are excluded from both model training and testing. The iterations and learning rate are consistent with those in ImageNet VID_vehicle.

4.2. Ablation Study

We ablate the each component of MFMN. Experiments are implemented on both UA-DETRAC and ImageNet VID_vehicle datasets.

4.2.1. Analysis of Multi-Scale Feature Strategies

We verify the performance of the multi-scale feature strategy on the UA-DETRAC dataset. Table 2 presents the experimental results. Strategy I is the proposed MFGN. Its mult-iscale features are

{F_{t}^{i_m}}

,

i = 3, 4

. In strategy II, The multi-scale features

{F_{t}^{i_m}}

,

i = 2, 3, 4, 5, 6

are produced by the five feature layers of {conv2, conv3, conv4, conv5, conv6}. To enhance efficiency, Strategy II predefines a single scale anchor for each pixel, specifically {322, 642, 1282, 2562, 5122}.

As shown in the second row of Table 2, the single-scale strategy adopted in Faster R-CNN achieves 74.8% Precision and 80.6% Recall. When employing Strategy I, the Precision and Recall increases to 77.6% and 87.4% respectively.These represent improvements of 2.8% and 6.8% respectively compared to those achieved using a single-scale strategy. Strategy II generates the highest number of anchors. It achieves the highest Precision and Recall, surpassing the single-scale strategy by 5.0% and 9.3%, respectively. Comparing the two multi-scale feature strategies, Strategy II improves Precision by 2.2% and Recall by 2.5% at the cost of three additional feature scales.

Figure 5 shows the speed of the above three feature strategies. Compared to 4.8 fps of the single-scale strategy, our MFGN experiences a decrease of 0.7 fps, while Strategy II shows a more significant drop of 1.6 fps. The comparison of Precision, Recall, and running speed demonstrate that the proposed multi-scale feature strategy achieves a better balance between accuracy and efficiency.

4.2.2. Module Analysis

We conducted experiments on ImageNet VID_vehicle to analyze the influence of MFGN and MMFAN on detection performance. The results are shown in Table 3, method (a) represents the base static detector, which has an mAP of 71.1%. Because of the MFGN module, method (b) improves the mAP of the base detector by 3.5%.Due to the MMFAN module, method (c) obtains an mAP of 75.4%, which is 4.3% higher than the base detector. Method (d) represents the proposed MFMN, which produces an mAP of 77.8%. The result is 6.7% higher than the base detector. In addition, we report the mAP of the above methods under different motion speeds. Following the method in study [8], vehicles are divided into three groups: slow, medium, and fast. For the above three groups, method (d) obtains 83.2%

{mAP}_{S}

, 76.0%

{mAP}_{M}

, and 59.4%

{mAP}_{F}

, respectively. These results are 4.0%, 7.1%, and 8.1% higher than those of the base detector. The enhancement in accuracy proves the effectiveness of MFGN and MMFAN in improving video vehicle detection accuracy.

The last column of Table 3 shows the running time of the above methods. The comparison results demonstrate that improvements in accuracy inevitably lead to varying degrees of speed degradation. The detection results on each vehicle category are reported in Table 4. Method (d) achieves the highest detection accuracy in all vehicle categories. The AP of bicycle, bus, car, and motorcycle are 77.9%, 83.8%, 67.7%, and 81.7%, respectively.

4.3. Comparison with State-of-the-Art Methods

We compare the proposed MFMN with the state-of-the-art detectors on the dataset of UA-DETRAC and ImageNet VID.

4.3.1. Results on the UA-DETRAC Dataset

The P-R curves and mAP values of our MFMN and mainstream image-based detectors are compared in Figure 6. Faster R-CNN achieves an mAP of 71.4%, surpassing the SSD detector by 8.2%. The mAP of YOLOv10-M is 70.7%, which is slightly lower than Faster R-CNN. MFMN obtains the best result. It improves the mAP of its base detector Faster R-CNN to 78.8%, representing a 7.4% improvement. Its mAP is 8.1% higher than that of the recent detector YOLOv10. Additionally, P-R curve reveal that MFMN consistently achieves the highest Precision across the most Recall values. The mAP and P-R curve comparisons show that our video-based vehicle detection method is significantly superior to static detectors.

We evaluate our MFMN on the UA-DETRAC dataset under nighttime and rainy scenes.The experimental results are shown in Table 5. In nighttime condition, MFMN achieves 73.7% Precision, 77.1% Recall, and 75.2% mAP. Among them, Pecision and mAP outperformed those in rainy scene. Recall in rainy scene is 1.3% higher than that in nighttime condition. These comparisons validate the detection capability of MFMN under challenging surveillance conditions. Furthermore, Figure 7 illustrates partial experimental results in the two scenes, indicating the effectiveness of our method in real-world surveillances.

Figure 8 visualizes the qualitative detection results of the above three methods in two scenes. We show 4 consecutive frames for each scene. As evidenced by the figure, all four methods demonstrate accurate detection capabilities for medium-sized and larger vehicles. The detection results are temporally stable. For instance, the bus in Scene 1 and multiple cars in the lower/central regions of Scene 2 are successfully detected in successive frames.

For small-sized vehicles, the SSD detector exhibits the highest rate of missed detections, while the proposed MFMN achieves the best detection performance. For instance, in the 4 frames displayed in Scene 1, SSD fails to detect the small cars in all frames. Faster R-CNN misses the car indicated by the red arrow in the first two frames. It also fails to detect the vehicle marked by the blue arrow in the last three frames, indicating poorer temporal stability in detecting small-sized vehicles. YOLOv10-M successfully detects the car marked with the red arrow, but produces a false negative for the vehicle labeled by the blue arrow. Our MFMN demonstrates superior detection performance for small-sized vehicles compared to the aforementioned methods. MFMN occasionally experiences missed detections, as shown by the car indicated by the yellow arrow in Scene 1. The detection results in Scene 2 align with those in Scene 1, further demonstrating that the MFMN achieves higher detection accuracy for small-sized vehicles.

4.3.2. Results on the ImageNet VID Dataset

We compare the proposed method with the state-of-the-art detectors on ImageNet VID. Table 6 presents a comparison with methods that employ similar theories to ours. Since STMN [47], LSTM-SSD [45], and MMNet [7] do not release the AP for each category, this experiment compares the mAP across all 30 categories of the dataset.

In Table 6, Faster R-CNN [14], Mask R-CNN [6], and Faster R-CNN+MFGN are image-based methods. Faster R-CNN+MFGN in this paper improves the mAP of Faster R-CNN by 3.1%. Mask R-CNN is based on Faster R-CNN. It introduces the multi-scale feature strategy described as "Strategy II” in Table 1, achieving an mAP of 77.2%. This result surpasses Faster R-CNN by 3.8% and outperforms the Faster R-CNN+MFGN (employing “Strategy I” in Table 1) by 0.7%. The advantage primarily stems from Mask R-CNN’s utilization of a greater number of feature scales for generating candidate regions, as well as its replacement of RoI Pooling with RoI Align. In addition, its preprocessed input images maintain a higher resolution.

The compared video-based methods all employ LSTM for temporal feature analysis. Faster R-CNN+MMFAN proposed in this paper achieves 77.6% mAP, demonstrating superior performance to STMN and TPN, and outperforming LSTM-SSD by 23.2%. This indicates that, in addition to the network architecture for inter-frame feature processing, the base detector also serves as a primary factor contributing to the accuracy disparity. MMNet, designed for compressed videos, achieves an mAP of 76.4%. The proposed MFMN obtains the highest mAP of 79.3%, which is 2.9% higher than MMNet. Therefore, experimental results confirm that our method has stable and excellent detection performance on the ImageNet VID dataset.

Table 7 shows a comparison with recent state-of-the-arts that employ different theories to ours. CHP is a heatmap-based method, which obtains 76.7% mAP. YOLOX-M + LPN and FCOS + LPN apply the instance-level temporal attention. They achieve mAP scores of 75.1% and 79.8%, respectively. This discrepancy is caused by the static detectors. transVOD is a transformer-based approach. It achieves the highest mAP of 82.0%, surpassing our MFMH by 2.7%. Notably, its base detector also has the highest mAP of 78.3%, exceeding our static detector by 3.9%. TransVOD improves the static detector by 3.1%, our MFMN by 5.9%, and FCOS + LPN by 6.5% which is the highest. The horizontal and vertical comparisons demonstrates the effectiveness of our method in detection performance.

Figure 9 visualizes the qualitative comparison of MFMN and its base detector. We show 4 consecutive frames in three different traffic scenes. Scene 1 suffers from motion blur caused by rapid movement, along with similar pillars that share color with vehicles. Both Faster R-CNN and the proposed MFMN succeed in detecting the cars. However, in frame 4, Faster R-CNN generates a false positive detection by mistaking the orange pillar for a car (as indicated by the green arrow). Scene 2 presents challenges of non-uniform illumination and partial occlusion. Both methods accurately detect the bus located at the center of the image. Regarding the partially occluded white car in the upper-right region, Faster R-CNN exhibited missed detections in three consecutive frames (as marked by the red arrows). However, the proposed MFMN method successfully and consistently detected this vehicle across all four displayed frames. There are several blurred traffic cones in Scene 3. As indicated by the blue arrows, Faster R-CNN exhibits persistent false positive detections across consecutive frames, which detects these obstacles as cars. In contrast, the proposed MFMN demonstrates immunity to such errors.

5. Conclusions

To address the low accuracy of static detectors in detecting vehicles from videos, we propose a multi-scale feature and memory based detection method. It is built upon the two-stage detector Faster R-CNN. In this study, we use two distinct scale feature maps in MFGN and design three anchor scales for each feature scale. This design enhances diversity at the scales of region proposals, thereby strengthening the network’s robustness to vehicles of varying dimensions. It also effectively circumvents computational inefficiencies incurred by redundant multi-scale features. Motivated by the temporal relevance of historical frame features and current frame features, we incorporate twin memory networks with identical structures in MMFAN. The dual pathways respectively perform inter-frame propagation and aggregation of the two-scale features extracted by the MFGN module. The multi-scale features output by MMFAN integrates the current-frame spatial information and historical temporal context. Experiments on UA-DETRAC and ImageNet VID show that MFMN effectively improves the accuracy of video vehicle detection while maintaining the detection speed.

Electronics

Author Contributions

Conceptualization, Y.Y.; methodology, Y.Y.; validation, Y.Y. and S.L.; formal analysis, Y.Y. and S.L.; data curation, Y.Y.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Y. and S.L.; visualization, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in the references [48,50].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Wang, F.; Wang, K.; Lin, W.; Xu, X.; Chen, C. Data-Driven Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transport. Syst. 2011, 12, 1624–1639. [Google Scholar] [CrossRef]
Zhang, F.; Li, C.; Wang, K.; Yang, F. Vehicle Detection in Urban Traffic Surveillance Images Based on Convolutional Neural Networks with Feature Concatenation. Sensors 2019, 19, 594. [Google Scholar] [CrossRef]
Zhang, W.; Gao, X.; Wang, K.; Yang, C.; Jiang, F.; Chen, Z. A object detection and tracking method for security in intelligence of unmanned surface vehicles. J. Ambient Intell. Humaniz. Comput. 2020, 13, 1279–1291. [Google Scholar] [CrossRef]
Song, H.; Liang, H.; Li, H.; Dai, Z.; Yun, X. Vision-based vehicle detection and counting system using deep learning in highway scenes. Eur. Transp. Res. Review. 2019, 11, 51. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–25 October 2017; pp. 2961–2969. [Google Scholar]
He, K.M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–25 October 2017; pp. 2961–2969. [Google Scholar]
Wang, S.Y.; Group, A.; Lu, H.C.; Deng, Z.D. Fast Object Detection in Compressed Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7104–7113. [Google Scholar]
Zhu, X.Z.; Wang, Y.J.; Dai, J.F.; Yuan, L.; Wei, Y.C. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–25 October 2017; pp. 408–417. [Google Scholar]
Wang, S.Y.; Zhou, Y.C.; Yan, J.J.; Deng, Z.D. Fully motion-aware network for video object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 542–557. [Google Scholar]
Chen, Y.H.; Cao, Y.; Hu, H.; Wang, L.W. Memory Enhanced Global-Local Aggregation for Video Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10337–10346. [Google Scholar]
Shi, X.J.; Chen, Z.R.; Wang, H.; Wang, H.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, BC, Canada, 7–12 December 2015; pp. 802–810. [Google Scholar]
Chen, X.Y.; Yu, J.Z.; Wu, Z.X. Temporally identity-aware SSD with attentional LSTM. IEEE Trans. Cybern. 2019, 50, 2674–2686. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J.; Berg, A. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 39, 91–99. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Bochkovskiy, A.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Dai, J.F.; Li, Y.; He, K.M.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1440–1448. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [Google Scholar] [CrossRef]
Roh, B.; Shin, J.W.; Shin, W.Y.; Kim, S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. arXiv 2021, arXiv:2111.14330v2. [Google Scholar]
Zhu, X.Z.; Su, W.J.; Lu, L.W.; Li, B.; Wang, X.G.; Dai, J.F. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159v4. [Google Scholar]
Duan, K.W.; Bai, S.; Xie, L.X.; Qi, H.G.; Huang, Q.M.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. arXiv 2019, arXiv:1904.08189v3. [Google Scholar]
Dong, Z.W.; Li, G.X.; Liao, Y.; Wang, F.; Ren, P.J.; Qian, C. CentripetalNet: Pursuing High-Quality Keypoint Pairs for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 18–23 June 2020; pp. 10516–10525. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Jiao, L.; Zhang, R.; Liu, F.; Yang, S.; Hou, B.; Li, L.; Tang, X. New Generation Deep Learning for Video Object Detection: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 3195–3215. [Google Scholar] [CrossRef] [PubMed]
Han, W.; Khorrami, P.; Paine, T.L.; Ramachandran, P.; Babaeizadeh, M.; Shi, H.H.; Li, J.N.; Yan, S.C.; Huang, T.S. Seq-nms for video object detection. arXiv 2016, arXiv:1602.08465. [Google Scholar]
Tang, P.; Wang, C.; Wang, X.; Liu, W.; Zeng, W.; Wang, J. Object detection in videos by short and long range object linking. arXiv 2018, arXiv:1801.09823. [Google Scholar]
Liu, X.; Nejadasl, F.K.; van Gemert, J.C.; Booij, O.; Pintea, S.L. Objects Do Not Disappear: Video Object Detection by Single-Frame Object Location Anticipation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1–13. [Google Scholar]
Zhu, X.Z.; Xiong, Y.W.; Dai, J.F.; Yuan, L.; Wei, Y.C. Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2349–2358. [Google Scholar]
Bertasius, G.; Torresani, L.; Shi, J. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 331–346. [Google Scholar]
Han, L.; Yin, Z. Global Memory and Local Continuity for Video Object Detection. IEEE Trans. Multimed. 2023, 25, 3681–3693. [Google Scholar]
Zhang, X.G.; Chou, C.H. Source-free Domain Adaptation for Video Object Detection Under Adverse Image Conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 1–10. [Google Scholar]
Kang, K.; Li, H.; Yan, J.; Zeng, X.; Yang, B.; Xiao, T.; Zhang, C.; Wang, Z.; Wang, R.; Wang, X. T-cnn: Tubelets with Convolutional Neural Networks for Object Detection from Videos. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 2896–2907. [Google Scholar] [CrossRef]
Yao, C.H.; Fang, C.; Shen, S.; Wan, Y.Y.; Yang, M.S. Video Object Detection via Object-level Temporal Aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 160–177. [Google Scholar]
Xu, Z.J.; Hrustic, E.; Vivet, D. CenterNet Heatmap Propagation for Real-time Video Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1–15. [Google Scholar]
Zhu, X.Z.; Dai, J.F.; Yuan, L.; Wei, Y.C. Towards High Performance Video Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7210–7218. [Google Scholar]
Hetang, C.; Qin, H.; Liu, S.; Yan, J. Impression Network for Video Object Detection. arXiv 2017, arXiv:1712.05896v1. [Google Scholar]
Deng, J.J.; Pan, Y.W.; Yao, T.; Zhou, W.G.; Li, H.Q.; Mei, T. Relation Distillation Networks for Video Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7023–7032. [Google Scholar]
Jiang, Z.K.; Liu, Y.; Yang, C.Y.; Liu, J.H.; Gao, P.; Zhang, Q.; Xiang, S.M.; Pan, C.H. Learning Where to Focus for Efficient Video Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Glasgow, UK, 23–28 August 2020; pp. 1–16. [Google Scholar]
Shvets, M.; Liu, W.; Berg, A. Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Wu, H.P.; Chen, Y.T.; Wang, N.Y.; Zhang, Z.X. Sequence Level Semantics Aggregation for Video Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9217–9225. [Google Scholar]
Wang, H.; Tang, J.; Liu, X.D.; Guan, S.Y.; Xie, R.; Song, L. PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2012; pp. 485–501. [Google Scholar]
Zhou, Q.; Li, X.; He, L.; Yang, Y.; Cheng, G.; Tong, Y.; Ma, L.; Tao, D. TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1–17. [Google Scholar] [CrossRef]
Yuan, Y.; Liang, X.D.; Wang, X.L.; Yeung, D.Y.; Gupta, A. Temporal Dynamic Graph LSTM for Action-driven Video Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–25 October 2017; pp. 1801–1810. [Google Scholar]
Liu, M.; Zhu, M.L. Mobile Video Object Detection with Temporally-Aware Feature Maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5686–5695. [Google Scholar]
Liu, M.; Zhu, M.; White, M.; Li, Y.; Kalenichenko, D. Looking Fast and Slow: Memory-guided Mobile Video Object Detection. arXiv 2019, arXiv:1903.10172. [Google Scholar]
Xiao, F.Y.; Lee, Y.J. Video Object Detection with an Aligned Spatial-temporal Memory. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 485–501. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Han, M.F.; Wang, Y.L.; Chang, X.J.; Qiao, Y. Mining Inter-Video Proposal Relations for Video Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1–16. [Google Scholar]
Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.C.; Qi, H.; Lim, J.; Yang, M.H.; Lyu, S. UADETRAC: A New Benchmark and Protocol for Multi-Object Detection and Tracking. arXiv 2015, arXiv:1511.04136. [Google Scholar]
Sun, S.J.; Akhtar, N.; Song, X.Y.; Mian, A.; Shah, M. Simultaneous detection and tracking with motion modelling for multiple object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1–24. [Google Scholar]
Kang, K.; Li, H.S.; Xiao, T.; Ouyang, W.; Yan, J.J.; Liu, X.; Wang, X.G. Object Detection in Videos with Tubelet Proposal Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 727–735. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.Q.; Sermanet, P.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sun, G.; Hua, Y.; Hu, G.; Robertson, N. Efficient One-stage Video Object Detection by Exploiting Temporal Consistency. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]

Figure 1. The problem of prolonged detection failure in static detectors. Vehicles with unstable results are marked with yellow arrows. The characters at the bottom of each image indicate where and when the surveillance image was acquired.

Figure 2. Overview of the proposed video vehicle detection method. conv1 to conv5 are the five stages of the feature extraction network ResNet-101. Faster R-CNN combined with MFGN includes ResNet-101, MFGN, and DN. MFCN takes the features of the last convolutional layers of conv3 and conv4 as inputs and produces multi-scale features for memorization. In MFCN, the semantic information of the low-level feature is enhanced by the high-level feature. MMFAN is a memory-based aggregation network for multi-scale features. Memory1 and Memory2 work in parallel. They are all based on ConvLSTM.

Figure 3. Detection results and feature maps of Faster R-CNN. In (a), the small vehicle indicated by the red arrow failed to be detected. In (b), The red bounding boxes highlight the features of missed vehicles of conv3 and conv4. Features of conv3 exhibit more saliency but less semantic.

Figure 4. The structure of the memory network in MMFAN.

Figure 5. Running time comparison of three feature strategies.

Figure 6. P-R curves on UA-DETRAC.

Figure 7. Example detection results under nighttime condition and rainy scenes on UA-DETRAC. For each scene, the detection results of four consecutive frames are displayed.

Figure 8. Example detection results on UA-DETRAC. For each scene, we sample 4 consecutive frames. Undetected vehicles are marked with arrows.

Figure 9. Example detection results on the ImageNet VID dataset. For each scene, we display the detection results for four consecutive frames, with detection errors annotated by arrows in distinct colors.

Table 1. Details of the multi-scale strategy. Because of the poor semantics, the semanticity of the conv3 feature is enhanced by the conv4 feature.

Feature Layer	Downsampling Multiple	Enhanced?	Anchor Scale	Aspect Ratio
conv3	8	✓	( $16^{2}$ , $32^{2}$ , $64^{2}$ )	0.5, 1, 2
conv4	16		( $128^{2}$ , $256^{2}$ , $512^{2}$ )	0.5, 1, 2

Table 2. Comparison of multi-scale feature strategy on the UA-DETRAC dataset.

Faster R-CNN	MMFAN	Strategy I	Strategy II	Features	Anchors	Precision (%)	Recall (%)
✓	✓			conv4	20,736	74.8	80.6
✓	✓	✓		{conv3, conv4}	103,680	77.6	87.4
✓	✓		✓	{conv2–conv6}	147,312	79.8	89.9

Table 3. Influence of modules on detection performance.

Methods	Faster R-CNN	MFGN	MMFAN	mAP (%)	${mAP}_{S}$ (%)	${mAP}_{M}$ (%)	${mAP}_{F}$ (%)	Runtime (fps)
(a)	✓			71.1	79.2	68.9	51.3	5.6
(b)	✓	✓		${74.6}_{↑ 3.5}$	${80.9}_{↑ 1.7}$	${72.2}_{↑ 3.3}$	${56.0}_{↑ 4.7}$	5.3
(c)	✓		✓	${75.4}_{↑ 4.3}$	${82.1}_{↑ 2.9}$	${73.9}_{↑ 5.0}$	${57.1}_{↑ 5.8}$	4.9
(d)	✓	✓	✓	${77.8}_{↑ 6.7}$	${83.2}_{↑ 4.0}$	${76.0}_{↑ 7.1}$	${59.4}_{↑ 8.1}$	4.2

Table 4. Detection results of all categories on ImageNet VID_vehicle.

Methods	AP(%)
Methods	Bicycle	Bus	Car	Motorcycle
(a)	69.9	78.6	56.0	79.9
(b)	73.4	78.8	65.9	80.4
(c)	74.6	81.3	64.4	81.2
(d)	77.9	83.8	67.7	81.7

Table 5. Detection results under nighttime condition and rainy scenes.

Sences	Precision (%)	Recall (%)	mAP (%)
night	73.7	77.1	75.2
rainy	70.9	78.4	74.1

Table 6. Accuracy comparison with state-of-the-arts that employ similar theories on ImageNet VID.

Methods	Compoments				mAP (%)
Methods	Feature Network	Static Detector	Multi-Scale Feature	LSTM	mAP (%)
Faster R-CNN [14]	ResNet-101 [24]	Faster R-CNN			73.4
Mask R-CNN [6]	ResNet-101	Mask R-CNN	✓		77.2
STMN [47]	ResNet-101	Fast R-CNN [17]		✓	71.4
TPN [52]	GooLeNet [53]	Fast R-CNN		✓	68.4
LSTM-SSD [45]	Mobilenet [54]	SSD [13]		✓	54.4
MMNet [7]	ResNet-101	R-FCN [52]	✓	✓	76.4
Faster R-CNN + MFGN	ResNet-101	Faster R-CNN	✓		76.5
Faster R-CNN + MMFAN	ResNet-101	Faster R-CNN		✓	77.6
MFMN	ResNet-101	Faster R-CNN	✓	✓	79.3

Table 7. Accuracy comparison with recent state-of-the-arts on ImageNet VID.

Methods	Feature Network	Static Detector	mAP (%)	mAP(%) of Static Detector
CHP [35]	ResNet-101 [24]	CenterNet [21]	76.7	73.6
TransVOD [43]	ResNet-101	Deformable DETR	82.0	78.3
YOLOX-M + LPN [55]	DarkNet-53	YOLOX-M [56]	75.1	69.4
FCOS + LPN [55]	ResNet-101	FCOS [57]	79.8	73.3
MFMN	ResNet-101	Faster R-CNN [14]	79.3	73.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Lu, S. Vehicle Detection in Videos Leveraging Multi-Scale Feature and Memory Information. Electronics 2025, 14, 2009. https://doi.org/10.3390/electronics14102009

AMA Style

Yang Y, Lu S. Vehicle Detection in Videos Leveraging Multi-Scale Feature and Memory Information. Electronics. 2025; 14(10):2009. https://doi.org/10.3390/electronics14102009

Chicago/Turabian Style

Yang, Yanni, and Shengnan Lu. 2025. "Vehicle Detection in Videos Leveraging Multi-Scale Feature and Memory Information" Electronics 14, no. 10: 2009. https://doi.org/10.3390/electronics14102009

APA Style

Yang, Y., & Lu, S. (2025). Vehicle Detection in Videos Leveraging Multi-Scale Feature and Memory Information. Electronics, 14(10), 2009. https://doi.org/10.3390/electronics14102009

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vehicle Detection in Videos Leveraging Multi-Scale Feature and Memory Information

Abstract

1. Introduction

2. Related Works

2.1. Vehicle Detection in Images

2.2. Vehicle Detection in Videos

3. Method

3.1. Multi-Scale Feature Generation Network

3.2. Memory-Based Multi-Scale Feature Aggregation Network

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Ablation Study

4.2.1. Analysis of Multi-Scale Feature Strategies

4.2.2. Module Analysis

4.3. Comparison with State-of-the-Art Methods

4.3.1. Results on the UA-DETRAC Dataset

4.3.2. Results on the ImageNet VID Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI