A Computer Vision-Based Algorithm for Detecting Vehicle Yielding to Pedestrians

Wan, Yanqi; Xu, Yaqi; Xu, Yi; Wang, Heyi; Wang, Jian; Liu, Mingzheng

doi:10.3390/su152215714

Open AccessArticle

A Computer Vision-Based Algorithm for Detecting Vehicle Yielding to Pedestrians

by

Yanqi Wan

^1,*,†,

Yaqi Xu

^2,†

,

Yi Xu

³,

Heyi Wang

⁴,

Jian Wang

⁵

and

Mingzheng Liu

¹

School of Journalism and Communication, Jilin University, Changchun 130012, China

²

School of Computer Science, Beijing University of Post and Telecommunications, Beijing 100876, China

³

School of Public Administration, Jilin University, Changchun 130012, China

⁴

School of Resources and Environment, Jilin Agricultural University, Changchun 130118, China

⁵

School of Computer Science and Technology, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sustainability 2023, 15(22), 15714; https://doi.org/10.3390/su152215714

Submission received: 9 August 2023 / Revised: 24 October 2023 / Accepted: 26 October 2023 / Published: 7 November 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Computer vision has made remarkable progress in traffic surveillance, but determining whether a motor vehicle yields to pedestrians still requires considerable human effort. This study proposes an automated method for detecting whether a vehicle yields to pedestrians in intelligent transportation systems. The method employs a target-tracking algorithm that uses feature maps and license plate IDs to track the motion of relevant elements in the camera’s field of view. By analyzing the positions of motor vehicles and pedestrians over time, we predict the warning points of pedestrians and hazardous areas in front of vehicles to determine whether the vehicles yield to pedestrians. Extensive experiments are conducted on the MOT16 dataset, real traffic street scene video dataset, and a Unity3D virtual simulation scene dataset combined with SUMO, which demonstrating the superiority of this tracking algorithms. Compared to the current state-of-the-art methods, this method demonstrates significant improvements in processing speed without compromising accuracy. Specifically, this approach substantially outperforms in operational efficiency, thus catering aptly to real-time recognition requirements. This meticulous experimentation and evaluations reveal a commendable reduction in ID switches, enhancing the reliability of violation attributions to the correct vehicles. Such enhancement is crucial in practical urban settings characterized by dynamic interactions and variable conditions. This approach can be applied in various weather, time, and road conditions, achieving high predictive accuracy and interpretability in detecting vehicle–pedestrian interactions. This advanced algorithm illuminates the viable pathways for integrating technological innovation and sustainability, paving the way for more resilient and intelligent urban ecosystems.

Keywords:

intelligent transportation; computer vision; object tracking; behavior detection

1. Introduction

The World Health Organization (WHO) reports that, globally, over 1.35 million people die annually as a result of road traffic accidents, with vehicles being the most common cause of pedestrian crash deaths [1]. Without improvements to road safety, this number is expected to increase. A study found that 67% of pedestrian fatalities in urban areas occurred on main roads, 50% at non-intersections, and 56% in the dark [2]. This may be due to inadequate sidewalks, poor visibility, or ignorance of pedestrian rights. To reduce casualties, it is important to improve traffic infrastructure, study vehicles with pedestrian avoidance technology, and provide better traffic safety awareness training and education for drivers and pedestrians. Reasonable legal means and effective regulatory measures can also encourage drivers to comply with traffic regulations.

In the context of burgeoning urban environments globally, addressing the dynamics between pedestrians and vehicles is paramount for shaping sustainable cities, anchoring in the principles of safety, efficiency, and environmental consideration. Sustainable urban traffic systems are a pivotal component of smart cities, where the interaction between vehicles and pedestrians becomes a crucial determinant in ensuring not only individual safety but also the overarching sustainability of urban transportation systems.

Smart cities aim to leverage technology and data-driven solutions to optimize urban life, addressing challenges related to transportation, energy consumption, and public services, thus contributing to sustainability. The integration of advanced technologies such as AI and deep learning in traffic management systems epitomizes the ethos of smart cities by enhancing safety and efficiency while mitigating environmental impacts through optimized traffic flows and reduced accidents.

In this vein, the transgressions in traffic norms, such as vehicles failing to yield to pedestrians, escalate the risks of accidents, impacting public safety, environmental well-being, social equity, and economic vitality. Therefore, robust and efficient pedestrian detection and vehicle behavior analysis technologies are instrumental in fortifying the sustainable development of urban traffic systems. These technologies enable precise identification and prediction of traffic incidents, foster adherence to traffic norms, and enhance the equitable and efficient utilization of roads.

Currently, surveillance cameras installed at various traffic intersections capture numerous images of suspected traffic violations when vehicles pass through intersections [3]. The traditional regulatory approach involves conducting multiple rounds of manual review on these images to confirm whether drivers violate traffic rules. This method aims to ensure accuracy and fairness while reducing driver complaint rates. However, this method is labor-intensive, inefficient, and involves repetitive tasks, which can adversely affect the mental state of the invigilators, thereby compromising the efficiency, fairness, and impartiality of violation audits. Therefore, it is crucial to find a technical solution that can quickly, effectively, fairly, and openly supervise and judge the behavior and degree of violation of drivers while reducing costs.

The reduction in costs and human resources, achieved through automation, enables the allocation of resources to other critical urban developmental areas, such as infrastructure improvement, public service enhancement, and environmental conservation, thereby fostering multifaceted urban sustainability.

Moreover, cost-effective and resource-efficient traffic management solutions contribute significantly to the realization of smart cities by allowing municipalities to manage traffic more effectively and respond promptly to violations, ensuring smoother traffic flows, reducing congestion, and decreasing vehicle emissions, all contributing to a healthier and more sustainable urban environment.

The incorporation of such sustainable and intelligent solutions is paramount for the evolution of urban areas, where the synergy between technological advancements and sustainability can lead to the establishment of cities that are more responsive, adaptive, and cognizant of both human and environmental needs, paving the way for the holistic and harmonious development of urban ecosystems.

Therefore, by amalgamating efficiency, cost-effectiveness, and technological innovation, this study accentuates the sustainable ethos of smart city initiatives, providing a pragmatic approach to bolster urban sustainability through enhanced traffic management. We introduce an innovative automatic auditing method, designed to assess whether vehicles yield to pedestrians at zebra crossings, based on a deep learning technologies. This method is set to supersede traditional manual auditing techniques, refining the existing tracking algorithms by utilizing feature maps to meticulously track both pedestrians and vehicles. A sophisticated warning-point prediction algorithm is further developed to accurately determine whether vehicles duly yield to pedestrians.

The main contributions of this paper are as follows:

We propose an automatic auditing method to check if vehicles yield to pedestrians at zebra crossings based on deep learning, which replaces the existing manual auditing method;
We enhance the conventional tracking algorithm through the utilization of feature maps, enabling more accurate and efficient tracking of pedestrians and vehicles in varied environmental conditions;
We formulate a novel warning-point prediction algorithm, specifically designed to accurately assess whether vehicles yield to pedestrians, reducing the probability of inaccurately attributing a vehicle’s violation to another and ensuring reliable and fair violation assessments;
We meticulously conducted extensive experiments, using datasets such as MOT16, our privately compiled street scene video dataset, and a virtual simulation scene dataset amalgamated with SUMO, to validate the efficacy and reliability of our proposed methodologies. Our experiments demonstrated that our method significantly outperforms existing solutions in operational efficiency, showcasing a speed of 21 Hz compared to Deep SORT’s 11 Hz, while maintaining a balanced accuracy, manifesting a MOTA of 55.3 and MOTP of 70.3;
The results of our comprehensive evaluations underline the substantial reduction in ID switches and false matches, elucidating our algorithm’s superiority in accuracy and reliability, particularly in dynamic and interacting urban settings, and reinforcing its potential in fostering sustainable traffic management solutions and enhancing urban sustainability.

The rest of the paper is organized as follows: Section 2 provides an overview of existing works in behavior detection and object detection and tracking. Section 3.1 provides an overview of the algorithm, followed by a detailed description of the proposed behavior detection algorithm in Section 3.2. Section 3.3 describes detection of vehicle yielding to pedestrian based on warning points in this study. Next, Section 4 presents the experimental results to verify the effectiveness of the proposed method. Conclusions and future works are given towards the end of the paper in Section 5.

2. Related Works

In recent years, in the field of automatic driving, there are many pedestrian detection and avoidance algorithms based on the perspective of driverless cars [4,5]. The behavior detection of whether vehicles yield to pedestrians is divided into the following steps: use the existing roadside unit-traffic surveillance camera to collect the video at the intersection, determine the position of the master crossing line under the camera, detect the target frame by frame, identify the traffic elements such as vehicles and pedestrians, determine their boundary box and category, and track the motion trajectory of the target. According to the target trajectory analysis, the violations of motor vehicles giving way to pedestrians are identified. The work related to these steps is briefly described below.

2.1. Behavior Detection

Papageorgiou and Poggio [6] developed a general trainable target detection system. The feature set is based on over complete Haar wavelet transform and provides rich pattern description Abramson and Steux [7,8] used four algorithms for initial detection: a 5 × 5 feature classifier, diagonal leg detector, pedestrian motion estimator, and vertical edge detector. The fusion is carried out under the framework of particle filter to realize pedestrian detection and tracking. Havasi et al. [9] used the symmetrical characteristics of pedestrians’ legs to detect pedestrians. For fixed cameras installed in infrastructure, the characteristics of pedestrians often driving along specific paths can also be used. Tracking a large number of pedestrians in the scene helps to understand these paths. For example, Makris and Ellis [10] use a Bayesian HMM-based method to generate a probability distribution model of trajectories in the scene. Large et al. [11] used clustering-based technology to learn motion patterns and obtain long-term estimates of object motion. The study can be used to predict where the currently detected pedestrians may go and estimate the probability of collision with vehicles. However, these algorithms are used to detect and avoid pedestrians from the perspective of unmanned vehicles, and there is little work on the behavior detection of vehicles giving way to pedestrians from the perspective of traffic supervision cameras.

2.2. Object Detection and Tracking

Ross Girshick [12] proposed an RCNN using a selective search algorithm for a single image. In view of the consumption of time and storage space of R-CNN, Ross proposed to use fast R-CNN [13] with region of interest pooling layer. Faster R-CNN proposed by Shaoqing Ren et al. [14] abandoned the time-consuming selective search algorithm and used the regional proposal network (RPN) to generate the area to be detected instead. Mask R-CNN [15] is a classic instance segmentation algorithm. It makes a subtle improvement on the pooling layer of the region of interest, and puts forward the concept of ROI alignment. SSD [16] is a kind of one stage target detection algorithm proposed by W. liu et al., which uses multi-frame prediction and a convolutional neural network for detection. On the basis of the previous work, mobilenet-V2 [17] introduced inverted residual and linear bottleneck.

In the traffic area, Jaiswal et al. [18] proposed a method for traffic intersection vehicle movement counts with temporal and visual similarity-based re-identification. Kumar et al. [19] proposed a vehicle re-identification and trajectory reconstruction method using multiple moving cameras in the CARLA driving simulator. Wang et al. [20], used a video channel decomposition saliency region network for vehicle re-identification, Li et al. [21] proposed a discriminative-region attention and orthogonal-view generation model for vehicle re-identification. Erik Bochinski et al. [22] proposed an online multi-target fast tracking algorithm based on the target maximum intersection union ratio (IOU), which can easily run at the speed of 100k FPS, and has a very significant impact in the field of multi-target fast tracking. However, the algorithm has two fatal disadvantages: it relies too much on detection, does not add relevant information (tracking) between frames, and for target tracking that moves too fast, interrupted frames will be generated in the middle, and the IOU strategy will fail. Erik Bochinski and others [23] improved on the basis of IOU and added a visual single target tracker to the tracking process of the IOU tracking algorithm, which greatly reduced the number of ID switching and fragments while maintaining a high tracking speed. Qi Chu et al. [24] proposed an online MOT framework based on CNN. The framework uses the advantages of single object tracker to adjust the appearance model and search the target in the next frame. Alex Bewley et al. [25] proposed a SORT online multi-target tracking framework based on target detection.

3. Behavior Detection Algorithm

3.1. Algorithm Overview

In behavior detection, specifically in determining whether motor vehicles yield to pedestrians, the primary difficulty lies in accurately tracking targets (like vehicles and pedestrians) in video frames, especially given the high frame rate of videos and slow motion of targets. Challenges also arise when multiple targets of similar appearance and category are present in a video frame. This study aims to provide a detection method of whether motor vehicles yield to pedestrians. The training data are the BDD100K [26] dataset, and the data used for the experiment is over 1000 real traffic road condition 4K videos collected from the perspective of traffic surveillance camera by UAV. The section describes the detection method proposed in this paper with the following steps:

The trained target object-detecting network is used to recognize the target elements of the collected video frames to obtain the target element set;
Tracking the motion trajectory of the elements in the target element set using an IOU tracking algorithm based on feature maps;
Based on the identified first crossing line and the track of the tracked object, the violation detection algorithm based on timing is used to detect whether there is a violation of vehicles failure to yield in the current video frame.

The design details of the object tracking and detection of vehicle yielding to pedestrian modules in our algorithm are given in Section 3.2 and Section 3.3.

3.2. Object Tracking Based on Feature Map

The target recognition algorithm can be used to identify the traffic elements. There are several algorithms available, including SSD, Faster R-CNN [12], FCN [27], and Yolo [28], but we choose SSD as the best choice for our algorithm due to its superior performance [29]. We use MobileNet-v2 feature extraction to improve recognition speed while maintaining accuracy for online, real-time recognition.

After training the neural network with BDD100K, we obtain the model parameters, and preprocess each frame of the traffic video to the trained neural network for forward propagation calculation. The algorithm then outputs a group of target elements with their corresponding categories and confidence degrees. We select the category with the highest confidence from each non-background category group to get the category of target elements and boundary detection box, and finally output a group of target elements.

Given the high video frame rate and slow target motion, we detect intersections between consecutive frames in the target element set and track the target element trajectory according to the time series using the maximum intersection merge ratio filtering method. The algorithm maintains a set of sequence boxes, each corresponding to the tracking trajectory of a target element. Each track contains the category of the target element, the number of frames last seen by the target element, and the list of check boxes that constitute the movement path of the target element.

To determine whether a track should be selected, we calculate the cross merge ratio of detection frames one by one between each latest target element detection box and the last detection box of all tracks in the sequence box list set using Equation (1), with a threshold value of 0.75 for optimal effect. If the cross merge ratio exceeds the threshold, the track corresponding to the last detection frame that meets the conditions is added to the track to be selected. Otherwise, the target element is the first frame of the latest tracking track, and its detection box and category are added to the sequence box list set.

\begin{matrix} I o U = max \{\frac{A r e a \{{b b o x}_{n e w} ⋂ {b o x}_{i}\}}{A r e a \{{b b o x}_{n e w} ⋃ {b b o x}_{i}\}}\}, i \in [1, n] \end{matrix}

(1)

where

b b o x_{n e w}

represents the detection box of the latest element. N represents the number of elements in the sequence box list collection, that is, the total number of tracks tracked currently

b o x_{I}

represents the last detection box of the ith sequence box list in the first sequence box list, and

A r e a

represents the area of the rectangle.

On the basis of the IoU tracking algorithm, in order to solve the problem that two or more targets with similar distance and the same category lead to multiple corresponding tracks to be selected for the latest target, we propose the following innovations. Specifically, we distinguish from the detailed features of the target, which may include color, shape, texture, etc. If the target is a vehicle category, the unique identification information of the vehicle license plate also becomes a necessary means to distinguish multiple tracks, as shown in Figure 1. If there are multiple corresponding tracks to be selected, the feature comparison method based on appearance, color, texture and license plate ID is used to screen the motion track of the set of tracks to be selected, as shown in Figure 2, so as to judge which track the latest target box belongs to through the appearance characteristics at the macro level and the pixel characteristics at the micro level, and add the target box to the track as the last detection box.

The elements in the latest target box, the last detection box of each track in the list of tracks to be selected, and the secondary end detection box of each track in the list of tracks to be selected are preprocessed by clipping, normalization, etc. Then, the feature extraction network is input to obtain the corresponding appearance, color, and texture features. Preferably, we use MobileNet-v2 as the feature map to extract the network. For the convenience of explanation, the feature map obtained by the elements in the latest target box is named as the latest feature map, the feature map obtained by the last detection frame of each track in the track list to be selected is named as the last-feature-map, and the feature map obtained by the next last detection frame of each track in the track list to be selected is named as the sub-last-feature-map. All the above feature maps are normalized to obtain the feature map of the same dimension. Calculate the cosine distance between the latest feature map and each last feature map as the last-cosine-distance, and calculate the cosine distance between the latest feature map and each sub final feature map as the sub-last-cosine-distance later. The calculation method of cosine distance between feature maps is shown in Equation (2):

{cos (θ)}_{k m} = \frac{A • B_{k m}}{|| A || || B_{k m} ||}, m = 1 o r 2, k \in [1, t]

(2)

where t represents the number of tracks in the track list to be selected, and A represents the latest-feature-map after normalization.

B_{k 1}

represents the normalized last-feature-map corresponding to the last detection box of the

k^{t h}

track to be selected in the track list.

B_{k 2}

represents the normalized sub-last-feature-map corresponding to the end of second detection box of the k track to be selected in the track list.

c o s {(θ)}_{k 1}

represents the cosine distance between the latest feature graph A and the last-feature-map

B_{k 1}

.

c o s {(θ)}_{k 2}

represents the cosine distance between the sub-latest feature graph A and the sub-last-feature-map

B_{k 2}

.

Take the logarithm of each last cosine distance to obtain the corresponding last characteristic similarity factor, and take the logarithm of each second last cosine distance to obtain the corresponding second last characteristic similarity factor. The calculation method of feature similarity factor is shown in Equation (3):

S_{{c u r}_{k_{m}}} = - {lncos (θ)}_{k_{m}} = - l n \frac{A • B_{k_{m}}}{|| A || ||B_{k_{m}}||}, k \in [1, t]

(3)

where

S_{{c u r}_{k_{1}}}

represents the last feature similarity factor, and

S_{{c u r}_{k_{2}}}

represents the second last feature similarity factor.

S_{{c u r}_{k_{m}}}

denotes the

m - t h

last feature similarity factor of the

k - t h

tracking path.

Due to the similarity of vehicles and various environmental factors such as lighting, viewpoint, and occlusion, appearance-based methods cannot accurately distinguish confused vehicles. More importantly, as the unique identification of each vehicle, the license plate information should be considered for accurate vehicle identification. In this paper, the HyperLPR algorithm [30] is used to recognize the license plate. After obtaining the license plate string, characters including regions, letters, and numbers are encoded, as shown in Figure 3. Equation (3) is used to calculate the similarity factor of the two license plate strings.

Take the last feature similarity factor and the corresponding next last feature similarity factor between the latest feature map and each track to be selected. If the target is a motor vehicle, it should also include the cosine feature similarity factor between license plates. The above feature similarity factors are linearly weighted, and the minimum weighted feature similarity factor is bounded to a minimum. Add the latest detection box to the end of the track detection box list to be selected corresponding to the minimum weighted feature similarity factor as the latest frame of the track. The calculation equation of the track with the minimum weighted feature similarity factor is shown in Equation (4):

f i n a l_{k} = {m i n}_{k} \{a • S_{{c u r}_{k_{1}}} + b • S_{{c u r}_{k_{2}}} + c • s i g n (S_{{c u r}_{k_{p l a t e}}} - β) S_{{c u r}_{k_{p l a t e}}}\}, k \in [1, t]

(4)

where a represents the weight of the last feature similarity factor, and b represents the weight of the next last feature similarity factor. If the target element category is motor vehicle, c represents the weight of the license plate similarity factor,

f i n a l_{k}

represents the track with the minimum weighted feature similarity. The symbol

s i g n

represents the signum function, and the meaning of

S_{{c u r}_{k_{m}}}

is as shown in Equation (3).

Since the time of the next last effective frame corresponding to the second last feature factor is far from the overlapping of the two vehicle detection frames, that is, the time of the next last effective frame, the independence of the detection frame of this element is better, so the color, texture, and other features of the next last effective frame are more referential. Since the license plate represents the unique identification of the vehicle, the priority of the license plate similarity factor is higher than other priorities when the license plate can be located and there is a recognition result. When the license plate similarity factor is less than the threshold of

β

, it means that the license plates are similar to a certain extent. At this time, a certain reward mechanism is given to the weighted feature similarity factor. If the similarity of license plates is higher than the threshold of

β

, it means that there is a gap in the similarity between license plates, and the weighted feature similarity factor faces a double penalty. Therefore, the weight of license plate similar factors is far greater than that of other similar factors, and the weight of the last similar factor is greater than that of the next similar factor, that is, c > b > a > 0.

There may be two possibilities for the disappearance of subsequent frames of a track: first, a target may lack the current information of the target due to too many missed detections in a frame, which is in high probability; or a tracking algorithm error occurs, so that the target is wrongly assigned to other tracks, which is in low probability. The object is successfully detected and tracked again in the subsequent frames, resulting in the chain breaking phenomenon of the tracking track at the current time. Second, the target disappears in the field of interest, so that the effective frame cannot be detected again, so there is no subsequent frame, that is, the tracking track disappears legally. Aiming at this phenomenon, we obtain the frame interval between the current video frame and the time node where the last detection frame of each current track appears. On the premise that the frame interval is greater than or equal to the preset frame interval threshold, we consider that the target element has disappeared in the field of interest, and discard the track corresponding to the target element.

In specific implementation, the frame number difference between the current video frame and the last frame number of each track can be calculated by using the last frame number of the target element maintained in the sequence box list set. If the frame number difference between the current video frame and the last valid occurrence of a track is less than the threshold of the frame interval, the track will be temporarily retained; otherwise, the target will be considered to have left the field of interest, and the track will be discarded. According to the following experiments, when the frame interval threshold

β

= 5, the number of tracking interrupts can be optimized without sacrificing too many time and space resources.

3.3. Detection of Vehicle Yielding to Pedestrian Based on Warning Points

The behavior of vehicles that do not yield to pedestrians can be classified into several categories, as shown in Figure 4. (a) For straight vehicles, the vehicles driving in the lane where the pedestrians are and the lane to be reached in the direction of travel must give way. The green belt is calculated according to one lane, that is, if pedestrians are in the green belt and are about to step into the zebra crossing, the vehicles in the zebra crossing pedestrian direction close to the lane must give way. Motor vehicles that are two lanes away from the pedestrian lane or green belt can pass. (b) When turning left to the intersection, when pedestrians enter the zebra crossing, the left turning vehicles must slow down to give priority to pedestrians. (c) When the vehicle turns right to the intersection, when pedestrians enter the zebra crossing, vehicles turning right must slow down to give priority to pedestrians.

According to these rules, motor vehicles should slow down and give way when pedestrians have a forward trend and there are vehicles in the lane and the first adjacent lane in the forward direction. However, there are many factors that make it difficult to determine whether a behavior violates the rules.

Excessive deceleration wastes time and traffic resources, so it is crucial to find the maximum speed at which the motor vehicle can slow down while ensuring pedestrians’ safety;
If pedestrians have no tendency to pass the intersection, how to judge whether the motor vehicle should slow down and give way at this time;
How to quantitatively judge the conflict between the motor vehicle route and the pedestrian route, that is, the motor vehicle does not yield to pedestrians.

To quantify these intricate factors, we present a method for translating the pixel coordinates of the target center point into the world coordinate system as discussed in Section 3.2. This translation is achieved using the transformation equation detailed in Equation (5).

\begin{matrix} P_{u v} & = K T P_{w} \\ P_{u v} & = T_{c a m e r a}^{p i x e l} T_{w o r l d}^{c a m e r a} P_{w} \\ d e p t h [\begin{matrix} u \\ v \\ 1 \end{matrix}] & = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} r_{11} & r_{12} & r_{13} & t_{1} \\ r_{21} & r_{22} & r_{23} & t_{2} \\ r_{31} & r_{32} & r_{33} & t_{3} \end{matrix}] [\begin{matrix} x \\ y \\ z \\ 1 \end{matrix}] \end{matrix}

(5)

where K represents the intrinsic matrix of the camera. It signifies the transformation from the pixel coordinate system to the camera coordinate system (related to both the camera and the lens), denoted as

T_{c a m e r a}^{p i x e l}

. This parameter can be obtained through camera calibration. Meanwhile, T represents the extrinsic parameters, indicating the transformation from the camera coordinate system to the world coordinate system, denoted as

T_{w o r l d}^{c a m e r a}

. This parameter can be obtained through PNP estimation.

P_{w}

represents the coordinates in the world coordinate system. The symbol

d e p t h

stands for the value of the target point in the Z direction in the camera coordinate system. Since the camera may not be exactly parallel to the target plane, the depth may vary. To reduce the error, different depths can be calculated using the external parameters. The depth value can be calculated using Equation (6) (Figure 5).

In the camera, there are four coordinate systems: the world coordinate system, the camera coordinate system, the image coordinate system, and the pixel coordinate system. (a) The world coordinate system allows for the arbitrary designation of the

x_{w}

axis and

y_{w}

axis. It is the coordinate system where point P in the above figure is located. (b) The camera coordinate system has its origin at the pinhole, with the z-axis coinciding with the optical axis, and its axes are parallel to the projection plane. It is represented as

X_{C}

,

Y_{C}

,

Z_{C}

in the above diagram. (c) The image coordinate system has its origin at the intersection of the optical axis and the projection plane, and its axes are parallel to the projection plane, represented as the (x,y) coordinate system in the above diagram. (d) The pixel coordinate system, viewed from the pinhole towards the projection plane, has its origin at the upper left corner of the projection plane, with the (u,v) axes coinciding with the sides of the projection plane. This coordinate system is on the same plane as the image coordinate system but has a different origin.

\begin{matrix} d e p t h = t_{3} + r_{31} \times x + r_{32} \times y + r_{33} \times z \end{matrix}

(6)

where

t_{3}

,

r_{31}

,

r_{32}

,

r_{33}

are part of the extrinsic parameters of the camera, and x, y, z represent the world coordinates.

After combining the influence of changing depth, the pixel coordinate and world coordinate conversion method is shown in Equation (7).

\begin{matrix} d e p t h [\begin{matrix} u \\ v \\ 1 \end{matrix}] = K (R [\begin{matrix} x \\ y \\ z_{c o n s t} \end{matrix}] + t) \end{matrix}

(7)

where K represents the intrinsic parameter matrix of the camera, R represents the rotation matrix, and t represents the translation matrix.

z_{c o n s t}

denotes the value of the target point in the Z direction in the world coordinate system.

The three-dimensional coordinates are projected onto the horizontal plane of the top view, resulting in two-dimensional coordinates without depth information. The “violation warning point” is defined as the position that is no longer safe in front of the target’s movement. In other words, the range from the target’s current position to the warning point is temporarily safe. In this paper, we predict the violation warning points of each tracking target at the current time, based on the center coordinates of detection frames arranged in a time series. The warning point reflects the target’s movement intention, including direction and speed. For example, if a pedestrian remains stationary for a certain amount of time, their warning point will be infinitely close to their current position. On the other hand, if a pedestrian or vehicle accelerates, it becomes dangerous to be in a far area in front of them. Thus, the faster the target moves, the greater the danger area in front of it. We assume that a target’s movement within a period of time can affect the position of the warning point, and that the impact on the warning point’s position becomes more obvious as the target approaches the current frame.

The calculation method for the violation warning point is shown in Equation (8). Here,

x_{w a r n}

and

y_{w a r n}

represent the horizontal and vertical coordinates of the violation warning point, while

x^{'} p m

and

y^{'} p m

represent the horizontal and vertical coordinates of the center point of the m-th detection box in the target’s tracking track. c represents the number of detection boxes in the target’s tracking track,

x_{c}

and

y_{c}

represent the coordinates of the center point of the last detection frame in the target’s tracking track, and

w i d t h_{c}

represents the width of the last detection box in the target’s tracking track. We found that the best value for the constant

α

is 0.55, based on experimental results. Equation (8) magnifies the influence of approaching the current frame by multiplying a logarithmic factor, so that the position influence factor is much higher when the position is closer to the current frame. Finally, we obtain the violation warning point

(x_{w a r n}, y_{w a r n})

.

In this implementation, a violation warning point is predicted for pedestrian motion track categories, while for motor vehicles, the area of the parallelogram generated by the side closest to the warning point is defined as the danger area, with the violation warning point as the center.

\begin{matrix} x_{w a r n} & = x_{c} + α • {w i d t h}_{c} • \frac{\sum_{m = 1}^{c} ({x^{'}}_{p}_{m} - {x^{'}}_{p}_{m - 1}) l n \frac{c - m}{c}}{\sum_{m = 1}^{c} l n \frac{c - m}{c}} \\ y_{w a r n} & = y_{c} + α • \frac{\sum_{m = 1}^{c} ({y^{'}}_{p}_{m} - {y^{'}}_{p}_{m - 1}) l n \frac{c - m}{c}}{\sum_{m = 1}^{c} l n \frac{c - m}{c}} \end{matrix}

(8)

In the presence of a crosswalk line, the main crosswalk line below the camera is defined as the primary crosswalk, and its surrounding area is designated as the camera’s field of interest. Consequently, the focus of this paper is restricted to violations occurring in this area alone, and to reduce computational load, it is predetermined whether each pedestrian in the current video frame is located within the field of interest. Violation warning points can be predicted for pedestrians within the area of interest, with the location of the center point of the lower edge of the pedestrian detection box used to determine whether the pedestrian is within the field of interest of the camera. The equation used to determine this is in Equation (9).

\begin{matrix} F (x) & = I (x_{p} + \frac{{w i d t h}_{p}}{2} < {l e f t}_{p c} + {w i d t h}_{p c}) \\ \times I (x_{p} + \frac{{w i d t h}_{p}}{2} > {l e f t}_{p c}) \\ \times I (y_{p} + \frac{{h e i g h t}_{p}}{2} < {t o p}_{p c} + {h e i g h t}_{p c}) \\ \times I (y_{p} + \frac{{h e i g h t}_{p}}{2} > {h e i g h t}_{p c}) \end{matrix}

(9)

where

(x_{p}, y_{p})

represents the horizontal and vertical coordinates of the upper left corner of the bounding box,

w i d t h_{p}

,

h e i g h t_{p}

represents the width and height of the bounding box,

l e f t_{p c}

,

t o p_{p c}

represents the horizontal and vertical coordinates of the upper left corner of the rectangular field of interest, respectively,

w i d t h_{p c}

,

h e i g h t_{p c}

represents the width and height of the field of vision of the rectangle of interest, respectively.

I ()

is the indicator function. When the input value is true,

I (x) = 0

; when the input value is false,

I (x) = 1

.

To determine if a motor vehicle has failed to slow down and yield to pedestrians, we evaluate whether the line segment between the pedestrian warning point and the center point of the current detection frame lies within the parallelogram danger area in front of the vehicle. If it does, then the vehicle is deemed to be in violation. In the diagram shown in Figure 6, the black, yellow, and blue dots represent the warning points predicted by motor vehicles 1 and 2 and pedestrian 1, respectively, using our calculation method. The magenta and blue parallelograms indicate the dangerous areas in front of the motor vehicles, with motor vehicle 2 being the offending party.

Once a violation is confirmed in the current video frame, we use a license plate recognition algorithm to identify the license plate of the offending vehicle. To prevent false accusations, the algorithm tracks the suspect vehicle for several consecutive frames to confirm the violation. The vehicle’s license plate number is identified for each frame and recorded in a license plate list. If the number of continuous violation frames exceeds a certain threshold, the algorithm selects the most frequently occurring license plate number in the recorded list as the final identifier of the violator. We record the type of violation, the corresponding license plate number, and the time of occurrence, and upload the data to the cloud. Our license plate recognition algorithm is based on the HyperLPR open-source framework, which we believe is the most effective method available.

4. Performance Evaluation

4.1. Simulation Settings

(1) Dataset: We use the BDD100K dataset to train our neural network, and evaluates the performance of the tracker on the MOT16 [31] benchmark and our test dataset. In view of the complexity of the actual traffic conditions, we use a real scene experiment and part of the simulation scene experiment to evaluate the vehicle comity pedestrian violation detection algorithm.

Our test dataset adopts the real street scene taken from the perspective of the traffic surveillance camera by unmanned aerial vehicles. The rendering of the dataset is shown in Figure 7. The labeling information includes shooting time, location, violation information at a certain time, and license plate number of vehicles in violation of regulations.

Among them, the video duration is more than 1000 min, covering more than 30 locations, with more than 10,000 tracking records, 656 incidents of vehicles not giving way to pedestrians, and 3112 incidents of vehicles giving way to pedestrians. The video includes intersections, T-junctions, “one” shaped lanes with signal lights, “one” shaped lanes without signal lights, etc. Its distribution is shown in Figure 8; it covers sunny, cloudy, cloudy, rainy, snowy, and foggy days, and its distribution is shown in the Figure 9, in which sunny and cloudy days account for a significant proportion; it covers all the time periods in the 24 h of a day, and its distribution is shown in the Figure 10; it includes four types of violation ways of vehicles not giving way to pedestrians as described in Figure 4, and the distribution is shown in Figure 11.

(2) Simulation design: To train and evaluate of our algorithm, we build a simulator for the dynamic urban environment. Our virtual simulation is a scene generated based on the joint simulation of communication between Unity3D [32] and micro traffic simulator SUMO [33]. We use SUMO as the server to simulate more than 30 groups of different traffic flows, and Unity3D as the client to truly restore the street view of a city in an equal proportion, as shown in Figure 12, which includes 42 intersections, 14 “T” intersections, and 11 one-way intersections. Each intersection is equipped with corresponding signal light groups and signal light conversion strategies. The simulation experiment contains an equal number of positive and negative examples, including four equal numbers of motor vehicles that do not yield to pedestrians in violation of the rules described in Section 3.3. The simulation scene is shown in Figure 12 and Figure 13.

4.2. Experiment of Target Tracking Algorithm Based on Feature Graph and Plate ID

For the multi-target tracking algorithm based on feature map and plate-ID in this paper, because it is difficult to use only one score to evaluate our multi-target tracking performance, we used the evaluation index defined in [34] and the standard multi-target tracking index [35].

MOTP: Precision of multi-target tracking. The average error between the estimated position and the real position in all frame matching pairs. This index reflects the ability of the tracker to estimate the precise position of the target, and is independent of the ability to identify target configuration, track tracking stability, etc. The calculation formula is as follows: Formula (10). The accuracy reflected in the determination of target location is used to measure the accuracy of target location determination, where $c_{t}$ represents the number of matches between the real position $o_{i}$ and hypothesis $h_{j}$ of the target in frame t. $d_{t}^{i}$ represents the distance between the real target position $o_{i}$ of frame t and its paired hypothetical position, that is, the matching error. The meaning of $d_{t}^{i}$ determines whether the MOTP index is larger or smaller.

$\begin{matrix} M O T P = \frac{\sum_{i, t} d_{t}^{i}}{\sum_{t} c_{t}} \end{matrix}$

(10)
MOTA(↑): Accuracy of multi-target tracking. MOTA explains all object configuration errors, false positives, misses, and mismatches caused by the tracker in all frames.
FAF(↓): Number of false alarms per frame.
MT(↑): The number of main tracking tracks. At least 80 % of the target’s lifecycle has the same label.
ML(↓): The number of tracks lost in most cases. The target is not tracked within at least 20 % of its life cycle.
FP(↓): the number of error detections.
FN(↓): the number of missed inspections.
ID Switch(↓): the number of times the ID switches to a different object previously tracked.
FM(↓): the number of times the tracking track is interrupted.
note: “↑” means the larger the value, the better the result; “↓” means the smaller the value, the better the result.

The MOT16 benchmark evaluated the tracking performance of seven challenging tasks, including front view scenes captured by mobile cameras and scenes taken from the perspective of surveillance cameras. This paper uses the experimental results provided by Yu, Wojke, and others [36,37] as the input for the tracker. However, due to differences in the experimental environment, this paper only reevaluates the running speed of other algorithms. To ensure a fair comparison, this paper uses the method of controlling variables to ensure that the datasets and experimental environments are equal. Experiments are conducted using the same indicators and presented in Table 1. The authors of this paper trained a Faster R-CNN on a set of public and private datasets to provide accurate results.

To facilitate comparison with other algorithms, we use the intersection and union ratio of the estimated target detection frame and the real target calibration frame, denoted as

d_{t}^{i}

. Table 1 compares the sorting method proposed in this paper with several other online tracking algorithms. We only list the most advanced online tracking algorithms in terms of accuracy, such as SORT [25], Deep SORT [36], and POI [37]. Compared with these other methods, the algorithm described in this paper achieves the highest target discrimination and the lowest target ID conversion when object occlusion is detected by pixel level discrimination without sacrificing speed. Specifically, compared with Deep SORT, our algorithm reduces target ID conversion from 781 to 699, a reduction of about 10.5%. However, due to the limitations in detecting and processing occluded objects, the number of undetected false negatives of our algorithm is much higher than that of Deep SORT, resulting in a relatively lower accuracy of multi-target tracking MOTA. Moreover, our algorithm spends most of its time on feature extraction, so there is still much room for improvement in computing efficiency and real-time performance if a more efficient GPU is available.

To evaluate the performance of our algorithm in giving way to pedestrians by motor vehicles, we tested it on the street view video dataset from the perspective of a real traffic supervision camera mentioned in Section 3.2, and compared it with several other multi-target tracking algorithms based on target detection. To obtain more accurate results, we used the Faster RCNN feature extractor to extract features. The results are shown in Table 2, which indicates that our tracking algorithm has achieved considerable results in our target tasks while maintaining high accuracy and low target ID switching without sacrificing too much speed. However, due to the subjective perspective of the video captured from driving vehicles during the training of the feature extraction network, our dataset shows weaker performance in various indicators than the MOT16 [31] dataset.

In the context of our research, it is essential to underscore the intentional design choices made to achieve a delicate equilibrium between speed and accuracy. While our algorithm might manifest lower scores in MOTA and MOTP compared to Deep SORT, it excels substantially in aspects of efficiency, making it highly suitable for scenarios necessitating real-time recognition.

The practical application of our approach, specifically in identifying whether motor vehicles yield to pedestrians, showcases its capacity to significantly minimize ID Switches, thereby reducing the likelihood of misattributing one vehicle’s violation to another. This meticulous optimization enables our system to perform reliably in real-world environments, addressing the inherent dynamism and variability present in urban settings.

Our commitment to maintaining this balance is not a compromise but a strategic alignment with the practical necessities and operational realities of implementing smart traffic management solutions in contemporary urban landscapes. It reflects a nuanced understanding of the intrinsic trade-offs between precision and responsiveness required to address the multifaceted challenges of sustainable urban mobility.

This harmonious interplay between accuracy and efficiency is not a detraction but reinforces the practical viability and relevance of our method in the evolving landscape of intelligent urban traffic management solutions.

To save experimental time and cost, we utilized the COCO dataset-based pre-training model parameters in the open-source target detection API of Google’s TensorFlow framework. We then compared the performance and accuracy of various feature extractors, including the SSD algorithm [16], Faster R-CNN [14], YOLO-V3 algorithm [42,43], RFCN algorithm [27], MobileNet-v1 [44], MobileNet-v2 [17], ResNet50 [45], ResNet101 [45], FPN [46], and inception-v2 [47]. Results is presented in Figure 14 show that the Faster R-CNN and ResNet series extractors have higher accuracy, while the combination of the SSD algorithm and MobileNet series extractors has an advantage in speed. Based on the results and the fact that 24 fps is the lowest frequency of human eyes for still pictures, we recommend using the YOLO-V3 algorithm or using the SSD algorithm with MobileNet-v2 as the extractor without sacrificing too much speed.

As shown in Figure 15 and Figure 16, the influence of the internal parameters of our tracking algorithm on the multi-target tracking accuracy is explored. In these figures, the x-axis represents the threshold value of the second interval frame, the y-axis represents the threshold value of the first IoU, the z-axis represents the MOTA value in Figure 15, and the Z-axis represents the speed in Figure 16. Our results suggest that the highest accuracy rate is achieved when the first IoU threshold value is 0.75 and the second interval frame threshold value is 5, provided that the operating frequency is greater than 25 Hz. When the IoU threshold is too low, it can lead to a large number of overlapping targets, causing errors in target differentiation and reducing the MOTA. Conversely, when the IoU threshold is too high, the IoU of the target with fast motion speed may be too low, resulting in low accuracy MOTA. Similarly, setting the threshold value of the interval frame too low or too high can also cause accuracy issues. A threshold that is too low may cause many paths to be discarded due to missed frames, while a threshold that is too high may lead to unnecessary overlapping area disputes when new targets pass through the last detection frame position, causing confusion and reducing accuracy. Finally, distinguishing a large number of overlapping targets at the pixel level can significantly slow down the running speed, as shown in the blue area of Figure 16.

4.3. Experiment on Detection Algorithm of Vehicles’ Offensive Pedestrians Based on Warning Points

This task can be described as a two-category problem, where the behavior of vehicles that do not yield to pedestrians is regarded as positive examples and the behavior of vehicles that yield to pedestrians is considered negative examples. To evaluate the algorithm, we use the following indicators commonly used in classification problems.

Accuracy(↑): reflect the ability of the model to judge the overall sample correctly;

$\begin{matrix} A C C = \frac{T P + T N}{P + N} = \frac{T P + T N}{T P + F N + F P + T N} \end{matrix}$

(11)
Precision(↑): reflect the precision that the model can correctly predict the positive samples;

$\begin{matrix} p r e c i s i o n = \frac{T P}{T P + F P} \end{matrix}$

(12)
Recall(↑): reflect the integrity that the classifier can correctly predict the positive samples

$\begin{matrix} r e c a l l = \frac{T P}{T P + F N} \end{matrix}$

(13)
False alarm(↓): reflect the purity that the model can correctly predict the positive samples.

$\begin{matrix} m i s s r a t e = \frac{F N}{T P + F N} \end{matrix}$

(14)

In real traffic situations, the number of positive violations samples is typically ten times greater than that of negative samples. However, in the samples provided in this dataset, the proportion of negative samples is five times greater than that of positive samples. If a negative case is judged as a positive case, it can still be remedied through manual review. However, if a positive case is mistakenly identified as a negative case, it can lead to serious consequences. Therefore, we need to pay more attention to the recall rate than other indicators. When the recall rate is high, the accuracy rate is also high, and the false alarm rate is low.

To determine the optimal value of the parameter

α

in Equation (8), we conducted control variable experiments. The results are shown in Figure 17, where the recall rate reaches the highest value at

α

= 0.55. This means that most samples have accurate judgments, with a false positive rate of 8.39% and a high accuracy rate of 90.24%. A low value of

α

results in a small danger area in front of the vehicle and close predicted locations of pedestrian movement, leading to a large number of undetected violations by vehicles that do not yield to pedestrians. On the other hand, a high value of

α

results in a long danger area in front of the motor vehicle, leading to a large number of false positives. The optimal value of

α

balances the detection of violations by vehicles that do not yield to pedestrians and minimizes false positives, resulting in high accuracy and low manual audit costs.

The impact of weather, road conditions, time periods, and types of violations on the classification results are investigated using a value of

α

0.55. The influence of weather and various indicators is illustrated in Figure 18 and Figure 19. Our method performed well in high visibility weather conditions such as sunny days and cloudy days, correctly predicting almost all positive examples. However, the result was not ideal in foggy conditions with low visibility. The 24 h period is divided into three time periods based on lighting conditions: 0–3 and 20–23, 4–7 and 16–19, and 8–15, corresponding to night, dawn and dusk, and day, respectively. The ideal recall rate of 99% and accuracy of 96% was achieved during the day when lighting was sufficient. However, for 17–18 and 7–8, some violations may have been missed due to heavier traffic. At night, although the classification results were more accurate, the problem of license plate recognition errors due to insufficient light remained to be solved. We further investigated the influence of simulation and real datasets on the road conditions of violations and the types of violations on the classification results, as shown in Figure 20 and Figure 21. Our method performed significantly better in detecting violations by vehicles that do not yield to pedestrians on straight roads compared to those while turning.

5. Conclusions and Future Works

This study propose a target tracking algorithm that utilizes feature maps and license plate IDs to track pedestrians and vehicles in surveillance camera working scenes, addressing the limitations of existing tracking algorithms. This algorithm extracts identity information such as appearance, texture, color, and license plate characteristics to distinguish vehicles in dense traffic, achieving real-time tracking with optimal ID-conversion time and balanced speed and accuracy.

To judge rule violations by a vehicle in real-time, we introduce a warning point prediction method that uses the tracking results of the algorithm. This method magnifies the impact of past movement trends to calculate warning points for pedestrians and danger areas in front of vehicles, which are updated based on future vehicle movement. Although our tracking algorithm has improved in speed and ID conversion time, it still requires optimization in detecting occluded targets and improving trajectory prediction accuracy.

As computer vision algorithms become more commonly used for personal and traffic data collection and processing, we suggest considering multiple perception methods such as V2X [48], 3D vision, high-precision mapping, and radar [49] to enhance traffic classification decisions, which can contribute to the future implementation of intelligent transportation systems and other related fields.

Author Contributions

Conceptualization, Y.W. and Y.X. (Yaqi Xu); methodology, Y.W. and Y.X. (Yaqi Xu); software, Y.W.; validation, Y.W. and Y.X. (Yaqi Xu); formal analysis, H.W. and Y.X. (Yaqi Xu); investigation, Y.X. (Yi Xu) and H.W.; resources, Y.X. (Yi Xu); data curation, H.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W. and Y.X. (Yaqi Xu); visualization, Y.W.; supervision, J.W. and M.L.; project administration, J.W.; funding acquisition, J.W. and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study does not involve research on humans or animals.

Informed Consent Statement

This study does not involve research on patient(s).

Data Availability Statement

The data in this study cannot be shared due to confidentiality concerns.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MOTA	Multiple Object Tracking Accuracy
MOTP	Multiple Object Tracking Precision
ROI	Region of Interest
MOT	Multiple Object Tracking
CARLA	Car Learning to Act
FPS	Frames Per Second
UAV	Unmanned Aerial Vehicle
PNP	Perspective-n-Point
IoU	Intersection over Union
ACC	Accuracy
TP	True Positive
TN	True Negative
FP	False Positive
FN	False Negative
V2X	Vehicle to Everything

References

World Health Organization (WHO). Road Traffic Injuries. 2021. Available online: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries (accessed on 16 September 2023).
Iftikhar, S.; Asim, M.; Zhang, Z.; Muthanna, A.; Chen, J.; El-Affendi, M.; Sedik, A.; Abd El-Latif, A.A. Target detection and recognition for traffic congestion in smart cities using deep learning-enabled uavs: A review and analysis. Appl. Sci. 2023, 13, 3995. [Google Scholar] [CrossRef]
Akhtar, M.J.; Mahum, R.; Butt, F.S.; Amin, R.; El-Sherbeeny, A.M.; Lee, S.M.; Shaikh, S. A robust framework for object detection in a traffic surveillance system. Electronics 2022, 11, 3425. [Google Scholar] [CrossRef]
Qureshi, S.A.; Hussain, L.; Chaudhary, Q.U.A.; Abbas, S.R.; Khan, R.J.; Ali, A.; Al-Fuqaha, A. Kalman filtering and bipartite matching based super-chained tracker model for online multi object tracking in video sequences. Appl. Sci. 2022, 12, 39538. [Google Scholar] [CrossRef]
Sun, C.; Wang, Y.; Deng, Y.; Li, H.; Guo, J. Research on vehicle re-identification for vehicle road collaboration. J. Phys. Conf. Ser. 2023, 2456, 012025. [Google Scholar] [CrossRef]
Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef] [PubMed]
Abramson, Y.; Steux, B. Hardware-friendly pedestrian detection and impact prediction. In Proceedings of the IEEE Intelligent Vehicles Symposium, Parma, Italy, 14–17 June 2004; pp. 590–595. [Google Scholar]
Abramson, Y.; Steux, B.; Ghorayeb, H. Yet even faster (yef) real-time object detection. Int. J. Intell. Syst. Technol. Appl. 2007, 2, 102–112. [Google Scholar] [CrossRef]
Havasi, L.; Szlávik, Z.; Szirányi, T. Pedestrian detection using derived third-order symmetry of legs a novel method of motion-based information extraction from video image-sequences. In Computer Vision and Graphics; Springer: Dordrecht, The Netherlands, 2006; pp. 733–739. [Google Scholar]
Makris, D.; Ellis, T. Spatial and probabilistic modelling of pedestrian behaviour. In Proceedings of the 13th British Machine Vision Conference, BMVC 2002, Cardiff, UK, 2–5 September 2002; pp. 1–10. [Google Scholar]
Large, F.; Vasquez, D.; Fraichard, T.; Laugier, C. Avoiding cars and pedestrians using velocity obstacles and motion prediction. In Proceedings of the IEEE Intelligent Vehicles Symposium, Parma, Italy, 14–17 June 2004; pp. 375–379. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28 (NIPS 2015); NeurIPS: San Diego, CA, USA, 2015. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Jaiswal, S.; Chakraborty, P.; Huang, T.; Sharma, A. Traffic intersection vehicle movement counts with temporal and visual similarity based re-identification. In Proceedings of the 2023 8th International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS), Nice, France, 14–16 June 2023; pp. 1–6. [Google Scholar]
Kumar, A.; Kashiyama, T.; Maeda, H.; Zhang, F.; Omata, H.; Sekimoto, Y. Vehicle re-identification and trajectory reconstruction using multiple moving cameras in the carla driving simulator. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 1858–1865. [Google Scholar]
Wang, Y.; Gong, B.; Wei, Y.; Ma, R.; Wang, L. Video-based vehicle re-identification via channel decomposition saliency region network. Appl. Intell. 2022, 52, 12609–12629. [Google Scholar] [CrossRef]
Li, H.; Wang, Y.; Wei, Y.; Wang, L.; Li, G. Discriminative-region attention and orthogonal-view generation model for vehicle re-identification. Appl. Intell. 2023, 53, 186–203. [Google Scholar] [CrossRef]
Bochinski, E.; Eiselein, V.; Sikora, T. High-speed tracking-by-detection without using image information. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
Bochinski, E.; Senst, T.; Sikora, T. Extending IOU based multi-object tracking by visual information. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
Chu, Q.; Ouyang, W.; Li, H.; Wang, X.; Liu, B.; Yu, N. Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4836–4845. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Seita, D. Bdd100k: A large-scale diverse driving video database. Berkeley Artif. Intell. Res. Blog. Vers. 2018, 511, 41. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems 29 (NIPS 2016); NeurIPS: San Diego, CA, USA, 2016. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using yolo: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
Zeusees. 2020. Available online: https://github.com/zeusees/HyperLPR (accessed on 25 October 2023).
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. Mot16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
Unity Technologies. Unity—Manual: Execution Order of Event Functions; Unity Technologies: San Francisco, CA, USA, 2017. [Google Scholar]
Behrisch, M.; Bieker, L.; Erdmann, J.; Krajzewicz, D. Sumo–simulation of urban mobility: An overview. In Proceedings of the SIMUL 2011, Third International Conference on Advances in System Simulation, Barcelona, Spain, 23–29 October 2011. [Google Scholar]
Li, Y.; Huang, C.; Nevatia, R. Learning to associate: Hybridboosted multi-target tracker for crowded scene. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2953–2960. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; Yan, J. Poi: Multiple object tracking with high performance detection and appearance feature. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; pp. 36–42. [Google Scholar]
Keuper, M.; Tang, S.; Zhongjie, Y.; Andres, B.; Brox, T.; Schiele, B. A multi-cut formulation for joint segmentation and tracking of multiple objects. arXiv 2016, arXiv:1607.06317. [Google Scholar]
Lee, B.; Erdenee, E.; Jin, S.; Nam, M.Y.; Jung, Y.G.; Rhee, P.K. Multi-class multi-object tracking using changing point detection. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; pp. 68–83. [Google Scholar]
Choi, W. Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3029–3037. [Google Scholar]
Sanchez-Matilla, R.; Poiesi, F.; Cavallaro, A. Online multi-target tracking with strong and weak detections. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; pp. 84–99. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Wey, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Chen, S.; Hu, J.; Shi, Y.; Peng, Y.; Fang, J.; Zhao, R.; Zhao, L. Vehicle-to-everything (v2x) services supported by lte-based systems and 5g. IEEE Commun. Stand. Mag. 2017, 1, 70–76. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]

Figure 1. (a) In different views, there are large differences within instances of the same vehicle (left), and small differences between instances of similar vehicles (right). (b) License plate for vehicle identification.

Figure 2. Filtering based on appearance features, color features, texture features, and license plate features. (Note: The Chinese characters in the picture are part of the license plate content, representing abbreviated Chinese province names).

Figure 3. License plate coding process. (Note: The Chinese characters in the picture are part of the license plate content, representing abbreviated Chinese province names).

Figure 4. Rules of comity to pedestrians.

Figure 5. Relationship of coordinate system.

Figure 6. Warning points, dangerous areas, and vehicles that do not yield to pedestrians. In the dangerous area ahead, vehicle 2 does not yield to pedestrians.

Figure 7. Test dataset from traffic surveillance camera view.

Figure 8. Distribution of road conditions.

Figure 9. Distribution of weather.

Figure 10. Distribution of time.

Figure 11. Distribution of types of violations by vehicles and pedestrians.

Figure 12. Aerial view of Unity3D equal scale restoration of urban street view.

Figure 13. Unity3D virtual simulation scene. (a) There is no behavior of not yielding to pedestrians. (b) Car1 needs to yield to Pedestrian1.

Figure 14. Speed and MOTA comparison of different target detection algorithms.

Figure 15. Relation surface between IoU threshold, interval frame threshold, and MOTA.

Figure 16. Surface diagram of IoU threshold, interval frame threshold, and speed (Hz).

Figure 17. Relationship between accuracy, precision, recall, false alarm rate, and threshold

α

.

Figure 17. Relationship between accuracy, precision, recall, false alarm rate, and threshold

α

.

Figure 18. Bar chart of relationship between accuracy, precision, recall, false alarm, and weather.

Figure 19. Bar chart of relationship between accuracy, precision, recall, false alarm, and time.

Figure 20. Bar chart of relationship between accuracy, precision, recall, false alarm rate, and violation classification on the real data set.

Figure 21. Bar chart of relationship between accuracy, precision, recall, false alarm rate, and violation classification on simulation data set.

Table 1. Performance of this method on MOT16 benchmark sequence.

Algorithm	Type	MOTA↑	MOTP↑	MT↑	ML↓	ID Sw↓	FM↓	FP↓	FN↓	Speed↑
LMP_p [38]	BATCH	71.0	80.2	46.9%	21.9%	434	587	7880	44,564	<1 Hz
MCMOT_HDM [39]	BATCH	62.4	78.3	31.5%	24.2%	1394	1318	9855	57,257	10 Hz
NOMTwSDP16 [40]	BATCH	62.2	79.6	32.5%	31.1%	406	642	5119	63,352	<1 Hz
EAMIT [41]	ONLINE	52.5	78.8	19.0%	34.9%	910	1321	4407	81,223	5 Hz
POI * [37]	ONLINE	66.1	79.5	34.0%	20.8%	805	3093	5061	55,914	4 Hz
SORT [25]	ONLINE	59.8	79.6	25.4%	22.7%	1423	1835	8698	63,245	25 Hz
Deep SORT [36]	ONLINE	61.4	79.1	32.8%	18.2%	781	2008	12,852	56,668	13 Hz
Ours (Fast R-CNN)	ONLINE	59.2	78.3	27.2%	20.3%	699	2148	11,243	66,110	15 Hz

Bold text represents the best value, and “*” represents that the algorithm only tests targets classified as pedestrians.

Table 2. Performance of the method on test data sequence of traffic surveillance camera view.

Algorithm	Type	MOTA↑	MOTP↑	MT↑	ML↓	ID Sw↓	FM↓	FP↓	FN↓	Speed↑
IoU [22]	ONLINE	53.4	67.6	18.6%	26.9%	204	2173	6712	10,830	22 Hz
V-IoU [23]	ONLINE	54.7	66.8	20.9%	26.8%	174	1752	6810	9720	20 Hz
SORT [25]	ONLINE	59.0	75.6	26.4%	19.1%	132	1135	4671	9217	19 Hz
Deep SORT [36]	ONLINE	57.1	76.1	29.1%	17.2%	111	1008	4236	8713	11 Hz
Ours	ONLINE	55.3	70.3	26.2%	25.3%	103	948	5119	9199	21 Hz

Bold text represents the best value.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, Y.; Xu, Y.; Xu, Y.; Wang, H.; Wang, J.; Liu, M. A Computer Vision-Based Algorithm for Detecting Vehicle Yielding to Pedestrians. Sustainability 2023, 15, 15714. https://doi.org/10.3390/su152215714

AMA Style

Wan Y, Xu Y, Xu Y, Wang H, Wang J, Liu M. A Computer Vision-Based Algorithm for Detecting Vehicle Yielding to Pedestrians. Sustainability. 2023; 15(22):15714. https://doi.org/10.3390/su152215714

Chicago/Turabian Style

Wan, Yanqi, Yaqi Xu, Yi Xu, Heyi Wang, Jian Wang, and Mingzheng Liu. 2023. "A Computer Vision-Based Algorithm for Detecting Vehicle Yielding to Pedestrians" Sustainability 15, no. 22: 15714. https://doi.org/10.3390/su152215714

APA Style

Wan, Y., Xu, Y., Xu, Y., Wang, H., Wang, J., & Liu, M. (2023). A Computer Vision-Based Algorithm for Detecting Vehicle Yielding to Pedestrians. Sustainability, 15(22), 15714. https://doi.org/10.3390/su152215714

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Computer Vision-Based Algorithm for Detecting Vehicle Yielding to Pedestrians

Abstract

1. Introduction

2. Related Works

2.1. Behavior Detection

2.2. Object Detection and Tracking

3. Behavior Detection Algorithm

3.1. Algorithm Overview

3.2. Object Tracking Based on Feature Map

3.3. Detection of Vehicle Yielding to Pedestrian Based on Warning Points

4. Performance Evaluation

4.1. Simulation Settings

4.2. Experiment of Target Tracking Algorithm Based on Feature Graph and Plate ID

4.3. Experiment on Detection Algorithm of Vehicles’ Offensive Pedestrians Based on Warning Points

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI