Multi-Object Detection and Tracking Using Reptile Search Optimization Algorithm with Deep Learning

Alagarsamy, Ramachandran; Muneeswaran, Dhamodaran

doi:10.3390/sym15061194

Open AccessArticle

Multi-Object Detection and Tracking Using Reptile Search Optimization Algorithm with Deep Learning

by

Ramachandran Alagarsamy

^1,* and

Dhamodaran Muneeswaran

²

¹

Department of Electronics and Communication Engineering, SSM Institute of Engineering and Technology, Dindigul 624002, India

²

Department of Electronics and Communication Engineering, M.Kumarasamy College of Engineering, Karur 639113, India

^*

Author to whom correspondence should be addressed.

Symmetry 2023, 15(6), 1194; https://doi.org/10.3390/sym15061194

Submission received: 21 February 2023 / Revised: 26 March 2023 / Accepted: 7 April 2023 / Published: 2 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

Multiple-Object Tracking (MOT) has become more popular because of its commercial and academic potential. Though various techniques were devised for managing this issue, it becomes a challenge because of factors such as severe object occlusions and abrupt appearance changes. Tracking presents the optimal outcomes whenever the object moves uniformly without occlusion and in the same direction. However, this is generally not a real scenario, particularly in complicated scenes such as dance events or sporting where a greater number of players are tracked, moving quickly, varying their speed and direction, along with distance and position from the camera and activity they are executing. In dynamic scenes, MOT remains the main difficulty due to the symmetrical shape, structure, and size of the objects. Therefore, this study develops a new reptile search optimization algorithm with deep learning-based multiple object detection and tracking (RSOADL–MODT) techniques. The presented RSOADL–MODT model intends to recognize and track the objects that exist with position estimation, tracking, and action recognition. It follows a series of processes, namely object detection, object classification, and object tracking. At the initial stage, the presented RSOADL–MODT technique applies a path-augmented RetinaNet-based (PA–RetinaNet) object detection module, which improves the feature extraction process. To improvise the network potentiality of the PA–RetinaNet method, the RSOA is utilized as a hyperparameter optimizer. Finally, the quasi-recurrent neural network (QRNN) classifier is exploited for classification procedures. A wide-ranging experimental validation process takes place on DanceTrack and MOT17 datasets for examining the effectual object detection outcomes of the RSOADL–MODT algorithm. The simulation values confirmed the enhancements of the RSOADL–MODT method over other DL approaches.

Keywords:

multi-object tracking; deep learning; reptile search optimization algorithm

1. Introduction

Nowadays, object detection and tracking has not only grabbed the interest of many because of its recent breakthrough research and wide range of applications but also its equal significance in academia and real-time applications [1], including monitoring security, transportation surveillance, autonomous driving, and robotic vision. Various sensing modalities such as computer vision (CV) radar and Light Detection and Ranging (LIDAR) are available for object tracking and detection. Unlike tracking a single specific object, multiple-object tracking (MOT) can be very complex [2]. In addition to the problems in tracking single objects, MOT in a single category should make new tracked objects utilizing identification outcomes, terminate objects whenever they go out of the field of camera view, or recognize lost objects when they appear again [3]. On top of that, pose changes; the problems of occlusion and background clutter are more complicated than those in tracking single objects. To manage such difficulties, certain DL-related techniques were devised. For instance, it is possible to replace traditional features with features derived from deep neural networks for associating detection outcomes, though the features were learned from tasks of recognition or classification [4]. In addition, it has been proven that the performance becomes enhanced if the attributes of MOT, such as temporal order or spatial attention maps, are explored. As well, certain end-to-end DL structures were devised for feature extraction not merely for appearance descriptors but for motion data. Although DL techniques are potentially applicable to MOT issues [5], there is much room to enhance the tracking efficiency using the power of DL because of its great achievements in the domain of image recognition and classification [6].

In DL, the problem of object detection refers to the task of labeling various objects in an image frame with their accurate classes and estimating the bounding box with a higher degree of probability. The learning accuracy in DL rests on previous experiences or the number of samples [7]. Higher accuracy in the performance can be achieved with a greater number of samples. These days’ data are abundantly available, which makes DL the right choice. Unlike conventional (shallow) learning, thousands of images are needed by DL to gain optimal outcomes [8]. The term “shallow” is contrary to “deep”. Hence, DL can be computationally intensive and problematic to engineers. It requires high-performing GPUs to offer fast motion and object detection [9]. DL methods are availed in domain-specific and generic object tracking and detection. Deep CNN was employed as a backbone in the detection network for the extraction of crucial features from input images or video frames. Such attributes were utilized for classifying and localizing objects in similar frames [10].

Wang et al. [11] recommend a fast and robust camera–LiDAR fusion-based MOT technique that accomplishes better tradeoffs between speed and accuracy. Based on the features of LiDAR sensors and cameras, a deep association mechanism was devised and embedded in the presented MOT model. The proposed method realized MOT in a 2D domain if the object is distant and detected only by the camera, and the updating of 2D trajectory includes 3D data that can be attained if the object appeared in the LiDAR field of view to accomplish a smooth fusion of 2D and 3D trajectories. Wang et al. [12] introduced an MOT technique based on GNN. The major concept is that GNN relationships between different size objects in spatial and temporal domains are crucial for learning discriminatory features for data association and detection.

The author in [13] developed an object detection technique by revamping YOLOv3, and real-time MOT can be performed by using Deep SORT for tracking the target through these data association and movement representation models. It is a tracking-by-detection technique. Guo and Zhao [14] proposed a new architecture for online 3D MOT to remove the impact of unknown biases and the inherent uncertainty in point clouds. A constant turn rate and velocity (CTRV) motion were used for estimating future motion states that can be smoothed using the cubature Kalman filter (CKF) model. An adaptive cubature Kalman filter (ACKF) is presented to update the tracked state powerfully and to remove the effect of unknown bias.

Rafique et al. [15] developed a Maximum Entropy Scaled Super-Pixels (MEsSP) Segmentation technique that encapsulates super-pixel segmentation depending on Entropy Model and exploits local energy terms for labeling the pixel. After pre-processing and acquisition, the image is initially segmented using two various approaches: MEsSP and Fuzzy C-Means (FCM). Lastly, based on the categorized objects, a DBN allocates the related label to the scene, intersection over dice similarity coefficient, and union scores. Lusardi et al. [16] developed a GNN-based architecture for MOT that integrates association and detection with the usage of novel re-detection features. The combination of multiple appearance features within the architecture is explored to improve tracking accuracy and obtain the best representation. Data augmentation with random noise and random erase are used for improving the robustness during tracking. Jiang et al. [17] proposed the Residual Neural Network (RNN) for target tracking and achieved high accuracy when compared with Multi-Domain (MDnet) in three complex problems such as deforms or rotates, similar target interference, and complex scenes. Wang et al. [18] proposed a solution for motion blur using the Motion Enhance Attention (MEA) module to detect far-away objects from the camera for target tracking using a Dual Correlation Attention (DCA) module, but the tracking performance is not better due to the above-combined module. CenterTrack [19] calculates the object displacements with inputs to associate the objects. FairMOT [20] finds each object in a particular detection dataset and find out to discriminate between those to learn tracking from stationary images. TraDes [21] and GTR [22] is a new method proposed to attempt multiple-object tracking. MOTR [23] and QDTrack [24] carried out the multi-object tracking with transformer and Quasi-Tense similarity learning, but both fall in classification errors such as large-scale datasets. Multi-object detection and tracking are categorized into neural networks, fuzzy logic, and meta-heuristic algorithms based on the approach in soft computing studies [25]. To solve the optimization problems, meta-heuristic algorithms are proposed that seek to exit local optimal points. Recently, the exploit of meta-heuristic optimization algorithms has occupied many research studies to detect and track objects because of their facility to estimate the location of objects accurately [25]. In contrast to the traditional optimization method [26,27,28] that performs an iterative search using a set of elements in video frames, this approach carries the efficient evolutionary computing method into object tracking. The elements are updated using a fitness function and make some iterations in the object location.

In addition, an object tracker can be divided into three core works, namely feature representation, observation, and motion features [29]. It generates the feature representation of the object, which is then used for comparison under the observation. Based on the association results, the motion feature makes a prediction of the object’s location. This, without uncertainty, can bring better tracking performance due to more effective searches in the search space.

Nonetheless, the need for improvements in other works of object tracking is mainly unseen in the literature. A possible solution to the weak feature problem is the adoption of deep learning-based feature representation, which could potentially improve the accuracy and robustness of the tracking.

This study develops a new reptile search optimization algorithm with deep learning-based multi-object detection and tracking (RSOADL–MODT) technique. The presented RSOADL–MODT model applies path augmented RetinaNet (PA–RetinaNet) based object detection module, which improves the feature extraction process. To improvise the network potentiality of the PA–RetinaNet method, the RSOA is utilized as a hyperparameter optimizer. Finally, the quasi-recurrent neural network (QRNN) classifier is exploited for classification procedures. A wide-ranging experimental validation process takes place on DanceTrack and MOT datasets for examining the effectual object detection outcomes of the RSOADL–MODT approach.

2. Materials and Methods

In this study, an automated MOT using the RSOADL–MODT technique has been developed for the recognition and tracking of the objects that exist with position estimation, tracking, and action recognition. It follows a series of processes, namely PA–RetinaNet-based object detection, RSOA-based hyperparameter tuning, QRNN object classification, and object tracking. Figure 1 represents the workflow of the RSOADL–MODT approach.

2.1. Object Detection Using PA–RetinaNet

In this work, the presented RSOADL–MODT technique exploited the PA–RetinaNet-based object detection module. The PA–RetinaNet incorporates two parallel sub-networks, a backbone network, and a path augmentation model with certain tasks [30]. The two sub-networks are used for bounding box regression and object classification. The backbone network adopts the Feature Pyramid Network (FPN), a new convolution network, to evaluate the feature maps of input images. A path augmentation model can be improved to make lower layer data easier for transmission correspondingly. The FPN can be adopted as a backbone network of PA–RetinaNet. FPN increases a typical convolution network with lateral and top-down pathway connections, which allows the effective creation of a multiscale, rich FPN from the single-resolution input images. Every level of the pyramid detects an object at various scales.

The proposal of the PA–RetinaNet detector improves and inherits the architecture of RetinaNet. With the addition of the bottom-up path makes the lowest layer data easier to transmit, reduces the data transmission path, and produces a good result. The lower-level feature map mostly perceives the local information, edge, and corner of an image and additional information, whereas the higher-level feature map reflects semantical data of the entire objects. A bottom-up path can be improved to make lower-layer data easier to transmit. Thus, the addition of a bottom-up path is essential to increase the reasonable classification ability of each feature and semantically propagate strong features in FPN.

Based on the description in FPN, layers of a similar network phase produce a feature map of similar spatial size, and every feature level corresponds to a single phase. ResNet as a fundamental framework and characterize feature level of the FPN as {

P 2; P 3; P 4; P 5

}. The space size can be downsampled gradually to 1/4 of the prior layer size. {

N 2; N 3; N 4; N 5

} is recently produced feature maps respective to {

P 2; P 3; P 4; P 5

}. It should be noted that

N 2

is

P 2

. Now, ⊕ this signifies the lateral connections of the high-layer fine feature map

N i

and low-layer coarse feature map

P i

+

1

, which produce a novel feature map

N i

+

1

. First, every feature map

N i

can be down-sampled using

3

×

3

convolution layers with the stride of

2

, providing a similar resolution as

P i

+

1

and added to it through a lateral connection. Next, the combined feature map produces

N i

+

1

via another

3 \times 3

convolution layers as input to the following two subnetworks. This is an iteration procedure until

N 5

is produced. In these modules,

256

channel feature map is used, and a ReLu operation can be implemented afterward in every convolution layer. The newest feature map, {

N 2; N 3; N 4; N 5

}, is later pooled to attain the feature grid. The two subnets are the box regression subnet and the classification subnet.

The proposal of the box regression subnet can be the same as the classification subnet, but it dismisses in 4A linear output for every spatial position. The input feature mapping of

256

channels can be attained from the presented layer in the classification subnet. First, the subnet uses four

3 \times 3

convolution layers. Every convolution layer contains

256

filters, the ReLu activation,

3 \times 3

convolution layer with KA filters, and sigmoid activation output KA binary prediction for every spatial position. The classification subnet forecasts the probability that every K object class and

A

anchor will exist at every spatial location, and if one exists, then the box regression subnet regresses the offset from every anchor box to the nearer ground-truth object.

2.2. RSOA-Based Hyperparameter Tuning

For enhancing the network potentiality of the PA–RetinaNet method, the RSOA is utilized as a hyperparameter optimizer. RSOA is a metaheuristic algorithm based on the natural hunting strategies of crocodiles [31]. The functioning of RSOA is based on two stages: the encircling and the hunting phases. The RSOA switch between the hunting search and the encircling phases and the shift between dissimilar stages are implemented by splitting the number of iterations into four different parts. RSOA begins by stochastically producing a candidate solution based on the subsequent formula:

χ_{j k} = r a n d \times (U_{b} - L_{b}) + L_{b} k = 1, 2, \dots n

(1)

In Equation (1),

χ_{j k} =

initialization matrix,

j = 1, 2, \dots

.P.

P

signifies population size (rows of initialization matrix),

r a n d

randomly generated values,

n

signifies dimensions (columns of initialization matrix) of the optimization problems, and

L_{b},

and

U_{b}

, represents the lower and upper bounds.

The encircling stage is an exploration of a higher-density area. During the encircling stage, belly and high walking are stimulated by the crocodile movement, which plays a crucial role. This movement does not assist in catching the prey however assists in discovering a search space.

χ_{j k (τ + 1) = B e s t_{k} (τ)} \times (- μ_{(j k)} (τ)) \times β - (R_{(j k)} (τ) \times r a n d), τ \leq \frac{T}{4}

(2)

x_{j k} (τ + 1) = B e s t_{k} (τ) \times χ_{(r_{1}, k)} \times E S (τ) \times r a n d, τ \leq 2 \frac{T}{4} a n d τ > \frac{T}{4}

(3)

From the expression,

B e s t_{k} (τ)

denotes the optimum solution attained at

k t h

location,

r a n d

characterizes a randomly generated value,

τ

displays the existing iteration count, and the maximal amount of iterations is characterized by T.

μ_{(j, k)}

indicates the value of the hunting operator of

j t h

solution at

k t h

location. The value of

μ_{(j, k)}

can be defined by Equation (4):

μ_{(j, k)} = B e s t_{k} (τ) \times P_{(j, k)}

(4)

where

β

denotes the sensitivity parameter and describes the exploration results. Additional function termed

R_{(j, k)}

, whose objective is to decrease the search region, is evaluated by Equation (5):

R_{(j, k)} = \frac{B e s t_{k} (τ) - P_{(r_{2'} k)}}{B e s t_{k} (τ) + e}

(5)

where

r_{1}

denotes the randomly generated integer within

[1, N]

. Now,

N

signifies the overall quantity of solution candidates.

z_{(r 1, l)}

characterizes a random location for the

k t h

solution.

r_{2}

denotes a random (arbitrary) integer ranges amongst

[1, N]

,

e

signifies a value of smaller magnitude. Modification is made to incorporate RSOA into the framework of object tracking. Prior to the initiation of tracking, a representation of the target object is extracted using the feature representation from the first frame and is stored as a reference for comparison. Equations (1)–(5) represents a potential candidate for the tracking solution and are then initialized in a given search space.

ES

(τ),

called Evolutionary Sense, is a probability-based ratio. The mathematical expression can be given by:

E S (τ) = 2 \times r_{3} \times (1 - \frac{1}{T})

(6)

In Equation (6),

r_{3}

denotes a randomly generated value.

P_{(j, k)}

is evaluated by Equation (7):

P_{(j, k)} = α + \frac{χ_{(j, k) - M (x_{j})}}{B e s t_{k} (τ) x (U_{b (k)} - (L_{b (k)}) + e}

(7)

where

α

denotes the sensitivity boundary that controls the exploration accuracy,

M (χ_{j})

indicates the average location of the

{j t h}_{}

solution and is evaluated by Equation (8):

M (χ_{j)} = \frac{1}{n} \sum_{k = 1}^{n} χ_{(j, k)}

(8)

The hunting stage, similar to the encircling stage, has two approaches, such as hunting cooperation and coordination. These two approaches are used for locally traversing the searching space and assisting in targeting the prey (search for an optimal solution). In addition, based on the iteration, the hunting stage can be divided into two parts. The hunting coordination approach can be performed for iteration ranges from

τ \leq 3 \frac{T}{4}

to

> 2 \frac{T}{4}

, whereas the hunting cooperation can be performed from

τ \leq T

to

τ > 3 \frac{T}{4}

. A stochastic coefficient is used for traversing the local search space to produce optimum solutions. Equations (9) and (10) are utilized for the exploitation stage:

x (j, k) (τ + 1) = B e s t_{k} (τ) \times (P_{(j, k)} (τ)) \times r a n d, τ \leq 3 \frac{T}{4} a n d τ > 2 \frac{T}{4}

(9)

x (j, k) (τ + 1) = B e s t_{k} (τ) - μ_{(j, k)} (τ) \times e - R_{(j, k)} (τ) \times r a n d, τ \leq T a n d τ > 3 \frac{T}{4}

(10)

where

B e s t_{k} (τ)

denotes the

k

location in the optimum solution obtained in the existing iteration. Similarly,

μ_{(j, k)}

signifies the hunting operator that is evaluated using Equation (4). Figure 2 depicts the flowchart of RSA. During the update process, a feature representation is extracted and compared to the object using an exact observation model called a fitness function in RSOA Equation (11).

The RSOA method has derived a fitness function from having enhanced outcomes. It has determined a positive value for signifying the better outcome of the candidate solutions. Here, the reduced classifier error rate is treated as the fitness function, as presented in Equation (11).

f i t n e s s (x_{i}) = C l a s s i f i e r E r r o r R a t e (x_{i}) = \frac{n u m b e r o f m i s c l a s s i f i e d s a m p l e s}{T o t a l n u m b e r o f s a m p l e s} * 100

(11)

2.3. Classification Using QRNN Model

For object classification purposes, the QRNN model is used in this work (Algorithm 1). The classification problem is hard to extend to large-scale datasets. The QRNN is proposed to decrease the computation efforts of the recurrent step in LSTM [32]. QRNN is a hybrid neural network of CNNs and LSTMs, which combines the benefits of both. In contrast with LSTM, QRNN is extremely parallelizable to CNN. Blue represents a parameterless function running parallel alongside the channel or feature dimension. Brown represents convolution or matrix multiplication. Contiguous blocks imply that computation is implemented simultaneously. The LSTM is decomposed into a blue element and brown linear blocks, whereas computation at every time step depends on the prior outcomes.

Every layer of QRNN integrates two seed components, such as pooling and convolution layers in CNN. Both layers permit fully parallel computation: the pooling layer supports parallelization across feature dimensions and mini-batch; the convolutional layer supports spatial dimensions (viz., sequence dimension) and parallelization across mini-batches.

The equation of the QRNN unit has been demonstrated below:

{\hat{x}}_{t} = t a n h (W * X_{t}) f_{t} = σ (W_{f} * X_{t}) 0_{t} = σ (W_{o} * X_{τ}) c_{t} = f_{T} c_{t - 1} + (1 - f_{T}) ⊙ {\hat{x}}_{t} h_{t} = 0_{t} ⊙ t a n h (c_{t})

(12)

From the expression,

x_{τ - κ + 1} \dots, x_{τ}

where

X_{t} \in R^{k \times n}

denotes the input series of

k n

-dimensional vectors,

“ * ”

signifies the mask convolution and the timestep dimension,

W, W_{f}, W_{o}

indicates the convolutional filter bank in

R^{d \times n \times k}

, and

k

represents the width of the filter. The first three expressions are the convolutional part of QRNN and generate

m

-dimensional sequence

\hat{x}, f_{t}

, and

0_{t}

. The symbol signifies component-wise multiplication. Especially, QRNN exploits a forget gate that is “dynamic average pooling”.

A single QRNN implements three multiply-vector operations, as indicated in Equation (2) that depend on the input series X without dependence on preceding outputs, such as

h_{τ - 1}

. With known input, this multiply-vector operation

W * X_{τ}

is predefined in various timesteps. Consequently, weight matrices with an enormous quantity of memory no longer have to be loaded at every timestep. In the presented method, the cost of DRAM is decreased once the timestep needed to conduct grows as follows:

u^{T} = (\begin{matrix} W \\ W_{f} \\ W_{o} \end{matrix}) [X_{k - 1 r} X_{k^{'}} ., X_{L + k - 1}]

(13)

where

U \in R^{L \times 3 d}

denotes the combined result matrix;

d

indicates the number of hidden layer neurons;

L = T - k + 1

signifies the input sequence length.

U \in R^{L \times B \times 3 d}

becomes a tensor while considering minibatch

B

.

Algorithm 1: QRNN Model

Input: merged matrix $U [k, j, i]$ initial hidden state $c_{0} [i, j]$ , bias vector $b_{f} [i], b_{r} [i]$
Output: hidden state tensor $c [' .^{'} .']$ and output tensor $h [' .^{'} .']$
Parameter settings: $d$ number of hidden layer neurons, $B$ mini batch, $L$ output sequence length
for $= 1_{1}$ , $B, \cdot i = 1$ , , $d$ do
$c = c_{0} [i, j]$
for $k = 1 \dots L$ do
$\hat{χ} = t a n h (U [k_{,} j, i])$
$f = σ (U [k, j, i + d] + b_{f} [i])$
$r = σ (U [k, j, i + d \times 2] + b_{r} [i])$
$c = f \times c + (1 - f) \times \hat{x}$
$h = r \times t a n h (c)$
$c [k, j, i] = c$
$h [k, j, i] = h$
return $h [' .^{'} .']$ and $c [' .^{'} .']$

3. Performance Validation

This paper presents the details of an experimental dataset and evaluation metrics. It demonstrates the effectiveness of the proposed method by comparing its performance to state-of-the-art methods on benchmark datasets. Next, an ablation study was conducted to investigate how our PA–Retinanet, reptile search optimization, and classification method were used to measure the performance of the approach.

DanceTrack [33] and MOT17 [34] datasets are used to evaluate the proposed method. DanceTrack [33] is a large-scale dataset for multi-object tracking in complex scenes such as uniform appearance and diverse motion, and it has 40 videos for training, 25 videos for validation, and 35 videos for testing. MOT17 [34] is a widely used dataset containing seven sequences for training and seven for testing, and it consists of crowded street scenes with linear object motion.

In order to evaluate the proposed method, the following tracking metrics are utilized. MOTA—The Multiple Object Tracking Accuracy (MOTA) [35] metric computes tracking accuracy with detection accuracy. MOTA weighs detection performance more greatly than association performance. MOTA is computed using the FP (False Positive), FN (False Negative), ID switches (IDs), and Mostly Tracked (MT). IDF1—The Identification F1 Score (IDF1) [36] matches ground truth and calculations on the trajectory level and computes a corresponding F1 score. IDF1 focuses on measuring association performance. HOTA Higher Order Tracking Accuracy (HOTA) [37] aims to combine the evaluation of detection and association relatively. Figure 3 shows the sample images.

Experiments are conducted on PyTorch with the runtime machine in eight NVIDIA 2080 Ti GPUs. It utilizes the variant of ResNet as the backbone network for fast convergence and object detection. The training model is trained with RSOA for the tuning of hyperparameters, and it boosts the accuracy. The initial learning rate is 0.0001 and produces the length of eight video clips and then trains with a batch size of eight videos on eight GPUs, ensuing in an effective batch size of 64.

In DanceTrack, 70 epochs on the training sets and the learning rate decreases by a factor of 10 at the 10th epoch. From the initial length of the video clip, it gradually increases the clips to 3, 4, and 5 at the 30th, 40th, and 50th epochs, respectively. The gradual increment of video clip length improves training efficiency and constancy. The training step takes about 8 h on two 2080 Ti GPUs.

In Table 1 and Figure 4, the overall tracking performance of the RSOADL–MODT method is examined in terms of evaluation metrics on the DanceTrack dataset. The results indicate that the RSOADL–MODT technique accomplishes improved performance under each iteration. It is noticed that the RSOADL–MODT technique gains an average MOTA of 87.67, MOTP of 0.325, IDF1 of 51.21%, Idsw of 4272, recall of 98.52, precision of 97.73, and MT of 537.

The TACY and VACY of the RSOADL–MODT approach on the DanceTrack dataset is defined in Figure 5. The Figure implied that the RSOADL–MODT technique had improved performance with increased values of TACY. Visibly, the RSOADL–MODT method has reached maximal TACY outcomes.

The TLOS and VLOS of the RSOADL–MODT method on the DanceTrack dataset is defined in Figure 6. The Figure inferred that the RSOADL–MODT method has better performance with minimal values of TLOS and VLOS. Particularly, the RSOADL–MODT approach has minimal VLOS outcomes.

A comparison study is made on the DanceTrack dataset in Table 2 to highlight the enhanced performance of the RSOADL–MODT technique [31].

Figure 7 shows the comparative assessment of the RSOADL–MODT technique in terms of MOTA, IDF1, recall, and precision. The results indicate that the RSOADL–MODT technique reaches improvised results. Based on the MOTA value RSOADL–MODT technique results in a higher MOTA of 87.67 while the CenterTrack [19], FairMOT [20], TraDes [21], GTR [22], and MOTR [23] models attain decreased MOTA values.

In Table 3 and Figure 8, the overall tracking performance of the RSOADL–MODT method technique is examined in terms of MOTA, MOTP, IDF1, IDsw, and MT on the MOT17 dataset. The results indicate that the RSOADL–MODT technique accomplishes improved performance under each iteration. It is noticed that the RSOADL–MODT technique gains an average MOTA of 74.67, MOTP of 0.321, IDF1 of 72.31, Idsw of 4331, recall of 98.40, a precision of 97.63, and MT of 623.

The TACY and VACY of the RSOADL–MODT model on the MOT17 dataset can be defined in Figure 9. The Figure implied that the RSOADL–MODT enrich methodology has displayed enhanced performance with maximal increased values of TACY and VACY, notably that the RSOADL–MODT technique has reached maximal TACY outcomes.

The TLOS and VLOS of the RSOADL–MODT algorithm on the MOT17 dataset can be defined in Figure 10. The Figure inferred that the RSOADL–MODT technique had revealed better performance with the smallest values of TLOS and VLOS. Visibly, the RSOADL–MODT technique has reduced VLOS outcomes.

To point out the enhanced performance of the RSOADL–MODT method, a comparative analysis is made on the MOT17 dataset in Table 4.

Figure 11 exhibits the comparative assessment of the RSOADL–MODT approach in terms of MOTA, IDF1, recall, and precision. The results indicate that the RSOADL–MODT technique reaches improvised results. Based on MOTA, the RSOADL–MODT algorithm result in a higher MOTA of 74.67, while the CenterTrack [19], FairMOT [20], TraDes [21], GTR [22], MOTR [23] and QDTrack [24] methods attain decreased MOTA values. In the meantime, based on IDF1, the RSOADL–MODT technique results in a higher IDF1 of 72.31 while the CenterTrack [19], FairMOT [20], TraDes [21], GTR [22], MOTR [23] and QDTrack [24] models attain decreased IDF1 values.

Ablation Study

To demonstrate the effectiveness of the proposed method, an ablation study on the validation of DanceTrack and MOT17 train sets by performing 3-fold cross-validation following [38]. This study analyzes the main aspects of our method: (i) the advantages of using a PA–RetinaNet object detector and (ii) the hyper-parameter optimization method across different levels.

At the initial stage, the presented RSOADL–MODT technique applies path augmented RetinaNet (PA–RetinaNet) based object detection module, which improves the feature extraction process. For optimization and to improvise the network potentiality of the PA–RetinaNet method, the Reptile Search Optimization Algorithm (RSOA) is utilized as a hyperparameter optimizer. The optimization difficulty for object detection is greatly relieved. It combines the feature extractor with a PA–RetinaNet detector and reptile search optimization and observes that it achieves the best performance with 87.6 MOTA and 51.2 IDF1. MOTA and IDF1 were improved, and the IDs were significantly reduced. MOTA and IDF1 are calculated for overall evaluations. Nevertheless, this initiative approach can be used to progress MOTA and IDF1 further. YOLOx proposal does not perform well enough in small datasets such as MOT17 due to its data-hungry nature, and it works well in large datasets such as DanceTrack [24]. When compared to the QDTrack, the proposed method gives a sufficient improvement in MOT17.

The results indicate that base object detectors with classification feature embeddings are adequate for multiple-object tracking with effective optimization. This method produces reasonable results and achieves the best overall scores with MOTA and IDF1. Additionally, an optimization algorithm with deep learning boosts the performance on all metrics while tuning hyperparameters and, finally, slight computation in the method. In addition, better computational outcomes have been obtained by metaheuristics with the precise optimization algorithm. When compared to the state-of-the-art methods given in Table 4, the proposed method is slightly less computationally expensive.

The proposed method finds that if it concerns the above features with hyper-parameter tuning, the result on the DanceTrack dataset shows slight improvement, and long-time occlusion between the objects should be improved using meta-heuristic optimization algorithms in the future.

Table 4 reports a significant improvement of MOTA, IDF1, and HOTA over state-of-the-art techniques, and it shows that similar tracking performance can be achieved by using optimization when performing the tracking. Given the sole features of the datasets, these results highlight the versatility of the approach to utilize the right methods for different scenarios.

4. Conclusions

In this study, an automated MOT using the RSOADL–MODT technique has been developed for the recognition and tracking of the objects that exist with position estimation, tracking, and action recognition. It follows a series of processes, namely object detection, hyperparameter tuning, object classification, and object tracking. Primarily, the presented RSOADL–MODT technique exploited the PA–RetinaNet-based object detection module, which improves the feature extraction process. For enhancing the network potentiality of the PA–RetinaNet method, the RSOA is utilized as a hyperparameter optimizer. At last, the QRNN classifier is exploited for classification procedures. A wide-ranging experimental validation process takes place on DanceTrack and MOT datasets for examining the effectual object detection outcomes of the RSOADL–MODT algorithm. The simulation values confirmed the enhancements of the RSOADL–MODT method over other DL models. In the future, feature fusion-based DL techniques can be employed to improve the object detection results of the proposed model.

Author Contributions

Conceptualization, R.A. and D.M.; Formal Analysis, R.A. and D.M.; Investigation, R.A.; Writing—original draft, R.A.; Writing—Review and Editing, D.M.; Supervision, D.M.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, X.; Ling, Y.; Yang, Y.; Chu, C.; Zhou, Z. Center-point-pair detection and context-aware re-identification for end-to-end multi-object tracking. Neurocomputing 2023, 524, 17–30. [Google Scholar] [CrossRef]
Guo, S.; Wang, S.; Yang, Z.; Wang, L.; Zhang, H.; Guo, P.; Gao, Y.; Guo, J. A Review of Deep Learning-Based Visual Multi-Object Tracking Algorithms for Autonomous Driving. Appl. Sci. 2022, 12, 10741. [Google Scholar] [CrossRef]
Pearce, A.; Zhang, J.A.; Xu, R.; Wu, K. Multi-Object tracking with mmWave Radar: A Review. Electronics 2023, 12, 308. [Google Scholar] [CrossRef]
Cao, J.; Weng, X.; Khirodkar, R.; Pang, J.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. arXiv 2022, arXiv:2203.14360. [Google Scholar]
Pal, S.K.; Pramanik, A.; Maiti, J.; Mitra, P. Deep learning in multi-object detection and tracking: State of the art. Appl. Intell. 2021, 51, 6400–6429. [Google Scholar] [CrossRef]
Suljagic, H.; Bayraktar, E.; Celebi, N. Similarity based person re-identification for multi-object tracking using deep Siamese network. Neural Comput. Appl. 2022, 34, 18171–18182. [Google Scholar] [CrossRef]
Valverde, F.R.; Hurtado, J.V.; Valada, A. There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11612–11621. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022. Part XXII. pp. 1–21. [Google Scholar]
Ravindran, R.; Santora, M.J.; Jamali, M.M. Multi-object detection and tracking, based on DNN, for autonomous vehicles: A review. IEEE Sens. J. 2020, 21, 5668–5677. [Google Scholar] [CrossRef]
Liang, T.; Li, B.; Wang, M.; Tan, H.; Luo, Z. A Closer Look at the Joint Training of Object Detection and Re-Identification in Multi-Object Tracking. IEEE Trans. Image Process. 2022, 32, 267–280. [Google Scholar] [CrossRef]
Wang, X.; Fu, C.; Li, Z.; Lai, Y.; He, J. DeepFusionMOT: A 3D Multi-Object Tracking Framework Based on Camera-LiDAR Fusion with Deep Association. IEEE Robot. Autom. Lett. 2022, 7, 8260–8267. [Google Scholar] [CrossRef]
Wang, Y.; Kitani, K.; Weng, X. Joint object detection and multi-object tracking with graph neural networks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA) 2021, Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13708–13715. [Google Scholar]
Praveenkumar, S.M.; Patil, P.; Hiremath, P.S. Real-time multi-object tracking of pedestrians in a video using convolution neural network and Deep SORT. In Proceedings of the ICT Systems and Sustainability: Proceedings of ICT4SD 2021, Goa, India, 22–23 July 2022; Springer: Singapore, 2022; Volume 1, pp. 725–736. [Google Scholar]
Guo, G.; Zhao, S. 3D multi-object tracking with adaptive cubature Kalman filter for autonomous driving. IEEE Trans. Intell. Veh. 2022, 8, 512–519. [Google Scholar] [CrossRef]
Rafique, A.A.; Gochoo, M.; Jalal, A.; Kim, K. Maximum entropy scaled super pixels segmentation for multi-object detection and scene recognition via deep belief network. Multimed. Tools Appl. 2022, 82, 13401–13430. [Google Scholar] [CrossRef]
Lusardi, C.; Taufique, A.M.N.; Savakis, A. Robust multi-object tracking using re-identification features and graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3868–3877. [Google Scholar]
Jiang, T.; Zhang, Q.; Yuan, J.; Wang, C.; Li, C. Multi-Type Object Tracking Based on Residual Neural Network Model. Symmetry 2022, 14, 1689. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Z.; Zhang, N.; Zeng, D. Attention Modulated Multiple Object Tracking with Motion Enhancement and Dual Correlation. Symmetry 2021, 13, 266. [Google Scholar] [CrossRef]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 474–490. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. IJCV 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Wu, J.; Cao, J.; Song, L.; Wang, Y.; Yang, M.; Yuan, J. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Zhou, X.; Yin, T.; Koltun, V.; Krahenbuhl, P. Global tracking transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 8771–8780. [Google Scholar]
Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple object tracking with transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Pang, J.; Qiu, L.; Li, X.; Chen, H.; Li, Q.; Darrell, T.; Yu, F. Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 164–173. [Google Scholar]
Kaushal, M.; Khehra, B.S.; Sharma, A. Soft Computing based object detection and tracking approaches: State-of-the-Art survey. Appl. Soft Comput. 2018, 70, 423–464. [Google Scholar] [CrossRef]
Castro, E.C.d.; Salles, E.O.T.; Ciarelli, P.M. A New Approach to Enhanced Swarm Intelligence Applied to Video Target Tracking. Sensors 2021, 21, 1903. [Google Scholar] [CrossRef] [PubMed]
Gao, M.L.; Li, L.L.; Sun, X.M.; Yin, L.J.; Li, H.T.; Luo, D.S. Firefly algorithm (FA) based particle filter method for visual tracking. Optik 2015, 126, 1705–1711. [Google Scholar] [CrossRef]
Walia, G.S.; Kapoor, R. Intelligent video target tracking using an evolutionary particle filter based upon improved cuckoo search. Expert Syst. Appl. 2014, 41, 6315–6326. [Google Scholar] [CrossRef]
Wang, N.; Shi, J.; Yeung, D.-Y.; Jia, J. Understanding and diagnosing visual tracking systems. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3101–3109. [Google Scholar]
Tan, G.; Guo, Z.; Xiao, Y. PA-RetinaNet: Path augmented RetinaNet for dense object detection. In Proceedings of the Artificial Neural Networks and Machine Learning–ICANN 2019: Deep Learning, Munich, Germany, 17–19 September 2019; pp. 138–149. [Google Scholar]
Khan, M.K.; Zafar, M.H.; Rashid, S.; Mansoor, M.; Moosavi, S.K.R.; Sanfilippo, F. Improved Reptile Search Optimization Algorithm: Application on Regression and Classification Problems. Appl. Sci. 2023, 13, 945. [Google Scholar] [CrossRef]
Yang, C.; Wang, W.; Zhang, X.; Guo, Q.; Zhu, T.; Ai, Q. A parallel electrical optimized load forecasting method based on quasi-recurrent neural network. IOP Conf. Ser. Earth Environ. Sci. 2021, 696, 012040. [Google Scholar] [CrossRef]
Sun, P.; Cao, J.; Jiang, Y.; Yuan, Z.; Bai, S.; Kitani, K.; Luo, P. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE Computer Society: Los Alamitos, CA, USA, 2022; pp. 20961–20970. [Google Scholar]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef] [Green Version]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 17–35. [Google Scholar]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
Braso, G.; Leal-Taixe, L. Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]

Figure 1. Workflow of RSOADL–MODT approach—Detection, Tracking and Classification.

Figure 2. Flowchart of RSOA.

Figure 3. Sample Images.

Figure 4. Average result of RSOADL–MODT approach on the DanceTrack dataset.

Figure 5. TACY and VACY outcome of RSOADL–MODT approach on the DanceTrack dataset.

Figure 6. TLOS and VLOS outcome of RSOADL–MODT approach on the DanceTrack dataset.

Figure 7. Comparative outcome of RSOADL–MODT approach on the DanceTrack dataset.

Figure 8. Average result of RSOADL–MODT approach on the MOT17 dataset.

Figure 9. TACY and VACY outcome of RSOADL–MODT approach on the MOT17 dataset.

Figure 10. TLOS and VLOS outcome of RSOADL–MODT approach on the MOT17 dataset.

Figure 11. Comparative outcome of RSOADL–MODT approach on the MOT17 dataset.

Table 1. Overall tracking result of RSOADL–MODT approach on the DanceTrack dataset.

DanceTrack Dataset
No. of Iterations	MOTA	MOTP	IDF1	Idsw	Recall	Precision	MT
Iteration-1	87.32	0.340	51.48	3953	98.70	97.45	537
Iteration-2	88.27	0.341	50.27	4461	98.86	97.62	539
Iteration-3	87.12	0.317	51.04	4590	98.70	97.87	537
Iteration-4	87.16	0.301	50.85	4050	98.20	97.73	538
Iteration-5	88.52	0.324	52.42	4304	98.14	97.96	535
Average	87.67	0.325	51.21	4272	98.52	97.73	537

Table 2. Comparative outcome of RSOADL–MODT approach with other systems on the DanceTrack dataset.

Models	MOTA	HOTA	IDF1	DetA	AssA
RSOADL–MODT	87.6	54.3	51.21	80.6	37.1
CenterTrack [19]	86.8	41.8	35.7	78.1	22.6
FairMOT [20]	82.2	39.9	40.8	66.7	23.8
TraDes [21]	86.2	43.3	41.2	74.5	25.4
GTR [22]	84.7	48.0	50.3	72.5	31.9
MOTR [23]	79.7	47.7	51.5	73.5	40.2
QDTrack [24]	87.7	54.2	50.4	80.1	36.8

Table 3. Overall tracking result of RSOADL–MODT approach on the MOT17 dataset.

MOT17 Dataset
No. of Iterations	MOTA	MOTP	IDF1	Idsw	Recall	Precision	MT
Iteration-1	73.22	0.328	73.22	4488	98.40	97.70	617
Iteration-2	74.11	0.304	71.56	4559	98.24	97.64	623
Iteration-3	78.23	0.343	74.26	3991	98.73	97.57	623
Iteration-4	73.46	0.307	72.11	4039	98.14	97.91	629
Iteration-5	74.36	0.324	70.44	4580	98.48	97.32	625
Average	74.67	0.321	72.31	4331	98.40	97.63	623

Table 4. Comparative outcome of RSOADL–MODT approach with other systems on the MOT17 dataset.

Models	MOTA	HOTA	IDF1	DetA	AssA
RSOADL–MODT	74.6	55.6	72.3	60.7	56.3
CenterTrack [19]	67.8	52.2	64.7	53.8	51
FairMOT [20]	73.7	59.3	72.3	60.9	58
TraDes [21]	69.1	52.7	63.9	55.2	50.8
MOTR [23]	73.4	57.8	68.6	60.3	55.7
QDTrack [24]	68.7	53.9	66.3	57.2	52.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alagarsamy, R.; Muneeswaran, D. Multi-Object Detection and Tracking Using Reptile Search Optimization Algorithm with Deep Learning. Symmetry 2023, 15, 1194. https://doi.org/10.3390/sym15061194

AMA Style

Alagarsamy R, Muneeswaran D. Multi-Object Detection and Tracking Using Reptile Search Optimization Algorithm with Deep Learning. Symmetry. 2023; 15(6):1194. https://doi.org/10.3390/sym15061194

Chicago/Turabian Style

Alagarsamy, Ramachandran, and Dhamodaran Muneeswaran. 2023. "Multi-Object Detection and Tracking Using Reptile Search Optimization Algorithm with Deep Learning" Symmetry 15, no. 6: 1194. https://doi.org/10.3390/sym15061194

APA Style

Alagarsamy, R., & Muneeswaran, D. (2023). Multi-Object Detection and Tracking Using Reptile Search Optimization Algorithm with Deep Learning. Symmetry, 15(6), 1194. https://doi.org/10.3390/sym15061194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Object Detection and Tracking Using Reptile Search Optimization Algorithm with Deep Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Object Detection Using PA–RetinaNet

2.2. RSOA-Based Hyperparameter Tuning

2.3. Classification Using QRNN Model

3. Performance Validation

Ablation Study

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI