A Survey of Computer Vision Detection, Visual SLAM Algorithms, and Their Applications in Energy-Efficient Autonomous Systems

Chen, Lu; Li, Gun; Xie, Weisi; Tan, Jie; Li, Yang; Pu, Junfeng; Chen, Lizhu; Gan, Decheng; Shi, Weimin

doi:10.3390/en17205177

Open AccessReview

A Survey of Computer Vision Detection, Visual SLAM Algorithms, and Their Applications in Energy-Efficient Autonomous Systems

by

Lu Chen

¹

,

Gun Li

^1,*

,

Weisi Xie

¹,

Jie Tan

¹,

Yang Li

¹

,

Junfeng Pu

¹,

Lizhu Chen

¹,

Decheng Gan

² and

Weimin Shi

^1,3,*

¹

School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, Chengdu 611731, China

²

School of Electronic Information Engineering, Yangtze Normal University, Chongqing 408100, China

³

School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

^*

Authors to whom correspondence should be addressed.

Energies 2024, 17(20), 5177; https://doi.org/10.3390/en17205177

Submission received: 10 September 2024 / Revised: 9 October 2024 / Accepted: 14 October 2024 / Published: 17 October 2024

(This article belongs to the Section K: State-of-the-Art Energy Related Technologies)

Download

Browse Figures

Versions Notes

Abstract

Within the area of environmental perception, automatic navigation, object detection, and computer vision are crucial and demanding fields with many applications in modern industries, such as multi-target long-term visual tracking in automated production, defect detection, and driverless robotic vehicles. The performance of computer vision has greatly improved recently thanks to developments in deep learning algorithms and hardware computing capabilities, which have spawned the creation of a large number of related applications. At the same time, with the rapid increase in autonomous systems in the market, energy consumption has become an increasingly critical issue in computer vision and SLAM (Simultaneous Localization and Mapping) algorithms. This paper presents the results of a detailed review of over 100 papers published over the course of two decades (1999–2024), with a primary focus on the technical advancement in computer vision. To elucidate the foundational principles, an examination of typical visual algorithms based on traditional correlation filtering was initially conducted. Subsequently, a comprehensive overview of the state-of-the-art advancements in deep learning-based computer vision techniques was compiled. Furthermore, a comparative analysis of conventional and novel algorithms was undertaken to discuss the future trends and directions of computer vision. Lastly, the feasibility of employing visual SLAM algorithms in the context of autonomous vehicles was explored. Additionally, in the context of intelligent robots for low-carbon, unmanned factories, we discussed model optimization techniques such as pruning and quantization, highlighting their importance in enhancing energy efficiency. We conducted a comprehensive comparison of the performance and energy consumption of various computer vision algorithms, with a detailed exploration of how to balance these factors and a discussion of potential future development trends.

Keywords:

computer vision; object detection; visual tracking; energy efficiency; visual SLAM; deep learning; driverless robotic vehicles

1. Introduction

In today’s digital era, computer vision systems are progressively replacing traditional, costly sensors as the primary data source across a range of applications, leading to both cost reduction and substantial performance improvements. These systems can improve their intelligence by integrating AI algorithms with cognitive processes. Visual object detection, a key task in computer vision, involves identifying objects like humans, vehicles, or animals within digital images or video streams. This enables automated systems to analyze the motion, behavior, and actions of these objects.

The application of vision detection technology in robotic vehicles has become central to the development of autonomous systems, with localization, navigation, path planning, obstacle detection, and robotic grasping. These applications highlight the importance of computer vision in enhancing automation and energy efficiency, particularly in multi-driverless systems. The integration of vision detection and green energy technologies presents both challenges and opportunities, ranging from navigation to minimizing environmental impact. This paper explores how computer vision algorithms are transforming autonomous systems, with a focus on sustainability. The rise of deep convolutional neural networks, along with powerful GPU computing, has rapidly advanced computer vision, with deep learning frameworks now serving as the foundation for most state-of-the-art object detection systems. This review highlights the key role deep learning algorithms play in driving progress in computer vision, with applications in autonomous vehicles, video surveillance, SLAM navigation, and industrial automation. We also explore the balance between algorithm performance and energy consumption, reviewing the role of optimization techniques in reducing energy use within the increasingly widespread context of intelligent autonomous systems, as demonstrated in Figure 1. These technologies are shaping fields such as security, automation, transportation, and medicine, underscoring their pervasive influence in modern society.

As the applications of computer vision technology become increasingly widespread, research and development in this field are also flourishing. Figure 2 illustrates the ascending trend in the number of papers tagged with “computer vision” over the past decade, highlighting the field’s growing prominence and impact within the scientific and academic communities.

1.1. Innovative Contributions and Gaps Compared to Related Reviews

A comprehensive and dimensional review of the evolution of computer vision algorithms.

This article presents a comprehensive survey of research on computer vision and a roadmap of milestone vision detectors, as depicted in Figure 3. It also summarizes and organizes the key nodes proposed by computer vision detection and tracking networks, categorizing them into traditional methods and deep learning-based methods according to their chronological development. Post-2024, the classification is further refined into one-stage detectors, two-stage detectors, and visual SLAM, providing an intuitive reflection of the technological evolution;

2.: A comprehensive discussion has been conducted on the classification of datasets that are crucial to computer vision.

A comprehensive analysis has been conducted on various types of datasets, including 1D, 2D, 3D, 3D+ vision, and multi-modal sensing, from multiple perspectives such as dataset structure, scale, features, and applications. These datasets not only cover general object detection but also include commonly used targets in autonomous driving, such as traffic signals and road surface detection. Additionally, the analysis explores how these datasets can be effectively used to evaluate the performance of algorithms;

3.: The relationship between computer vision algorithms and energy consumption has been explored.

This paper introduces a comparative analysis that evaluates the energy efficiency of various algorithms by measuring their power consumption during operation and computational efficiency and examining the trade-offs between performance and energy consumption. This in-depth examination of energy efficiency in relation to algorithmic advancements is a key contribution of this paper;

4.: Examples of industrial applications in this research area.

We discussed the specific applications of this domain in complex, dynamic settings like unmanned factories and autonomous vehicles, highlighting practical considerations and energy-performance trade-offs. The system integrates visual detection and tracking with SLAM for localization and navigation, showcasing a comprehensive application of visual technology. Additionally, it highlights the latest industrial application of ACDet in the pharmaceutical sector, demonstrating its significant social and economic value.

Compared to related reviews, the key differences lie in how we not only highlight the scientific contributions but also clearly distinguish our work by emphasizing energy efficiency, industry-specific applications, and environmental sustainability in computer vision. Specifically, our review provides a unique perspective on balancing algorithm performance with energy consumption, explores applications tailored to various industrial needs, and addresses the importance of sustainable practices in implementing these technologies. Specifically, the key differences are as follows:

Integration of energy metrics in algorithm comparison.

While many reviews focus on algorithm accuracy and speed, this paper includes a comparative analysis of energy efficiency, a perspective that addresses both environmental sustainability and practical limitations in deploying these technologies;

2.: Cross-domain application analysis.

This paper extends beyond general computer vision applications and explores domain-specific use cases, including the medical industry, autonomous driving, and unmanned aerial vehicles. This cross-domain applicability is a unique angle that shows the versatility of vision algorithms across different industries;

3.: Specific emphasis on autonomous navigation and low-carbon impact.

By examining SLAM algorithms in the context of green technology and autonomous navigation, this paper aligns with current sustainability trends, which are not as extensively covered in traditional computer vision reviews.

These set the current research landscape apart from the numerous general object detection reviews published in recent years [16,17,18,19,20,21,22,23]. Table 1 provides a comprehensive summary of the comparative analysis of various works within this domain and this paper’s chapter structure.

We begin by examining the evolution of foundational algorithms and datasets, laying the groundwork for understanding advancements in computer vision technology and providing a basis for exploring specific algorithms. Building on this, we proceed with a concise analysis of traditional detection and tracking algorithms based on correlation filtering, focusing on their principles and performance. These algorithms enable long-term tracking under conditions like occlusion and scale changes, assuming manual initial target labeling. However, as technology has progressed, traditional methods have struggled to achieve high-precision detection under challenging conditions such as lighting variations, scale shifts, small or weak targets, and dense, similar targets.

To enhance robustness and performance in visual detection, our third step involves a comprehensive review of deep learning detection algorithms based on CNN. We conduct an in-depth analysis, covering essential datasets, the YOLO series, and practical industrial applications, with an emphasis on balancing performance and energy consumption. Yet, purely visual detection lacks organic integration with embedded intelligent agents, which is crucial for low-energy autonomous systems like composite robots that navigate unknown environments independently.

Next, we analyze SLAM algorithms, using the target feature detection achieved in steps two and three to facilitate localization and loop closure. This highlights the potential of deep learning algorithms in SLAM, supported by dataset analysis. Given the computational power and energy constraints of onboard chips, we also explore the relationship between performance and energy consumption in this context. Finally, we discuss strategies for optimizing models to enhance algorithmic energy efficiency. Techniques such as model pruning and quantization were explored as means to reduce computational load, thereby lowering power consumption by streamlining the algorithms and adapting them to run on low-power hardware.

1.2. Selection Criteria for Reviewed Articles

This work primarily focused on recent developments and foundational works in computer vision and SLAM algorithms spanning two decades (1999–2024). The selection criteria for reviewed articles were based on the following:

Relevance to energy efficiency.

Priority was given to studies that evaluated energy consumption, particularly those relevant to autonomous systems and industrial applications where energy efficiency was critical;

2.: Algorithmic advances.

This review included key milestones and breakthroughs in both traditional computer vision and deep learning-based methods, ensuring a comprehensive overview of technological progress;

3.: Application-specific case studies.

This paper incorporated research relevant to diverse industrial applications, especially those aligned with low-carbon and sustainable technology, to provide cross-domain insights into practical deployments;

4.: Citation frequency and impact.

Preference was given to influential and frequently cited studies to ensure that the review encompassed highly regarded and impactful research within the field.

1.3. An Evolutionary Review of Algorithms Related to Computer Vision

Traditional Computer Vision Algorithms.

In 2001, P. Viola and M. Jones introduced the VJ (Viola–Jones) detector [1]. In 2005, N. Dalal and B. Triggs proposed the HOG (Histogram of Oriented Gradients) Detector [2]. The DPM (Deformable Part-based Model), developed by P. Felzenszwalb [24] based on HOG, represented a significant milestone in traditional vision detection. It introduced key concepts such as mixture models, hard negative mining, and bounding box regression, which continue to exert a profound influence on contemporary computer vision algorithms. Building on the foundation of computer vision detection, Bolme et al. constructed a novel visual object tracking framework in 2010 using correlation filtering and proposed the MOSSE (Minimum Output Sum of Squared Error) algorithm [3]. MOSSE resolved an optimization problem by utilizing the target’s grayscale features and a Gaussian expected output function to train a discriminative correlation filter. In 2014, Henriques introduced the KCF (Kernelized Correlation Filters) tracking algorithm [4], providing a complete theoretical derivation of the cyclic properties and offering a method to integrate multi-channel features into the correlation filter framework. This method incorporated multidimensional HOG features into the algorithm, further enhancing its performance. In 2019, Xu et al. proposed the LADCF (Learning Adaptive Discriminative Correlation Filters) algorithm [5], addressing common issues such as boundary effects and temporal filter degradation in correlation filter-based algorithms. This tracking algorithm combined adaptive spatial feature selection and temporal consistency constraints, enabling effective representation in low-dimensional manifolds and improving tracking accuracy. In the same year, Huang et al. introduced the ARCF (Aberrance Repressed Correlation Filters) tracking algorithm [6], a long-term target tracking algorithm based on the BACF (Background-Aware Correlation Filters) algorithm. The ARCF incorporated aberration-repressed correlation filters in the detection module, restraining the variation rate of response maps generated during the detection phases to suppress aberrations, thereby achieving more stable and accurate target tracking. In 2020, Ju Z et al. proposed a new method for detecting shadow boundaries using an adaptive direction-tracking filter specifically designed for brain–machine interface (BMI) applications. This approach enhances the accuracy of shadow boundary detection, which is crucial for improving BMI systems’ performance in complex environments [25];

Deep Learning-based Computer Vision Algorithms;

The introduction of the CNN (Convolutional Neural Network) model by R. Girshick in 2014 [7] marked a significant milestone in the application of convolutional neural networks for object detection and tracking. Next, in 2015, S. Ren et al. proposed Faster R-CNN [8], which introduced the RPN (Region Proposal Network), further advancing the field. In the same year, R. Joseph et al. presented the YOLO model based on One-stage Detectors [9], significantly enhancing computational speed. R. Joseph subsequently made a series of improvements to the YOLO model, leading to the development of its v2 and v3 editions [26]. Currently, Alexey Bochkovskiy and others have updated the algorithm to YOLOv8, which has been applied in various scenarios [10]. In 2019, Gao F et al. proposed a novel algorithm for target recognition in SAR (Synthetic Aperture Radar) images using an improved deep convolutional neural network. This approach enhances the precision and reliability of SAR image analysis, addressing the challenges of target recognition in complex environments [27]. YOLOv10, proposed by Wang A et al. in 2024 [11], is the first object detection model to eliminate the need for NMS (Non-Maximum Suppression). Removing NMS and optimizing the model architecture significantly improves inference speed and efficiency. This model introduces a consistent dual assignment strategy, combining one-to-one and one-to-many assignment methods, optimizing both training and inference processes;

Visual SLAM Algorithms;

With the maturation of deep learning models, in 2017, Chen B. X. et al. proposed a pedestrian tracking system for unmanned vehicles based on binocular stereo cameras [28], addressing the issue of RGBD cameras being unsuitable for outdoor use. This type of camera calculates depth information through the disparity between two cameras. In 2021, Liu et al. [12] designed a deep learning-based robotic system closed-loop tracking framework, DeepSORT, capable of effectively implementing automatic detection and tracking of new vehicles in complex environments. In the same year, TEED Z. et al. proposed DROID-SLAM [13], which had a significant impact on visual SLAM. This method liberated unmanned vehicles from the constraints of expensive LiDAR by utilizing visual cameras, enabling not only object detection but also localization and navigation. The DROID-SLAM model, based on deep learning, can effectively extract features from the front end and compute optical flow fields to iteratively update the pose and depth of the current keyframe while performing BA (Bundle Adjustment) optimization in the backend, greatly enhancing the system’s robustness, generalization ability, and performance. A plethora of deep learning-based computer vision technologies applied to multi-vehicle unmanned driving and automated intelligent unmanned factories have effectively improved the efficiency and intelligence of various systems [29].

2. Review of Traditional Computer Vision Algorithms

This chapter offers an introduction to the VJ (Viola–Jones) detector model [1], which is one of the earlier algorithms to use a cascade of simple features for fast object detection. It is aimed at facilitating an understanding of the basic principles and modular composition of traditional computer vision algorithms. Subsequently, it introduces the composition of the ECO (Efficient Convolution Operators) detection and tracking algorithm and proposes a framework for an adaptive detection and tracking model that integrates ECO [30]. This framework is applied to the visual detection and tracking of targets such as roads, traffic lights, and pedestrians in the context of sensing vehicles, thereby preparing for the execution of various unmanned driving actions.

2.1. The Basic Theory of Computer Vision Detection

In the domain of computer vision-based target location and detection, the efficiency of employing effective image feature extraction methods is significantly higher than that of pixel-based calculation methods. Initially, the image undergoes an integration process as a preliminary step in feature extraction, as shown in Equation (1):

i i (x, y) = \sum_{x^{'} \leq x, y^{'} \leq y} i (x^{'}, y^{'}),

(1)

where

x, y

represent the position information;

i i

represents the integral image, and

i

represents the original image. The following recursive formula completes the integral processing of the initial image:

\begin{array}{l} s (x, y) = s (x, y - 1) + i (x, y) \\ i i (x, y) = i i (x - 1, y) + s (x, y) \end{array}

(2)

The utilization of rectangular features combined with integral images improves extraction efficiency and flexibility. A weak learning algorithm is employed to design single rectangular features that separate positive and negative examples, identifying the final threshold classification function to reduce the occurrence of misclassification. The weak classifier

h_{j} (x)

is designed as follows:

h_{j} (x) = \{\begin{cases} 1 & p_{j} f_{j} (x) < p_{j} θ \\ 0 & otherwise \end{cases}

(3)

Among them,

f_{i}

represents characteristics, and

θ

is a threshold.

Upon completion of the image integration process, a cascading classifier is assembled. The visual detection mechanism adopts the structure of a degenerate decision tree, commonly known as cascading. In this sequence, each subsequent classifier is activated only upon the successful passage of the previous stage; a “pass” from the first-level classifier prompts the activation of the second-level classifier, and this progression continues accordingly. Any negative classification at a given level halts further evaluation down the cascade. This operational principle is elucidated in Figure 4 [1].

The threshold for each stage needs to be defined according to the situations of different detection targets, the number of stages in the classifier, and the number of features corresponding to each stage. The aim is to balance the number of features and the number of stages in the classifier, thus minimizing the expected number of evaluated features.

2.2. The Basic Theory of Computer Vision Tracking

The ECO (Efficient Convolution Operators) algorithm, serving as a foundation for target detection and tracking, has exhibited outstanding performance on tracking datasets, including the VOT (Visual Object Tracking) and OTB (Object Tracking Benchmark). By executing convolution operations within the continuous domain, the algorithm facilitates the natural integration of multi-resolution feature mappings. This approach permits the flexible selection of features across various resolutions without the necessity for resampling, thereby enhancing the adaptability and efficiency of the tracking process. The result of the target detection by the algorithm is a continuous function, enabling sub-pixel level positioning and improving tracking accuracy. The ECO algorithm learns the continuous convolutional filter

f

. The mapping matrix

P

uses a set of sample feature maps

{x_{j}}_{1}^{M} = {x_{1}, …, x_{M}}

and corresponds with the desired response values

{y_{j}}_{1}^{M} = {y_{1}, …, y_{M}}

. The position of the target is predicted by the following convolution operation:

S_{f, P} \{x\} = \sum_{c = 1}^{C} f^{c} * (\sum_{d = 1}^{D} P_{d c} J_{d} {x^{d}})

(4)

In the proposed method, the interpolation operator

J_{d}

performs interpolation calculations on the

d

-th feature channel

x^{d}

of the extracted sample image

x

, transforming the feature map into a continuous domain. Next, a projection matrix

P

of size

D \times C

is utilized to map the feature map to a lower-dimensional space. The mapped samples are then convolved with a continuous convolutional filter

f

to obtain the response map.

The continuous convolutional filter

f

and the projection matrix

P

are jointly learned by minimizing an optimization loss function (5),

E (f) = \sum_{j = 1}^{M} α_{k} ‖S_{f, P} \{x_{j}\} - y_{j}‖ + {\sum_{c = 1}^{C} α_{k} ‖ω \times f^{c}‖}^{2} + λ {‖P‖}_{F}^{2}

(5)

where

α_{j}

denotes the weight for each training sample. The label function

y_{j}

is a two-dimensional Gaussian distribution function centered at the target position in the corresponding sample

x_{j}

. The second term is a spatial regularization term introduced to mitigate the boundary effect issues caused by circular convolution. The last term is the

F

-norm of the matrix

P

, serving as a regularization term with the regularization parameter

λ

.

By employing the Parseval’s theorem, the optimization problem (5) can be transformed into a minimization problem in the Fourier domain:

E (f) = \sum_{j = 1}^{M} α_{k} ‖{\hat{S}}_{f, P} \{x_{j}\} - {\hat{y}}_{j}‖ + {\sum_{c = 1}^{C} α_{k} ‖\hat{ω} \times {\hat{f}}^{c}‖}^{2} + λ {‖P‖}_{F}^{2}

(6)

where “

\hat{.}

” denotes the Fourier coefficients of the corresponding values. The resulting system of equations can be efficiently solved using this conjugate.

2.3. Adaptive Long-Term Tracking Framework Based on ECO-C

In 2023, Chen et al. [14] proposed a computer vision-based framework for long-term tracking that incorporated an innovative uncertainty estimation method. This method determined the necessity for model re-detection and updates in a timely fashion and assessed each frame to enable adaptive model updating. Such a method has been shown to enhance the algorithm’s performance and robustness, as evidenced by its superior results on public datasets. When applied to unmanned vehicles, this framework facilitated the effective identification and tracking of pedestrians, thereby empowering robotic or autonomous vehicular systems to execute critical functions, including obstacle avoidance and target tracking.

Within the framework of the ECO tracking algorithm, the conjugate gradient method is employed to solve for the filter parameters, which necessitates the prior determination of the sample energy as a known parameter. Assuming

X_{t}

is the discrete Fourier transform of the training sample in the

t

-th frame, the sample energy

E_{s}

is calculated by Equation (7):

E = ‖ X_{t} ‖^{2}

(7)

The sample energy for each frame is updated using the learning rate, as shown in Equation (8):

E_{t} = η ‖ X_{t} ‖^{2} + (1 - η) E_{t - 1}

(8)

This framework determines whether the model needs updating based on the proposed tracking confidence and peak-to-sidelobe ratio, as shown in Equation (9):

η = \{\begin{matrix} P & M_{P S R} \geq τ_{P S R}, t \geq τ_{t} \\ P / 2 & M_{P S R} < τ_{P S R}, t \geq τ_{t} o r M_{P S R} \geq τ_{P S R}, t < τ_{t} \\ 0 & M_{P S R} \leq τ_{P S R}, t \leq τ \end{matrix}

(9)

Here,

P

is the value of the sample learning rate

η

set during the initialization of the tracking algorithm, which is a constant.

τ_{P S R}

and

τ_{t}

are the thresholds for tracking uncertainty detection. As seen in the equation, when both the peak-to-sidelobe ratio and tracking confidence are above the thresholds, the target is considered clearly visible, and the model is updated normally. If either falls below the threshold, it indicates significant deformation or partial occlusion of the target, which is still visible. Setting the learning rate to zero in this case might lead to overfitting, so the sample learning rate is reduced. When both values are below the thresholds, it implies severe occlusion or disappearance of the target, and the sample learning rate

τ_{t}

is set to zero to prevent erroneous updates. The model structure, as shown in Figure 5 [14], includes three distinct modules: detection; redetection; and tracking.

This framework demonstrates its test accuracy and coverage rate on public datasets UAV20 compared to other visual tracking algorithms, namely, LADCF [5], ARCF [6], ECO-HC [30], ECO-C [14] SRDCF [31], DSST [32], KCF [4], SAMF [33], BACF [34], MCCT-H [35], STRCF [36], ADNet [37] CFNet [38], and MCPF [39], as shown in Figure 6 [14]. The localization accuracy reached 75%, an improvement of 8.4%, and the coverage rate was 62.6%, which was an increase of 8.3%.

Table 2 reflects the accuracy of nine high-performing algorithms on the benchmark UAV20 dataset across 12 challenging attributes. The comparative analysis of the algorithms in the table highlights a fundamental challenge in balancing energy efficiency and performance, particularly in applications like autonomous driving, where both aspects are critical. Autonomous vehicles, especially those relying on green energy, must optimize resource consumption to extend operational range without compromising real-time performance in tasks such as object detection and tracking.

For instance, algorithms like ECO-HC and CFNet, which exhibit higher average frame rates (41.2 FPS and 38.1 FPS, respectively), demonstrate greater energy efficiency. In the context of autonomous driving, this leads to reduced power consumption, allowing electric or hybrid vehicles to maximize their operational efficiency. These algorithms are particularly well-suited for scenarios where real-time performance is crucial, such as obstacle detection and pedestrian tracking in dynamic environments. However, the trade-off is that they may not offer the highest accuracy or robustness in complex scenarios, such as low-light or high-speed conditions, which are common in autonomous systems.

Conversely, algorithms such as ECO-C and LADCF, despite their lower FPS (36.4 FPS and 16.1 FPS, respectively), offer superior accuracy and robustness in tracking tasks. For autonomous vehicles, this precision is critical when navigating in highly dynamic or unpredictable environments, such as urban streets or rural areas with varying road conditions. However, the increased computational load of these algorithms may reduce energy efficiency.

To balance energy efficiency and performance, a hybrid approach could be explored in which high-efficiency algorithms like LADCF are used for tasks that require real-time response, while more robust algorithms like ECO-C are selectively employed for critical decision-making tasks where accuracy is paramount.

2.4. Discussion of the Application of Traditional Vision Detection and Tracking Algorithms in Autonomous Driving and Contribution to Promoting Green Energy Sustainability

Traditional detection and tracking algorithms, such as optical flow, HOG, and Kalman filters, have long served as foundational technologies in the development of autonomous driving systems. These algorithms rely on predefined patterns for feature extraction and object tracking, which makes them efficient for specific tasks in controlled environments. For example, the HOG algorithm has been used effectively in pedestrian detection, while Kalman filters were widely adopted for tracking object motion, allowing vehicles to predict the behavior of other road users. This combination has been successfully implemented in early autonomous systems to detect pedestrians or vehicles moving in traffic, contributing to real-time decision-making processes.

In real-world applications, these algorithms have performed reliably in structured environments like well-mapped urban streets, where challenges tended to be more predictable. However, their limitations become apparent in complex or rapidly changing scenarios, such as extreme weather conditions, low lighting, or occluded objects. Despite these constraints, traditional detection algorithms offer significant advantages in terms of computational efficiency. These algorithms require far fewer computational resources than modern deep learning-based methods, rendering them ideal for energy-constrained autonomous vehicles.

Although traditional algorithms may lack the adaptability and performance of modern approaches in dynamic environments, their energy-efficient nature provides a valuable contribution to sustainable autonomous driving systems. Integrating these algorithms with newer techniques may present a balanced approach, combining the low energy consumption of traditional methods with the adaptability and robustness of more advanced technologies.

3. Review of Deep Learning-Based Computer Vision

3.1. Computer Vision Datasets

The field of object detection has experienced significant advancements, attributable in part to the establishment of numerous benchmarks such as Caltech [40], KITTI [41], ImageNet [42], PASCAL VOC [43], VOC Challenges [44,45], visual recognition challenges [46,47], MS COCO [48], and Open Images V5 [49]. Furthermore, the organizers of the ECCV VisDrone 2018 competition introduced an innovative drone platform-based dataset [50], which includes a comprehensive collection of images and videos.

3.1.1. The Thorough Review of Multidimensional Typed Datasets

The selection and use of datasets are crucial in computer vision detection. Different types of datasets, such as 1D, 2D, 3D, 3D+ vision, and multi-modal sensing, provide rich information and diverse application scenarios. In this paper, we present a comprehensive review of these datasets, analyzing their characteristics and their applications in computer vision detection.

One-Dimensional Datasets.

One-dimensional datasets are primarily used to process time-series data, such as audio signals and sensor readings. While their application in computer vision is less common, 1D datasets still hold significant importance in specific scenarios, such as edge detection and signal processing;

2.: Two-Dimensional Dataset.

Two-dimensional datasets are the most common type of datasets used in computer vision research, primarily consisting of static images and video frames. Typical 2D datasets, such as COCO, Pascal VOC, and ImageNet, are widely used for tasks like object detection, image classification, and semantic segmentation;

3.: Three-Dimensional Datasets.

Three-dimensional datasets offer rich spatial information and are commonly used for tasks such as 3D reconstruction, pose estimation, and three-dimensional point cloud processing. Representative datasets like KITTI, ScanNet [51], and ModelNet [52] play a crucial role in fields such as autonomous driving and robotic navigation;

4.: Three-Dimensional+ Vision Datasets.

Three-dimensional+ vision datasets combine 3D spatial information with traditional 2D image data, offering a more comprehensive perception capability. For instance, the SUN RGB-D [53] and NYU Depth V2 datasets [54] merge RGB images with depth information, making them widely applicable in indoor scene understanding and object detection.

5.: Multimodal Sensing datasets.

Multi-modal sensing datasets combine data from various sensors, such as RGB images, depth information, LiDAR point clouds, and IMU data. These datasets provide richer environmental information, with representative examples including ApolloScape [55] and the KAIST Multi-Spectral Pedestrian Dataset [56], which are widely used in autonomous driving and intelligent surveillance applications.

By comparing different types of datasets, we can observe that each dataset offers unique advantages and challenges in specific application scenarios. One-dimensional datasets excel in processing time-series data, while 2D and 3D datasets play key roles in spatial information processing and object detection. Meanwhile, 3D+ vision and multi-modal sensing datasets provide more comprehensive perception capabilities, making them well-suited for complex environmental sensing and autonomous navigation tasks. Table 3 provides an overview of several representative datasets from each category.

3.1.2. Examples of Popular and Challenging Computer Vision Datasets and Their Value and Meaning

Table 4 provides a compilation of some of the popular datasets for these specific detection tasks.

Evaluating the performance of algorithms on these datasets requires a comprehensive approach that leverages the diverse characteristics and challenges each dataset presents. For example, the KITTI dataset, with its multi-modal data (e.g., images, LiDAR, GPS), is well-suited for assessing an algorithm’s ability to handle complex environments in tasks such as 3D object detection, localization, and tracking. V3DET, with its extensive object vocabulary, is ideal for evaluating object detection algorithms’ capacity to generalize across a wide range of categories. Meanwhile, the WiderFace and IJB datasets are particularly useful for testing facial recognition and detection models, especially in challenging scenarios involving occlusions, varied lighting, and complex backgrounds. Datasets like EventVOT and NWPU-VHR10 introduce specialized challenges, such as event-based data and remote sensing, respectively, which allow for testing algorithm robustness in dynamic environments and high-resolution satellite imagery tasks. By evaluating algorithms across these datasets, one can gain valuable insights into their performance in terms of accuracy, speed, scalability, and adaptability to real-world conditions, providing a holistic assessment of their effectiveness.

Concurrently, we have established our own datasets to meet the demands of in-depth research and practical applications. These datasets encompass various domains such as WMTIS (Weak Military Targets in Infrared Scenes), the appearance of medical industry MB (Medicine Boxes), EP (Express Packages), and FPP (Fully Automated Unmanned Factories and Product Detection and Tracking). Table 5 outlines the parameters and descriptions of these diverse datasets, while Figure 7 provides illustrative examples from each dataset. These datasets can significantly enhance the model’s learning ability and performance across multiple dimensions. For instance, in the MB dataset, by incorporating samples collected under varying lighting conditions, the model can more effectively learn the consistent features of pharmaceutical packaging, regardless of lighting variations. This approach reduces the impact of lighting differences on detection accuracy, leading to more robust performance.

3.2. Review of Deep Learning Computer Vision Based on Convolutional Neural Networks

Machines can now interpret and understand visual data in ways that were before impractical due to deep learning, which has completely changed the field of computer vision. Deep neural networks, especially CNNs, which are built to automatically and adaptively learn spatial hierarchies of features from vast volumes of image data, are at the core of this transformation. These networks are made up of several neuronal layers, each of which is trained to detect increasingly complex features, such as edges and textures in the initial layers and higher-level objects and patterns in the deeper layers. Training these networks involves using large datasets of labeled images and a backpropagation algorithm to adjust the weights of the neurons so that the network can accurately classify or recognize objects. Deep learning in computer vision encompasses principles that apply to tasks like object detection, semantic segmentation, and image generation. These tasks utilize deep neural networks to extract meaningful information from pixels and convert it into realistic images or actionable insights. Applications ranging from surveillance and augmented reality to medical imaging and autonomous cars have significantly advanced as a result.

A one-stage detector network is appropriate for relatively basic application situations requiring high real-time performance because of its rapid inference speed. YOLO, SSD [71], and other instances are among the most prevalent ones. Figure 8 shows the fundamental structure of this type of detector [22].

In this structure, the yellow part represents a series of convolutional layers in the backbone network with the same resolution, while the blue part is the RoI (Region of Interest) pooling layer, which generates feature maps for objects of the same size.

The second type is the two-stage detector network, exemplified by CNN and RCNN [7]. This type complements the one-stage detectors with its higher localization precision and recognition accuracy. However, it has longer inference times and demands greater computational power, making it suitable for complex applications requiring high accuracy. The two stages of two-stage detectors are separated by the RoI pooling layer. The first stage, known as the RPN (Region Proposal Network), proposes candidate object bounding boxes. In the second stage, the RoI Pooling operation extracts features from each candidate box for subsequent classification and bounding box regression tasks. Taking Fast-RCNN as an example, its basic structure is illustrated in Figure 9 [72].

The RCNN network is composed of four integral modules. The first module is dedicated to region proposal. The second module extracts fixed-length feature vectors from these proposed regions. The third module comprises a suite of linear support vector machines for specific categories. The final module is tasked with bounding box regression. The Fast R-CNN, as an advancement over its predecessor, commences by extracting features from the entire input image. Subsequently, it achieves scale-invariant feature representation through the RoI (Region of Interest) pooling layer. These processed features serve as the input for the subsequent fully connected layers responsible for classification and bounding box regression. This integrated approach to one-time classification and localization significantly enhances the inference speed, improving the overall efficiency of the model.

In recent years, a plethora of enhanced model algorithms has been applied to diverse public datasets. Through experimental evaluation, it has been noted that, despite not being tested under entirely identical sub-datasets, their performance generally exhibits an upward trend over time. Table 6 presents the results of various algorithms on the MS COCO dataset, with AP (Average Precision) serving as the primary evaluation metric. We have also employed key metrics such as power consumption (measured in watts) and performance (expressed as a percentage) to evaluate the algorithms. By comparing these algorithms using these metrics, we effectively assess the trade-offs between energy consumption and computational efficiency. This approach enables us to illustrate how each algorithm balances performance with energy usage. To determine energy efficiency more comprehensively, we adopt the Equation (10) to calculate the following:

Energy Efficiency = \frac{AP * FPS}{Energy Consumption * (\frac{Processing Time}{1000} * GPU Utilization) * Memory Usage}

(10)

Optimizing the energy efficiency of visual detection algorithms is crucial. Autonomous vehicles, especially electric ones, rely heavily on computational efficiency to extend operational range while maintaining real-time performance. Algorithms like YOLOv10, with its energy efficiency of 0.139, illustrate how cutting-edge models can achieve high detection accuracy without excessive power consumption. This balance is essential for reducing the energy load on electric vehicles, contributing to sustainable transportation systems.

As cities push for greener solutions, the ability to deploy energy-efficient algorithms in autonomous driving systems becomes a key factor in minimizing environmental impact. More resource-intensive models, such as Fast R-CNN and NAS-FPN, which consume more energy and have higher memory demands, present challenges in this regard. The focus on lightweight, efficient algorithms like YOLOv8 and YOLOv9 enables the integration of advanced technologies into green energy systems, supporting both real-time decision-making.

Figure 10 illustrates the performance of several mainstream algorithms, including DPM-v1/v5 [81], RCNN, SPPNet [82], Fast RCNN, Faster RCNN, SSD, RefineDet512+, YOLOv5, VIT-ASSD [83], YOLOv7 [84], Stack-YOLO [85], and YOLOv10 on the VOC dataset [21], showing a steady improvement in their performance.

3.3. Computer Vision Detection Applications Based on Convolutional Neural Networks

CNNs, one of the core models in deep learning, have been extensively applied to various computer vision tasks. Whether in object detection, image classification, or semantic segmentation, CNNs’ capability to handle complex visual data makes them an ideal solution. In this broad context, the YOLO series of models, known for their efficiency and real-time accuracy, have found widespread use in fields such as autonomous driving and security surveillance [86]. Similarly, the multi-scale attention mechanism proposed by Chen et al. [87] has significantly improved the classification accuracy of remote sensing images, highlighting CNNs’ strong adaptability to complex environments. The potential applications of CNNs extend beyond these fields to include biometric recognition and intelligent waste management. In the domain of biometric recognition, Chhabra M et al. proposed a method for automated latent fingerprint detection and segmentation based on deep convolutional neural networks [88]. This model addresses common challenges such as low-quality and partial fingerprints. By leveraging a CNN architecture, the model automatically extracts key features from complex fingerprint data, thereby improving the accuracy of fingerprint detection and segmentation. Experimental results indicate that this model offers significant advantages over traditional methods, especially in forensic applications, by reducing manual intervention and enhancing efficiency. In the domain of biometric recognition, Chhabra M et al. proposed a method for automated latent fingerprint detection and segmentation based on deep convolutional neural networks [89]. This model addresses common challenges such as low-quality and partial fingerprints. By leveraging a CNN architecture, the model automatically extracts key features from complex fingerprint data, thereby improving the accuracy of fingerprint detection and segmentation. Experimental results indicate that this model offers significant advantages over traditional methods, especially in forensic applications, by reducing manual intervention and enhancing efficiency. In 2020, Vora and Shrestha proposed a method for detecting diabetic retinopathy using an embedded computer vision system [90]. Their approach leverages image processing techniques on embedded devices to identify symptoms of diabetic retinopathy in retinal images, aiming to provide a low-cost, accessible solution for early diagnosis. This system is designed to be implemented on portable devices, enhancing its applicability in remote and underserved areas where access to traditional screening methods may be limited.

With the rapid development of the medical and pharmaceutical industries, CNN-based visual detection has become an integral part of industrial manufacturing. The medical and pharmaceutical industries demand exceptionally high accuracy, typically exceeding 99.95%, due to the critical nature of their applications. This stringent requirement means that algorithms must achieve an extremely high level of precision, as even minor inaccuracies can have significant consequences. As a result, algorithmic precision is paramount, necessitating meticulous optimization and rigorous validation to ensure that the detection and classification processes meet these high standards. Building on the general single-stage detection model YOLOv8, we have developed a new model—Self-Attention and Concatenation-Based Detector (ACDet)—specifically designed for grayscale images from industrial cameras. This model is resilient to challenges such as varying illumination and densely packed objects. ACDet primarily helps verify the type and quantity of pharmaceutical packaging to ensure absolute safety in the pharmaceutical process. Currently, this model is widely deployed in industrial applications, generating significant economic value. Moreover, as a general model with strong generalization capabilities, ACDet can be adapted for other industrial scenarios through parameter adjustments. Section 3.3.1 provides a comprehensive introduction to this model.

3.3.1. ACDet: A Vector Detection Model for Drug Packaging Based on Convolutional Neural Network

ACDet is an algorithmic solution proposed by the authors for computer vision detection in the medical industry, integrating the algorithms mentioned previously within the YOLOv8 framework. The medical sector, with its rapid automation advancements, exhibits a substantial demand for visual detection technologies. The medical industry encounters numerous challenges in drug detection, attributable to the complexity of pharmaceuticals, the diversity of packaging materials, and the variety of formats, as exemplified by the EP dataset introduced in Section 3.1. These challenges encompass issues such as uneven lighting and the necessity for high response speeds. To address these problems, we devised a universal lightweight vector detection model. By optimizing the multi-computation module C2F-A, this model amplifies attention across multiple dimensions of gradient flow outputs, thereby enabling efficient and rapid classification of various drugs by improving the sensing ability of the network. The architecture of this model is illustrated in Figure 11.

Upon testing, this model achieved a mAP of over 81% on the EP dataset. Under identical testing conditions, its performance surpassed that of YOLOv5 to 10 versions by Smooth mAP, improving by 4.66% to 13.85% and mAP increasing by 6.31% to 19.41%, as demonstrated in Figure 12 and Table 7. This vision system has been extensively deployed, generating substantial market value, as depicted in Figure 12. In practical applications, the system’s accuracy can exceed 99.9%. Figure 13 illustrates the specific application scenarios.

3.4. Exploration and Future Trends in Deep Learning-Based Computer Vision Algorithms

The development of deep learning algorithms is intrinsically linked to the computational capabilities of hardware. In recent years, the rapid advancement of high-performance GPUs (Graphics Processing Units) has accelerated this process, propelling the revolution in artificial intelligence. Notably, the recent introduction of the Blackwell architecture GPU, featuring 208 billion transistors and employing a custom, dual-reticle TSMC 4NP process, has been a significant milestone. The interconnect speed between two smaller chips reaches up to 10 TBps, which not only elevates the computational power to 20 petaflops (FP4 precision) but also reduces energy consumption to one-twenty-fifth of its previous level. We believe that this technological breakthrough will have a profound impact on several aspects of computer vision, shaping its future direction and trends.

Enhanced Performance and Real-time Processing

High-performance hardware, such as the Blackwell architecture, equipped with parallel processing capabilities and high memory bandwidth, has dramatically accelerated the performance of computer vision algorithms. Its ability to handle complex matrix operations and parallel computations efficiently has enabled real-time processing of high-resolution images and videos. This is crucial for applications like autonomous vehicles, where split-second decisions based on visual data can be lifesaving. Furthermore, the enhanced performance of these latest cards has facilitated the development of more sophisticated and accurate computer vision models, pushing the boundaries of what is achievable in object detection, image recognition, and 3D reconstruction;

2.: Energy Efficiency and Sustainability

The energy efficiency of high-performance hardware represents another critical factor influencing the future trajectory of computer vision. As AI models grow in complexity, the energy consumption associated with their training and inference processes has emerged as a pressing concern. Blackwell cards, designed with energy-efficient architectures, effectively reduce power consumption without compromising performance. This advancement not only facilitates the sustainable deployment of computer vision models but also extends their applicability to mobile and embedded systems where power availability is constrained;

3.: Democratization of AI Research

High-performance hardware has contributed to the democratization of AI research by making high-performance computing more accessible to a broader range of researchers and developers. The affordability and availability of these cards have lowered the entry barrier for individuals and small organizations to experiment with and develop computer vision models. This democratization fosters innovation and diversity in this field, as more people from different backgrounds can contribute to the advancement of computer vision technology.

Simultaneously, we posit that an infinite increase in data volume results in the exponential growth of computing nodes, and addressing this issue solely through hardware advancements is not a sustainable solution. Current deep learning recognition algorithms exhibit an excessive dependency on training datasets. Consequently, exploring environmentally friendly, low-carbon models equipped with forgetting screening mechanisms constitutes one of the critical development directions in this field.

3.5. Discussion on the Application of Deep Learning-Based Vision Detection Algorithms in Autonomous Driving and Contribution to Promoting Green Energy Sustainability

Deep learning-based algorithms are distinguished by their ability to automatically learn relevant features from large datasets, making them highly adaptable to complex and dynamic environments. Unlike traditional algorithms, which depend on predefined features, deep learning models excel in handling diverse real-world scenarios, including variations in lighting, weather, and object occlusions.

The YOLO series of algorithms, including YOLOv3 and YOLOv4, have been widely adopted in autonomous driving for real-time object detection. YOLO’s single-stage detection framework enables extremely fast inference speeds, which is crucial for real-time decision-making in driverless vehicles. This capability is particularly valuable for detecting pedestrians, vehicles, and obstacles in various driving environments. In contrast, Faster R-CNN, with its two-stage detection approach, offers high accuracy in complex object detection tasks, making it effective for identifying smaller or partially occluded objects, which are frequently encountered in unpredictable driving conditions.

Although deep learning algorithms offer superior detection accuracy and adaptability, they present a trade-off in terms of computational complexity and energy consumption. These models typically require substantial processing power, often necessitating high-performance GPUs, which can lead to increased energy usage—a challenge for energy-efficient autonomous systems. However, recent advancements in model optimization techniques, such as pruning, quantization, and model distillation, have been developed to mitigate the computational demands of these models without significantly compromising their performance.

The energy efficiency of deep learning-based algorithms can be further enhanced through the integration of specialized hardware, such as Edge AI devices. These devices are designed to execute deep learning models at lower power levels, making them suitable for deployment in autonomous vehicles. By utilizing Edge AI and optimized deep learning models, it is possible to achieve a balance between high-performance object detection and energy efficiency. While deep learning algorithms significantly improve the accuracy and reliability of object detection in autonomous driving, their high computational demands must be addressed to ensure energy-efficient and sustainable solutions.

4. Review of Visual Simultaneous Localization and Mapping (SLAM) Algorithms

Visual SLAM (Simultaneous Localization and Mapping) plays a pivotal role in the realm of sensing driverless robotic vehicles. By leveraging computer vision techniques, these autonomous vehicles can dynamically construct a map of their surroundings while simultaneously determining their position within that map. This capability is crucial for navigating complex environments, avoiding obstacles, and ensuring efficient route planning. In 2020, Morar A. et al. provided an extensive review of computer vision techniques used for indoor localization. It examines various methods, such as feature matching, deep learning, and hybrid approaches that combine visual data with other sensor inputs, like Wi-Fi or IMUs. The survey evaluates these techniques based on accuracy, computational efficiency, and applicability to different indoor environments. It highlights the advantages and limitations of each method, providing insights into how computer vision can be effectively used for precise indoor positioning in areas like smart buildings, robotics, and augmented reality applications [91].

4.1. Visual SLAM Datasets

In recent years, the combination of deep learning and computer vision has achieved remarkable results in many tasks. Vision-based SLAM systems, grounded in computer vision, provide a broad development space for the application of neural networks in this field. Similar to object detection algorithms based on deep learning, the first step for SLAM systems based on deep learning is also the establishment of datasets. Table 8 shows some of the common public datasets in recent years.

These datasets offer diverse environments and conditions for evaluating the performance of models across different computer vision tasks, particularly in SLAM, visual–inertial odometry, and robotic navigation. TUM RGBD and EUROC provide RGB-D and stereo–inertial data for indoor mapping, which help assess models’ ability to handle varying lighting and cluttered indoor environments. TUM VI and OpenLORIS offer visual-inertial datasets that challenge algorithms with real-world robotic scenarios, including changes in motion speed and environment complexity. Brno Urban focuses on outdoor urban environments, providing a real-world testbed for models handling traffic scenarios and large-scale mapping. TUM-VIE and TartanAir test algorithms in a mix of real-world and simulated environments, with TartanAir offering highly challenging simulated conditions such as fog or rain, which is ideal for robust navigation evaluations. SubT-MRS targets subterranean and rescue environments, pushing models to handle extreme conditions, including darkness and irregular terrain. Together, these datasets help measure robustness, accuracy, and adaptability across diverse real-world scenarios.

4.2. Review of the SLAM Algorithms

4.2.1. The Basic Principles of SLAM Algorithms

Visual SLAM digitizes real-world scenes by projecting 3D spatial points onto 2D pixel coordinates in the camera coordinate system, primarily utilizing the pinhole camera model. After the 3D space is projected onto the normalized image plane, distortion correction becomes necessary. Once processed, these data are fed into the visual front-end for VO (Visual Odometry) processing. The primary function of VO is to estimate the camera’s motion roughly based on a series of adjacent image information and then provide this coarse information to the back end. Traditional visual odometry methods are mainly categorized into feature-based and direct methods, with feature-based visual odometry being the most widely utilized and developed. This process involves extracting feature points from each image, finding descriptors for feature matching, and then estimating different camera poses to obtain the corresponding visual odometry [100], as illustrated in Figure 14:

The image information captured by different cameras varies, and so do the methods used for their pose estimation. For a monocular camera that obtains 2D pixel coordinates, the epipolar geometry method is employed. For stereo or depth cameras that can acquire 3D pixel coordinates, the ICP (Iterative Closest Point) method is used. If both 3D pixel coordinates and 2D camera projection coordinates are available, a combination of the two pieces of information can be utilized, employing the PnP (Perspective-n-Point) [101] method.

Then, the front end obtains the camera’s motion and observation equation, as shown in Equation (11):

\{\begin{array}{l} x_{k} = f (x_{k - 1}, u_{k}) + w_{k} \\ z_{k, j} = h (y_{j}, x_{k}) + v_{k, j} \end{array} k = 1, \dots, N, j = 1, \dots, M

(11)

Back-end optimization is conducted to eliminate the accumulated errors and uncertainties caused by noise. This is primarily achieved through graph optimization, where the objective function can be solved using methods such as Gauss–Newton or Levenberg–Marquardt to obtain an optimized solution, resulting in better camera poses. Finally, SE3 (the Special Euclidean group in three dimensions) [102] calculations are used for loop closure detection, determining the camera’s pose in 3D space. The calculation of SE3 involves determining both the rotational (R) and translational (t) components, which, together, form a 4 × 4 transformation matrix used for 3D transformations in tasks like visual odometry, SLAM, and motion estimation. By comparing the similarity between the previous frame and the current frame, it can be determined whether the detected trajectory forms a loop. Integrating visual SLAM and tracking algorithms with sensing multi-driverless vehicles can result in a system with autonomous intelligence, as illustrated in Figure 15.

Utilizing an RGB-D camera as the sensor, the surrounding environmental information is perceived, while encoders and gyroscopes sense the internal operating information of the mobile robot. In the front end, based on the RGB and depth images obtained by the RGB-D camera, optical flow values are calculated and predicted. A dynamic target mask is generated by fusing the difference in depth values, thereby eliminating feature points on dynamic targets. Subsequently, based on the remaining static feature points, collinear and coplanar relationships are identified to extend static lines and planes. In the back-end optimization phase, the pose is optimized by minimizing the residuals of static point features and static line features, thereby improving localization accuracy and enhancing navigation efficiency.

4.2.2. Performance Comparison and Energy Balance Discussion of Mainstream SLAM Algorithms

Benefiting from the establishment of the aforementioned datasets, a significant amount of research has been dedicated to optimizing the odometry of visual SLAM using deep learning methods, which are mainly categorized into supervised and unsupervised learning-based visual odometry approaches. Wang et al. [103] introduced the first end-to-end monocular visual odometry method based on deep recurrent convolutional neural networks, which directly learn from image sequences to achieve more accurate and stable visual odometry estimation. Xiao et al. [104] utilized a convolutional neural network to construct an SSD object detector combined with prior knowledge, enabling semantic-level detection of dynamic objects in a new detection thread. By employing a selective tracking algorithm to handle the feature points of dynamic objects, the pose estimation error caused by incorrect matching is significantly reduced. Duan et al. [105] proposed a keyframe retrieval method based on deep feature matching, which treats the local navigation map as an image and the keyframes as key points of the map image. Convolutional neural networks are used to extract descriptors of keyframes for loop closure detection. DROID-SLAM, proposed in 2021, is a novel SLAM system based on deep learning. Its front-end part performs feature extraction, calculates the optical flow field and three keyframes with the highest degree of co-visibility based on the optical flow field, and then iteratively updates the pose and depth of the current keyframe based on the co-visibility relationship. The back-end part optimizes all keyframes using BA (Bundle Adjustment). Each time an update iteration is performed, a frame graph is reconstructed for all keyframes. Compared to previous approaches, the robustness, generalization ability, and success rate have been greatly improved, achieving end-to-end visual SLAM using deep learning methods.

Table 9 presents a performance comparison of several mainstream SLAM algorithms across six different scenarios within the TUM-VI dataset. In this analysis, we also took into account various factors such as performance and energy consumption. Given that we use the Avg RMSE ATE (Root Mean Square Error of Absolute Trajectory Error) as the performance metric, where a smaller value indicates better performance, we have optimized the Energy Efficiency calculation as follows:

Energy Efficiency = \frac{e^{- Avg RMSE ATE} * FPS * 100}{Energy Consumption * (\frac{Processing Time}{1000} * GPU Utilization) * Memory Usage}

(12)

The comparison of various SLAM algorithms shows clear trade-offs between performance and energy consumption. DROID-SLAM, a deep learning-based algorithm, exhibits strong performance with an Avg RMSE ATE of 0.033, making it one of the best-performing algorithms in terms of trajectory accuracy. However, this performance comes at the cost of higher energy consumption, with 240 watts and high memory usage. Such algorithms are more suited to environments with substantial computational power, such as electric autonomous vehicles equipped with advanced onboard GPUs, where energy consumption can be offset by green energy sources and high computational capabilities.

In contrast, algorithms like ORB-SLAM and SVO offer a more balanced approach, achieving moderate performance with significantly lower energy consumption. For instance, ORB-SLAM achieves an energy efficiency of 0.424, making it a practical choice for ground-based robots or drones, where onboard computational power is limited, and power efficiency is critical. These algorithms, with lower energy demands, are well-suited for platforms where sustainable energy usage is prioritized, and continuous operation is needed with minimal energy wastage.

DROID-SLAM’s high computational requirements may make it ideal for autonomous electric vehicles, but for applications like drones or ground robots, where battery life and energy consumption are critical, lighter algorithms like ORB-SLAM or SVO can offer better overall system performance.

4.3. Exploration and Future Trends in SLAM Algorithms

SLAM algorithms faces several challenges, including robustness to varying lighting conditions, dynamic environments, feature extraction and matching, computational efficiency, and scalability. Future research trends include the integration of deep learning techniques for improved feature extraction and object recognition, the development of more efficient algorithms for real-time processing, the fusion of multiple sensors for enhanced accuracy and robustness, and the exploration of semantic SLAM for a deeper understanding of the environment. However, current deep SLAM methods still have the following shortcomings:

Data volume and labeling: Deep learning necessitates large-scale data and accurate labeling, yet acquiring large-scale SLAM datasets poses a significant challenge;
Low real-time performance: Visual SLAM often operates under real-time constraints, and even input from low-frame-rate, low-resolution cameras can generate a substantial amount of data, requiring efficient processing and inference algorithms;
Generalization ability: A critical consideration is whether the model can accurately locate and construct maps in new environments or unseen scenes. Future advancements in deep SLAM methods are expected to increasingly emulate human perception and cognitive patterns, making strides in high-level map construction, human-like perception and localization, active SLAM methods, integration with task requirements, and storage and retrieval of memory. These developments will aid robots in achieving diverse tasks and self-navigation capabilities. The end-to-end training mode and information processing approach, which align with the human cognitive process, hold significant potential.

4.4. Visual Framework for Unmanned Factory Applications with Multi-Driverless Robotic Vehicles and UAVs

Based on the methods summarized in this article, we have discussed and proposed a complete vision system for complex, intelligent factory environments. This system can be deployed on multiple robots, drones, or vehicles and can perform a series of tasks such as path planning, automatic navigation, intelligent obstacle avoidance, cargo grabbing, human following, emergency rescue, and more. It includes a computer vision detection module based on deep learning, comprising a composite detection and recognition module (including SVM classifiers and ACDet), a vision tracking module, a vision fusion laser SLAM positioning module, and a vehicle chassis drive control module, as shown in Figure 16. This can provide a reference for the application of computer vision technology in the industrial field.

4.5. Discussion of the Application of SLAM Algorithms in Autonomous Driving and Contribution to Promoting Green Energy Sustainability

SLAM algorithms, such as ORB-SLAM, DSO (Direct Sparse Odometry), and LSD-SLAM, are integral to autonomous driving and robotic systems, enabling real-time mapping and navigation in dynamic environments. These algorithms allow vehicles to build a map of their surroundings while simultaneously tracking their position within it, a crucial feature for navigation in unknown or changing environments. SLAM algorithms rely on visual data from cameras, as well as data from other sensors, such as IMU, to create accurate representations of the environment, enabling safe and efficient path planning for autonomous vehicles.

For example, ORB-SLAM3 integrates both visual and inertial data to provide robust performance in various conditions, including environments with low texture or varying lighting. Its ability to handle both monocular and stereo input makes it versatile for different sensor setups in autonomous vehicles. In real-world applications, ORB-SLAM is used to navigate urban environments, detect and avoid obstacles, and create detailed 3D maps for long-term navigation. Similarly, DSO focuses on minimizing photometric errors and is highly effective in low-light environments, making it a strong candidate for applications where environmental conditions fluctuate.

Although SLAM algorithms provide substantial advantages in mapping and localization accuracy, one of the primary challenges in their practical implementation is their computational complexity and energy consumption. SLAM requires real-time processing of large volumes of visual data, which can heavily tax onboard computational resources, resulting in increased energy usage. This issue is especially critical in the context of green energy driverless vehicles, where energy efficiency is essential for maximizing operational range on limited power sources.

To overcome these challenges, recent advancements in SLAM have concentrated on computational optimization and hardware acceleration. Techniques such as keyframe reduction, feature selection, and sparse mapping have been introduced to reduce the computational load, enabling SLAM algorithms to operate more efficiently in energy-constrained environments. Furthermore, specialized hardware, including low-power GPUs and FPGAs (Field Programmable Gate Arrays), has been employed to offload computational tasks from the main processors, enhancing energy efficiency without compromising performance.

4.6. Discussion on the Role of Algorithm Optimization in Improving Energy Efficiency

In the context of intelligent robots within unmanned factories, it is evident that the applications of computer vision and SLAM algorithms are extensive. However, choosing the right algorithm or improving hardware performance alone is not enough to achieve maximum energy efficiency. Instead, it is essential to fundamentally optimize the algorithms themselves to enhance energy utilization. This approach not only improves efficiency but also aligns with green, low-carbon, and environmentally sustainable goals. Algorithm optimization enables us to reduce energy consumption at its core, supporting a more eco-friendly and sustainable deployment of these technologies [15,111].

To enhance energy efficiency in the UAV’s embedded vision system, in 2019, Lygouras E. et al. utilized lightweight, unsupervised learning algorithms that minimized computational complexity, allowing for faster processing with less power consumption [112]. They also implemented efficient data processing techniques that prioritized essential image analysis tasks, reducing unnecessary computations and conserving energy. Furthermore, this system leverages low-power embedded hardware specifically optimized for real-time visual processing, which enables prolonged operation in energy-constrained environments. These strategies collectively support the UAV’s ability to detect humans autonomously during search and rescue missions while maintaining low energy usage, critical for extended flight times and operational effectiveness in remote areas. In 2022, A. Douklias et al. created an airborne computing platform for UAVs specifically designed to support computer vision and machine learning tasks [113]. To optimize energy efficiency, the UAV-based computing platform utilizes low-power processors that are specifically chosen to balance computational capability with minimal energy consumption. These processors are tailored for embedded systems, allowing for continuous operation without excessive power draw. Additionally, this platform employs streamlined algorithms that reduce computational load by focusing on essential tasks, eliminating unnecessary data processing, and enabling faster execution times. By minimizing energy-intensive operations and optimizing the sequence of data processing tasks, this system conserves power, which is crucial for extending flight duration and maintaining performance during complex computer vision and machine learning applications in airborne environments. In 2023, Ortega L. D. et al. presented a low-cost, computer-vision-based embedded system for UAVs, focusing on enhancing energy efficiency. They optimized the system through model pruning, which removed less critical neurons and layers, thus reducing the model’s size and computational load. This decrease in complexity lowers power consumption, which is essential for embedded UAV applications. Additionally, they implemented quantization by reducing the precision of weights and activations, which minimized memory usage and computational requirements. Together, these techniques allow the model to operate efficiently on low-power hardware, balancing energy savings with computational performance to support longer UAV missions on limited power resources [114]. Marroquín A. et al. proposed a navigation system that enabled mobile robots to navigate autonomously using computer vision on embedded systems in 2023 [115]. This approach emphasizes real-time image processing and object recognition, allowing robots to identify obstacles, track paths, and make navigation decisions without relying on external infrastructure. By utilizing embedded vision, this system aims to reduce reliance on computational resources, making it suitable for lightweight, cost-effective robots in dynamic environments. This method is particularly useful in applications where real-time responsiveness and energy efficiency are critical. In 2023, Nuño-Maganda M. A. et al. optimized the embedded vision model for monitoring and sorting citrus fruits by implementing model compression techniques such as pruning and quantization, which reduced computational demands [116]. Pruning eliminates unnecessary neural connections within the model, making it smaller and faster, thus reducing the amount of processing power needed. Quantization lowers the precision of model parameters, converting them from 32-bit to lower-bit representations, which saves memory and speeds up computations on embedded hardware. Additionally, this system uses real-time optimization algorithms that prioritize essential image-processing tasks, ensuring that the model runs efficiently within the constraints of embedded devices. This tailored approach enables accurate fruit sorting without requiring extensive computing resources, making it both cost-effective and scalable for large-scale agricultural operations.

Through the analysis and discussion above, it is evident that algorithm optimization plays a significant role in improving energy efficiency by reducing the computational requirements and power consumption of models. Techniques such as model pruning and quantization help streamline the model’s operations, leading to fewer calculations and lower energy demand. This is particularly crucial in autonomous systems and industrial applications where reducing energy consumption can lead to cost savings and sustainability benefits. Optimized algorithms not only enhance the energy utilization rate but also contribute to the longevity and efficiency of the hardware on which they operate, aligning well with the goals of low-carbon, energy-efficient technologies.

5. Conclusions

With the rapid advancement of computer hardware, the potential for significant development in the field of computer vision is evident, as demonstrated in this paper. Computer vision serves as a foundational element in the realm of intelligent systems and AI technologies. It can effectively replace traditional sensors as the primary source of data input, acting as a pair of intelligent “eyes” that can more adeptly process data sources to drive a range of unmanned operations, such as autonomous driving, obstacle avoidance, and detection tracking. The potential ceiling for this field is exceptionally high.

This paper also explores the future development trajectory of the field in depth. Current mainstream deep learning algorithms exhibit a strong dependence on datasets and hardware, while there is a growing demand for high-precision, real-time performance in modern systems. One of the key challenges moving forward is developing models that are both “sustainable” and “energy-efficient,” incorporating mechanisms such as selective forgetting to support low-carbon, environmentally friendly solutions.

Author Contributions

Writing—original draft preparation, L.C. (Lu Chen), J.T., Y.L., and J.P.; writing—review and editing, W.X., D.G., and L.C. (Lizhu Chen).; supervision, W.S. and G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a project on the Research and development of a freight robot based on multi-dimensional information behavior detection technology from the UESTC-Houngfuh Joint Laboratory of Smart Logistics, Project No.211282; the Research on the Construction and Application of an Intelligent Outpatient Medical Record Integration Platform, Project No. 24SYSX0210.

Acknowledgments

The authors would like to express their gratitude to the Houngfuh Group and the UESTC-Houngfuh Joint Laboratory of Smart Logistics for generously providing the experimental testing platform. Additionally, we extend our heartfelt thanks to Chenhan Liu, Yongchen Liu, Bo Di, and Jingrong Zhong for their invaluable suggestions and guidance throughout the preparation of this article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SLAM	Simultaneous Localization and Mapping
HOG	Histogram of Oriented Gradients
DPM	Deformable Part-based Model
KITTI	Karlsruhe Institute of Technology and Toyota Technological Institute
MOSSE	Minimum Output Sum of Squared Error
KCF	Kernelized Correlation Filters
LADCF	Learning Adaptive Discriminative Correlation Filters
ARCF	Aberrance Repressed Correlation Filters
BACF	Background-Aware Correlation Filters
CNN	Convolutional Neural Network
RPN	Region Proposal Network
NMS	Non-Maximum Suppression
BA	Bundle Adjustment
VJ	Viola–Jones
ECO	Efficient Convolution Operators
FPS	Frames Per Second
VOT	The Visual Object Tracking
VOC	Visual Object Classes
COCO	Common Objects in Context
OTB	Object Tracking Benchmark
WMTIS	Weak Military Targets in Infrared Scenes
MB	Medicine Boxes
XEEEP	Express Packages
FPP	Fully Automated Unmanned Factories and Product Detection and Tracking
CNNs	Convolutional Neural Networks
RoI	Region of Interest
AP	Average Precision
GPUs	Graphics Processing Units
ADAS	Advanced Driver-assistance Systems
VO	Visual Odometry
PnP	Perspective-n-Point
BA	Bundle Adjustment
LSD	Large-scale Direct
ORB	Oriented FAST and Rotated BRIEF
DSO	Direct Sparse Odometry
FPGA	Field Programmable Gate Arrays
FM	Fast Movement
SV	Scale Variation
FO	Full Occlusion
PO	Partial Occlusion
OV	Out-of-View
IV	Illuminance Variation
LR	Low Resolution
RMSE ATE	Root Mean Square Error of absolute trajectory Error
SE3	Special Euclidean Group in Three Dimensions
ACDet	Self-Attention and Concatenation-Based Detector
SAR	Synthetic Aperture Radar

References

Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 8–14 December 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 1, p. I. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 2544–2550. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Xu, T.; Feng, Z.H.; Wu, X.J.; Kittler, J. Learning Adaptive Discriminative Correlation Filters via Temporal Consistency Preserving Spatial Feature Selection for Robust Visual Object Tracking. IEEE Trans. Image Process. 2019, 28, 5596–5609. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Fu, C.; Li, Y.; Lin, F.; Lu, P. Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2891–2900. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Aboah, A.; Wang, B.; Bagci, U.; Adu-Gyamfi, Y. Real-time multi-class helmet violation detection using few-shot data sampling technique and yolov8. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5350–5358. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Liu, X.; Zhang, Z. A Vision-Based Target Detection, Tracking, and Positioning Algorithm for Unmanned Aerial Vehicle. Wirel. Commun. Mob. Comput. 2021, 2021, 5565589. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Adv. Neural Inf. Process. Syst. 2021, 34, 16558–16569. [Google Scholar]
Chen, L.; Li, G.; Zhao, K.; Zhang, G.; Zhu, X. A Perceptually Adaptive Long-Term Tracking Method for the Complete Occlusion and Disappearance of a Target. Cogn. Comput. 2023, 15, 2120–2131. [Google Scholar] [CrossRef]
He, J.; Li, M.; Wang, Y.; Wang, H. OVD-SLAM: An online visual SLAM for dynamic environments. IEEE Sens. J. 2023, 23, 13210–13219. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Agarwal, S.; Terrail, J.O.D.; Jurie, F. Recent advances in object detection in the age of deep convolutional neural networks. arXiv 2018, arXiv:1809.03193. [Google Scholar]
Andreopoulos, A.; Tsotsos, J.K. 50 years of object recognition: Directions forward. Comput. Vis. Image Underst. 2013, 117, 827–891. [Google Scholar] [CrossRef]
Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 7310–7311. [Google Scholar]
Grauman, K.; Leibe, B. Visual Object Recognition (Synthesis Lectures on Artificial Intelligence and Machine Learning); Morgan & Claypool Publishers: San Rafael, CA, USA, 2011; p. 3. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
Liu, Y.; Dai, Q. A survey of computer vision applied in aerial robotic vehicles. In Proceedings of the 2010 International Conference on Optics, Photonics and Energy Engineering (OPEE), Wuhan, China, 10–11 May 2010; IEEE: Piscataway, NJ, USA, 2010; Volume 1, pp. 277–280. [Google Scholar]
Felzenszwalb, P.; McAllester, D.; Ramanan, D. A discriminatively trained, multiscale, deformable part model. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–8. [Google Scholar]
Ju, Z.; Gun, L.; Hussain, A.; Mahmud, M.; Ieracitano, C. A novel approach to shadow boundary detection based on an adaptive direction-tracking filter for brain-machine interface applications. Appl. Sci. 2020, 10, 6761. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 7263–7271. [Google Scholar]
Gao, F.; Huang, T.; Sun, J.; Wang, J.; Hussain, A.; Yang, E. A new algorithm for SAR image target recognition based on an improved deep convolutional neural network. Cogn. Comput. 2019, 11, 809–824. [Google Scholar] [CrossRef]
Chen, B.X.; Sahdev, R.; Tsotsos, J.K. Person following robot using selected online ada-boosting with stereo camera. In Proceedings of the 2017 14th Conference on Computer and Robot Vision (CRV), Edmonton, AB, Canada, 16–19 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 48–55. [Google Scholar]
Evjemo, L.D.; Gjerstad, T.; Grøtli, E.I.; Sziebig, G. Trends in smart manufacturing: Role of humans and industrial robots in smart factories. Curr. Robot. Rep. 2020, 1, 35–41. [Google Scholar] [CrossRef]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ECO: Efficient Convolution Operators for Tracking. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 6931–6939. [Google Scholar]
Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 4310–4318. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1401–1409. [Google Scholar]
Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the Computer Vision-ECCV 2014 Workshops, Zurich, Switzerland, 6–7 and 12 September 2014; Part II 13. Springer International Publishing: Cham, Switzerland, 2015; pp. 254–265. [Google Scholar]
Kiani Galoogahi, H.; Fagg, A.; Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1135–1143. [Google Scholar]
Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Guangzhou, China, 2–4 December 2016; IEEE: Piscataway, NJ, USA, 2018; pp. 4844–4853. [Google Scholar]
Li, F.; Tian, C.; Zuo, W.; Zhang, L.; Yang, M.H. Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4904–4913. [Google Scholar]
Yun, S.; Choi, J.; Yoo, Y.; Yun, K.; Young Choi, J. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2711–2720. [Google Scholar]
Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2805–2813. [Google Scholar]
Zhang, T.; Xu, C.; Yang, M.H. Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4335–4343. [Google Scholar]
Dollar, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian detection: An evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 743–761. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Fei-Fei, L. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Mottaghi, R.; Savarese, S. Beyond pascal: A benchmark for 3d object detection in the wild. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 75–82. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Everingham, M.; Eslami, S.M.A.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for UAV tracking. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 445–461. [Google Scholar]
Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 2411–2418. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part V 13. Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.R.R.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision meets drones: A challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5828–5839. [Google Scholar]
Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1912–1920. [Google Scholar]
Song, S.; Lichtenberg, S.P.; Xiao, J. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 567–576. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Part V 12. Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
Huang, X.; Cheng, X.; Geng, Q.; Cao, D.; Zhou, H.; Wang, B.; Lin, Y.; Yang, R. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 954–960. [Google Scholar]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1037–1045. [Google Scholar]
Dau, H.A.; Bagnall, A.; Kamgar, K.; Yeh, C.-C.M.; Zhu, Y.; Gharghabi, S.; Ratanamahatana, C.A.; Keogh, E. The UCR time series archive. IEEE/CAA J. Autom. Sin. 2019, 6, 1293–1305. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3213–3223. [Google Scholar]
Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. Shapenet: An information-rich 3d model repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Nießner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3d: Learning from rgb-d data in indoor environments. arXiv 2017, arXiv:1709.06158. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11621–11631. [Google Scholar]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2446–2454. [Google Scholar]
De Charette, R.; Nashashibi, F. Real time visual traffic lights recognition based on spot light detection and adaptive traffic lights templates. In Proceedings of the Intelligent Vehicles Symposium, Xi’an, China, 3–5 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 358–363. [Google Scholar]
Timofte, R.; Zimmermann, K.; Van Gool, L. Multi-view traffic sign detection, recognition, and 3D localisation. Mach. Vis. Appl. 2014, 25, 633–647. [Google Scholar] [CrossRef]
Houben, S.; Stallkamp, J.; Salmen, J.; Schlipsing, M.; Igel, C. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In Proceedings of the 2013 International Joint Conference on Neural Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1–8. [Google Scholar]
Klare, B.F.; Klein, B.; Taborsky, E.; Blanton, A.; Cheney, J.; Allen, K.; Grother, P.; Mah, A.; Burge, M.J.; Jain, A.K. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1931–1939. [Google Scholar]
Yang, S.; Luo, P.; Loy, C.C.; Tang, X. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 5525–5533. [Google Scholar]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Wang, J.; Zhang, P.; Chu, T.; Cao, Y.; Zhou, Y.; Wu, T.; Wang, B.; He, C.; Lin, D. V3det: Vast vocabulary visual detection dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 19844–19854. [Google Scholar]
Wang, X.; Wang, S.; Tang, C.; Zhu, L.; Jiang, B.; Tian, Y.; Tang, J. Event stream-based visual object tracking: A high-resolution benchmark dataset and a novel baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 19248–19257. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Papageorgiou, C.P.; Oren, M.; Poggio, T. A general framework for object detection. In Proceedings of the Sixth international conference on computer vision (IEEE Cat. No. 98CH36271), Bombay, India, 4–7 January 1998; IEEE: Piscataway, NJ, USA, 1998; pp. 555–562. [Google Scholar]
Kang, K.; Li, H.; Yan, J.; Zeng, X.; Yang, B.; Xiao, T.; Zhang, C.; Wang, Z.; Wang, R.; Ouyang, W.; et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 2896–2907. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.; Abe, N. A short introduction to boosting. J.-Jpn. Soc. Artif. Intell. 1999, 14, 1612. [Google Scholar]
Malisiewicz, T.; Gupta, A.; Efros, A.A. Ensemble of exemplar-svms for object detection and beyond. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 89–96. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Kvietkauskas, T.; Pavlov, E.; Stefanovič, P.; Birutė, P. The Efficiency of YOLOv5 Models in the Detection of Similar Construction Details. Appl. Sci. 2024, 14, 3946. [Google Scholar] [CrossRef]
Kumar, D.; Muhammad, N. Object detection in adverse weather for autonomous driving through data merging and YOLOv8. Sensors 2023, 23, 8471. [Google Scholar] [CrossRef] [PubMed]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Yu, H.; Luo, X. Cvt-assd: Convolutional vision-transformer based attentive single shot multibox detector. In Proceedings of the 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), Washington, DC, USA, 1–3 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 736–744. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7464–7475. [Google Scholar]
Zheng, C. Stack-YOLO: A friendly-hardware real-time object detection algorithm. IEEE Access 2023, 11, 62522–62534. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Chen, X.; Ma, H.; Wan, J. Multi-Scale Attention Mechanism for Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar]
Chhabra, M.; Ravulakollu, K.K.; Kumar, M.; Sharma, A.; Nayyar, A. Improving automated latent fingerprint detection and segmentation using deep convolutional neural network. Neural Comput. Appl. 2023, 35, 6471–6497. [Google Scholar] [CrossRef]
Chhabra, M.; Sharan, B.; Elbarachi, M.; Kumar, M. Intelligent waste classification approach based on improved multi-layered convolutional neural network. Multimed. Tools Appl. 2024, 81, 1–26. [Google Scholar] [CrossRef]
Vora, P.; Shrestha, S. Detecting diabetic retinopathy using embedded computer vision. Appl. Sci. 2020, 10, 7274. [Google Scholar] [CrossRef]
Morar, A.; Moldoveanu, A.; Mocanu, I.; Moldoveanu, F.; Radoi, I.E.; Asavei, V.; Gradinaru, A.; Butean, A. A comprehensive survey of indoor localization methods based on computer vision. Sensors 2020, 20, 2641. [Google Scholar] [CrossRef] [PubMed]
Sturm, J.; Burgard, W.; Cremers, D. Evaluating egomotion and structure-from-motion approaches using the TUM RGB-D benchmark. In Proceedings of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS), Vilamoura, Algarve, Portugal, 7–12 October 2012; IEEE: Piscataway, NJ, USA, 2012; Volume 13, p. 6. [Google Scholar]
Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Schubert, D.; Goll, T.; Demmel, N.; Usenko, V.; Stueckler, J.; Cremers, D. The TUM VI benchmark for evaluating visual-inertial odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1680–1687. [Google Scholar]
She, Q.; Feng, F.; Hao, X.; Yang, Q.; Lan, C.; Lomonaco, V.; Shi, X.; Wang, Z.; Guo, Y.; Zhang, Y.; et al. Openloris-object: A dataset and benchmark towards lifelong object recognition. arXiv 2019, arXiv:1911.06487. [Google Scholar]
Ligocki, A.; Jelinek, A.; Zalud, L. Brno urban dataset-the new data for self-driving agents and mapping tasks. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3284–3290. [Google Scholar]
Klenk, S.; Chui, J.; Demmel, N.; Cremers, D. Tum-vie: The tum stereo visual-inertial event dataset. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 8601–8608. [Google Scholar]
Wang, W.; Zhu, D.; Wang, X.; Hu, Y.; Qiu, Y.; Wang, C.; Hu, Y.; Kapoor, A.; Scherer, S. Tartanair: A dataset to push the limits of visual slam. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; IEEE: Piscataway, NJ, USA, 2020; pp. 4909–4916. [Google Scholar]
Zhao, S.; Gao, Y.; Wu, T.; Singh, D.; Jiang, R.; Sun, H.; Sarawata, M.; Whittaker, W.C.; Higgins, I.; Su, S.; et al. SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 22647–22657. [Google Scholar]
Fei, B.; Yang, W.; Chen, W.M.; Li, Z.; Li, Y.; Ma, T.; Hu, X.; Ma, L. Comprehensive review of deep learning-based 3d point cloud completion processing and analysis. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22862–22883. [Google Scholar] [CrossRef]
Chen, H.; Wang, P.; Wang, F.; Tian, W.; Xiong, L.; Li, H. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2781–2790. [Google Scholar]
Byravan, A.; Fox, D. Se3-nets: Learning rigid body motion using deep neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 173–180. [Google Scholar]
Wang, S.; Clark, R.; Wen, H.; Trigoni, N. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2043–2050. [Google Scholar]
Xiao, L.; Wang, J.; Qiu, X.; Rong, Z.; Zou, X. Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment. Robot. Auton. Syst. 2019, 117, 1–16. [Google Scholar] [CrossRef]
Duan, R.; Feng, Y.; Wen, C.Y. Deep pose graph-matching-based loop closure detection for semantic visual SLAM. Sustainability 2022, 14, 11864. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel JM, M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Lin, J.; Zhang, F. R3LIVE++: A Robust, Real-time, Radiance reconstruction package with a tightly-coupled LiDAR-Inertial-Visual state Estimator. arXiv 2022, arXiv:2209.03666. [Google Scholar]
Wang, R.; Schworer, M.; Cremers, D. Stereo DSO: Large-scale direct sparse visual odometry with stereo cameras. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3903–3911. [Google Scholar]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Fabre, W.; Haroun, K.; Lorrain, V.; Lepecq, M.; Sicard, G. From Near-Sensor to In-Sensor: A State-of-the-Art Review of Embedded AI Vision Systems. Sensors 2024, 24, 5446. [Google Scholar] [CrossRef] [PubMed]
Lygouras, E.; Santavas, N.; Taitzoglou, A.; Tarchanidis, K.; Mitropoulos, A.; Gasteratos, A. Unsupervised human detection with an embedded vision system on a fully autonomous UAV for search and rescue operations. Sensors 2019, 19, 3542. [Google Scholar] [CrossRef] [PubMed]
Douklias, A.; Karagiannidis, L.; Misichroni, F.; Amditis, A. Design and implementation of a UAV-based airborne computing platform for computer vision and machine learning applications. Sensors 2022, 22, 2049. [Google Scholar] [CrossRef] [PubMed]
Ortega, L.D.; Loyaga, E.S.; Cruz, P.J.; Lema, H.P.; Abad, J.; Valencia, E.A. Low-Cost Computer-Vision-Based Embedded Systems for UAVs. Robotics 2023, 12, 145. [Google Scholar] [CrossRef]
Marroquín, A.; Garcia, G.; Fabregas, E.; Aranda-Escolástico, E.; Farias, G. Mobile robot navigation based on embedded computer vision. Mathematics 2023, 11, 2561. [Google Scholar] [CrossRef]
Nuño-Maganda, M.A.; Dávila-Rodríguez, I.A.; Hernández-Mier, Y.; Barrón-Zambrano, J.H.; Elizondo-Leal, J.C.; Díaz-Manriquez, A.; Polanco-Martagón, S. Real-Time Embedded Vision System for Online Monitoring and Sorting of Citrus Fruits. Electronics 2023, 12, 3891. [Google Scholar] [CrossRef]

Figure 1. The Applying Scenarios of Computer Vision.

Figure 2. The Increasing Number of Publications in Object Detection from 2010 to 2024. (Data from IEEE Xplore advanced search: all in title: “computer vision”).

Figure 3. A Road Map of Computer Vision with Milestone Detectors and Trackers [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15].

Figure 4. Cascade Classifier Structure. 1, 2, 3, and 4 represent different levels of classifiers, with T standing for true and F indicating failure.

Figure 5. Adaptive Long-Term Tracking Framework Based on Visual Detection.

Figure 6. Positioning Accuracy Map and Coverage of System Frame.

Figure 7. The Datasets We Have Built Ourselves in Recent Years.

Figure 8. Network Structure of One-Stage Detector.

Figure 9. Network Structure of Two-Stage Detector.

Figure 10. The Steady Improvement of Accuracy in Visual Detection Algorithms on VOC Dataset.

Figure 11. ACDet Model System Architecture.

Figure 12. Test Results of Acdet Model on EP Dataset.

Figure 13. A Computer Vision Detection System Based on Deep Learning Applied in the Medical Industry.

Figure 14. Principles of Visual Odometry.

Figure 15. Unmanned Driving System Based on Visual SLAM and Tracking Algorithms.

Figure 16. Visual Framework for Unmanned Factory Applications with Multi-Driverless Robotic Vehicles.

Table 1. A comprehensive summary of the comparative analysis of various works within this domain.

Chapter of this Paper	Category	Algorithm/Datasets	Domain	Key Features	Related Applications
Chapter 1	Development Overview	Caltech; KITTI; MOSSE; R-CNN; OVD-SLAM, etc.	Related Datasets; Visual Detection; Visual Tracking; Visual SLAM.	History and trend of development and application.	Action detection; Simultaneous localization and mapping; Multiple Object Tracking; Pilotless automobile, etc.
Chapter 2	Traditional Target Detection and Tracking Based on Correlation Filtering	Viola–Jones Detector	Face detection	Cascade classifier; integral image processing	Anti-occlusion long-term tracking framework
		ECO-Tracker	General object tracking	Tracking confidence; peak-to-sidelobe ratio
		Comparison of multiple tracking algorithms	General object tracking; UAV20 dataset	Comprehensive comparison of performance and energy consumption
Chapter 3	Deep Learning Object Detection	Review of classification 1D, 2D, 3D, 3D+ vision and multi-modal sensing datasets	Multi-class datasets	Multi-class datasets	Automated latent fingerprint detection and segmentation using deep convolutional neural network; Intelligent waste classification approach based on improved multi-layered convolutional neural network; ACDet: computer vision detection in the medical industry;
		PASCAL; VOC; UAV, etc.	Well-known public datasets for object detection	Verify and improve algorithm performance
		YOLO series	General object detection	Single-stage detection, real-time processing
		CNN series	General object detection	Region Proposal Network, two-stage detection
		Comparison of multiple deep learning-based detection algorithms	General object detecting, MS COCO, and EP datasets	Comprehensive comparison of performance and energy consumption
Chapter 4	Visual SLAM Algorithms	TUM series datasets; SubT-MRS; TartanAir et al.	Well-known public datasets for visual SLAM	Verify and improve algorithm performance	Visual Framework for Unmanned Factory Applications with Multi-Driverless Robotic Vehicles and UAVs
		LIO-SAM	LiDAR and IMU Integration	High accuracy in large-scale and dynamic environments;
		DROID-SLAM	Localization and Mapping	Deep learning-based, end-to-end feature extraction and optimization; Robust to varying environments
		Comparison of multiple SLAM algorithms	Locate and map TUM-VI dataset	Comprehensive comparison of performance and energy consumption
		Discussion of improving model energy efficiency	Model optimization	Model pruning; quantization

Table 2. An Overview of Various Related Datasets by Category.

Challenge/FPS	ECO-C	LADCF	ECO-HC	ARCF	ADNet	STRCF	STRCF	CFNet	MCCT-H
AC	0.602①	0.592②	0.588③	0.573	0.569	0.562	0.557	0.559	0.544
BC	0.639①	0.619③	0.617	0.621②	0.567	0.558	0.556	0.542	0.553
CM	0.610①	0.602②	0.584	0.599③	0.571	0.569	0.555	0.540	0.528
FM	0.593②	0.597①	0.581③	0.578	0.559	0.544	0.544	0.549	0.533
FO	0.643①	0.598③	0.607②	0.583	0.581	0.574	0.565	0.555	0.548
IV	0.583①	0.579②	0.525	0.552	0.566③	0.524	0.523	0.527	0.495
LR	0.643①	0.616②	0.580	0.584③	0.579	0.574	0.568	0.555	0.551
OV	0.609①	0.591②	0.565	0.579③	0.557	0.541	0.551	0.545	0.523
PO	0.634①	0.604②	0.574	0.594③	0.565	0.559	0.549	0.541	0.527
SO	0.618①	0.584③	0.595②	0.579	0.561	0.563	0.544	0.534	0.503
SV	0.613②	0.617①	0.569	0.589③	0.582	0.567	0.552	0.529	0.529
VC	0.576①	0.565③	0.567②	0.553	0.563	0.556	0.538	0.529	0.509
Ave FPS	36.4	16.1	41.2①	32.8	11.7	22.8	31.2	38.1	26.8

FPS (Frames Per Second). ①②③ represent the top 3 results for the algorithm.

Table 3. An overview of various related datasets by category.

Datasets Type	Representative Datasets	Main Applications	Advantages
1D	UCR Time Series Classification Archive [57]	Time series classification	Working with time series data
2D	COCO, Pascal VOC, ImageNet, Cityscapes [58]	Object detection, classification, segmentation	Rich image data
3D	KITTI, ScanNet, ModelNet, ShapeNet [59]	Three-dimensional reconstruction, object detection, scene understanding	Providing spatial information
3D+ Vision	SUN RGB-D, NYU Depth V2, Matterport3D [60]	Indoor scene understanding	Combining RGB images and depth information
Multimodal Sensing	ApolloScape, KAIST, nuScenes [61], Waymo Open Dataset [62]	Autonomous driving, pedestrian detection	Integrating multiple sensor data

Table 4. An Overview of Some Popular and Challenged Computer Vision Datasets.

Dataset/Year	Structure	Diversity	Description	Scale	URL
TLR [63] 2009	Focuses on traffic light detection in urban environments with labeled bounding boxes for traffic lights.	Urban traffic scenes from Paris, mainly focused on traffic lights.	Traffic scenes in Paris	20,200 Frames	https://github.com/DeepeshDongre/Traffic-Light-Detection-LaRA Accessed on 30 December 2017
KITTI [41] 2012	Comprises annotated images, 3D laser scans, and GPS data; includes multiple sensors such as stereo cameras, laser and IMU.	Urban and rural driving environments in Germany, diverse in weather, time, and lighting conditions.	The traffic scene analysis in Germany.	16,000 Images	https://www.cvlibs.net/datasets/kitti/raw_data.php Accessed on 1 June 2012
BelgianTSD [64] 2014	Contains images of 269 traffic sign categories with labeled bounding boxes.	Diverse traffic signs captured in different weather and lighting conditions.	The traffic sign annotations of 269 types. With the 3D location.	138,300 Images	https://btsd.ethz.ch/shareddata/ Accessed on 18 February 2014
GTSDB [65] 2013	Traffic sign detection dataset with annotations for different traffic sign types.	Diverse road environments, including various climates and weather conditions.	Traffic scenes in different climates.	2100 Images	http://benchmark.ini.rub.de/?section=gtsdb&subsecti-on=news Accessed on 15 July 2013
IJB [66] 2015	A dataset with various face images and videos for facial recognition and verification tasks.	High diversity in pose, lighting, and background variations.	IJB scenes for recognition and detection tasks.	50,000 Images and video clips	https://www.nist.gov/programs-projects/face-challenges Accessed on 10 June 2015
WiderFace [67] 2016	A large-scale face detection dataset containing images with a wide range of scales, occlusions, and poses.	High variability in face scale, pose, occlusion, and lighting.	Face detection scene.	32,000 Images	http://shuoyang1213.me/WIDERFACE/ Accessed on 17 April 2016
NWPU-VHR10 [68] 2016	A remote sensing dataset containing images with 10 different object classes from high-resolution satellite imagery.	High diversity in urban and rural environments, multiple object types like aircraft, ships, and vehicles.	Remote sensing detection scenario.	4600 Images	http://github.com/chaozhong2010/VHR-10_dataset_coco Accessed on 17 July 2019
V3Det [69] 2023	A dataset designed for large vocabulary visual detection tasks with a wide range of object categories.	Diverse object categories with detailed annotations, covering many daily and industrial objects.	Vast vocabulary visual detection dataset with precisely annotated.	245,500 Images	https://v3det.openxlab.org.cn/ Accessed on 10 August 2023
EventVOT [70] 2024	Focuses on high-resolution event-based object tracking, including various categories like pedestrians, vehicles, and drones.	Event-based video data under different weather conditions and environments.	Contains multiple categories of videos, such as pedestrians, vehicles, drones, table tennis, etc.	1141 Video clips	https://github.com/Event-AHU/EventVOT_Benchmark Accessed on 5 July 2023

Table 5. Our Computer Vision Datasets for Different Challenges.

Our Own Datasets (Year)	Scale		Description and Application
Our Own Datasets (Year)	Images	Objects	Description and Application
WMTIS (2019) (Weak Military Targets in Infrared Scenes)	1632	1808	The infrared simulation weak target dataset, constructed based on the infrared characteristics of military targets, includes a series of challenging samples featuring scale variations. These samples represent fighter jets, tanks, and warships across diverse environments, such as desert, coastal, inland, and urban settings.
MB (2023) (Medicine Boxes)	3345	9612	Various types of medicine boxes made of various materials, covering all mainstream types of pharmacies, including challenges such as reflection caused by waterproof plastic film.
EP (2022) (Express Packages)	25,127	60,393	A comprehensive sample covering all types of packages in the logistics and express delivery industry, with sizes ranging from 5 cm to 3 m and heights ranging from 0.5 mm to 1.2 m in various shapes.
FPP (2022~2024) (Fully Automated Unmanned Factories and Product Detection and Tracking)	9716	17,435	Multi-target samples in complex industrial scenes face many challenges, such as easy occlusion, uneven illumination, inconsistent imaging quality, and open scenes. This includes production personnel and samples of various types of products collected through various methods, such as ground robotic vehicles and UAVs.

Table 6. Detection Results and Energy Efficiency of Different Algorithms on The MS COCO.

Algorithmic Network	Backbone	AP	FPS	Processing Time (ms/Frame)	GPU Utilization (%)	Memory Usage (MB)	Energy Consumption (Watts)	Energy Efficiency
Fast R-CNN [72]	VGG-16	19.7	15	66.7	75%	1150	120	0.00044
Faster R-CNN [73]	VGG-16	21.9	25	40	65%	1110	100	0.00191
SSD321 [74]	ResNet-101	28.0	30	33.3	62%	1020	90	0.00467
YOLOv3 [75]	DarkNet-53	33.0	45	22.2	58%	850	85	0.01685
RefineDet512+ [76]	ResNet-101	41.8	40	25	66%	950	88	0.01333
NAS-FPN [77]	AmoebaNet	48.0	20	50	81%	1400	150	0.00114
YOLOv5 [78]	CSP-Darknet53	48.9	60	16.7	54%	750	75	0.06248
YOLOv8 [79]	CSPNet	51.8②	65②	15.4②	45%	6600	70②	0.11910
YOLOv9 [80]	CSPNet	51.4	70①	14.3①	42%②	650①	68①	0.13550②
YOLOv10 [11]	CSPNet	52.4①	70①	14.3①	40%①	660②	70②	0.13900①

①② represent the top 2 results for the algorithm.

Table 7. Detection Results on The EP Dataset with Different Models.

Model	Smooth mAP	mAP	FLOPS/M	Average Processing Time/ms
YOLOv5	66.49	66.95	2.58①	1.78②
YOLOv6	62.67	62.15	4.26	1.83
YOLOv8	74.86②	75.25②	3.08	1.79
YOLOv10	64.69	65.10	2.71②	1.77①
ACDet	79.52①	81.56①	3.69	1.79

①② represent the top 2 results for the algorithm.

Table 8. Common Visual SLAM Datasets.

Dataset	Release Date	Scale	Collection Situation	Application Areas
TUM RGBD [92]	2012	Over 100 indoor video sequences	RGB Camera/Depth camera	SLAM, 3D Reconstruction, Robotics
EUROC [93]	2016	11 sequences, approx. 50 min	Binocular Camera/RGB Camera/UAV	Visual-Inertial Odometry, SLAM
TUM VI [94]	2018	Approx. 40 h of indoor and outdoor sequences	RGB Camera/IMU	Visual-Inertial Odometry, SLAM
Openloris [95]	2019	Over 10 indoor environments	RGB Camera/IMU/Radar	Lifelong Learning, Object Recognition
Brno Urban [96]	2020	16 sequences, approx. 100 min	RGB Camera/IMU/Radar/Infrared	Autonomous Driving, Urban Navigation
TUM-VIE [97]	2021	Approx. 10 h of sequences	Binocular Camera/IMU	SLAM, Robotics, Augmented Reality
TartanAir [98]	2023	Over 300 km in virtual environments	RGB&RGBD Camera/Optical flow/Semantic segmentation	SLAM, Navigation, Robotics
SubT-MRS [99]	2024	Over 100 h of underground exploration videos	RGB Camera/IMU/Radar/Thermal imagery	SLAM, Navigation, Robotics

Table 9. Detection Results and Energy Efficiency of Different SLAM Algorithms on the TUM-VI.

Seq. and Index	ORB-SLAM [106]	DROID-SLAM	R3LIVE++ [107]	DSO [108]	LSD-SLAM [109]	ORB-SLAM3 [110]
Room1	0.057	0.040	0.028①	0.032②	0.037	0.033
Room2	0.051	0.027①	0.037	0.066	0.029②	0.033
Room3	0.027	0.017①	0.021②	0.023	0.038	0.026
Room4	0.052	0.058	0.043	0.033②	0.021①	0.033②
Room5	0.030	0.026①	0.051	0.027②	0.055	0.037
Room6	0.031①	0.035	0.032②	0.036	0.037	0.040
Avg RMSE ATE	0.041	0.033①	0.035	0.036	0.036	0.034②
FPS	35②	25	30	20	45①	15
Processing Time (ms/frame)	28②	40	33	50	22①	66
GPU Utilization (%)	85%	90%	88%	80%	70%①	75%②
Memory Usage (MB)	1850②	2200	2100	1900	1800①	2250
Energy Consumption (Watts)	180①	240	200	190	375	185②
Energy Efficiency	0.424①	0.127	0.2375	0.1336	0.417②	0.0704

①② represent the top 2 results for the algorithm.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, L.; Li, G.; Xie, W.; Tan, J.; Li, Y.; Pu, J.; Chen, L.; Gan, D.; Shi, W. A Survey of Computer Vision Detection, Visual SLAM Algorithms, and Their Applications in Energy-Efficient Autonomous Systems. Energies 2024, 17, 5177. https://doi.org/10.3390/en17205177

AMA Style

Chen L, Li G, Xie W, Tan J, Li Y, Pu J, Chen L, Gan D, Shi W. A Survey of Computer Vision Detection, Visual SLAM Algorithms, and Their Applications in Energy-Efficient Autonomous Systems. Energies. 2024; 17(20):5177. https://doi.org/10.3390/en17205177

Chicago/Turabian Style

Chen, Lu, Gun Li, Weisi Xie, Jie Tan, Yang Li, Junfeng Pu, Lizhu Chen, Decheng Gan, and Weimin Shi. 2024. "A Survey of Computer Vision Detection, Visual SLAM Algorithms, and Their Applications in Energy-Efficient Autonomous Systems" Energies 17, no. 20: 5177. https://doi.org/10.3390/en17205177

APA Style

Chen, L., Li, G., Xie, W., Tan, J., Li, Y., Pu, J., Chen, L., Gan, D., & Shi, W. (2024). A Survey of Computer Vision Detection, Visual SLAM Algorithms, and Their Applications in Energy-Efficient Autonomous Systems. Energies, 17(20), 5177. https://doi.org/10.3390/en17205177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Survey of Computer Vision Detection, Visual SLAM Algorithms, and Their Applications in Energy-Efficient Autonomous Systems

Abstract

1. Introduction

1.1. Innovative Contributions and Gaps Compared to Related Reviews

1.2. Selection Criteria for Reviewed Articles

1.3. An Evolutionary Review of Algorithms Related to Computer Vision

2. Review of Traditional Computer Vision Algorithms

2.1. The Basic Theory of Computer Vision Detection

2.2. The Basic Theory of Computer Vision Tracking

2.3. Adaptive Long-Term Tracking Framework Based on ECO-C

2.4. Discussion of the Application of Traditional Vision Detection and Tracking Algorithms in Autonomous Driving and Contribution to Promoting Green Energy Sustainability

3. Review of Deep Learning-Based Computer Vision

3.1. Computer Vision Datasets

3.1.1. The Thorough Review of Multidimensional Typed Datasets

3.1.2. Examples of Popular and Challenging Computer Vision Datasets and Their Value and Meaning

3.2. Review of Deep Learning Computer Vision Based on Convolutional Neural Networks

3.3. Computer Vision Detection Applications Based on Convolutional Neural Networks

3.3.1. ACDet: A Vector Detection Model for Drug Packaging Based on Convolutional Neural Network

3.4. Exploration and Future Trends in Deep Learning-Based Computer Vision Algorithms

3.5. Discussion on the Application of Deep Learning-Based Vision Detection Algorithms in Autonomous Driving and Contribution to Promoting Green Energy Sustainability

4. Review of Visual Simultaneous Localization and Mapping (SLAM) Algorithms

4.1. Visual SLAM Datasets

4.2. Review of the SLAM Algorithms

4.2.1. The Basic Principles of SLAM Algorithms

4.2.2. Performance Comparison and Energy Balance Discussion of Mainstream SLAM Algorithms

4.3. Exploration and Future Trends in SLAM Algorithms

4.4. Visual Framework for Unmanned Factory Applications with Multi-Driverless Robotic Vehicles and UAVs

4.5. Discussion of the Application of SLAM Algorithms in Autonomous Driving and Contribution to Promoting Green Energy Sustainability

4.6. Discussion on the Role of Algorithm Optimization in Improving Energy Efficiency

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI