Advances in Computer Vision and Deep Learning and Its Applications

Aili Wang; Haibin Wu; Yuji Iwahori

doi:10.3390/electronics14081551

,

and

¹

Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China

²

Department of Computer Science, Chubu University, 1200 Matsumoto-cho, Kasugai 487-8501, Japan

^*

Authors to whom correspondence should be addressed.

Electronics2025, 14(8), 1551;https://doi.org/10.3390/electronics14081551

This article belongs to the Special Issue Advances in Computer Vision and Deep Learning and Its Applications

Version Notes

Order Reprints

1. Introduction to Advances, Challenges, and Research Trends in Computer Vision, Deep Learning, and Their Applications

(1) Computer Vision: The field of computer vision is making significant strides in dynamic reasoning capability through test-time scaling (TTS) [1] technology. TTS optimizes the robustness and interpretability of models in complex tasks by flexibly allocating computational resources. Multimodal base models, such as CLIP (contrastive language-image pre-training) [2] and Florence, facilitate the deep fusion of vision and language through cross-modal alignment techniques. These advancements have significantly improved the accuracy of visual question answering (VQA) and cross-modal retrieval. Generative AI technologies, such as Stable Diffusion, have also broken through the limitations of 2D image generation, enabling the transition to semantics-driven 3D scene models, like neural radiance fields (NeRF) [3]. This shift supports the generation of spatial models with physically interactive attributes from a single sheet of input, providing a new paradigm for virtual reality and industrial design. In addition, the introduction of the spatial intelligence [4] concept allows computer vision systems to simulate physical interactions in 3D space, driving the development of embodied intelligence and robot navigation.

However, these macroscopic technological frameworks still face several challenges, including inadequate algorithmic adaptation, high computational costs, and fragmented cross-modal representations in specific scenarios. While this Special Issue highlights significant progress in algorithmic improvements and scene adaptation, two key knowledge gaps persist. First, much of the current research is centered on the optimization of unimodal vision tasks, while exploration into multimodal alignment techniques remains relatively underdeveloped. Second, research on dynamic reasoning capabilities is still in its infancy, and existing models struggle to meet the real-time adaptive demands of complex physical interaction environments. In addition, the integration of generative AI with spatial intelligence remains insufficient, and further breakthroughs are needed to enhance the simulation of dynamic physical attributes.

Future research should further integrate multimodal a priori knowledge and dynamic reasoning mechanisms. On the one hand, linguistic descriptions can be embedded into the industrial defect detection process, and a joint visual-semantic representation space can be constructed to enhance model interpretability. On the other hand, neural radial field generation techniques based on physics engines need to be explored to enhance the simulation of physical interactions within 3D models through the introduction of rigid-body dynamics constraints. In addition, for incremental SFM and UAV scene reconstruction, developing an adaptive computation offloading strategy that combines the characteristics of edge computing devices will enable real-time 3D closed-loop sensing with cloud collaboration.

(2) Feature Extraction and Image Selection: Self-supervised learning and comparative learning frameworks, such as SimCLR (Simple Contrastive Learning of Visual Representations) [5] and MoCo (Momentum Contrast), have become the dominant paradigms for feature extraction. These frameworks significantly reduce the reliance on labeled data, especially in the small-sample task of medical imaging. Image selection techniques combine attention mechanisms with reinforcement learning to optimize dynamic sampling. Interpretability methods, such as the improved version of Grad-CAM++ (gradient-weighted class activation mapping) enhance the model’s credibility in highly sensitive scenarios, like remote sensing and security, by visualizing the importance of features.

Compared with current mainstream self-supervised learning and comparative learning paradigms, the research presented in this Special Issue focuses on feature characterization optimization and heterogeneous data fusion in vertical scenarios. However, two knowledge gaps remain. First, at the level of basic theory, most methods fail to fully integrate the advantages of contemporary self-supervised comparative learning techniques, limiting the model’s generalization ability. Second, the dynamic optimization mechanism has not yet formed a complete closed loop, and existing image selection techniques lack a dynamic sampling strategy that integrates reinforcement learning, making it challenging to achieve the synergistic optimization of defective region detection and manual review efficiency in industrial quality inspection scenarios. In addition, although several studies have adopted visual feature analysis, interpretability methods still rely on traditional heat maps. Newer interpretable frameworks, such as the improved version of Grad-CAM++, have not been introduced, potentially limiting the certification of model credibility in high-reliability domains like remote sensing and security.

Future research should deepen the exploration of three key areas: First, it is necessary to establish a deep fusion mechanism between generic feature extraction frameworks and domain-specific knowledge, and to develop a self-supervised pre-training model that requires fewer samples for pathological image analysis. Second, there is a need to build a closed-loop optimization system that is dynamically interpretable, and to form a complete cognitive chain from feature extraction to decision-making validation. Lastly, it is crucial to break through the intrinsic limitations of two-dimensional visual representation, and to develop an implicit model based on neural radiance fields (NeRFs), which represent the most effective approach to visualization. Additionally, exploring the synergistic integration of multimodal large language models with feature extraction networks will open up new directions for constructing intelligent visual systems with semantic understanding capabilities.

(3) Pattern Recognition in Image Processing: The widespread adoption of the Transformer architecture has revolutionized pattern recognition technologies. Vision Transformer (ViT) [6] and its variants, such as Swin Transformer, have surpassed traditional convolutional networks in image classification and segmentation by capturing long-range dependencies through self-attention mechanisms. The dynamic sparse attention mechanism further optimizes computational efficiency, enabling real-time video analysis. Multimodal fusion technologies, which integrate multi-dimensional signals such as vision and speech, are advancing the ability to understand complex scenes in smart security and human–computer interaction, bringing them closer to human-level cognitive capabilities.

However, several knowledge gaps remain in the research presented in this Special Issue. First, current improvement strategies are mostly limited to unimodal visual data and have not yet effectively integrated the latest advancements in multimodal fusion technologies. Second, the zero-sample bit-pose estimation method based on the SAM model represents a breakthrough beyond traditional supervised learning paradigms, but it fails to fully exploit the potential of self-supervised representation learning. In addition, lightweight models commonly adopt traditional compression strategies, such as channel pruning, and there is an obvious lag in the fusion of cutting-edge acceleration techniques such as dynamic sparse computing.

Future research should focus on achieving breakthroughs in three key areas: exploring the deep integration of lightweight architectures with Transformer models, and achieving a dual improvement in computational efficiency and modeling capability through the design of hybrid attention mechanisms; developing a cross-modal self-supervised pre-training framework that encodes multimodal signals into a unified representation space, thereby enhancing the model’s reasoning capability in open environments; and constructing a theoretical dynamic computation allocation model to enable adaptive tuning of computational resources using reinforcement learning. These breakthroughs will facilitate a paradigm shift in pattern recognition technology, moving from specialized improvements to general intelligence, and will provide stronger technical support for Industry 4.0 and agricultural intelligence.

(4) Image Processing in Intelligent Transportation: Intelligent transportation systems rely on multi-sensor synergy technologies, combining vision and LiDAR (light detection and ranging) [7] point cloud data to build high-precision 3D environmental models, enabling all-weather obstacle detection and centimeter-level positioning. The event camera overcomes traditional frame rate limitations to capture microsecond dynamic changes in low-light or high-speed motion scenes. Additionally, Huawei’s proposed optical flow-event fusion network significantly reduces the false detection rate of vehicle tracking at night. The federated learning framework optimizes global traffic prediction models through cross-region data collaboration, safeguarding privacy and security while improving city-level traffic management efficiency.

Despite the significant progress in algorithmic innovation presented in this Special Issue, there is still potential for further expansion of its technical boundaries. The full potential of multimodal collaboration has not yet been realized, the federated learning framework has not been incorporated into the model optimization system, the distributed training mechanism of lightweight CNN and Transformer modules remains unexplored, and the bottleneck of real-time performance in dynamic scenarios remains unsolved, among other issues.

Looking ahead to next-generation intelligent transportation system, the extended research presented in this Special Issue can be categorized along three axes: first, exploring the implicit alignment mechanism of multimodal representations by combining the microsecond dynamic sensing capability of event cameras with the spatial a priori of point cloud data, thus constructing spatiotemporal continuous environment understanding models; second, developing a federated learning framework for edge computing that achieves cross-region collaborative optimization of lightweight models through knowledge distillation and differential privacy techniques, balancing algorithmic efficiency with privacy and security; third, building an integrated “perception-decision-control” architecture that utilizes neural symbolic systems to translate deep learning outputs into interpretable traffic rule constraints. Furthermore, there is an urgent need to establish benchmark test sets that cover extreme scenarios so as to push the technology from laboratory validation to industrial-scale deployment. These directions will not only deepen the existing findings in this Special Issue but may also give rise to core algorithmic paradigms for the next generation of transportation intelligences.

(5) Neural Network, Machine Learning, and Deep Learning Directions: Dynamic network architectures, such as Google Pathways, allocate computational resources on demand through task-adaptive routing mechanisms to maintain accuracy while minimizing energy consumption. Self-supervised pre-training techniques, such as the Masked Autoencoder (MAE), dramatically improve generalization capabilities for small-sample scenarios by reconstructing a high percentage of masked data, thus driving the rapid deployment of industrial detection. Lightweight models, such as MobileNet-V4, combined with neural architecture search (NAS) [8], enable real-time 4K video processing on edge devices with a 40% reduction in power consumption.

However, the existing research in this Special Issue still suffers from knowledge gaps in two areas: First, the adaptive nature of dynamic network architectures has yet to be seamlessly integrated with a priori knowledge within the field. Second, there is a disconnect between the architectural innovation of lightweight models and their training strategies. While most studies focus on structural improvements, they have not revolutionized the design of the loss function. Although the InnerMPDIoU loss and pixel position adaptive loss proposed in this Special Issue represent breakthroughs for specific tasks, their theoretical completeness and ability to generalize across scenarios still need further validation.

Future research should deepen exploration in three dimensions: first, building a differentiable formal framework that enables the joint optimization of domain knowledge embedding and dynamic architectural search; second, developing an algorithm-compiler-chip co-design system for edge computing, enabling the translation of structural innovations—such as reversible connections in RVDR-YOLOv8—into hardware instruction set-level optimization; and finally, establishing a cross-modal pre-training-tuning paradigm to address the challenge of representation bias and overfitting in small-sample scenarios, by combining MAE-like self-supervision strategies with GAM attention mechanisms. These directions will advance neural network research from discrete technological innovations to intelligent system engineering, facilitating the paradigm shift from a “scene-definition model” to a “model-enabled scene”.

(6) Hyperspectral Image Processing: End-to-end deep learning models, such as HybridSN, which combines 3D-CNN and Transformer architectures, enable accurate classification in agricultural pest detection by jointly extracting spatial-spectral features. Physical model-driven unmixing algorithms, such as UnmixerNet, combined with generative adversarial networks (GANs) [9], can reconstruct super-resolution images from low-resolution data, overcoming hardware acquisition limitations. Field-programmable gate array (FPGA) hardware acceleration technology, through algorithm-chip co-design, can increase the speed of hyperspectral imaging to 1000 frames per second, advancing the real-time application of automated driving.

Although current research has made progress in spatial-spectral feature fusion, physically driven reconstruction, and hardware acceleration, core challenges remain in robust modeling, cross-scale information coupling, and end-to-end system reconfiguration for dynamic and open scenarios. This Special Issue will focus on the following research directions in the future: Intelligent Algorithmic Innovation: Integrating differential manifold theory with Transformer architecture, combining radiative transfer equations to construct regularized networks, and enhancing the physical interpretability of small-sample scenes in agriculture. Open-Set Reconstruction Theory: Developing probabilistic generative models for end-element abundance, dynamically updating spectral libraries using variational self-encoders, and fusing multi-source data, such as LiDAR constraints, to achieve cross-sensor reconstruction. Collaborative Computing Architecture: Designing reconfigurable computing units based on neural architecture search, establishing closed-loop feedback between spectral fidelity and chip parameters, and optimizing elastic scheduling of hardware resources. Interdisciplinary Paradigm Breakthrough: Embedding task-aware compression operators in the optical coding stage, constructing a micropipetable pipeline from optical modulation to decision-making, and promoting the development of millisecond-level real-time processing systems.

(7) Biomedical Image Processing and Recognition: Multimodal fusion technology, such as Transformer-based medical image alignment architecture (UNETR) [10], achieves sub-millimeter alignment of MRI and CT images and supports precise tumor boundary localization. Self-supervised pre-training models, such as HistoSSL, require minimal labeled data to achieve expert-level pathology diagnosis, significantly reducing healthcare costs. Generative adversarial networks, such as CycleGAN, solve the long-tailed distribution problem in medical imaging by synthesizing rare case data, thus improving the generalization ability of the model.

Despite these advancements, several knowledge boundaries remain to be addressed in this Special Issue. First, while existing work has made significant breakthroughs in unimodal tasks, the ability to co-process multimodal images has not yet been fully verified. Second, the cross-exploration of frequency-domain feature enhancement and self-supervised learning has not yet formed a systematic methodology. The current frequency-domain attention mechanism still relies on artificially designed transform domains, which do not deeply integrate with the data-driven feature decoupling paradigm. Third, the application of generative adversarial networks in the synthesis of long-tailed data has not yet been fully integrated with novel optimization algorithms, resulting in a trade-off between semantic fidelity and diversity in the generated pathological data.

To address these gaps, the research in this Special Issue focuses on three major directions: constructing a lightweight multimodal joint learning framework that achieves resource-sensitive interactions of cross-modal features through knowledge distillation and dynamic routing mechanisms; developing a frequency-domain self-supervised pre-training paradigm to enhance the feature representation capability of unlabeled medical images by utilizing frequency-domain sparsity a priori; and designing a joint generative-optimization model based on physical constraints to incorporate anatomical a priori of biomedical images into the data synthesis process to achieve semantically controllable image generation of pathology. These directions will drive the evolution of biomedical image processing from single-task optimization to systematic intelligent diagnosis.

(8) Speech and Video Processing: Cross-modal alignment models, such as Microsoft Video-Audio-Text Transformer (VATT) [11], promote the utility of intelligent subtitle generation and video summarization, utilizing comparative learning to align speech, video and text features. Spatiotemporal modeling architectures, such as TimeSformer, enable accurate semantic parsing of long video content by separating spatiotemporal attention mechanisms. The new generation coding standard, H.266/VVC (Versatile Video Coding), combines deep learning with optical flow prediction and residual coding, reducing the bit rate by 50% while maintaining picture quality, thus supporting the popularization of UHD streaming media.

Future research should focus on the synergistic breakthrough of cross-modal cognition and neural coding. To address multimodal semantic ambiguity, a fine-grained alignment framework for knowledge enhancement can be constructed to achieve the conceptual decoupling and interpretable reasoning of audiovisual texts. This should include innovating the spatiotemporal modeling paradigm, developing a dynamic recursive architecture that incorporates causal reasoning, and analyzing the evolution logic of long video events. The contributors to this Special Issue advance neural compression technologies, including the integration of motion estimation and entropy coding within implicit representation space. Our authors develop bit rate-quality semantic co-optimization model and explore SNN-based bionic coding mechanisms. Alongside these innovations, a synchronized construction of a semantic fidelity evaluation system is crucial, which will break through the limitations of traditional indicators and address security problems, such as depth forgery caused by generative compression. These efforts will propel audiovisual processing toward cognitive intelligence and lay the foundation for a semantic engine within the metaverse’s immersive media.

(9) Image Processing in Intelligent Surveillance: Graph convolutional networks (GCNs) [12] achieve real-time detection of abnormal events in public places with a low false alarm rate by modeling the association between individual and group behaviors. Federated learning frameworks, such as FEDVision, support multi-camera collaborative training without sharing raw data, facilitating the compliance of cross-border security systems. Differential privacy (DP) technologies balance security and privacy protection needs in identification tasks through noise injection mechanisms.

However, the research presented in this Special Issue still suffers from a gap in knowledge at the level of group intelligence analysis and privacy computation. Existing findings have not effectively incorporated the theoretical advantages the state-of-the-art research on GCNs in group behavior association modeling, with a particular lack of systematic exploration of multi-objective interaction relationship modeling and spatiotemporal association analysis. Meanwhile, in the integration of federated learning and differential privacy technology, this Special Issue article focuses on the performance optimization of algorithmic ontologies. However, the integration of the data compliance framework and privacy protection mechanism is still insufficient, which may present a technical shortcoming in the construction of cross-border security systems.

For future research, intelligent surveillance image processing technology should focus on multimodal fusion and trusted computing, integrating spatiotemporal graph neural networks with dynamic adjustable mechanisms at the algorithmic level. This will optimize group behavior prediction and individual detection of collaborative modeling. Regarding engineering architecture, edge-cloud collaboration and the federated learning framework will become mainstream, reducing latency and ensuring data privacy compliance through distributed computing. Privacy protection technologies will evolve toward scene adaptation, combining with interpretable AI to build transparent decision-making systems that balance security and privacy. Technological breakthroughs in multi-biometric fusion (such as combining EEG with visual signals) and lightweight digital twins for high-fidelity real-time mapping via neural radiance fields (NeRFs) [13] will improve detection accuracy. These trends will drive surveillance systems from passive perception to active cognition, establishing a new-generation security ecosystem of virtual and real symbiosis.

(10) Deep Learning for Image Processing: Dynamic inference techniques (e.g., adaptive inference) adjust the depth of the network according to the input complexity, balancing efficiency and quality in image denoising and restoration tasks. Model compression techniques (e.g., Tiny-YOLOv7) compress the target detection model to less than 1 MB through knowledge distillation and quantized perceptual training, which is suitable for embedded devices. Causal inference adversarial training (e.g., CausalGAN [14]) improves the generalization ability of image restoration models in occluded scenes by distinguishing correlation interference from essential feature association.

In this Special Issue addresses several theoretical aspects that require deeper exploration. Although the aesthetics-guided image scaling technique innovatively integrates composition rules and deep learning, its four preset categories of fixed-composition paradigms are prone to rule conflicts when faced with abstract art or composite scenes, and the cross-domain generalization ability of the classification module has not yet been verified by diversified datasets. While the incremental SFM framework improves reconstruction efficiency through the SuperPoint and sliding window strategies, the stability of the feature extractor under extreme lighting or weak texture conditions is still flawed, and the local optimization process may fall into sub-optimal solutions. The deeper challenge is that existing methods mostly adopt a staged optimization strategy; as a result, the aesthetic rules, geometric constraints, and neural networks fail to form an end-to-end joint learning framework, which leads to objective function conflicts and information loss among modules.

Future research should construct an open-domain aesthetic evaluation system, independently mine visual laws through the cross-cultural comparative learning framework, and overcome the limitations of fixed rules. Meanwhile, by integrating the differentiable rendering characteristics of neural radiance fields, researchers should develop a user-interactable 3D editing toolchain to enhance the practicability of algorithms. Notably, the deep integration of dynamic inference technology and neural representation learning will give rise to scene-adaptive meta-architecture, thus achieving a dynamic balance between network capacity and rendering accuracy through spatial-aware distillation paradigms. In addition, extending the aesthetic preservation mechanism to the video timing domain, as well as constructing cross-frame consistency constraints and motion semantic perception models, is expected to overcome the technical bottleneck of dynamic digital content generation and provide a higher dimension of creative freedom for virtual reality fusion scenes.

(11) Deep Learning-Based Image and Video Analysis Methods: Dual-path architectures (e.g., SlowFast [15]) achieve accurate parsing of long video behaviors by separating spatial detail capture and temporal dynamic modeling. Self-supervised frameworks (e.g., VideoMoCo [16]) extract temporal features from unlabeled videos, reducing the reliance on expensive labeled data. Diffusion models (e.g., Imagen Video) break through the limitations of single-frame compositing to generate coherent, high-resolution videos, driving the production of automated content in the film and advertising industries.

Future research in this Special Issue will focus on the evolution of cognitively driven multimodal spatiotemporal inference systems. These systems will break through traditional architectures to achieve spatiotemporal fusion at the neural dynamics level. The focus is on constructing differentiable energy field models that encode entity trajectories and physical constraints into dynamic graph networks, empowering systems with the ability to derive physical laws. Self-supervised learning requires the development of a temporal intervention framework to extract spatiotemporal causal maps from disordered data through counterfactual reasoning. Generative modeling requires the fusion of neural rendering and symbolic rule systems to embed knowledge graph constraints into diffusion sampling. The joint video-motion representation space will bridge observational learning and embodied skill migration to promote digital twins and educational robotics. Ultimately, supported by quantum-inspired architectures, video intelligence will evolve into cognitive subjects capable of predictive modeling and creative intervention.

(12) Image Analysis and Pattern Recognition for Robots and Unmanned Systems: The combination of event cameras and spiking neural networks (SNNs) [17] empowers UAVs with millisecond-level obstacle avoidance responses. Semantic SLAM (e.g., ORB-SLAM3) improves navigation accuracy in dynamic environments by jointly optimizing geometric and semantic information. Multi-robot cooperative systems share local observation data through a distributed learning framework to multiply the speed of target recognition in disaster search and rescue scenarios.

However, the research presented in this special issue still faces gaps in bridging knowledge with cutting-edge fields. Firstly, current agricultural robotics research has yet to fully integrate bionic computing architectures, such as pulsed neural networks. Secondly, the depth of multimodal data fusion is insufficient. For example, although SU-Grasp introduces depth-normal vector dual-stream coding, it has not yet constructed a cross-modal spatiotemporal correlation model, limiting its ability to handle synergistic sensing of dynamic obstacles and manipulated targets in unstructured environments. Furthermore, existing studies mostly focus on single-unit intelligence enhancement and lack group collaborative validation under distributed learning frameworks, such as SwarmNet. This restricts the scalability of the technology in complex tasks like disaster search and rescue.

Future research should focus on three key areas: first, the cross-layer fusion of bionic computing architectures, exploring the coupling mechanism between SNN pulse timing encoding and spatial features of convolutional neural networks, and developing neuromorphic vision chips for high-speed harvesting in agriculture; second, the construction of cognitive evolutionary frameworks in open environments, combining federated learning and online knowledge distillation to realize collaborative semantic maps of multi-robot systems and incremental updating in dynamic scenarios; and third, the closed-loop validation of physical-digital twin systems, which will break through the generalization bottleneck of current algorithms in real-world complex contact interactions by constructing a multi-physical field simulation environment that incorporates illumination, mechanics, and material properties.

(13) AI-Based Image Processing, Understanding, Recognition, Compression, and Reconstruction: Generative AI (e.g., Stable Diffusion 3.0) supports text-guided local redrawing and style migration, expanding the possibilities of creative design. Neural compression (e.g., Neural Image Compression (NIC) [18]) saves up to 40% of bit rate compared to traditional standards through non-linear transform coding, driving cloud storage efficiency. Joint visual-linguistic models (e.g., Flamingo) achieve zero-sample cross-modal reasoning by learning with fewer samples, approaching the level of human cognition in open-scene understanding tasks. Three-dimensional reconstruction techniques (e.g., NeRF++) combine ray tracing and depth estimation to generate high-fidelity editable models from a single RGB image, with applications in digital twins and cultural heritage preservation.

Although this Special Issue demonstrates the advantages of high accuracy in specific scenarios, its models still have limitations in domain adaptation and real-time performance. It relies on customized data training for industrial scenarios, and its cross-domain migration capability has not yet been verified; while the deep quality purification module in RGB-D salient target detection improves robustness in noisy environments, the two-stage decoder design increases computational complexity, which may limit its deployment on edge devices. In contrast, the joint vision-verbal model realizes zero-sample inference in open scenarios through few-sample learning, demonstrating greater generality. This Special Issue should strike a better balance between “specialization” and “generalization”.

Future research can explore lightweight multimodal feature encoding based on neural compression that reduces storage and transmission costs, as well as introduce the meta-learning capability of generative AI to extend the adaptability of 3D reconstruction to long-tailed scenarios, such as cultural heritage preservation. On the methodology level, we should construct a cross-domain transfer learning framework. Regarding ethical and security dimensions, the current research is insufficient to regulate AI-generated content. In the future, we will be able to draw on the transparent inference mechanism of visual-linguistic modeling and develop traceable deep feature watermarking technology in combination with policy specifications. The issue of energy efficiency should also be taken into account in the technical design stages to promote the development of green AI.

2. Overview of This Special Issue

The articles included in this Special Issue cover advancements in ten research directions: computer vision, feature extraction and image selection, pattern recognition for image processing techniques, image processing in intelligent transportation, neural networks, machine learning and deep learning, biomedical image processing and recognition, image processing for intelligent surveillance, deep learning for image processing, robotics and unmanned systems, and AI-based image processing, understanding, recognition, compression, and reconstruction. I have categorized the 33 articles included in this Special Issue based on these research directions, with the classification system not only demonstrating the vertical extension of the technological depth but also embodying the horizontal coverage of the cross-field applications. The classification system is divided into three dimensions: the technological layer, the type of task, and the industry. Through this system, a clear technological lineage of computer vision and deep learning and its application fields can be constructed. Basic algorithmic innovation provides theoretical support for each application field, customized optimization in vertical fields promotes the technology on the ground, and the full-process integration solution further enhances the practicability and generalization ability of the AI system.

The development of the computer vision field presents a multi-dimensional technological evolution and has been widely adopted. In this Special Issue, new network architectures, attention mechanisms, and multimodal fusion technologies continue to make breakthroughs at the level of basic algorithmic innovation. The related articles are introduced as follows.

“An Improved YOLOv5Underwater Detector Based on an Attention Mechanism and Multi-Branch Reparameterization Module” addresses the problem of degradation in target detection accuracy due to low image quality in underwater environments. A global attention mechanism (GAM) is introduced into the backbone network to enhance the interaction between channels and spatial information and improve feature extraction capability. DAMO-YOLO-based fusion block is used in the neck to strengthen multi-scale feature aggregation, and the experimental results surpass advanced methods such as ViDT.

“ESD-YOLOv5: A Full-Surface Defect Detection Network for Bearing Collars” proposes an improved YOLOv5 model—ESD-YOLOv5— to address the detection challenges posed by bearing collars with various types of surface defects and complex backgrounds, among other issues. A hybrid module combining efficient channel attention (ECA) and coordinate attention (CA) is constructed to enhance the network’s ability to localize defect features. Slim-neck is used to replace the original neck structure to reduce the number of model parameters and computational complexity, while the decoupled head of YOLOX is introduced to separate the classification and regression tasks.

“Consistent Weighted Correlation-Based Attention for Transformer Tracking” presents a consistent weighted correlation (CWC)-based attention mechanism for improving the performance of a Transformer architecture in visual tracking. The traditional attention computation of Transformer architectures handles each query-key pair independently, ignoring the consistency of the global context. By introducing the CWC module, the authors dynamically adjust the weights in the cross-attention block to enhance the consistency of relevant pairs and suppress the interference of irrelevant pairs.

“MM-NeRF: Large-Scale Scene Representation with Multi-Resolution Hash Grid and Multi-View Priors Features” proposes MM-NeRF, a large-scale neural radiance field (NeRF) method that integrates a multi-resolution hash grid and multi-view a priori features. MM-NeRF adopts a two-branch structure: one branch utilizes a multi-resolution hash grid branch to efficiently encode the geometric details of the scene, while the other branch employs multi-view a priori features to enhance texture information by fusing the cross-view features. This two-branch structure enables MM-NeRF to solve the problems of detail loss and high training costs typically associated with traditional NeRF methods when dealing with large-scale scenes.

“Research on 3D Visualization of Drone Scenes Based on Neural Radiance Fields” presents a neural radiance field (NeRF) 3D visualization framework for UAV aerial photography scenes. The framework introduces the spatial boundary compression technique combined with the ground optimization sampling strategy to reduce the sampling points in invalid regions. It adopts the multi-resolution hash grid and clustering sampling method to optimize feature encoding and sampling efficiency, and reduces outliers and blurring artifacts through L1-paradigm penalties and entropy regularization loss. These features solve the problems of detail blurring, high computational costs, and cloud artifacts in large-scale scene rendering.

“Incremental SFM 3D Reconstruction Based on Deep Learning” proposes an incremental structured light motion recovery (SFM) 3D reconstruction method based on deep learning techniques. This study significantly improves the accuracy and efficiency of 3D reconstruction by improving key processes, including feature matching, beam leveling (BA), and depth estimation. Specifically, SuperPoint and SuperGlue are employed for feature extraction and matching, and a sliding window strategy is used to process high-resolution UAV images. A BFGS-corrected Gauss–Newton solver is introduced to optimize the BA process and reduce reprojection error. Finally, a fully convolutional network predicts the depth map using a sparse point cloud alongside the original image, with fused multi-view information. This approach addresses the problems of inefficiencies and inaccuracies in feature matching that are typical of traditional SFM when dealing with complex scenes.

“YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments” introduces the YOLO-CBF algorithm, built upon the YOLOv7 framework, and proposes a three-fold optimization for the task of helmet detection in road scenes. First, it incorporates coordinate convolution (CoordConv) to embed spatial coordinate channels into the input features, strengthening the network’s ability to perceive target locations and significantly improving detection accuracy in small target and occlusion scenarios. Second, the BiFormer dynamic sparse attention mechanism is integrated to filter key regions for attention computation through a two-layer routing process, which reduces complexity from O(N²) to O(N) while maintaining global feature capture and computational efficiency. Third, the FocalConv is improved for helmet detection in road scenarios, and the Focal-EIOU loss function is further optimized by introducing weight coefficients that focus on the optimization of low-overlap samples. Additionally, the bounding box error is decomposed into multi-dimensional errors—overlap, center offset, and aspect ratio—enhancing regression accuracy. Through the combination of spatial perception enhancement, dynamic feature focusing, and an accurate regression mechanism, this model achieves a balance between lightweight operation and robust detection in complex environments.

Feature extraction and image selection techniques focus on improving data representation. These techniques prove most beneficial in the preprocessing stage, providing quality data input for subsequent classification and detection tasks. The following articles in this Special Issue achieve optimization of feature representation in specific areas.

“Research on Railway Dispatcher Fatigue Detection Method Based on Deep Learning with Multi-Feature Fusion” focuses on the core issues of railway transportation safety—particularly dispatcher fatigue detection—proposing a multi-feature fusion detection method that combines facial key points and body postures. Addressing the issue of traditional single-feature detection being easily affected by occlusion and angle change, the study constructed a facial key point detection module based on the RetinaFace model through the HRNet network; this model extracted physiological indexes, such as eye closure rate and blinking frequency, and analyzed fatigue behaviors, such as head drooping and table lying. The HOG-PSO-SVM algorithm is introduced to classify eye states and is combined with the Bi-LSTM-SVM adaptive enhancement model to recognize complex postures. Finally, fatigue levels are determined by fusing five categories of features using an artificial neural network.

“Automatic Fabric Defect Detection Method Using AC-YOLOv5” proposes an improved YOLOv5 detection model—AC-YOLOv5—to address the problem of detecting diverse defects with large-scale differences in the complex textural background of textile fabrics. This model embeds a void space pyramid pooling (ASPP) module into the backbone network, allowing for the extraction of multi-scale features by convolutional kernels with different expansion rates. A convolutional squeezing excitation (CSE) channel attention module is introduced to enhance the network’s attention to defective features.

“Detection of Fittings Based on the Dynamic Graph CNN and U-Net Embedded with Bi-Level Routing Attention” addresses the challenges of complex backgrounds, small targets, and occlusion in power fittings detection by proposing a combined U-Net and dynamic graph convolutional network (DGCNN) framework. Traditional 2D detection methods struggle to handle 3D spatial information, while acquiring 3D point cloud data is expensive. To overcome this, the authors generate pseudo-point cloud data using the Lite-Mono algorithm, converting 2D images into 3D point cloud representations. DGCNN is then used to extract geometric features of occluded accessories. Meanwhile, the feature extraction capability is enhanced by embedding a bidirectional routing attention (BRA) module within U-Net.

“Multi-Branch Spectral Channel Attention Network for Breast Cancer Histopathology Image Classification” introduces the Multi-Branch Spectral Channel Attention Network (MbsCANet), which aims to enhance the accuracy of breast cancer histopathology image classification. While existing methods based on convolutional neural networks rely on spatial features, the authors innovatively introduce a two-dimensional discrete cosine transform (DCT) into the channel attention mechanism. This fusion of the lowest-frequency features with high-frequency information through a multi-branch structure helps preserve phase information and enhances the model’s context-awareness ability.

“RS-Xception: A Lightweight Network for Facial Expression Recognition” presents RS-Xception, a lightweight facial expression recognition network designed to address the challenges of excessive parameters and low computational efficiency in existing models on embedded devices. Xception integrates ResNet’s residual connectivity, SENet’s channel attention mechanism, and Xception’s depth-separable convolution to achieve efficient feature extraction and classification through a modular design. The study introduces the SE-ResNet module, which enhances key features through squeeze-excite operations and reduces computation using depth-separable convolution.

“Robotic Grasping Detection Algorithm Based on 3D Vision Dual-Stream Encoding Strategy” presents SU-Grasp, a 3D vision-based dual-stream encoding strategy for robotic grasping detection that integrates the sliding-window self-attention mechanism of the Swin Transformer with the multi-scale feature fusion of U-Net. This combination enhances spatial semantic understanding by processing RGB images and depth images (with normal vector angle features) through two-way encoders, while SU-Grasp introduces the normal vector angle images as a spatial a priori, enhancing perception of target objects’ geometries and surface orientations through cross-modal fusion. This research provides key technical support for the autonomous operation of robots in unstructured environments.

Pattern recognition techniques are directly applicable to target detection, classification, and segmentation tasks, with algorithmic optimization used to address practical challenges such as occlusion and small targets. The following articles in this Special Issue highlight task-specific algorithmic improvements.

“Content-Aware Image Resizing Technology Based on Composition Detection and Composition Rules” proposes a method that combines composition detection and composition rules image scaling methods to address the lack of aesthetic perception in existing content-aware image scaling algorithms. By introducing a composition classification module based on convolutional neural networks, images are categorized into four common compositions in landscape photography—such as trichotomous and symmetrical compositions—and the corresponding aesthetic rules are selected to guide the scaling operation according to the classification results. The graph-based visual saliency (GBVS) model and collaborative segmentation algorithm are used to generate an importance map, while the golden ratio and other rules are combined to optimize the positioning of salient regions, ensuring that the scaled image retains important content while conforming to aesthetic principles.

“Lightweight Strawberry Instance Segmentation on Low-Power Devices for Picking Robots” presents a lightweight instance segmentation model tailored for strawberry-picking robots operating in complex orchard environments. These environments pose problems such as diverse fruit morphology and severe occlusion. The proposed model, StrawSeg, adopts MobileNetV2 as the backbone network to extract multi-scale features. It also designs a feature aggregation network (FAN) to merge different layers of features through a pixel blending operation, avoiding the computational overhead caused by interpolation or deconvolution.

“Application of Improved YOLOv5 Algorithm in Lightweight Transmission Line Small Target Defect Detection” focuses on the issue of insulator defect detection in UAV aerial transmission line images. The lightweight, improved Algorithm DFCG_YOLOv5 is proposed to address challenges such as noise interference, false detection of small targets, and slow detection speeds in complex backgrounds. This is achieved by introducing a high-speed adaptive median filtering (HSMF) algorithm at the input stage to effectively reduce image noise. The Ghost backbone network is optimized by incorporating the DFC attention mechanism to balance accuracy and speed in feature extraction. The original CIOU loss function is replaced with a Poly Loss function, which adjusts the parameters to suppress the loss of insulator defects and addresses the imbalance between positive and negative samples, especially for small targets.

“A Method for Unseen Object Six Degrees of Freedom Pose Estimation Based on Segment Anything Model and Hybrid Distance Optimization” presents a method for six degrees of freedom (6-DoF) pose estimation of unseen objects and complex scenes, leveraging the Segment Anything Model (SAM) and hybrid distance optimization. The authors improve the SAM model (CAE-SAM) by addressing boundary blurring, mask nulling, and over-segmentation problems using a local spatial feature enhancement module, global contextual labeling, and a bounding box generator, achieving high-quality zero-sample instance segmentation. Additionally, a point cloud alignment method based on hybrid distance metrics is introduced, combining farthest point sampling (FPS) and fast global registration (FGR) algorithms to reduce dependence on hyperparameters.

“Detection of Liquid Retention on Pipette Tips in High-Throughput Liquid Handling Workstations Based on Improved YOLOv8 Algorithm with Attention Mechanism” presents an improved YOLOv8-based detection method for addressing the challenge of liquid retention on pipette tips in high-throughput liquid handling workstations. The authors enhance the model’s ability to detect small targets and complex backgrounds by introducing three key improvements: the global context (GC) attention module, which strengthens the model’s understanding of global features in the backbone network; the large kernel selection (LKS) module, which dynamically adjusts the sensory field to accommodate different backgrounds; and the simple attention (SimAM) mechanism, which generates attentional weights to optimize feature representation in the network’s neck stage.

“StrawSnake: A Real-Time Strawberry Instance Segmentation Network Based on the Contour Learning Approach” presents StrawSnake, a real-time strawberry instance segmentation network based on contour learning. This model addresses the challenges of low accuracy and insufficient real-time detection of strawberries in complex environments. The authors design a dedicated octagonal contour that combines the YOLOv8 detection frame and extreme points to closely enclose the target. Dynamic serpentine convolution (DSConv) is used to adaptively adjust the sensory field through deformable convolution kernels, enhancing the perception of boundary curves. The multi-scale feature enhancement block (MFRB) incorporates a self-attention mechanism, improving the model’s ability to aggregate multi-scale features.

The field of intelligent transportation relies on technologies such as monocular depth estimation and lightweight CNN models to promote autonomous driving and traffic management. The following articles in this Special Issue promote the development of technologies in this field.

“StrawSnake: A Real-Time Strawberry Instance Segmentation Network Based on the Contour Learning Approach” proposes a self-supervised learning-based monocular depth estimation method that aims to improve model performance by optimizing image reconstruction loss and left-right disparity image loss. Traditional methods rely on L1 or SSIM for reconstruction loss, but these approaches have limitations when dealing with low-texture or long-range regions. The authors introduce LPIPS (learned perceptual image patch similarity) as a perceptual loss to measure the quality of reconstructed images in a way that more closely aligns with human visual perception. This is combined with left-right disparity image loss to align differences between the left and right views, thus reducing reconstruction distortions caused by factors such as lighting and camera calibration.

“Convolutional Neural Networks Adapted for Regression Tasks: Predicting the Orientation of Straight Arrows on Marked Road Pavement Using Deep Learning and Rectified Orthophotography” presents a convolutional neural network (CNN)-based regression model for automatically recognizing the direction of road arrows. Traditional methods rely on manual feature extraction or single-stage detection, which are difficult to adapt to variations in arrow direction within complex scenes. The authors designed a customized lightweight CNN architecture (ad hoc model) and compared it with classical networks, such as VGGNet and ResNet. The effectiveness of the lightweight network for specific tasks is demonstrated. In addition, the study explores the impact of data augmentation and transfer learning on model performance, providing a new solution for automated road sign recognition.

“Research on the Car Searching System in the Multi-Storey Garage with the RSSI Indoor Locating Based on Neural Network” designs a neural network-based RSSI indoor localization system for a multi-story garage car searching application. The system integrates YOLOv5 and LPRNet networks for license plate positioning and recognition and combines BP neural networks with KNN algorithms to construct an indoor localization module. The localization accuracy achieves 100% within 2.5 m. The A* algorithm is improved, and spatial accessibility is introduced to optimize path planning, reducing ineffective search nodes by over 55% and improving operational efficiency by 28.5%. The experimental results show that the system enables full-process automation of license plate recognition, indoor localization, and optimal path planning.

“Improved 3D Object Detection Based on PointPillars” proposes an improved method based on PointPillars to address the problem of insufficient small target detection accuracy in 3D point cloud target detection. The study redefines the attention mechanism (R-SENet), which enhances key feature expression through channel and spatial dual attention. Additionally, dynamic convolution enhances the network’s adaptability to different input features, optimizes the backbone network, and introduces Transformer-based candidate frame optimization. The Transformer module further refines candidate frame regression by modeling global contextual relationships through self-attention.

In addition to the previously mentioned articles, “YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments” also makes important contributions to the field.

The optimization of neural network architecture design and training strategies constitutes another important research direction. The following articles in this Special Issue balance model efficiency and performance while optimizing networks for specific scenarios.

“YOLO-Drone: An Optimized YOLOv8Network for Tiny UAV Object Detection” presents an optimized YOLOv8 network, YOLO-Drone, designed to address the challenges of detecting small targets and handling complex backgrounds in miniature UAVs. This is achieved by adding a high-resolution branch to the detection head, which enhances small target detection capabilities. The redundant layers associated with large target detection are trimmed to reduce model parameters. SPD-Conv replaces traditional convolution to extract multi-scale features and retain more detailed information, while the GAM attention mechanism is introduced in the neck part to strengthen feature fusion.

“Toward Unified and Quantitative Cinematic Shot Attribute Analysis” presents a unified framework for cinematic shot attribute analysis, designed to process multiple attributes of a shot simultaneously through a motion-static dual-stream network. Traditional methods usually use independent models for each attribute and lack the ability to exploit global feature. The authors introduce a learnable frame difference generator to replace the optical flow network and extract spatiotemporal features by combining Visual Transformer (ViT) and Multi-scale Visual Transformer (MViT). By dynamically adjusting the weights of motion and static features through a quantitative fusion module, the model achieves optimal performance on both the MovieShots and AVE datasets, significantly outperforming existing methods. The study also quantifies the dependence of different attributes on motion and static features for the first time, providing a theoretical basis for the design of subsequent single-attribute analysis models.

“Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection” introduces a depth-quality purification feature processing network (DQPFPNet) for RGB-D salient object detection. Most existing methods overlook the impact of depth feature quality on detection accuracy. The authors design a DQPFP module, which includes depth denoising, quality weighting, and enhanced attention to filter and fuse multi-scale depth features. Additionally, they introduce a two-stage decoder to optimize context modeling. The experimental results demonstrate the importance of multi-scale feature processing and quality-aware fusion for salient target detection. The study also incorporates the RReLU activation function and pixel position adaptive loss (PPAI) to further enhance the robustness and detailed representation of the model.

“RVDR-YOLOv8: A Weed Target Detection Model Based on Improved YOLOv8” presents a lightweight weed detection model, RVDR-YOLOv8, based on an improved YOLOv8 framework and designed to address the issue of limited computational resources for weeding robots. The study replaces the traditional backbone with a reversible column network (RevColNet), which reduces computation and improves feature generalization through reversible connections and a multi-input design. It introduces the C2fDWR module, which incorporates an expansion residual mechanism to enhance the recognition of occluded targets. Additionally, GSConv is used in the neck network in place of traditional convolution, further reducing computational complexity. The study also introduces the InnerMPDIoU loss function, which fuses the MPDIoU and InnerIoU models to improve bounding box regression accuracy.

“A Novel Deep Learning Framework Enhanced by Hybrid Optimization Using Dung Beetle and Fick’s Law for Superior Pneumonia Detection” presents a hybrid optimization algorithm-based pneumonia detection framework that integrates the dung beetle optimizer (DBO) algorithm and Fick’s law algorithm (FLA) to optimize feature selection and classification performance in convolutional neural network (CNNs). The model is based on MobileNet V1, which reduces computational complexity through depth-separable convolution and dynamically balances the exploration and utilization of feature space by leveraging the global search capability of the DBO and the local optimization property of the FLA.

In addition to the previously mentioned articles,”YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments” also makes important contributions to the field.

The following articles in this Special Issue provide technical support for the field of biomedical image processing, with the aim of improving diagnostic accuracy through medical imaging-specific algorithms.

“LezioSeg: Multi-Scale Attention Affine-Based CNN for Segmenting Diabetic Retinopathy Lesions in Images” addresses the challenge of segmenting diabetic retinopathy (DR) lesions by proposing a data enhancement method that combines multi-scale attention and affine transformations. Traditional models rely on complex networks and have limited generalization capabilities. The authors design the LezioSeg network, which employs MobileNet as a lightweight encoder, integrates an ASPP module, and uses gated jump connectivity (GSC) to enhance feature extraction. Additionally, affine transformations are used to increase data diversity, and the study demonstrates the effectiveness of affine transformation for small target segmentation, offering a lightweight solution for medical image analysis.

The previously mentioned articles “Multi-Branch Spectral Channel Attention Network for Breast Cancer Histopathology Image Classification”, “RS-Xception: A Lightweight Network for Facial Expression Recognition”, and “A Novel Deep Learning Framework Enhanced by Hybrid Optimization Using Dung Beetle and Fick’s Law for Superior Pneumonia Detection” also contribute to the field to varying degrees.

Image processing techniques for intelligent surveillance scenarios focus on human behavior analysis and security applications, and the following articles in this special issue reflect the importance of scenario-based algorithm design.

“Digital Twin 3D System for Power Maintenance Vehicles Based on UWBandDeepLearning” proposes a digital twin system that combines ultra-wideband (UWB) localization and deep learning to enhance safety monitoring during power maintenance vehicle operations. The chaotic particle swarm optimization (CPSO) algorithm is used to improve the TDOA/AOA localization scheme, effectively suppressing non-visual distance and multipath effects and significantly improving localization accuracy compared to traditional methods. Additionally, a YOLOv5-based robotic arm state recognition network is designed, incorporating the long edge definition method, the SIoU loss function, and the CBAM attention mechanism, achieving an mAP of 85.04%. This system ensures the safety of electric power operations through enhanced visualization and intelligent monitoring.

“A High-Precision Fall Detection Model Based on Dynamic Convolution in Complex Scenes” introduces ESD-YOLO, a high-precision fall detection model based on dynamic convolution, designed to address the insufficient accuracy of YOLOv8 in detecting human falls in complex environments. By replacing the C2f module in the backbone network with the C2Dv3 module, the model’s ability to capture target deformation and detail is enhanced. The DyHead dynamic detection head is integrated into the neck, and a multi-scale attention mechanism is introduced to improve the detection performance in occluded scenes. The EASlideloss loss function dynamically adjusts the weights of difficult samples, addressing the issue of sample imbalance. The experimental results show that ESD-YOLO significantly outperforms YOLOv8, showing stronger robustness, especially under low light, occlusion, and complex backgrounds.

“An Efficient Multi-Branch Attention Network for Person Re-Identification” presents EMANet, an efficient multi-branch attention network designed to address the challenges of pedestrian re-identification (Re-ID), such as cross-view angles, illumination changes, and occlusion. A multi-branch structure is designed with global branching, relational branching, and global contrast pooling branching, which collaboratively extract overall, local, and background suppression features. The DAS attention module and adaptive sparse pairwise loss are employed, with depth-separable convolution and deformable convolution dynamically focusing on salient regions. The adaptive loss function optimizes sample pair selection, improving the model’s generalization ability.

“A Study of Occluded Person Re-Identification for Shared Feature Fusion with Pose-Guided and Unsupervised Semantic Segmentation” addresses the challenge of occluded person re-identification by simultaneously extracting human topological features for pose-guided and pixel-level semantic features for unsupervised semantic segmentation. The multi-branch structure employs the multi-scale correlation matching fusion (MCF) module to achieve feature complementarity. This study provides a robust solution for pedestrian re-recognition in surveillance scenarios, which is especially suitable for identity matching tasks in complex occlusion environments.

The previously mentioned article “Research on Railway Dispatcher Fatigue Detection Method Based on Deep Learning with Multi-Feature Fusion” has also made excellent contributions to this field.

End-to-end deep learning-based image processing techniques focus on image generation, reconstruction, and editing, and the aforementioned articles “Content-Aware Image Resizing Technology Based on Composition Detection and Composition Rules”, “MM-NeRF: Large-Scale Scene Representation with Multi-Resolution Hash Grid and Multi-View Priors Features”, “YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments”, “Research on 3D Visualization of Drone Scenes Based on Neural Radiance Fields”, and “Incremental SFM 3D Reconstruction Based on Deep Learning” all overcome the limitations of traditional image processing and achieve direct mapping from input to output.

The field of robotics and unmanned systems empowers autonomous operation through vision algorithms, and the previously mentioned articles “Lightweight Strawberry Instance Segmentation on Low-Power Devices for Picking Robots”, “A Method for Unseen Object Six Degrees of Freedom Pose Estimation Based on Segment Anything Model and Hybrid Distance Optimization”,”StrawSnake: A Real-Time Strawberry Instance Segmentation Network Based on the Contour Learning Approach”, and “Robotic Grasping Detection Algorithm Based on 3D Vision Dual-Stream Encoding Strategy” all contribute to the field to varying degrees.

AI-driven full-flow image processing technologies integrate compression, analysis, and generation. The following articles in this Special Issue provide technical support in this area.

“Improved YOLOV5 Angle Embossed Character Recognition by Multiscale Residual Attention with Selectable Clustering” addresses the challenges of recognizing small, mutilated characters and overcoming complex background interference in power pylon angle character recognition. The study proposes a multi-scale residual attention network based on the improved YOLOv5 (YOLOv5-R). The authors introduce a multi-scale residual attention coding mechanism (MSRC) and a selectable cluster minimum iterative center module (OCMC). MSRC dynamically adjusts feature weights through global pooling and Softmax to focus attention on detailed features, while OCMC uses IoU as a distance metric to optimize the anchor frame clustering process and reduce reliance on a priori knowledge. This approach effectively resolves the challenges of character recognition in industrial scenarios and provides reliable technical support for automated detection.

In conjunction with the four previously mentioned articles “Toward Unified and Quantitative Cinematic Shot Attribute Analysis”, “Detection of Fittings Based on the Dynamic Graph CNN and U-Net Embedded with Bi-Level Routing Attention”, “Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection”, and “Improved 3D Object Detection Based on PointPillars”, a comprehensive solution has been developed, spanning from data preprocessing to application.

Author Contributions

Conceptualization, A.W., H.W. and Y.I.; writing—original draft preparation, A.W. and H.W.; writing—review and editing, A.W. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This article received no external funding.

Conflicts of Interest

The author declares no conflicts of interest.

List of Contributions

Chen, L.; Zheng, W. Research on Railway Dispatcher Fatigue Detection Method Based on Deep Learning with Multi-Feature Fusion. Electronics 2023, 12, 2303. https://doi.org/10.3390/electronics12102303.
Zhang, J.; Chen, H.; Yan, X.; Zhou, K.; Zhang, J.; Zhang, Y.; Jiang, H.; Shao, B. An Improved YOLOv5 Underwater Detector Based on an Attention Mechanism and Multi-Branch Reparameterization Module. Electronics 2023, 12, 2597. https://doi.org/10.3390/electronics12122597.
Guo, Y.; Kang, X.; Li, J.; Yang, Y. Automatic Fabric Defect Detection Method Using AC-YOLOv5. Electronics 2023, 12, 2950. https://doi.org/10.3390/electronics12132950.
Wang, B.; Si, H.; Fu, H.; Gao, R.; Zhan, M.; Jiang, H.; Wang, A. Content-Aware Image Resizing Technology Based on Composition Detection and Composition Rules. Electronics 2023, 12, 3096. https://doi.org/10.3390/electronics12143096.
Cao, L.; Chen, Y.; Jin, Q. Lightweight Strawberry Instance Segmentation on Low-Power Devices for Picking Robots. Electronics 2023, 12, 3145. https://doi.org/10.3390/electronics12143145.
Chen, M.; Liu, T.; Zhang, J.; Xiong, X.; Liu, F. Digital Twin 3D System for Power Maintenance Vehicles Based on UWB and Deep Learning. Electronics 2023, 12, 3151. https://doi.org/10.3390/electronics12143151.
Li, J.; Pan, H.; Li, J. ESD-YOLOv5: A Full-Surface Defect Detection Network for Bearing Collars. Electronics 2023, 12, 3446. https://doi.org/10.3390/electronics12163446.
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An Optimized YOLOv8 Network for Tiny UAV Object Detection. Electronics 2023, 12, 3664. https://doi.org/10.3390/electronics12173664.
Park, H.; Park, S. Improving Monocular Depth Estimation with Learned Perceptual Image Patch Similarity-Based Image Reconstruction and Left–Right Difference Image Constraints. Electronics 2023, 12, 3730. https://doi.org/10.3390/electronics12173730.
Cira, C.; Díaz-Álvarez, A.; Serradilla, F.; Manso-Callejo, M. Convolutional Neural Networks Adapted for Regression Tasks: Predicting the Orientation of Straight Arrows on Marked Road Pavement Using Deep Learning and Rectified Orthophotography. Electronics 2023, 12, 3980. https://doi.org/10.3390/electronics12183980.
Li, Y.; Tian, F.; Xu, H.; Lu, T. Toward Unified and Quantitative Cinematic Shot Attribute Analysis. Electronics 2023, 12, 4174. https://doi.org/10.3390/electronics12194174.
Xie, Z.; Fu, M.; Liu, X. Detection of Fittings Based on the Dynamic Graph CNN and U-Net Embedded with Bi-Level Routing Attention. Electronics 2023, 12, 4611. https://doi.org/10.3390/electronics12224611.
Liu, L.; Fang, G.; Wang, J.; Wang, S.; Wang, C.; Shen, L.; Zhu, K.; Melo, S. Consistent Weighted Correlation-Based Attention for Transformer Tracking. Electronics 2023, 12, 4648. https://doi.org/10.3390/electronics12224648.
Ali, M.; Jabreel, M.; Valls, A.; Baget, M.; Abdel-Nasser, M. LezioSeg: Multi-Scale Attention Affine-Based CNN for Segmenting Diabetic Retinopathy Lesions in Images. Electronics 2023, 12, 4940. https://doi.org/10.3390/electronics12244940.
Feng, S.; Zhao, L.; Hu, J.; Zhou, X.; Chan, S. Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection. Electronics 2024, 13, 93. https://doi.org/10.3390/electronics13010093.
Yu, Z.; Lei, Y.; Shen, F.; Zhou, S. Application of Improved YOLOv5 Algorithm in Lightweight Transmission Line Small Target Defect Detection. Electronics 2024, 13, 305. https://doi.org/10.3390/electronics13020305.
Cao, L.; Pan, K.; Ren, Y.; Lu, R.; Zhang, J. Multi-Branch Spectral Channel Attention Network for Breast Cancer Histopathology Image Classification. Electronics 2024, 13, 459. https://doi.org/10.3390/electronics13020459.
Xin, L.; Lin, H.; Liu, X.; Wang, S. A Method for Unseen Object Six Degrees of Freedom Pose Estimation Based on Segment Anything Model and Hybrid Distance Optimization. Electronics 2024, 13, 774. https://doi.org/10.3390/electronics13040774.
Dong, B.; Chen, K.; Wang, Z.; Yan, M.; Gu, J.; Sun, X. MM-NeRF: Large-Scale Scene Representation with Multi-Resolution Hash Grid and Multi-View Priors Features. Electronics 2024, 13, 844. https://doi.org/10.3390/electronics13050844.
Ma, J.; Wang, L.; Zhu, X.; Li, Z.; Lu, X. Research on the Car Searching System in the Multi-Storey Garage with the RSSI Indoor Locating Based on Neural Network. Electronics 2024, 13, 907. https://doi.org/10.3390/electronics13050907.
Qin, Y.; Miao, W.; Qian, C. A High-Precision Fall Detection Model Based on Dynamic Convolution in Complex Scenes. Electronics 2024, 13, 1141. https://doi.org/10.3390/electronics13061141.
Jin, P.; Yu, Z. Research on 3D Visualization of Drone Scenes Based on Neural Radiance Fields. Electronics 2024, 13, 1682. https://doi.org/10.3390/electronics13091682.
Ding, Y.; Jiang, C.; Song, L.; Liu, F.; Tao, Y. RVDR-YOLOv8: A Weed Target Detection Model Based on Improved YOLOv8. Electronics 2024, 13, 2182. https://doi.org/10.3390/electronics13112182.
Ying, S.; Fang, J.; Tang, S.; Bao, W. Improved YOLOV5 Angle Embossed Character Recognition by Multiscale Residual Attention with Selectable Clustering. Electronics 2024, 13, 2435. https://doi.org/10.3390/electronics13132435.
Yin, Y.; Lei, J.; Tao, W. Detection of Liquid Retention on Pipette Tips in High-Throughput Liquid Handling Workstations Based on Improved YOLOv8 Algorithm with Attention Mechanism. Electronics 2024, 13, 2836. https://doi.org/10.3390/electronics13142836.
Liu, L.; Wang, C.; Feng, C.; Gong, W.; Zhang, L.; Liao, L.; Feng, C. Incremental SFM 3D Reconstruction Based on Deep Learning. Electronics 2024, 13, 2850. https://doi.org/10.3390/electronics13142850.
Kong, W.; Du, Y.; He, L.; Li, Z. Improved 3D Object Detection Based on PointPillars. Electronics 2024, 13, 2915. https://doi.org/10.3390/electronics13152915.
Guo, Z.; Hu, X.; Zhao, B.; Wang, H.; Ma, X. StrawSnake: A Real-Time Strawberry Instance Segmentation Network Based on the Contour Learning Approach. Electronics 2024, 13, 3103. https://doi.org/10.3390/electronics13163103.
Han, K.; Zhu, M.; Li, P.; Dong, J.; Xie, H.; Zhang, X. An Efficient Multi-Branch Attention Network for Person Re-Identification. Electronics 2024, 13, 3183. https://doi.org/10.3390/electronics13163183.
Liao, L.; Wu, S.; Song, C.; Fu, J. RS-Xception: A Lightweight Network for Facial Expression Recognition. Electronics 2024, 13, 3217. https://doi.org/10.3390/electronics13163217.
Sabaawi, A.; Koyuncu, H. A Novel Deep Learning Framework Enhanced by Hybrid Optimization Using Dung Beetle and Fick’s Law for Superior Pneumonia Detection. Electronics 2024, 13, 4042. https://doi.org/10.3390/electronics13204042.
Lei, M.; Wang, P.; Lei, H.; Ma, J.; Wu, W.; Hao, Y. Robotic Grasping Detection Algorithm Based on 3D Vision Dual-Stream Encoding Strategy. Electronics 2024, 13, 4432. https://doi.org/10.3390/electronics13224432.
Qu, J.; Zhang, Z.; Zhang, Y.; He, C. A Study of Occluded Person Re-Identification for Shared Feature Fusion with Pose-Guided and Unsupervised Semantic Segmentation. Electronics 2024, 13, 4523. https://doi.org/10.3390/electronics13224523.
Wu, Z.; Qin, J.; Xiang, X.; Tan, Y. YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments. Electronics 2025, 14, 1413. https://doi.org/10.3390/electronics14071413.

References

Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv 2024, arXiv:2412.05271. [Google Scholar]
Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv 2021, arXiv:2110.05208. [Google Scholar]
Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. Pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4578–4587. [Google Scholar]
Komninos, N. Intelligent cities: Variable geometries of spatial intelligence. Intell. Build. Int. 2011, 3, 172–188. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. International conference on machine learning. PmLR 2020, arXiv:2002.05709, 1597–1607. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
Reutebuch, S.E.; Andersen, H.-E.; McGaughey, R.J. Light detection and ranging (LIDAR): An emerging tool for multiple resource inventory. J. For. 2005, 103, 286–292. [Google Scholar] [CrossRef]
Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Chen, X.; Wang, X. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Comput. Surv. CSUR 2021, 54, 1–34. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.H.; Chang, S.F.; Cui, Y.; Gong, B. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Inf. Process. Syst. 2021, 34, 24206–24221. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Kocaoglu, M.; Snyder, C.; Dimakis, A.G.; Vishwanath, S. Causalgan: Learning causal implicit generative models with adversarial training. arXiv 2017, arXiv:1709.02023. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Pan, T.; Song, Y.; Yang, T.; Jiang, W.; Liu, W. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11205–11214. [Google Scholar]
Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S.R.; Masquelier, T.; Maida, A. Deep learning in spiking neural networks. Neural Netw. 2019, 111, 47–63. [Google Scholar] [PubMed]
Tellez, D.; Litjens, G.; Van der Laak, J.; Ciompi, F. Neural image compression for gigapixel histopathology image analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 567–578. [Google Scholar] [CrossRef] [PubMed]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).