You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Editorial
  • Open Access

11 April 2025

Advances in Computer Vision and Deep Learning and Its Applications

,
and
1
Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China
2
Department of Computer Science, Chubu University, 1200 Matsumoto-cho, Kasugai 487-8501, Japan
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Advances in Computer Vision and Deep Learning and Its Applications

2. Overview of This Special Issue

The articles included in this Special Issue cover advancements in ten research directions: computer vision, feature extraction and image selection, pattern recognition for image processing techniques, image processing in intelligent transportation, neural networks, machine learning and deep learning, biomedical image processing and recognition, image processing for intelligent surveillance, deep learning for image processing, robotics and unmanned systems, and AI-based image processing, understanding, recognition, compression, and reconstruction. I have categorized the 33 articles included in this Special Issue based on these research directions, with the classification system not only demonstrating the vertical extension of the technological depth but also embodying the horizontal coverage of the cross-field applications. The classification system is divided into three dimensions: the technological layer, the type of task, and the industry. Through this system, a clear technological lineage of computer vision and deep learning and its application fields can be constructed. Basic algorithmic innovation provides theoretical support for each application field, customized optimization in vertical fields promotes the technology on the ground, and the full-process integration solution further enhances the practicability and generalization ability of the AI system.
The development of the computer vision field presents a multi-dimensional technological evolution and has been widely adopted. In this Special Issue, new network architectures, attention mechanisms, and multimodal fusion technologies continue to make breakthroughs at the level of basic algorithmic innovation. The related articles are introduced as follows.
“An Improved YOLOv5Underwater Detector Based on an Attention Mechanism and Multi-Branch Reparameterization Module” addresses the problem of degradation in target detection accuracy due to low image quality in underwater environments. A global attention mechanism (GAM) is introduced into the backbone network to enhance the interaction between channels and spatial information and improve feature extraction capability. DAMO-YOLO-based fusion block is used in the neck to strengthen multi-scale feature aggregation, and the experimental results surpass advanced methods such as ViDT.
“ESD-YOLOv5: A Full-Surface Defect Detection Network for Bearing Collars” proposes an improved YOLOv5 model—ESD-YOLOv5— to address the detection challenges posed by bearing collars with various types of surface defects and complex backgrounds, among other issues. A hybrid module combining efficient channel attention (ECA) and coordinate attention (CA) is constructed to enhance the network’s ability to localize defect features. Slim-neck is used to replace the original neck structure to reduce the number of model parameters and computational complexity, while the decoupled head of YOLOX is introduced to separate the classification and regression tasks.
“Consistent Weighted Correlation-Based Attention for Transformer Tracking” presents a consistent weighted correlation (CWC)-based attention mechanism for improving the performance of a Transformer architecture in visual tracking. The traditional attention computation of Transformer architectures handles each query-key pair independently, ignoring the consistency of the global context. By introducing the CWC module, the authors dynamically adjust the weights in the cross-attention block to enhance the consistency of relevant pairs and suppress the interference of irrelevant pairs.
“MM-NeRF: Large-Scale Scene Representation with Multi-Resolution Hash Grid and Multi-View Priors Features” proposes MM-NeRF, a large-scale neural radiance field (NeRF) method that integrates a multi-resolution hash grid and multi-view a priori features. MM-NeRF adopts a two-branch structure: one branch utilizes a multi-resolution hash grid branch to efficiently encode the geometric details of the scene, while the other branch employs multi-view a priori features to enhance texture information by fusing the cross-view features. This two-branch structure enables MM-NeRF to solve the problems of detail loss and high training costs typically associated with traditional NeRF methods when dealing with large-scale scenes.
“Research on 3D Visualization of Drone Scenes Based on Neural Radiance Fields” presents a neural radiance field (NeRF) 3D visualization framework for UAV aerial photography scenes. The framework introduces the spatial boundary compression technique combined with the ground optimization sampling strategy to reduce the sampling points in invalid regions. It adopts the multi-resolution hash grid and clustering sampling method to optimize feature encoding and sampling efficiency, and reduces outliers and blurring artifacts through L1-paradigm penalties and entropy regularization loss. These features solve the problems of detail blurring, high computational costs, and cloud artifacts in large-scale scene rendering.
“Incremental SFM 3D Reconstruction Based on Deep Learning” proposes an incremental structured light motion recovery (SFM) 3D reconstruction method based on deep learning techniques. This study significantly improves the accuracy and efficiency of 3D reconstruction by improving key processes, including feature matching, beam leveling (BA), and depth estimation. Specifically, SuperPoint and SuperGlue are employed for feature extraction and matching, and a sliding window strategy is used to process high-resolution UAV images. A BFGS-corrected Gauss–Newton solver is introduced to optimize the BA process and reduce reprojection error. Finally, a fully convolutional network predicts the depth map using a sparse point cloud alongside the original image, with fused multi-view information. This approach addresses the problems of inefficiencies and inaccuracies in feature matching that are typical of traditional SFM when dealing with complex scenes.
“YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments” introduces the YOLO-CBF algorithm, built upon the YOLOv7 framework, and proposes a three-fold optimization for the task of helmet detection in road scenes. First, it incorporates coordinate convolution (CoordConv) to embed spatial coordinate channels into the input features, strengthening the network’s ability to perceive target locations and significantly improving detection accuracy in small target and occlusion scenarios. Second, the BiFormer dynamic sparse attention mechanism is integrated to filter key regions for attention computation through a two-layer routing process, which reduces complexity from O(N2) to O(N) while maintaining global feature capture and computational efficiency. Third, the FocalConv is improved for helmet detection in road scenarios, and the Focal-EIOU loss function is further optimized by introducing weight coefficients that focus on the optimization of low-overlap samples. Additionally, the bounding box error is decomposed into multi-dimensional errors—overlap, center offset, and aspect ratio—enhancing regression accuracy. Through the combination of spatial perception enhancement, dynamic feature focusing, and an accurate regression mechanism, this model achieves a balance between lightweight operation and robust detection in complex environments.
Feature extraction and image selection techniques focus on improving data representation. These techniques prove most beneficial in the preprocessing stage, providing quality data input for subsequent classification and detection tasks. The following articles in this Special Issue achieve optimization of feature representation in specific areas.
“Research on Railway Dispatcher Fatigue Detection Method Based on Deep Learning with Multi-Feature Fusion” focuses on the core issues of railway transportation safety—particularly dispatcher fatigue detection—proposing a multi-feature fusion detection method that combines facial key points and body postures. Addressing the issue of traditional single-feature detection being easily affected by occlusion and angle change, the study constructed a facial key point detection module based on the RetinaFace model through the HRNet network; this model extracted physiological indexes, such as eye closure rate and blinking frequency, and analyzed fatigue behaviors, such as head drooping and table lying. The HOG-PSO-SVM algorithm is introduced to classify eye states and is combined with the Bi-LSTM-SVM adaptive enhancement model to recognize complex postures. Finally, fatigue levels are determined by fusing five categories of features using an artificial neural network.
“Automatic Fabric Defect Detection Method Using AC-YOLOv5” proposes an improved YOLOv5 detection model—AC-YOLOv5—to address the problem of detecting diverse defects with large-scale differences in the complex textural background of textile fabrics. This model embeds a void space pyramid pooling (ASPP) module into the backbone network, allowing for the extraction of multi-scale features by convolutional kernels with different expansion rates. A convolutional squeezing excitation (CSE) channel attention module is introduced to enhance the network’s attention to defective features.
“Detection of Fittings Based on the Dynamic Graph CNN and U-Net Embedded with Bi-Level Routing Attention” addresses the challenges of complex backgrounds, small targets, and occlusion in power fittings detection by proposing a combined U-Net and dynamic graph convolutional network (DGCNN) framework. Traditional 2D detection methods struggle to handle 3D spatial information, while acquiring 3D point cloud data is expensive. To overcome this, the authors generate pseudo-point cloud data using the Lite-Mono algorithm, converting 2D images into 3D point cloud representations. DGCNN is then used to extract geometric features of occluded accessories. Meanwhile, the feature extraction capability is enhanced by embedding a bidirectional routing attention (BRA) module within U-Net.
“Multi-Branch Spectral Channel Attention Network for Breast Cancer Histopathology Image Classification” introduces the Multi-Branch Spectral Channel Attention Network (MbsCANet), which aims to enhance the accuracy of breast cancer histopathology image classification. While existing methods based on convolutional neural networks rely on spatial features, the authors innovatively introduce a two-dimensional discrete cosine transform (DCT) into the channel attention mechanism. This fusion of the lowest-frequency features with high-frequency information through a multi-branch structure helps preserve phase information and enhances the model’s context-awareness ability.
“RS-Xception: A Lightweight Network for Facial Expression Recognition” presents RS-Xception, a lightweight facial expression recognition network designed to address the challenges of excessive parameters and low computational efficiency in existing models on embedded devices. Xception integrates ResNet’s residual connectivity, SENet’s channel attention mechanism, and Xception’s depth-separable convolution to achieve efficient feature extraction and classification through a modular design. The study introduces the SE-ResNet module, which enhances key features through squeeze-excite operations and reduces computation using depth-separable convolution.
“Robotic Grasping Detection Algorithm Based on 3D Vision Dual-Stream Encoding Strategy” presents SU-Grasp, a 3D vision-based dual-stream encoding strategy for robotic grasping detection that integrates the sliding-window self-attention mechanism of the Swin Transformer with the multi-scale feature fusion of U-Net. This combination enhances spatial semantic understanding by processing RGB images and depth images (with normal vector angle features) through two-way encoders, while SU-Grasp introduces the normal vector angle images as a spatial a priori, enhancing perception of target objects’ geometries and surface orientations through cross-modal fusion. This research provides key technical support for the autonomous operation of robots in unstructured environments.
Pattern recognition techniques are directly applicable to target detection, classification, and segmentation tasks, with algorithmic optimization used to address practical challenges such as occlusion and small targets. The following articles in this Special Issue highlight task-specific algorithmic improvements.
“Content-Aware Image Resizing Technology Based on Composition Detection and Composition Rules” proposes a method that combines composition detection and composition rules image scaling methods to address the lack of aesthetic perception in existing content-aware image scaling algorithms. By introducing a composition classification module based on convolutional neural networks, images are categorized into four common compositions in landscape photography—such as trichotomous and symmetrical compositions—and the corresponding aesthetic rules are selected to guide the scaling operation according to the classification results. The graph-based visual saliency (GBVS) model and collaborative segmentation algorithm are used to generate an importance map, while the golden ratio and other rules are combined to optimize the positioning of salient regions, ensuring that the scaled image retains important content while conforming to aesthetic principles.
“Lightweight Strawberry Instance Segmentation on Low-Power Devices for Picking Robots” presents a lightweight instance segmentation model tailored for strawberry-picking robots operating in complex orchard environments. These environments pose problems such as diverse fruit morphology and severe occlusion. The proposed model, StrawSeg, adopts MobileNetV2 as the backbone network to extract multi-scale features. It also designs a feature aggregation network (FAN) to merge different layers of features through a pixel blending operation, avoiding the computational overhead caused by interpolation or deconvolution.
“Application of Improved YOLOv5 Algorithm in Lightweight Transmission Line Small Target Defect Detection” focuses on the issue of insulator defect detection in UAV aerial transmission line images. The lightweight, improved Algorithm DFCG_YOLOv5 is proposed to address challenges such as noise interference, false detection of small targets, and slow detection speeds in complex backgrounds. This is achieved by introducing a high-speed adaptive median filtering (HSMF) algorithm at the input stage to effectively reduce image noise. The Ghost backbone network is optimized by incorporating the DFC attention mechanism to balance accuracy and speed in feature extraction. The original CIOU loss function is replaced with a Poly Loss function, which adjusts the parameters to suppress the loss of insulator defects and addresses the imbalance between positive and negative samples, especially for small targets.
“A Method for Unseen Object Six Degrees of Freedom Pose Estimation Based on Segment Anything Model and Hybrid Distance Optimization” presents a method for six degrees of freedom (6-DoF) pose estimation of unseen objects and complex scenes, leveraging the Segment Anything Model (SAM) and hybrid distance optimization. The authors improve the SAM model (CAE-SAM) by addressing boundary blurring, mask nulling, and over-segmentation problems using a local spatial feature enhancement module, global contextual labeling, and a bounding box generator, achieving high-quality zero-sample instance segmentation. Additionally, a point cloud alignment method based on hybrid distance metrics is introduced, combining farthest point sampling (FPS) and fast global registration (FGR) algorithms to reduce dependence on hyperparameters.
“Detection of Liquid Retention on Pipette Tips in High-Throughput Liquid Handling Workstations Based on Improved YOLOv8 Algorithm with Attention Mechanism” presents an improved YOLOv8-based detection method for addressing the challenge of liquid retention on pipette tips in high-throughput liquid handling workstations. The authors enhance the model’s ability to detect small targets and complex backgrounds by introducing three key improvements: the global context (GC) attention module, which strengthens the model’s understanding of global features in the backbone network; the large kernel selection (LKS) module, which dynamically adjusts the sensory field to accommodate different backgrounds; and the simple attention (SimAM) mechanism, which generates attentional weights to optimize feature representation in the network’s neck stage.
“StrawSnake: A Real-Time Strawberry Instance Segmentation Network Based on the Contour Learning Approach” presents StrawSnake, a real-time strawberry instance segmentation network based on contour learning. This model addresses the challenges of low accuracy and insufficient real-time detection of strawberries in complex environments. The authors design a dedicated octagonal contour that combines the YOLOv8 detection frame and extreme points to closely enclose the target. Dynamic serpentine convolution (DSConv) is used to adaptively adjust the sensory field through deformable convolution kernels, enhancing the perception of boundary curves. The multi-scale feature enhancement block (MFRB) incorporates a self-attention mechanism, improving the model’s ability to aggregate multi-scale features.
The field of intelligent transportation relies on technologies such as monocular depth estimation and lightweight CNN models to promote autonomous driving and traffic management. The following articles in this Special Issue promote the development of technologies in this field.
“StrawSnake: A Real-Time Strawberry Instance Segmentation Network Based on the Contour Learning Approach” proposes a self-supervised learning-based monocular depth estimation method that aims to improve model performance by optimizing image reconstruction loss and left-right disparity image loss. Traditional methods rely on L1 or SSIM for reconstruction loss, but these approaches have limitations when dealing with low-texture or long-range regions. The authors introduce LPIPS (learned perceptual image patch similarity) as a perceptual loss to measure the quality of reconstructed images in a way that more closely aligns with human visual perception. This is combined with left-right disparity image loss to align differences between the left and right views, thus reducing reconstruction distortions caused by factors such as lighting and camera calibration.
“Convolutional Neural Networks Adapted for Regression Tasks: Predicting the Orientation of Straight Arrows on Marked Road Pavement Using Deep Learning and Rectified Orthophotography” presents a convolutional neural network (CNN)-based regression model for automatically recognizing the direction of road arrows. Traditional methods rely on manual feature extraction or single-stage detection, which are difficult to adapt to variations in arrow direction within complex scenes. The authors designed a customized lightweight CNN architecture (ad hoc model) and compared it with classical networks, such as VGGNet and ResNet. The effectiveness of the lightweight network for specific tasks is demonstrated. In addition, the study explores the impact of data augmentation and transfer learning on model performance, providing a new solution for automated road sign recognition.
“Research on the Car Searching System in the Multi-Storey Garage with the RSSI Indoor Locating Based on Neural Network” designs a neural network-based RSSI indoor localization system for a multi-story garage car searching application. The system integrates YOLOv5 and LPRNet networks for license plate positioning and recognition and combines BP neural networks with KNN algorithms to construct an indoor localization module. The localization accuracy achieves 100% within 2.5 m. The A* algorithm is improved, and spatial accessibility is introduced to optimize path planning, reducing ineffective search nodes by over 55% and improving operational efficiency by 28.5%. The experimental results show that the system enables full-process automation of license plate recognition, indoor localization, and optimal path planning.
“Improved 3D Object Detection Based on PointPillars” proposes an improved method based on PointPillars to address the problem of insufficient small target detection accuracy in 3D point cloud target detection. The study redefines the attention mechanism (R-SENet), which enhances key feature expression through channel and spatial dual attention. Additionally, dynamic convolution enhances the network’s adaptability to different input features, optimizes the backbone network, and introduces Transformer-based candidate frame optimization. The Transformer module further refines candidate frame regression by modeling global contextual relationships through self-attention.
In addition to the previously mentioned articles, “YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments” also makes important contributions to the field.
The optimization of neural network architecture design and training strategies constitutes another important research direction. The following articles in this Special Issue balance model efficiency and performance while optimizing networks for specific scenarios.
“YOLO-Drone: An Optimized YOLOv8Network for Tiny UAV Object Detection” presents an optimized YOLOv8 network, YOLO-Drone, designed to address the challenges of detecting small targets and handling complex backgrounds in miniature UAVs. This is achieved by adding a high-resolution branch to the detection head, which enhances small target detection capabilities. The redundant layers associated with large target detection are trimmed to reduce model parameters. SPD-Conv replaces traditional convolution to extract multi-scale features and retain more detailed information, while the GAM attention mechanism is introduced in the neck part to strengthen feature fusion.
“Toward Unified and Quantitative Cinematic Shot Attribute Analysis” presents a unified framework for cinematic shot attribute analysis, designed to process multiple attributes of a shot simultaneously through a motion-static dual-stream network. Traditional methods usually use independent models for each attribute and lack the ability to exploit global feature. The authors introduce a learnable frame difference generator to replace the optical flow network and extract spatiotemporal features by combining Visual Transformer (ViT) and Multi-scale Visual Transformer (MViT). By dynamically adjusting the weights of motion and static features through a quantitative fusion module, the model achieves optimal performance on both the MovieShots and AVE datasets, significantly outperforming existing methods. The study also quantifies the dependence of different attributes on motion and static features for the first time, providing a theoretical basis for the design of subsequent single-attribute analysis models.
“Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection” introduces a depth-quality purification feature processing network (DQPFPNet) for RGB-D salient object detection. Most existing methods overlook the impact of depth feature quality on detection accuracy. The authors design a DQPFP module, which includes depth denoising, quality weighting, and enhanced attention to filter and fuse multi-scale depth features. Additionally, they introduce a two-stage decoder to optimize context modeling. The experimental results demonstrate the importance of multi-scale feature processing and quality-aware fusion for salient target detection. The study also incorporates the RReLU activation function and pixel position adaptive loss (PPAI) to further enhance the robustness and detailed representation of the model.
“RVDR-YOLOv8: A Weed Target Detection Model Based on Improved YOLOv8” presents a lightweight weed detection model, RVDR-YOLOv8, based on an improved YOLOv8 framework and designed to address the issue of limited computational resources for weeding robots. The study replaces the traditional backbone with a reversible column network (RevColNet), which reduces computation and improves feature generalization through reversible connections and a multi-input design. It introduces the C2fDWR module, which incorporates an expansion residual mechanism to enhance the recognition of occluded targets. Additionally, GSConv is used in the neck network in place of traditional convolution, further reducing computational complexity. The study also introduces the InnerMPDIoU loss function, which fuses the MPDIoU and InnerIoU models to improve bounding box regression accuracy.
“A Novel Deep Learning Framework Enhanced by Hybrid Optimization Using Dung Beetle and Fick’s Law for Superior Pneumonia Detection” presents a hybrid optimization algorithm-based pneumonia detection framework that integrates the dung beetle optimizer (DBO) algorithm and Fick’s law algorithm (FLA) to optimize feature selection and classification performance in convolutional neural network (CNNs). The model is based on MobileNet V1, which reduces computational complexity through depth-separable convolution and dynamically balances the exploration and utilization of feature space by leveraging the global search capability of the DBO and the local optimization property of the FLA.
In addition to the previously mentioned articles,”YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments” also makes important contributions to the field.
The following articles in this Special Issue provide technical support for the field of biomedical image processing, with the aim of improving diagnostic accuracy through medical imaging-specific algorithms.
“LezioSeg: Multi-Scale Attention Affine-Based CNN for Segmenting Diabetic Retinopathy Lesions in Images” addresses the challenge of segmenting diabetic retinopathy (DR) lesions by proposing a data enhancement method that combines multi-scale attention and affine transformations. Traditional models rely on complex networks and have limited generalization capabilities. The authors design the LezioSeg network, which employs MobileNet as a lightweight encoder, integrates an ASPP module, and uses gated jump connectivity (GSC) to enhance feature extraction. Additionally, affine transformations are used to increase data diversity, and the study demonstrates the effectiveness of affine transformation for small target segmentation, offering a lightweight solution for medical image analysis.
The previously mentioned articles “Multi-Branch Spectral Channel Attention Network for Breast Cancer Histopathology Image Classification”, “RS-Xception: A Lightweight Network for Facial Expression Recognition”, and “A Novel Deep Learning Framework Enhanced by Hybrid Optimization Using Dung Beetle and Fick’s Law for Superior Pneumonia Detection” also contribute to the field to varying degrees.
Image processing techniques for intelligent surveillance scenarios focus on human behavior analysis and security applications, and the following articles in this special issue reflect the importance of scenario-based algorithm design.
“Digital Twin 3D System for Power Maintenance Vehicles Based on UWBandDeepLearning” proposes a digital twin system that combines ultra-wideband (UWB) localization and deep learning to enhance safety monitoring during power maintenance vehicle operations. The chaotic particle swarm optimization (CPSO) algorithm is used to improve the TDOA/AOA localization scheme, effectively suppressing non-visual distance and multipath effects and significantly improving localization accuracy compared to traditional methods. Additionally, a YOLOv5-based robotic arm state recognition network is designed, incorporating the long edge definition method, the SIoU loss function, and the CBAM attention mechanism, achieving an mAP of 85.04%. This system ensures the safety of electric power operations through enhanced visualization and intelligent monitoring.
“A High-Precision Fall Detection Model Based on Dynamic Convolution in Complex Scenes” introduces ESD-YOLO, a high-precision fall detection model based on dynamic convolution, designed to address the insufficient accuracy of YOLOv8 in detecting human falls in complex environments. By replacing the C2f module in the backbone network with the C2Dv3 module, the model’s ability to capture target deformation and detail is enhanced. The DyHead dynamic detection head is integrated into the neck, and a multi-scale attention mechanism is introduced to improve the detection performance in occluded scenes. The EASlideloss loss function dynamically adjusts the weights of difficult samples, addressing the issue of sample imbalance. The experimental results show that ESD-YOLO significantly outperforms YOLOv8, showing stronger robustness, especially under low light, occlusion, and complex backgrounds.
“An Efficient Multi-Branch Attention Network for Person Re-Identification” presents EMANet, an efficient multi-branch attention network designed to address the challenges of pedestrian re-identification (Re-ID), such as cross-view angles, illumination changes, and occlusion. A multi-branch structure is designed with global branching, relational branching, and global contrast pooling branching, which collaboratively extract overall, local, and background suppression features. The DAS attention module and adaptive sparse pairwise loss are employed, with depth-separable convolution and deformable convolution dynamically focusing on salient regions. The adaptive loss function optimizes sample pair selection, improving the model’s generalization ability.
“A Study of Occluded Person Re-Identification for Shared Feature Fusion with Pose-Guided and Unsupervised Semantic Segmentation” addresses the challenge of occluded person re-identification by simultaneously extracting human topological features for pose-guided and pixel-level semantic features for unsupervised semantic segmentation. The multi-branch structure employs the multi-scale correlation matching fusion (MCF) module to achieve feature complementarity. This study provides a robust solution for pedestrian re-recognition in surveillance scenarios, which is especially suitable for identity matching tasks in complex occlusion environments.
The previously mentioned article “Research on Railway Dispatcher Fatigue Detection Method Based on Deep Learning with Multi-Feature Fusion” has also made excellent contributions to this field.
End-to-end deep learning-based image processing techniques focus on image generation, reconstruction, and editing, and the aforementioned articles “Content-Aware Image Resizing Technology Based on Composition Detection and Composition Rules”, “MM-NeRF: Large-Scale Scene Representation with Multi-Resolution Hash Grid and Multi-View Priors Features”, “YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments”, “Research on 3D Visualization of Drone Scenes Based on Neural Radiance Fields”, and “Incremental SFM 3D Reconstruction Based on Deep Learning” all overcome the limitations of traditional image processing and achieve direct mapping from input to output.
The field of robotics and unmanned systems empowers autonomous operation through vision algorithms, and the previously mentioned articles “Lightweight Strawberry Instance Segmentation on Low-Power Devices for Picking Robots”, “A Method for Unseen Object Six Degrees of Freedom Pose Estimation Based on Segment Anything Model and Hybrid Distance Optimization”,”StrawSnake: A Real-Time Strawberry Instance Segmentation Network Based on the Contour Learning Approach”, and “Robotic Grasping Detection Algorithm Based on 3D Vision Dual-Stream Encoding Strategy” all contribute to the field to varying degrees.
AI-driven full-flow image processing technologies integrate compression, analysis, and generation. The following articles in this Special Issue provide technical support in this area.
“Improved YOLOV5 Angle Embossed Character Recognition by Multiscale Residual Attention with Selectable Clustering” addresses the challenges of recognizing small, mutilated characters and overcoming complex background interference in power pylon angle character recognition. The study proposes a multi-scale residual attention network based on the improved YOLOv5 (YOLOv5-R). The authors introduce a multi-scale residual attention coding mechanism (MSRC) and a selectable cluster minimum iterative center module (OCMC). MSRC dynamically adjusts feature weights through global pooling and Softmax to focus attention on detailed features, while OCMC uses IoU as a distance metric to optimize the anchor frame clustering process and reduce reliance on a priori knowledge. This approach effectively resolves the challenges of character recognition in industrial scenarios and provides reliable technical support for automated detection.
In conjunction with the four previously mentioned articles “Toward Unified and Quantitative Cinematic Shot Attribute Analysis”, “Detection of Fittings Based on the Dynamic Graph CNN and U-Net Embedded with Bi-Level Routing Attention”, “Depth-Quality Purification Feature Processing for Red Green Blue-Depth Salient Object Detection”, and “Improved 3D Object Detection Based on PointPillars”, a comprehensive solution has been developed, spanning from data preprocessing to application.

Author Contributions

Conceptualization, A.W., H.W. and Y.I.; writing—original draft preparation, A.W. and H.W.; writing—review and editing, A.W. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This article received no external funding.

Conflicts of Interest

The author declares no conflicts of interest.

List of Contributions

References

  1. Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv 2024, arXiv:2412.05271. [Google Scholar]
  2. Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv 2021, arXiv:2110.05208. [Google Scholar]
  3. Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. Pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4578–4587. [Google Scholar]
  4. Komninos, N. Intelligent cities: Variable geometries of spatial intelligence. Intell. Build. Int. 2011, 3, 172–188. [Google Scholar]
  5. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. International conference on machine learning. PmLR 2020, arXiv:2002.05709, 1597–1607. [Google Scholar]
  6. Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
  7. Reutebuch, S.E.; Andersen, H.-E.; McGaughey, R.J. Light detection and ranging (LIDAR): An emerging tool for multiple resource inventory. J. For. 2005, 103, 286–292. [Google Scholar] [CrossRef]
  8. Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Chen, X.; Wang, X. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Comput. Surv. CSUR 2021, 54, 1–34. [Google Scholar] [CrossRef]
  9. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  10. Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
  11. Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.H.; Chang, S.F.; Cui, Y.; Gong, B. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Inf. Process. Syst. 2021, 34, 24206–24221. [Google Scholar]
  12. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  13. Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
  14. Kocaoglu, M.; Snyder, C.; Dimakis, A.G.; Vishwanath, S. Causalgan: Learning causal implicit generative models with adversarial training. arXiv 2017, arXiv:1709.02023. [Google Scholar]
  15. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
  16. Pan, T.; Song, Y.; Yang, T.; Jiang, W.; Liu, W. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11205–11214. [Google Scholar]
  17. Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S.R.; Masquelier, T.; Maida, A. Deep learning in spiking neural networks. Neural Netw. 2019, 111, 47–63. [Google Scholar] [PubMed]
  18. Tellez, D.; Litjens, G.; Van der Laak, J.; Ciompi, F. Neural image compression for gigapixel histopathology image analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 567–578. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.