MDPI - Publisher of Open Access Journals

22 pages, 3139 KiB

Open AccessArticle

A Counterfactual Fine-Grained Aircraft Classification Network for Remote Sensing Images Based on Normalized Coordinate Attention

by Zeya Zhao, Wenyin Tuo, Shuai Zhang and Xinbo Zhao

Appl. Sci. 2025, 15(16), 8903; https://doi.org/10.3390/app15168903 - 12 Aug 2025

Viewed by 193

Abstract

Fine-grained aircraft classification in remote sensing is a critical task within the field of remote sensing image processing, aiming to precisely distinguish between different types of aircraft in aerial images. Due to the high visual similarity among aircraft targets in remote sensing images, [...] Read more.

Fine-grained aircraft classification in remote sensing is a critical task within the field of remote sensing image processing, aiming to precisely distinguish between different types of aircraft in aerial images. Due to the high visual similarity among aircraft targets in remote sensing images, accurately capturing subtle and discriminative features becomes a key technical challenge for fine-grained aircraft classification. In this context, we propose a Normalized Coordinate Attention-Based Counterfactual Classification Network (NCC-Net), which emphasizes the spatial positional information of aircraft targets and effectively captures long-range dependencies, thereby enabling precise localization of various aircraft components. Furthermore, we analyze the proposed network from a causal perspective, encouraging the model to focus on key discriminative features of the aircraft while minimizing distraction from the surrounding environment and background. Experimental results on three benchmark datasets demonstrate the superiority of our method. Specifically, NCC-Net achieves Top-1 classification accuracies of 97.7% on FAIR1M, 95.2% on MTARSI2, and 98.4% on ARSI120, outperforming several state-of-the-art methods. These results highlight the effectiveness and generalizability of our proposed method for fine-grained remote sensing target recognition. Full article

► Show Figures

Figure 1

17 pages, 840 KiB

Open AccessArticle

Improving Person Re-Identification via Feature Erasing-Driven Data Augmentation

by Shangdong Zhu and Huayan Zhang

Mathematics 2025, 13(16), 2580; https://doi.org/10.3390/math13162580 - 12 Aug 2025

Viewed by 200

Abstract

Person re-identification (Re-ID) has attracted considerable attention in the field of computer vision, primarily due to its critical role in video surveillance and public security applications. However, most existing Re-ID approaches rely on image-level erasing techniques, which may inadvertently remove fine-grained visual cues [...] Read more.

Person re-identification (Re-ID) has attracted considerable attention in the field of computer vision, primarily due to its critical role in video surveillance and public security applications. However, most existing Re-ID approaches rely on image-level erasing techniques, which may inadvertently remove fine-grained visual cues that are essential for accurate identification. To mitigate this limitation, we propose an effective feature erasing-based data augmentation framework that aims to explore discriminative information within individual samples and improve overall recognition performance. Specifically, we first introduce a diagonal swapping augmentation strategy to increase the diversity of the training samples. Secondly, we design a feature erasing-driven method applied to the extracted pedestrian feature to capture identity-relevant information at the feature level. Finally, extensive experiments demonstrate that our method achieves competitive performance compared to many representative approaches. Full article

► Show Figures

Figure 1

24 pages, 6260 KiB

Open AccessArticle

Transforming Product Discovery and Interpretation Using Vision–Language Models

by Simona-Vasilica Oprea and Adela Bâra

J. Theor. Appl. Electron. Commer. Res. 2025, 20(3), 191; https://doi.org/10.3390/jtaer20030191 - 1 Aug 2025

Viewed by 530

Abstract

In this work, the utility of multimodal vision–language models (VLMs) for visual product understanding in e-commerce is investigated, focusing on two complementary models: ColQwen2 (vidore/colqwen2-v1.0) and ColPali (vidore/colpali-v1.2-hf). These models are integrated into two architectures and evaluated across various [...] Read more.

In this work, the utility of multimodal vision–language models (VLMs) for visual product understanding in e-commerce is investigated, focusing on two complementary models: ColQwen2 (vidore/colqwen2-v1.0) and ColPali (vidore/colpali-v1.2-hf). These models are integrated into two architectures and evaluated across various product interpretation tasks, including image-grounded question answering, brand recognition and visual retrieval based on natural language prompts. ColQwen2, built on the Qwen2-VL backbone with LoRA-based adapter hot-swapping, demonstrates strong performance, allowing end-to-end image querying and text response synthesis. It excels at identifying attributes such as brand, color or usage based solely on product images and responds fluently to user questions. In contrast, ColPali, which utilizes the PaliGemma backbone, is optimized for explainability. It delivers detailed visual-token alignment maps that reveal how specific regions of an image contribute to retrieval decisions, offering transparency ideal for diagnostics or educational applications. Through comparative experiments using footwear imagery, it is demonstrated that ColQwen2 is highly effective in generating accurate responses to product-related questions, while ColPali provides fine-grained visual explanations that reinforce trust and model accountability. Full article

► Show Figures

Figure 1

18 pages, 5309 KiB

Open AccessArticle

LGM-YOLO: A Context-Aware Multi-Scale YOLO-Based Network for Automated Structural Defect Detection

by Chuanqi Liu, Yi Huang, Zaiyou Zhao, Wenjing Geng and Tianhong Luo

Processes 2025, 13(8), 2411; https://doi.org/10.3390/pr13082411 - 29 Jul 2025

Viewed by 267

Abstract

Ensuring the structural safety of steel trusses in escalators is critical for the reliable operation of vertical transportation systems. While manual inspection remains widely used, its dependence on human judgment leads to extended cycle times and variable defect-recognition rates, making it less reliable [...] Read more.

Ensuring the structural safety of steel trusses in escalators is critical for the reliable operation of vertical transportation systems. While manual inspection remains widely used, its dependence on human judgment leads to extended cycle times and variable defect-recognition rates, making it less reliable for identifying subtle surface imperfections. To address these limitations, a novel context-aware, multi-scale deep learning framework based on the YOLOv5 architecture is proposed, which is specifically designed for automated structural defect detection in escalator steel trusses. Firstly, a method called GIES is proposed to synthesize pseudo-multi-channel representations from single-channel grayscale images, which enhances the network’s channel-wise representation and mitigates issues arising from image noise and defocused blur. To further improve detection performance, a context enhancement pipeline is developed, consisting of a local feature module (LFM) for capturing fine-grained surface details and a global context module (GCM) for modeling large-scale structural deformations. In addition, a multi-scale feature fusion module (MSFM) is employed to effectively integrate spatial features across various resolutions, enabling the detection of defects with diverse sizes and complexities. Comprehensive testing on the NEU-DET and GC10-DET datasets reveals that the proposed method achieves 79.8% mAP on NEU-DET and 68.1% mAP on GC10-DET, outperforming the baseline YOLOv5s by 8.0% and 2.7%, respectively. Although challenges remain in identifying extremely fine defects such as crazing, the proposed approach offers improved accuracy while maintaining real-time inference speed. These results indicate the potential of the method for intelligent visual inspection in structural health monitoring and industrial safety applications. Full article

(This article belongs to the Special Issue Advances in Computer Vision and Image Processing for Industrial Processes)

► Show Figures

Figure 1

24 pages, 10460 KiB

Open AccessArticle

WGGLFA: Wavelet-Guided Global–Local Feature Aggregation Network for Facial Expression Recognition

by Kaile Dong, Xi Li, Cong Zhang, Zhenhua Xiao and Runpu Nie

Biomimetics 2025, 10(8), 495; https://doi.org/10.3390/biomimetics10080495 - 27 Jul 2025

Viewed by 400

Abstract

Facial expression plays an important role in human–computer interaction and affective computing. However, existing expression recognition methods cannot effectively capture multi-scale structural details contained in facial expressions, leading to a decline in recognition accuracy. Inspired by the multi-scale processing mechanism of the biological [...] Read more.

Facial expression plays an important role in human–computer interaction and affective computing. However, existing expression recognition methods cannot effectively capture multi-scale structural details contained in facial expressions, leading to a decline in recognition accuracy. Inspired by the multi-scale processing mechanism of the biological visual system, this paper proposes a wavelet-guided global–local feature aggregation network (WGGLFA) for facial expression recognition (FER). Our WGGLFA network consists of three main modules: the scale-aware expansion (SAE) module, which combines dilated convolution and wavelet transform to capture multi-scale contextual features; the structured local feature aggregation (SLFA) module based on facial keypoints to extract structured local features; and the expression-guided region refinement (ExGR) module, which enhances features from high-response expression areas to improve the collaborative modeling between local details and key expression regions. All three modules utilize the spatial frequency locality of the wavelet transform to achieve high-/low-frequency feature separation, thereby enhancing fine-grained expression representation under frequency domain guidance. Experimental results show that our WGGLFA achieves accuracies of 90.32%, 91.24%, and 71.90% on the RAF-DB, FERPlus, and FED-RO datasets, respectively, demonstrating that our WGGLFA is effective and has more capability of robustness and generalization than state-of-the-art (SOTA) expression recognition methods. Full article

(This article belongs to the Special Issue New Biomimetic Advances in Signal and Image Processing for Biomedical Applications 2025)

► Show Figures

Figure 1

21 pages, 12122 KiB

Open AccessArticle

RA3T: An Innovative Region-Aligned 3D Transformer for Self-Supervised Sim-to-Real Adaptation in Low-Altitude UAV Vision

by Xingrao Ma, Jie Xie, Di Shao, Aiting Yao and Chengzu Dong

Electronics 2025, 14(14), 2797; https://doi.org/10.3390/electronics14142797 - 11 Jul 2025

Viewed by 332

Abstract

Low-altitude unmanned aerial vehicle (UAV) vision is critically hindered by the Sim-to-Real Gap, where models trained exclusively on simulation data degrade under real-world variations in lighting, texture, and weather. To address this problem, we propose RA3T (Region-Aligned 3D Transformer), a novel self-supervised framework [...] Read more.

Low-altitude unmanned aerial vehicle (UAV) vision is critically hindered by the Sim-to-Real Gap, where models trained exclusively on simulation data degrade under real-world variations in lighting, texture, and weather. To address this problem, we propose RA3T (Region-Aligned 3D Transformer), a novel self-supervised framework that enables robust Sim-to-Real adaptation. Specifically, we first develop a dual-branch strategy for self-supervised feature learning, integrating Masked Autoencoders and contrastive learning. This approach extracts domain-invariant representations from unlabeled simulated imagery to enhance robustness against occlusion while reducing annotation dependency. Leveraging these learned features, we then introduce a 3D Transformer fusion module that unifies multi-view RGB and LiDAR point clouds through cross-modal attention. By explicitly modeling spatial layouts and height differentials, this component significantly improves recognition of small and occluded targets in complex low-altitude environments. To address persistent fine-grained domain shifts, we finally design region-level adversarial calibration that deploys local discriminators on partitioned feature maps. This mechanism directly aligns texture, shadow, and illumination discrepancies which challenge conventional global alignment methods. Extensive experiments on UAV benchmarks VisDrone and DOTA demonstrate the effectiveness of RA3T. The framework achieves +5.1% mAP on VisDrone and +7.4% mAP on DOTA over the 2D adversarial baseline, particularly on small objects and sparse occlusions, while maintaining real-time performance of 17 FPS at 1024 × 1024 resolution on an RTX 4080 GPU. Visual analysis confirms that the synergistic integration of 3D geometric encoding and local adversarial alignment effectively mitigates domain gaps caused by uneven illumination and perspective variations, establishing an efficient pathway for simulation-to-reality UAV perception. Full article

(This article belongs to the Special Issue Innovative Technologies and Services for Unmanned Aerial Vehicles)

► Show Figures

Figure 1

17 pages, 548 KiB

Open AccessArticle

Enhanced Localisation and Handwritten Digit Recognition Using ConvCARU

by Sio-Kei Im and Ka-Hou Chan

Appl. Sci. 2025, 15(12), 6772; https://doi.org/10.3390/app15126772 - 16 Jun 2025

Viewed by 357

Abstract

Predicting the motion of handwritten digits in video sequences is challenging due to complex spatiotemporal dependencies, variable writing styles, and the need to preserve fine-grained visual details—all of which are essential for real-time handwriting recognition and digital learning applications. In this context, our [...] Read more.

Predicting the motion of handwritten digits in video sequences is challenging due to complex spatiotemporal dependencies, variable writing styles, and the need to preserve fine-grained visual details—all of which are essential for real-time handwriting recognition and digital learning applications. In this context, our study aims to develop a robust predictive framework that can accurately forecast digit trajectories while preserving structural integrity. To address these challenges, we propose a novel video prediction architecture integrating ConvCARU with a modified DCGAN to effectively separate the background from the foreground. This ensures the enhanced extraction and preservation of spatial and temporal features through convolution-based gating and adaptive fusion mechanisms. Based on extensive experiments conducted on the MNIST dataset, which comprises 70 K pixel images, our approach achieves an SSIM of 0.901 and a PSNR of 29.31 dB. This reflects a statistically significant improvement in PSNR of +0.20 dB (p < 0.05) compared to current state-of-the-art models, thus demonstrating its superior capability in maintaining consistent structural fidelity in predicted video frames. Furthermore, our framework performs better in terms of computational efficiency, with lower memory consumption compared to most other approaches. This underscores its practicality for deployment in real-time, resource-constrained applications. These promising results consequently validate the effectiveness of our integrated ConvCARU–DCGAN approach in capturing fine-grained spatiotemporal dependencies, positioning it as a compelling solution for enhancing video-based handwriting recognition and sequence forecasting. This paves the way for its adoption in diverse applications requiring high-resolution, efficient motion prediction. Full article

► Show Figures

Figure 1

19 pages, 1486 KiB

Open AccessArticle

A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition

by Jian Wang, Yanan Zhou, Qi He and Wenbo Zhang

Appl. Sci. 2025, 15(11), 6034; https://doi.org/10.3390/app15116034 - 27 May 2025

Viewed by 506

Abstract

Multimodal amed entity recognition (MNER) is a natural language-processing technique that integrates text and visual modalities to detect and segment entity boundaries and their types from unstructured multimodal data. Although existing methods alleviate semantic deficiencies by optimizing image and text feature extraction and [...] Read more.

Multimodal amed entity recognition (MNER) is a natural language-processing technique that integrates text and visual modalities to detect and segment entity boundaries and their types from unstructured multimodal data. Although existing methods alleviate semantic deficiencies by optimizing image and text feature extraction and fusion, a fundamental challenge remains due to the lack of fine-grained alignment caused by cross-modal semantic deviations and image noise interference. To address these issues, this paper proposes a dual-enhanced hierarchical alignment (DEHA) framework that achieves dual semantic and spatial enhancement via global–local cooperative alignment optimization. The proposed framework incorporates a dual enhancement strategy comprising Semantic-Augmented Global Contrast (SAGC) and Multi-scale Spatial Local Contrast (MS-SLC), which reinforce the alignment of image and text modalities at the global sample level and local feature level, respectively, thereby reducing image noise. Additionally, a cross-modal feature fusion and vision-constrained CRF prediction layer is designed to achieve adaptive aggregation of global and local features. Experimental results on the Twitter-2015 and Twitter-2017 datasets yield F1 scores of 77.42% and 88.79%, outperforming baseline models. These results demonstrate that the global–local complementary mechanism effectively balances alignment precision and noise robustness, thereby enhancing entity recognition accuracy in social media and advancing multimodal semantic understanding. Full article

(This article belongs to the Special Issue Intelligence Image Processing and Patterns Recognition)

► Show Figures

Figure 1

19 pages, 8750 KiB

Open AccessArticle

FP-Deeplab: A Novel Face Parsing Network for Fine-Grained Boundary Detection and Semantic Understanding

by Borui Zeng, Can Shu, Ziqi Liao, Jingru Yu, Zhiyu Liu and Xiaoyan Chen

Appl. Sci. 2025, 15(11), 6016; https://doi.org/10.3390/app15116016 - 27 May 2025

Viewed by 455

Abstract

Facial semantic segmentation, as a critical technology in high-level visual understanding, plays an important role in applications such as facial editing, augmented reality, and identity recognition. However, due to the complexity of facial structures, ambiguous boundaries, and inconsistent scales of facial components, traditional [...] Read more.

Facial semantic segmentation, as a critical technology in high-level visual understanding, plays an important role in applications such as facial editing, augmented reality, and identity recognition. However, due to the complexity of facial structures, ambiguous boundaries, and inconsistent scales of facial components, traditional methods still suffer from significant limitations in detail preservation and contextual modeling. To address these challenges, this paper proposes a facial parsing network based on the Deeplabv3+ framework, named FP-Deeplab, which aims to improve segmentation performance and generalization capability through structurally enhanced modules. Specifically, two key modules are designed: (1) the Context-Channel Refine Feature Enhancement (CCR-FE) module, which integrates multi-scale contextual strip convolutions and Cross-Axis Attention and introduces a channel attention mechanism to strengthen the modeling of long-range spatial dependencies and enhances the perception and representation of boundary regions; (2) the Self-Modulation Attention Feature Integration with Regularization (SimFA) module, which combines local detail modeling and a parameter-free channel attention modulation mechanism to achieve fine-grained reconstruction and enhancement of semantic features, effectively mitigating boundary blur and information loss during the upsampling stage. The experimental results on two public facial segmentation datasets, CelebAMask-HQ and HELEN, demonstrate that FP-Deeplab improves the baseline model by 3.8% in Mean IoU and 2.3% in the overall F1-score on the HELEN dataset, and it achieves a Mean F1-score of 84.8% on the CelebAMask-HQ dataset. Furthermore, the proposed method shows superior accuracy and robustness in multiple key component categories, especially in long-tailed regions, validating its effectiveness. Full article

► Show Figures

Figure 1

21 pages, 5452 KiB

Open AccessArticle

HFC-YOLO11: A Lightweight Model for the Accurate Recognition of Tiny Remote Sensing Targets

by Jinyin Bai, Wei Zhu, Zongzhe Nie, Xin Yang, Qinglin Xu and Dong Li

Computers 2025, 14(5), 195; https://doi.org/10.3390/computers14050195 - 18 May 2025

Cited by 1 | Viewed by 1470

Abstract

To address critical challenges in tiny object detection within remote sensing imagery, including resolution–semantic imbalance, inefficient feature fusion, and insufficient localization accuracy, this study proposes Hierarchical Feature Compensation You Only Look Once 11 (HFC-YOLO11), a lightweight detection model based on hierarchical feature compensation. [...] Read more.

To address critical challenges in tiny object detection within remote sensing imagery, including resolution–semantic imbalance, inefficient feature fusion, and insufficient localization accuracy, this study proposes Hierarchical Feature Compensation You Only Look Once 11 (HFC-YOLO11), a lightweight detection model based on hierarchical feature compensation. Firstly, by reconstructing the feature pyramid architecture, we preserve the high-resolution P2 feature layer in shallow networks to enhance the fine-grained feature representation for tiny targets, while eliminating redundant P5 layers to reduce the computational complexity. In addition, a depth-aware differentiated module design strategy is proposed: GhostBottleneck modules are adopted in shallow layers to improve its feature reuse efficiency, while standard Bottleneck modules are maintained in deep layers to strengthen the semantic feature extraction. Furthermore, an Extended Intersection over Union loss function (EIoU) is developed, incorporating boundary alignment penalty terms and scale-adaptive weight mechanisms to optimize the sub-pixel-level localization accuracy. Experimental results on the AI-TOD and VisDrone2019 datasets demonstrate that the improved model achieves mAP50 improvements of 3.4% and 2.7%, respectively, compared to the baseline YOLO11s, while reducing its parameters by 27.4%. Ablation studies validate the balanced performance of the hierarchical feature compensation strategy in the preservation of resolution and computational efficiency. Visualization results confirm an enhanced robustness against complex background interference. HFC-YOLO11 exhibits superior accuracy and generalization capability in tiny object detection tasks, effectively meeting practical application requirements for tiny object recognition. Full article

► Show Figures

Figure 1

28 pages, 9332 KiB

Open AccessArticle

Contrastive Learning-Based Cross-Modal Fusion for Product Form Imagery Recognition: A Case Study on New Energy Vehicle Front-End Design

by Yutong Zhang, Jiantao Wu, Li Sun and Guoan Yang

Sustainability 2025, 17(10), 4432; https://doi.org/10.3390/su17104432 - 13 May 2025

Viewed by 659

Abstract

Fine-grained feature extraction and affective semantic mapping remain significant challenges in product form analysis. To address these issues, this study proposes a contrastive learning-based cross-modal fusion approach for product form imagery recognition, using the front-end design of new energy vehicles (NEVs) as a [...] Read more.

Fine-grained feature extraction and affective semantic mapping remain significant challenges in product form analysis. To address these issues, this study proposes a contrastive learning-based cross-modal fusion approach for product form imagery recognition, using the front-end design of new energy vehicles (NEVs) as a case study. The proposed method first employs the Biterm Topic Model (BTM) and Analytic Hierarchy Process (AHP) to extract thematic patterns and compute weight distributions from consumer review texts, thereby identifying key imagery style labels. These labels are then leveraged for image annotation, facilitating the construction of a multimodal dataset. Next, ResNet-50 and Transformer architectures serve as the image and text encoders, respectively, to extract and represent multimodal features. To ensure effective alignment and deep fusion of textual and visual representations in a shared embedding space, a contrastive learning mechanism is introduced, optimizing cosine similarity between positive and negative sample pairs. Finally, a fully connected multilayer network is integrated at the output of the Transformer and ResNet with Contrastive Learning (TRCL) model to enhance classification accuracy and reliability. Comparative experiments against various deep convolutional neural networks (DCNNs) demonstrate that the TRCL model effectively integrates semantic and visual information, significantly improving the accuracy and robustness of complex product form imagery recognition. These findings suggest that the proposed method holds substantial potential for large-scale product appearance evaluation and affective cognition research. Moreover, this data-driven fusion underpins sustainable product form design by streamlining evaluation and optimizing resource use. Full article

► Show Figures

Figure 1

27 pages, 1868 KiB

Open AccessArticle

MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs

by Zhixiong Zeng, Zaoming Wu, Runtao Xie, Kai Lin, Shenwen Tan, Xinyuan He and Yizhi Luo

Agriculture 2025, 15(9), 968; https://doi.org/10.3390/agriculture15090968 - 29 Apr 2025

Viewed by 784

Abstract

The accurate recognition of pig behaviors in intensive farming is crucial for health monitoring and growth assessment. To address multi-scale recognition challenges caused by perspective distortion (non-frontal camera angles), this study proposes MACA-Net, a YOLOv8n-based model capable of detecting four key behaviors: eating, [...] Read more.

The accurate recognition of pig behaviors in intensive farming is crucial for health monitoring and growth assessment. To address multi-scale recognition challenges caused by perspective distortion (non-frontal camera angles), this study proposes MACA-Net, a YOLOv8n-based model capable of detecting four key behaviors: eating, lying on the belly, lying on the side, and standing. The model incorporates a Mamba Global–Local Extractor (MGLE) Module, which leverages Mamba to capture global dependencies while preserving local details through convolutional operations and channel shuffle, overcoming Mamba’s limitation in retaining fine-grained visual information. Additionally, an Adaptive Multi-Path Attention (AMPA) mechanism integrates spatial-channel attention to enhance feature focus, ensuring robust performance in complex environments and low-light conditions. To further improve detection, a Cross-Layer Feature Pyramid Transformer (CFPT) neck employs non-upsampled feature fusion, mitigating semantic gap issues where small target features are overshadowed by large target features during feature transmission. Experimental results demonstrate that MACA-Net achieves a precision of 83.1% and mAP of 85.1%, surpassing YOLOv8n by 8.9% and 4.4%, respectively. Furthermore, MACA-Net significantly reduces parameters by 48.4% and FLOPs by 39.5%. When evaluated in comparison to leading detectors such as RT-DETR, Faster R-CNN, and YOLOv11n, MACA-Net demonstrates a consistent level of both computational efficiency and accuracy. These findings provide a robust validation of the efficacy of MACA-Net for intelligent livestock management and welfare-driven breeding, offering a practical and efficient solution for modern pig farming. Full article

(This article belongs to the Special Issue Modeling of Livestock Breeding Environment and Animal Behavior)

► Show Figures

Figure 1

23 pages, 2486 KiB

Open AccessFeature PaperArticle

Learning High-Order Features for Fine-Grained Visual Categorization with Causal Inference

by Yuhang Zhang, Yuan Wan, Jiahui Hao, Zaili Yang and Huanhuan Li

Mathematics 2025, 13(8), 1340; https://doi.org/10.3390/math13081340 - 19 Apr 2025

Viewed by 568

Abstract

Recently, causal models have gained significant attention in natural language processing (NLP) and computer vision (CV) due to their capability of capturing features with causal relationships. This study addresses Fine-Grained Visual Categorization (FGVC) by incorporating high-order feature fusions to improve the representation of [...] Read more.

Recently, causal models have gained significant attention in natural language processing (NLP) and computer vision (CV) due to their capability of capturing features with causal relationships. This study addresses Fine-Grained Visual Categorization (FGVC) by incorporating high-order feature fusions to improve the representation of feature interactions while mitigating the influence of confounding factors through causal inference. A novel high-order feature learning framework with causal inference is developed to enhance FGVC. A causal graph tailored to FGVC is constructed, and the causal assumptions of baseline models are analyzed to identify confounding factors. A reconstructed causal structure establishes meaningful interactions between individual images and image pairs. Causal interventions are applied by severing specific causal links, effectively reducing confounding effects and enhancing model robustness. The framework combines high-order feature fusion with interventional fine-grained learning by performing causal interventions on both classifiers and categories. The experimental results demonstrate that the proposed method achieves accuracies of 90.7% on CUB-200, 92.0% on FGVC-Aircraft, and 94.8% on Stanford Cars, highlighting its effectiveness and robustness across these widely used fine-grained recognition datasets. Comprehensive evaluations of these three widely used fine-grained recognition datasets demonstrate the proposed framework’s effectiveness and robustness. Full article

(This article belongs to the Special Issue Advances, Challenges, and Applications of Deep Learning Models in Computer Vision and Image Processing and Analysis)

► Show Figures

Figure 1

22 pages, 5294 KiB

Open AccessArticle

Text-in-Image Enhanced Self-Supervised Alignment Model for Aspect-Based Multimodal Sentiment Analysis on Social Media

by Xuefeng Zhao, Yuxiang Wang and Zhaoman Zhong

Sensors 2025, 25(8), 2553; https://doi.org/10.3390/s25082553 - 17 Apr 2025

Viewed by 742

Abstract

The rapid development of social media has driven the need for opinion mining and sentiment analysis based on multimodal samples. As a fine-grained task within multimodal sentiment analysis, aspect-based multimodal sentiment analysis (ABMSA) enables the accurate and efficient determination of sentiment polarity for [...] Read more.

The rapid development of social media has driven the need for opinion mining and sentiment analysis based on multimodal samples. As a fine-grained task within multimodal sentiment analysis, aspect-based multimodal sentiment analysis (ABMSA) enables the accurate and efficient determination of sentiment polarity for aspect-level targets. However, traditional ABMSA methods often perform suboptimally on social media samples, as the images in these samples typically contain embedded text that conventional models overlook. Such text influences sentiment judgment. To address this issue, we propose a text-in-image enhanced self-supervised alignment model (TESAM) that accounts for multimodal information more comprehensively. Specifically, we employed Optical Character Recognition technology to extract embedded text from images and, based on the principle that text-in-image is an integral part of the visual modality, fused it with visual features to obtain more comprehensive image representations. Additionally, we incorporate aspect words to guide the model in disregarding irrelevant semantic features, thereby reducing noise interference. Furthermore, to mitigate the semantic gap between modalities, we propose pre-training the feature extraction module with self-supervised alignment. During this pre-training stage, unimodal semantic embeddings from both modalities are aligned by calculating errors using Euclidean distance and cosine similarity. Experimental results demonstrate that TESAM achieved remarkable performances on three ABMSA benchmarks. These results validate the rationale and effectiveness of our proposed improvements. Full article

(This article belongs to the Special Issue Advanced Signal Processing for Affective Computing)

► Show Figures

Figure 1

20 pages, 623 KiB

Open AccessArticle

Fast Normalization for Bilinear Pooling via Eigenvalue Regularization

by Sixiang Xu, Huihui Dong, Chen Zhang and Chaoxue Wang

Appl. Sci. 2025, 15(8), 4155; https://doi.org/10.3390/app15084155 - 10 Apr 2025

Viewed by 491

Abstract

Bilinear pooling, as an aggregation approach that outputs second-order statistics of deep learning features, has demonstrated effectiveness in a wide range of visual recognition tasks. Among major improvements on the bilinear pooling, matrix square root normalization—applied to the bilinear representation matrix—is regarded as [...] Read more.

Bilinear pooling, as an aggregation approach that outputs second-order statistics of deep learning features, has demonstrated effectiveness in a wide range of visual recognition tasks. Among major improvements on the bilinear pooling, matrix square root normalization—applied to the bilinear representation matrix—is regarded as a crucial step for further boosting performance. However, most existing works leverage Newton’s iteration to perform normalization, which becomes computationally inefficient when dealing with high-dimensional features. To address this limitation, through a comprehensive analysis, we reveal that both the distribution and magnitude of eigenvalues in the bilinear representation matrix play an important role in the network performance. Building upon this insight, we propose a novel approach, namely RegCov, which regularizes the eigenvalues when the normalization is absent. Specifically, RegCov incorporates two regularization terms that encourage the network to align the current eigenvalues with the target ones in terms of their distribution and magnitude. We implement RegCov across different network architectures and run extensive experiments on the ImageNet1K and fine-grained image classification benchmarks. The results demonstrate that RegCov maintains robust recognition to diverse datasets and network architectures while achieving superior inference speed compared to previous works. Full article

(This article belongs to the Special Issue Application of Machine Learning to Image Classification and Image Segmentation)

► Show Figures

Figure 1

Search Results (56)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (56)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI