Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (121)

Search Parameters:
Keywords = multimodal feature aggregation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
23 pages, 5036 KB  
Article
Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment
by Yuchuan Yang and Xiaosu Xu
Computers 2026, 15(4), 234; https://doi.org/10.3390/computers15040234 - 9 Apr 2026
Abstract
LiDAR point cloud semantic segmentation is essential for autonomous driving, yet LiDAR-only methods remain constrained by sparsity and limited texture cues. We propose Cross-Modal Collaborative Manifold Distillation (CMCMD), which transfers open-world semantic priors from the DINOv3 Vision Foundation Model to a LiDAR student [...] Read more.
LiDAR point cloud semantic segmentation is essential for autonomous driving, yet LiDAR-only methods remain constrained by sparsity and limited texture cues. We propose Cross-Modal Collaborative Manifold Distillation (CMCMD), which transfers open-world semantic priors from the DINOv3 Vision Foundation Model to a LiDAR student network. The framework combines an Adaptive Relation Convolution (ARConv) backbone with geometry-conditioned aggregation, a Unified Bidirectional Mapping Module (UBMM) for explicit 2D–3D interaction, and Manifold-Aware Topological Distillation (MATD), which aligns inter-sample affinity structures in a shared latent manifold rather than enforcing pointwise feature matching. By preserving relational topology instead of absolute feature coordinates, CMCMD mitigates negative transfer across heterogeneous modalities. Experiments on SemanticKITTI and nuScenes yield mIoU values of 72.9% and 81.2%, respectively, surpassing the compared distillation baselines and approaching the performance of multimodal fusion methods at lower inference cost. Additional evaluation on real-world campus scenes further supports the cross-domain robustness of the proposed framework. Full article
Show Figures

Figure 1

25 pages, 4248 KB  
Article
A Spatial Post-Multiscale Fusion Entropy and Multi-Feature Synergy Model for Disturbance Identification of Charging Stations
by Hui Zhou, Xiujuan Zeng, Tong Liu, Wei Wu, Bolun Du and Yinglong Diao
Energies 2026, 19(8), 1837; https://doi.org/10.3390/en19081837 - 8 Apr 2026
Abstract
The large-scale integration and grid connection of renewable energy sources and charging stations introduce a multitude of nonlinear and impact loads, resulting in more severe distortion and higher complexity of disturbance signals in power systems. As a consequence, power quality disturbances (PQDs) in [...] Read more.
The large-scale integration and grid connection of renewable energy sources and charging stations introduce a multitude of nonlinear and impact loads, resulting in more severe distortion and higher complexity of disturbance signals in power systems. As a consequence, power quality disturbances (PQDs) in active distribution networks, including overvoltage and harmonics, display greater randomness and diversity, which increases the challenge of PQD identification. To tackle this problem, this study presents a dual-channel early-fusion approach for PQD recognition based on Spatial Post-MultiScale Fusion Entropy (SMFE). SMFE is used as an entropy-based feature-construction pipeline in which a time–frequency representation is formed prior to spatial post-multiscale aggregation to produce a compact complexity map complementary to waveform morphology. Subsequently, a dual-channel model is constructed by integrating waveform-morphology input with SMFE-derived complexity features for joint learning. By leveraging the ConvNeXt architecture and a Squeeze-and-Excitation (SE) mechanism, a multimodal channel-recalibration model is implemented to emphasize informative feature responses during PQD recognition. Experimental verification with simulated signals shows that the proposed approach achieves an identification accuracy of 97.83% under an SNR of 30 dB, indicating robust performance under the tested noise settings. Full article
Show Figures

Figure 1

27 pages, 24041 KB  
Article
PMDet: Patch-Aware Enhancement and Fusion for Multispectral Object Detection
by Jie Li, Chenhong Sui, Jing Wang and Jun Zhou
Remote Sens. 2026, 18(7), 1068; https://doi.org/10.3390/rs18071068 - 2 Apr 2026
Viewed by 182
Abstract
Multispectral object detection addresses the limitations of single-modal approaches by fusing complementary information from visible and infrared images, thereby improving robustness in complex environments. However, the inter-modal representations are inherently misaligned due to sensing discrepancies, and the complementary cues they provide are often [...] Read more.
Multispectral object detection addresses the limitations of single-modal approaches by fusing complementary information from visible and infrared images, thereby improving robustness in complex environments. However, the inter-modal representations are inherently misaligned due to sensing discrepancies, and the complementary cues they provide are often imbalanced, making it difficult to exploit modality-specific information effectively. Moreover, directly merging features from different modalities can introduce noise and artifacts that deteriorate the detection performance. To this end, this paper proposes a patch-aware enhancement and fusion network for multispectral object detection (PMDet). This method employs a dual-stream backbone equipped with the patch-aware Feature Enhancer (FE) module for cross-modal features alignment and enhancement. FE not only reinforces the feature representation of key regions but also helps to suppress local noise and enhance the model’s perception of fine textures and differences. Building on these enriched features, the patch-based Feature Aggregator (FA) module allows for efficient inter-modal feature interaction and semantic fusion with noise resistance. Specifically, both FE and FA modules leverage the shifted-patch design to preserve computational efficiency while enabling long-range modeling. In this regard, PMDet couples multi-scale cross-modal semantic enhancement with deep semantic fusion to form a stable and discriminative multimodal representation pipeline. Experiments on FLIR, LLVIP, and VEDAI demonstrate that the method outperforms mainstream approaches in detection accuracy and robustness, and ablation studies further verify the effectiveness of each module. Full article
Show Figures

Figure 1

13 pages, 1956 KB  
Article
Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models
by Kenan Kassab, Alexey Kashevnik and Irina Shoshina
Big Data Cogn. Comput. 2026, 10(4), 106; https://doi.org/10.3390/bdcc10040106 - 1 Apr 2026
Viewed by 361
Abstract
Candidate interview assessment is primarily reliant on subjective human judgment, while existing AI-based methods rely on end-to-end predictions with no psychometric basis. In this paper, we propose an interpretable multi-modal framework that combines nonverbal behavior, LLM-based verbal analysis, and Big Five personality traits [...] Read more.
Candidate interview assessment is primarily reliant on subjective human judgment, while existing AI-based methods rely on end-to-end predictions with no psychometric basis. In this paper, we propose an interpretable multi-modal framework that combines nonverbal behavior, LLM-based verbal analysis, and Big Five personality traits into three theory-based constructs: professional-cognitive competence, observed leadership behavior, and leadership disposition. The proposed method utilizes computer vision and larger language models to extract features from video interviews. Rather than targeting predictive accuracy, the proposed method prioritizes construct validity and transparent aggregation under severe label scarcity. The proposed method aggregates the constructs into a Top Potential Score that reflects the executive abilities of the candidate. Experiments on the method show its ability to significantly differentiate top candidates from others (Cliff’s delta = 0.91 for the composite Top Potential Score, permutation p = 0.0002). Leave-one-out analysis verifies robustness, while rank-based evaluation yields 100% recall of executive candidates in the top 20% of rated applications. The findings justify the use of the proposed multi-modal method as an interpretable decision-support tool for candidate interview assessment. Full article
Show Figures

Figure 1

33 pages, 3891 KB  
Article
Correlation and Semantic Prior-Guided Multi-Scale Cross-Modal Interaction Network for SAR-OPT Image Fusion
by Xiaoyang Hou, Lingxi Zhou, Chenguo Feng, Hao Cha, Yang Liu, Liguo Liu and Haibo Liu
Remote Sens. 2026, 18(7), 975; https://doi.org/10.3390/rs18070975 - 24 Mar 2026
Viewed by 305
Abstract
Syntheticaperture radar (SAR) and optical (OPT) image fusion aims to leverage their complementary information to obtain a more comprehensive representation of ground objects. However, significant discrepancies exist between the two modalities in terms of imaging mechanisms and feature distributions. Consequently, existing multi-modal image [...] Read more.
Syntheticaperture radar (SAR) and optical (OPT) image fusion aims to leverage their complementary information to obtain a more comprehensive representation of ground objects. However, significant discrepancies exist between the two modalities in terms of imaging mechanisms and feature distributions. Consequently, existing multi-modal image fusion methods struggle to achieve robust cross-modal feature alignment and deep semantic consistency between the fused results and the source modalities. To address the above challenges, this paper proposes a correlation and semantic prior-guided multi-scale cross-modal interaction network (CSP-MCIN) for effective SAR-OPT image fusion. Specifically, CSP-MCIN first employs two modality-specific encoders based on ResNet-18 to extract low-level details and high-level semantic features from SAR and OPT images, respectively. Subsequently, a multi-scale interactive decoder integrating cross-modal Transformers and gated fusion units is constructed to align and aggregate semantic and detail information from both encoders. Finally, to strengthen source-modality representations, a novel loss function combining a pixel-domain correlation loss and a CLIP-guided semantic consistency loss is designed and optimized under a PCGrad-based multi-objective optimization scheme. Experimental results on public SAR-OPT image datasets demonstrate that the proposed CSP-MCIN achieves superior fusion performance and computational efficiency compared with state-of-the-art approaches. Full article
Show Figures

Figure 1

20 pages, 6854 KB  
Article
TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification
by Shuhong Shi and Lingchuan Zeng
Electronics 2026, 15(6), 1194; https://doi.org/10.3390/electronics15061194 - 13 Mar 2026
Viewed by 227
Abstract
Autonomous mobile robots require robust traversability perception to navigate safely in diverse outdoor environments. However, traditional deep learning approaches are data-hungry, requiring large-scale manual annotations, and struggle to adapt quickly to unseen environments. This paper introduces TARTS (Training-free Adaptive Reference-guided Traversability Segmentation), a [...] Read more.
Autonomous mobile robots require robust traversability perception to navigate safely in diverse outdoor environments. However, traditional deep learning approaches are data-hungry, requiring large-scale manual annotations, and struggle to adapt quickly to unseen environments. This paper introduces TARTS (Training-free Adaptive Reference-guided Traversability Segmentation), a novel framework combining one-shot prototype initialization with trajectory-guided online adaptation for terrain segmentation. Using a single reference image of desired traversable terrain, TARTS establishes an initial prototype from pre-trained DINO Vision Transformer (ViT) features. The system performs segmentation through superpixel-based feature aggregation and valley-emphasis Otsu thresholding while continuously refining the prototype via Exponential Moving Average (EMA) updates driven by automated footprint supervision from the robot’s traversed trajectory. Extensive experiments on our introduced Reference-guided Traversability Segmentation Dataset (RTSD) and the challenging Off-Road Freespace Detection (ORFD) benchmark demonstrate strong performance, achieving 94.5% IoU on RTSD and 94.1% IoU on ORFD, outperforming state-of-the-art supervised methods that require multi-modal inputs and dedicated training. The framework maintains efficient performance (17–24 FPS) on embedded platforms, enabling practical deployment with only a reference image as initialization. Full article
Show Figures

Figure 1

28 pages, 5635 KB  
Article
Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
by Kaiqing Yuan, Haotian Lan, Yao Gao and Kun Wang
Land 2026, 15(3), 449; https://doi.org/10.3390/land15030449 - 12 Mar 2026
Viewed by 375
Abstract
While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) [...] Read more.
While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using Low-Rank Adaptation(LoRA) and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.863 on objective features and 89.3% agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns—such as the divergent perceptual effects of architectural transparency across residential and commercial zones—revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with Sustainable Development Goal 11(SDG 11). This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience. Full article
(This article belongs to the Special Issue Big Data-Driven Urban Spatial Perception)
Show Figures

Figure 1

59 pages, 5629 KB  
Article
Adaptive Neural Network Method for Detecting Crimes in the Digital Environment to Ensure Human Rights and Support Forensic Investigations
by Serhii Vladov, Oksana Mulesa, Petro Horvat, Yevhen Kobko, Victoria Vysotska, Vasyl Kikinchuk, Serhii Khursenko, Kostiantyn Karaman and Oksana Kochan
Data 2026, 11(3), 49; https://doi.org/10.3390/data11030049 - 2 Mar 2026
Viewed by 529
Abstract
This article presents an adaptive neural network method for the automated detection, reconstruction, and prioritisation of multi-stage criminal operations in the digital environment, aiming to protect human rights and ensure the legal security of digital evidence. The developed method combines multimodal temporal encoders, [...] Read more.
This article presents an adaptive neural network method for the automated detection, reconstruction, and prioritisation of multi-stage criminal operations in the digital environment, aiming to protect human rights and ensure the legal security of digital evidence. The developed method combines multimodal temporal encoders, a graph module based on GNN for entity correlation, and a correlation head with a link-prediction mechanism and differentiable path recovery. Sliding time windows, logarithmic transformation of volumetric features, and pseudonymization of identifiers with the ability to utilise privacy-preserving procedures (federated learning, differential privacy) are used for data aggregation and normalisation. Unique features of the developed method include an integrated risk function combining an anomaly component and graph significance, a module for automated forensic packet generation with chain of custody recording, and a mechanism for incremental model updates. Experimental results demonstrate high diagnostic metric values (AUC ≈ 0.97, F1 ≈ 0.99 on the test dataset after balancing), robust recovery of priority paths (“path_probability” > 0.7 for top operations), and pipeline performance in PII leak prioritisation and human trafficking reconstruction scenarios. The study’s contribution lies in a practice-oriented neural network method that integrates detection, correlation, and the collection of legally applicable evidence. Full article
Show Figures

Figure 1

81 pages, 3981 KB  
Review
Graph Learning in Bioinformatics: A Survey of Graph Neural Network Architectures, Biological Graph Construction and Bioinformatics Applications
by Lijia Deng, Ziyang Dong, Zhengling Yang, Bo Gong and Le Zhang
Biomolecules 2026, 16(2), 333; https://doi.org/10.3390/biom16020333 - 23 Feb 2026
Viewed by 777
Abstract
Graph Neural Networks (GNNs) have become a central methodology for modelling biological systems where entities and their interactions form inherently non-Euclidean structures. From protein interaction networks and gene regulatory circuits to molecular graphs and multi-omics integration, the relational nature of biological data makes [...] Read more.
Graph Neural Networks (GNNs) have become a central methodology for modelling biological systems where entities and their interactions form inherently non-Euclidean structures. From protein interaction networks and gene regulatory circuits to molecular graphs and multi-omics integration, the relational nature of biological data makes GNNs particularly well-suited for capturing complex dependencies that traditional deep learning methods fail to represent. Despite their rapid adoption, the effectiveness of GNNs in bioinformatics depends not only on model design but also on how biological graphs are constructed, parameterised and trained. In this review, we provide a structured framework for understanding and applying GNNs in bioinformatics, organised around three key dimensions: (1) graph construction and representation, including strategies for deriving biological networks from heterogeneous sources and selecting biologically meaningful node and edge features; (2) GNN architectures, covering spectral and spatial formulations, representative models such as Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), Graph Sample and AggregatE (GraphSAGE) and Graph Isomorphism Network (GIN), and recent advances including transformer-based and self-supervised paradigms; and (3) applications in biomedical domains, spanning disease–gene association prediction, drug discovery, protein structure and function analysis, multi-omics integration and biomedical knowledge graphs. We further examine training considerations, including optimisation techniques, regularisation strategies and challenges posed by data sparsity and noise in biological settings. By synthesising methodological foundations with domain-specific applications, this review clarifies how graph quality, architectural choice and training dynamics jointly influence model performance. We also highlight emerging challenges such as modelling temporal biological processes, improving interpretability, and enabling robust multimodal fusion that will shape the next generation of GNNs in computational biology. Full article
(This article belongs to the Special Issue Application of Bioinformatics in Medicine)
Show Figures

Figure 1

28 pages, 1114 KB  
Review
Amyotrophic Lateral Sclerosis (ALS) Genetics and Microbiota: A Comprehensive Review
by Mostafa Ahmed Kurdi, Hidayah Alotaibi, Asayel Tawfiq Alkhuraymi, Layyan Nassar Aldahery, Ali Fouad Alhawaj and Hamzah Jehad Aldali
Int. J. Mol. Sci. 2026, 27(4), 1978; https://doi.org/10.3390/ijms27041978 - 19 Feb 2026
Viewed by 1249
Abstract
Amyotrophic Lateral Sclerosis (ALS) is a severe, progressive neurodegenerative disorder characterized by the loss of upper and lower motor neurons, affecting 0.5 to 2.6 per 100,000 people, with a median survival of 2 to 5 years. It is increasingly seen as a multisystem [...] Read more.
Amyotrophic Lateral Sclerosis (ALS) is a severe, progressive neurodegenerative disorder characterized by the loss of upper and lower motor neurons, affecting 0.5 to 2.6 per 100,000 people, with a median survival of 2 to 5 years. It is increasingly seen as a multisystem disorder, sharing essential clinicopathological features with Frontotemporal Dementia (FTD). This convergence arises from overlapping molecular processes, including severe oxidative stress, glutamate-mediated excitotoxicity, mitochondrial dysfunction, and widespread aggregated TDP-43 proteinopathy in both sporadic and familial cases. Several key genetic factors have been identified, particularly mutations in C9orf72, SOD1, TARDBP, and FUS, which serve as important targets for novel treatments, such as Tofersen, a recently approved SOD1-specific antisense oligonucleotide (ASO) gene therapy. Additionally, there is increasing evidence of the gut–brain connection. Dysbiosis, involving species such as Akkermansia muciniphila, and lower levels of neuroprotective metabolites, such as nicotinamide, may affect the course of the disease. As a result, treatment strategies are shifting toward a personalized approach. This includes using gene therapy, ranging from ASOs and RNA interference (RNAi) to new CRISPR-based genome editing. It also involves exploring microbiome-modulating treatments, such as specific probiotics and Fecal Microbiota Transplantation (FMT). While microbiome and gene therapies remain largely experimental, their potential is promising, as highlighted by the recent approval of Tofersen. These novel approaches could be further enhanced and guided by more robust diagnostic criteria and by investigating early multimodal treatment strategies to slow the progression of this complex disease. Full article
Show Figures

Figure 1

16 pages, 4072 KB  
Article
SCGViT: A Pseudo-Multimodal Low-Latency Framework for Real-Time Skin Lesion Diagnosis
by Zirui Luo, Chengyu Hou and Haishi Wang
Electronics 2026, 15(4), 845; https://doi.org/10.3390/electronics15040845 - 16 Feb 2026
Viewed by 325
Abstract
In order to solve the problems of insufficient medical image feature extraction, high classification accuracy, and computational complexity in automatic diagnosis of skin lesions in the edge computing environment, this paper proposes a real-time pseudo-multimodal low-delay diagnosis framework, SCGViT, based on a vision [...] Read more.
In order to solve the problems of insufficient medical image feature extraction, high classification accuracy, and computational complexity in automatic diagnosis of skin lesions in the edge computing environment, this paper proposes a real-time pseudo-multimodal low-delay diagnosis framework, SCGViT, based on a vision transformer. The framework is constructed around three functional objectives: mitigating data imbalance through generative modeling, capturing diverse representations via multi-dimensional perception, and optimizing feature fusion through adaptive refinement. Firstly, using Class-Conditioned Generative Adversarial Networks (CGANs) simulates manifolds of minority class samples in latent space, achieving preliminary balance of data distribution. Secondly, a branch feature-extraction path is constructed to simulate inversion (INV) and infrared (IR) modes in the original visual primary color mode (RGB), in order to achieve multi-dimensional perception. Finally, a cross-attention mechanism is combined for cross-branch feature aggregation, and a channel-attention mechanism (squeeze and excitation) is embedded for secondary refinement of the mixed global local features to enhance the representation ability of key pathological regions by integrating complementary structural and contrast information. The experimental results on the HAM10000 dataset showed that the F1 score reached 0.973, the inference speed reached 304.439 FPS, the parameter count was only 0.524 M, and the computational complexity was only 0.866 G FLOPs, achieving a balance between high accuracy and light weight. Full article
Show Figures

Figure 1

32 pages, 107233 KB  
Article
Fourier-Based Non-Rigid Slice-to-Volume Registration of Segmented Petrographic LM and CT Scans of Concrete Specimens
by Mohamed Said Helmy Alabassy, Martin Christian Hampe, Doreen Erfurt, Horst-Michael Ludwig and Andrea Osburg
Materials 2026, 19(4), 663; https://doi.org/10.3390/ma19040663 - 9 Feb 2026
Viewed by 567
Abstract
Cyclic freezing and thawing (FT) are a primary cause of cracking in concrete, yet current assessment procedures in Germany rely heavily on qualitative estimation using the International Union of Laboratories and Experts in Construction Materials, Systems and Structures (RILEM) capillary suction, internal damage [...] Read more.
Cyclic freezing and thawing (FT) are a primary cause of cracking in concrete, yet current assessment procedures in Germany rely heavily on qualitative estimation using the International Union of Laboratories and Experts in Construction Materials, Systems and Structures (RILEM) capillary suction, internal damage and freeze-thaw (CIF) and Capillary de-icing freeze-thaw (CDF) tests. Although these standard tests provide a general overview of the condition of concrete damage in specimens through the estimation of water saturation through capillary suction, mass of surface delamination, qualitative open surface damage, and relative dynamic modulus of elasticity, they do not take quantitative analysis of voids, including cracks and air pores, directly into account. To address this, we propose a novel workflow utilizing deep learning-based semantic segmentation with Fourier-based slice-to-volume registration by combining 2D light microscopy (LM) of petrographic sections and 3D micro-computed tomography (μCT). We segment cracks, air pores, and aggregates in both modalities and employ feature matching alongside spatial harmonics analysis for 3D shape description. The best proposed 3D registration framework through feature matching demonstrated a success rate of 89.75%, achieving a dissimilarity of 5.21% in relative root mean square error (RRMSE) terms and thereby significantly surpassing the performance of compared 2D-only methods adapted from the body of research. Our approach enables precise, automated, and verifiable quantification of voids across CT and LM modalities and paves the way for advanced computational modeling-based methods to investigate moisture transfer mechanisms for more accurate assessments of frost damage in concrete, service life prediction models, deep learning applications for multimodal data fusion, and more comprehensive FT damage simulations. Full article
(This article belongs to the Section Advanced Materials Characterization)
Show Figures

Figure 1

17 pages, 784 KB  
Article
A Wideband Oscillation Classification Method Based on Multimodal Feature Fusion
by Yingmin Zhang, Yixiong Liu, Zongsheng Zheng and Shilin Gao
Electronics 2026, 15(3), 682; https://doi.org/10.3390/electronics15030682 - 4 Feb 2026
Cited by 1 | Viewed by 405
Abstract
With the increasing penetration of renewable energy sources and power-electronic devices, modern power systems exhibit pronounced wideband oscillation characteristics with large frequency spans, strong modal coupling, and significant time-varying behaviors. Accurate identification and classification of wideband oscillation patterns have therefore become critical challenges [...] Read more.
With the increasing penetration of renewable energy sources and power-electronic devices, modern power systems exhibit pronounced wideband oscillation characteristics with large frequency spans, strong modal coupling, and significant time-varying behaviors. Accurate identification and classification of wideband oscillation patterns have therefore become critical challenges for ensuring the secure and stable operation of “dual-high” power systems. Existing methods based on signal processing or single-modality deep-learning models often fail to fully exploit the complementary information embedded in heterogeneous data representations, resulting in limited performance when dealing with complex oscillation patterns.To address these challenges, this paper proposes a multimodal attention-based fusion network for wideband oscillation classification. A dual-branch deep-learning architecture is developed to process Gramian Angular Difference Field images and raw time-series signals in parallel, enabling collaborative extraction of global structural features and local temporal dynamics. An improved Inception module is employed in the image branch to enhance multi-scale spatial feature representation, while a gated recurrent unit network is utilized in the time-series branch to model dynamic evolution characteristics. Furthermore, an attention-based fusion mechanism is introduced to adaptively learn the relative importance of different modalities and perform dynamic feature aggregation. Extensive experiments are conducted using a dataset constructed from mathematical models and engineering-oriented simulations. Comparative studies and ablation studies demonstrate that the proposed method significantly outperforms conventional signal-processing-based approaches and single-modality deep-learning models in terms of classification accuracy, robustness, and generalization capability. The results confirm the effectiveness of multimodal feature fusion and attention mechanisms for accurate wideband oscillation classification, providing a promising solution for advanced power system monitoring and analysis. Full article
Show Figures

Figure 1

21 pages, 2231 KB  
Article
Token Injection Transformer for Enhanced Fine-Grained Recognition
by Bing Ma, Zhengbei Jin, Junyi Li, Jindong Li, Pengfei Zhang, Xiaohui Song and Beibei Jin
Processes 2026, 14(3), 492; https://doi.org/10.3390/pr14030492 - 30 Jan 2026
Viewed by 590
Abstract
Fine-Grained Visual Classification (FGVC) involves distinguishing highly similar subordinate categories within the same basic-level class, presenting significant challenges due to subtle inter-class variations and substantial intra-class diversity. While Vision Transformer (ViT)-based approaches have demonstrated potential in this domain, they remain limited by two [...] Read more.
Fine-Grained Visual Classification (FGVC) involves distinguishing highly similar subordinate categories within the same basic-level class, presenting significant challenges due to subtle inter-class variations and substantial intra-class diversity. While Vision Transformer (ViT)-based approaches have demonstrated potential in this domain, they remain limited by two key issues: (1) the progressive loss of gradient-based edge and texture signals during hierarchical token aggregation and (2) insufficient extraction of discriminative fine-grained features. To overcome these limitations, we propose a Gradient-Aware Token Injection Transformer, a novel framework that explicitly incorporates gradient magnitude and orientation into token embeddings. This multi-modal feature fusion mechanism enhances the model’s capacity to preserve and leverage critical fine-grained visual cues. Extensive experiments on four standard FGVC benchmarks demonstrate the superiority of our approach, achieving 92.9% top-1 accuracy on CUB-200-2011, 90.5% on iNaturalist 2018, 93.2% on NABirds, and 95.3% on Stanford Cars, thereby validating its effectiveness and robustness. Full article
Show Figures

Figure 1

22 pages, 9269 KB  
Article
Efficient Layer-Wise Cross-View Calibration and Aggregation for Multispectral Object Detection
by Xiao He, Tong Yang, Tingzhou Yan, Hongtao Li, Yang Ge, Zhijun Ren, Zhe Liu, Jiahe Jiang and Chang Tang
Electronics 2026, 15(3), 498; https://doi.org/10.3390/electronics15030498 - 23 Jan 2026
Viewed by 413
Abstract
Multispectral object detection is a fundamental task with an extensive range of practical implications. In particular, combining visible (RGB) and infrared (IR) images can offer complementary information that enhances detection performance in different weather scenarios. However, the existing methods generally involve aligning features [...] Read more.
Multispectral object detection is a fundamental task with an extensive range of practical implications. In particular, combining visible (RGB) and infrared (IR) images can offer complementary information that enhances detection performance in different weather scenarios. However, the existing methods generally involve aligning features across modalities and require proposals for the two-stage detectors, which are often slow and unsuitable for large-scale applications. To overcome this challenge, we introduce a novel one-stage oriented detector for RGB-infrared object detection called the Layer-wise Cross-Modality calibration and Aggregation (LCMA) detector. LCMA employs a layer-wise strategy to achieve cross-modality alignment by using the proposed inter-modality spatial-reduction attention. Moreover, we design Gated Coupled Filter in each layer to capture semantically meaningful features while ensuring that well-aligned and foreground object information is obtained before forwarding them to the detection head. This relieves the need for a region proposal step for the alignment, enabling direct category and bounding box predictions in a unified one-stage oriented detector. Extensive experiments on two challenging datasets demonstrate that the proposed LCMA outperforms state-of-the-art methods in terms of both accuracy and computational efficiency, which implies the efficacy of our approach in exploiting multi-modality information for robust and efficient multispectral object detection. Full article
(This article belongs to the Special Issue Multi-View Learning and Applications)
Show Figures

Figure 1

Back to TopTop