- Article
MPCFN: A Multilevel Predictive Cross-Fusion Network for Multimodal Named Entity Recognition in Social Media
- Qinjun Qiu,
- Bo Tan and
- Yukuan Zhou
- + 3 authors
The goal of the Multimodal Named Entity Recognition (MNER) job is to identify and classify named entities by combining various data modalities (such as text and images) and assigning them to specified categories. The growing prevalence of multimodal social media posts has spurred heightened interest in MNER, particularly due to its pivotal role in applications ranging from intention comprehension to personalized user recommendations. In the MNER task, the inconsistency between image information and text information and the difficulty of fully utilizing the image information to complement the text information are the two main difficulties currently faced. In order to solve these problems, this study proposes a Multilevel Predictive Cross-Fusion Network (MPCFN) approach for Multimodal Named Entity Recognition. First, textual features are extracted using BERT and visual features are extracted using ResNet, then irrelevant information in the image is filtered using the Correlation Prediction Gate. Second, the hierarchy of visual features received by each Transformer block is controlled by the Dynamic Gate and aligned between image and textual features using the Cross-Fusion Module to align the image and text features. Finally, the hidden layer representation is fed into the CRF layer optimized for decoding using Flooding. Through experiments on TWITTER-2015, TWITTER-2017, and WuKong datasets, our method achieves F1 scores of 76.74%, 87.61%, and 82.35%, outperforming the existing mainstream state-of-the-art models and proving the effectiveness and superiority of our method.
7 November 2025





