4.3. Evaluation Metrics
In this study, we refer to the evaluation framework established in the literature [
20] as a way to evaluate the effectiveness of the CASF-MNER model proposed in this paper, and we construct a three-dimensional evaluation system that includes
Precision rate,
Recall rate, and
value. Specifically, the
Precision rate is used to measure the reliability of the model’s prediction results, which is calculated as the ratio of correctly recognized entities to all predicted entities:
The
Recall metric, on the other hand, focuses on assessing the model’s ability to cover real entities and is mathematically defined as:
In order to comprehensively balance the trade-off between
Precision rate and
Recall rate, this paper takes
value as the core evaluation index. This index combines the two organically through the reconciliation average algorithm, and its expression is:
This quantitative index dynamically characterizes the comprehensive performance of the model in the interval of [0,1], and achieves the maximum value when the Precision rate and Recall rate reach the equilibrium state, which effectively avoids the bias problem that may arise from the evaluation of a single index.
4.4. Model Training and Performance Evaluation
4.4.1. Model Training
During the model training process, we monitored the performance of both Twitter-2015 and Twitter-2017 datasets in detail, focusing on the trends of the overall
Precision,
Recall, and
scores. The relevant results are shown in
Figure 6.
Overall, as the number of training rounds (Epoch) increases, the model’s metrics on both datasets improve significantly. It is worth noting that after round 10, some metrics such as Precision and showed a brief decline. This is mainly due to the overfitting tendency of the model to some noisy or hard-to-discriminate samples as it gradually adapts to the details of the training set, which leads to temporary fluctuations in the metrics. Subsequently, the model performance stabilizes and gradually converges to a higher level at a later stage.
Comparing the two datasets, the final metrics of Twitter-2017 are better than the Twitter-2015 dataset, with up to about 0.87, indicating that the model is also able to achieve better generalization results in more challenging data environments. In terms of details, Recall-2017 basically maintains above 0.87 since round 10 with minimal fluctuation, showing the model’s strong ability to Recall relevant instances. Precision and , on the other hand, have also been steadily improved since the 10th round, indicating that the ability to discard invalid information and accurately categorize is simultaneously enhanced.
For the Twitter-2015 dataset, the three metrics oscillated slightly as training progressed, with scores fluctuating especially around round 20, but generally maintaining a steady upward trend. In the end, Precision, Recall, and are around 0.73 respectively, which is a stable performance. This indicates that the model can still get some performance improvement after long-term training when the data is relatively old or slightly noisy.
In summary, the change process of the model’s performance under different epochs fully reflects its effective learning ability and good convergence characteristics. Higher scores and smaller indicator fluctuations prove the reliability of this paper’s method in the multimodal named entity recognition task.
4.4.2. Model Performance Evaluation
In order to visualize the model performance, we select three representative samples for the case study, as shown in
Figure 7.
Case (a) demonstrates the value of incorporating visual semantics in entity type determination. In the example “Thanks Andrew for the great Tesla road trip presentation,” the BERT-CRF model—relying solely on textual information—mislabels two entities, and the MGCMT model, despite improvements, still does not correctly identify the entity “Andrew.” Our approach leverages visual grounding by modeling the association between the speaker and the Tesla car, aided by saliency-guided attention, enabling correct identification of all entities in this particular case.
Case (b) explores the model’s ability to handle entities from technical or professional domains. In the example “Coiner: Practical Applications of [GIS MISC] in Crisis Mapping #TAMUCC #COMM4335 #ESRI #NYC,” BERT-CRF struggles to identify “GIS” as a MISC entity, while MGCMT fails to correctly classify “ESRI” as an organization. Our framework leverages deep multimodal fusion to more accurately distinguish entity categories by leveraging connections between textual terms and visual cues (such as GIS interfaces). However, despite these advantages in certain technical contexts, our model’s performance on detecting MISC categories on the Twitter2015 dataset still leaves room for improvement.
Case (c) presents a complex example involving rich media content. For “Jennifer Lawrence on the cover of Harper’s Bazaar Magazine Bulgaria (June 2016),” both BERT-CRF and MGCMT fail to comprehensively and correctly identify all relevant entities, particularly those belonging to the PERSON category. In this scenario, our model is able to utilize both textual and visual features to recognize people, organizations, and geographic references, demonstrating its potential for nuanced, cross-modal entity extraction.
Notably, while our hybrid model consistently achieves higher accuracy than the baseline on major entity types (e.g., PERSON, ORG, and LOC), we also observe slightly lower performance than the unimodal BERT-CRF on MISC categories on the Twitter2015 dataset. This result may be due to the inherent ambiguity and contextual dependencies of MISC entities, which are more difficult to capture through multimodal associations and may require more advanced knowledge integration or additional training data. Furthermore, some misclassifications still occur when the visual context is weak or misleading, or when the entity is underrepresented in both modalities.
4.5. Experimental Results and Analysis
In order to fully validate the effectiveness of the model in this paper on the task of multimodal named entity recognition, we conducted comparative experiments between the CASF-MNER model and existing benchmark models on two public datasets, TWITTER-2015 and TWITTER-2017.
Analysis of text-based named entity recognition (NER) methods. As shown in
Table 4 and
Table 5, among text-based single-modal methods, methods based on pre-trained language models generally outperform traditional methods. On the TWITTER-2015 dataset, the
score of BERT-CRF was 71.81%, significantly higher than the 64.42% of BiLSTM-CRF, representing an improvement of 7.39 percentage points. HBiLSTM-CRF outperformed BiLSTM-CRF by 4.75%, achieving a score of 69.17%, which demonstrates the effectiveness of hierarchical architectures in sequence labeling tasks. On the TWITTER-2017 dataset, BERT-CRF achieved an
score of 83.44%, which is 7.13% higher than BiLSTM-CRF, indicating that pre-trained models can better capture semantic information in social media text.
Analysis of multimodal named entity recognition (NER) methods. The data in
Table 4 and
Table 5 show that models incorporating visual information generally outperform text-only models. On the TWITTER-2015 dataset, UMGF achieved an
score of 74.85%, which is 3.04% higher than the best text-based model, BERT-CRF; GDN-CMCF and MGCMT achieved 73.05% and 74.18%, respectively. On the TWITTER-2017 dataset, the advantage of graph fusion is even more pronounced, with UMGF, GDN-CMCF, and MGCMT achieving
scores of 85.51%, 85.71%, and 85.89%, respectively, significantly higher than single-modal methods. This indicates that visual information plays an important supplementary role in handling ambiguity and incomplete expressions in social media text.
Comparative analysis with other MNER methods. A comprehensive analysis of
Table 4 and
Table 5 shows that the CASF-MNER model proposed in this paper achieves excellent results on both datasets. On TWITTER-2015, the
score reaches 74.16%, which is close to the current best UMGF (74.85%) but outperforms the other comparison methods in terms of
Recall; on TWITTER-2017, the
score is 86.81%, which outperforms all the comparison methods, and improves by 0.92 percentage points over the next best MGCMT (85.89%). Notably, the model in this paper performs particularly well on the difficult-to-recognize ORG and MISC categories, reaching 85.22% and 70.38%, respectively, on TWITTER-2017, which is significantly higher than the other comparison methods.
Further analysis of the methods. As shown in
Figure 8, the
scores for the four entity categories demonstrate the differences between multimodal methods and text-based methods, particularly in the ORG and MISC categories on the TWITTER-2017 dataset, where the CASF-MNER model achieved significant improvements. However, it is worth noting that in the MISC category of the TWITTER-2015 dataset, the
scores of multimodal methods such as CASF-MNER were lower than those of the BERT-CRF text-based method, indicating a negative transfer effect caused by multimodal information. This phenomenon likely results from the limited accuracy of data collection and annotation in the early stages of social media environments. Additionally, models may sometimes misclassify certain visually similar domain entities (such as local businesses or specific cultural landmarks) as more general or higher-frequency categories, potentially due to data distribution biases during CLIP pre-training. Among these four categories, the MISC category has the most complex instance distribution, resulting in generally lower image quality and text-image synergy. The model struggles to fully leverage multimodal synergy during joint representation, and may even be misled by erroneous or irrelevant visual information, leading to negative impacts on the MISC category and ultimately causing the
score to decline. While multimodal information can significantly improve the overall performance of named entity recognition in most scenarios, its performance in specific categories is constrained by the quality of the data itself and the attributes of the category.
It should be noted that although this study achieved significant performance improvements on the two mainstream multimodal social media datasets TWITTER-2015 and TWITTER-2017, the current experiments were limited to these datasets due to computational power, time, and the availability of publicly labeled datasets. The lack of more comprehensive validation on a wider range of datasets has, to some extent, limited the universality and comprehensiveness of the experimental results. Therefore, the experimental results primarily reflect the model’s effectiveness in these two typical scenarios, and its performance in more complex and diverse social media environments remains to be further evaluated.
4.6. Ablation Study
In order to further explore the role of different modules of the model in this paper for entity recognition, we conducted ablation experiments, the results of which are shown in the following table:
As shown in
Table 6, we removed the visual representation enhancement, the
scores of the model on the Twitter-2015 and Twitter-2017 datasets decreased by 0.84% and 1.01%, respectively. It indicates that high-quality visual representations play an important role for entity recognition tasks. The enhanced features obtained by using fine-grained target detection and visual grounding can provide the model with more accurate visual semantic information, which can effectively improve the understanding and localization of entities.
When the CAM is removed, the scores of the model on Twitter-2015 and Twitter-2017 are reduced by 0.43% and 0.17%, respectively. This result suggests that modal alignment has a positive significance in facilitating multimodal fine-grained semantic fusion, which helps to improve the performance of downstream tasks.
After canceling the SFM and inputting the visual modality information directly into the subsequent module, the model scores decreased by 0.79% and 0.58% on the two datasets, respectively. The results show that the SFM module helps to improve feature semantic consistency and cross-modal co-expression, which has a positive impact on model performance.
Removal of the DSF resulted in a 2.11% and 1.21% decrease in scores on Twitter-2015 and Twitter-2017, respectively, which is the largest performance decrease among the modules. This phenomenon indicates that the DSF module effectively realizes the semantic mapping and integration of heterogeneous modal information through deep feature fusion, which fully exploits the complementary relationship between textual and visual features, and is crucial to the overall performance improvement.
The scores of the model on the two datasets are reduced by 0.59% and 0.33% respectively when training without introducing contrast loss. This result validates the effectiveness of the contrast learning mechanism in facilitating the hidden space alignment of heterogeneous modal representations with mitigating inter-modal distributional differences, thus enhancing the cross-modal semantic association modeling capability.
In order to more intuitively show the independent contribution of each functional module to the overall model performance, this paper further visualizes the results of the ablation experiments in
Figure 9. By comparing the performance curves of the complete model with the removal of each sub-module, it can be clearly observed that the removal of the different modules brings about significant and differentiated impacts on the entity recognition
Precision,
Recall, and
score.
In
Figure 9a, DSF removal leads to the maximum
loss with positive P/R deviation, proving its key role in integrating BERT representations and enhancing
Recall. VRE contributes the next highest (0.93%), mainly optimizing the fusion of CLIP visual features with textual information. the significant negative P/R deviation of SFM on TWITTER-2017 reflects its particular contribution to
Precision rate. According to the average
loss bar chart in
Figure 9b, the DSF (Deep Semantic Fusion) module stands out with the highest
loss after ablation, firmly establishing its pivotal contribution to the overall model performance. Specifically, removing DSF results in a 1.66% loss in the
score—noticeably higher than the losses observed upon disabling VRE (0.93%), SFM (0.69%), CL (0.46%), or CAM (0.30%). This pattern indicates that, while other modules each play significant roles in boosting the model’s effectiveness, DSF is irreplaceable in achieving deep cross-modal fusion for entity recognition. Furthermore,
Figure 9a reveals that, once DSF is ablated, the corresponding sample point exhibits both a larger
reduction and a positive
Precision-
Recall shift, suggesting that DSF is particularly beneficial for retrieving challenging entities and enables more comprehensive, nuanced cross-modal semantic integration.
In contrast, CAM (Cross-modal Alignment) and CL (Contrastive Learning) contribute relatively less to the
improvement, but their influence remains non-negligible. The radar plot in
Figure 9c highlights that CAM excels at “modality alignment,” effectively suppressing cross-modal noise and strengthening the interrelation of input representations. CL, while exhibiting lower scores in “contextual understanding,” demonstrates strong alignment capability; however, its contribution to global or intricate contextual reasoning is limited. Together, these two modules enhance the quality and consistency of underlying feature representations, providing a robust foundation for the semantic fusion operations carried out by DSF.
Delving deeper into the relationship between SFM and DSF,
Figure 9c shows that the two modules emphasize different functional dimensions. SFM plays a prominent role in “semantic consistency” and “co-occurrence expression” by selectively extracting and amplifying salient information at the feature input stage, whereas DSF is more adept in “semantic fusion” and “contextual reasoning” at a higher abstraction level. Their functions are complementary: SFM refines the initial signal, filtering out irrelevant noise and delivering high-quality features to DSF, which then performs deeper integration and information interaction across the global semantic space. Empirical results demonstrate that removing either module leads to performance drops, and neither can fully substitute for the other.
Taken as a whole,
Figure 9 illustrates the division of labor and mutual reinforcement among the various modules in the multimodal named entity recognition workflow. DSF is central to deep semantic integration, while VRE and SFM focus on optimizing modal features, and CL and CAM facilitate latent space alignment and noise mitigation. Rather than contributing in an additive fashion, these modules interact synergistically—each providing necessary support for others—to achieve both fine-grained and holistic improvements in model performance.
4.7. Analysis of Statistical Significance Between CASF and MGCMT
To further validate the performance advantage of the proposed CASF-MNER model over representative multimodal baselines, we performed rigorous statistical significance testing on the scores. Specifically, we independently trained and evaluated both the CASF and MGCMT models on the same data split using 10 random seeds to account for training variability and potential randomness inherent in deep learning model optimization. For each model, the score was recorded from 10 runs, and we computed the mean and standard deviation as indicators of central tendency and stability.
The experimental results, summarized in
Table 7, show that CASF-MNER achieved an average
score of 86.80 with a standard deviation of 0.518, while MGCMT obtained an average
score of 85.88 and a higher standard deviation of 0.694. To objectively assess whether the observed difference in average
scores is statistically meaningful rather than resulting from random fluctuations, we conducted an independent samples
t-test. The test yielded a
t-value of 3.50 and a corresponding
p-value of 0.00282. According to conventional statistical standards (typically
p < 0.05), this p-value confirms that the difference is statistically significant, i.e., there is strong evidence that the CASF model consistently outperforms MGCMT rather than this being due to chance.
This statistical analysis not only underscores the reliability of the observed performance improvement, but also reflects the model’s stability across repeated trials. Collectively, these findings further corroborate the practical effectiveness and robustness of the CASF-MNER architecture for the multimodal named entity recognition task.
Figure 10 shows the box plots of the
scores for the two models, allowing for a visual comparison of their distribution and variability. Overall, both the experiments and statistical analysis confirm that the CASF model exhibits higher and more stable performance in terms of
score.