A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion
Abstract
1. Introduction
- Limited Modality Integration: Existing approaches heavily rely on raw image data, failing to incorporate semantic information that could provide a more comprehensive understanding of the fragments. The lack of integration between high-level semantics and low-level visual features results in suboptimal information utilization, significantly impairing model performance. Our study, however, overcomes this limitation by integrating multiple modalities—image data, inscriptions, and archeological metadata, which enables a more holistic understanding of bone stick fragments, improving matching precision and accuracy.
- Inadequate image preprocessing: Many traditional methods fail to perform critical preprocessing tasks such as denoising, enhancement, or structural normalization, which are essential to prepare the input data for effective feature extraction. As a result, these methods often suffer from high computational burden and reduced model performance during feature extraction and task modeling. In contrast, our method introduces a more robust preprocessing pipeline, which precisely segments fracture regions and minimizes noise interference, thus allowing the model to focus on key features and significantly improving matching efficiency.
- Insufficient robustness in matching algorithms: The majority of existing algorithms struggle when dealing with highly variable fragment morphologies and severe surface degradation, resulting in weak feature extraction capabilities. As a result, the accuracy of recognition and reconstruction in such conditions is significantly diminished, greatly limiting the generalizability and practical applicability of these methods. Our approach, however, leverages advanced pre-trained models like Vision-RWKV, RWKV, and BERT, which are specifically designed to extract deep features across multiple modalities, ensuring that our model is far more robust to image degradation, missing fragments, and complex geometric variations. This multi-faceted feature extraction combined with a dynamic fusion mechanism enhances the generalization capability, making our method more applicable to a wide range of real-world scenarios.
- Design and implementation of the multimodal matching method: To ensure the comprehensive extraction of various information features while maintaining the overall robustness of the system and the operational efficiency of the model, this paper proposes an integrated matching system driven in parallel by three large models. The aim is to enhance the feature extraction capability in order to fully exploit the latent information within the data of bone sticks. By leveraging different large models corresponding to the distinct characteristics of the data, feature extraction is carried out separately for each type of data, thereby providing a more precise basis for bone stick image matching.
- Image data preprocessing and noise suppression: To enhance the model’s performance and generalization ability, this paper conducts efficient preprocessing of the image dataset. This includes the precise removal of the backgrounds and the cropping of fracture regions using DIS-Net in order to minimize the impact of noise interference on model training.
- Dynamic cross-feature fusion mechanism: A dynamic cross-feature fusion mechanism is introduced, wherein features from different sources are cross-fused. This enables the model to deeply understand the input feature information from multiple angles and dimensions.
- Calculating bone sticks matching degree through feature vector differences: The matching relationship between bone sticks is computed using deep feature vector differences. By combining feature vector differences and introducing a binary cross-entropy loss function, the matching degree between bone sticks is quantitatively represented within the range. This approach significantly reduces the computational complexity of the network.
2. Basic Theory
2.1. RWKV
- R (Receptance Vector): This component stores information from previous time steps. For example, if we are processing the sequence of words “The cat sat on the mat,” the receptance vector captures the context from earlier words (e.g., “The cat”) to inform the understanding of the current word (“sat”).
- W (Weight): This is a trainable parameter that adjusts the influence of past information on the current time step. In the example, the weight determines how much influence the previous word “The” should have when processing the current word “cat.”
- K (Key): The key helps to retrieve the relevant information from the past, providing a form of selective memory. For instance, when processing the word “sat,” the key mechanism would identify the relevant context from the earlier word “cat.”
- V (Value): The value represents the actual content or features associated with the key. In the example, when processing “sat,” the value would contain features or embeddings related to the word “cat,” such as its semantic meaning.
2.2. Vision-RWKV (VRWKV)
2.3. BERT
2.4. Model Selection
3. The Proposed Method
3.1. Image Preprocessing Module
- (1)
- The main body of the bone stick is segmented using DIS-Net [19], a deep network designed for binary image segmentation. DIS-Net consists of an encoder, an image segmentation module based on [20], and an intermediate supervision strategy. DIS-Net excels in precisely identifying and separating target objects in high-resolution images. Due to its superior binary classification capability and high-precision segmentation features, DIS-Net demonstrates high applicability in the segmentation of bone stick images. In the experiments presented in this study, we follow DIS-Net’s training strategy and fine-tune the network using its pre-trained weights to achieve fine-grained segmentation of bone stick images, ultimately generating high-precision image labels. The segmentation results are shown in Figure 4b.
- (2)
- The original bone stick image (Figure 4a) is subjected to a pixel-wise AND operation with the segmented result image (Figure 4b), retaining the pixel values corresponding to the bone stick while setting the background pixel values to 255. This process generates a standardized bone stick image (Figure 4c). The computation is defined in (1), where represents the output image, and denote the original image and the mask image, respectively, and indicates the pixel location.
- (3)
- Due to the prevalence of fractures in the bone sticks within the images, not all bone sticks awaiting matching exhibit an intact form. To address this, the present study employs a data augmentation strategy by cropping the fractured regions of the bone sticks. Furthermore, additional cropping is applied to the main body of the bone sticks to exclude areas unrelated to the fractured portions, thereby preserving critical feature information essential for bone stick matching. As illustrated in Figure 4d, the resulting cropped images are more tightly focused on the target regions, which enhances the model’s ability to recognize the key features of the bone sticks more effectively.
- (4)
- Following the cropping process, white padding is added around the bone stick images with reference to their longer edge, ensuring that the final image dimensions are standardized to 224 × 224 pixels. This normalization of image size aligns with the input requirements of the model, as shown in Figure 4e, thereby ensuring data consistency and enhancing the model’s processing efficiency.
3.2. Feature Extraction Based on RWKV, V-RWKV and BERT Module
3.3. Dynamic Cross-Feature Fusion Model
3.4. Matching Calculation Module
4. Experimental Environment, Dataset, and Parameters
4.1. Experimental Environment
4.2. Dataset
- 1.
- Fragment Labeling: The bone stick fragments were carefully labeled by a team of professional archeologists. Each fragment was assigned a unique identifier and the corresponding inscriptions were transcribed into standardized text. The inscription data include historical content (e.g., official titles, weapon types) and spatial metadata (e.g., excavation context, missing sections). Each pair of fragments that was identified as matching was manually labeled as a matching pair. These labeled pairs serve as the ground-truth labels for the dataset used in training and testing.
- 2.
- Sampling: To ensure a diverse and representative dataset, 3000 pairs of bone stick fragments were selected. These pairs were chosen based on a wide range of fragmentation patterns, inscription types, and archeological metadata characteristics. This diverse sampling includes both well-preserved fragments with minimal damage and highly degraded fragments, ensuring that the model can generalize to different qualities of bone stick artifacts.
- 3.
- Preprocessing: The preprocessing of the images and metadata involved the following steps:
- Image Preprocessing: Each image was processed to remove the background and crop out the fracture regions using the DIS-Net segmentation network, which is particularly effective for high-resolution binary image segmentation. The cropped images were resized to 224 × 224 pixels to meet the input requirements of the deep learning model.
- Textual Data Preprocessing: The inscription data were cleaned and standardized, with any variations in transcription style adjusted to ensure consistency across the dataset. This allowed for the RWKV model to process the inscriptions effectively.
- Archeological Metadata: Additional metadata, including details such as color, excavation context, and missing sections, were encoded using the BERT model, which helped enrich the textual features.
4.3. Evaluation Metrics
5. Experiments
5.1. Ablation Study (Effect of Pretrained Models)
5.2. Ablation Study (Effect of Model Variants)
5.3. Comparative Experiments: Proposed Method Versus Traditional Approaches
5.4. Comparative Experiments: Proposed Method Versus Other Large-Scale Pre-Trained Models
5.5. Comparative Experiments on Bone Stick Preprocessing
5.6. Sample Visualization
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Yu, Z.Y. A Study on the Identification and Classification of Bone Sticks Excavated from the Weiyang Palace Site of Han Chang’an City. Arch. Cult. Relics 2007, 2, 48–62. [Google Scholar]
- Liu, G.N. The Earliest Specialized Archive Repository in Our Country—The Han Dynasty Oracle Bone Inscription Archives Repository. Chin. Arch. 2007, 2, 50–52. [Google Scholar]
- Liu, C.; Wang, H.; Mao, L.; Liu, R.; Wang, Z.; Wang, T. Image Stitching Method of Bone Stick Fragment Based on Similarity Freeman Code Matching. IEEE Access 2023, 11, 23073–23084. [Google Scholar] [CrossRef]
- Liang, Q.; Yang, L.; Luo, Z.; Jiang, W.; Hong, C. A Siamese Network-Based Method for Automatic Stitching of Artifact Fragments. IEEE Trans. Instrum. Meas. 2023, 72, 2520913. [Google Scholar] [CrossRef]
- Rui, X.; Zhang, X.; Wang, R. Application of the Multi-Feature Splicing Technology Based on Residual Network Identification. In Proceedings of the International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA), Huaihua, China, 9–11 December 2022; pp. 381–384. [Google Scholar]
- Samonte, M.J.C.; Kong, T. Image Processing of Cultural Relics Fragments Splicing through Hybrid Folded Mesh Simplification Algorithm. In Proceedings of the 2023 5th World Symposium on Software Engineering (PWSSE), Tokyo, Japan, 22–24 September 2023; pp. 305–314. [Google Scholar]
- Wang, X.; Fu, S. Cultural Relics Fragment Assembly Based on Fracture Surface Contour Features. Appl. Opt. 2024, 63, 5278–5291. [Google Scholar] [CrossRef]
- Duan, Y.; Zhang, L.; Chen, X.; Wang, Q.; Zhao, Y.; Liu, Z.; Liu, Q. Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures. arXiv 2025, arXiv:2403.02308. [Google Scholar]
- Peng, B.; Xue, F.; Wang, Q.; Liu, Y.; Gao, Z.; Zhang, L.; Zhang, H.; Wang, Y.; Tang, Y.; Liang, X.; et al. RWKV: Reinventing RNNs for the Transformer Era. arXiv 2023, arXiv:2305.13048. [Google Scholar] [CrossRef]
- Koroteev, M.V. BERT: A Review of Applications in Natural Language Processing and Understanding. arXiv 2021, arXiv:2103.11943. [Google Scholar] [CrossRef]
- Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. arXiv 2021, arXiv:1808.03314v10. [Google Scholar] [CrossRef]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Virtual Event, 6–14 December 2021; pp. 15908–15919. [Google Scholar]
- Xie, S.; Li, Y.; Ma, Y.; Wu, Y. AutoGMM-RWKV: A Detecting Scheme Based on Attention Mechanisms against Selective Forwarding Attacks in Wireless Sensor Networks. IEEE Internet Things J. 2024, 12, 1–18. [Google Scholar] [CrossRef]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450v1. [Google Scholar] [CrossRef]
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, and Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Eldan, R.; Shamir, O. The Power of Depth for Feedforward Neural Networks. arXiv 2016, arXiv:1512.03965. [Google Scholar] [CrossRef]
- Adekotujo, A.S.; Enikuomehin, T.; Aribisala, B.; Mazzara, M.; Zubair, A.F. Computational Treatment of Natural Language Text for Intent Detection. Comput. Res. Model. 2024, 16, 1539–1554. [Google Scholar] [CrossRef]
- Qin, X.; Dai, H.; Hu, X.; Fan, D.-P.; Shao, L.; Van Gool, L. Highly Accurate Dichotomous Image Segmentation. arXiv 2022, arXiv:2203.03041. [Google Scholar] [CrossRef]
- Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection. Pattern Recognit. 2020, 106, 107404–107419. [Google Scholar] [CrossRef]
- Xue, Z.; Marculescu, R. Dynamic multimodal fusion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 2575–2584. [Google Scholar]











| Model Name | Modality | Pretraining Corpus | Output Feature Dimension |
|---|---|---|---|
| Vision-RWKV-L | Visual (Image) | ImageNet-22K | 150,528 (original)→ 1024 (final) |
| RWKV-6 World | Text (Inscriptions) | The Pile (multilingual corpus) | 768 (original)→ 1024 (final) |
| BERT-base-cased | Text (Metadata) | BooksCorpus + Wikipedia (EN) | 768 (original)→ 1024 (final) |
| Training Set | Validation Set | Test Set 1 | Test Set 2 |
|---|---|---|---|
| 2100 pairs | 600 pairs | 300 pairs | 3000 pairs |
| Method | Rank-1/% | Rank-5/% | Rank-10/% | Rank-15/% | Average/% | Params/M | TD/S | |
|---|---|---|---|---|---|---|---|---|
| Test Set 1 | (1) VR-S | 31.20 | 48.54 | 65.02 | 71.16 | 53.98 | 215.23 | 32.41 |
| (2) VR-B | 32.85 | 50.71 | 68.40 | 73.52 | 56.37 | 216.88 | 35.08 | |
| (3) VR-L | 34.93 | 55.13 | 75.12 | 82.25 | 61.86 | 217.94 | 50.37 | |
| (4) VR-L and R4 (normal contact) | 35.21 | 56.31 | 76.57 | 83.94 | 63.01 | 220.47 | 52.31 | |
| (5) VR-L and R5 (normal contact) | 36.78 | 57.42 | 77.23 | 84.65 | 64.02 | 223.92 | 54.72 | |
| (6) VR-L and R6 (normal contact) | 38.04 | 58.89 | 79.84 | 86.78 | 65.89 | 339.47 | 61.93 | |
| (7) VR-L and R6 and Bm (normal contact) | 38.62 | 59.54 | 81.12 | 87.40 | 66.67 | 442.13 | 64.87 | |
| (8) VR-L and R6 and Bb (normal contact) | 40.01 | 61.38 | 82.97 | 90.36 | 68.68 | 446.82 | 67.11 | |
| (9) R6 and Bb (normal contact) | 17.83 | 27.74 | 30.16 | 37.85 | 28.39 | 134.27 | 18.52 | |
| (10) R6 and Bb (Static Averaging) | 18.51 | 28.66 | 38.85 | 42.23 | 32.07 | 165.18 | 30.13 | |
| (11) VR-L and Bb (normal contact) | 37.12 | 56.95 | 76.98 | 83.83 | 63.72 | 414.55 | 62.26 | |
| (12) VR-L and Bb (Static Averaging) | 38.27 | 58.71 | 79.36 | 86.43 | 65.68 | 427.39 | 64.19 | |
| (13) VR-L and R6 and Bb (Static Averaging) | 41.72 | 64.32 | 83.63 | 90.42 | 69.52 | 439.12 | 65.28 | |
| (14) VR-L and R6 and Bb (Dynamic cross-feature fusion) -ours | 46.71 | 67.84 | 87.26 | 94.73 | 74.13 | 467.76 | 67.54 | |
| Test Set 2 | (1) VR-S | 25.20 | 43.14 | 59.92 | 62.41 | 47.67 | 215.23 | 187.65 |
| (2) VR-B | 27.05 | 45.49 | 63.47 | 69.17 | 51.30 | 216.88 | 204.91 | |
| (3) VR-L | 29.43 | 50.18 | 70.44 | 78.12 | 57.04 | 217.94 | 292.56 | |
| (4) VR-L and R4 (normal contact) | 30.01 | 51.63 | 72.15 | 80.04 | 58.46 | 220.47 | 303.26 | |
| (5) VR-L and R5 (normal contact) | 31.78 | 52.92 | 72.98 | 80.90 | 59.65 | 223.92 | 315.45 | |
| (6) VR-L and R6 (normal contact) | 33.34 | 54.66 | 75.84 | 83.25 | 61.77 | 339.47 | 357.19 | |
| (7) VR-L and R6 and Bm (normal contact) | 34.12 | 55.49 | 77.29 | 84.02 | 62.73 | 442.13 | 373.82 | |
| (8) VR-L and R6 and Bb (normal contact) | 35.71 | 57.51 | 79.31 | 85.13 | 64.42 | 446.82 | 390.02 | |
| (9) R6 and Bb (normal contact) | 9.52 | 14.81 | 16.10 | 20.21 | 15.16 | 134.27 | 106.58 | |
| (10) R6 and Bb (Static Averaging) | 10.74 | 16.63 | 22.54 | 24.51 | 18.61 | 165.18 | 173.14 | |
| (11) VR-L and Bb (normal contact) | 27.24 | 41.49 | 56.49 | 61.52 | 46.69 | 414.55 | 357.72 | |
| (12) VR-L and Bb (Static Averaging) | 29.61 | 45.42 | 61.40 | 66.87 | 50.83 | 427.39 | 369.26 | |
| (13) VR-L and R6 and Bb (Static Averaging) | 36.17 | 58.76 | 80.66 | 85.96 | 65.38 | 439.12 | 375.04 | |
| (14) VR-L and R6 and Bb (Dynamic Cross-Feature Fusion) -ours | 42.83 | 62.16 | 83.21 | 87.33 | 68.88 | 467.76 | 391.40 |
| Method | Rank-1/% | Rank-5/% | Rank-10/% | Rank-15/% | Average/% | Params/M | TD/S | |
|---|---|---|---|---|---|---|---|---|
| Test Set 1 | (1) ViLT-Large | 32.03 | 49.62 | 66.71 | 72.34 | 55.17 | 347.17 | 89.12 |
| (2) VLMo-base | 35.07 | 55.72 | 75.84 | 83.01 | 62.41 | 425.63 | 94.01 | |
| (3) VR-L and R6 | 38.04 | 58.89 | 79.84 | 86.78 | 65.89 | 339.47 | 61.93 | |
| (4) ViLT-Large and Bb (normal contact) | 36.62 | 57.60 | 78.20 | 85.36 | 64.45 | 454.52 | 94.29 | |
| (5) VLMo-base and Bb (normal contact) | 37.70 | 58.48 | 79.17 | 86.02 | 65.34 | 532.98 | 100.13 | |
| (6) VR-L and R6 and Bb (normal contact) | 40.01 | 61.38 | 82.97 | 90.36 | 68.68 | 446.82 | 67.11 | |
| (7) ViLT-Large and Bb (Dynamic Cross-Feature Fusion) | 39.02 | 60.13 | 81.40 | 88.57 | 67.28 | 475.46 | 97.46 | |
| (8) VLMo-base and Bb (Dynamic Cross-Feature Fusion) | 41.72 | 62.83 | 84.10 | 91.27 | 69.98 | 553.92 | 112.67 | |
| (9) VR-L and R6 and Bb (Dynamic Cross-Feature Fusion) -ours | 46.71 | 67.84 | 87.26 | 94.73 | 74.13 | 467.76 | 67.54 | |
| Test Set 2 | (1) ViLT-Large | 26.12 | 44.31 | 61.70 | 65.79 | 49.48 | 347.17 | 481.25 |
| (2) VLMo-base | 28.24 | 47.83 | 66.95 | 73.64 | 54.16 | 425.63 | 535.86 | |
| (3) VR-L and R6 | 33.34 | 54.66 | 75.84 | 83.25 | 61.77 | 339.47 | 357.19 | |
| (4) ViLT-Large and Bb (normal contact) | 32.56 | 53.79 | 74.41 | 82.07 | 60.71 | 454.52 | 603.46 | |
| (5) VLMo-base and Bb (normal contact) | 33.73 | 55.07 | 76.56 | 83.63 | 62.25 | 532.98 | 624.81 | |
| (6) VR-L and R6 and Bb (normal contact) | 35.71 | 57.51 | 79.31 | 85.13 | 64.42 | 446.82 | 390.02 | |
| (7) ViLT-Large and Bb (Dynamic Cross-Feature Fusion) | 33.73 | 55.07 | 76.56 | 83.63 | 62.24 | 475.46 | 662.79 | |
| (8) VLMo-base and Bb (Dynamic Cross-Feature Fusion) | 34.91 | 56.50 | 78.30 | 84.58 | 63.57 | 553.92 | 766.16 | |
| (9) VR-L and R6 and Bb (Dynamic Cross-Feature Fusion) -ours | 42.83 | 62.16 | 83.21 | 87.33 | 68.88 | 467.76 | 391.40 |
| Rank-1/% | Rank-5/% | Rank-10/% | Rank-15/% | Average/% | ||
|---|---|---|---|---|---|---|
| Unprocessed Bone Stick Dataset | Test Set1 | 34.65 | 48.35 | 64.72 | 76.83 | 56.14 |
| Test Set2 | 28.83 | 39.62 | 52.27 | 56.46 | 44.30 | |
| Preprocessed Bone Stick Dataset | Test Set1 | 46.71 | 67.84 | 87.26 | 94.73 | 74.13 |
| Test Set2 | 42.83 | 62.16 | 83.21 | 87.33 | 68.88 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fan, T.; Wang, H.; Wang, K.; Liu, R.; Wang, Z. A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion. Appl. Sci. 2025, 15, 8681. https://doi.org/10.3390/app15158681
Fan T, Wang H, Wang K, Liu R, Wang Z. A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion. Applied Sciences. 2025; 15(15):8681. https://doi.org/10.3390/app15158681
Chicago/Turabian StyleFan, Tao, Huiqin Wang, Ke Wang, Rui Liu, and Zhan Wang. 2025. "A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion" Applied Sciences 15, no. 15: 8681. https://doi.org/10.3390/app15158681
APA StyleFan, T., Wang, H., Wang, K., Liu, R., & Wang, Z. (2025). A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion. Applied Sciences, 15(15), 8681. https://doi.org/10.3390/app15158681
