A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion

Fan, Tao; Wang, Huiqin; Wang, Ke; Liu, Rui; Wang, Zhan

doi:10.3390/app15158681

Open AccessArticle

A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion

by

Tao Fan

¹,

Huiqin Wang

^1,*,

Ke Wang

¹,

Rui Liu

² and

Zhan Wang

³

¹

School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

²

Institute of Archaeology, Chinese Academy of Social Sciences, Beijing 100101, China

³

Shaanxi Institute for the Preservation of Cultural Heritage, Xi’an 710075, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8681; https://doi.org/10.3390/app15158681

Submission received: 7 July 2025 / Revised: 1 August 2025 / Accepted: 4 August 2025 / Published: 5 August 2025

Download

Browse Figures

Versions Notes

Abstract

Among the approximately 60,000 bone stick fragments unearthed from the Weiyang Palace site of the Han Dynasty, about 57,000 bear inscriptions. Most of these fragments exhibit vertical fractures, leading to a separation between the upper and lower fragments, which poses significant challenges to digital preservation and artifact restoration. Manual matching is inefficient and may cause further damage to the bone sticks. This paper proposes a novel multimodal bone stick matching approach that integrates image, inscription, and archeological information to enhance the accuracy and efficiency of matching fragmented bone stick artifacts. Unlike traditional methods that rely solely on image data, our method leverages large-scale pre-trained models, namely Vision-RWKV for visual feature extraction, RWKV for inscription analysis, and BERT for archeological metadata encoding. A dynamic cross-modal feature fusion mechanism is introduced to effectively combine these features, enabling better interaction and weighting based on the contextual relevance of each modality. This approach significantly improves matching performance, particularly in challenging cases involving fractures, corrosion, and missing sections. The novelty of this method lies in its ability to simultaneously extract and fuse multiple sources of information, addressing the limitations of traditional image-based matching methods. This paper uses Rank-N and Cumulative Match Characteristic (CMC) curves as evaluation metrics. Experimental evaluation shows that the matching accuracy reaches 94.73% at Rank-15, and the method performs significantly better than the comparative methods on the CMC evaluation curve, demonstrating outstanding performance. Overall, this approach significantly enhances the efficiency and accuracy of bone stick artifact matching, providing robust technical support for the research and restoration of bone stick cultural heritage.

Keywords:

bone stick; feature extraction; image matching

1. Introduction

A substantial number of fragmented bone stick artifacts have been unearthed at the site of the Weiyang Palace in Chang’an City during the Han Dynasty. These bone sticks, made from animal scapulae, are long, bar-shaped objects, with over 57,000 of them inscribed with textual information [1]. The inscriptions on these bone sticks encompass a diverse array of subjects, spanning official titles, categories of weapons, and units of measurement throughout the Western Han period. They vividly reconstruct the social structure and development of the Western Han Dynasty [2], offering profound academic and cultural value to archeological research. However, due to prolonged burial, the bone sticks have been subjected to the erosive forces of soil movement and their inherent structural degradation. As a result, most bone sticks are found broken into upper and lower fragments, often already fractured by the time of excavation. Archeologists primarily rely on observing the texture of the fracture surfaces and the inscription to match fragments. This process is highly time-consuming and labor-intensive, with the risk of causing secondary damage to the bone sticks during the matching process. The application of digital image processing techniques for the matching of bone stick fragments presents a promising strategy to substantially improve the efficiency and accuracy of the matching process.

In current research within the domain of image matching, considerable progress has been made. For instance, Liu et al. [3] proposed an image matching method based on Freeman coding. Freeman coding is highly effective in representing regular boundaries; however, it exhibits clear limitations when dealing with fragments that have irregular shapes or complex boundary curvature, resulting in a significant reduction in generalization capability during matching. Liang et al. [4] introduced a matching method based on Siamese networks, which relies on the complete geometric and textural information of fracture surfaces. However, when fracture surfaces are severely degraded due to corrosion, missing sections, or noise, the accuracy of feature extraction and matching based on individual point clouds is substantially diminished. Rui et al. [5] designed a multi-feature matching method based on residual networks and addressed the multi-fragment stitching problem through a dynamic matching algorithm. Nonetheless, when the geometric edges of fragments are severely damaged, extracting valid feature points from images becomes exceedingly difficult, thus reducing matching accuracy. Samonte et al. [6] proposed an automatic matching algorithm based on feature region partitioning, which primarily depends on the geometric shape features of the fragments. However, this method overlooks other potentially valuable information, resulting in lower accuracy for fragments with sparse surface features or unclear boundary characteristics. Wang et al. [7] presented a stitching method based on fracture surface profile features, integrating curvature and neighborhood information. This approach uses geometric matching and iterative optimization algorithms to achieve fragment registration and stitching. However, for fragments with nonlinear surface textures or multi-layered complex structures, the extraction and matching of feature points significantly deteriorate, thereby affecting the overall accuracy of the matching process.

Although the aforementioned methods demonstrate a certain degree of effectiveness under specific conditions, they continue to face several critical bottlenecks in practical applications, which this study effectively addresses:

Limited Modality Integration: Existing approaches heavily rely on raw image data, failing to incorporate semantic information that could provide a more comprehensive understanding of the fragments. The lack of integration between high-level semantics and low-level visual features results in suboptimal information utilization, significantly impairing model performance. Our study, however, overcomes this limitation by integrating multiple modalities—image data, inscriptions, and archeological metadata, which enables a more holistic understanding of bone stick fragments, improving matching precision and accuracy.
Inadequate image preprocessing: Many traditional methods fail to perform critical preprocessing tasks such as denoising, enhancement, or structural normalization, which are essential to prepare the input data for effective feature extraction. As a result, these methods often suffer from high computational burden and reduced model performance during feature extraction and task modeling. In contrast, our method introduces a more robust preprocessing pipeline, which precisely segments fracture regions and minimizes noise interference, thus allowing the model to focus on key features and significantly improving matching efficiency.
Insufficient robustness in matching algorithms: The majority of existing algorithms struggle when dealing with highly variable fragment morphologies and severe surface degradation, resulting in weak feature extraction capabilities. As a result, the accuracy of recognition and reconstruction in such conditions is significantly diminished, greatly limiting the generalizability and practical applicability of these methods. Our approach, however, leverages advanced pre-trained models like Vision-RWKV, RWKV, and BERT, which are specifically designed to extract deep features across multiple modalities, ensuring that our model is far more robust to image degradation, missing fragments, and complex geometric variations. This multi-faceted feature extraction combined with a dynamic fusion mechanism enhances the generalization capability, making our method more applicable to a wide range of real-world scenarios.

To address the aforementioned challenges, this paper introduces a pre-trained large model-driven multimodal information fusion matching method. By extracting multimodal feature information, the model’s representational capacity during the feature extraction process is enhanced. Furthermore, through a dynamic cross-feature fusion strategy, nonlinear interactions between features are modeled, which not only improves matching accuracy but also enhances the model’s generalization capability. This approach integrates three types of information—image data from the fracture surfaces of the bone sticks, the inscriptions on the bone sticks, and archeological information about bone sticks (including color, excavation context number, and missing section)—as input to the model. Initially, pre-trained large-scale models, namely Vision-RWKV [8], RWKV [9], and BERT [10], are employed to extract features from different modalities of data. Then, a dynamic cross-modal feature fusion module is used to effectively combine the various features. Finally, bone stick image matching is achieved by calculating feature similarities. A contrastive learning approach is utilized, and through transfer learning, the pre-trained weights of the large-scale models are used to initialize the bone stick matching task, thereby accelerating the model’s adaptation to this specific domain.

The primary contributions of this paper are as follows:

Design and implementation of the multimodal matching method: To ensure the comprehensive extraction of various information features while maintaining the overall robustness of the system and the operational efficiency of the model, this paper proposes an integrated matching system driven in parallel by three large models. The aim is to enhance the feature extraction capability in order to fully exploit the latent information within the data of bone sticks. By leveraging different large models corresponding to the distinct characteristics of the data, feature extraction is carried out separately for each type of data, thereby providing a more precise basis for bone stick image matching.
Image data preprocessing and noise suppression: To enhance the model’s performance and generalization ability, this paper conducts efficient preprocessing of the image dataset. This includes the precise removal of the backgrounds and the cropping of fracture regions using DIS-Net in order to minimize the impact of noise interference on model training.
Dynamic cross-feature fusion mechanism: A dynamic cross-feature fusion mechanism is introduced, wherein features from different sources are cross-fused. This enables the model to deeply understand the input feature information from multiple angles and dimensions.
Calculating bone sticks matching degree through feature vector differences: The matching relationship between bone sticks is computed using deep feature vector differences. By combining feature vector differences and introducing a binary cross-entropy loss function, the matching degree between bone sticks is quantitatively represented within the range. This approach significantly reduces the computational complexity of the network.

2. Basic Theory

2.1. RWKV

RWKV is a large-scale model with a novel neural network architecture that combines the characteristics of Recurrent Neural Networks (RNNs) [11] and Transformers [12]. By introducing a weighted key–value pair mechanism [13], it effectively captures contextual information from time-series data, empowering the model to discern and model interdependencies across distant time steps more effectively, while maintaining efficiency during feature extraction. Additionally, RWKV adjusts the memory length through a weighted mechanism, prioritizing the retention of important information relevant to the current time step, thereby enabling more targeted feature extraction. The flow of information within the RWKV architecture facilitates the seamless and efficient transmission of data across time steps, enabling the extraction of more nuanced and sophisticated textual features.

The RWKV model is constructed based on stacked residual blocks, as shown in Figure 1a. Each residual block consists of a time-mixing subblock and a channel-mixing subblock. This design enables the simultaneous modeling of dependencies in both the feature and time dimensions, demonstrating the strengths of a recurrent structure. It allows the model to fully leverage historical information to enhance feature extraction at the current time step. The model incorporates a unique attention mechanism, which introduces a dynamic update mechanism, particularly the time-dependent operation. This significantly enhances numerical stability and alleviates the problem of vanishing gradients. Through this mechanism, the model is able to control the interaction of information between different time steps with finer granularity, thereby more precisely capturing key features. Additionally, the model structure integrates layer normalization techniques [14], which are crucial for stabilizing gradient flow and effectively addressing common issues such as gradient vanishing and gradient explosion during deep network training. This stability not only optimizes the model’s training process but also further supports the stacking of deep networks, enabling the model to capture abstract features at different levels and extract more complex feature information.

The core architecture of the RWKV model consists of four fundamental components, as illustrated in Figure 1b. The inherent components of the time-mixing and channel-mixing modules are as follows:

R (Receptance Vector): This component stores information from previous time steps. For example, if we are processing the sequence of words “The cat sat on the mat,” the receptance vector captures the context from earlier words (e.g., “The cat”) to inform the understanding of the current word (“sat”).
W (Weight): This is a trainable parameter that adjusts the influence of past information on the current time step. In the example, the weight determines how much influence the previous word “The” should have when processing the current word “cat.”
K (Key): The key helps to retrieve the relevant information from the past, providing a form of selective memory. For instance, when processing the word “sat,” the key mechanism would identify the relevant context from the earlier word “cat.”
V (Value): The value represents the actual content or features associated with the key. In the example, when processing “sat,” the value would contain features or embeddings related to the word “cat,” such as its semantic meaning.

These components work together to dynamically capture temporal dependencies and enhance the ability of the RWKV model to integrate both short- and long-term contextual information.

2.2. Vision-RWKV (VRWKV)

Vision-RWKV (VRWKV) is a modified version of the RWKV attention mechanism, specifically adapted for visual tasks. In comparison to Convolutional Neural Network (CNN) [15] and Vision Transformer (ViT) [16], it employs a simplified attention mechanism that reduces computational complexity while enhancing its ability to process high-resolution images. This approach effectively captures both local features (e.g., edges and textures) and global information (such as contextual relationships between objects), thus enabling the extraction of more precise and detailed features when handling complex images.

The architecture of VRWKV is composed of L identical encoder layers, an average pooling layer, and a linear prediction head, as depicted in Figure 2a. Each VRWKV encoder layer encompasses two fundamental components: the spatial mixing module and the channel mixing module. The spatial mixing module serves as the attention mechanism, executing global attention operations with a linear complexity, while the channel mixing module functions as a feed-forward network (FFN) [17], enabling feature fusion along the channel dimension, as shown in Figure 2b. This design optimizes the model’s ability to efficiently combine representational information from various convolutional kernels or feature channels, thereby augmenting its capacity to extract a broader range of features from images and enhancing its performance in complex visual tasks. By preserving the intrinsic advantages of the RWKV model, VRWKV introduces a series of refinements to the conventional attention mechanism, incorporating innovations such as bidirectional attention, relative bias, and flexible attenuation.

The introduction of bidirectional attention allows the model to comprehensively consider the interactions between different pixels or regions in an image during feature extraction. This mechanism significantly enhances the model’s ability to capture global semantic information within the image, enabling a more detailed understanding of both the global structure and the finer details of the image. The incorporation of relative bias helps the model to adaptively adjust its ability to capture spatial relationships between pixels or regions when processing images of varying sizes. This design is particularly beneficial for handling images of different scales, thereby improving the model’s capacity for cross-scale feature extraction. The flexible decay mechanism not only strengthens the model’s global attention computation ability but also enables it to capture long-range dependencies within the image. By employing an adaptive decay strategy, the model can focus on regions of the image that are farther from the current target during attention calculation, thus optimizing the modeling of remote features and further enhancing the depth of image feature representation.

2.3. BERT

BERT is a language model based on the Transformer architecture, whose most notable feature is its ability to model bidirectional context [18]. This approach provides BERT with a significant advantage in text comprehension. By integrating the BERT model into the process of extracting bone stick archeological metadata, it not only enhances the depth and accuracy of feature extraction but also effectively captures multidimensional feature information such as the color, excavation context number, and missing section of the bone stick. The key advantage of this approach lies in BERT’s bidirectional context modeling ability and its capacity for multi-layer feature extraction. These capabilities enable the model to extract richer semantic features from complex input data, thereby improving the overall performance of the system.

2.4. Model Selection

To ensure robust feature extraction and deep semantic comprehension across heterogeneous modalities, this study adopts three pre-trained large-scale models—Vision-RWKV, RWKV, and BERT—corresponding to visual, inscriptional, and archeological features, respectively—to construct a modular feature extraction architecture, in order to fully leverage each model’s capacity for feature representation within its corresponding data type.

For the visual modality, bone stick fracture images typically exhibit unstructured attributes such as complex boundary curvature and vulnerability to local information loss. While conventional networks excel in capturing localized textures, they often struggle to model long-range dependencies inherent in the global structure of bone sticks. The Vision-RWKV model, integrating the spatial sensitivity of Convolutional Neural Networks (CNNs) with the global contextual modeling capabilities of Transformers, incorporates bidirectional attention, relative positional bias, and mechanisms for long-distance dependency modeling. These enhancements enable more efficient extraction of edge features and fine-grained texture variations in fractured regions—making it particularly effective for representing nonlinear damage patterns prevalent in bone stick imagery.

In the textual modality, RWKV, as a next-generation model that synergizes the recurrent dynamics of RNNs with the contextual depth of Transformers, demonstrates superior capability in capturing long-distance semantic dependencies in Chinese text. Given that bone stick inscriptions are typically devoid of punctuation, densely packed with characters, and structurally compressed, RWKV offers higher stability in modeling contextual relationships between characters. Compared to LSTM, GRU, or standard Transformer-based architectures, RWKV is better suited for handling classical Chinese and epigraphic texts in terms of semantic representation.

BERT is employed to extract the archeological feature information of the bone stick, particularly for processing inscriptions. BERT is preferred over other language models like ELECTRA and RoBERTa due to its bidirectional context modeling. Unlike ELECTRA, which excels in pretraining but lacks deep contextual comprehension, BERT captures relationships between characters in dense, unpunctuated inscriptions, making it ideal for processing classical Chinese text. Additionally, RoBERTa lacks BERT’s token-level pretraining on large-scale corpora, limiting its generalization ability. On the other hand, ELECTRA focuses on distinguishing replaced tokens rather than contextual understanding, which is crucial for the complex syntactical structures of bone stick inscriptions. Therefore, BERT’s bidirectional capability ensures high accuracy in capturing linguistic nuances and interdependencies in inscriptions, which is essential for accurate bone stick matching.

RWKV, Vision-RWKV, and BERT, as representative large-scale model families that have undergone continuous iterations and optimizations in recent years, exhibit significant differences in parameter scale and applicability across various scenarios. Given that the images of bone sticks are characterized by relatively small dimensions and high complexity, to balance the overall system robustness and efficiency requirements, this study adopts the Vision-RWKV-L model, pre-trained on the ImageNet-22K dataset, to extract critical visual feature information from the fractured areas of bone sticks. Simultaneously, the RWKV-6 World model, pre-trained on the Pile dataset, is utilized to extract the inscription information of bone sticks, while the BERT-base-cased model, pre-trained on the BooksCorpus dataset, is employed to extract the archeological feature information of bone sticks, with the model category and detailed specifications provided in Table 1.

The parameter count for the selected models is as follows: Vision-RWKV: 198 M, RWKV-6: 144 M, BERT-base-cased: 110 M, with a total of approximately 452 M parameters. These models are, respectively, used to extract features from multimodal data, thereby ensuring model accuracy while significantly improving training efficiency, and this approach effectively mitigates the risk of overfitting, which may arise from excessively large model parameters. The selection of these models demonstrates an optimized balance between model complexity and performance requirements, providing a robust and efficient solution for the archeological analysis of bone stick data.

3. The Proposed Method

This paper proposes a bone stick matching method driven by pre-trained large models based on multimodal information fusion, with the framework structure illustrated in Figure 3. The core of the method includes the following modules: the Image Preprocessing module (IP), the Feature Extraction module based on RWKV, V-RWKV, and BERT (RVB), the Dynamic Cross-Feature Fusion module (DCFF), and the Matching Calculation module (MC). The overall workflow is as follows: Firstly, the IP module performs background removal and cropping preprocessing on the bone stick image dataset, eliminating noise and interference from the images. Then, two sets of bone stick data are input into the RVB module, which uses a feature extraction network to obtain feature vector representations of the multimodal input data. Subsequently, the DCFF module dynamically fuses the three types of features, generating a unified feature representation, which is then passed to the MC module. The MC module ranks the candidate images based on the matching score and selects the images with higher scores as potentially matched bone sticks, ultimately outputting the matching results.

3.1. Image Preprocessing Module

The raw bone stick image data typically carry a significant amount of noise, as shown in Figure 4a. If unprocessed images are directly fed into the model, they not only considerably increase computational overhead but also introduce redundant features. The redundancy of these features could interfere with the model’s ability to effectively extract relevant features, thereby increasing the risk of overfitting and negatively affecting the model’s generalization performance. Therefore, to effectively extract the key matching features from the images, this paper performs preprocessing on the image dataset. The specific procedure involves segmenting the main informative regions of the image and then cropping the image at the fracture points. The detailed processing steps are as follows:

(1): The main body of the bone stick is segmented using DIS-Net [19], a deep network designed for binary image segmentation. DIS-Net consists of an encoder, an image segmentation module based on $U^{2} - N e t$ [20], and an intermediate supervision strategy. DIS-Net excels in precisely identifying and separating target objects in high-resolution images. Due to its superior binary classification capability and high-precision segmentation features, DIS-Net demonstrates high applicability in the segmentation of bone stick images. In the experiments presented in this study, we follow DIS-Net’s training strategy and fine-tune the network using its pre-trained weights to achieve fine-grained segmentation of bone stick images, ultimately generating high-precision image labels. The segmentation results are shown in Figure 4b.
(2): The original bone stick image (Figure 4a) is subjected to a pixel-wise AND operation with the segmented result image (Figure 4b), retaining the pixel values corresponding to the bone stick while setting the background pixel values to 255. This process generates a standardized bone stick image (Figure 4c). The computation is defined in (1), where $d s t$ represents the output image, $s r c_{1}$ and $s r c_{2}$ denote the original image and the mask image, respectively, and $(i, j)$ indicates the pixel location.
(3): Due to the prevalence of fractures in the bone sticks within the images, not all bone sticks awaiting matching exhibit an intact form. To address this, the present study employs a data augmentation strategy by cropping the fractured regions of the bone sticks. Furthermore, additional cropping is applied to the main body of the bone sticks to exclude areas unrelated to the fractured portions, thereby preserving critical feature information essential for bone stick matching. As illustrated in Figure 4d, the resulting cropped images are more tightly focused on the target regions, which enhances the model’s ability to recognize the key features of the bone sticks more effectively.
(4): Following the cropping process, white padding is added around the bone stick images with reference to their longer edge, ensuring that the final image dimensions are standardized to 224 × 224 pixels. This normalization of image size aligns with the input requirements of the model, as shown in Figure 4e, thereby ensuring data consistency and enhancing the model’s processing efficiency.

To ensure the acquisition of a high-quality dataset for the bone stick image matching system and to enhance both the efficiency and accuracy of the matching process, a semi-automated mechanism is employed in the image preprocessing pipeline. Data preprocessing steps include image segmentation, noise removal, and cropping to retain the relevant fracture regions. Background removal is performed using DIS-Net to accurately isolate the bone stick fragments. Additionally, a normalization step is applied where the dimensions of the input images are standardized to a uniform size of 224 × 224 pixels. This normalization ensures that all input data align across modalities, reducing variability and enhancing feature extraction across different input types. The preprocessing is critical for aligning the image, inscription, and archeological data, ensuring that the final model can process these multimodal inputs consistently. Specifically, steps (1) and (2) of the preprocessing procedure are executed using automated techniques to improve processing efficiency and consistency, while steps (3) and (4) are performed manually to ensure that the images input to the pre-trained large models meet the required standards of quality and structural integrity. This hybrid approach ensures that the subsequent stages of feature extraction and matching analysis comply with the stringent input requirements of the model.

d s t (i, j) = s r c_{1} (i, j) \land s r c_{2} (i, j) + 255 \times (1 - s r c_{2} (i, j))

(1)

3.2. Feature Extraction Based on RWKV, V-RWKV and BERT Module

The input data for the feature extraction network is divided into three categories: the inscription information on the bone stick, the preprocessed bone stick images, and the archeological metadata of the bone stick (including its color, excavation context number, and missing section). The overall structure is illustrated in Figure 5. Among these, RWKV is used to extract the Chinese inscription on the bone stick. Due to the RWKV architecture’s combination of the recursive properties of RNNs and the contextual modeling capabilities of Transformers, it demonstrates superior comprehension when processing Chinese text. Vision-RWKV extracts feature representations from the preprocessed bone stick images, while BERT is used to extract the archeological feature information of the bone stick.

(1) RWKV extracts feature representations of the bone stick inscription: The RWKV feature extraction mechanism consists of the following steps. The input text is first divided into a sequence of words

{x_{1}, x_{2}, \dots x_{T}}

, where

T

is the sequence length. Each word is mapped to its corresponding index in the vocabulary and transformed into an embedding vector

e_{R t} = E_{R} [x_{t}], e_{R t} \in ℝ^{d}

, where

E_{R} \in ℝ^{o \times d}

is the embedding matrix,

o

represents the vocabulary size, where

d = 768

, and

d

denotes the embedding dimension. The core of RWKV lies in its Key-Value recurrence mechanism, designed to accumulate contextual information. At each time step, the input embedding

e_{t}

is linearly transformed into the Key and Value vectors, as shown in (2), where

W_{R k}

,

W_{R v} \in ℝ^{d \times d}

are trainable weight matrices. This mechanism updates the accumulated Value

s_{t - 1}

and incorporates the influence of the current time step through a dynamic weighting factor

α_{t} \in [0, 1]

, as specified in (3). This factor balances the contributions of past states and current input. The final output hidden state

h_{t}

at the current time step is the combination of the Key and the accumulated state

s_{t}

, computed as shown in (4). The RWKV model outputs the hidden state matrix

H_{R} \in ℝ^{T \times d}

, where

d = 768

. The hidden vector corresponding to the last time step

h_{t e x t 1} \in ℝ^{768}

is used as the feature representation of the entire input sequence.

K_{t} = W_{R k} e_{R t}, V_{t} = W_{R v} e_{R t}

(2)

s_{t} = α_{t} \times s_{t - 1} + (1 - α_{t}) \cdot V_{t}

(3)

h_{t} = R E L U (K_{t} + s_{t}), h_{t} \in ℝ^{d}

(4)

(2) Vision-RWKV extracts bone stick image features: The feature extraction process involves patch division, Key-Value accumulation, and contextual feature generation. The input image is first divided into a series of patches with a size of

P \times P

pixels, as described in (5). Here,

H \times W

represents the dimensions of the image

(224 \times 224)

,

N

denotes the total number of patches, and

C

is the number of channels. Each patch is flattened into a one-dimensional vector and transformed into a high-dimensional space through a linear mapping, resulting in patch embeddings, as defined in (6). The transformation utilizes a trainable weight matrix

W_{p} \in ℝ^{(P \times P \times C) \times d^{'}}

, where

d^{'}

denotes the embedding vector dimension, and

d^{'} = 150528 (224 \times 224 \times 3)

. Vision-RWKV outputs a high-dimensional global feature vector with a flattened dimension of 150528. The core mechanism of Vision-RWKV is the Key-Value accumulation, adapted for spatial dimensions. Each image patch is treated as a spatial token, and the embedding is applied to transform each patch into a spatial feature vector

E_{V}

. The Key and Value mappings are then calculated, as illustrated in (7), where the variables

W_{V k}, W_{V v}

represent trainable weight matrices. The state update mechanism follows a recurrence formula analogous to the RWKV model but tailored for spatial data, as detailed in (8). The final output comprises contextual features for each patch, encoded in a hidden state matrix

H_{V} \in ℝ^{N \times d^{'}}

. To produce a global feature representation, pooling operations (e.g., average pooling) are typically employed to aggregate the features of all patches into a unified feature vector

h_{i m a g e} \in ℝ^{150528}

, as shown in (9).

N = \frac{H \times W}{P^{2}}

(5)

E_{V} = W_{p} P_{i}, E_{V} \in ℝ^{d^{'}}

(6)

K_{i} = W_{V k} E_{V}, V_{i} = W_{V v} E_{V}

(7)

s_{i} = α_{i} \times s_{i - 1} + (1 - α_{i}) \times V_{i}

(8)

h_{i} = R e L U (K_{i} + s_{i}), h_{i} \in ℝ^{d^{'}}

(9)

(3) BERT extracts bone stick archeological metadata features: The BERT model begins by tokenizing the input text and inserting special tokens

[C L S]

and

[S E P]

to form the input sequence, as shown in (10). Each token is mapped to an embedding vector composed of three components, as defined in (11). The final embedding vector is denoted by

e_{B t} \in ℝ^{d}

, where

d

is the embedding dimension and

d = 768

,

E_{n}

represents the token embedding matrix corresponding to the current token,

S_{t}

denotes the position of the current token, used to capture sequential order information, and

E_{m}

represents the sentence to which the token belongs, which is used in the Next Sentence Prediction (NSP) task for sentence-level segmentation. The input structure of BERT is illustrated in Figure 6. BERT processes the embedded inputs through a multi-layer Transformer encoder architecture, where each layer comprises the following steps. First, self-attention is computed by deriving the attention vectors

Q, K, V

from the inputs, as defined in (12). These vectors are obtained via linear transformations of the inputs, as shown in (13), where

W_{Bq}, W_{Bk}, W_{Bv}

are trainable weight matrices, and

E_{B}

is the embedded input matrix produced after the embedding layer. The outputs of multiple attention heads are then concatenated and projected back to the original dimension through a linear transformation, as shown in (14). Here,

{head}_{i}

refers to the output of the

i - th

attention head, each of which operates independently. The projection matrix

W_{o}

maps the concatenated output back to the original embedding space. The output of the self-attention module is passed through a feed-forward neural network (FFN), as described in (15), where

W_{1}

and

W_{2}

are the weight matrices of the first and second layers, respectively, and

b_{1}, b_{2}

are the corresponding bias terms. The output of BERT is a sequence of context-aware token representations, denoted by

H_{B} \in ℝ^{(T + 2) \times d}

, where the embedding of the

[CLS]

token,

h_{text 2} \in ℝ^{768}

, is used as a global representation of the entire sentence.

InputText = [[CLS], x_{1}, x_{2}, x_{3}, \dots, [SEP], \dots]

(10)

e_{Bt} = TokenEmbedding [E_{n}] + PositionEmbedding [S_{t}] + SegmentEmbedding [E_{m}]

(11)

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(12)

Q = W_{Bq} E_{B}, K = W_{Bk} E_{B}, V = W_{Bv} E_{B}

(13)

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots {head}_{H}) W_{o}

(14)

FFN (x) = RELU (x W_{1} + b_{1}) W_{2} + b_{2}

(15)

(4) Feature vector adjustment: The feature dimensions extracted by the three models demonstrate significant disparities, particularly between the textual and image features, where the difference in dimensions is considerable. If they are directly concatenated, these disparities may result in an imbalance during the feature fusion process. Specifically, high-dimensional features may dominate low-dimensional features in terms of relative weight during concatenation, leading to an overemphasis on one modality during model training. This imbalance can compromise the balance and performance of multimodal learning. To address this issue, a structural transformation module is introduced in this study to uniformly map the feature vectors from different modalities, ensuring consistency in feature dimensions. Specifically, fully connected layers are employed to perform linear transformations on the extracted features, mapping them to a unified dimension. This operation enhances the balance and expressiveness of the model during multimodal feature fusion. Given that the computational complexity of fully connected layers is proportional to the product of the input and output dimensions, high-dimensional features not only increase computational cost but also significantly elevate memory and storage requirements. Therefore, in this study, image and text features are mapped to a unified 1024-dimensional feature space, which significantly reduces computational complexity and memory requirements while preserving the primary information contained in high-dimensional features. This transformation is typically implemented using a fully connected layer, as expressed in (16), where

W \in ℝ^{1024 \times d}

represents the weight matrix responsible for dimensional transformation.

h^{'} = RELU (W \cdot h + b)

(16)

(5) Modal Alignment: After extracting relevant features and performing feature vector adjustment using multimodal large-scale models, the features are subjected to standardization and normalization. Given that each modality may generate features with different scales and distributions, normalization becomes a critical step to prevent any single modality from dominating the feature fusion process. For each modality, standardization is applied to ensure that the mean is zero and the standard deviation is one, thereby ensuring the comparability of features across modalities, as expressed in (17), where

μ_{m o d a l}

and

σ_{m o d a l}

represent the mean and standard deviation of each modality’s features, respectively. This process is applied to all modalities, including images, text, and archeological metadata, in order to create a unified feature representation that facilitates balanced fusion across modalities.

h_{m}^{'} = \frac{h^{'} - μ_{m o d a l}}{σ_{m o d a l}}

(17)

Once the features from all modalities have been standardized, they are input into the Dynamic Cross-Feature Fusion (DCFF) module described below. The DCFF module performs adaptive fusion of the three feature sets and generates modality-specific weights based on the content and relevance of each modality. This dynamic weighting mechanism ensures that each modality’s contribution to the final feature representation is optimized. Through this cross-modal fusion, DCFF ensures that all available information, regardless of its origin, is appropriately integrated into a unified, high-quality representation, thereby enhancing the performance of the matching task. By applying this rigorous modality alignment approach, the model ensures that the features of each modality are comparable and that their contributions are adaptively weighted, thus optimizing the performance of the matching task.

3.3. Dynamic Cross-Feature Fusion Model

The dynamic cross-modal feature fusion (DCFF) module is proposed to effectively capture high-order interactions among heterogeneous modalities while adaptively weighting each modality based on its contextual relevance. Traditional fusion strategies—such as static weighting or direct concatenation—lack the capability to dynamically adjust to variations in the input data, often resulting in suboptimal representation learning. In contrast, the DCFF module introduces a dynamic mechanism that generates modality-specific weights in response to changing input conditions, thereby enhancing the robustness and expressiveness of the fused features. Building upon the foundational work in [21], this study extends the fusion framework to accommodate three distinct types of modalities: visual features, textual inscription, and archeological metadata. The overall architecture of the DCFF module is illustrated in Figure 7.

In multimodal fusion tasks, conventional methods often employ direct concatenation or static weighting strategies to combine information from heterogeneous modalities. While these approaches are straightforward to implement, they suffer from critical limitations: they lack the ability to dynamically adjust modal contributions based on contextual information and fail to model deep semantic interactions across modalities. These shortcomings are especially pronounced in highly heterogeneous and sparsely annotated domains, such as bone-stick inscription analysis, where traditional concatenation-based fusion often fails to fully exploit the complementary strengths of different modalities. To address these limitations, this study employs a Dynamic Cross-Feature Fusion (DCFF) module, which enables adaptive modeling and explicit cross-modal interaction, thereby significantly enhancing both the expressive capacity and discriminative power of the fused representation. The DCFF architecture is composed of four core components: Dynamic Weight Generation (DWG): Unlike static concatenation-based designs, the DWG module utilizes a sigmoid combination to dynamically generate a weight distribution across modalities. This mechanism allows the model to automatically evaluate and regulate the importance of each modality according to input content, thus enabling a more flexible and context-sensitive fusion strategy. Nonlinear Transformation of In-modal Features (NTIF): In traditional fusion, raw features are typically passed directly into the fusion layer without transformation, limiting intra-modal expressiveness. The NTIF module introduces dedicated fully connected layers and ReLU activation functions to construct modality-specific nonlinear encoding paths, which is particularly beneficial for sparse and low-dimensional data types such as archeological metadata. Weighted Feature Interaction (WFI): Moving beyond passive fusion, the WFI layer performs dimension-wise weighted interactions to actively model high-order correlations between modalities. This mechanism is particularly suited to tasks like bone-stick matching, where semantic mismatches between visual and textual modalities can be substantial due to their inherently different representations. Generation of Fusion Features (GFF): This component integrates the activated intra-modal features and the cross-modal interactive representations into a unified multimodal embedding using a deep nonlinear transformation network. Compared to simple concatenation, this approach provides substantially improved expressiveness and discriminability. In summary, the proposed DCFF module offers considerable advantages over traditional fusion methods by enabling context-aware dynamic weighting and explicit semantic interaction between modalities. It significantly improves the model’s capacity to adapt to heterogeneous information sources and enhances matching accuracy, demonstrating strong robustness and generalization in complex multimodal tasks such as fragmented bone-stick image alignment.

The fusion process begins by feeding the input feature vectors into the Dynamic Weight Generation (DWG) layer, where each modality is assigned a learnable weight based on its importance. The weight for modality

i

is computed using a sigmoid activation function applied to a linear transformation of its feature vector, as expressed in (18), where

W_{i}

and

b_{i}

are the trainable weight matrix and bias vector, respectively, and

σ (\cdot)

denotes the sigmoid activation function. The resulting raw weights

w_{i}

are subsequently normalized using

s o f t m a x

operation to ensure that their sum equals 1, thereby producing a stable and dynamic distribution of modal weights. This normalization is given in (19), where

α_{i}

represents the normalized importance of modality

i

, and the summation runs across the three modalities (e.g., visual features and two textual representations). Following weight generation, the Nonlinear Transformation of In-modal Features (NTIF) module enhances the expressiveness of each modality through dedicated non-linear transformations. For each modality

i

, its feature vector

h_{x}^{'}

is passed through a fully connected layer followed by a ReLU activation, as shown in (20). This transformation allows each modality to build a modality-specific encoding path, particularly beneficial for sparse or low-dimensional modalities such as archaeological metadata. To model the inter-modal interactions, the Weighted Feature Interaction (WFI) module computes pairwise cross-modal feature relationships. Given two modalities

i

and

j

, their cross-interaction is computed as the element-wise product of their weighted features, formulated in (21), where

α_{i} ⊙ h_{i}

and

α_{j} ⊙ h_{j}

denote the element-wise scaled features, and

h_{i j}^{c r o s s}

captures the semantic synergy between the two modalities. Finally, the Generation of Fusion Features (GFF) module aggregates the transformed intra-modal features and all cross-modal interaction features. These are combined through a final deep nonlinear transformation to form the fused multimodal representation

h_{f u s i o n}

, as defined in (22). This unified representation

h_{f u s i o n}

incorporates both deep intra-modal semantics and cross-modal relationships, resulting in a rich and discriminative multimodal embedding suitable for downstream matching tasks.

w_{i} = σ (W_{i} h^{'} + b_{i}), w_{i} \in ℝ^{1024}

(18)

α_{i} = \frac{e x p (w_{i})}{\sum_{j = 1}^{3} e x p (w_{j})}, α_{i} \in ℝ^{1024}

(19)

h_{x}^{'} = R e L U (W_{i}^{'} h_{i} + b_{i}^{'})

(20)

h_{c r o s s}^{i j} = α_{i} ⊙ h_{i} \times (α_{j} ⊙ h_{j}), h_{c r o s s}^{i j} \in R^{1024}

(21)

h_{fusion} = ReLU [W_{f} (h_{text 1}^{'} + h_{text 2}^{'} + h_{image}^{'} + \sum_{i \neq j} h_{cross}^{ij}) + b_{f}], h_{fusion} \in ℝ^{1024}

(22)

3.4. Matching Calculation Module

The extracted bone stick features undergo a matching decision computation to determine similarity scores, as illustrated in Figure 8. The architecture comprises a measurement layer and a decision layer. Here,

h_{fusion 1}

and

h_{fusion 2}

denote the dynamically cross-fused features of two bone sticks. To ensure computational efficiency, this study calculates the difference between the two feature representations, generating a vectorized feature that encapsulates the intrinsic differences between bone sticks. This process is formalized in (23), where

D

represents the distance metric between the feature representations, and

h_{fusion 1} (x)

and

h_{fusion 2} (x)

correspond to the deep-layer features of the two bone sticks. By fusing the distance-based measurements, a unified feature representation is synthesized. This resultant vector encodes discriminative differences between the paired bone stick images, which are subsequently projected into a real-valued similarity score ranging from 0 to 1 through a fully connected layer and a

sigmoid

activation function.

D (x_{1}, x_{2}) = ‖ h_{fusion 1} (x) - h_{fusion 2} (x) ‖

(23)

The decision to utilize vector distance for final matching, rather than other similarity measures, is based on several key factors. First, vector distance, particularly in the form of Euclidean or cosine distance, provides a direct and mathematically well-understood metric for quantifying the dissimilarity between feature vectors. These vectors represent multi-dimensional characteristics of bone stick fragments, encapsulating visual, textual, and archeological data. As such, vector distance allows for an intuitive and computationally efficient method to compare the combined feature representations. Furthermore, vector distances are effective at capturing both the magnitude and direction of feature discrepancies, which is crucial when handling multimodal data with varying scales and distributions across features.

In contrast, other similarity measures such as Jaccard or Pearson correlation may not be as effective in this context because they are more suited to binary or linear comparisons, whereas vector distances inherently accommodate nonlinear relationships and the high-dimensionality of the features. Additionally, vector-based distance metrics, particularly when used in conjunction with a dynamic cross-modal feature fusion approach, allow for better scalability and generalization to unseen data, especially in complex, heterogeneous datasets like the bone stick fragments under study. These advantages make vector distance a more appropriate choice for the matching task in this work.

The final parameter count for the Dynamic Cross-Modal Feature Fusion (DCFF) module is 20.94 M. To optimize the framework, binary cross-entropy loss is adopted as the objective function, offering an intuitive quantification of similarity while reducing network complexity. Equation (24) defines the binary cross-entropy loss, where

N

denotes the batch size, and

y_{i}

and

y_{i}^{'}

represent the ground-truth labels and predicted similarity scores, respectively. This formulation ensures robust feature alignment and enhances the discriminative power of the model for bone stick verification tasks.

L (y, \hat{y}) = - \frac{1}{N} \sum_{i = 1}^{N} (y_{i} \cdot \log ({\hat{y}}_{i}) + (1 - y_{i}) \cdot \log (1 - {\hat{y}}_{i}))

(24)

4. Experimental Environment, Dataset, and Parameters

4.1. Experimental Environment

The computational platform used in this study is configured with a 64-bit Windows 11 operating system, 32 GB of RAM, and an Intel 14th-generation Core i9-14900HX processor (Intel, Santa Clara, CA, USA) (base frequency 2.2 GHz). The system is also equipped with an NVIDIA GeForce RTX 4070 graphics card (Nvidia, Santa Clara, CA, USA). All programming tasks were implemented in a Python 3.9 environment using the PyTorch 1.12.0 + cu113 framework. To ensure the reliability and accuracy of the experiments, each ablation and comparison experiment conducted in this study was repeated three times, and the mean values were calculated, thereby guaranteeing the robustness and reliability of the experimental results.

4.2. Dataset

The dataset utilized in this study was provided by the Institute of Archaeology, Chinese Academy of Social Sciences. It comprises high-resolution images, inscriptions, and corresponding archeological metadata of bone sticks unearthed from the ruins of the Western Han Dynasty’s Weiyang Palace in Chang’an. Before its formal use, the dataset underwent a comprehensive verification and standardization process conducted by a team of professional archeologists. Following authorized approval and with the support of funded research projects, archeological experts manually completed the annotation of matching relationships between bone stick samples, serving as the foundational data for the bone stick matching task in this study. All data annotations include ground-truth labels for bone stick fragment pairings. Before being fed into the large-scale pre-trained models, the raw image data undergoes preprocessing through the Image Preprocessing Module (IPM). Specifically, each bone stick image is first passed through the DIS-Net segmentation network to isolate and extract the core object regions. Subsequently, fragmented or damaged areas identified in the images are manually cropped and padded to enhance feature clarity. All processed images are then uniformly resized to 224 × 224 pixels to ensure compatibility with downstream network input requirements. The associated inscriptional data and archeological metadata remain one-to-one aligned with the corresponding images, collectively forming a multimodal input structure that supports the bone stick matching task. The detailed procedures are as follows:

1.

Fragment Labeling: The bone stick fragments were carefully labeled by a team of professional archeologists. Each fragment was assigned a unique identifier and the corresponding inscriptions were transcribed into standardized text. The inscription data include historical content (e.g., official titles, weapon types) and spatial metadata (e.g., excavation context, missing sections). Each pair of fragments that was identified as matching was manually labeled as a matching pair. These labeled pairs serve as the ground-truth labels for the dataset used in training and testing.

2.

Sampling: To ensure a diverse and representative dataset, 3000 pairs of bone stick fragments were selected. These pairs were chosen based on a wide range of fragmentation patterns, inscription types, and archeological metadata characteristics. This diverse sampling includes both well-preserved fragments with minimal damage and highly degraded fragments, ensuring that the model can generalize to different qualities of bone stick artifacts.

3.

Preprocessing: The preprocessing of the images and metadata involved the following steps:

Image Preprocessing: Each image was processed to remove the background and crop out the fracture regions using the DIS-Net segmentation network, which is particularly effective for high-resolution binary image segmentation. The cropped images were resized to 224 × 224 pixels to meet the input requirements of the deep learning model.
Textual Data Preprocessing: The inscription data were cleaned and standardized, with any variations in transcription style adjusted to ensure consistency across the dataset. This allowed for the RWKV model to process the inscriptions effectively.
Archeological Metadata: Additional metadata, including details such as color, excavation context, and missing sections, were encoded using the BERT model, which helped enrich the textual features.

By performing these steps, the dataset was carefully constructed to provide a high-quality foundation for training and testing the multimodal bone stick matching model. The final dataset consists of 6000 high-resolution images, each aligned with corresponding inscriptions and metadata, providing a comprehensive multimodal input for the matching task.

The complete dataset comprises a total of 3000 verified bone stick pairs, corresponding to 6000 high-resolution images, 6000 transcribed inscriptions, and 6000 sets of corresponding archeological metadata. All sample pairings were rigorously validated by expert archeologists to ensure both scientific accuracy and annotation reliability. To enhance the effectiveness of model training and the objectivity of performance evaluation, the dataset was partitioned into training, validation, and test subsets using a standard 7:2:1 split ratio. During the training phase, to improve the model’s generalization capability and robustness, the original image data were augmented through a series of data enhancement strategies. These include random translation and scaling operations, designed to increase sample diversity and mitigate the risk of overfitting to specific visual patterns. Moreover, to comprehensively assess the model’s performance under different scenarios, two distinct test sets were constructed. Test Set 1 contains bone stick images in which each image has a corresponding matched counterpart, thereby evaluating the model’s performance under ideal matching conditions. In contrast, Test Set 2 builds upon Test Set 1 by incorporating unmatched bone stick images, allowing for the evaluation of the model’s ability to generalize in the presence of non-matching samples. The detailed image data distribution across the subsets is summarized in Table 2.

4.3. Evaluation Metrics

This study adopts Rank-N and the Cumulative Match Characteristic (CMC) curve as the core metrics to evaluate algorithmic performance. The Rank-N metric quantifies the probability that a target bone stick image appears among the top N most similar retrieval results, with a higher Rank-N value indicating superior matching performance. The CMC curve, plotted with rank on the horizontal axis and cumulative matching rate on the vertical axis, provides an intuitive visualization of the model’s matching performance across different ranks. The Cumulative Match Characteristic (CMC) curve is formally defined in (25), where

C M C @ k

denotes the probability of a query image

q

being correctly matched within the top

k

search results. Here,

Q

represents the total number of query images in the dataset, and

g t (q, k)

is an indicator function that equals 1 if image

q

is correctly matched at or before rank

k

in the retrieval sequence, and 0 otherwise. As implied by the formula, lower

k

values combined with higher

C M C @ k

values signify enhanced method efficacy.

C M C @ k = \frac{1}{Q} \sum_{i = 1}^{Q} g t (q, k)

(25)

5. Experiments

5.1. Ablation Study (Effect of Pretrained Models)

This study employs a contrastive learning strategy to pretrain the network, enabling the model to initially learn and capture the latent features of the bone stick data. During this process, the pretrained weights are used to initialize the entire system, ensuring that the model possesses a strong representational capacity in the early stages of training. Specifically, the batch size was set to 32, and the number of training epochs was set to 500 to ensure adequate convergence during training. Apart from hyperparameter tuning during the training phase, all other hyperparameters remained unchanged. Ultimately, this pretraining process produced network weights optimized for the bone stick dataset.

The training results in Figure 9 illustrate the performance differences between the two training strategies. Subfigure (a) shows the trends in training and validation losses during the first 200 epochs of the training process. The green and orange curves correspond to the training and validation losses obtained without using pretrained weights, while the red and blue curves represent the losses obtained when pretrained weights are employed. The results demonstrate that networks with pretrained weights exhibit loss stabilization after approximately 165 epochs, indicating that the model has completed its convergence process. In contrast, networks without pretrained weights display significant fluctuations in loss and a slower rate of loss reduction, highlighting their inferior training performance. Additionally, Subfigure (b) compares the Rank-1 to Rank-15 matching accuracies on Test Set 1 under the two training strategies. The results indicate that networks using pretrained weights consistently outperform those without pretrained weights across all ranking levels, with an average matching accuracy improvement of 46.69% across Rank-1 to Rank-15. At Rank-15, the network with pretrained weights achieves a matching accuracy of 94.73%, clearly demonstrating the substantial role of pre-trained weights in enhancing the model’s generalization ability and overall performance.

Overall, the experimental results validate the positive contribution of pretrained weights to both the convergence speed and the final performance of the network.

5.2. Ablation Study (Effect of Model Variants)

To comprehensively evaluate the effectiveness of the three categories of features and the fusion module in the bone stick matching task, as well as to explore the impact of model parameter configurations on matching performance, a series of comparative experiments were meticulously designed and conducted. For each experimental configuration, the model’s parameter size (Params/M) and its inference time on a standardized testing platform (TD/S) were recorded, enabling a holistic assessment of the trade-off between accuracy and computational efficiency across different model configurations.

The experimental setup is structured as follows: Experiments (1) to (3) are based on unimodal image input, where fracture features of bone stick images are extracted using three different scales of the Vision-RWKV model—Vision-RWKV-S, Vision-RWKV-B, and Vision-RWKV-L (used in this study)—for direct matching tasks. Experiments (4) to (6) extend the setup by incorporating textual features extracted by the RWKV language models—RWKV-4, RWKV-5, and the RWKV-6—used in this work. These textual features are simply and directly concatenated with the visual features extracted by Vision-RWKV-L, forming a basic bimodal fusion architecture for matching. Experiments (7) and (8) further introduce archeological metadata features encoded using the BERT family of models—BERT-medium and the BERT-base-cased model employed in this study. These semantic features are simply and directly concatenated with the image and inscription features extracted by Vision-RWKV-L and RWKV-6, respectively, to construct a tri-modal matching framework. Experiment (9) corresponds to the full method proposed in this paper. The detailed results of all experiments are presented in Table 3, where the impacts of different feature combinations and model architectures on matching accuracy, inference time, and parameter size are systematically compared.

As observed from the experimental results presented in Table 3, in the configuration where bone stick matching relies solely on image features extracted from fracture surfaces (Experiments (1) to (3)), the model’s feature extraction capability improves progressively with the increasing parameter size, leading to a corresponding upward trend in matching accuracy. However, due to the unimodal nature of this approach, the overall matching performance remains constrained, and the overall accuracy level is relatively limited. Further improvements are observed when inscription features are introduced and fused with visual features (Experiments (4) to (6)), resulting in a notable average increase in approximately 5% to 8% in matching accuracy. This clearly highlights the critical role of textual information in the bone stick matching task. Moreover, this multimodal fusion strategy maintains stable inference efficiency, with only a slight increase in inference time on the test set and minimal additional demands on computational resources. Building upon this, when the archeological metadata of bone sticks is further incorporated (Experiments (7) and (8)), and all three types of features—visual, textual, and archeological—are concatenated in the feature space, the model achieves an additional 2% to 4% improvement in matching accuracy over the previous stage. To further clarify the specific contributions of each module in the bone stick matching task, a series of comparative experiments were designed to analyze the effects of different feature fusion strategies. Experiments (9) and (10) involve the fusion of transcriptional semantic features extracted by the RWKV-6 model with archeological metadata derived using the BERT-base-cased model. Experiment (9) applies a direct concatenation of feature vectors for matching, while Experiment (10) performs matching after applying static averaging. Experiments (11) and (12) adopt a similar configuration, except that the visual features used are extracted from fracture images by the Vision-RWKV-L model. These are again fused with archeological metadata via direct concatenation and static averaging, respectively. Experiment (13) integrates multimodal information by combining fracture image features from Vision-RWKV-L, transcriptional semantic features from RWKV-6, and archeological metadata from the BERT-base-cased model using static averaging before performing the matching task. Experiment (14) represents the full method proposed in this paper, aiming to achieve optimal performance based on multimodal fusion. The detailed quantitative results of all experiments are presented in Table 3 to illustrate the relative contributions and performance of each module and fusion strategy.

These results further validate the benefit of leveraging multi-source information, with textual inscriptions contributing the most significantly. Notably, upon introducing the Dynamic Cross-Modal Feature Fusion Module (Experiment (9)), the model exhibits a substantial enhancement in matching accuracy. Although this module introduces approximately an increase in parameter overhead, its inference efficiency remains nearly identical to that of the simple concatenation strategy, demonstrating a well-balanced trade-off between computational cost and performance gains. More importantly, the proposed dynamic fusion method demonstrates superior generalization capability on Test Set 2. Even under conditions involving substantial noise interference, it consistently maintains robust matching performance, which fully reflects the critical contribution of this module to the robustness of the overall approach.

Furthermore, as observed from the results of Experiments (9) to (13), the most decisive feature module for bone stick fragment matching is the fracture image representation extracted by the Vision-RWKV-L model. In contrast, the other two modalities—namely, the transcriptional semantic representations generated by the RWKV-6 model and the archeological metadata extracted by the BERT-base-cased model—play relatively limited independent roles in the matching performance. Instead, they function primarily as auxiliary modalities that enhance the visual feature representation through complementary information. Additionally, the experimental results clearly demonstrate that the proposed DCFF module significantly outperforms conventional concatenation strategies (including direct feature concatenation and static averaging) in the bone stick matching task. By dynamically modeling the complementarity and correlation between modalities, the DCFF module effectively improves feature-level integration quality, resulting in an overall accuracy gain of approximately 6%. This result further confirms that the observed performance improvement originates from the dynamic fusion mechanism introduced by the DCFF module itself, rather than from any simple accumulation of features extracted by RWKV-6 and BERT-base-cased.

5.3. Comparative Experiments: Proposed Method Versus Traditional Approaches

To further validate the effectiveness of the proposed algorithm, we compare it with five representative image matching methods (References [3,4,5,6,7]), and the experimental results are presented in Figure 10. Reference [3] proposes a matching method based on Freeman coding for representing fracture image features, while Reference [4] employs a ConvNeXt based convolutional matching method within a Siamese network framework. Reference [5] introduces a multi-feature matching method based on a residual network; Reference [6] adopts a matching approach based on feature region partitioning; and Reference [7] utilizes a matching method based on fracture surface contour features.

As illustrated in the results, the proposed network achieves a substantial performance improvement over existing methods on Test Set 1, with average matching accuracy gains compared to References [3,4,5,6,7] being 11.83%, 9.42%, 7.48%, 11.58%, and 5.19%, respectively. On Test Set 2, the performance gains are similarly significant, reaching 10.57%, 10.86%, 10.51%, 13.51%, and 6.92%, respectively. As shown in Figure 10a, the proposed method demonstrates particularly strong performance on Test Set 1, attaining a near-optimal matching accuracy by approximately rank 14, whereas the other methods generally do not stabilize until after rank 20, with their overall matching rates consistently lagging behind the proposed approach, thus highlighting a clear performance gap. Moreover, as observed from the expanded dataset evaluation in Figure 10b, although all methods exhibit some degree of performance degradation, the proposed method demonstrates a marked advantage in generalization capability. Specifically, the decline in matching accuracy is significantly less severe than that of traditional models, indicating greater robustness and adaptability to variations in data distribution and complexity.

These results collectively validate the effectiveness and robustness of the proposed approach in scenarios characterized by distributional shifts or increased data heterogeneity, thereby demonstrating its strong potential for generalization and real-world applicability.

5.4. Comparative Experiments: Proposed Method Versus Other Large-Scale Pre-Trained Models

To systematically evaluate the structural effectiveness and performance advantages of the proposed method in the multimodal bone stick matching task, a series of comparative experiments were carefully designed. To enable a consistent and controlled comparison under unified experimental settings, we concurrently incorporated two representative state-of-the-art multimodal pretrained model architectures—ViLT-Base and VLMo-Base—as benchmarking baselines. Both models were trained on the same dataset used for the large-scale pretrained models in this study, thereby ensuring a consistent training context for evaluation. Given that these models adopt a fusion input architecture—where image and text tokens are concatenated prior to encoding, with the requirement that visual and textual inputs must be correspondingly paired and fed simultaneously—this study constructs a comparison framework from a dual-encoder perspective to maintain alignment in model structure and ensure the validity of experimental conclusions. The experimental design is as follows: In Experiments (1) to (3), the matching task is conducted based on image features of bone stick fracture regions and textual features from bone stick transcriptions. Specifically, Experiment (3) adopts a direct feature concatenation strategy, where the fracture image features extracted by Vision-RWKV-L model are concatenated with the semantic features of the transcriptions extracted by the RWKV-6 model. Experiments (4) to (6) build upon the previous configuration by incorporating archeological metadata of bone sticks, extracted using BERT-base-cased, into the feature representation, thereby forming an enhanced multimodal matching representation. Experiments (7) to (9) further modify the previous setting by replacing the conventional feature concatenation module with a dynamic cross-modal fusion module, aimed at investigating the potential improvements in matching accuracy introduced by more advanced interaction mechanisms. The detailed results of these experiments are reported in Table 4.

As shown in the table, although ViLT-Large and VLMo-Base possess substantially larger parameter scales compared to the baseline model proposed in this study, their matching accuracy on both test sets is consistently inferior, and they incur significantly longer inference time. This highlights the substantial impact of model architecture on inference efficiency. In the second group of experiments, the inclusion of archeological metadata resulted in a 2–3% increase in average matching accuracy across all models, validating the importance of semantic prior knowledge in the bone stick matching task. However, for Transformer-based architectures, this enhancement introduced additional computational overhead, further prolonging inference time and diminishing their real-time adaptability. Results from the third group of experiments indicate that incorporating the Dynamic Hybrid Cross-Modal Fusion Module into the matching process yields an additional 3–6% increase in overall matching accuracy. This underscores the effectiveness of the module in enhancing multimodal information interaction, particularly in optimizing feature fusion for large-scale models applied to domain-specific tasks. Moreover, on the more challenging Test Set 2, Transformer-based models exhibit a marked decline in matching accuracy, further exposing their limitations in robustness. Notably, their inference time is nearly twice that of the proposed method, reinforcing the practical advantages of our approach under complex and noise-prone conditions.

5.5. Comparative Experiments on Bone Stick Preprocessing

To systematically assess the role of image preprocessing in enhancing bone stick matching accuracy, a controlled comparative experiment was designed and conducted. The objective was to evaluate the performance variations in the matching algorithm under distinct image processing conditions. The experiment was divided into two groups: The first group served as the baseline, where raw bone stick images—without any preprocessing—were directly used as inputs, along with the corresponding inscription features and archeological attributes, and fed into the matching model proposed in this study. The associated test sets likewise consisted of original bone stick images accompanied by their respective textual and contextual attributes. In contrast, the second group introduced the image preprocessing pipeline developed in this study, which corresponds to the method adopted in this paper, applying standardization and enhancement operations to the bone stick images prior to input. These preprocessed images were then combined with the same inscription information and archeological features and fed into the same model for performance evaluation. The corresponding test sets were uniformly constructed using preprocessed bone stick images and their respective inscription and contextual features.

As shown in Table 5, under the experimental setting involving no image preprocessing—i.e., using raw bone stick images—the proposed method achieved an average matching accuracy of 56.14% on Test Set 1 and 44.30% on Test Set 2. In comparison, when utilizing preprocessed bone stick images under identical model conditions, the average matching accuracy significantly improved to 74.13% and 68.88%, respectively—corresponding to increases of 17.99 and 24.58 percentage points. These findings clearly demonstrate the substantial effectiveness of image preprocessing in enhancing bone stick image matching accuracy and provide strong empirical evidence for the practical utility of the proposed preprocessing pipeline in optimizing feature representation and enhancing model performance.

5.6. Sample Visualization

Figure 11 illustrates the results of a matching experiment conducted on the upper portion of a bone stick image. Based on the computed similarity scores, the model ranks the 15 candidate images in descending order of their predicted matching probabilities. The results indicate that when there is a strong overall resemblance or partial feature overlap between the input image and a given candidate, the similarity score approaches 1, signifying a high likelihood of accurate matching. Conversely, as the structural or semantic correspondence between the samples diminishes, the similarity score progressively declines toward 0, reflecting their increasing dissimilarity in the feature space.

6. Discussion

Despite demonstrating significant performance advantages across multiple test sets, the proposed large-scale multimodal bone stick matching framework still has several limitations that warrant further refinement and extension for practical deployment. First, the model exhibits a strong dependency on input image quality during the preprocessing stage. When faced with noise factors such as blurriness, occlusion, or poor lighting conditions, the Vision-RWKV module may struggle to accurately extract fracture edges and morphological features, which may consequently lead to a degradation in overall matching performance. Second, the archeological auxiliary information embedded in the textual modality is primarily modeled via the BERT encoder. However, such metadata are often incomplete, inconsistent, or prone to manual annotation errors in real-world archeological documents and databases, thereby introducing noise into the multimodal fusion process and compromising both accuracy and robustness. A suggestion is proposed here: given that the ablation experiments show Vision-RWKV alone can achieve satisfactory matching performance based solely on visual features, in cases where textual metadata are missing, we recommend inputting an empty textual field and using the remaining modules to perform the matching task. Finally, although the proposed dynamic cross-modal fusion framework enables deep integration of visual and textual representations, it still relies on manually cropping the fractured regions of bone sticks during image preprocessing to guide the model’s attention toward potentially informative areas. This dependency limits the scalability of the framework in large-scale, fully automated application scenarios.

Future research will explore the incorporation of weakly supervised or unsupervised image segmentation techniques to automatically detect and extract salient fracture regions, thereby reducing reliance on manual intervention. Additionally, we intend to conduct systematic evaluations of the proposed approach on larger, more structurally complex datasets—particularly those containing noisy or ambiguous inscriptions—to further validate its robustness and adaptability under real-world archeological conditions.

Furthermore, we recognize the limitations of using a simple L2 distance followed by a sigmoid activation to assess similarity. In future work, we plan to incorporate a learnable similarity head or a lightweight contrastive-learning-based projector to better capture nuanced semantic relationships—especially for bone fragments with partial or degraded inscriptions. These enhancements are expected to significantly improve the discriminative capacity of the model’s decision layer while maintaining end-to-end differentiability and efficiency.

7. Conclusions

In this study, we proposed a novel multimodal bone stick matching approach that significantly enhances the matching accuracy and efficiency of fragmented bone stick artifacts. By integrating image, inscription, and archeological data through the use of pre-trained large-scale models such as Vision-RWKV, RWKV, and BERT, we were able to achieve a remarkable matching accuracy of 94.73% at Rank-15. This approach addresses several critical challenges in the domain of artifact restoration, particularly the inefficiency and potential damage caused by traditional manual matching methods.

The significance of these results lies in the model’s ability to effectively handle fragmented and degraded artifacts, providing a robust solution for digital preservation and restoration tasks. Moreover, the dynamic cross-modal feature fusion mechanism enhances the model’s ability to adapt to variations in the data, making it highly applicable in archeological and cultural heritage studies. The proposed method not only outperforms existing traditional matching approaches but also offers a scalable framework that can be adapted to other similar domains requiring multimodal data integration.

In conclusion, this research offers substantial technical advancements for the restoration of cultural heritage artifacts, showcasing the potential of multimodal information fusion in improving both the accuracy and efficiency of artifact matching, thus supporting the broader goal of cultural preservation.

In the future, this multimodal framework can be extended to other types of cultural heritage restoration tasks—such as matching pottery shards, bamboo slips, or bronze fragments—and adapted for mobile deployment to support real-time in-field analysis by archeologists.

Author Contributions

T.F.: Conceived and conducted the experiments, analyzed data, and drafted the initial manuscript. H.W.: Managed the project, supervised research activities, and contributed to manuscript revision. K.W.: Developed experimental models, programmed the system, and resolved technical issues. R.L.: Supervised the project, provided archeological datasets, and offered domain-specific guidance. Z.W.: Validated and corrected the dataset, and coordinated regional cultural heritage contributions. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Social Science Fund Unpopular Special Project of China, grant number 20VJXT001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Special thanks are extended to Li Mao of the School of Information and Control Engineering, Xi’an University of Architecture and Technology for his substantial guidance throughout this work. With his extensive expertise in cultural heritage preservation, foundation models, and deep learning, Mao provided meticulous revisions to the manuscript’s language, significantly improving its clarity and precision. In addition, he offered in-depth modifications to the graphical representations, ensuring the work’s scientific rigor while enhancing its accessibility to a wider audience.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, Z.Y. A Study on the Identification and Classification of Bone Sticks Excavated from the Weiyang Palace Site of Han Chang’an City. Arch. Cult. Relics 2007, 2, 48–62. [Google Scholar]
Liu, G.N. The Earliest Specialized Archive Repository in Our Country—The Han Dynasty Oracle Bone Inscription Archives Repository. Chin. Arch. 2007, 2, 50–52. [Google Scholar]
Liu, C.; Wang, H.; Mao, L.; Liu, R.; Wang, Z.; Wang, T. Image Stitching Method of Bone Stick Fragment Based on Similarity Freeman Code Matching. IEEE Access 2023, 11, 23073–23084. [Google Scholar] [CrossRef]
Liang, Q.; Yang, L.; Luo, Z.; Jiang, W.; Hong, C. A Siamese Network-Based Method for Automatic Stitching of Artifact Fragments. IEEE Trans. Instrum. Meas. 2023, 72, 2520913. [Google Scholar] [CrossRef]
Rui, X.; Zhang, X.; Wang, R. Application of the Multi-Feature Splicing Technology Based on Residual Network Identification. In Proceedings of the International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA), Huaihua, China, 9–11 December 2022; pp. 381–384. [Google Scholar]
Samonte, M.J.C.; Kong, T. Image Processing of Cultural Relics Fragments Splicing through Hybrid Folded Mesh Simplification Algorithm. In Proceedings of the 2023 5th World Symposium on Software Engineering (PWSSE), Tokyo, Japan, 22–24 September 2023; pp. 305–314. [Google Scholar]
Wang, X.; Fu, S. Cultural Relics Fragment Assembly Based on Fracture Surface Contour Features. Appl. Opt. 2024, 63, 5278–5291. [Google Scholar] [CrossRef]
Duan, Y.; Zhang, L.; Chen, X.; Wang, Q.; Zhao, Y.; Liu, Z.; Liu, Q. Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures. arXiv 2025, arXiv:2403.02308. [Google Scholar]
Peng, B.; Xue, F.; Wang, Q.; Liu, Y.; Gao, Z.; Zhang, L.; Zhang, H.; Wang, Y.; Tang, Y.; Liang, X.; et al. RWKV: Reinventing RNNs for the Transformer Era. arXiv 2023, arXiv:2305.13048. [Google Scholar] [CrossRef]
Koroteev, M.V. BERT: A Review of Applications in Natural Language Processing and Understanding. arXiv 2021, arXiv:2103.11943. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. arXiv 2021, arXiv:1808.03314v10. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Virtual Event, 6–14 December 2021; pp. 15908–15919. [Google Scholar]
Xie, S.; Li, Y.; Ma, Y.; Wu, Y. AutoGMM-RWKV: A Detecting Scheme Based on Attention Mechanisms against Selective Forwarding Attacks in Wireless Sensor Networks. IEEE Internet Things J. 2024, 12, 1–18. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450v1. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, and Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Eldan, R.; Shamir, O. The Power of Depth for Feedforward Neural Networks. arXiv 2016, arXiv:1512.03965. [Google Scholar] [CrossRef]
Adekotujo, A.S.; Enikuomehin, T.; Aribisala, B.; Mazzara, M.; Zubair, A.F. Computational Treatment of Natural Language Text for Intent Detection. Comput. Res. Model. 2024, 16, 1539–1554. [Google Scholar] [CrossRef]
Qin, X.; Dai, H.; Hu, X.; Fan, D.-P.; Shao, L.; Van Gool, L. Highly Accurate Dichotomous Image Segmentation. arXiv 2022, arXiv:2203.03041. [Google Scholar] [CrossRef]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection. Pattern Recognit. 2020, 106, 107404–107419. [Google Scholar] [CrossRef]
Xue, Z.; Marculescu, R. Dynamic multimodal fusion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 2575–2584. [Google Scholar]

Figure 1. The architecture of RWKV: (a) Complete RWKV residual block; (b) interactions among elements within the RWKV residual block.

Figure 2. The architecture of VRWKV: (a) The VRWKV framework consists of L identical VRWKV encoder layers, an average pooling layer, and a linear prediction head. (b) The detailed configuration of the VRWKV encoder layer is illustrated. The “Q-Shift” represents a four-way shifting approach specifically designed for visual tasks. The “Bi-WKV” module operates as a bidirectional recurrent neural network unit or a global attention mechanism.

Figure 3. Framework of the pre-trained large-model-based bone stick matching method driven by dynamic cross-fusion of multimodal information.

Figure 4. Processing pipeline for bone stick images: (a) Original bone stick image. (b) Image segmented using DIS-Net. (c) Image with background removed. (d) Image cropped to retain fractured regions. (e) Image resized to 224 × 224 pixels.

Figure 5. Architecture of the feature extraction network (RVB module). The diagram illustrates the three input branches for multimodal feature extraction: the RWKV model processes Chinese-character inscriptions from bone sticks; Vision-RWKV takes preprocessed bone stick images as input; and BERT encodes the textual archeological metadata (e.g., color, excavation context, and missing section) in English.

Figure 6. The input structure of BERT is designed to process archeological metadata related to bone sticks, including attributes such as color, excavation context number, and missing section.

Figure 7. Architecture of the Dynamic Cross-Feature Fusion (DCFF) module.

Figure 8. Architecture of the Matching Calculation (MC) module.

Figure 9. Results of contrastive learning: (a) Training and validation loss curves. (b) Matching accuracy from Rank-1 to Rank-15 on Test Set 1.

Figure 10. Cumulative Match Characteristic (CMC) curves of different matching methods: (a) CMC curves on Test Set 1, illustrating performance under ideal matching conditions. (b) CMC curves on Test Set 2, highlighting robustness under data heterogeneity and distributional shifts.

Figure 11. Visualization of sample results: for each input image to be matched, a total of 15 potential matching result images are generated. The figure illustrates the top three and bottom three sets of potential matching results.

Table 1. Model selection rationale and modality-specific configuration details. “→” indicates the transformation of the feature vector.

Model Name	Modality	Pretraining Corpus	Output Feature Dimension
Vision-RWKV-L	Visual (Image)	ImageNet-22K	150,528 (original)→ 1024 (final)
RWKV-6 World	Text (Inscriptions)	The Pile (multilingual corpus)	768 (original)→ 1024 (final)
BERT-base-cased	Text (Metadata)	BooksCorpus + Wikipedia (EN)	768 (original)→ 1024 (final)

Table 2. Dataset details of bone stick images. The quantities presented in the table are measured in terms of image pairs, with the corresponding number of images in the four datasets being 4200, 1200, 600, and 6000, respectively.

Training Set	Validation Set	Test Set 1	Test Set 2
2100 pairs	600 pairs	300 pairs	3000 pairs

Table 3. Model architecture and parameter comparison. “Average” denotes the mean matching accuracy of the model across Rank-1 to Rank-15 (in percentage). “Params/M” refers to the total number of model parameters, measured in millions. “TD/S” indicates the inference time (in seconds) required by the model on the designated test set.

	Method	Rank-1/%	Rank-5/%	Rank-10/%	Rank-15/%	Average/%	Params/M	TD/S
Test Set 1	(1) VR-S	31.20	48.54	65.02	71.16	53.98	215.23	32.41
	(2) VR-B	32.85	50.71	68.40	73.52	56.37	216.88	35.08
	(3) VR-L	34.93	55.13	75.12	82.25	61.86	217.94	50.37
	(4) VR-L and R4 (normal contact)	35.21	56.31	76.57	83.94	63.01	220.47	52.31
	(5) VR-L and R5 (normal contact)	36.78	57.42	77.23	84.65	64.02	223.92	54.72
	(6) VR-L and R6 (normal contact)	38.04	58.89	79.84	86.78	65.89	339.47	61.93
	(7) VR-L and R6 and Bm (normal contact)	38.62	59.54	81.12	87.40	66.67	442.13	64.87
	(8) VR-L and R6 and Bb (normal contact)	40.01	61.38	82.97	90.36	68.68	446.82	67.11
	(9) R6 and Bb (normal contact)	17.83	27.74	30.16	37.85	28.39	134.27	18.52
	(10) R6 and Bb (Static Averaging)	18.51	28.66	38.85	42.23	32.07	165.18	30.13
	(11) VR-L and Bb (normal contact)	37.12	56.95	76.98	83.83	63.72	414.55	62.26
	(12) VR-L and Bb (Static Averaging)	38.27	58.71	79.36	86.43	65.68	427.39	64.19
	(13) VR-L and R6 and Bb (Static Averaging)	41.72	64.32	83.63	90.42	69.52	439.12	65.28
	(14) VR-L and R6 and Bb (Dynamic cross-feature fusion) -ours	46.71	67.84	87.26	94.73	74.13	467.76	67.54
Test Set 2	(1) VR-S	25.20	43.14	59.92	62.41	47.67	215.23	187.65
	(2) VR-B	27.05	45.49	63.47	69.17	51.30	216.88	204.91
	(3) VR-L	29.43	50.18	70.44	78.12	57.04	217.94	292.56
	(4) VR-L and R4 (normal contact)	30.01	51.63	72.15	80.04	58.46	220.47	303.26
	(5) VR-L and R5 (normal contact)	31.78	52.92	72.98	80.90	59.65	223.92	315.45
	(6) VR-L and R6 (normal contact)	33.34	54.66	75.84	83.25	61.77	339.47	357.19
	(7) VR-L and R6 and Bm (normal contact)	34.12	55.49	77.29	84.02	62.73	442.13	373.82
	(8) VR-L and R6 and Bb (normal contact)	35.71	57.51	79.31	85.13	64.42	446.82	390.02
	(9) R6 and Bb (normal contact)	9.52	14.81	16.10	20.21	15.16	134.27	106.58
	(10) R6 and Bb (Static Averaging)	10.74	16.63	22.54	24.51	18.61	165.18	173.14
	(11) VR-L and Bb (normal contact)	27.24	41.49	56.49	61.52	46.69	414.55	357.72
	(12) VR-L and Bb (Static Averaging)	29.61	45.42	61.40	66.87	50.83	427.39	369.26
	(13) VR-L and R6 and Bb (Static Averaging)	36.17	58.76	80.66	85.96	65.38	439.12	375.04
	(14) VR-L and R6 and Bb (Dynamic Cross-Feature Fusion) -ours	42.83	62.16	83.21	87.33	68.88	467.76	391.40

Table 4. Comparison between the proposed method and other pretrained large-scale models. “Average” denotes the mean matching accuracy of the model across Rank-1 to Rank-15 (expressed as a percentage). “Params/M” refers to the total number of model parameters, measured in millions. “TD/S” indicates the inference time (in seconds) required by the model on the designated test set.

	Method	Rank-1/%	Rank-5/%	Rank-10/%	Rank-15/%	Average/%	Params/M	TD/S
Test Set 1	(1) ViLT-Large	32.03	49.62	66.71	72.34	55.17	347.17	89.12
	(2) VLMo-base	35.07	55.72	75.84	83.01	62.41	425.63	94.01
	(3) VR-L and R6	38.04	58.89	79.84	86.78	65.89	339.47	61.93
	(4) ViLT-Large and Bb (normal contact)	36.62	57.60	78.20	85.36	64.45	454.52	94.29
	(5) VLMo-base and Bb (normal contact)	37.70	58.48	79.17	86.02	65.34	532.98	100.13
	(6) VR-L and R6 and Bb (normal contact)	40.01	61.38	82.97	90.36	68.68	446.82	67.11
	(7) ViLT-Large and Bb (Dynamic Cross-Feature Fusion)	39.02	60.13	81.40	88.57	67.28	475.46	97.46
	(8) VLMo-base and Bb (Dynamic Cross-Feature Fusion)	41.72	62.83	84.10	91.27	69.98	553.92	112.67
	(9) VR-L and R6 and Bb (Dynamic Cross-Feature Fusion) -ours	46.71	67.84	87.26	94.73	74.13	467.76	67.54
Test Set 2	(1) ViLT-Large	26.12	44.31	61.70	65.79	49.48	347.17	481.25
	(2) VLMo-base	28.24	47.83	66.95	73.64	54.16	425.63	535.86
	(3) VR-L and R6	33.34	54.66	75.84	83.25	61.77	339.47	357.19
	(4) ViLT-Large and Bb (normal contact)	32.56	53.79	74.41	82.07	60.71	454.52	603.46
	(5) VLMo-base and Bb (normal contact)	33.73	55.07	76.56	83.63	62.25	532.98	624.81
	(6) VR-L and R6 and Bb (normal contact)	35.71	57.51	79.31	85.13	64.42	446.82	390.02
	(7) ViLT-Large and Bb (Dynamic Cross-Feature Fusion)	33.73	55.07	76.56	83.63	62.24	475.46	662.79
	(8) VLMo-base and Bb (Dynamic Cross-Feature Fusion)	34.91	56.50	78.30	84.58	63.57	553.92	766.16
	(9) VR-L and R6 and Bb (Dynamic Cross-Feature Fusion) -ours	42.83	62.16	83.21	87.33	68.88	467.76	391.40

Table 5. Comparative evaluation of image preprocessing on bone sticks matching performance. “Average” refers to the mean matching accuracy across Rank-1 to Rank-15 (in percentage).

		Rank-1/%	Rank-5/%	Rank-10/%	Rank-15/%	Average/%
Unprocessed Bone Stick Dataset	Test Set1	34.65	48.35	64.72	76.83	56.14
Unprocessed Bone Stick Dataset	Test Set2	28.83	39.62	52.27	56.46	44.30
Preprocessed Bone Stick Dataset	Test Set1	46.71	67.84	87.26	94.73	74.13
Preprocessed Bone Stick Dataset	Test Set2	42.83	62.16	83.21	87.33	68.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, T.; Wang, H.; Wang, K.; Liu, R.; Wang, Z. A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion. Appl. Sci. 2025, 15, 8681. https://doi.org/10.3390/app15158681

AMA Style

Fan T, Wang H, Wang K, Liu R, Wang Z. A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion. Applied Sciences. 2025; 15(15):8681. https://doi.org/10.3390/app15158681

Chicago/Turabian Style

Fan, Tao, Huiqin Wang, Ke Wang, Rui Liu, and Zhan Wang. 2025. "A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion" Applied Sciences 15, no. 15: 8681. https://doi.org/10.3390/app15158681

APA Style

Fan, T., Wang, H., Wang, K., Liu, R., & Wang, Z. (2025). A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion. Applied Sciences, 15(15), 8681. https://doi.org/10.3390/app15158681

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multimodal Bone Stick Matching Approach Based on Large-Scale Pre-Trained Models and Dynamic Cross-Modal Feature Fusion

Abstract

1. Introduction

2. Basic Theory

2.1. RWKV

2.2. Vision-RWKV (VRWKV)

2.3. BERT

2.4. Model Selection

3. The Proposed Method

3.1. Image Preprocessing Module

3.2. Feature Extraction Based on RWKV, V-RWKV and BERT Module

3.3. Dynamic Cross-Feature Fusion Model

3.4. Matching Calculation Module

4. Experimental Environment, Dataset, and Parameters

4.1. Experimental Environment

4.2. Dataset

4.3. Evaluation Metrics

5. Experiments

5.1. Ablation Study (Effect of Pretrained Models)

5.2. Ablation Study (Effect of Model Variants)

5.3. Comparative Experiments: Proposed Method Versus Traditional Approaches

5.4. Comparative Experiments: Proposed Method Versus Other Large-Scale Pre-Trained Models

5.5. Comparative Experiments on Bone Stick Preprocessing

5.6. Sample Visualization

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI