4.2. Experimental Setup
We adopted X-CLIP as the baseline method. The framework was implemented in PyTorch, and all experiments were conducted on a single NVIDIA GeForce RTX 4090 24 GB GPU. Both the text and image encoders were initialized with publicly released CLIP checkpoints. The pre-trained CLIP model was fine-tuned with a learning rate of , while the remaining components were trained with a learning rate of . Optimization was performed using the Adam optimizer with a cosine learning rate scheduler. The embedding dimensions for text, video, and image were uniformly set to 512, consistent with the dimensions of the spatial and temporal features in the text decomposition module. In submodule configurations, the number of layers l in the temporal encoder and the top-K parameter in the patch selection module were set to four and six, respectively. For MSR-VTT and MSVD, we sampled 12 frames per video, set the maximum text length per independent caption to 32, and trained for five epochs with a batch size of 64. For DiDeMo, we used 64 frames per video, merged the five captions into a single paragraph (maximum length 64), and trained for 10 epochs with a batch size of 16.
4.3. Comparison to State-of-the-Art Models
We selected several state-of-the-art methods as comparative baselines, including CLIP4Clip [
13], X-CLIP [
6], EERCF [
35], TS2-Net [
36], and UCoFiA [
8]. Among them, CLIP4Clip, X-CLIP, and EERCF directly adopt ViT-based architectures for visual feature extraction, while TS2-Net and UCoFiA enhance the ViT structure by incorporating token shifting during the visual feature extraction stage, which enables inter-frame token interaction. However, these methods are designed solely for text-to-video retrieval and do not simultaneously perform text-to-image retrieval.
To adapt them for image retrieval, we adopted the following strategies:
For CLIP4Clip, X-CLIP, and EERCF, we extracted the [CLS] token from the ViT encoder as the image representation and computed its similarity with the text feature.;
For TS2-Net and UCoFiA, which require a sequence of frames (e.g., 12 frames) as input and cannot process a single image directly, we constructed a pseudo-video by repeating the single image 12 times. This frame sequence was fed into the video encoder to obtain the image feature, which was then compared with the text feature.
To ensure a fair and reliable comparison, we re-implemented all baseline methods using their officially released code and followed the training strategies described in the respective papers. Under the experimental environment specified in
Section 4.2, we retrained these compared methods on the MSR-VTT, MSVD, and DiDeMo datasets.
On the MSR-VTT dataset,
Table 1 presents the results of UniTriM on both text–video retrieval and text–image retrieval tasks. For text–video retrieval, UniTriM achieves
in terms of R@1, which is 0.6 percentage points higher than X-CLIP’s
. Although this value is lower than that of EERCF, it still attains a second-best performance. In the R@5 and R@10 metrics, UniTriM also demonstrates clear advantages, reaching
and
respectively, outperforming all compared baselines. Specifically, R@5 improves by 2.2 percentage points over X-CLIP, and R@10 improves by 1.6 percentage points. For text–image retrieval, UniTriM achieves
,
, and
on R@1, R@5, and R@10 respectively, surpassing all competing methods. Notably, on R@1, UniTriM exceeds X-CLIP by 5.5 percentage points, which strongly validates the effectiveness of the proposed multi-granularity semantic fusion and feature disentanglement modules in aligning text with static images across modalities.
On the MSVD dataset, UniTriM also demonstrates superior performance. As shown in
Table 2, for text–video retrieval, UniTriM achieves
on R@1, which is 0.9 percentage points higher than X-CLIP’s
, and attains the highest score among all methods on R@10, exceeding X-CLIP by 0.8 percentage points. For text–image retrieval, UniTriM exhibits an even more pronounced advantage, reaching
on R@1, a substantial improvement of 6.0 percentage points over X-CLIP’s
. It also leads on R@10 with a score of
. These results further validate the robustness and generalization ability of the proposed model across datasets of different scales.
On the DiDeMo dataset, which contains longer videos and paragraph-level text descriptions and thus presents a greater challenge, UniTriM still maintains strong performance. As shown in
Table 3, for text–video retrieval, UniTriM achieves
on R@1, representing an improvement of 1.3 percentage points over X-CLIP’s
. It also delivers the best results on R@5 and R@10, reaching
and
respectively. For text–image retrieval, UniTriM attains
on R@1, outperforming X-CLIP’s
by 4.0 percentage points. It likewise surpasses all other methods on R@5 and R@10, with scores of
and
respectively.
To comprehensively evaluate the model’s practicality, we compare the number of parameters and inference speed of our method against other approaches in
Table 4. As can be seen, our method has a higher parameter count and requires more inference time than the comparison methods. We attribute the increase in parameters primarily to the encoders within our feature-disentanglement module. This is a trade-off we find acceptable given the resulting performance gains. Regarding inference speed, our method is slower than CLIP4Clip, X-CLIP, and EERCF. This is due to the additional computation introduced by the fine-grained alignment. However, it is faster than TS2-Net and UCoFiA because TS2-Net and UCoFiA cannot process images directly. They must expand each image into a video for computation, which significantly increases their computational load.
4.4. Ablation Studies
To evaluate the effectiveness of each module, we conducted ablation studies on the MSR-VTT dataset. First, we conducted an ablation study on the alignments at different granularities in the MGSA mechanism. We conducted experiments by incrementally adding each proposed component. The results are summarized in
Table 5. When only coarse-grained alignment between text and video is used, the model achieves
,
, and
on R@1, R@5, and R@10, respectively. After incorporating all five granularity-level alignments, the R@1 score improves by 3.5 percentage points to
, R@5 increases by 3.0 percentage points to
, and R@10 rises by 3.5 percentage points to
. These results demonstrate the importance of multi-granularity alignment in capturing fine-grained vision–language correspondences. The multi-granularity alignment mechanism effectively integrates semantic information from different hierarchical levels, leading to significant performance gains.
As shown in
Table 6, the experiments primarily assessed the contribution of the disentanglement module. No-Disentangle denotes the approach that directly employs the textual feature for text-to-image retrieval, where the textual feature is obtained by summing the sentence-level, triplet-level, and word-level representations without processing through the spatial disentanglement encoder. For the text-to-image retrieval task, the model equipped with the spatial disentangler achieves superior performance. When No-Disentangle is used, the model achieves R@1, R@5, and R@10 scores of
,
, and
, respectively, on the text–image retrieval task. With the introduction of the spatial disentangler, these metrics improve to
,
, and
, with R@1 increasing by 2.6 percentage points. This demonstrates that the spatial disentangler can effectively separate features from non-critical spatial interference in images, enabling the model to focus more on core visual content during the semantic matching process between text and static images, thereby significantly enhancing retrieval accuracy.
In addition, we performed an ablation study on the parameter k in the patch selection module. The results are shown in
Table 7 above. The optimal performance was achieved when k = 8. In the experiments, we evaluated three different numbers of selected patches: k = 4, k = 6, and k = 8. In terms of the R@1 metric for the text–video retrieval task, the model achieved
when k = 4, improved to
when k = 6, and further reached
when k = 8, marking a gain of 0.1 percentage points over k = 6. For the R@5 metric, the results were
at k = 4,
at k = 6, and
at k = 8, an improvement of 0.4 percentage points over k = 6. Overall, as the value of k increased from four to eight, the model showed a gradual improvement in both R@1 and R@5 metrics.
Additionally, we conducted an ablation study on the weighting parameters
and
in the loss function during the feature disentanglement learning stage. Here,
controls the strength of the reconstruction constraint, while
regulates the orthogonality constraint. We evaluated
, and the results are shown in
Table 8. To better illustrate the results, we introduce the Rsum evaluation metric, defined as the sum of R@1, R@5, and R@10 in the text–image retrieval task. This metric provides a comprehensive measure of the model’s overall retrieval performance. We observe that the highest Rsum score of 145.7 was achieved when
and
.
4.5. Visualization
To intuitively evaluate cross-modal semantic alignment, we show qualitative results on text–image–video retrieval in
Figure 6. The bottom left shows the extracted triplets from the sentence, the middle displays the top-five retrieved images ranked by similarity, and the right shows keyframes from the top-five retrieved videos. Green checkmarks indicate positive results, and red crosses indicate negative ones. The test set was constructed with a one-to-one mapping, meaning each query text was associated with only one ground-truth video.
As shown in
Figure 6a for the query “a little girl does gymnastics,” our model first retrieves scenes related to the triplet “girl does gymnastics” or those containing girls. The modifiers “a” and “little” then enhance retrieval precision. This demonstrates the effectiveness of our fine-grained cross-modal retrieval approach.
Figure 6b shows that for the query “a cartoon shows two dogs talking to a bird,” our model primarily retrieves cartoon-related scenes and then further aligns them through words such as “dog” and “bird.” In
Figure 6c, we analyze the retrieval results for the query “fireworks are being lit and exploding in a night sky.” Our model tends to return scenes associated with “fireworks” and “night sky.” However, because the dataset contains visually similar content, retrieval errors occur, reflecting a certain bias in the model’s predictions.
4.7. Comprehensive Result Analysis and Discussion
To provide deeper insights into UniTriM’s results, we conducted supplementary analyses on performance improvements across tasks and datasets. Firstly, as shown in
Table 1,
Table 2 and
Table 3, UniTriM achieves significantly higher average improvements on image retrieval tasks compared to video retrieval. We attribute this phenomenon to two main factors: 1. Targeted module design: The feature disentanglement module is specifically designed to align spatial semantics in text with static images. This design directly addresses the core requirements of image retrieval, thus yielding more substantial gains. 2. Differences in baseline saturation: The video retrieval field has witnessed extensive research, with baselines already operating at a relatively high performance level, leaving limited room for improvement. In contrast, image–video joint retrieval is a more emerging task where existing methods are mostly simple adaptations of video models; therefore, UniTriM’s targeted design brings more significant breakthroughs.
Secondly, we analyzed UniTriM’s performance variations across different video datasets. On MSRVTT, performance improvements are relatively modest. We believe this is because the video annotations are relatively concise and the video duration is short, leaving limited room for improvement. On MSVD, due to minimal scene changes and high background consistency across clips, the model encounters less interference during feature extraction. Additionally, individual frames contain richer semantic information, leading to better retrieval performance than on MSRVTT. For the DiDeMo dataset, its paragraph-level descriptions provide more learnable semantic triplets for the model and enhance attention to core vocabulary. Meanwhile, the denser frame sampling strategy offers more detailed information for temporal modeling, collectively contributing to improved retrieval accuracy. These analyses validate the effectiveness of UniTriM’s design choices and highlight how dataset characteristics influence multimodal retrieval performance.