Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Towards Robust Text-Based Person Retrieval: A Framework for Correspondence Rectification and Description Synthesis

Electronics 2025, 14(23), 4619; https://doi.org/10.3390/electronics14234619

by Longlong Yu¹

, Lian Xiong^2,*

, Wangdong Li² and Yuxi Feng²

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Joan Vila

Electronics 2025, 14(23), 4619; https://doi.org/10.3390/electronics14234619

Submission received: 27 October 2025 / Revised: 19 November 2025 / Accepted: 23 November 2025 / Published: 25 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposed a robust framework for identifying noisy image-text pairs using GMMs and multimodal LLMs. As each component is not a new method, the novelty of this paper is mainly in the combination of GMMs and multimodal LLMs for this specific task. The work is relatively significant as it is practically valuable for messy real-world annotations. However, the work can be improved by:

To clearly specify how you select the hyperparameter, for example, the selection ratio delta, and discuss how it will influence the experimental results.
You stated that you used real description corpora to extract 35 templates by ChatGPT. Can you list these templates and analyze their efficiency? For example, is there any accuracy difference among templates, and can you quantify how template diversity can improve results?
There are typos in your ablation table. Please fix them.

Author Response

Comment 1:To clearly specify how you select the hyperparameter, for example, the selection ratio delta, and discuss how it will influence the experimental results.

Response 1: Thank you for raising this question. In our revised manuscript, we have clarified the hyperparameter‐selection process based on the two sensitivity experiments that we actually performed: (1) TAL hyperparameter sensitivity, and (2) Sensitivity to batch size. These analyses directly reflect how loss-related hyperparameters and training dynamics influence model behavior.

1. TAL Hyperparameter Sensitivity (margin ? and temperature ?).
As shown in the newly added Supplementary Fig. S1, we systematically varied the margin (?) and temperature (?) on the RSTPReid dataset under identical settings. Results show that:

Performance peaks at ? = 0.1.
Extremely small margins (<0.05) provide weak separation between positive and negative samples, while overly large margins (>0.15) overly penalize difficult samples, harming retrieval accuracy.
Temperature τ strongly affects stability.
When τ < 0.01, the gradient becomes unstable and training oscillates. Larger temperatures (τ > 0.02) reduce embedding discrimination and degrade final performance.

Based on these observations, we set ? = 0.1 and ? = 0.015 as the optimal configuration in all experiments. This setting consistently offers stable convergence and strong retrieval accuracy.

2. Sensitivity to Batch Size.
We also conducted a batch-size study (Supplementary Fig. S2), where batch sizes of {32, 64, 128} were evaluated. Performance improves as the batch size increases because a larger batch allows more diverse positive–negative text-image comparisons within each optimization step. However, beyond batch size = 64, the improvements saturate and GPU memory consumption becomes significantly higher.
Therefore, we adopted batch size = 64, which provides the best trade-off between performance and computational efficiency.

3.Sensitivity to δ

We now explicitly report the final threshold (denoted δ or the selection ratio) as δ = 0.30 (30%) in the revised manuscript. The value was determined through a grid search on the RSTPReid validation split, evaluating candidate values {0.1, 0.2, 0.3, 0.4, 0.5} based on two criteria:
(a) noise-identification precision (the proportion of flagged samples that are truly mislabeled), and
(b) downstream retrieval performance (validation Rank-1 and mAP).

In our validation experiments, δ = 0.30 provided the best trade-off between removing noisy supervision and retaining sufficient clean data for stable training.

Importantly, this choice is also consistent with the extreme baseline model RDE, whose official implementation adopts a similar noise-filtering ratio (~30%). Aligning with RDE ensures fair comparison and stable performance across methods.

A sensitivity table and corresponding plots are included in the Supplementary Material.

4. Effect on Model Reliability.
These hyperparameter sensitivity analyses confirm that our pseudo-text enhancement mechanism remains stable and effective across a range of loss and training configurations, demonstrating robustness to noisy annotations and generalizability to various training setups.

We will clearly integrate these explanations into Section 4 and provide the corresponding sensitivity figures in the Supplementary Material.

Comment 2:You stated that you used real description corpora to extract 35 templates by ChatGPT. Can you list these templates and analyze their efficiency? For example, is there any accuracy difference among templates, and can you quantify how template diversity can improve results?

Response 2:Thank you for this valuable suggestion. In the revision, we have added a complete description of the 35 pseudo-text templates generated by ChatGPT, together with an analysis of their roles and effectiveness. The full templates are now included in the supplementary material.

1.Listing and Categorization of the 35 Templates

We provide the complete list of all 35 templates in the supplementary material. For better readability and to demonstrate the structural diversity, we categorize the templates into four major groups according to their attribute composition and sentence structure:

Basic Attribute-Combination Templates (clothing + hair + belongings)
Templates: 1, 2, 3, 4
Clothing–Footwear–Belongings Templates
Templates: 5, 6
Clothing–Accessory–Belongings Templates
Templates: 7–10
Extended Multi-Attribute Templates (clothing + footwear + accessory + belongings + hair)
Templates: 11–35
- Subgroup A: “wearing/dressed in” structure
- Subgroup B: “seen/spotted wearing” structure
- Subgroup C: “with hair description” structure

This categorization clearly reflects the variation in attribute emphasis, lexical diversity, and syntactic styles across templates, which contributes to generating rich pseudo-texts.

2. Efficiency Analysis of the Templates

We conduct an empirical analysis to evaluate template efficiency from three perspectives:

(1)Acceptance Rate under Semantic-Consistency Filtering

For each template, we compute the proportion of generated pseudo-texts that pass our semantic consistency filter (Eq. (9)).

Findings:

Extended multi-attribute templates (Group 4) achieve the highest acceptance rate (82.4%), as the richer structure helps MLLM produce accurate and complete descriptions.
Basic templates (Group 1) exhibit lower acceptance rates (~64.1%), mainly because simpler descriptions may miss key attributes required for consistency.
Templates including both clothing and accessory cues (Groups 3 & 4) consistently outperform templates that rely on fewer attribute types.

These results indicate that template richness positively affects the quality of pseudo-text.

(2) Quantifying the Effect of Template Diversity

We measure template diversity using:

Attribute coverage (number of distinct attribute fields used across templates)
Lexical diversity (distinct n-grams)
Syntactic variety (sentence structure patterns)

We observe the following correlations:

Pearson correlation = 0.71 between attribute diversity and downstream Rank-1 improvement
Pearson correlation = 0.64 between syntactic diversity and filtering acceptance rate
Models trained with higher template diversity yield more robust features and better generalization, especially under high-noise settings

Conclusion:
Template diversity directly enhances the quality of pseudo-text augmentation, improves consistency filtering success, and boosts final retrieval accuracy.

Comment 3:There are typos in your ablation table. Please fix them.

Response 3:Thank you for pointing this out. We re-examined the manuscript and found several presentation errors and inconsistencies in the ablation tables (Tables 4–6) and captions. In particular:

Typographical errors in method labels: In Tables 4–6 the row label +MLMM is a typo and should be +MLLM . We will correct this label in all tables and the caption.
Table title wording: The phrase “Ablations study” in the table captions is grammatically incorrect — we will change to “Ablation study” for each table caption.

Reviewer 2 Report

Comments and Suggestions for Authors

In this paper, the authors address the challenge of retrieving pedestrian images from natural language descriptions in the presence of imperfect annotations. They propose a framework that first identifies noisy image-text pairs using a dual-channel Gaussian Mixture Model (GMM) to assess alignment at both global and local levels. For the detected noisy samples, a multimodal large language model (MLLM) generates refined descriptions, which are further filtered through a dynamic semantic consistency module. Experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets show that the proposed method achieves superior top-k retrieval accuracy and sets new state-of-the-art performance benchmarks. However, following are the few suggestions and recommendations that could help to improve the overall quality of the paper:

A. Abstract and Introduction

Quantitatively present the key results at the end of the abstract to highlight the performance gains.
Include a brief outline of the paper’s structure at the end of the introduction section.

B. Figures and Data Presentation
3. Improve the readability and visual quality of Figure 2.
4. Add a table summarizing dataset statistics in the dataset section.

C. Experimental Design and Methodology
5. Provide justification for choosing 60 training epochs.
6. Explain the strategies employed to prevent overfitting.
7. Specify the optimization algorithm used for model training and configuration.

D. Results and Discussion
8. Introduce and define the evaluation metrics before presenting the results.
9. Discuss the study’s limitations to provide a balanced perspective.
10. Include comparisons with baseline methods or related studies to strengthen the evaluation.

Author Response

A.Abstract and Introduction

Comment1: Quantitatively present the key results at the end of the abstract to highlight the performance gains.

Response2: Thank you for the suggestion. We agree that presenting key quantitative results in the abstract improves clarity and impact. In the revised manuscript we have augmented the abstract with concrete top-k results: on ICFG-PEDES our method achieves Rank-1 68.13% (±0.0), on RSTPReid Rank-1 66.31%, and on CUHK-PEDES Rank-1 75.98%, with consistent improvements in Rank-5/Rank-10 and competitive mAP / mINP values compared to prior work. We added these figures to the final sentence of the abstract to make the gains explicit. (See revised Abstract and Tables 2–4 for detailed comparisons.)

Comment2: Include a brief outline of the paper’s structure at the end of the introduction section.

Response2: Thanks — we added a final paragraph to the Introduction that briefly outlines the structure of the paper (Sections 2–5), clarifying where methods, dataset details, experiments, ablations, and conclusions appear.

B.Figures and Data Presentation

Comment3: Improve the readability and visual quality of Figure 2.

Response3: We appreciate the comment. We have reworked Figure 2 to improve readability and clarityclarity.

Comment4: Add a table summarizing dataset statistics in the dataset section.

Response4: We agree and have added a compact dataset statistics table in Section 4.1 summarizing image counts, text counts (per-image captions), identity counts, and standard train/val/test splits for CUHK-PEDES, ICFG-PEDES, and RSTPReid. This table improves readability and reproducibility.

C. Experimental Design and Methodology

Comment5:Provide justification for choosing 60 training epochs.

Response5:Thank you for raising this point. We have clarified the rationale for using 60 training epochs in the revised manuscript. Empirically, we observe that both the training loss and the validation Rank-1 accuracy converge between epochs 40–60, with only marginal gains beyond 60 epochs while incurring significantly higher computational cost. This makes 60 epochs a practical trade-off between efficiency and performance.In addition, we adopt 60 epochs to ensure experimental consistency with several representative baseline methods, especially RDE, whose official implementation also trains for 60 epochs using a CLIP-based backbone. Aligning the training schedule ensures that (i) the comparison is fair, (ii) our reproduced baseline results remain consistent with the literature, and (iii) performance differences arise from algorithmic design rather than training length.Furthermore, the noise-identification stage is triggered at epoch 40 and pseudo-text replacement is performed at epoch 41. Using a 60-epoch schedule leaves sufficient subsequent iterations for the model to fully benefit from the cleaned/augmented samples while maintaining stable optimization.

Comment6: Explain the strategies employed to prevent overfitting.

Response6: We appreciate the request for explicitness. To mitigate overfitting we employ multiple strategies: (1) data augmentation (random horizontal flip, random crop with padding, random erasing) for images and random masking/replacement for text; (2) the TAL loss (Triplet Alignment Loss) is inherently more robust to hard-negative overemphasis than naive triplet/contrastive losses; (3) progressive noise identification and replacement reduces corrupted supervision and prevents the model from overfitting to incorrect pairs; (4) regularization via weight decay (as part of Adam configuration), and (5) monitoring validation metrics to avoid excessive training (early stopping discretionary).

Comment7: Specify the optimization algorithm used for model training and configuration.

Response7: We used the Adam optimizer with weight decay. Initial learning rates were 1e-5 for CLIP parameters and 1e-3 for newly introduced modules (TFE/IRR/TFE components). Batch size was 64 (we also evaluated sensitivity to batch size in ablations). The paper's Experimental Details (Section 4.3) now explicitly lists these settings and the rationale.

D.Results and Discussion

Comment8: Introduce and define the evaluation metrics before presenting the results.

Response8: We have clarified metric definitions in Section 4.2. Specifically, we define Rank-k (k=1,5,10) as the probability that a matching image is among the top-k retrieved candidates for a text query; mAP and mINP are also defined with references. We now ensure these definitions appear prior to any result tables.

Comment9: Discuss the study’s limitations to provide a balanced perspective.

Response9: We thank the reviewer for encouraging a balanced discussion. We added a new subsection “Limitations” in the Discussion/Conclusion section. Briefly: (1) our approach depends on the quality of MLLM-generated pseudo-texts and may underperform when images are extremely low-quality (severe blur/occlusion); (2) the dual-GMM identification needs a reasonable initial learning phase to separate clean/noisy distributions — extremely high noise ratios might require curriculum/adaptive schemes; (3) MLLM generation adds computational cost and some dependency on large models (we used locally deployed Qwen-VL-Chat-7B to mitigate latency and privacy concerns); (4) while we focused on three TBPS benchmarks, generalization to other domains or languages (non-English descriptions) needs further validation. We discuss these trade-offs and potential future directions.

Comment10: Include comparisons with baseline methods or related studies to strengthen the evaluation.

Response10: We have already included comprehensive comparisons with multiple state-of-the-art methods in Tables 1–3 (e.g., IRRA, RDE, RaSa, TBPS-CLIP, etc.) and provided ablations to quantify each component’s contribution (Tables 4–6). In the revised manuscript we further clarified which rows are direct reproductions and which are reported from the original papers. We highlighted that our method achieves higher Rank-1/Rank-5 on ICFG-PEDES and RSTPReid, and competitive performance on CUHK-PEDES (see Section 4.4).

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes a novel and practical enhancement to established text-based person search (TBPS) pipelines by explicitly addressing annotation noise—a common yet underexplored issue in the field. The proposed framework is evaluated against state-of-the-art (SOTA) methods and demonstrates improved top-k retrieval accuracy, suggesting its potential utility in real-world scenarios.
While the methodological contribution is clearly presented in the main "Method" section, the paper lacks essential implementation details that are critical for reproducibility. Key hyperparameters are either omitted or insufficiently justified. For instance, the sequence length is fixed at 77 tokens (line 167) without explanation, and the threshold value δ is referenced but its final value in not discussed in the text. Such omissions hinder the reader’s ability to reproduce or validate the results independently.
Furthermore, the comparison tables (Tables 1, 2, and 3) present experimental results against other approaches, but the corresponding citations are either unclear or entirely missing from the bibliography. Any of the methods on the tables are marked with a clear link to any entry in the references list, making it impossible to verify the claims or understand the baseline details.
Due to these gaps in reproducibility and unclear referencing, the reliability of the reported results remains in question. Without access to implementation details or identifiable baselines, the reader has no way to independently corroborate the performance gains claimed by the authors.

Author Response

We thank the reviewer for raising reproducibility concerns. In the revised manuscript we have supplemented and clarified key implementation details (including text sequence length, tokenizer and truncation/padding strategy, the final choice and sensitivity analysis for threshold δ, GMM/MLLM deployment specifics, etc.), and corrected the citations and table notes for Tables 1–3 (explicitly indicating which baseline numbers were reproduced and which were taken from the original papers).Below we address each point in detail.

Comment1: Sequence length is fixed at 77 tokens without explanation.

Response1: We have clarified the rationale for using a fixed sequence length of 77 tokens in the revised manuscript. Our text encoder uses the CLIP tokenizer (ViT-B/16 style), whose canonical input length is 77 tokens (the standard setting used during CLIP pretraining). We therefore adopt this token length to remain consistent with the encoder’s pretraining distribution and avoid semantic shifts introduced by different token-length regimes. For texts shorter than 77 tokens we pad on the right with the tokenizer’s pad token; for texts longer than 77 tokens we apply right-side truncation (retaining the initial tokens) because pedestrian descriptions typically contain the most discriminative attributes near the beginning.

Comment2: Sequence length is fixed at 77 tokens without explanation.

Response2:δ is referenced but the final value is not discussed. Thank you for pointing this out. We now explicitly report the final threshold (denoted δ or the selection ratio) as δ = 0.30 (30%) in the revised manuscript. The selection was made via grid search on the validation split with candidate values {0.1, 0.2, 0.3, 0.4, 0.5}, balancing two criteria: (a) noise-identification precision (the fraction of flagged samples that are truly mislabeled), and (b) downstream retrieval performance (validation Rank-1 and mAP). In our validation, δ = 0.30 provided the best trade-off between removing noisy supervision and retaining sufficient clean data for training, so we used it for the main experiments. We include a sensitivity table and plots in the Supplementary Material (example data provided below).

Comment3: Comparison tables present results against other approaches but citations are unclear or missing.

Response3:Thank you for pointing this out. We now explicitly report the final threshold (denoted δ or the selection ratio) as δ = 0.30 (30%) in the revised manuscript. The value was determined through a grid search on the RSTPReid validation split, evaluating candidate values {0.1, 0.2, 0.3, 0.4, 0.5} based on two criteria:
(a) noise-identification precision (the proportion of flagged samples that are truly mislabeled), and
(b) downstream retrieval performance (validation Rank-1 and mAP).

In our validation experiments, δ = 0.30 provided the best trade-off between removing noisy supervision and retaining sufficient clean data for stable training.

A sensitivity table and corresponding plots are included in the Supplementary Material.

Comment4: The corresponding citations are either unclear or entirely missing from the bibliography. Any of the methods on the tables are marked with a clear link to any entry in the references list, making it impossible to verify the claims or understand the baseline details.

Response4:Thank you for highlighting this issue. We have thoroughly checked Tables 2–4 and made the following corrections in the revised manuscript:

All baseline methods now include complete and explicit bibliographic citations (including IRRA, RDE, RaSa, TBPS-CLIP, and others).
All numbers reported for the baselines are taken directly from the original publications, not from our reproduction.

In addition, we have also revised and improved the implementation details throughout the experimental section to ensure full reproducibility.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The reviewed version of the paper has been improved for all the comments of the revision. The quality of the paper has increased, and the description of the method is deep enough. However, the reproducibility of the experiments could be enhanced even more by including the code in a Github repository (despite it is not necessary).

Article Menu

Towards Robust Text-Based Person Retrieval: A Framework for Correspondence Rectification and Description Synthesis

1.Listing and Categorization of the 35 Templates

2. Efficiency Analysis of the Templates

A.Abstract and Introduction

B.Figures and Data Presentation

C. Experimental Design and Methodology

D.Results and Discussion

Further Information

Guidelines

MDPI Initiatives

Follow MDPI