Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Hard Negatives Mining and Enhancing Method for Multi-Modal Contrastive Learning

Electronics 2025, 14(4), 767; https://doi.org/10.3390/electronics14040767

by Guangping Li^*

, Yanan Gao

, Xianhui Huang

and Bingo Wing-Kuen Ling

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Galina D. Momcheva

Electronics 2025, 14(4), 767; https://doi.org/10.3390/electronics14040767

Submission received: 31 December 2024 / Revised: 5 February 2025 / Accepted: 14 February 2025 / Published: 16 February 2025

(This article belongs to the Special Issue Recent Advances in Computer Vision: Technologies and Applications, 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript describes a hard negative mining and enhancement (HNME) module for multi-modal contrastive learning. It is suggested that HNME improves discrimination and mitigates overfitting in 3D open-world environments by leveraging visual-text relationships and weighting hard negatives, leading to enhanced model performance in zero-shot and few-shot tasks.

Some comments include:
- I believe the first sentence of the paper, defining zero-shot classification, is incorrect.
- A space should be added before [ for every reference
- The concept of "anchor" should be explained in its first appearance at line 47
- I struggle to understand Fig. 3. The anchor is in the square of class 3 and the enhanced hard negative samples are the closest to the square of class 3. No issues there. There is also one positive sample, belonging to the same square. Also no problems there. But there are 2 other samples, in that same square labeled as negative. I believe those should also be positive samples, since they also belong to class 3 as the anchor.
- "The results show that the models trained with HNME exhibit significant improvement in zero-shot classification." What statistical test was used to prove significance?
- Why are there no results for CLIP2Point top1 and top 3 on Objaverse-LVIS (Table 1)?
- Why was CLIP2Point chosen for Few-shot Classification, given that other models presented better results in the Zero-shot Classification problem?
- The ablation study in section 4.3.3. is interesting. However, why wasnt a similar study performed for MixCon3D, a more recent model?
- The word significant and its derivatives can only be used when differences have been statistically proven

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper proposes a Hard Negatives Mining and Enhancing (HNME) module to improve contrastive learning frameworks. The method focuses on identifying and enhancing hard negative samples in multi-modal datasets, specifically targeting vision and text modalities. By reweighting the hard negatives, the paper claims to boost the discriminative power of the data and reduce overfitting, particularly in zero-shot and few-shot classification tasks. The effectiveness of the approach is demonstrated through experiments on 3D point cloud datasets like ModelNet40 and ScanObjectNN. In general, the paper is well-written, and the novelty is enough. In addition, there are still several weaknesses, as follows:

1. The details of the HNME module, such as how hard negatives are selected and enhanced, are inadequately explained. For example, the exact computation of weights and the rationale behind the selected formulas (e.g., Equation 4) are not well-justified.

2. Key hyperparameters, such as $\delta$ and $\tau$, are introduced without explanation of how they were chosen or their impact on performance. The absence of a sensitivity analysis undermines the robustness of the method.

3. While some ablation experiments are provided, they are overly simplistic, comparing only HNME with a baseline and a basic hard negative (HN) strategy. More detailed analyses of individual components of HNME.

4. The related work section does not comprehensively review recent advances in hard negative mining. Some references about contrastive loss and advanced multi-modal frameworks are suggested to be cited (e.g., 10.1007/s11263-022-01731-4, 10.1109/TPAMI.2024.3511621).

These recommended references are directly related to the manuscript's topic:

10.1007/s11263-022-01731-4 addresses the same task as the manuscript.
10.1109/TPAMI.2024.3511621 focuses on multi-modal learning frameworks, similar to the manuscript's theme.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The ablation research is adequate for publication but might be greatly improved.

Some specific recommendations are for title optimization can be made. Some important keywords can be used to improve the title to respond to the article contributions (e.g. Hard Negatives Mining and Enhancing (HNME) module, zero-shot, multimodal) instead of general words (e.g. simple, effective). Another specific comment is for the style of the abstract that is too descriptive. Finally, the second sentence in the Table 1. Capture has to be in a paragraph, not in the title. Table 1. Results of zero-shot classification. Our method improves the accuracy on ModelNet40, Objaverse-LVIS, and ScanObjectNN.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have done a commendable job in improving the manuscript. However, I still have concerns regarding Figure 3. Specifically, in subplot (d), the two true negative samples within the square of class 3 raise some questions. Shouldn’t these be false negative samples, as observed in the other three subplots?

Author Response

Comments 1: The authors have done a commendable job in improving the manuscript. However, I still have concerns regarding Figure 3. Specifically, in subplot (d), the two true negative samples within the square of class 3 raise some questions. Shouldn’t these be false negative samples, as observed in the other three subplots?

Response 1: Thank you for reviewing this manuscript,. Just as your concerns, the two negative samples within class 3 in subplot (d) of Figure 3 are indeed false negative samples. When we revise Figure 3, we forget to modify the the samples in (d) due to negligence. We check Figure 3 again to make sure all the subplots and descriptions are correct. Now the correct Figure 3 are in the revised manuscript.

Article Menu

A Hard Negatives Mining and Enhancing Method for Multi-Modal Contrastive Learning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI