Next Article in Journal
Automated Selection of Time Series Forecasting Models for Financial Accounting Data: Synthetic Data Application
Next Article in Special Issue
YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments
Previous Article in Journal
The Detection of Spurious Correlations in Public Bidding and Contract Descriptions Using Explainable Artificial Intelligence and Unsupervised Learning
Previous Article in Special Issue
A Study of Occluded Person Re-Identification for Shared Feature Fusion with Pose-Guided and Unsupervised Semantic Segmentation
 
 
Article
Peer-Review Record

HistoMoCo: Momentum Contrastive Learning Pre-Training on Unlabeled Histopathological Images for Oral Squamous Cell Carcinoma Detection

Electronics 2025, 14(7), 1252; https://doi.org/10.3390/electronics14071252
by Weibin Liao 1,2,†, Yifan He 2,†, Bowen Jiang 2,†, Junfeng Zhao 1,2, Min Gao 3 and Xiaoyun Zhang 3,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2025, 14(7), 1252; https://doi.org/10.3390/electronics14071252
Submission received: 5 February 2025 / Revised: 12 March 2025 / Accepted: 13 March 2025 / Published: 22 March 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript investigates the advantages of using the MoCo pretraining approach in the context of oral squamous cell carcinoma (OSCC) classification. Overall, the methodology makes sense, and the experimental design is relatively reasonable, supporting the proposed hypothesis. The main drawback of the manuscript lies in an insufficient review of the current state of the art and the lack of comparisons with appropriate baselines.

From what I have observed, a large number of foundational model studies for pathological images (e.g., UNI, Virchow, CONCH, MUSK, etc.) have recently been published. These studies widely employ methods such as DINO, DINOv2, and iBOT for pretraining on unlabeled data, which aligns with the basic framework of MoCo. However, this paper does not mention these important developments and does not conduct experimental comparisons with them. Considering the massive training data used by the aforementioned foundational models, directly comparing them may be somewhat unfair. Nevertheless, such comparisons can help us better assess the effectiveness of MoCo. I suggest adding at least one of the following experiments:

  1. Use models such as UNI as feature extractors and report results (either via linear modeling or fine-tuning) on the OSCC dataset.
  2. Include DINO, DINOv2, and iBOT in the control group alongside MoCo and compare them under the same experimental settings as MoCo.

Other issues include:

  1. The dataset setup in the paper appears somewhat confusing. Section 4.1 states that CRC and EBHI are used for contrastive pretraining, while Oral Histopathology and NDB-UFES are used for testing, which matches Table 1. However, Figure 3 does not clearly explain the role of EBHI. Additionally, CRC is also used as a test set in the experiments (see Table 2). The paper should clarify its experimental objectives and adopt a coherent experimental design (i.e., which data is used for pretraining, and whether it is drawn from the downstream task itself or from unrelated external pathological datasets) and describe it accurately.

  2. On page 2, line 52, the paper states: “As a result, there is an increasing need for pre-training strategies tailored to the unique characteristics of histopathological images.” However, MoCo has already been widely used on ImageNet and other natural image datasets. What, specifically, are the “unique characteristics” in question, and how did the paper “tailor” MoCo to account for them? At a minimum, the paper should present the empirical hyperparameters used for this task (including specific momentum and queue length settings, as well as augmentation configurations).

  3. I noticed that the image size used for pretraining (224 px) does not match the image sizes used in downstream tasks (512×512 or 2048×1536). The paper should clarify the preprocessing or global pooling methods employed to ensure the reproducibility of the experiments.

  4. In Table 3, some experiments yield extreme performance values (Precision, Sensitivity, Specificity, F1) of 0 or 1. Could the authors please explain the specific circumstances leading to such results?

  5. On page 3, line 70, the paper states: “However, only a limited number of studies have focused on self-supervised learning for histopathological images [22,23], particularly.” As mentioned above, this statement is not accurate.

  6. On page 7, line 233, there is a typo: “we” should be capitalized to “We.”

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript by Liao et al. discusses the use of momentum contrastive learning for histopathology, specifically for the detection of oral squamous cell carcinoma. Emphasis was placed on the pre-training stage where the proposed method is hoped to account for the differences in oral histopathology images and images that common pre-trained models are based on, therefore potentially improving the scalability of such techniques for real world applications.

The article was well-written overall with a clear and easy-to-follow structure, some minor comments & questions:

  • The challenges in current OSCC diagnosis and hence the motivation for (semi)-automated detection using deep learning were not specifically explained. E.g., why aren’t there any existing large-scale datasets, what is the envisioned workflow using AI and how much time/resources can you save etc.
  • No clear definition of contrast, homogeneity, energy, correlation and what information they convey about the images.
  • What is the effect of resizing? Given that Oral Histopathology will need to be downsampled almost ~10 times for HistoMoCo and single-cell resolution is necessary to diagnose clinically.
  • General lack of interpretation in discussion leaves more to be desired, e.g., why has ‘end-to-end turning performance approached a bottleneck?’
  • While the metric comparisons are thorough and detailed, it is noted that a recent study (3389/fmed.2023.1349336 ) does produce seemingly superior performance. One of the key issues with pathology images is label noise or inconsistent labels, where contrastive learning could have a more profound impact, this could be a unique novelty of the paper and should be easily demonstratable with the existing datasets.
  • Some minor typos see e.g., REF [3]
Comments on the Quality of English Language

Good overall, some minor typos

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents HistoMoCo, a Momentum Contrastive Learning (MoCo)-based self-supervised pre-training approach for oral squamous cell carcinoma (OSCC) detection from histopathological images. The authors argue that ImageNet-based pre-training introduces domain divergence, limiting its effectiveness in histopathological image analysis. HistoMoCo is trained on colorectal cancer datasets and later fine-tuned for OSCC detection. The experimental results indicate that HistoMoCo outperforms ImageNet-based pre-training in OSCC detection, reducing the dependence on large annotated datasets. However, there are also some critical weaknesses that must be addressed:

  1. The proposed method is a direct application of MoCo to histopathological image pre-training, but many recent works (e.g., BYOL, DINO, SwAV, SimMIM) have already addressed the self-supervised learning problem in medical image analysis. The authors fail to demonstrate how HistoMoCo is fundamentally different or superior to these approaches.
  2. The pre-training dataset (NCT-CRC-HE-100K) consists of colorectal cancer histopathological images, while the target task is OSCC detection. But the anatomical and morphological characteristics of OSCC and colorectal cancer are significantly different. The authors should provide  an ablation study to clarify why colorectal cancer histopathology should be suitable for pre-training OSCC detection models. 
  3. The paper does not analyze the impact of different MoCo hyperparameters (e.g., queue size, momentum update rate, projection head size). In addition, the authors do not describe the specific data augmentation techniques used during pre-training. In mu opinion, histopathological images require special augmentation techniques.
  4. Many explanations could be more concise and structured clearly to avoid redundancy. For example, sections 2 (Related Work) and 3 (Preliminaries) contain redundant discussions on contrastive learning and MoCo. In addition, some references are incomplete or formatted inconsistently. There are issues with citation formatting, such as [3,3], which appears to be an error.
  5. Some works about contrastive learning should be cited in this paper to make this submission more comprehensive, such as 10.1109/TNNLS.2023.3240195, 10.1609/aaai.v35i10.17047.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have made extensive revisions to the manuscript based on the reviewers' comments, which have essentially addressed all of my concerns. I have no further questions regarding the current research status and experimental design.

However, I would like to request that the authors verify whether the results of Histomoco's Linear Tuning are correctly recorded (the second row of Table 6, which is also the first row of Table 7).

Author Response

Comments 1: However, I would like to request that the authors verify whether the results of Histomoco's Linear Tuning are correctly recorded (the second row of Table 6, which is also the first row of Table 7).

Responses 1: Thank you very much for your valuable feedback.

We have verified that our recorded results are correct.

The confusion arises because moco-m = 0.999 is the default parameter setting for HistoMoCo, as described in line 234 of the manuscript. Therefore, the first row of Table 7 represents the standard performance of HistoMoCo (which corresponds to the second row of Table 6). A similar situation occurs in the first row of Table 8 and the second row of Table 9.

To prevent any misunderstanding, we have provided further clarification in line 380 of the revised manuscript.

Once again, we sincerely appreciate your insightful comments, which are crucial for improving the quality of our manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

No more comments.

Comments on the Quality of English Language

Suitable

Author Response

We sincerely appreciate your valuable feedback, which are crucial for improving the quality of our manuscript.

Back to TopTop