Review Reports - A Cross-Modal Hash Retrieval Method with Fused Triples

Round 1

Reviewer 1 Report (New Reviewer)

1.Introduction and references should be updated by including the following papers.

(a)DOI: 10.1109/TIP.2018.2821921

(b)DOI: http://dx.doi.org/10.1145/2939672.2939812.

( c) https://doi.org/10.3390/math10030430.

2. The section 3 should be updated by briefly describing CCA-ITQ [11], SCM [7], CMFH [12], SePH [16], DCMH [9] and DLFH [2] and TDH for cross-modal retrieval (DOI: 10.1109/TIP.2018.2821921) methods /algorithms to improve the readability of the paper.

3. An algorithm based on proposed method should be in included in section 4.

4. In the section 4.2 ,the Comparison of proposed method with TDH for cross-modal retrieval (DOI: 10.1109/TIP.2018.2821921) method should be included.

5. The authors should detail what are the advantages and disadvantages of the proposed method in conclusion section.

The language of the paper should be improved. The authors should have used a grammar check editor for some grammatical errors and misprints.

Author Response

Thank you very much for your review of my paper. I have revised the paper as required, and the following are the answers to the questions.

Q: Introduction and references should be updated by including the following papers.

(a)DOI: 10.1109/TIP.2018.2821921

(b)DOI: http://dx.doi.org/10.1145/2939672.2939812.

( c) https://doi.org/10.3390/math10030430.

A: The introduction and references have been updated in this paper, and the above three literatures have been added and referenced, these are the references [25][26][27]..

Q: The section 3 should be updated by briefly describing CCA-ITQ [11], SCM [7], CMFH [12], SePH [16], DCMH [9] and DLFH [2] and TDH for cross-modal retrieval (DOI: 10.1109/TIP.2018.2821921) methods /algorithms to improve the readability of the paper.

A: This paper updates the description of cross-modal retrieval methods in Section 3 and briefly introduces the comparison method CMFH involved [19], CCA-ITQ [18], SCM [7], SePH [23], DVSH [26], DCMH [15], TDH [27], DLFH [2], DMSFH [25].

Q: An algorithm based on proposed method should be in included in section 4.

A: The algorithm part of the method proposed in this paper has been updated to Section 4.

Q: In the section 4.2 ,the Comparison of proposed method with TDH for cross-modal retrieval (DOI: 10.1109/TIP.2018.2821921) method should be included.

A: In the experimental part of this paper, the comparison between the proposed method and TDH method is added. At the same time, since the previous comparison experiments were all the methods of 2019 and before, the comparison experiments of DMSFH method were added. In addition, a brief description of existing cross-modal search methods has been added to Section 3, so the experimental section has been changed from the original section 4 to the current section 5.

Q: The authors should detail what are the advantages and disadvantages of the proposed method in conclusion section.

A: The conclusions section has been updated to add the advantages and disadvantages of the methods in this paper and to explain the reasons for the strengths and weaknesses.

Q: The language of the paper should be improved. The authors should have used a grammar check editor for some grammatical errors and misprints.

A: I am very sorry that my poor English writing ability has brought some trouble to your review. At present, I have revised the overall grammar and expression of the paper.

Author Response File: Author Response.pdf

Reviewer 2 Report (New Reviewer)

The manuscript, titled ‘A Cross-Modal Hash Retrieval Method with Fused Triples’ proposes a novel cross-modal hashing method. The proposed framework is then validated on publically available datasets against existing methods. Overall, the manuscript is well-structured. The methods, framework and the experimental datasets are explained comprehensively. However, some minor additions are required to make the manuscript ready for publication.

Firstly, authors should add more details regarding the limitations of the proposed method. The performance on large training datasets is a major issue in terms of practical applications. This requires a detailed discussion to be included in the manuscript. Moreover, comparison of supervised and unsupervised methods in terms of preprocessing and processing time may also be included.

Secondly, proof-reading of whole manuscript is required, especially abstract needs to be improved significantly.

Proof-reading of whole manuscript is required, especially abstract needs to be improved significantly.

Author Response

Thank you very much for your review of my paper. I have revised it as required. The following are your questions and my answers.

Q: Firstly, authors should add more details regarding the limitations of the proposed method. The performance on large training datasets is a major issue in terms of practical applications. This requires a detailed discussion to be included in the manuscript. Moreover, comparison of supervised and unsupervised methods in terms of preprocessing and processing time may also be included.

A: To address the limitations of the methods in this paper and their performance on large training datasets, which were indeed rarely mentioned in previous manuscripts, they are now added in the experiments and conclusions sections.In addition, a comparison of the training time between unsupervised cross-modal hashing methods and supervised cross-modal hashing methods has been added to the experimental section.

Q: Secondly, proof-reading of whole manuscript is required, especially abstract needs to be improved significantly.

A: I am very sorry for the inconvenience of your review due to my negligence. The entire paper has now been proofread and the abstract has been greatly revised.The abstract section now focuses on showing the central idea of the proposed method as well as the strengths and weaknesses of the proposed method.

Attached is my revised manuscript.

Thank you very much.

Author Response File: Author Response.pdf

Reviewer 3 Report (New Reviewer)

Congrats for a well written paper that presents a very modern topic and approach. Bellow I tried to add a few suggestions for improvements.

Line 137 "The hash learning includes hamming distance loss, intra-modal loss, cross-modal inter-modal loss, and quantization loss. Cross-modal triple loss can" - you say triple, but you add to the list 4 distance loss methods. Make sure to clarify this.

Line 155. The decision to choose "an 8-layer CNN model" has to be justified. Please write if you tried other models and if the performance is different or how you decided that an an 8-layer model is perfect.

Line 191. Can you provide more details about the role of hyperparameters and the performance impact of the tuning scale factor.

I love the experiment section. Here, please try to make some statements regarding the machines (specs) used for testing and if possible describe or add details regarding the timing required to run those configurations. Your solution looks a nice improvement, but it would be good to see if the improvement takes place within the same processing-time intervals, same energy, etc. If you didn't make all those measurements yet, I recommend at least mentioning these aspects or your opinion on these aspects.

In order to improve the reading, please revise the English language on the following lines: 8-9, 63-64 , 160, 253.

Author Response

Thank you very much for your review of my paper. I have revised the paper as required, and the following are the answers to the questions.

Congrats for a well written paper that presents a very modern topic and approach. Bellow I tried to add a few suggestions for improvements.

Q: Line 137 "The hash learning includes hamming distance loss, intra-modal loss, cross-modal inter-modal loss, and quantization loss. Cross-modal triple loss can" - you say triple, but you add to the list 4 distance loss methods. Make sure to clarify this.

A: I am sorry that it is my lack of English expression that affects your understanding. The triple mentioned in the paper refers to the triples loss function, which is a name for the main loss function used cross-modal, and the size of the weights is determined by this loss function, hence the name triple, not the meaning of three loss functions.

Q: Line 155. The decision to choose "an 8-layer CNN model" has to be justified. Please write if you tried other models and if the performance is different or how you decided that an an 8-layer model is perfect.

A: An 8-layer CNN is chosen because DCMH, DLFH, etc. all use the same network according to previous methods. In addition,the innovation of the method in this paper lies in the addition of the triples data selection method in the data processing stage and the loss function part of the cross-modal hash retrieval algorithm. Therefore, in the feature extraction part, the existing network was chosen to be carried out according to the experience of previous researchers, and no innovation or alteration was made to it.

Q: Line 191. Can you provide more details about the role of hyperparameters and the performance impact of the tuning scale factor.

A: Hyperparameter optimization allows the learning algorithm to select a set of optimal hyperparameters, usually aiming to optimize a measure of the algorithm's performance on the dataset. Therefore, in training the model, the hyperparameters must be optimized to select a set of optimal hyperparameters for the learning machine to improve the performance and effect of learning. Therefore, the training process needs to be well-regulated and reconstructed input parameters. In the experiment of this method, all possible hyperparameters are tested by traversal method, and the parameter values of the optimal effect are finally determined.

Already added to section 4.3 of the paper.

Q: I love the experiment section. Here, please try to make some statements regarding the machines (specs) used for testing and if possible describe or add details regarding the timing required to run those configurations. Your solution looks a nice improvement, but it would be good to see if the improvement takes place within the same processing-time intervals, same energy, etc. If you didn't make all those measurements yet, I recommend at least mentioning these aspects or your opinion on these aspects.

A: It is true that some descriptions of the test machine specifications etc. were not mentioned in the previous manuscript, and after this problem, some descriptions of the test machine specification models were added to the experimental part of Section 5, and at the same time, a comparison of the unsupervised cross-modal hashing method and the supervised cross-modal hashing method in terms of training time was added to Section 5.3.

Q: In order to improve the reading, please revise the English language on the following lines: 8-9, 63-64 , 160, 253.

A: I am very sorry that my poor English writing ability has brought some trouble to your review. At present, I have revised the overall grammar and expression of the paper and focused on modifying the syntax of your proposed location.

Attached is my revised manuscript.

Thank you very much.

Author Response File: Author Response.pdf

Reviewer 4 Report (New Reviewer)

This study proposes a cross-modal hash retrieval method based on fused triples. The fused triples method is effective compared to the existing cross-modal hashing methods. The problem statement and methodology explanation are clear and convincing. Some minor comments:

1.The manuscripts need careful proofreading to fix typos and grammatical errors.

2.The abstract needs to be revised to better summarize the findings and advantages of the proposed model compared to other methods. Currently, the abstract does not provide enough information.

Language needs to be refined.

Author Response

Thank you very much for your review of my paper. I have revised the paper as required, and the following are the answers to the questions.

Q: The manuscripts need careful proofreading to fix typos and grammatical errors.

A: I am very sorry that my poor English writing ability has brought some trouble to your review. At present, I have revised the overall grammar and expression of the paper.

Q: The abstract needs to be revised to better summarize the findings and advantages of the proposed model compared to other methods. Currently, the abstract does not provide enough information.

A: The abstract section has been greatly revised, and the abstract section now focuses on showing the central idea of the proposed method, as well as its advantages and disadvantages. This paper first introduces the advantages of the cross-modal hash retrieval method, describes the existing problems, introduction the innovation, significance, advantages and disadvantages of the proposed method, and finally describes the experimental results.

Attached is my revised manuscript.

Thank you very much.

Author Response File: Author Response.pdf

Reviewer 5 Report (New Reviewer)

(1) The motivation for adopting the multi-LSTM model in this work is unclear. This part needs more explanations and references.

(2) Certain notations are unclear. For example, in (3), is the multiplication over both i and j, and both from 1 to n? Is the notation correct?

Overall, the method is well explained. However, there are many grammatical errors that can be found in the manuscript.

Author Response

Thank you very much for your review of my paper. I have revised the paper as required, and the following are the answers to the questions.

Q: The motivation for adopting the multi-LSTM model in this work is unclear. This part needs more explanations and references.

A: According to the previous experience, the conventional LSTM network is usually used in the text feature extraction part. Previously, I discovered the multi-LSTM network by reading the literature. During the feature extraction experiments on text modality, I compared the feature extraction effects of both LSTM and multi-LSTM networks, and finally chose the multi-LSTM network.Therefore, the meaning of multi-LSTM and the explanation of the choice of it are added in section 4.2 of the paper.

Q: Certain notations are unclear. For example, in (3), is the multiplication over both i and j, and both from 1 to n? Is the notation correct?

A: After checking, the notation of equation (3) is correct, but there is a problem with the format in which it is written, and it has been proofread.Also, other formulas in the paper were checked.

Q: Overall, the method is well explained. However, there are many grammatical errors that can be found in the manuscript.

A: I am very sorry that my English writing ability is poor, which brings some trouble to your review. At present, I have revised the overall grammar and expression of the paper.

Attached is my revised manuscript.

Thank you very much.

Author Response File: Author Response.pdf

Reviewer 6 Report (New Reviewer)

This article proposes a novel cross-model hashing method for the cross-model retrieval problem. This is interesting, and the two parts in the proposed framework are reasonable. However, my biggest concern is the contribution of this article was not sufficient to be accepted by the journal. Following are the detailed comments:

1. In the Introduction, the description of the existing work, especially the shortcomings, is slightly inadequate.

2. In Section 2, the description of related work lacks rational organization, and it is suggested that the presentation be divided into different categories.

3. In Section 3, the feature extraction effort looks like existing networks stacked on top of each other.

4. In Section 3, the formatting of symbols and formulas seems confusing, it is recommended to reorganize the draft with a formula editor or latex.

5. In Section 4, baseline methods are all prior to 2019, and whether there has been any related work in the last three years.

Furthermore, its presentation and use of English should be improved. There are some errors in English writing and formula formats throughout the entire article.

Moderate editing of English language required

Author Response

Thank you very much for your review of my paper. I have revised the paper as required, and the following are the answers to the questions.

Q: In the Introduction, the description of the existing work, especially the shortcomings, is slightly inadequate.

A: The introduction section complements the existing work on cross-modal hashing methods as well as the shortcomings. It mainly includes the way deep cross-modal hashing methods learn hash codes, comparison of deep cross-modal retrieval methods with earlier cross-modal retrieval methods, and shortcomings such as the long time consuming training of deep cross-modal hash retrieval methods.

Q: In Section 2, the description of related work lacks rational organization, and it is suggested that the presentation be divided into different categories.

A: In the related work section of Section 2, the original organization was not organized, and it is now modified. First, the meaning and methods of cross-modal search are explained, and the unsupervised and supervised methods are distinguished among the cross-modal hash methods to be used. Then, Section 3 is introduced, and the existing unsupervised cross-modal search and supervised cross-modal search methods are briefly described.

Q: In Section 3, the feature extraction effort looks like existing networks stacked on top of each other.

A: Since the innovation of the method in this paper lies in the addition of the triplets data selection method in the data processing stage and the loss function part of the cross-modal hash retrieval algorithm. Therefore, in the feature extraction part, the existing network was chosen to be carried out according to the experience of previous researchers, and no innovation or alteration was made to it.

Q: In Section 3, the formatting of symbols and formulas seems confusing, it is recommended to reorganize the draft with a formula editor or latex.

A: I'm sorry for the error in the formula editing format due to my negligence, which has been corrected.

Q: In Section 4, baseline methods are all prior to 2019, and whether there has been any related work in the last three years.

A: The baseline methods are all from 2019 and before, because the experiments in this paper mainly distinguish between supervised and unsupervised algorithms, and do not compare the experiments in the past three years. With this problem in mind, in recent days I have added a set of experiments on the 2022 DMSFH method, and the experimental results show that the proposed method is more effective.

Q: Furthermore, its presentation and use of English should be improved. There are some errors in English writing and formula formats throughout the entire article.

Moderate editing of English language required

A: I am very sorry that my English writing ability is poor, which brings some trouble to your review. At present, I have revised the overall grammar and expression of the paper, and at the same time, the formatting errors of the formulas in the paper have been corrected.

Attached is my revised manuscript.

Thank you very much.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report (New Reviewer)

The authors revised the paper according my suggestion in report 1. I recommend the present version of paper for the publication in Applied Sciences.

Reviewer 6 Report (New Reviewer)

The authors have made the changes suggested in the last round of revisions, and the current version is of fair quality.

Minor editing of English language required

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

The subject of the manuscript is very interesting. The authors propose Cross-Modal Hashing (Tri-CMH) retrieval method. For this reason, the manuscript is relevant to the Applied Sciences audience. The manuscript is not well structured and is difficult to read and understand due to its poor linguistic quality. Furthermore, it has some flaws in the experimental part.

I would like to raise some points that should be revised to improve the quality of the manuscript:

The abstract must be rewritten to indicate the manuscript's novelty and contributions.
The manuscript requires extensive editing of the English language and style. It is difficult to read due to its poor linguistic quality.
The introduction mentions "network", "loss", etc., without any explanation.
It is unclear how the CNN of Figure 1 is trained, validated, and tested.
There is no description of the Multi-LSTM in Figure 1.
There is no explanation about the hyperparameters of Eq. (1) and how they are determined.
The notation has many flaws and inconsistencies.
The experimental protocol is not presented. It seems that authors train and test on the same data. If so, the results are biased.

Author Response

I would like to thank the reviewers for reviewing and commenting on the manuscript despite their hectic schedules, and my responses to the questions are provided below.

The abstract must be rewritten to indicate the manuscript's novelty and contributions.

Complete the revision in the manuscript as required， as follows:

Cross-modal hashing is the primary method for cross-modal retrieval. Standard neural networks usually use to extract the features of image and text modalities in the data set. The extracted feature information hash to learn to calculate the hamming distance. The idea is that the semantic similarity between a picture and text increases with decreasing hamming distance between the two modalities. The existing cross-modal hashing methods need to be more effective using the supervised information of the data set and insufficient semantic similarity expression ability. To address this problem, this paper proposes the cross-modal hashing retrieval method Tri-CMH with fused triples, which is an end-to-end model framework consisting of two parts: feature extraction and hash learning. Firstly, the multi-modal data prepossess into the form of a triple. The data supervision matrix constructs to aggregate the samples with labels and meanings. Those with labels opposite to their meanings are separated, thus avoiding the problem of under-utilization of supervised information in the data set and reaching the effect of effective utilization of global supervised information. To improve the ability of the model to identify cross-modal semantic similarity, hamming distance loss, single-mode internal loss, cross-modal loss, and quantization loss consider. Semantically similar and semantically different hash codes can explicitly constrain by optimizing the loss function for hash learning. The method is trained and tested on IAPR-TC12, MIRFLICKR-25K, and NUS-WIDE datasets, and the experimental evaluation criteria are mAP and PR-curve. The experimental results show the effectiveness and practicality of the method.

2.The manuscript requires extensive editing of the English language and style. It is difficult to read due to its poor linguistic quality.

All inappropriate expressions in the manuscript have been checked and revised.

I am sorry for the trouble I caused the reviewer due to my poor English expression.

3.The introduction mentions "network", "loss", etc., without any explanation.

The term network in the text refers to a neural network model that can learn and train on its own.

The loss function measures the degree of inconsistency between the predicted and actual values of the model. The smaller the loss function, the better the robustness of the trained network model.

Supervised information means that the dataset contains data other than label information, while unsupervised information means that the dataset is unlabeled.

Additions were made in lines 57, 59, and 70 of the manuscript.

4.It is unclear how the CNN of Figure 1 is trained, validated, and tested.

For the IAPR TC-12 and MIRFLICKR-25K datasets, 3000 images were randomly selected as the test set for CNN extraction of image features. The rest of the images were used as the retrieval set, and for the NUS-WIDE data set, 20000 images were selected as the test set for CNN extraction of image features, and the rest of the images were used as the retrieval set. Another 10,000 images to select from the retrieval set of each data set as the training set for IAPR TC-12 and MIRFLICKR-25K, and 30,000 images to select as the training set for NUS-WIDE.

5.There is no description of the Multi-LSTM in Figure 1.

Three LSTM layers, one fully connected layer, and one one-dimensional convolutional layer make up the multi-layer, multi-variable longer short-term memory network model known as Multi-LSTM. Each LSTM layer has an input gate, an oblivion gate, an output gate, and a memory unit. The oblivion gate discards useless information, and the input gate controls the memory unit to store helpful information. The output gate controls the output of useful information. Multi-LSTM combines the advantages of each layer to avoid the gradient disappearance and explosion problem to a certain extent and to obtain the semantic feature information of text modality. Using the network model for feature extraction of text modes is superior to simple network models, such as LSTM. The Multi-LSTM model used in this paper has three LSTM layers, and the best training results and minor errors achieving when the number of neurons in the hidden layer is 64, 256, and 64.

It has already been added in section 3.2 of the manuscript.

6.There is no explanation about the hyperparameters of Eq. (1) and how they are determined.

The experiments set hyperparameters =100, =150, =50, ==1, fixed the input size of image and text as 128, fixed the number of iterations as 500, fixed the learning rate of modal image network and text modal network as [10^-6,10^-1], and averaged the experimental data from three experimental results and applied them to the algorithm.

It has already been added in section 4.1 of the manuscript.

7.The notation has many flaws and inconsistencies.

All symbols in the manuscript have been checked and corrected. It includes the meaning of the symbols and the format in which they are written.

8.The experimental protocol is not presented. It seems that authors train and test on the same data. If so, the results are biased.

For the IAPR TC-12 and MIRFLICKR-25K datasets, 3,000 image text pairs were randomly selected as the test set, 10,000 image text pairs select from the rest of the data as the training set, and the remaining image text pairs using as the cross-modal retrieval set. For the NUS-WIDE dataset, 10,000 picture text pairs choose at random as the test set, 30,000 image text pairs choose as the training set, and the remaining image text pairings utilizing as the cross-modal retrieval set.

It has already been added in section 4.1 of the manuscript.

Author Response File: Author Response.docx

Reviewer 2 Report

This paper is concernd about cross-model hash retrieveal with a triple fusion technique. The main theme of this approach is ok, with intensive experiments conducted showing better performance.

The primary drawback of this paper is English and mathematical expressions. Some typical situations are listed below.

1. The present title does not fully make sense due to the unsuitable "for".

2. In line 188, "Where" should be "where" with no indentation. Similar cases in lines 213 and 224.

3. In line 192, the statement "x_i is the row vector of image i" is not clearly addressed.

4. In lines 206 and 207, the symbol $\in$ should be $\subset$.

5.In line 208, "| ||" should be "|| ||". The term "F-paradigm" should be "F-norm".

It's the authors' responsibility to carefully check the whole manuscript for unsuitable expressions.

Besides, the recent regerences are too few. Some references lacks information like pages.

Author Response

I would like to thank the reviewers for reviewing and commenting on the manuscript despite their hectic schedules, and my responses to the questions are provided below.

The present title does not fully make sense due to the unsuitable "for".

Complete the revision in the manuscript as required， as follows:

“A Cross-Modal Hash Retrieval Method with Fused Triples”

In line 188, "Where" should be "where" with no indentation. Similar cases in lines 213 and 224.

Complete the revision in the original text as required.

Additions were made in lines 199, 224, and 225 of the manuscript.

In line 192, the statement "xi is the row vector of image i" is not clearly addressed.

Complete the revision in the manuscript as required， as follows:

xi is the raw pixels of image i.

I apologize for the misrepresentation of the original manuscript due to personal oversight.

Additions were made in line 203 of the manuscript.

In lines 206 and 207, the symbol $\in$ should be $\subset$.

Complete the revision in the manuscript as required.

Additions were made in lines 217 and 218 of the manuscript.

In line 208, "| ||" should be "|| ||". The term "F-paradigm" should be "F-norm".

Complete the revision in the manuscript as required.

Additions were made in line 219 of the manuscript.

It's the authors' responsibility to carefully check the whole manuscript for unsuitable expressions.

All inappropriate expressions in the manuscript have been checked and revised.

I am sorry for the trouble I caused the reviewer due to my poor English expression.

Besides, the recent references are too few. Some references lacks information like pages.

References with missing pages have been revised. Three newer references were read and added to the manuscript in appropriate places: references 14, 17, and 20.

Sorry for not being able to enter the formula here and for not being able to show you the results of the changes visually.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The submitted document is unreadable. It is a draft, not a manuscript.

Reviewer 2 Report

The revised version is much better in expressions. But there are still many drawbacks in English, esp. in the revised parts.

In order for the revirewer to clearly read the manuscript, in further revision, a clear version without any revision tag should be provided.