Next Article in Journal
Call Redistribution for a Call Center Based on Speech Emotion Recognition
Previous Article in Journal
Thermal Behavior of Single-Crystal Diamonds Catalyzed by Titanium Alloy at Elevated Temperature
 
 
Article
Peer-Review Record

Weakly Supervised Fine-Grained Image Classification via Salient Region Localization and Different Layer Feature Fusion

Appl. Sci. 2020, 10(13), 4652; https://doi.org/10.3390/app10134652
by Fangxiong Chen 1, Guoheng Huang 2,*, Jiaying Lan 2, Yanhui Wu 3, Chi-Man Pun 4,*, Wing-Kuen Ling 3,* and Lianglun Cheng 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2020, 10(13), 4652; https://doi.org/10.3390/app10134652
Submission received: 19 June 2020 / Revised: 25 June 2020 / Accepted: 28 June 2020 / Published: 6 July 2020
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

I appreciate the efforts you made in clarifying your methodology and enriching the description of the experimental results. Now, your contribution is generally understandable, even if some parts of the procedure are still awkward to read (specifically, the final part of Section 3.1 and Section 3.3).

I suggested some modifications in the annotated pdf file to fix some minor issues, mainly concerning the manuscript readability.

Finally, the References section still needs a thorough check, because some entries are still incomplete, with the Journal names missing.

Comments for author File: Comments.pdf

Author Response

Original Manuscript ID: applsci-856246

Original Article Title: “Weakly Supervised Fine-grained Image Classification via Salient Region Localization and Different Layer Feature Fusion”

 

To: Multidisciplinary Digital Publishing Institute Editor

Re: Response to reviewers

 

 

 

Dear Editors,

 

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments. We are uploading (a) our point-by-point response to the comments (response to reviewers) and (b) a revised manuscript (PDF main document).

We have reviewed the full text again and made corresponding amendments based on the reviewers' suggestions. Especially, we have enriched related experiments on Section 4. On References section, we have carefully checked and fixed all of issues.

I hope all of revisions can meet the requirements of editors and reviewers. I sincerely ask you to give us one more chance.

 

Sincerely yours,

Dr. Guoheng Kevin Huang et al.

 

Reviewer #1, General Concern: I appreciate the efforts you made in clarifying your methodology and enriching the description of the experimental results. Now, your contribution is generally understandable, even if some parts of the procedure are still awkward to read (specifically, the final part of Section 3.1 and Section 3.3).

I suggested some modifications in the annotated pdf file to fix some minor issues, mainly concerning the manuscript readability.

Finally, the References section still needs a thorough check, because some entries are still incomplete, with the Journal names missing.

Author response: Thank you for your valuable suggestion. For your first concern, we have double checked the presentation of the final part of Section 3.1, Section 3.3, and have revised them accordingly. We reformulated the whole Section 3.1 and section 3.3 in the updated manuscript. Besides, all issues mentioned in your attachments are corrected according to your advice. We checked the references and found that there are many errors and omissions. We have revised it carefully and hope our modification will not disappoint you. I sincerely hope you can give us another chance.

Reviewer #1, Concern #1: lines 151 I really don't understand. Here you say that the figure comes from Zhou [26], but in the figure caption you cite Selvaraju [27] as the source!

Author response: Thank you for your valuable suggestion. We have modified the sentence and fixed the error according to your advice. The corrected sentences are on lines 150 and 151 in our updated manuscript.

Reviewer #1, Concern #2: lines 288

Author response: Thank you for your valuable suggestion. We have modified the sentence according to your advice. The corrected sentences are on line 287 in our updated manuscript.

Reviewer #1, Concern #3: line 290

Author response: Thank you for your valuable suggestion. We have modified the sentence according to your advice. The corrected sentence is on line 288 in our updated manuscript.

Reviewer #1, Specific Concern #4: lines 379-380

Author response: Thank you for your advice. We have rewritten the sentence. The corrected sentences are on line 381 and 383 in our updated manuscript.

Reviewer #1, Specific Concern #5: lines 386-388

Author response: Thank you for your valuable advice. We have fixed this issue accordingly. The corrected sentences are on line 388 to 390 in our updated manuscript.

Reviewer #1, Specific Concern #6: lines 400-412

Author response: Thank you for your valuable advice. We have fixed this sentence’s problem. The corrected sentences are on line 402 to 416 in our updated manuscript.

Reviewer #1, Specific Concern #7: lines 451-470

Author response: Thank you for your valuable advice. We have fixed all of issues. The corrected sentences are on line 453 to 470 in our updated manuscript.

Reviewer #1, Specific Concern #8: lines 479 Please, add the reference number.

Author response: Thank you for your valuable advice. The corrected sentences are on line 486 to 487 in our updated manuscript. Also, we have added the reference number accordingly.

Author Response File: Author Response.docx

Reviewer 2 Report

  This article presents a saliency module based weakly supervised fine-grained image classification model. The proposed model can localize essential regional parts with the use of saliency maps, while only image class annotations are provided. The proposed approach applies the bilinear attention architecture and make use of Different Layer Feature Fusion Module to improve the expression ability of model features. The authors conducted several experiments to evaluate the performances of the proposed model.

 The paper has been improved compared to previews versions, however the following question has not been addressed in this version, and it should be:

  • The proposed method is very similar to the approach proposed in [1], which gives better results than the proposed. However, this study is not mentioned in the comparative study. It is important to discuss the deference between the proposed approach and that of [1]

Another remark: in line 234: The figure 5 also shows parts of the response of …

 it is not clear what the authors wanted to show in figure 5 , probably they wanted to write “Figure 4.” If so what are “the parts” ?

Author Response

Original Manuscript ID: applsci-856246

Original Article Title: “Weakly Supervised Fine-grained Image Classification via Salient Region Localization and Different Layer Feature Fusion”

 

To: Multidisciplinary Digital Publishing Institute Editor

Re: Response to reviewers

 

 

 

Dear Editors,

 

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments. We are uploading (a) our point-by-point response to the comments (response to reviewers) and (b) a revised manuscript (PDF main document).

We have reviewed the full text again and made corresponding amendments based on the reviewers' suggestions. Especially, we have enriched related experiments on Section 4. On References section, we have carefully checked and fixed all of issues.

I hope all of revisions can meet the requirements of editors and reviewers. I sincerely ask you to give us one more chance.

 

Sincerely yours,

Dr. Guoheng Kevin Huang et al.

 

Reviewer #2 General Concern: This article presents a saliency module based weakly supervised fine-grained image classification model. The proposed model can localize essential regional parts with the use of saliency maps, while only image class annotations are provided. The proposed approach applies the bilinear attention architecture and make use of Different Layer Feature Fusion Module to improve the expression ability of model features. The authors conducted several experiments to evaluate the performances of the proposed model.

 The paper has been improved compared to previews versions, however the following question has not been addressed in this version.

Author response: Thanks a lot for the valuable comments. We will carefully check the full text and correct any errors in the full text. I sincerely hope you can give us another chance.

Reviewer #2 Concern #1: The proposed method is very similar to the approach proposed in [1], which gives better results than the proposed. However, this study is not mentioned in the comparative study. It is important to discuss the deference between the proposed approach and that of [1]

Author response: Thanks a lot for the valuable comments. Although our proposed method is similar to the method proposed by citation [1], our proposed method retains the advantages of end-to-end training and testing by using the bilinear neural network. Besides, the purpose of our proposed Salient Region Localization Module is to acquire regional critical areas and to use the characteristics of bilinear neural networks. In this way, our model can better integrate global and regional features to achieve higher accuracy of salient area recognition. This is different from the method proposed by citation [1]. The method of citation [1] does not implement end-to-end training and testing, and its Attention module is mainly designed for re-aligning the data from each category and enrich the training data from the same category. Furthermore, their model can be trained on multi-view and multi-scale feature. Hence, they can easily improve the recognition accuracy of their network model. Nonetheless, the uncertainty of the regional area clustering and alignment operations of FilterNet and PartNet in the initialization process heavily affects the ClassNet training, which makes their model's experiment result heavily influenced by hyperparameters. Furthermore, our proposed model is more stable. Salient Region Localization Module in our proposed model does not require training, reducing the impact of initialization parameters.

And our accuracy is higher than the OPAM proposed by Peng et al. on FGVC-Aircraft dataset [1]. On CUB 200-2011 dataset, our accuracy is similar to the OPAM method proposed by to Peng et al. However, OMPA runs with roughly 35M parameters, which is seven times the number we have, and only achieving a classification speed of 4 frames per second (our method is 48fps). We have reduced the number of parameters while increasing the speed of detection and ensuring classification accuracy.

We have added more detail on line 465 to 470 accordingly.

Reviewer#2, Concern # 2: Another remark: in line 234: The figure 5 also shows parts of the response of …

 it is not clear what the authors wanted to show in figure 5, probably they wanted to write “Figure 4.” If so, what are “the parts”?

Author response: Thank you for your valuable suggestion. What we are trying to state here is that, feature maps from different channels have various salient areas. These sentences are describing the Figure 5. Indeed, it is not very suitable to put the description in this place. For this reason, we have integrated this statement into the last paragraph of section 3.1 (modified manuscript line 261-266) accordingly.

Author Response File: Author Response.docx

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

  This article presents a saliency module based weakly supervised fine-grained image classification model. The proposed model can localize essential regional parts with the use of saliency maps, while only image class annotations are provided. The proposed approach applies the bilinear attention architecture and make use of Different Layer Feature Fusion Module to improve the expression ability of model features. The authors conducted several experiments to evaluate the performances of the proposed model.

 The paper is very well written and easy to read. However, some explanations are needed:

  • Information are missing in lines 237; 240 and 242 about the –axis
  • The proposed method is very similar to the approach proposed in [1], which gives better results than the proposed. However, this study is not mentioned in the comparative study. It is important to discuss the deference between the proposed approach and that of [1]

Author Response

Original Manuscript ID: applsci-792341

Original Article Title: “Weakly Supervised Fine-grained Image Classification via Salient Region Localization and Different Layer Feature Fusion”

 

To: Multidisciplinary Digital Publishing Institute Editor

Re: Response to reviewers

 

 

 

Dear Editor,

 

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

We are uploading (a) our point-by-point response to the comments (response to reviewers) and (b) a revised manuscript (PDF main document). Please also see the attachment.

 

 

Best regards,

Huang et al.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Reviewer#1, Concern # 1: This article presents a saliency module based weakly supervised fine-grained image classification model. The proposed model can localize essential regional parts with the use of saliency maps, while only image class annotations are provided. The proposed approach applies the bilinear attention architecture and make use of Different Layer Feature Fusion Module to improve the expression ability of model features. The authors conducted several experiments to evaluate the performances of the proposed model.

 

 The paper is very well written and easy to read. However, some explanations are needed:

 

Information are missing in lines 237; 240 and 242 about the –axis

The proposed method is very similar to the approach proposed in [1], which gives better results than the proposed. However, this study is not mentioned in the comparative study. It is important to discuss the deference between the proposed approach and that of [1]

Author response:  Thanks a lot for the valuable comments. Although our proposed method is similar to the method proposed by citation [1], our proposed method retains the advantages of end-to-end training and testing by using the bilinear neural network. Besides, the purpose of our proposed Salient Region Localization Module is to acquire regional critical areas and to use the characteristics of bilinear neural networks. In this way, our model can better integrate global and regional features to achieve higher accuracy of salient area identification. This is different from the method proposed by citation [1]. The method of citation [1] does not implement end-to-end training and testing, and its Attention module is mainly designed for re-aligning the data from each category and enrich the training data from the same category. Furthermore, their model can be trained on multi-view and multi-scale feature. So they can easily improve the recognition accuracy of their network model. Nonetheless, the uncertainty of the regional area clustering and alignment operations of FilterNet and PartNet in the initialization process heavily affects the ClassNet training, which makes their model's experiment result heavily influenced by hyperparameters. Therefore, the source code and experiment results of citation [1] are difficult to be reproduced by us. For this reason, we did not reproduce their model. The comparison experiments between citation [1] and our proposed method are difficult to implement because it is almost impossible for us to compare the results of the two experiments on the same platform. Furthermore, our proposed model is more stable. Salient Region Localization Module in our proposed model does not require training, reducing the impact of initialization parameters. We have added more detail on line 107 to 108, 150 to 153.

As for the missing information in line 237, 240, 242 about -axis,  we have rewritten that section to make it easier to understand.

Author Response File: Author Response.doc

Reviewer 2 Report

see the attached file

Comments for author File: Comments.pdf

Author Response

Original Manuscript ID: applsci-792341

Original Article Title: “Weakly Supervised Fine-grained Image Classification via Salient Region Localization and Different Layer Feature Fusion”

 

To: Multidisciplinary Digital Publishing Institute Editor

Re: Response to reviewers

 

 

 

Dear Editor,

 

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

We are uploading (a) our point-by-point response to the comments (response to reviewers) and (b) a revised manuscript (PDF main document). Please also see the attachment.

 

 

Best regards,

Huang et al.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Reviewer#2, Concern # 1: lines 25-28 Please, break this too long sentence.

Author response: Thank you for your valuable suggestion. We checked the mentioned sentence and broke it into several short sentences. Moreover, we have modified some sentences so that it can express semantics more clearly. The corrected sentences are on line 27 to 32 in our updated manuscript.

Reviewer#2, Concern # 2: lines 57-59 Your claim is not correct. You probably should say “Wah et al. introduced the CUB200-2011 dataset [4] and proposed some benchmark methods; however, their classification method for uncropped images only achieves an accuracy of 10.3%”

Author response: Thank you for your valuable suggestion. We have modified the sentence according to your advice. The corrected sentences are on line 62 to 63 in our updated manuscript.

Reviewer#2, Concern # 3: lines 67-68 Please, reformulate in a clearer way

Author response:  Thank you for your valuable advice. We have checked the sentence and reformulated it in a clearer way. What we were trying to state here is, researches based on the two methods we mentioned can reach 50% to 62% fine-grained image classification accuracy. The corrected sentences are on line 72 to 73 in our updated manuscript.

Reviewer#2, Concern # 4: line 70 “have a significant impact on the final classification results,”. Please, replace that comma with a colon, or break otherwise this long sentence.

Author response: Thank you for the valuable suggestion. We have carefully rethought about the whole sentence, and chosen to break the long sentence into several short sentences to make our claim clearer.  The corrected sentences are on line 74 to 77 in our updated manuscript.

Reviewer#2, Concern # 5: lines 80-82 Please, check this sentence. Its meaning is not clear.

Author response: Thanks for your suggestion. We have reformulated the sentence to make its meaning clearer accordingly. What we were trying to claim here is that strongly supervised method require delicate manually annotated dataset. Such dataset is expensive. For this reason, strongly supervised methods are limited when they are applied to actual tasks. The corrected sentences are on line 86 to 89 in our updated manuscript.

Reviewer#2, Concern # 6: lines 86-89 Please, break this too long sentence.

Author response: Thank you for your valuable advice. We have broken this too long sentence into several shorter sentences to make this part easier to understand. The corrected sentences are on line 92 to 99 in our updated manuscript.

Reviewer#2, Concern # 7: lines 101-102 Please, check this unclear and redundant sentence.

Author response: Thank you for your valuable suggestion. We have rewritten the whole sentence and made its meaning clearer. We were trying to point out the difficulties on fine-grained image classification, and introduce our solution. Our solution is using thoughtfully designed loss function to improve distinction between classes. The corrected sentences are on line 109 to 111 in our updated manuscript.

Reviewer#2, Concern # 8: line 107 “which showing”. I don’t understand the meaning of this

Author response: Thank you for pointing out the problem of grammar. We have rewritten the whole sentence and rearranged the sentences around this sentence to make this part’s meaning clearer.  We were trying to state that our model can reach higher accuracy than most strongly supervised methods, while not using dataset with manually annotated essential area. The corrected sentences are on line 115 to 117 in our updated manuscript.

Reviewer#2, Concern # 9: lines 109-114 Please, adopt a coherent capitalization (for Sections, but also for modules names)

Author response: Thank you for your valuable advice. We have fixed this sentence’s problem. The corrected sentences are on line 118 to 123 in our updated manuscript.

Reviewer#2, Concern # 10: lines 120-121 “apply… applications” Please, reformulate

Author response: Thank you for pointing out the grammar problem. We have reformulated this sentence. The reformulated sentences are on line 129 to 132 in our updated manuscript.

Reviewer#2, Concern # 11: lines 143-144 “As Figure 1. shows Zhou et al. in related research [7].” This sentence is meaningless

Author response: Thank you for your precious advice. We have reformulated the sentence, and the reformulated sentence is on line 153 to 156. We have added more information for introducing their research.

Reviewer#2, Concern # 12: lines 145-146 If you are describing your method features, you should mention it

Author response: Thank you for your advice. We were indeed describing our method feature. So, we changed the original sentence into: “We used the weighted gradient-based algorithm for class activation mapping in our method for this reason.”

Reviewer#2, Concern # 13: line 150 Figure caption. You should add details on what is represented and how the maps have been obtained

Author response: Thank you for your advice.  We have added more detail to the image caption. The Figure 1 is about the heatmaps produced by Grad-CAM method. These heatmaps show that using Grad-CAM can easily locate salient places of different images.  We modified the original caption to: “Feature maps from different channels, generated by Grad-CAM method. From the figure above, we can see that Grad-CAM can easily locate salient places of different images. Though heat maps from different channels generated in this way have various focusing points. We use these heat maps to extract regional features.”

Reviewer#2, Concern # 14: line 158 “tow-level” should be “low level”, probably

Author response:  Thank you for your advice.  We have corrected the typo.

Reviewer#2, Concern # 15: lines 171-172 “Our model is weakly supervised learning method based. Only using image class labels for training, and it does not require manually annotated essential regional areas.” Please, check these sentences.

Author response:  Thank you for your suggestion.  We have rewritten the whole sentence to make it easier to understand. What we were trying to show here is that our model is weakly supervised learning method based. For this reason, we can train our model with dataset without manually annotated essential regional areas. Such feature makes our model more applicable. The reformulated sentence is on line 187 to 189.

Reviewer#2, Concern # 16: lines 179-180 “each feature map would be converted into two vectors, vector contains maximum values and vector contains average values” should probably read “each feature map would be converted into two vectors, one containing maximum values and the other containing average values” or something similar

Author response:  Thank you for your advice. We have rewritten the whole sentence to make it easier to understand. What we were trying to show here is just like what you said, “each feature map would be converted into two vectors, one containing maximum values and the other containing average values”.

Reviewer#2, Concern # 17: Section 3.1 is too generic and gives very few (and confused) details on the Module implementation. It should be carefully checked and mainly rewritten.

Author response:  Thank you for your advice. We have added more details on the Module implementation in Section 3.1.

Reviewer#2, Concern # 18: Section 3.2 should be carefully checked and mainly rewritten. The bilinear neural network is not described. I only suggest the main incongruences I found:

Author response:  Thank you for your careful reading and pointed out the errors of grammar in the manuscript. We checked the grammar of the entire manuscript and corrected the incorrect grammar. Moreover, we modified some sentences so that it can express semantics clearly. The Section 3.2 is rewritten.

Reviewer#2, Concern # 19: line 227 obscure sentence (something missing?) 

Author response:  Thank you for your advice. We checked the grammar of the entire manuscript and corrected the incorrect grammar. Moreover, we modified some sentences so that it can express semantics clearly. The reformulated sentence is on line 239 to 240.

 

Reviewer#2, Concern # 20: line 229 visualized more visually???.

Author response:  Thank you for your careful reading. And we have modified this sentence so that it can express semantics clearly. “visually” is changed to “directly”. The reformulated sentence is on line 240.

Reviewer#2, Concern # 21: line 230 size of the heat map YOU INTEND TO PRODUCE would be the same of the original image, isn’t it?

Author response:  Thank you for your careful reading. We have checked the sentence and reformulated it in a clearer way.  We have changed the original sentence to: “The heat map and the input image have the same shape.”

Reviewer#2, Concern # 22: lines 231-232 “Most parts of the response will be in the foreground, and only a minority will be on the background”. what do you mean? And what Figure do you refer to?

Author response:  Thank you for your careful reading. We refer here to concepts in target detection. If a target is detected on an input image, the foreground represents the target to be detected and the background represents an area other than the target, such as the sky, grass. Moreover, we modified the sentence so that it can express semantics clearly. The reformulated sentence is on line 243 to 245.

Reviewer#2, Concern # 23: lines 234-243 Bilinear interpolation is a quite common procedure, there is no need to detail it. Moreover, (4) is wrong: in the second line of the formula you should have  and

Author response:  Thank you for your careful reading. We have made a correction to the formula in (4). Moreover, we modified the sentence so that it can be easier to understand.

Reviewer#2, Concern # 24: line 244 “different feature map has…” should be “different feature maps have…” 

Author response:  Thank you very much for your valuable suggestion. We have modified the sentence according to your advice. The reformulated sentence is on line 246.

Reviewer#2, Concern # 25: line 245 “to apply addition to the d dimension” should be “to sum over the d dimension” 

Author response:  Thank you very much for your valuable suggestion. We have modified the sentence according to your advice. The reformulated sentence is on line 248.

Reviewer#2, Concern # 26: lines 249-250 “easier and more accurate”. Apart from the awkward English, why is it so? Please, justify your claim.  

Author response:  Thank you for your advice. We are very sorry for some of the inappropriate expression that comes out of that sentence. We have justified this claim from a more scientific perspective. The fusion of multiple feature maps through Equation (5) helps to enhance the feature information of critical areas, which in turn makes it easier and more accurate to locate local critical areas. The reformulated sentence is on line 252 to 254.

Reviewer#2, Concern # 27: lines 250-251 obscure sentence. What do you mean with “size positions matrix of image”? Please, provide details and/or references for OTSU algorithm you mention.   

Author response:  Thank you for your advice. For the phrase “size positions matrix of image”, we means all the pixels in the area. And we have provided the reference for OTSU algorithm we mention. The reformulated sentence is on line 256.

Reviewer#2, Concern # 28: lines 255-273 Please, try to explain in a clearer way what seems to me a simple procedure described in a very confused paragraph. First, clarify the difference between what you called a point P=(x,y) and a pixel f(x,y)  

Author response:  Thank you for your advice. We have deleted the redundant “point”, because they are actually the same. We reformulated the sentences to make them easier to understand. The reformulated sentences can be seen on line 259 to line 271

Reviewer#2, Concern # 29: line 254 Please, clarify who are Bi,j and Ai,j. I assume a mismatch in the notation with respect to line 248

Author response:  Thank you for your advice. We have added more detail about the meaning of  and , which can be seen on line 253, and line 259 respectively.

Reviewer#2, Concern # 30: line 282 “punish each feature” Do you mean “punish/penalize intra class variation for each feature”?  

Author response:  Thank you very much for your valuable suggestion. “punish/penalize intra class variation for each feature” is really what we mean. The corrected sentence is on line 300.

Reviewer#2, Concern # 31: line 283 “Our loss function is based on adding center loss function to softmax function” Already said in line 280 “we add center loss to our loss function”, isn’t it?  

Author response:  Thank you very much for your valuable suggestion. We have removed the repetitive expression in the paragraph.

Reviewer#2, Concern # 32: line 291 Please, define all the quantities in Eq (9). WTyi, Ai, Ayi Bi, Byi , ABi, AByi (see my comment to line 254)  

Author response:  Thank you very much for your valuable suggestion. We apologize profusely for the lack of a formula description. The formula for the loss function has been modified and the variables in it have been explained.

our loss function for the model is defined as follows:

                                   (6)

In the equation above, stands for the possibility on each category, which is produced by main neural network.  is one hot encoded vector for stating each image’s label.  is possibility on each category produced by lower level neural network .  stands for central feature of i-th category.  stands for features of input images. Hyperparameters  and  are chosen based on cross validation method, while parameter  set to 1.

Reviewer#2, Concern # 33: line 300 “The changes in values of loss functions demonstrate the validity of loss functions” What do you mean?    

Author response:  Thank you very much for your valuable suggestion. What we were trying to state here is, the validity of the loss function is demonstrated by observing the change in the value of the loss in the experiment. The reformulated sentence is on line 325 to 326.

Reviewer#2, Concern # 34: line 312 ”which is illustrated in Figure 6”. As you wrote, it seems that Figure 6 illustrates image label data. This is not the case, Figure 6 simply shows some examples of images from the dataset.     

Author response:  Thank you very much for your valuable suggestion. We are very sorry that the figure illustrating’ sentence was misplaced in the paragraph arrangement. We have corrected the placement of the figure descriptions in the paragraphs and supplemented the descriptions of the training set and test set. The sentence is moved to line 332.

Reviewer#2, Concern # 35: line 333 I don’t understand this sentence. Do you use CompCars, or do you extract and preprocess a part of this dataset? Please, reformulate the sentence. Something like “We considered CompCars dataset, which is proposed by Yang et al. [35], and contains 300, 000 images of 500 categories of vehicles. We extract from this dataset…”   

Author response:  Thank you very much for your valuable suggestion. We did use the CompCars dataset. And we have corrected the statement in the paragraph based on your suggestions. We organized the CompCars dataset, which is proposed by Yang et al. [36], and the dataset contains 300, 000 images of 500 categories of vehicles. We took out 15 categories of vehicle type, 55 categories of vehicle brands, and 250 types of vehicle models. The modified sentences are on line 349 to line 361

Reviewer#2, Concern # 36: lines 346- 348 Please, check the sentence structure.    

Author response: Thank you very much for your valuable suggestion. We apologize for the unclear expression of sentence structure. We have rewritten this paragraph. The reformulated sentences are on line 372 to line 378

Reviewer#2, Concern # 37: 4.2.2 Again, the (straightforward) data normalization is described in a redundant and confused manner.    

Author response: Thank you very much for your valuable suggestion. We apologize for the unclear expression of sentence structure. We have rewritten this paragraph. The reformulated sentences are on line 380 to line 384.

Reviewer#2, Concern #38: Section 4.3 The entire Section should be more detailed. Results are simply reported through accuracy tables, implementation details (such as parameters tuning and sensitivity of the model to their choice) are not discussed. The choice and size of the training datasets for all the experiments are not described, apart from the indication of 70%-30% for the CompCars dataset. Final, and more relevant comment: the last Experiment, that is the more significant to evaluate the improvements of the proposed methodology (according to the authors’claim) is poorly discussed.     

Author response:  Thank you very much for your valuable suggestion. We have streamlined this section based on your suggestions. The division of the training and test sets in each dataset is also explained. In the meantime, more explanation and discussion has been given in the manuscript regarding the issues discussed in the last experiment that you mentioned.

Reviewer#2, Concern # 39: Section 4.3.1 The title is too vague and not representative of the section content. You are describing the accuracy index you choose, that is the index to be evaluated, am I right?      

Author response:  Thank you very much for your valuable suggestion. We have changed that title to “Evaluation Index”.

Reviewer#2, Concern # 40: line 382-384 This sentence is obscure. I assume that the “which” on line 383 should be “it”      

Author response:  Thank you very much for your valuable suggestion. According to your advice, we have made a correction in that sentence. The reformulated sentences are on line 405 to line 408.

Reviewer#2, Concern # 41: Section 4.3.2 Nothing is said about the criteria for choosing the parameters alpha, beta (and lambda, of course). Moreover, and more important, how can you say that the main difference among the loss curves is not due to the simple subtraction of the terms LPA, LPB (since the parameters alpha, beta are positive, these terms are both negative..) . Along with Table 1 you could have shown and discussed results of your classification, highlighting on what images the joint loss function achieves better accuracy and trying to relate this to the better identified features.     

Author response: Thank you very much for your questions about the loss function fold plot and hyperparameter selection. The selection criteria for the hyperparameters are addressed in the previous question (concern #33), so we provided the detailed answer below that question. By looking at the loss function curve, we are more concerned with the downward trend and rate of decline of the loss function curve in this experiment. Although we have added a partial loss term. At the same time, the addition of loss items in combination with Table 1 also demonstrates a greater improvement in identification accuracy.

Reviewer#2, Concern # 42: line 421 This sentence is broken. Do you mean “We intend to prove that…” ?       

Author response:  Thank you very much for your valuable suggestion. According to your advice, we have made a correction in that sentence. The reformulated sentence is on line 441.

Reviewer#2, Concern # 43: Please, check the correct capitalization in all your references        

Author response:  Thank you for your advice and your careful reading. According to your advice, we have made corrections in Reference.

 

Author Response File: Author Response.doc

Round 2

Reviewer 1 Report

no further comments 

Reviewer 2 Report

The authors tried to revise their manuscript facing its main flaws. However, many concerns still remain.

First, the Methodology section still has many obscure points. Specifically, in Section 3.1 (lines 252-263) the authors simply moved part of the former Section 3.1. However, the text remains unclear and chaotic. They must rewrite it to allow the readers understand what they have done (probably such a detailed description is unnecessary; it should be sufficient to describe in a few sentences how to assign connection marks). In lines 271-283 the authors (unsuccessfully) tried to describe Lin’s approach. I suggest to reformulate the text giving an idea of their algorithm and leaving the details to the cited reference.

Second, as I already pointed out in the previous report, the Experiments are poorly described so that their significance is questionable; they must be enriched also through the discussions of classification results of the different methods on some specific images. Moreover, the authors should check the normalization Subsection: from the present text one can assume they don’t know the meaning of normalization at all.

Finally, awkward use of English is to be very carefully checked, along with the entire References section, where most of the entries are incomplete, misspelled  or incoherently expressed (see annotations in the attached file). This section is far below any decent journal requirement, in my opinion.

Other comments and suggestions to be mandatorily considered are reported in the annotated pdf of the manuscript.

Comments for author File: Comments.pdf

Back to TopTop