Review Reports - Weakly-Supervised Image Semantic Segmentation Based on Superpixel Region Merging

Round 1

Reviewer 1 Report

This paper presents a weakly supervised semantic segmentation method using superpixel aggregation as an annotation. The paper is very well written and organized. The topic addressed is timely and in the scope of the journal, the methodology followed is well described and sounds ok.

My comments:

1. Figure 2, 4, 5: it is visible that there are differences between the results, but it is not obvious that which is better and why? “it is visible that x is better than y” is not satisfactory. You should use some quantitative measurements to determine the accuracy of the given methods. For example, manually segment the image and compare the results to this.

2. Some minor formal comments:

-Description of Figure 3 is not complete. What are these regions and numbers?

-Line 213: recheck font styles

3. The manuscript needs to be refined for English grammatical structure and phraseology. The manuscript should be polished by a native English speaker or an English language service could be used.

Author Response

Dear Reviewer:

Thank you very much for reviewing my article during your busy schedule. You have put forward a professional comment and put forward the shortcomings in the article. Those comments are all valuable and helpful for revising and improving our paper, as well as the important guiding significance to our researches. We have studied comments carefully and have made a correction.

Point 1: Figure 2, 4, 5: it is visible that there are differences between the results, but it is not obvious that which is better and why? “it is visible that x is better than y” is not satisfactory. You should use some quantitative measurements to determine the accuracy of the given methods. For example, manually segment the image and compare the results to this.

Response 1: As for Figure 2 (Figure 3 in the revised manuscript), to see the difference between SLIC and SLICO, we zoomed in the merged boundary at line 181-190. Superpixel segmentation usually uses the method of boundary measurement to judge the performance of segmentation. When the edge of the superpixel is aligned with the boundary of the object in the image, the algorithm is more accurate. When using BR as a measurement, SLIC can achieve 61.26% accuracy, which is 4.21% higher than SLICO.

The original Figure 4 is for the principle of consistency of experiments. When m>20, the variation curve of the segmentation becomes gentle, so the similarity parameter of the superpixel segmentation adopts m=20.

Figure 5 also shows the divided portion after being zoomed in. We can see that the third image has a higher boundary coincidence rate at line 240-242. The performance of Figure 5 was evaluated using the Boundary Recall (BR) method. The BR values of the three criteria are 59.24%, 92.71%, and 63.15%, respectively, which justifies that the third criterion is better in performance.

Some minor formal comments:

Point 2: Description of Figure 3 is not complete. What are these regions and numbers?

Response 2: The area consisting of white lines is produced by superpixel segmentation, while the red numbers represent the semantic labels of each region at line 207-208.

Point 3: Line 213: recheck font styles.

Response 3: Fixed at line 221, 225.

Point 4: The manuscript needs to be refined for English grammatical structure and phraseology. The manuscript should be polished by a native English speaker or an English language service could be used.

Response 4: A native English-speaker has helped proofread the revised manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper is excellent. Very enjoyable to read. Well presented, clear, thorough and would like to read more by these authors.

I heartily recommend the acceptance of the paper.

Author Response

Dear Reviewer:

We are very honored to receive your recognition, and our research in the direction of semantic segmentation will continue to work. Thank you very much for your comments on the paper.

Your sincere greetings.

Author Response File: Author Response.pdf

Reviewer 3 Report

Kindly refer to the attached detailed report for the authors.

Cheers!

Comments for author File: Comments.pdf

Author Response

Dear Reviewer:

A. Title

Point 1: The title seems acceptable, although it could do with a little tweaking for improved clarity here, the first two words of the title could be hyphenated to read “Weakly-supervised…”

Response 1: We have updated the title as per the comment at line 2.

B. Abstract

Point 1: i. Overall, the authors should consider revising the abstract to improve its readability as well as present a succinct summary of the study. In doing so, some quantitative outcomes from the study should be highlighted.

Response 1: We have revised the abstract with quantitative results and key contributions at line 18-34.

Point 2: ii. Very importantly, future perspectives to improve the study should be mentioned.

Response 2: We have highlighted the potential high impact of this work for future research at the end of the abstract at line 31-34.

Point 3: iii. Grammar – Revise use of plural in the phrase “each pixels…”

Response 3: fixed in line 22.

Point 4: iv. Numerous other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 4: A native English-speaker has helped us thoroughly proofread the manuscript.

C. Introduction

Point 1: i. Page 1, line 32: Clarity – Revise the statement “…with widespread application…”

Response 1: fixed in line 38.

Point 2: ii. Page 1, line 33: Grammar – Revise the statement “…methods (4-9) relies on…”

Response 2: fixed in line 39.

Point 3: iii. Page 1, lines 42-45: Maintain consistency in tense used throughout the literature review.

Response 3: We have revised the text to make the tense consistent at line 47, 49, 52.

Point 4: iv. Page 2, line 71: Standards – Cite source(s) of the theoretical hypothesis in the statement “…According to theoretical hypothesis…”

Response 4: The rationale for the theory comes from the references [25] and [34]. We have added both references in the revised manuscript at line 78.

Point 5: v. Page 2, line 80: Clarity – Specify and mention the criterion used in “…certain criterion… “.

Response 5: We have revised the manuscript to clarify the three criteria used in the

paper. (Lines 87-88).

Point 6: vi. Page 2, lines 83-84: Clarity – Revise the confusion arising from the use of “…The last step…“ and then following up with “…Then we…” in the next sentence.

Response 6: We have updated the text on line 91; “the last step” was replaced with “Furthermore”.

Point 7: vii. Page 2, lines 92-95: Revise the inconsistency tenses usage for the manuscript layout. The norm is to report in present tense.

Response 7: We have updated the text with the present tense. (Lines 100-103).

Point 8: viii. Numerous other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 8: A native English-speaker has carefully proofread this section.

D. Section 2

Point 1: i. Page 3, line 97: Standards – Improve quality of Fig. 1 and the text in it.

Response 1: We increased the font in the image and changed the color of the split line to make Fig. 1 clear.

Point 2: ii. Page 3, line 102: Clarity – Revise the statement “…a large number of super pixels…” for clarity. Presently, it seems to nullify the definition of superpixel presented earlier in lines 99-100.

Response 2: Superpixels represent blocks of pixels with similar features, and each superpixel block consists of a large number of pixels. We revised this sentence at 110-111.

Point 3: iii. Page 3, line 108: Standards – Define the acronym FCN at its first mention in the section (i.e. line 108).

Response 3: Full Convolutional Networks (FCN), we have revised it at Line 116.

Point 4: iv. Page 3, lines 110-111: Revise the phrase “…good in effect today…” Instead, consider “… widely used today …” or something similar.

Response 4: We have corrected it at Line 119.

Point 5: v. Page 3, lines 113-114: Technical – First, box-based training should be explained and then its shortcoming in comparison with the preferred superpixel merging should be highlighted. Presently, the argument is unsubstantiated.

Response 5: We have updated the text to first (briefly) explain how box-based training works and then discuss the pros and cons of box-based training and our proposed method in this paper. (Lines 121-124).

Point 6: vi. Page 3, line 124: Provide evidence to corroborate the observation in the statement “…We observed…”

Response 6: When we test the superpixel region merging method separately, it will consume 0.172 seconds of folding time! This will undoubtedly increase the cost of calculation. This reflects the increase in computational costs. We added these proofs on lines 131-132.

Point 7: vii. Page 4, line133: Technical – Provide an example of VGG16 image in the statement “…The input to VGG16 is an…”.

Response 7: We have fixed this by adding the network structure of VGG16 (Figure 2 in the revised manuscript).

Point 8: viii. Page 4, line 142: Standards – Provide a citation for the PASCAL VOC2012.

Response 8: We have added the reference for VOC2012 [30] at line 151.

Point 9: ix. Page 4, line 147: Clarity – Revise use of “migrate” in the phrase “…we migrated to learn…”

Response 9: We have revised the text (at Lines 156-157 in the revised manuscript).

Point 10: x. Page 4, line 152: Provide citations for “BoxSup” and “other improvements”.

Response 10: We have added these references (Line 161).

Point 11: xi. Page 4, line 155: Clarity – Revise the statement “…at least on corresponding to the transfer…” for clarity.

Response 11: To make it easier to understand, we changed this sentence to "(1) The image contains at least one superpixel label." (Line 163).

Point 12: xii. Page 4, line 160: Flush text to the left and include the sentence “where b …” as part of the equation sentence.

Response 12: We have corrected it at line 168.

Point 13: xiii. Page 4, line 163: Standards – Provide full meaning of the acronym “SLIC”.

Response 13: We added the full name of the SLIC to the title of 2.3 at line 171.

Point 14: xiv. Page 4, line 167: Clarity – Delete or revise the redundant phrase “…On the contrary…”

Response 14: We have corrected it at Line 175.

Point 15: xv. Page 4, line 170: Technical – Provide additional arguments to support the choice of SLIC over SLICO. What is the implication of this choice, especially considering its shortcomings mentioned in the lines 165-166?

Response 15: We have added a new paragraph to justify the choice of SLIC over SLICO at Lines 183-190.

Point 16: xvi. Page 5, line 175: Technical – Fig. 2 does not provide convey the benefits for using both SLIC and SLICO. What are other more efficient techniques?

Response 16: We modified Figure 2 (now Figure 3 in the revised draft) to magnify the boundaries of the object to prove the advantages of SLIC. We didn’t any literature showing a more efficient method for superpixel segmentation.

Point 17: xvii. Page 5, line 180: Grammar – Revise the duplicated use of prepositions ‘as’ and ‘for’ in the phrase “…as for each…”

Response 17: We have corrected it at line 192.

Point 18: xviii. Page 5, line 185-186: Merge the two sentences in lines 185-186.

Response 18: Fixed inline 197.

Point 19: xix. Page 5, line 191: Standards – Flush text to the left to include the sentence “…where Ik…” as part of the equation sentence.

Response 19: Fixed inline 202.

Point 20: xx. Page 5, line 193: Clarity – Revise use of “pictures” (images?) and define the parameter m in the statement “…pictures with different m values…”

Response 20: The correct representation is “images”, and the definition of m is explained at line 205.

Point 21: xxi. Page 5, line 195: Flush text to the left to include the sentence “…where x, y…” as part of equation sentence.

Response 21: We have corrected these formats. (Line 203).

Point 22: xxii. Numerous other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 22: We have requested an English-speaking classmate to conduct a thorough review and revision.

E. Section 3

Point 1: i. Page 6, line 201: Clarity – Revise the statement “…our segmentation accuracy…” or explicitly specify the accuracy being referred to.

Response 1: We have updated the text on Lines 211-213.

Point 2: ii. Page 6, line 203: Clarity – The statement “…meets the requirements of our algorithm…”is vague. The requirements inferred should be specified.

Response 2: Since this sentence is ambiguous, we deleted this sentence.

Point 3: iii. Page 6, line 207: Clarity – Revise “ …[35] (RAG)…” “…(RAG) [35]…”

Response 3: Fixed (Line 216).

Point 4: iv. Page 6, line 213: Standards – Flush text to the left to include the sentence “…where Ci…” as part of the equation sentence.

Response 4: We have corrected these formats. (Line 221).

Point 5: v. Page 6, line 216: Standards – Flush text to the left to include the sentence “…where Gc…” as part of the equation sentence.

Response 5: We have corrected these formats. (Line 225).

Point 6: vi. Page 6, line 217: Clarity – Revise ‘u’ ‘μ’??

Response 6: The correct one is ‘μ’ and we have fixed it at line 226.

Point 7: iv. vii. Page 7, line 228: Clarity – Elaborate on the sentence “…Hence, the color intensity of the means is the key to decide whether to merge…” Consider rewriting this sentence.

Response 7: We rewrite this sentence to "Therefore, the mean intensity of the color is very important." (Line 237).

Point 8: viii. Page 7, line 232: Clarity – Revise the caption of Fig. 5

Response 8: We modified Figure 5 to zoom in on some of the merged areas to show the advantages of the guidelines more clearly.

Point 9: ix. Page 7, line 242, Algorithm 1: Technical – It is not enough to mention “…According to SLIC algorithm…” in step 1. Relevant aspect of the SLIC algorithm must be included in the Algorithm 1.

Response 9: We have added the SLIC algorithm to Algorithm 1, and steps 1-5 are SLIC superpixel segmentation algorithms, thus obtaining cluster C = {?1, ?2, ⋯ , ?? }.

Point 10: x. Page 7, line 242, Algorithm 1: How is an “untouched superpixel” in step 3 determined? How does such selection affect the computational cost?

Response 10: Because the access to superpixels is sequential access to the cluster collections and each superpixel have a label bit that is flagged after being accessed. Since the super-pixel points are sequentially selected, the efficiency is low and the calculation cost becomes significant.

Point 11: xi. Page 7, line 242, Algorithm 1: Elaborate on the “priority merge right” in step 5.

Response 11: Priority merger rights in step 10: The factor of area ?? is considered in these guidelines. In the case where the regional heterogeneity is the same, the smaller the area of the region, the smaller the error caused by the merger for the entire image approximation, so the merger is also less expensive. In the process of regional merging, priority is given to merging areas with smaller areas to form larger areas, which also helps to enhance the stability of regional features.

Point 12: xii. Page 7, line 242, Algorithm 1: State what happens when the condition “if” in step 7 is not satisfied.

Response 12: When the condition of “if” is not satisfied, it indicates that the superpixel pair (?0, ??) will not be merged, and the flag of ?? is not processed. We have fixed this in the revised draft in steps 12.

Point 13: xiii. Page 8, line 245: Standards – Include citations for “GoogleNet” and “ResNet”.

Response 13: We have added references to these datasets [43] [44]. (Line 254).

Point 14: xiv. Page 8, line 267: Clarity – Revise the statement “…is equal to 1, otherwise 0…” for clarity.

Response 14: We have modified this on line 274 and changed it to "the function value is equal to 1; otherwise, it is equal to 0".

Point 15: xv. Page 8, line 270: Clarify the statement “…algorithm to the training…”

Response 15: For reading comprehension, we change this sentence to "we apply the region merging algorithm to train the model." (Line 278).

Point 16: xvi. Page 8, line 272: Grammar – Insert missing period mark or revise use of capitalisation in the statement “…truth In the…”

Response 16: We have made corrections at line 280.

Point 17: xvii. Page 9, Algorithm 2: Clarity – Revise use of “get” in steps 1 and 2 of Algorithm 2.

Response 17: We updated them to "Construct" and “Build” in steps 1 and 2.

Point 18: xviii. Page 9, lines 283-284: Clarity – Revise the phrase “…label update area label…” for clarity.

Response 18: We have made corrections at line 288.

Point 19: xix. Numerous other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 19: A native English-speaker has helped us proofread the paper.

F. Section 4

Point 1: i. Page 9, line 296: Clarity – Reconsider the use of the noun “ablation”

Response 1: We deleted the "the", and changed to "in this work". (Line 302).

Point 2: ii. Page 10, line 303: Technical – Elaborate on the statement “…and the achievable segmentation accuracy…”

Response 2: We have updated the text at Lines 310-312.

Point 3: iii. Page 10, line 310 (Figure 9): Clarity – “…performance…”

“performances”. iv. Page 10, line 318 (Figure 10): Clarity – “…performance …”

“performances”

Response 3: We have corrected it. (Line 319,327).

Point 4: v. Page 11, lines 321-322: Clarity - Revise the use of underscore (i.e. “_”) in “MG_shape” and “Num_of_Region”

Response 4: MG_shape represents the shape parameter, and Num_of_Region represents the number of combined regions. We removed the underscore and used the full name. (Line 313, 330-331).

Point 5: vi. Page 11, line 327: Clarity – Revise the statement “…results, that is the Jaccard Index…” for clarity.

Response 5: We have made corrections (Line 336).

Point 6: vii. Page 11, line 332: Clarity – Revise the statement “…with size in …”.

Response 6: We updated the text at Lines 340-342.

Point 7: viii. Page 12, line 337: Clarity – State where as shown in the statement “…As shown…” was “shown”.

Response 7: We have modified it to "As shown in figure 11". (Line 347).

Point 8: ix. Page 12, line 341: Clarity – Reconsider use of the preposition ‘of’ in the

statement “details of the target”.

Response 8: We changed "of" to "in". (Line 351).

Point 9: x. Page 12, line 348: Standards – Revise use of “picture” “image”

Response 9: We have corrected it. (Line 358).

Point 10: xi. Page 12, line 350: Clarity/Spelling - Rewrite the sentence “…tried to…”

Response 10: We changed this sentence to "We further performed experiments on..." (Line 360).

Point 11: vii xii. Page 12, line 351: Clarity – Clarify and include citation “5K data set” referenced.

Response 11: 5K is the resolution of the images, rather than the number of images. We have updated the text accordingly (Line 362).

Point 12: viii xiii. Page 12, line 353: Clarity – Revise the statement “…is used as post processing…” for clarity.

Response 12: We have updated it to "...was used as a post-processing step..." at Lines 363-365.

Point 13: ix xiv. Page 12, line 357: Technical – State which dataset was used for the quoted result “…this dataset was 48.1%...”

Response 13: "this dataset" represents the PASCAL-CONTEXT data set. We have made changes to this. (Line 368-369).

Point 14: xv. Page 12, line 363: Clarity – The claim that Jaccard index was “discussed” in section 4.2 is inaccurate. Considering its importance in the validation of the study, this index should be truly discussed.

Response 14: What we really wanted to emphasize was the IoU value represented by the Jaccard index. Therefore, we have updated the claim to IoU at Line 376.

Point 15: xi xvi. Page 13, line 369: Clarity/Technical – Elaborate on how “superpixel constraints” are performed.

Response 15: Superpixel Constraint: The initial step of RMNN is to perform superpixel segmentation. For each image I, we can get a set of superpixel tags ??????0 . Our superpixel merging can encourage multiple pixels to add constraints by merging to get the outline of the target. While this approach encourages certain pixels to use a specific label, it is usually insufficient to correctly label all the pixels. (Lines 382-386).

Point 16: xii xvii. Page 13, line 375: Technical – The statement “…segment to 21 targets…” is rather vague; the authors should expatiate on it.

Response 16: The 21 labels include the 20 physical labels of Figure 8 plus a black background label. (Lines 389-392).

Point 17: xiii xviii. Page 13, line 377: Standards – Include citations for all methods reported in Table 1.

Response 17: We have added references to these methods in Table 1.

Point 18: xiv xix. Page 13, line 383: Technical – Provide arguments to support the choice of “size” as a constraint.

Response 18: Size is a constraint of the CCNN method (equivalent to the labeling of an image), and the CCNN method performs weak supervised semantic segmentation with size as a constraint. Due to our superpixel merging method, enclosing the image in some lines is similar to providing the size of the target. Moreover, CCNN constraints can be added a lot, and our method is similar to semantic segmentation with size as a constraint.

Point 19: xv xx. Page 13, line 387: Standards – Provide citation for pixel accuracy (PA).

Response 19: We have added a reference to the PA. (Line 404).

Point 20: xvi xxi. Page 14, line 401: Grammar - Revise the statement “…is still lack…” for grammatical accuracy.

Response 20: We changed "...is still lack..." to "it lacks...". (Line 418).

Point 21: xvii xxii. Numerous other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 21: We have requested a native English-speaker to proofread the paper.

G. Conclusions

Point 1: i. Page 14, lines 415-416: Clarity – Although an honest and significant contribution; this sentence should be rewritten for grammatical clarity.

Response 1: We have rewritten the entire conclusion. (Lines 422-432).

Point 2: ii. The conclusion section should be revised to include quantitative results emanating from the study as well as future perspectives to improve or revise the study. For example, the authors should provide insights regarding how the shortcomings mentioned in lines 375 and 415-416 can be ameliorated in future work.

Response 2: We have revised the conclusion to discuss the shortcomings and future work to address them. (Lines 428-432).

Point 3: iii. Overall, the conclusions drawn from the study appears to be shabbily written. This section is supposed to reflect on the entire content of the manuscript. Presently, it has not adequately done so; and so it should be carefully revised.

Response 3: the conclusion is completely rewritten now. (Lines 422-432).

Point 4: iv. Numerous other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 4: fixed.

H. References

Point 1: i. Page 16, line 24: Standards – Include year and other details for References 24, and 27.

Response 1: We added the year and page number information for both documents. (Line 497,505).

Point 2: ii. With 15 out of 39 of the references published before 2015 it is safe to say most of the bibliography cited as old. The authors should consider enriching the article by adding some more recent literature (i.e. 2015-2019).

Response 2: We have added eight more references between 2015-2019 ([1], [3], [9], [24], [27], [42-44], [46]).

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Kindly refer to the attached detailed report for authors.

Cheers!

Comments for author File: Comments.pdf

Author Response

Dear Reviewer:

Thank you very much again for your review of the papers. Your professional advice is very important for the improvement of the paper. Thank you for your questions regarding the details of the paper, and we have made corrections in this regard.

B. Abstract

Point 1: i. Page 1, lines 19-20: Clarity – Rewrite the sentence “…In weakly supervised semantic…” for clarity.

Response 1: We have rewritten lines 19-20.

Point 2: ii. Page 1, line 19: Standards – Consider hyphenating compound words such as “weakly-supervised”, “super-pixels”, etc. wherever used throughout the revised manuscript.

Response 2: We have made corrections and highlighted the changes. (Lines 26,27,36,43,57,58,61,64,68,112,158,182,,187,213,313,317,347,355,375,390,397,417,423,430,431,444.)

Point 3: iii. Page 1, line 23: Grammar – Revise use of the verb “cause” in the statement “…Rouge predictions are…” for improved readability.

Response 3: We have changed "cause" to "so that" in line 23.

Point 4: iv. Page 1, line 31: Define the acronym “mIoU” for clarity.

Response 4: We gave the full name of mIoU in line 31.

Point 5: v. Other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 5: A native English-speaker has carefully proofread this section.

C. Introduction

Point 1: i. Page 1, line 38: Standards – Citations supporting the claim “…widespread applications of CNN…” are rather nominal. Perhaps, the suggested references [1] and [2] can be added to enrich the manuscript and support the claimed made.

Response 1: We have corrected the original reference [2], [7] after careful analysis. (Lines 464,477).

Point 2: ii. Page 2, line 55: Standards – Revise spacing in “includes:a random…”

Response 2: Fixed in line 56.

Point 3: iii. Page 2, lines 61-62: Clarity – Revise the statement “…The well-known advantages of…” for clarity.

Response 3: We have revised the text in line 62-64.

Point 4: iv. Page 2, line 66: Clarity – Revise use of “being” in the phrase “…is being used frequently…”

Response 4: We removed "being" in line 67.

Point 5: v. Page 3, line 102: Standards – Define the acronym “VGG16” as its first mention here.

Response 5: We have added the full name of VGG in line 102. "VGG" represents Oxford's Oxford Visual Geometry Group, 16 is weight layers.

Point 6: vi. Other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 6: A native English-speaker has carefully proofread this section.

D. Section 2

Point 1: i. Page 3, lines 108-110: Clarity – Improve the definition of superpixel in lines 108-110.

Response 1: We have modified the definition of super-pixel in line 108-109.

Point 2: ii. Page 3, line 114: Clarity – Revise the sentence “…The main algorithm in …” for clarity.

Response 2: We have made corrections. (Lines 114-115).

Point 3: iii. Page 3, line 129: Grammar – Reconsider use of “that” in the statement “…that employs manual…”

Response 3: We have changed "that" to "who" in line 129.

Point 4: iv. Page 3, lines 132-133: Clarity – Rewrite the sentence “…We have tested that the…” for clarity.

Response 4: We have revised the text in line 132-133.

Point 5: v. Page 4, lines136-137: Clarity – Clarify the confusion in the statement “…We know that the fusion…”

Response 5: We have made corrections. (Lines 136-137).

Point 6: vi. Page 4, lines 143-144: Grammar – Revise choice of coordinating conjunction in the statement “…training, but the accuracy is not …” for improved readability.

Response 6: We have made corrections in line 144.

Point 7: vii. Page 4, lines 149-150: Clarity – Elaborate on the explanation of transfer learning and also provide citations.

Response 7: Transfer learning is to first use Deep Source training such as ImageNet for a million-level data set, but if you use it directly after PASCAL VOC, it may not work well because of the distribution of ImageNet and PASCAL datasets. The labels may be different. So we can use the small data set PASCAL VOC to fine tune the network (trained on ImageNet) so that this network can be used for small data sets. We are introducing lines 149-151.

We added a reference [48] to it.

Point 8: viii. Page 4, lines 161-162: Grammar – Revise grammar associated with using “many” in the sentence “…many existing CNN-based scavenging…”

Response 8: We have made corrections. (Lines 115, 162).

Point 9: ix. Page 4, lines 164-166: Grammar – Revise punctuation following the use of the colon. The items (1), (2) and (3) should be linked via a comma “,”.

Response 9: We have made corrections in line 166.

Point 10: x. Page 5, lines 189-190: Standards – Define the acronym BR.

Response 10: We have added the full name of BR in line 191. The meaning of BR is explained in line 311.

Point 11: xi. Other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 22: A native English-speaker has carefully proofread this section.

E. Section 3

Point 1: i. Page 6, line 210: Standards – Revise the caption/ heading of Section 3. Consider making it less generic.

Response 1: We changed the title of the third chapter to "Regional Merge Algorithm ". (Line 211)

Point 2: ii. Page 7, line 226: Standards – Flush text to the left to include the sentence “…Where Gc is the G-statistic…” as part of the equation sentence.

Response 2: Fixed (Line 227).

Point 3: iii. Page 7, line 233: Clarity – Revise the phrase “…when λ≠0 indicates…” for clarity.

Response 3: Fixed (Line 234).

Point 4: iv. Page 8, line 250: Grammar – Revise punctuation in the statement “…with similar size area; each pixel…’

Response 4: Fixed. (Line 251).

Point 5: v. Page 9, Algorithm 1, line 8: Clarity – Revise step 8 of Algorithm 1 for clarity.

Response 5: We have revised step 8.

Point 6: vi. Page 9, line 260: Clarity – Revise the statement “…can be considered that the…” for clarity.

Response 6: We have rewritten this sentence in line 260.

Point 7: vii. Page 9, line 266: Clarity – Revise the statement “…transferring model. In the transfer learning, because…” in the caption of Fig. 6.

Response 7: We have rewritten this sentence in line 266-267.

Point 8: viii. Page 10, line 270: Clarity and Standards – Revise the statement “…original image using SLIC, merge…” for clarity.

Response 8: We have corrected. (Lines 270-271).

Point 9: ix. Page 10, Algorithm 2: Clarity and Standards – Maintain consistency in verb/adjective used in outlining the steps of Algorithm 2 (especially step 3).

Response 9: Fixed.

Point 10: x. Other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 10: A native English-speaker has helped us proofread the paper.

F. Section 4

Point 1: i. Page 11, line 296: Clarity – Revise heading of Section 4.

Response 1: We have changed the heading to "Experiment Analysis”. (Line 295)

Point 2: ii. Page 11, line 306: Standards – Write ‘20’ in words (i.e. caption of Fig. 8).

Response 2: Fixed.

Point 3: iii. Page 11, lines 315-316: Revise the phrase “…In the respect of BR…” for clarity.

Response 3: We removed “the”. (Line 314).

Point 4: iv. Page 12, line 338: Standards – Include appropriate citation to credit the “Jaccard Index”.

Response 4: We added a reference [47] to it in line 337.

Point 5: v. Page 13, line 358: Grammar – Revise choice of preposition and punctuation in the sentence “…especially to maintain the …” for clarity. Also provide motivation for the choices.

Response 5: We have rewritten this sentence in line 356.

Point 6: vi. Page 14, line 376: Technical – The statement “…we used some indicators to get a set of…” is vague. State the “indicators” used in the evaluation.

Response 6: We have corrected it in Lines 374.

Point 7: vii. Page 14, line 385: Clarity – Revise use of the verb “get” in the statement “…for each image…”

Response 7: We changed "get" to "obtain". (Line 383).

Point 8: viii. Page 14, line 388: Technical – Expatiate on the constraints used in the proposed methods.

Response 8: We constrain the target by superpixel merging to form a super-pixel block, and mark the target foreground as much as possible. This distinguishes the foreground and background labels iteratively synthesize larger areas and ensure that the final mark is as close as possible to the outline of the target. Therefore, the constraints of this paper are only limited to the size constraints of the target, and no other constraints are extended. The constraint method in this paper is predicted by the pixel block instead of the pixel. We use this information in the training process by accessing the size of the object. (Lines 386-389).

Point 9: ix. Page 15, line 396: Standards – Provide better way to separate listing of mIoU from the categories in Table 1.

Response 9: We also create Table 2 for the comparison of mIoU and PA. (Line 406).

Point 10: x. Page 15, line 410: Technical – Provide a mathematical definition for IoU used in Equation 8.

Response 10: We added IoU's formulas and added mathematical meaning. (Lines 412-414).

Point 11: xi. Page 16, line 414: Grammar and Clarity – Revise the of the preposition “in” in the statement “…methods in mIoU…” perhaps, “in terms of “, or “for” would be better.

Response 11: We changed "in" to "for". (Line 419).

Point 12: xii. Other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 12: We have requested a native English-speaker to proofread the paper.

G. Conclusions

Point 1: i. Page 16, line 423: Standards – As part of the concluding remarks, a brief overview of the proposed method should be provided.

Response 1: We have corrected in Line 433-442, 450-452.

Point 2: ii. Other grammatical, typographical, spelling, etc. mistakes throughout the section.

Response 2: Fixed.

H. References

Point 1: i. As suggested in comment C(i) (i.e. in Section C of this report), the citations supporting the claim of “widespread applications of CNN” is rather inadequate. The authors should expand this, and references [1] and [2] (below) are recommended for that purpose.

Response 1: In order to supplement the lack of CNN applications. After careful consideration, we will not use the original reference [2] [7] and replace it with reference to [1] and [2].

Author Response File: Author Response.pdf