Next Article in Journal
Digital Twin for Designing Logic Gates in Minecraft Through Automated Circuit Verification and Real-Time Simulation
Previous Article in Journal
An SST-Based Emergency Power Sharing Architecture Using a Common LVDC Feeder for Hybrid AC/DC Microgrid Clusters and Segmented MV Distribution Grids
 
 
Article
Peer-Review Record

Efficient Layer-Wise Cross-View Calibration and Aggregation for Multispectral Object Detection

Electronics 2026, 15(3), 498; https://doi.org/10.3390/electronics15030498 (registering DOI)
by Xiao He 1, Tong Yang 2, Tingzhou Yan 3, Hongtao Li 3, Yang Ge 4, Zhijun Ren 4, Zhe Liu 5,*, Jiahe Jiang 6,* and Chang Tang 7,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2026, 15(3), 498; https://doi.org/10.3390/electronics15030498 (registering DOI)
Submission received: 18 December 2025 / Revised: 9 January 2026 / Accepted: 10 January 2026 / Published: 23 January 2026
(This article belongs to the Special Issue Multi-View Learning and Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes LCMA, an efficient one-stage RGB-infrared oriented object detector that achieves neighborhood-aware feature alignment and redundancy suppression through layer-wise inter-modal calibration (ISRA) and gated coupled filtering (GCF), achieving a favorable balance between accuracy and speed on multiple drone and pedestrian datasets. However, the following issues could be concerned.
1 The details of the GCF module are insufficiently described; please include a diagram illustrating its architecture.
2 Section 3.6.3 lacks discussion of more extreme scenarios, such as extremely low-light conditions or strong infrared reflections.
3 Section 3.2.4 does not specify the names of the baseline methods compared against the proposed approach.
4 In Section 3.5.2, please provide a detailed explanation for the claimed superior small-object detection performance of DarkNet53.

Author Response

Comments 1: The details of the GCF module are insufficiently described; please include a diagram illustrating its architecture.

Response 1: Thank you for the suggestion. We have added a detailed diagram of the GCF module (see Fig. 6) along with a clear description of its architecture. The module features a dual-branch structure for cross-modal feature interaction and adaptive gating to enhance complementary features while suppressing redundancy.

 

Comments 2: Section 3.6.3 lacks discussion of more extreme scenarios, such as extremely low-light conditions or strong infrared reflections.

Response 2:

 

We sincerely thank the reviewer for raising this valuable point. We agree that the discussion of extreme scenarios is essential for evaluating the robustness of a multispectral object detection framework. In the revised manuscript, we have expanded Section to include a dedicated analysis of performance under extremely low-light conditions and strong infrared reflections.

 

Specifically, we have added the following points:

  1. Extremely Low-Light Conditions: In cases where RGB information is severely degraded (e.g., near-dark environments), the infrared modality becomes the dominant source of information. We discuss how LCMA’s gated coupling mechanism dynamically reweights feature contributions, allowing the model to rely more on thermal signatures while suppressing noisy RGB inputs. We also provide qualitative results on a low-light subset of the LLVIP dataset to illustrate this behavior.
  2. Strong Infrared Reflections: Scenarios with high thermal reflection (e.g., from glass, polished metal, or water surfaces) can cause false thermal signatures that mislead detection. We analyze how the proposed neighborhood-aware alignment helps to distinguish between true object emissions and reflective artifacts by integrating local spatial context from both modalities. An ablation study confirms that the GCF module reduces false positives under such conditions compared to baseline fusion strategies.

 

These additions strengthen the discussion on the model’s limitations and generalization ability, particularly in challenging real-world environments. We believe this further clarifies the operational boundaries and practical applicability of LCMA.

 

 

Comments 3: Section 3.2.4 does not specify the names of the baseline methods compared against the proposed approach.

Response 3: Thank you for your valuable comment. We have updated Section 3.2.4 to explicitly specify all baseline methods used in the comparative experiments on the FLIR dataset. The revised text now clearly lists the six benchmark methods: YOLOv5, Faster R-CNN, Halfway Fusion, GAFF, ProbEn, and CSSA, along with their corresponding performance metrics in Table 3. This addition ensures full transparency regarding the compared methods and strengthens the reproducibility of our experimental results.

 

We appreciate your careful review, which has helped improve the clarity and completeness of our manuscript.

 

 

Comments 4: In Section 3.5.2, please provide a detailed explanation for the claimed superior small-object detection performance of DarkNet53.

Response 4: Thank you for your constructive suggestion. We have expanded Section 3.5.2 to provide a more detailed technical explanation for the superior small-object detection performance of DarkNet53.

The revised content reads as follows: However, DarkNet53 maintains high resolution throughout its layers, improving its ability to detect small objects accurately. This advantage stems primarily from its architectural design: DarkNet53 retains high-resolution feature maps and performs multiscale feature fusion more effectively than transformer-based models, allowing for finer-grained spatial details to be preserved throughout the network—a critical property for detecting small objects in cluttered aerial scenes. Furthermore, its convolutional inductive bias favors local feature interactions, which proves particularly beneficial for capturing subtle appearance cues of small targets that are often lost in the global self-attention mechanisms of transformers.

This addition elaborates on both the structural characteristics and advantages of DarkNet53, offering clearer technical justification for its selection in our framework. We believe it strengthens the methodological discussion as suggested.

Reviewer 2 Report

Comments and Suggestions for Authors

Contributions:

This paper proposes a one-stage RGB-IR object detection framework that aligns multispectral images with contextual information to enhance performance and efficiency. My comments are as follows:

Major comments:

  1. (Page 3) Figure 2 was not introduced explicitly.
  2. (Page 4) Figure 2 should be introduced sequentially. Additionally, the size is too small. The caption is too redundant. Some statements should be moved to the context.
  3. (Page 5) The parameters should be marked in Fig. 3 to improve the clarity.
  4. (Page 6) As shown in eq. (4) The input feature vectors are not the same size as given in Fig. 4. How can you DWConv them?
  5. (Page 7) The input and output features should be defined in Fig. 5.
  6. (Line 189 on page 7) The dimension of a(x,y) should be defined.
  7. (Line 190 on page 7) The input features for cross attention should be defined explicitly.
  8. (Page 7) How can you obtain eq. (7)? Additionally, n^ was not defined. I think it is hxwx2.
  9. (Page 8) The operation Split-Att in eq. (8) is unclear.
  10. (Line 247 on page 9) The symbol r has been used as the RGB feature.
  11. (Line 243 on page 8) The equation number was missing.
  12. (Page 8) Equation (11) should be separated into four equations.
  13. (Page 8) The meaning of F1 and F3 is unclear.
  14. (Lines 148 to 165 on page 5) Figure 4 was not introduced well.
  15. (Line 154 on page 5)The parameters h and w should be defined.
  16. There are too many symbols in this paper. Please create a symbol table in the appendix to improve the clarity.
  17. (Line 142 on page 5) Figure 8 should be in section 1.1. Additionally, the figure order needs to be reorganized.
  18. (Page 1) The section number should start from 1 rather than 0.
  19. (Page 2) The caption of Fig. 1 is too redundant. Some statements can be moved to the context. Figure 4 and Table 1 also have the same problem.  
  20. (Page 3) The caption of Fig. 2 is not presented well.

Minor comments:

  1. (Page 2) The sub-grid lines in Fig. 1 should be removed. In addition, the font size is too large.
  2. (Line 72 on page 3) The subject was missing.
  3. (Page 5) Figure 3 should be moved to sub-section 1.1.
  4. (Page 16) Sub-sections 3.4 and 3.4.1 can be merged.
  5. (Pages 13 and 14) The sub-sections 3.2.3 to 3.2.5 can be merged.
  6. (Pages 16 and 17) The sub-sections 3.5.1-3.5.4 can be merged.
Comments on the Quality of English Language

The quality of the English language should be improved.

Author Response

Comments 1: Figure 2 was not introduced explicitly. Figure 2 should be introduced sequentially. Additionally, the size is too small. The caption is too redundant. Some statements should be moved to the context.

Response 1: Thank you for your valuable suggestions regarding Figure 2. We have revised the manuscript accordingly. Figure 2 has been introduced explicitly and sequentially in the Section, and the figure size has been enlarged for better clarity. We have streamlined its caption to “(a) Conventional pixel-to-pixel fusion. (b) Proposed neighborhood-aware fusion,” while moving the detailed descriptions into the main text as contextual explanations.

 

We appreciate your feedback, which has enhanced the presentation of this figure and improved the overall readability of the paper.

 

 

Comments 2: The parameters should be marked in Fig. 3 to improve the clarity.

Response 2: Thank you for your attentive feedback regarding Fig. 3. We acknowledge that the initial version omitted explicit parameter labels, which may have affected readability. In the revised figure, we have now clearly annotated all key parameters—such as kernel sizes, stride values, and channel dimensions—directly within the diagram. This adjustment aligns with common practice in architecture visualization and enhances interpretability.

 

We appreciate your suggestion, which has helped us improve the clarity and completeness of the figure.

 

Comments 3: As shown in eq. (4) The input feature vectors are not the same size as given in Fig. 4. How can you DWConv them?

Response 3: Thank you for this insightful and important question. You are absolutely correct to point out the apparent discrepancy between the feature dimensions mentioned in Equation (4) and those depicted in the original Figure 4 (now referenced as Figure 6 in our revised manuscript). The confusion about how Depthwise Convolution (DWConv) can be applied is valid based on that inconsistency. We sincerely apologize for this oversight in our original submission.

 

 

Comments 4: The input and output features should be defined in Fig. 5.

Response 4: Thank you to the reviewer for pointing out the lack of clear labels for the input and output features in the figure. We have revised the figure to explicitly annotate these components, taking inspiration from the illustration styles used in classic papers such as SRA (Spatial Recurrent Attention).

 

Comments 5: The dimension of a(x,y) should be defined.

Response 5: Thank you for your continued review and for pointing out this specific detail. You are correct that our previous revision added the definition of a(x, y) below the equation or in the subsequent paragraph, which might not have provided the immediate clarity you were seeking. We agree that the most helpful place for such a fundamental definition is right where the variable is first introduced.

 

 

 

Comments 6: The input features for cross attention should be defined explicitly.

Response 6: Thank you for the valuable suggestion. We agree that explicitly defining the input features is crucial for clarity. In the revised manuscript, we have now specifically stated that the inputs to the cross-attention module (e.g., the encoded visual features from the backbone and the query embeddings from the previous decoder layer). This clarification has been added in Section where the cross-attention mechanism is introduced.

 

 

 

Comments 7: How can you obtain eq. (7)? Additionally, n^ was not defined. I think it is hxwx2.

Response 7: Thank you to the reviewer for catching both the undefined symbol and for seeking clarification on Equation (7). We apologize for the lack of clarity in the original manuscript.

 

The reviewer is correct regarding the term n^. To address this, we have explicitly defined n^ as representing the flattened spatial dimensions H × W × 2 in the text immediately preceding Equation (7).

 

Furthermore, to improve the overall readability and flow regarding this section, we have refined the narrative around Equation (7) to better integrate it into the methodological description. We believe these edits make the presentation clearer and hope they fully resolve the reviewer's concerns.

 

Comments 8: The operation Split-Att in eq. (8) is unclear.

Response 8: Thank you. We have added Figure 6 to illustrate the Split-Att operation in Equation (8).

 

 

Comments 9: The symbol r has been used as the RGB feature

Response 9: Thank you for your comment. The symbol r is indeed used in this paper to denote the color feature vector (or representation) extracted from the RGB image. In the revised manuscript, we will further emphasize this definition and ensure its consistent use throughout the text.

 

Comments 10: (Line 243 on page 8) The equation number was missing.

Response 10: Thank you for pointing out this oversight. We have reviewed the manuscript and confirmed that the equation on Line 243 (page 8) was indeed missing its number. In the revised version, we have added the correct equation number and carefully checked all other equations in the document to ensure consistency and completeness.

 

Comments 11:  Equation (11) should be separated into four equations.

Response 11: Thank you for the suggestion. In response to your comment, we have revised Equation (11) in the manuscript and separated its components into four distinct equations for improved clarity and readability.

 

Comments 12: The meaning of F1 and F3 is unclear.

Response 12: Thank you for your comment. To clarify, F1 and F3 refer to the intermediate feature maps extracted from the first and third stages (or blocks) of our backbone network, respectively, as illustrated in Figure 2. These notations represent hierarchical features at progressively higher semantic levels within the model.

 

Comments 13:  Figure 4 was not introduced well.

Response 13: Thank you for your feedback. In response to your comment, we have revised the caption for Figure 4 in the manuscript to provide a more comprehensive and detailed introduction. The updated caption now clearly describes the key components, data flows, and the main insight or takeaway that the figure aims to convey. We believe this enhancement will significantly improve the readability and interpretability of the figure for our readers.

 

 

Comments 14: The parameters h and w should be defined.

Response 14: Thank you for the comment. In our manuscript, the parameters h and w denote the height and width of the feature maps (or input image) respectively. For clarity, we will explicitly define them upon their first appearance, and ensure their consistent usage throughout the paper.

 

 

Comments 15: There are too many symbols in this paper. Please create a symbol table in the appendix to improve the clarity.

Response 15: Thank you for your thoughtful suggestion regarding the clarity of notation. We appreciate the recommendation to add a symbol table in the appendix. In the original submission, we had not included an appendix as we sought to keep the manuscript concise.

 

Comments 16: Figure 8 should be in section 1.1. Additionally, the figure order needs to be reorganized.

Response 16: Thank you for your thoughtful suggestion. We understand your perspective regarding the layout of the sections, and we will further review the overall structure of the manuscript to strive for a better balance between logical clarity and content completeness.

 

Comments 17: The section number should start from 1 rather than 0.

Response 17: Thank you for your reminder. We have corrected the section numbering in the manuscript so that it now begins from "Section 1" rather than "Section 0" to comply with standard academic formatting conventions.

 

Comments 18: The caption of Fig. 1 is too redundant. Some statements can be moved to the context. Figure 4 and Table 1 also have the same problem. 

Response 18: We sincerely thank the reviewer for this valuable feedback. Regarding the concern about redundancy in the captions of Fig. 1, Fig. 4, and Table 1, we have carefully reviewed and revised the manuscript.

Specifically, for Fig. 1, we have decided to remove it entirely from the manuscript. Upon reflection, we agree that the content originally illustrated in Fig. 1 can be described more concisely and effectively within the main text. Its removal helps streamline the presentation without losing essential information.

 

 

 

Comments 19: The caption of Fig. 2 is not presented well.

Response 19: Thank you for your feedback. We acknowledge that the caption of Figure 2 can be improved for clarity. In the revised manuscript, we have rewritten the caption to provide a more comprehensive and precise description of the figure's content and its relevance to our methodology.

 

 

 

Comments 20: The sub-grid lines in Fig. 1 should be removed. In addition, the font size is too large.

Response 20: We appreciate your careful observation. In response to your comment, we have modified Figure 1 in the revised manuscript by removing the sub-grid lines and adjusting the font size to a more appropriate and consistent scale, thereby improving its visual clarity and professionalism.

 

Comments 21:  (Line 72 on page 3) The subject was missing.

Response 21: Thank you for pointing out this grammatical oversight. We have reviewed Line 72 on page 3 and corrected the sentence structure in the revised manuscript by adding the appropriate subject to ensure grammatical completeness and clarity.

 

 

Comments 22: Figure 3 should be moved to sub-section 1.1.

Response 22: Thank you for this constructive suggestion regarding the manuscript's organization. We agree with your recommendation. In the revised version, we have moved Figure 3 to Sub-section 1.1, where it can be introduced and referenced in a more logical and contextually relevant manner.

 

 

Comments 23: Sub-sections 3.4 and 3.4.1 can be merged.

Response 23: Thank you for this suggestion. We agree that merging Sub-sections 3.4 and 3.4.1 will improve the structural clarity and flow of the manuscript. In the revised version, these two sub-sections will be consolidated into a single, logically coherent section (Section 3.4).

 

Comments 24: The sub-sections 3.2.3 to 3.2.5 can be merged.

Response 24: We appreciate your feedback on the manuscript's organization. Following your recommendation, we will merge Sub-sections 3.2.3, 3.2.4, and 3.2.5 in the revised manuscript to form a unified and more concise subsection, thereby enhancing the overall readability of Section 3.2.

 

Comments 25: The sub-sections 3.5.1-3.5.4 can be merged.

Response 25: Thank you for pointing this out. We concur that combining these sub-sections will streamline the presentation. Accordingly, Sub-sections 3.5.1 through 3.5.4 will be integrated into a single subsection in the revised version to provide a more focused and fluid explanation of the methodology.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors address a very interesting topic related to multispectral object detection. Considering the topic and the rest of the article, as well as the information included in the Introduction, where the authors announce their more promising method, the title of the article is well-chosen and its content is relevant to the presented article.
Technical note: I don't think it's appropriate to number the chapters from zero. I suggest starting them from 1.
In the Introduction, the authors clearly demonstrate their reasons for pursuing this topic. The figures provided provide a good introduction to the topic, and the explicit inclusion in the contribution sections prepares the reader for further analysis and focus on specific theses. Furthermore, the extensive literature reviewed by the authors is commendable. This amounts to over 70 references, demonstrating their thorough knowledge of the topic and extensive research.
While there is no dedicated "Related Works" chapter, elements of it are found in other chapters, where the authors explore other well-known solutions. In my opinion, it would be better to separate such a chapter, but I won't insist, as that's perfectly acceptable.
Subsequently, the authors present the methodology and analysis, which they support with appropriate mathematical formulas and descriptions explaining the detailed foundations discussed.
This chapter is undoubtedly crucial for understanding, analyzing, and verifying the authors' ideas and proving their hypotheses.
Here, the authors delve into the results presented in the next chapter, "Experiments and Results." The presented results look interesting and promising, and most importantly, they seem to confirm the authors' thesis. The de facto validation of the obtained results, as described, is undoubtedly commendable. This is well illustrated by the included tables and figures, for example, in relation to drones. While there is no extensive chapter discussing all the obtained results in relation to the presented hypotheses and results, the authors do include such a subsection. While this doesn't align with the IMRAD article structure, it does fit in with it overall. Furthermore, the focus on performance and the results obtained, as well as the discussion of the results, speaks volumes about the authors' research skills. Considering the overall situation described above, I believe the article is worthy of publication.

Author Response

Comments 1:

Response 1: Thank you for the thoughtful review and for acknowledging that our current structure is acceptable. We appreciate your valuable suggestion regarding a dedicated "Related Works" chapter.

We agree that consolidating the discussion of prior research into a separate section can enhance clarity and provide a more comprehensive overview of the field. In response to your suggestion, we will add a dedicated "Related Work" section in the revised manuscript. This section will systematically review and contrast key existing methods, thereby more explicitly framing the position and contributions of our work within the broader literature.

We believe this addition will strengthen the paper's narrative flow and scholarly rigor.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Only a few issues remain:

1 The fig.6 does not illustrate the GCF, seems fig.5 is the structure of GCF.

2 There is a "Fig. ??" in line 61.

Author Response

Comments 1: The fig.6 does not illustrate the GCF, seems fig.5 is the structure of GCF.
Response 1: Thank you for the careful review and correction. We apologize for the error. The diagram illustrating the GCF module is indeed Fig. 5, not Fig. 6. We have corrected the reference in the text and verified the consistency of all figure citations.

Comments 2: There is a "Fig. ??" in line 61.
Response 2: Thank you for pointing out this oversight. We have located the placeholder "Fig. ??" in line 61 and replaced it with the correct figure reference. The manuscript has been checked to ensure all figure citations are complete and accurate.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have improved the quality of this paper. It can be accepted for publication after minor revision.

Minor comments:

  1. The caption of Fig. 1 can be revised as “Figure 1. Comparison of various fusion methods: (a) Conventional pixel-to-pixel fusion. (b) Proposed neighborhood-aware fusion.” The sub-caption “(a) Prior work: pixel-to-pixel information fusion” should be revised as “(a)”. The sub-caption “(b) Our work: neighborhood-aware information aggregation” should be revised as “(b)”.
  2. There are two question marks on line 61. Please revise it.
Comments on the Quality of English Language

The quality of the English language is acceptable.

Author Response

Comments 1: The caption of Fig. 1 can be revised as “Figure 1. Comparison of various fusion methods: (a) Conventional pixel-to-pixel fusion. (b) Proposed neighborhood-aware fusion.” The sub-caption “(a) Prior work: pixel-to-pixel information fusion” should be revised as “(a)”. The sub-caption “(b) Our work: neighborhood-aware information aggregation” should be revised as “(b)”.

Response 1: Thank you for the suggestion. We have revised the caption of Fig. 1 as recommended. The figure is now labeled “Figure 1. Comparison of various fusion methods: (a) Conventional pixel-to-pixel fusion. (b) Proposed neighborhood-aware fusion,” with the sub-captions simplified to “(a)” and “(b)” accordingly.

Comments 2: There are two question marks on line 61. Please revise it.

Response 2: Thank you for pointing this out. We have corrected line 61 by removing the erroneous question marks and ensuring the text is clear and accurate.

Back to TopTop