Next Article in Journal
Semi-BSU: A Boundary-Aware Semi-Supervised Semantic Segmentation Framework with Superpixel Refinement for Coastal Aquaculture Pond Extraction from Remote Sensing Images
Previous Article in Journal
Automatic Ghost Noise Labeling for 4D mmWave Radar Data in Underground Mine Environments Using LiDAR as Reference
Previous Article in Special Issue
Impact of Input Image Resolution on Deep Learning Performance for Side-Scan Sonar Classification: An Accuracy–Efficiency Analysis
 
 
Article
Peer-Review Record

SF3Net: Frequency-Domain Enhanced Segmentation Network for High-Resolution Remote Sensing Imagery

Remote Sens. 2025, 17(22), 3734; https://doi.org/10.3390/rs17223734 (registering DOI)
by Yi He, Zhenyu Lu and Hai Huan *
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Remote Sens. 2025, 17(22), 3734; https://doi.org/10.3390/rs17223734 (registering DOI)
Submission received: 22 September 2025 / Revised: 21 October 2025 / Accepted: 14 November 2025 / Published: 17 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper addresses the core challenge in the semantic segmentation of high-resolution remote sensing imagery: how to balance accuracy and computational efficiency while fully leveraging frequency-domain information to improve boundary segmentation and the perception of complex textures. To tackle this, a novel lightweight spatial-frequency feature fusion network, SF3Net, is proposed. The framework consists of four key modules: the Frequency Feature Stereo Learning (FFSL) module, which utilizes the Fourier transform to extract rich frequency-domain features from three directions (Channel-Height, Height-Width, Channel-Width), enhancing perception in areas with drastic grayscale variations; the Spatial Feature Aggregation Module (SFAM), which employs parallel dilated convolutions and Soft-Pool to aggregate pixel-level spatial context information in a weighted manner, effectively mitigating segmentation errors caused by object occlusion; the Feature Selection Module (FSM), embedded in skip connections, which dynamically selects and fuses shallow encoder features via channel and spatial attention mechanisms, compensating for detail loss during downsampling; and the Spatial-Frequency Feature Fusion Module (SFFM), the core of the decoder, which efficiently integrates frequency-domain features from FFSL with spatially enhanced features from SFAM, achieving complementary advantages of global semantics and local details. Comprehensive experiments on three datasets demonstrate that SF3Net achieves excellent performance with a lightweight design and efficient modules. However, the following issues require clarification and improvement:

 

  1. Formula (7) presents the standard 2D FFT formula, with input in spatial coordinates (x, y) and output in frequency coordinates (u, v). However, in Formula (8), the summation variables for FFT(f_i2) become c and h. Does this imply performing a 2D FFT-like operation on the "Channel" and "Height" dimensions?

 

  1. In the last sentence of Section 2.3, it states: "By performing element-wise multiplication of v_h and v_w as shown in equation (3)". However, Equation (3) in the text describes the SoftPool calculation, not an element-wise multiplication? Should this refer to the outer product or element-wise multiplication operation of v_h and v_w, which itself lacks a dedicated equation number?

 

  1. In Figure 4, a complete feature is split into 4 groups of features with the same number of channels via 'split'. Are different splitting methods applied to these features? Does each group of features after splitting have channels corresponding to different directions? Furthermore, is a residual connection annotation missing in the bottom-right corner? Or what does the '+' symbol signify?

 

  1. In Figure 4, are the three frequency-domain learning branches W_{(C,H)}, W_{(H,W)}, and W_{(C,W)} shared or independent? In Figure 3, is the operation denoted twice by 'x' an element-wise multiplication?

 

  1. How is the 'Channel Adapter' in Figure 2 implemented? Is it simply using a 1×1 convolution? Both the spatial feature (f_s) and the frequency-domain feature (f_f) undergo GroupNorm before fusion. Is this intended to "calibrate" features from different sources to similar distributions before fusion?

 

  1. The experimental design in Table 2 is confusing. The mIoU for the first row "- - - -" is 78.409%, which is identical to the result for "Baseline + SFAM" in Table 1. Does this indicate that the experiments in Table 2 were conducted based on "Baseline + SFAM" rather than the pure Baseline?

Author Response


We thank the reviewer for the valuable comments that have significantly improved the quality of our manuscript. We next present a point-by-point response as follows. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper proposes a Spatial-Frequency Feature Fusion Network (SF3Net) for semantic segmentation of high-resolution remote sensing images, aiming to balance segmentation accuracy and computational efficiency by integrating frequency domain information with spatial features. Addressing the core limitation of existing methods—"insufficient utilization of frequency domain information"—the proposed approach designs four key modules (FFSL, SFAM, SFFM, FSM) and constructs a U-shaped encoder-decoder architecture. Systematic experiments were conducted on the ISPRS-Potsdam, ISPRS-Vaihingen, and a self-built farmland dataset. The overall research approach is clear, the technical route is complete, and the experimental design is relatively comprehensive. The results demonstrate that SF3Net outperforms most existing spatial-frequency fusion methods in terms of segmentation accuracy and model lightweighting, exhibiting certain theoretical innovation and application value. However, the paper still has room for improvement in the elaboration of methodological details and requires further refinement:

1. Supplement to the Frequency Domain Weight Learning Mechanism of the FFSL Module: The paper mentions that FFSL performs frequency domain feature learning through three learnable weights (W(C,H), W(H,W), W(C,W)), but does not specify the initialization method or update strategy for these weights. It is recommended to supplement these details.

2. High-Frequency Noise in Remote Sensing Images: Remote sensing images commonly contain high-frequency noise, which can affect segmentation accuracy. However, the paper does not thoroughly discuss how the FFSL module suppresses high-frequency noise and enhances effective frequency information. It is suggested to include an analysis of the high-frequency noise issue in remote sensing images in the introduction or methodology section and elaborate on how the adaptive weight learning mechanism of FFSL can theoretically address this problem.

3. Missing Figure Title for Fig2: Is Fig2 missing a figure title?

4. Inconsistent Color Schemes in Fig4 and Fig5: The color schemes for Fig4 and Fig5 are different. Would it be possible to unify the color schemes?

Comments on the Quality of English Language

Can be improved.

Author Response

We thank the reviewer for the valuable comments that have significantly improved the quality of our manuscript. We next present a point-by-point response as follows. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript provides a comprehensive review of the development of remote sensing (RS) image semantic segmentation, tracing the evolution from traditional spectral–spatial feature-based approaches to modern deep learning frameworks such as CNNs, Transformers, and Mamba models. The logical flow of the overview is generally clear. The authors have successfully identified a key research gap—namely, the overreliance on spatial-domain features and the insufficient exploitation of frequency-domain information—and have proposed a hybrid spatial–frequency fusion network, SF3Net, as a well-defined and innovative direction. The experimental section is clearly presented. I recommend minor structural adjustments and language polishing to enhance readability and strengthen the persuasive impact of the paper.

 

Specific comments are as follows:

  1. The first paragraph lacks smooth transitions, and the examples of RS applications are presented as a simple list without conceptual grouping. It is recommended to add one or two transitional sentences to better highlight the research motivation and to organize the application scenarios into logical categories for improved readability and flow.
  2. (Line 37). The statement “Traditional image segmentation methods primarily rely on spectral, spatial, and textural features for segmenting RS images [8]” does not include a preceding sentence summarizing the limitations of traditional methods. As a result, the transition to deep learning methods appears abrupt and weakens the narrative continuity.
  3. (Lines 62–92). The section from “[18] proposed combining a pyramid structure with attention mechanisms…” to the end of this discussion enumerates numerous studies related to Transformer and hybrid structures. Although informative, it lacks a unifying summary or critical synthesis that connects these works to the motivation of the current study.
  4. The list of contributions is somewhat verbose, and some points partially overlap in content and logic. The section would benefit from concise and parallel phrasing of each contribution to improve clarity and stylistic consistency.

 

Author Response

We thank the reviewer for the valuable comments that have significantly improved the quality of our manuscript. We next present a point-by-point response as follows. Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop