Author Contributions
Conceptualization, P.Z. and J.L.; methodology, P.Z.; software, P.Z. and C.W.; validation, P.Z. and J.L.; formal analysis, P.Z.; investigation, P.Z. and C.W.; resources, P.Z., Y.N. and J.L.; data curation, P.Z.; writing—original draft preparation, P.Z.; writing—review and editing, J.L. and C.W.; visualization, P.Z., C.W. and J.L.; supervision, Y.N. and J.L.; Project administration, Y.N. and J.L.; funding acquisition, Y.N. and J.L. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Architectural composition of SAM2MS: Encoder–decoder framework with adapters, dimensionality reduction blocks (DRBs), and a multi-scale subtraction module (MSSM) constructed through the cascading of multiple subtraction blocks (Subs). The MSSM module is visually highlighted in green.
Figure 1.
Architectural composition of SAM2MS: Encoder–decoder framework with adapters, dimensionality reduction blocks (DRBs), and a multi-scale subtraction module (MSSM) constructed through the cascading of multiple subtraction blocks (Subs). The MSSM module is visually highlighted in green.
Figure 2.
The schematic diagram of the adapter shows that the adapter is composed of two fully connected neural networks for upsampling and downsampling, along with ReLU activation layers. The input to the adapter is external remote sensing information, which is efficiently embedded into the SAM2 image encoder through this simplified network architecture, enabling effective processing of remote sensing data.
Figure 2.
The schematic diagram of the adapter shows that the adapter is composed of two fully connected neural networks for upsampling and downsampling, along with ReLU activation layers. The input to the adapter is external remote sensing information, which is efficiently embedded into the SAM2 image encoder through this simplified network architecture, enabling effective processing of remote sensing data.
Figure 3.
The schematic diagram of the dimensional reduction block indicates that the input consists of four feature layers from the SAM2 encoder. After passing through the DRB module, the feature dimension is reduced to a low-dimensional (64-dimensional) representation. The structure utilizes convolution kernels of sizes 3, 5 and 7, with the outputs from different branches being fused, expanding the receptive field while ensuring the model remains lightweight. Taking a feature map input of size 144 × 256 × 256 as an example, the channel number and feature map size annotated at each layer correspond to the dimensions of the output feature map of that layer.
Figure 3.
The schematic diagram of the dimensional reduction block indicates that the input consists of four feature layers from the SAM2 encoder. After passing through the DRB module, the feature dimension is reduced to a low-dimensional (64-dimensional) representation. The structure utilizes convolution kernels of sizes 3, 5 and 7, with the outputs from different branches being fused, expanding the receptive field while ensuring the model remains lightweight. Taking a feature map input of size 144 × 256 × 256 as an example, the channel number and feature map size annotated at each layer correspond to the dimensions of the output feature map of that layer.
Figure 4.
Parameter-free lossnet architecture: deep supervision via fixed ResNet50 backbone.
Figure 4.
Parameter-free lossnet architecture: deep supervision via fixed ResNet50 backbone.
Figure 5.
Annotation exemplars for the Massachusetts, SpaceNet, and DeepGlobe datasets are shown sequentially from left to right, with the top row presenting original imagery and the bottom row depicting corresponding annotations. Red markings highlight critical annotation details.
Figure 5.
Annotation exemplars for the Massachusetts, SpaceNet, and DeepGlobe datasets are shown sequentially from left to right, with the top row presenting original imagery and the bottom row depicting corresponding annotations. Red markings highlight critical annotation details.
Figure 6.
Dimensionality reduction analysis was performed on the training and test sets using both t-SNE and PCA. By extracting multiple features from each dataset, we conducted a joint analysis of these datasets. As shown in the figure, the resulting two-dimensional visualization illustrates the distribution of the data in the reduced space: blue points represent the training set, while red points denote the test set. For all three datasets, the training and test data exhibit a closely interwoven distribution pattern in this space.
Figure 6.
Dimensionality reduction analysis was performed on the training and test sets using both t-SNE and PCA. By extracting multiple features from each dataset, we conducted a joint analysis of these datasets. As shown in the figure, the resulting two-dimensional visualization illustrates the distribution of the data in the reduced space: blue points represent the training set, while red points denote the test set. For all three datasets, the training and test data exhibit a closely interwoven distribution pattern in this space.
Figure 7.
Comparative results: Representative outcomes from the three datasets demonstrate performance variations among the models. Row (a) displays the actual remote sensing images, while row (b) presents the corresponding ground truth labels. Rows (c–l) sequentially show the test results of baseline models: UNet, UNet++, D-LinkNet, MSNet, M2SNet, Seg-Road, SwinUNet, SGCNNet, MSMDFFNet and SAM2UNet. Row (m) showcases our proposed SAM2MS method. Samples (1–3) originate from the DeepGlobe dataset, samples (4–6) from SpaceNet, and samples (7,8) from Massachusetts. Solid red borders highlight areas with significant occlusion and shadows, whereas dashed borders emphasize missing regions in either ground truth annotations or inference results.
Figure 7.
Comparative results: Representative outcomes from the three datasets demonstrate performance variations among the models. Row (a) displays the actual remote sensing images, while row (b) presents the corresponding ground truth labels. Rows (c–l) sequentially show the test results of baseline models: UNet, UNet++, D-LinkNet, MSNet, M2SNet, Seg-Road, SwinUNet, SGCNNet, MSMDFFNet and SAM2UNet. Row (m) showcases our proposed SAM2MS method. Samples (1–3) originate from the DeepGlobe dataset, samples (4–6) from SpaceNet, and samples (7,8) from Massachusetts. Solid red borders highlight areas with significant occlusion and shadows, whereas dashed borders emphasize missing regions in either ground truth annotations or inference results.
Figure 8.
Progressive training visualization. To demonstrate model evolution during training, we present a randomly selected test image alongside its ground truth (a,e). Successive inference results from early-stage (b), intermediate (c), and final (d) training phases illustrate performance progression. Complementary multi-level supervision maps generated by lossnet (f-h) highlight critical refinement processes: background suppression (f), edge refinement (g) and region-of-interest enhancement (h).
Figure 8.
Progressive training visualization. To demonstrate model evolution during training, we present a randomly selected test image alongside its ground truth (a,e). Successive inference results from early-stage (b), intermediate (c), and final (d) training phases illustrate performance progression. Complementary multi-level supervision maps generated by lossnet (f-h) highlight critical refinement processes: background suppression (f), edge refinement (g) and region-of-interest enhancement (h).
Figure 9.
Cross-dataset comparative results: Models trained on the DeepGlobe dataset were directly evaluated on the test sets of SpaceNet and Massachusetts (denoted as D2S and D2M, respectively). Columns (1–4) present partial inference results for D2S, while columns (5–8) correspond to D2M. Row (a) displays the actual remote sensing images, and row (b) presents the corresponding ground truth labels. Rows (c–l) sequentially demonstrate the test results of baseline models: UNet, UNet++, D-LinkNet, MSNet, M2SNet, Seg-Road, SwinUNet, SGCNNet, MSMDFFNet and SAM2UNet. Finally, row (m) showcases the performance of our proposed SAM2MS method.
Figure 9.
Cross-dataset comparative results: Models trained on the DeepGlobe dataset were directly evaluated on the test sets of SpaceNet and Massachusetts (denoted as D2S and D2M, respectively). Columns (1–4) present partial inference results for D2S, while columns (5–8) correspond to D2M. Row (a) displays the actual remote sensing images, and row (b) presents the corresponding ground truth labels. Rows (c–l) sequentially demonstrate the test results of baseline models: UNet, UNet++, D-LinkNet, MSNet, M2SNet, Seg-Road, SwinUNet, SGCNNet, MSMDFFNet and SAM2UNet. Finally, row (m) showcases the performance of our proposed SAM2MS method.
Figure 10.
The spatial distribution relationships among the DeepGlobe, SpaceNet, and Massachusetts datasets were analyzed using both t-SNE (a,b) and PCA (c,d) methods, with feature scatter points visualized accordingly: samples from the DeepGlobe dataset are represented by blue scatter points, whereas samples from the Massachusetts and SpaceNet datasets are depicted using red scatter points.
Figure 10.
The spatial distribution relationships among the DeepGlobe, SpaceNet, and Massachusetts datasets were analyzed using both t-SNE (a,b) and PCA (c,d) methods, with feature scatter points visualized accordingly: samples from the DeepGlobe dataset are represented by blue scatter points, whereas samples from the Massachusetts and SpaceNet datasets are depicted using red scatter points.
Table 1.
Cross-model quantitative evaluation on DeepGlobe datasets benchmark.
Table 1.
Cross-model quantitative evaluation on DeepGlobe datasets benchmark.
Method | DeepGlobe Dataset [44] |
---|
Param. (M) | FLOPs (G) | Prec. | Recall | F1 | mIoU | mDice | MAE ↓
|
---|
UNet (2015) | 31.02 | 875.81 | 74.83 | 76.97 | 75.89 | 59.86 | 73.44 | 2.10 |
UNet++ (2018) | 47.19 | 3202.89 | 80.22 | 52.06 | 63.14 | 44.58 | 58.25 | 2.64 |
D-LinkNet (2018) | 217.64 | 481.25 | 72.95 | 72.48 | 72.71 | 62.88 | 75.75 | 1.90 |
MSNet (2021) | 27.69 | 143.91 | 80.77 | 76.81 | 78.74 | 63.04 | 76.65 | 1.86 |
M2SNet (2023) | 27.69 | 144.41 | 81.95 | 74.99 | 78.32 | 63.74 | 76.26 | 1.84 |
Seg-Road (2023) | 28.68 | 314.41 | 71.78 | 80.78 | 76.02 | 60.42 | 73.98 | 2.29 |
SwinUNet (2023) | 27.14 | 123.59 | 83.18 | 68.72 | 75.26 | 60.30 | 73.84 | 1.84 |
SGCNNet (2022) | 42.73 | 1234.41 | 69.49 | 78.52 | 73.73 | 57.51 | 71.47 | 2.23 |
MSMDFFNet (2024) | 603.21 | 39.26 | 74.49 | 80.26 | 77.27 | 62.22 | 75.30 | 1.96 |
SAM2UNet (2024) | 863.72 | 216.41 | 74.26 | 83.16 | 78.46 | 63.64 | 76.47 | 2.08 |
SAM2MS (Ours) | 867.28 | 217.11 | 79.50 | 81.32 | 80.31 | 64.24 | 77.93 | 1.80 |
Table 2.
Cross-model Quantitative Evaluation on SpaceNet Datasets Benchmark.
Table 2.
Cross-model Quantitative Evaluation on SpaceNet Datasets Benchmark.
Method | SpaceNet Dataset [45] |
---|
Param. (M) | FLOPs (G) | Prec. | Recall | F1 | mIoU | mDice | MAE ↓
|
---|
UNet (2015) | 31.02 | 875.81 | 60.61 | 56.61 | 58.54 | 42.01 | 55.56 | 5.28 |
UNet++ (2018) | 47.19 | 3202.89 | 58.91 | 26.40 | 36.46 | 21.62 | 30.45 | 5.10 |
D-LinkNet (2018) | 217.64 | 481.25 | 62.58 | 63.55 | 63.06 | 47.94 | 61.82 | 3.77 |
MSNet (2021) | 27.69 | 143.91 | 65.98 | 58.76 | 62.16 | 46.24 | 60.19 | 3.83 |
M2SNet (2023) | 27.69 | 144.41 | 66.34 | 57.99 | 61.89 | 46.03 | 59.96 | 3.73 |
Seg-Road (2023) | 28.68 | 314.41 | 60.07 | 66.14 | 62.96 | 46.27 | 60.39 | 4.36 |
SwinUNet (2023) | 27.14 | 123.59 | 69.84 | 53.18 | 60.38 | 44.74 | 58.94 | 3.36 |
SGCNNet (2022) | 42.73 | 1234.41 | 57.63 | 50.63 | 53.90 | 38.64 | 51.26 | 4.52 |
MSMDFFNet (2024) | 603.21 | 39.26 | 60.41 | 60.84 | 60.62 | 45.50 | 59.09 | 3.89 |
SAM2UNet (2024) | 863.72 | 216.41 | 67.11 | 57.43 | 61.90 | 45.60 | 59.24 | 4.06 |
SAM2MS (Ours) | 867.28 | 217.11 | 62.34 | 67.16 | 64.66 | 48.52 | 62.57 | 3.73 |
Table 3.
Cross-model quantitative evaluation on Massachusetts road datasets benchmark.
Table 3.
Cross-model quantitative evaluation on Massachusetts road datasets benchmark.
Method | Massachusetts Dataset [46] |
---|
Param. (M) | FLOPs (G) | Prec. | Recall | F1 | mIoU | mDice | MAE ↓
|
---|
UNet (2015) | 31.02 | 875.81 | 69.46 | 65.84 | 67.60 | 50.64 | 63.95 | 4.65 |
UNet++ (2018) | 47.19 | 3202.89 | 71.40 | 62.51 | 66.66 | 49.51 | 62.66 | 4.13 |
D-LinkNet (2018) | 217.64 | 481.25 | 69.03 | 66.34 | 67.66 | 52.22 | 65.36 | 2.73 |
MSNet (2021) | 27.69 | 143.91 | 60.11 | 52.22 | 55.89 | 23.15 | 31.85 | 4.88 |
M2SNet (2023) | 27.69 | 144.41 | 59.95 | 52.76 | 56.12 | 22.75 | 31.37 | 6.09 |
Seg-Road (2023) | 28.68 | 314.41 | 60.45 | 51.03 | 55.34 | 24.10 | 33.33 | 4.18 |
SwinUNet (2023) | 27.14 | 123.59 | 78.80 | 56.15 | 65.57 | 49.70 | 63.10 | 2.64 |
SGCNNet (2022) | 42.73 | 1234.41 | 71.85 | 52.64 | 60.77 | 44.17 | 57.87 | 3.09 |
MSMDFFNet (2024) | 603.21 | 39.26 | 71.94 | 60.88 | 65.95 | 49.79 | 63.21 | 2.80 |
SAM2UNet (2024) | 863.72 | 216.41 | 65.37 | 61.21 | 63.22 | 49.41 | 62.76 | 4.33 |
SAM2MS (Ours) | 867.28 | 217.11 | 72.71 | 69.39 | 71.01 | 50.66 | 64.10 | 2.93 |
Table 4.
Ablation studies on backbones and adapters.
Table 4.
Ablation studies on backbones and adapters.
Backbones | Adapter | Evaluation Metrics |
---|
F1 | mIoU | mDice | MAE ↓
|
---|
SAM2-Tiny | − | 74.36 | 57.77 | 71.76 | 2.29 |
SAM2-Tiny | √ | 76.81 | 61.11 | 74.47 | 2.16 |
SAM2-Small | − | 74.19 | 57.52 | 71.60 | 2.39 |
SAM2-Small | √ | 77.26 | 61.83 | 75.01 | 2.09 |
SAM2-Base+ | − | 74.39 | 57.91 | 71.89 | 2.29 |
SAM2-Base+ | √ | 78.11 | 63.03 | 76.00 | 2.04 |
SAM2-Large | − | 74.99 | 58.56 | 72.44 | 2.29 |
SAM2-Large | √ | 80.31 | 64.24 | 77.93 | 1.80 |
Table 5.
Ablation study on lossnet.
Table 5.
Ablation study on lossnet.
Backbones | Adapter | lossnet | Evaluation Metrics |
---|
F1 | mIoU | MAE↓
|
---|
SAM2-Large | √ | − | 78.77 | 64.08 | 2.00 |
SAM2-Large | √ | VGG16 | 78.89 | 64.21 | 1.84 |
SAM2-Large | √ | ResNet50 | 80.31 | 64.24 | 1.80 |
Table 6.
Ablation study on MSSM and DRB.
Table 6.
Ablation study on MSSM and DRB.
Backbones | MSSM | DRB | Evaluation Metrics |
---|
F1 | mIoU | MAE ↓
|
---|
SAM2-Large | − | √ | 77.44 | 61.64 | 3.18 |
SAM2-Large | √ | − | 78.72 | 63.16 | 1.94 |
SAM2-Large | √ | √ | 80.31 | 64.24 | 1.80 |
Table 7.
Quantitative analysis of cross-dataset experiments.
Table 7.
Quantitative analysis of cross-dataset experiments.
Method | D2S | S2S | D2M | M2M |
---|
RLOD | RLOD | RLOD | RLOD |
---|
UNet (2015) | 10.21 | 57.46 | 50.51 | 66.60 |
UNet++ (2018) | 5.32 | 26.00 | 16.53 | 61.75 |
D-LinkNet (2018) | 18.96 | 63.41 | 51.50 | 66.26 |
MSNet (2021) | 15.94 | 59.31 | 51.53 | 58.73 |
M2SNet (2023) | 18.69 | 58.55 | 52.44 | 59.00 |
Seg-Road (2023) | 37.29 | 67.78 | 58.50 | 58.19 |
SwinUNet (2023) | 18.94 | 53.01 | 52.10 | 55.98 |
SGCNNet (2022) | 8.14 | 50.18 | 26.41 | 52.73 |
MSMDFFNet (2024) | 8.02 | 60.37 | 45.93 | 61.14 |
SAM2UNet (2024) | 55.78 | 63.64 | 63.62 | 62.94 |
SAM2MS (Ours) | 60.84 | 68.45 | 67.16 | 69.61 |