C3Net: Cross-Modal Feature Recalibrated, Cross-Scale Semantic Aggregated and Compact Network for Semantic Segmentation of Multi-Modal High-Resolution Aerial Images

: Semantic segmentation of multi-modal remote sensing images is an important branch of remote sensing image interpretation. Multi-modal data has been proven to provide rich complementary information to deal with complex scenes. In recent years, semantic segmentation based on deep learning methods has made remarkable achievements. It is common to simply concatenate multi-modal data or use parallel branches to extract multi-modal features separately. However, most existing works ignore the effects of noise and redundant features from different modalities, which may not lead to satisfactory results. On the one hand, existing networks do not learn the complementary information of different modalities and suppress the mutual interference between different modalities, which may lead to a decrease in segmentation accuracy. On the other hand, the introduction of multi-modal data greatly increases the running time of the pixel-level dense prediction. In this work, we propose an efﬁcient C3Net that strikes a balance between speed and accuracy. More speciﬁcally, C3Net contains several backbones for extracting features of different modalities. Then, a plug-and-play module is designed to effectively recalibrate and aggregate multi-modal features. In order to reduce the number of model parameters while remaining the model performance, we redesign the semantic contextual extraction module based on the lightweight convolutional groups. Besides, a multi-level knowledge distillation strategy is proposed to improve the performance of the compact model. Experiments on ISPRS Vaihingen dataset demonstrate the superior performance of C3Net with 15 × fewer FLOPs than the state-of-the-art baseline network while providing comparable overall accuracy.


Introduction
Remote sensing images are widely used in a variety of applications, such as military reconnaissance [1,2], urban planning [3][4][5][6][7][8][9], disaster monitoring [10,11], and meteorological monitoring [12,13]. As one of the significant methods for automatic analysis and interpretation of remote sensing images, semantic segmentation, namely pixel-wise classification, aims to assign each pixel with a semantic label. In the past few years, semantic segmentation has benefited a lot from deep learning methods in the computer vision fields of natural RGB images. Indeed, methods based on deep learning, especially Convolutional Neural Networks (CNNs), have greatly improved the accuracy of semantic segmentation In addition to common RGB data, additional remote sensing data is also widely used for semantic segmentation, such as Synthetic Aperture Radar (SAR) images [14][15][16][17]. and Digital Surface Models (DSM) [4,[18][19][20]. These multi-modal data usually refer to a collection of multi-channel data collected by different sensors and can reflect different characteristics of the objects. In order to improve the performance of semantic segmentation based on single modal data, we utilize the rich complementary features of multi-modal remote sensing images including IRRG (Near Infrared-Red-Green) and DSM. Semantic segmentation based on multi-modal images aims to use the complementary characteristics of different modalities to improve classification accuracy while reducing the influence of inherent noise in single-modal data and improving the generalization performance in complex application scenes.
There are two main challenges in the utilization of multi-modal images. One is how to effectively extract multi-modal features. Typical methods directly concatenate multiple channels as the input or integrate the prediction results of different models corresponding to multi-modal images. However, indiscriminately fuse the features at data-level or prediction-level does not fully extract the complementary characteristics of multi-modality, and even introduces redundant features and aggravates the impact of image noise. In this work, we extract multi-modal information by feature-level fusion method. Furthermore, we introduce a cross-modal feature recalibration module (CFR) to improve the quality of multi-modal representation by transforming the features of corresponding modalities first and then recalibrate and aggregate the informative features as the fusion feature.
The other challenge of semantic segmentation of multi modal images is the efficiency of the algorithm. The semantic segmentation task is originally a pixel-level dense classification task. In addition, the segmentation model designed for multi-modal data introduces more modalities and corresponding feature extraction networks. The huge model scale and extensive computational burden limit the application in edge devices and scenarios with real-time requirements. Considering this, in quest of an efficient model that strikes a balance between speed and accuracy, we redesign a lightweight cross-scale semantic aggregation module proposed in our previous work. Besides, we introduce a multi-level knowledge distillation (KD) strategy to obtain an accurate and compact network for dense prediction.
Our contributions are the following: • We propose a multi-modal semantic segmentation network, C3Net, which takes into account both efficiency and accuracy. The remainder of this paper is organized as follows-Section 2 introduces the background of the semantic segmentation methods based on RGB images and multi-modal images, lightweight network design, and knowledge distillation. Section 3 describes the proposed C3Net in details. Section 4 illustrates the experimental settings and analyses the experimental results, followed the conclusions in Section 5.

Related Work
In this section, we first briefly review the development of deep learning methods in the field of semantic segmentation, including both the segmentation tasks of common RGB images and remote sensing multi-modal images. We then review several mainstream handcrafted networks and corresponding lightweight convolutional modules. Finally, we review the related work in the field of knowledge distillation, which is also an active direction in the field of model compression and acceleration.

Semantic Segmentation
Research on semantic segmentation has received a tremendous performance boost since the design of the fully convolutional network [21]. The two major factors restricting the improvement of semantic segmentation accuracy are the extraction of spatial information and semantic information. In order to recover the spatial detail information, the low-level features are usually introduced in the form of skip-connection in the network based on the encoder-decoder architecture [22][23][24][25]. The methods based on skip-connection can directly introduce spatial information without many additional parameters. However, simply merging low-level features runs the risk of introducing redundant features. Especially the spatial information of multi-modal features needs to be extracted selectively to prevent the introduction of redundant features and noise. Previous work [4] utilize artificial prior knowledge to concatenate low-level features, which still ignore the complementary features of multi-modal data. In order to make full use of the complementary features of multi-modal data, we proposed a cross-modal feature recalibration module to extract informative spatial information.
The key to learning semantic information is to extract appropriate receptive fields. The method based on pooling operation [26,27] can learn the scene-level global context, but result in the loss of spatial information. The method based on attention mechanism use "attention" descriptors to focus feature learning on global semantic information [28][29][30]. It can extract the global receptive field but increase the computational burden. Dilated convolution [31] is another method that can enlarge the receptive fields without significantly increasing the calculations [27,32]. However, the cascade of dilated convolution may lead to grid effect, which result in the reduction of the effective features. Our previous work [33] densely connect dilated convolutions to eliminate the grid effect and obtain multi-scale context information. However, it introduces a large number of parameters resulting in low computational efficiency. In order to solve the problems above-mentioned in the process of semantic information extraction, we design a lightweight cross-scale semantic aggregation module. The proposed module has a better performance for multi-scale objects with fewer parameters.
With the development of remote sensing imaging technology and deep learning, recent work on the semantic segmentation of multi-modal remote sensing images has made considerable progress. Rifcn [34] simply combine near-infrared, red, green (IRRG) spectrum and DSM as the fusion input to the network. However, it does not fully exploit the relationship between heterogeneous features and may introduce redundant features in training. Some works [18,35] based on parallel branch architecture use multi-backbones to process the multi-modal data separately. Such parallel branch fusion architecture can effectively extract informative features; however, the huge model scale brings a large number of parameters and reduce model inference speed. In [19], backend processing is utilized to fuse the DSM feature to extract complementary information and refine the segmentation details. However, a multi-stage training method is difficult to guarantee the global optimal solution. In this work, we aim to design a lightweight and end-to-end architecture which can effectively extract cross-modal complementary information and achieve more accurate segmentation results.

Lightweight Networks
In recent years, designing handcrafted convolutional neural network architecture for the optimal trade-off between accuracy and efficiency has been developed into an active research field. SqueezeNet [36] utilizes 1 × 1 convolutions in squeeze and expand module to reduce the number of parameters. Group convolution is proposed in [37] for parallelization to accelerate the training process. Many state-of-the-art networks [38][39][40] integrate group convolution into model architecture to reduce model scale and parameters. Shufflenet introduces channel shuffle operation to enhance the flow of information between channels and improve the performance of group convolution. Furthermore, depthwise separable convolution [41] is introduced as an efficient replacement for the traditional convolution layers and widely used in [42][43][44]. An inverted residual bottlenecks module is proposed [45] as an extension of depthwise separable convolution and used in state-of-theart efficient architectures [46][47][48]. In this work, we choose a variety of lightweight networks as the backbone network to explore the optimal knowledge distillation strategy and the optimal architecture of the student network. More quantitatively, comparisons can be found in Section 4.3.3. Besides, we also design a plug-and-play module, namely lightweight convolution group (LCG). As the basic architecture of the cross-scale semantic aggregation module, LCG can drastically reduce model parameters while retaining multi-scale semantic information.

Knowledge Distillation
Knowledge Distillation aims to compress the model and improve the accuracy by transferring knowledge from a teacher network (cumbersome model) to a student network (compact model). It has been widely used in image classification and has been proven to achieve good performance improvements [49][50][51]. The research of [49] is the pioneering work that exploits intra-class similarity and inner-class diversity from teacher network as the useful knowledge to guide the learning processing of the student network. Following [49], many other methods are proposed for knowledge distillation. Fitnet [50] considers the intermediate features as the informative knowledge for the first time, which directly aligns feature maps to learn intermediate representation. Subsequently, the attention map aggregated from the response maps is also regarded as a kind of knowledge. Attention transfer [51], which mimics the attention map between teacher network and student network, is related to our approach. However, the previous methods usually focus on the tasks of image classification and are only applied for single-modal data. In this work, we also explore knowledge distillation approaches in multi-modal semantic segmentation network. A multi-level knowledge distillation strategy based on class knowledge transfer and attention knowledge transfer is proposed to design an efficient semantic segmentation framework for multi-modal image processing.

Method
Following the previous work [4], proposed network architecture adopts an encoderdecoder structure, as shown in the Figure 2. In this work, ResNet101 and ResNet50 [52] are selected as the feature extraction networks of multispectral data and DSM data, respectively. In the encoder, the parallel multi-modal feature extraction network can be replaced with any CNN network. The CSA is an abbreviation for a lightweight cross-scale semantic aggregation module, which can efficiently obtain multi-scale contextual information. The CFR represents a cross-modal feature recalibration module, which is designed for multimodal feature fusion. In order to simplify the model structure, the architecture of decoder follows the design in [25]. In this section, we will introduce CFR module, CSA module, and a multi-modal knowledge distillation strategy respectively.

Cross-Modal Feature Recalibration Module (CFR)
The CFR module is designed to jointly learn the multi-modal features while reducing the influence of inherent noise of different modalities. The simple concatenation of multimodal feature [34] cannot learn cross-modal complementary characteristics, and even the inherent noise and redundant feature will reduce the ability of feature expression [4]. In order to extract informative features of different modalities, the CFR module is designed with feature aggregation, feature recalibration, and feature reconstruction operation. The architecture of CFR is shown as Figure 3. We denote X 1 ∈ R C×H×W and X 2 ∈ R C×H×W as the input feature maps of two branches respectively.  First, fusion feature map X c is calculated as where [ • ] denotes concatenate operation along the channel dimension. To ensure the validity of the features, we utilize the attention mechanism [28,53] to transform the feature maps into the embedding space and regard it as an attention vector. Attention vector represents the high confident activations in an original feature map, which can effectively focus the feature learning process on most informative features and suppress the importance of noisy features. We take one branch as an example, the attention vector is calculated as where F M denotes a common MLP network which includes two 1 × 1 convolutional layers and σ denotes sigmoid function. F G denotes long-range context modeling operation. By doing this, the global semantic information is obtained for feature aggregation.
where X i C denotes the feature vector in position i and W C denotes linear transformation matrix. After calculating the attention vector, the aggregate feature can be formulated as where denotes element-wise multiplication. The aggregate features can be regarded as the initial calibration of the features from the same modality. Compared with original feature maps, aggregate feature maps contain more informative features and less noisy features. The feature recalibration is designed to exchange and recalibrate the features of different modalities.
The process of recalibration completes the cross-modal information interaction. More related discussions can be found in Section 4.3.1.
The previous works usually directly introduce low-level features from different modalities to the decoder for rich spatial information. However, the spatial information of differ-ent modalities is usually not aligned well, which seriously affects the feature reconstruction. Inspired by [54], a spatial-wise gate mechanism is designed to obtain the multi-modal fusion feature. A 1 × 1 convolutional layer is utilized for embedding space mapping and a SoftMax function is applied to obtain gate weights: where X 1i denotes the feature vector in channel i from modality 1 and W S denotes linear transformation matrix. The final fusion feature is the summation of the weighted features.
where m denotes the total number of modalities, in our case, m = 2. X o is regarded as the low-level feature in the decoder for up-sampling and spatial information recoverying.

Cross-Scale Semantic Aggregation Module (CSA)
Semantic contextual information is critical to the pixel-level classification. In recent years, parallel spatial pyramid pooling structures [25,27] are widely used to obtain multiscale receptive fields. However, it is still difficult to obtain a large and dense receptive field in remote sensing scenes. To remedy this problem, the CSA module is designed to capture dense contextual information in a cross-scale form through multiple sets of dilated convolutions. Besides, the CSA module is composed of a lightweight convolution group (LGC), which can reduce the computational burden of multi-modal high-resolution images. The architecture of CSA is shown as Figure 4. CSA contains six contextual information extraction branches. We utilize five LCGs with different dilated rates to capture multiple receptive fields and a multi-shape pooling (MSP) to obtain global semantic information. We denote X s and Y s as the input feature map and output feature map of CSA module respectively. Above operations can be formulated as where [ • ] denotes concatenate operation along the channel dimension. Y i denotes different branches of CSA module.
where F M denotes the multi-shape pooling operation and F k,d L denotes the lightweight convolution group. The architecture of the lightweight convolution group is shown as Figure 4. LCG consists of two 1 × 1 convolution layers for channel reduction and expansion respectively.Besides, a 3 × 3 convolution layer with different dilated rate r is utilized for semantic information encoding and an another 3 × 3 convolution layer is added to encode more spatial contextual information. The architecture of multi-shape pooling is shown as Figure 4. MSP utilizes 1 × 1, 3 × 1, and 1 × 3 convolutional layers to model global semantic. In particular, it encodes the globally horizontal and vertical information to capture complex objects with multiple shapes in remote sensing scenes. Assuming that the input of the module is X in , the above operations can be formulated as where σ denotes sigmoid function, W 1×1 denotes convolution layer with 1 × 1 kernel, W i , P i denote the convolution and pooling operations corresponding to the three branches.

Multi-Level Knowledge Distillation Strategy
In order to obtain a compact model, knowledge distillation has been widely used in CNN designed for natural RGB images, however, there are few related research on the knowledge distillation strategy of multi-modal images. In this section, we explore knowledge distillation strategies for networks that process multi-modal images and introduce a multi-level knowledge distillation strategy. As shown in Figure 5, the proposed knowledge distillation strategy includes class knowledge transfer and attention knowledge transfer. Note that only a single branch of the multi-modal network is shown as an example.

Segmetation Loss
Teacher network Student network Figure 5. Illustration of our knowledge distillation strategy.

Class Knowledge Transfer
Semantic segmentation can be regarded as a pixel-wise classification task. Therefore, class probabilities produced from the teacher network can be utilized to transfer the informative knowledge to the student network. We follow [49] and regard the class probabilities as a soft target which provides much more information than hard targets. We denote S and T as the compact student network and pre-trained teacher network. The loss function of the class knowledge transferring process is as follows where H × W denotes the spatial size of softmax layer output. KL( • , • ) is the Kullback-Leibler divergence. p S,i and p T ,i denote the soft target of the i-th pixel produced from student and teacher network, which can be formulate as where j denotes the j-th class of C classes, z denotes the output of the softmax layer.
T denotes the temperature factor that is utilized to control the probability distribution over classes.

Attention Knowledge Transfer
Class knowledge transferring performs knowledge distillation for each pixel separately, which may lead to sub-optimal result for the dense prediction task. The knowledge extracted from deep features only focuses on the high-level information and neglect the detailed spatial knowledge from low-level features. Besides, the activation of different modal features varies greatly. It is necessary to transfer the knowledge for multi-modal features. Due to the above fact, we explore a multi-level knowledge distillation strategy based on attention knowledge, as shown in Figure 5.
where m denotes the m-th modality in the M modalities, l denotes the j-th pair of teacher attention maps A l T ,k and student attention maps A l S,k , k denotes the k-th class in the C classes. || • || 2 denotes l 2 -normalization which is utilized during the knowledge transferring.
The training process of our method is presented in Algorithm 1. Given the pre-trained teacher network, the parameters of the teacher network are kept frozen during the training process. The student network is supervised by three losses: common cross entropy loss s with ground truth, class knowledge transferring loss c in Equation (11) and attention transferring loss A in Equation (13). Scaling factor α (10) and β (1000) are utilized to make these loss value range comparable.  Vaihingen Dataset provides 16 labeled tiles for training. Following other works, the training set consists of 11 tiles (region numbers 1, 3,5,7,13,17,21,23,26,32,37) and the validation set consists of five tiles (region numbers 11,15,28,30,34). Limited by computing resources, we crop the large tiles into patches of 513 × 513 pixels using an overlapped (50%) sliding window. Finally, we get 696 patches for training and 297 patches for validation. Note that in the testing phase, we use all patches for training.

ISPRS Potsdam
The Potsdam Dataset contains 38 very high true orthophoto (TOP) tiles. Each tile contains 6000 × 6000 pixels with a resolution of 5cm/pixel. Tiles consist of Infrared-Red-Green-Blue (IRRGB) data, DSM data and nDSM data. The Potsdam Dataset consists of 24 labeled tiles for training. In the following experiments, 6 tiles (region numbers 2_12, 3_12, 4_12, 5_12, 6_12, 7_12) are removed from the training set as the validation set. During the training process, the large tiles are cropped into 9522 patches for training and 3174 patches for validation according to the method mentioned above. Note that in the testing phase, we use all patches for training.

Dataset Augmentation
Standard data augmentation is applied during the training process, including random flipping, random scaling (from 0.5 to 2), and random rotating (between −10 and 10 degrees).

Evaluation
According to the benchmark rules, overall accuracy (OA) and F1 score are used to quantitatively evaluate the model performance. OA is the normalization of the trace of pixel-based confusion matrices. F1 score is calculated as follows: where P, R denotes precision and recall respectively.
where TP denotes the number of true positive, FP denotes the number of false positive and FN denotes the number of false negative. All these metrics can be calculated by a pixel-based accumulated confusion matrix. The number of float-point operations (FLOPs) is also applied to investigate the computation complexity.

Implementation DETAILS
During the training stage, proposed network utilizes stochastic gradient descent (SGD) with the momentum (0.9) and the weight decay (0.0005) as the optimization strategy and be trained for 100 epochs. The basic learning rate is set to 0.01 with the "poly" learning rate policy and the power is set to 0.9. Our experiments are implemented on a Tesla P100 GPU while the batch-size is set to 4. For a fair comparison, we make the ablation study with 513 × 513 patches.

Results and Discussion
In this section, we report the effects of cross-modal feature recalibration module, cross-scale semantic aggregation module and multi-level knowledge distillation strategy and analyze these proposed methods from both qualitative and quantitative perspective.

Effects of CFR
In order to further reveal the role of CFR, we visualize the response of DSM features in Figure 6. We average the channel dimension and convert it to one dimension for visualization. As shown in the second column of Figure 6, due to the imaging quality and the inherent characteristics of the data, DSM images lack spatial detail information. Comparing the fifth and sixth columns in Figure 6, the DSM feature contains more significant detailed information after CFR feature calibration. In particular, there is a higher activation response at the edges between different classes and the outline of the object is more precise. In addition, the third row of Figure 6 shows that the CFR module makes the feature map shows a good response to the car.

IRRG
Ground Truth DSM Before Feature Calibration After Feature Calibration  Table 1 shows the results on ISPRS Vaihingen testing dataset with different multimodal fusion methods based on the baseline architecture. The results of IRRG based methods are also listed for reference in the first row. As shown in the second row of Table 1, due to the noisy and redundant features of DSM modality, directly concatenate the multimodal images as the input leads to worse performance in car. This simple multi-modal fusion mechanism cannot explore the complementarity of different modalities to boost performance. As shown in the third row of Table 1, the design of the baseline is inspired by previous work [4] which has achieved good segmentation performance. Based on the above observation, the baseline only introduces the IRRG features to the decoder. In order to utilize the strength of multi-modal information, CFR is designed to filter the useless information and fusion the cross-modal features in an effective way. Thus, CFR still gains 0.47% OA improvement compared with baseline. In particular, CFR can obtain a higher F1 value of car class. CFR uses the rich semantic information of DSM modality in the decoder to reconstruct the low-level features and achieve better segmentation results on small targets that are likely to lose spatial detail information while encoding. In summary, the cross-modal feature recalibration module could aggregate information of multiple modalities and form discriminative representations for segmentation. Furthermore, the CFR module boosts the segmentation performances while only adds a small parameter and computation overhead. It can also be regarded as a plug-and-play module to effectively fuse multi-modal images.

Effects of CSA
Contextual semantic information plays an important role in classification, especially for dense prediction tasks. Existing methods usually only contain limited scales of receptive field. It is difficult to deal with remote sensing application scenarios that include multiple complex objects and large scope. The CSA module uses multiple dilated convolutions to obtain multi-scale receptive fields in the form of dense connections. In the experiment, we empirically choose 1, 3, 6, 12, 18 as the dilation rates. As shown in Figure 7, SPP [27] contains four scales of receptive fields, ASPP [25] contains five scales of receptive fields, while the proposed CAS and MCA contains 25 types of receptive fields. The CAS module can not only extract multi-scale contextual information but also obtain global semantic with multi-shape pooling. We examine the role of the cross-scale semantic aggregation module on ISPRS Vaihingen testing dataset. In order to verify the effectiveness of CSA, we re-implement several mainstream semantic context extraction modules based on the baseline network. Table 2 shows the results of the baseline containing different semantic context extraction modules. The performance of the network without the semantic aggregation module is also shown in the first row of Table 2 for comparation. Compared with ASPP, SPP, and MCA, the proposed CSA gain 91.15% OA improvement based on the same baseline network. CSA and MCA have the same receptive field scale, but CSA utilizes a lightweight convolution group instead of traditional convolution, which greatly reduces the number of parameters of the semantic extraction module. As shown in Table 2, compared to MCA, CSA can reduce the amount of parameters by 55.72% while slightly improving performance by 0.12%.

Effects of Multi-Level Knowledge Distillation Strategy
The multi-level knowledge distillation strategy contains two types of distillation methods: class knowledge transfer and attention knowledge transfer. We conduct ablation experiments on these two distillation methods respectively. In the following experiments, we choose ResNet101 as the IRRG feature extraction network and ResNet50 as the DSM feature extraction network for the teacher network. As for the student network, in order to compress the model parameters as much as possible and obtain the relevant knowledge of the corresponding feature pairs, we choose ResNet18 as the feature extraction network for two modalities.
In order to verify the effect of feature maps at different levels in attention knowledge transfer strategy, we distillation the attention knowledge at multiple levels of feature map pairs. As shown in Table 3, the feature map pair from level 5 is distilled as the high-level attention knowledge. Then the feature pair attention knowledge distilled from level 4, level 3, level 2, and layer 1 are introduced in sequence. Here, feature map pairs from different levels represent the output feature map pairs from corresponding different stages in Resnet. As expected, the effect of distillation gradually increased with the introduction of low-level attention maps. This also indirectly proves that the detailed knowledge contained in the low-level attention maps could help to improve the performance of dense prediction tasks. During the knowledge distillation training process, the model scale and data set scale have a great influence on information transferring. In order to demonstrate the effectiveness of the proposed multi-level knowledge distillation strategy, we ablate a variety of backbone models [45,47,52] on Vaihingen and Potsdam data sets. As shown in Figure 8, we compare the accuracy and efficiency of different models and quantify them with OA and GFLOPs in the bubble chart. Note that the area of the bubble represents the number of parameters. Experimental results on various backbones and data sets show that the proposed distillation strategy can effectively improve the segmentation accuracy of compact models. The results also show that the proposed distillation strategy can obtain greater performance gains on large-scale data sets, which implies that the proposed method has better generalization performance in big-data application scenarios.

Comparing with State-of-the-Arts
We report the results on the Vaihingen dataset in Figure 9 and Table 4. In order to demonstrate the effectiveness of the cross-scale semantic aggregation module (CSA), we also train C1Net (baseline with CSA) with IRRG data (CH3) to compare with other state-ofart methods. As shown in Table 4, C1Net shows superior segmentation results than the other methods. In multi-modal data experiments (CH4), the proposed C2Net (Baseline with CSA and CFR) ranks 1st both in mean F1 score and overall accuracy compared with all the other published works. Note that neither the test time augmentation nor the CRF for post-processing is used in the proposed methods. This is based on the consideration of fast inferencing and designing compact models. As shown in Figures 8 and 9, it is worth noting that the C3Net (backbone with MobileNet V3S) obtained by knowledge distillation can still achieve competitive segmentation accuracy when the amount of parameters is reduced by 93.6% and the inference speed is increased by 1.6 times. It turns out that C3Net could achieve a good balance between accuracy and efficiency.

Conclusions
In this paper, a novel semantic segmentation framework is presented for multi-modal remote sensing images. The major contribution of this work is to address two key challenges in the application of multi-modal images, i.e., the effective joint representation of different modalities and the compact model for efficiently inferencing. A cross-modal feature recalibration module (CFR) is designed to recalibrate the noisy modalities and extract crossmodal complementary features. Besides, lightweight cross-scale semantic aggregation module (CSA) and multi-level knowledge distillation strategy are utilized to obtain a compact model while still remaining segmentation accuracy. The experimental results indicate that the proposed C2Net achieves state-of-the-art segmentation performance and the compact C3Net achieves 8.66 GFLOPs while keeping the performance levels almost intact. Note that the proposed networks are only applicable to IRRG and DSM data. In future work, we will introduce more informative data such as SAR and Light Detection and Ranging (LiDAR) data in semantic segmentation task to improve the generalization performance in complex remote sensing scenarios. In addition, the proposed lightweight model C3Net still takes up a lot of memory usage on an GPU platform during training process. The future works include lightweight feature extractor designing and knowledge distillation based on structured information of multi-scale objects, which aim to further compress the model for less running time while remaining segmentation accuracy.
Author Contributions: Z.C. conceived and designed the experiments; Z.C. and W.D. performed the experiments and analyzed the data; Z.C. wrote the paper; X.S.; X.L. and K.F. contributed materials; W.D. and K.F. and M.Y. supervised the study and reviewed this paper. All authors have read and agreed to the published version of the manuscript.

Acknowledgments:
The Vaihingen and Potsdam dataset were provided by the International Society for Photogrammetry and Remote Sensing (ISPRS): https://www2.isprs.org/commissions/comm2 /wg4/benchmark/semantic-labeling. The authors would like to thank all the colleagues for the fruitful discussions on semantic segmentation and knowledge distillation.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

ISPRS
International