1. Introduction
High-resolution remote sensing images are valuable resources that contain detailed ground information, making them highly relevant in disaster monitoring, industrial production, agricultural production, military surveillance, and other fields. Within remote sensing image processing, multi-class geospatial object detection plays a vital role by automatically extracting information about various ground objects. However, the traditional remote sensing object detection process involves several time-consuming steps, including image acquisition, image download, ground image processing, and remote sensing object detection. This entire process consists of multiple tedious processing steps, which can undermine the timeliness of tasks requiring fast processing, such as disaster monitoring and military warning. To address this issue, deploying object detection methods directly on satellites for on-board detection would significantly enhance the effective utilization of high-resolution remote sensing image data for such tasks. By only downloading the specific information of interest rather than all captured images, the burden of data transmission can be effectively alleviated.
Earlier satellites like Earth Observation-1, QuickBird, and NEMO were limited by the hardware capabilities available at the time. They relied on FPGA (Field Programmable Gate Array) and DSP (Digital Signal Processor) technology, which allowed for basic on-board processing tasks but could not handle computationally complex algorithms like object detection. However, hardware advancements have led to the development of small edge computing devices with high computational performance. These devices, such as NVIDIA Jetson AGX Orin, HUAWEI Ascend 310, Raspberry Pi, and others, are now capable of processing artificial intelligence algorithms effectively. This development provides a solid hardware foundation for the development of on-board remote sensing object detection algorithms.
The advancement of deep learning object detection technology has laid the foundation for on-board remote sensing object detection. Deep learning object detection technology originated from the computer vision community and is mainly applied to natural images. Based on whether to choose region proposals [
1] or not, deep learning object detection techniques are divided into two categories: two-stage and one-stage. A two-stage network requires the selection of a specific number of region proposal boxes beforehand. These boxes are then utilized in the feature extraction stage to enhance the accuracy of classification and positioning in object detection. Examples include R-CNN (Region Convolutional Neural Network), Fast R-CNN, Faster R-CNN, R-FCN (Region-based Fully Convolutional Network), Mask R-CNN [
1,
2,
3,
4,
5], etc. The one-stage network, in contrast, operates based on the concept of regression, which directly generates detection boxes through the network without the need for a separate region proposal step. Examples include the YOLO series, SSD (Single Shot multibox Detector), RetinaNet [
6,
7,
8,
9,
10,
11,
12,
13], and so on.
Researchers in the Earth observation community have successfully applied deep learning object detection technology to improve remote sensing object detection accuracy. For accuracy enhancement, some two-stage remote sensing object detection methods have been proposed, such as Small, Cluttered, and Rotated Object Detector (SCRDet++), Adaptive Feature Fusion towards highly accurate oriented object Detection (AFF-Det), and Phase-Shifting Coder (PSC) [
14,
15,
16,
17]. SCRDet++ improves detection accuracy by adding denoising modules to Faster R-CNN and addressing rotation variation issues. AFF-Det introduces a multi-scale feature fusion module (MSFF) built on the top layer to mitigate the semantic information loss in small-scale features, and it also proposes a weighted RoI feature aggregation (WRFA) module to enhance the feature representations at different stages of the network based on an attention mechanism. PSC provides a unified framework for handling variable periodic fuzzy problems that arise due to rotational symmetry in oriented object detection. Considering practical applications, to improve the speed of remote sensing object detection, researchers have proposed one-stage methods like You Only Look Twice (YOLT), Hyper-Light deep learning network (HyperLi-Net), and Aligned Single-Shot Detector (ASSD) [
18,
19,
20]. YOLT has proven that due to the size of remote sensing images being too large, scaling them to the appropriate size will result in a significant loss of image details. Based on YOLO v2, it is the first to propose a complete detection process that conforms to industrial production, including image slicing, tiles detection, and result mapping. HyperLi-Net combines various network techniques to create a lightweight network for high-speed and accurate ship detection from SAR (Synthetic Aperture Radar) imagery. ASSD addresses feature misalignment in one-stage detectors in remote sensing scenes, achieving a balance between speed and accuracy on the DOTA dataset [
21].
In response to the pressing demand for on-board remote sensing object detection, numerous researchers have put forth methods based on existing research. Currently, on-board remote sensing object detection methods are primarily categorized into two scenarios: SAR image-oriented and optical image-oriented. Pan et al. [
22] proposed on-board ship detection in HISEA-1 SAR images, which first used the Constant False Alarm Rate (CFAR) method to coarsely identify ships, and then used the YOLO v4 [
9] method to obtain more accurate final results. Xu et al. [
23] introduced a lightweight on-board ship detection model for SAR (Synthetic Aperture Radar) imagery, known as Lite-YOLOv5. This model incorporates several innovative modules to reduce computational complexity and enhance detection performance. One of the key advantages of SAR imagery is its ability to acquire data effectively regardless of weather conditions. Unlike optical imagery, which can be hindered by clouds, fog, or darkness, SAR can penetrate through such obstacles and capture data consistently. However, SAR imagery does have some inherent limitations. Firstly, the resolution of SAR imagery is generally lower compared to optical imagery. This means that the level of detail captured in SAR images may be coarser, making it challenging to discern fine-scale features or objects. Additionally, SAR imagery tends to exhibit a lower signal-to-noise ratio. The presence of noise in SAR images can stem from various sources, such as speckle noise caused by interference patterns in the radar signal. This noise can obscure the desired information and impact the quality of the image. It is very necessary to carry out on-board object detection for optical images. Del Rosso et al. [
24] proposed on-board volcanic eruption detection through CNN (convolutional neural network) in satellite multispectral imagery, which is the first prototype that applies deep learning models to on-board optical remote sensing detection. Pang et al. introduced a fast and lightweight intelligent Satellite On-orbit Computing Network (SOCNet) [
25] that aimed to accelerate model inference and reduce the number of parameters. They achieved this by implementing various techniques, including flat multibranch and coupled fine-coarse-grained feature extraction, exchanging a larger receptive field for network depth, depthwise separable convolution, and global average pooling. Li et al. [
26] introduced a new intelligent optical remote sensing satellite Luojia-3 01, which is equipped with various on-board intelligent processing technologies, such as on-board object detection, change detection, and so on.
The mentioned on-board optical object detection methods have indeed made significant contributions in terms of algorithm advancements and practical applications. However, they have often overlooked a crucial factor: the influence of cloud coverage on optical images. According to data from the International Satellite Cloud Climate Project-Flux Data (ISCCP-FD) [
27], it is estimated that the global average cloud coverage is approximately two-thirds. If object detection is performed directly on these images without considering the presence of clouds, it can lead to ineffective detection, wasted computing resources, and reduced detection efficiency. In response to the mentioned issue, this paper proposes a comprehensive on-board multi-class geospatial object detection scheme. Firstly, the satellite-captured remote sensing images are sliced into smaller tiles for efficient processing. Next, cloud detection is performed on all the tiles. The detection results are then evaluated, and any tiles identified as cloud images are filtered out. The third step involves conducting object detection on the remaining tiles that passed the cloud detection stage. In the fourth step, the detection results from the tiles are mapped back to the original remote-sensing image. This mapping process helps align the object detection results with their corresponding locations in the original image. Additionally, noise boxes are removed to enhance the accuracy. Finally, the object detection results are transmitted to the ground workstation for further analysis and utilization. In accordance with specific requirements, it is also possible to selectively download the tiles that have been filtered by the cloud detection step. This approach helps alleviate the pressure of data transmission. In the proposed process, this paper needs to take into account both the speed and accuracy of cloud detection, considering the limitations of hardware devices. To address these concerns, numerous previous studies have been dedicated to enhancing the performance of cloud detection methods. Jeppesen et al. [
28] introduced a deep learning cloud detection model called the Remote Sensing Network (RS-Net) which utilizes the U-net architecture. The performance of this model, when using only the RGB bands, shows significant improvement in cloud detection accuracy. Li et al. [
29] propose the global context-dense block U-Net (GCDB-UNet), a robust cloud detection network that integrates the global context-dense block (GCDB) into the U-Net framework, enabling effective detection of thin clouds. Pu et al. [
30] introduced a high-precision cloud detection network that combines a self-attention module and spatial pyramidal pooling, excelling in detecting cloud edges. While these algorithms offer accuracy advantages, they may lack sufficient speed for on-board computing scenarios. In this paper, a cloud detection model based on the state-of-the-art (SOTA) real-time semantic segmentation network PID-Net [
31] is utilized. This model is applied to the task of cloud detection through transfer learning. The reliability of the on-board cloud detection tasks is validated through experiments. To improve the performance of on-board object detection, this paper embeds the MIOU loss [
32] into YOLO v7-Tiny [
11], which improved the detection accuracy without losing inference speed. Due to a large number of truncated objects generated at the edge of tiles during remote sensing image slicing, many truncated boxes will be obtained after object detection. These truncated boxes are not easily removed when mapping the detection results of tiles back to the original image. Therefore, this paper uses the Truncated NMS algorithm to remove noise boxes, which has been proven effective by Shen et al. [
33] in removing truncation boxes and duplicate detection boxes.
The main contributions of this paper are as follows:
- (1)
This paper proposes a comprehensive on-board multi-class geospatial object detection scheme, including image slicing, cloud detection, tile filtering, object detection, coordinate mapping, and noise box removal. This scheme effectively avoids the waste of computing resources caused by performing object detection tasks on a large number of cloud images, significantly improving detection efficiency.
- (2)
This paper implements fast on-board cloud detection based on real-time semantic segmentation network PID-Net, which combines efficiency and accuracy in cloud detection results, providing a guarantee for effectively removing cloud images.
- (3)
To achieve fast on-board remote sensing object detection, this paper proposes MIOU loss-based YOLO v7-Tiny, which enhances the accuracy of the network while maintaining fast inference speed. And in the post-processing process, the Truncated NMS algorithm was effectively used to eliminate duplicate detection boxes and the truncated boxes generated by truncated targets near the tile edges.
- (4)
This paper creates a new dataset called DOTA-CD to verify whether the on-board cloud detection process is effective in improving detection efficiency. To validate the performance of the PID-Net model in on-board cloud detection tasks, this paper compares its results with those of state-of-the-art (SOTA) deep learning cloud detection algorithms using the AIR-CD dataset [
34]. Furthermore, to evaluate the effectiveness of the MIOU loss-based YOLO v7-Tiny in remote sensing object detection, this paper compares its performance with that of SOTA deep learning remote sensing object detection algorithms on the DOTA dataset. The scheme was conducted on the on-board equipment NVIDIA Jetson AGX Orin. All experimental results have verified the feasibility of our scheme.