1. Introduction
With the rapid advancement of satellite platforms and payload hardware, low-cost, lightweight, and constellation-capable satellites have become a mainstream approach for Earth observation [
1,
2,
3]. At the same time, the spatial resolution of satellite systems continues to improve. Current operational Earth observation satellites typically achieve 0.5 m resolution, while Synthetic Aperture Radar (SAR) satellites demonstrate capabilities of up to 0.3 m resolution. Notably, China’s Gaofen-11 satellite reaches 0.1 m resolution during close-range observations. The expansion of satellite constellations and advances in imaging resolution have together led to an exponential increase in remote sensing data volume [
4]. State-of-the-art (SOTA) satellites offer clear advantages in launch flexibility, cost efficiency, and on-board intelligent processing hardware, making it possible to deploy satellite networks for rapid collaborative observation [
5].
The increasing frequency of natural disasters and escalating geopolitical tensions have placed time-sensitive Earth observation at the forefront of industry priorities, making rapid response a critical determinant in addressing emergent events. This urgency has rendered the inherent conflict between conventional data processing workflows and the demand for high temporal resolution increasingly acute. Systemic inefficiencies persist throughout the end-to-end workflow, from data acquisition to operational deployment. In a traditional workflow, satellites must first downlink raw data to ground stations when within coverage, after which computational analysis is performed on Earth. This multi-stage pipeline—encompassing data transmission, ground processing, and product dissemination—typically requires hours to days, resulting in insufficient responsiveness for emergency scenarios and necessitating substantial human intervention. Furthermore, this approach suffers from several major drawbacks, which are exacerbated by explosively growing data volumes: The estimation and adjustment of satellite imaging parameters are entirely dependent on ground systems, leading to complex implementation paths, multiple intermediate links, poor timeliness, and low execution efficiency. Downlinked data often contain substantial invalid information, such as scenes with excessive cloud cover, areas lacking the mission target, or unusable content due to poor imaging conditions. These invalid data cause a significant waste of satellite resources, consumes ground system capacity, and places undue pressure on downlink bandwidth [
6,
7]. Consequently, the specific target information users need becomes buried within massive volumes of raw downlinked data, as ground systems must fully process all data before specific images can be extracted and utilized.
High-precision geometric rectification is an essential prerequisite for conducting on-board tasks such as change detection, terrain classification, and mission-specific military operations. This necessity motivates the investigation of a novel spaceborne feature matching methodology in this study. Given the significant constraints of on-board hardware compared to ground-based systems, specially designed processing approaches are required. Satellite storage units, for instance, predominantly employ radiation-hardened EMMC (Embedded MultiMediaCard) architectures to endure the unique space environment. While radiation-hardened, these storage resources remain strictly limited, necessitating the development of a lightweight reference feature library. For example, a global 1 m resolution basemap would require approximately 150 TB of storage, far exceeding typical on-board capacity. Furthermore, on-board systems are subject to multiple constraints, including power consumption, thermal management, and strict size/weight limitations [
6,
8,
9]. Additionally, less accurate real-time orbit parameters often lead to larger initial geometric offsets. These limitations preclude the unrestricted hardware expansion feasible in ground stations, compelling the adoption of highly efficient and optimized methodologies. Critically, traditional approaches relying on full-resolution basemaps or ground control point (GCP) libraries demand excessive storage. Therefore, a compact, lightweight feature database that can be deployed within stringent on-board storage constraints is imperative.
It is therefore clear that both the compression of the feature library and the simplification of algorithms affect registration accuracy, necessitating a balanced design in our methodology. Given the substantial compression of the reference feature library and the constrained on-board computational capacity, the positioning accuracy achievable on-board is not expected to equal that of ground-based processing. Nevertheless, our method can still deliver sufficiently accurate positioning to support subsequent on-board tasks such as change detection and terrain classification. Moreover, on-board positioning enables the selective downlink of image slices from regions of interest (ROIs) with improved geo-reference accuracy to ground stations for further high-precision processing [
10,
11]. This reduces the burden on satellite-ground transmission systems and enhances overall operational efficiency. Consequently, for on-board positioning, the most critical metrics are storage footprint and processing timeliness. A certain degree of accuracy compromise under these constraints is considered acceptable.
There has been extensive research conducted on image matching methods and algorithms for on-board hardware implementation. Wan et al. developed Weighted Structure Saliency Feature (WSSF) and improved-gradient location and orientation histogram (GLOH) methods to enhance matching performance [
12]. Yang et al. proposed a registration method based on adjacent self-similarity-based 3D convolution (ASTC) [
13]. Efe et al. introduced a Deep Feature Matching (DFM) method, which eliminates the need for keypoint detection [
14]. Xiang et al. applied phase congruency approaches to SAR images (SARPC) [
15]. However, nearly all these methods require access to original images or abundant GCPs for reference, making these methods impractical for on-board satellite storage.
For on-board processing algorithms, common strategies include reducing algorithmic complexity or deploying solutions on specialized hardware. Most algorithms can be simplified through approximations—albeit with some performance trade-offs—such as dimensionality reduction [
16,
17], quantization [
18,
19,
20], calculation approximation [
21,
22], or the integration of neural networks [
7,
11,
23]. Applications in military operations, disaster response, and humanitarian missions require integrated remote sensing services that are available around the clock and capable of time-critical responses. This drives the development of intelligent spaceborne processing architectures that utilize AI-accelerated hardware to deliver high computational performance, ultra-low power consumption, and multifunctional capabilities. Such systems must support cross-domain applicability through the systematic integration of advanced AI algorithms and domain-specific models, thereby enabling real-time solutions beyond the limits of conventional approaches. Furthermore, AI model parameters should be dynamically reconfigurable via all-weather satellite constellations, allowing adaptive optimization for diverse industrial applications through space–ground collaborative intelligence.
Digital Signal Processors (DSPs) and Field-Programmable Gate Arrays (FPGAs) have traditionally been the primary computing hardware for satellite systems [
10,
24,
25,
26,
27,
28,
29]. While providing strong performance in power consumption control and computational efficiency, these devices face limitations in addressing emerging operational imperatives for on-board processing systems. In recent years, Graphics Processing Units (GPUs) have demonstrated significant advantages in development ecosystems, dynamic reconfiguration capabilities, and versatility, positioning them as potential dominant platforms for future satellite edge computing.
Certain on-board processing tasks have demonstrated the capability of GPUs to efficiently process data in parallel. Wang et al. implemented an on-board streaming computation scheme based on an embedded NVIDIA GPU platform, meeting the real-time correction requirements in terms of performance, power consumption, efficiency, and accuracy [
4]. Xu et al. combined CFAR detection with a lightweight deep learning network to realize an on-board ship detection scheme on a spaceborne GPU computing platform, reducing processing time by nearly half compared to traditional methods [
7]. Xie et al. utilized the lightweight MobileNetV2 network to extract features and improved the triplet loss, embedding control point features into a hypersphere to save storage space. They tested processing times under different power budgets on an NVIDIA Jetson Xavier, achieving positioning accuracy within 30 m [
23]. In another work, Xie et al. parallelized differential geo-registration on a GPU to eliminate geometric distortion between panchromatic and multispectral images, achieving near-real-time processing efficiency [
30]. Inzerillo et al. designed two knowledge distillation components to enhance the teacher model’s guidance for the student model, thereby developing a lightweight model with performance comparable to the original one, which improved the real-time capability of on-board change detection [
31]. As general-purpose computing hardware, GPUs enable dynamic allocation of computational tasks across satellite networks within constellations, freeing systems from application-specific constraints. Moreover, their mature ecosystem offers enhanced accessibility and are developer-friendly.
In this paper, we present an on-board rectification database under a two-step framework, including a high-dimensional feature vector database for fast global localization and a binary edge feature database for accurate refinement. We apply sparse sampling to the global database and main-structure ROI extraction to the local registration database to minimize overall data storage. Both global and local registration are implemented on GPUs to guarantee efficiency. Quantitative experiments confirm that the proposed method achieves viability in efficiency and accuracy, with storage usage below of the original images. The proposed method is different from existing studies in two aspects. First, it does not apply the traditional feature extraction-hash learning framework. Instead, the feature compression method introduced applies sparse sampling and an RTV mask. Although these techniques are widely applied, they have never been applied to image registration for the purpose of storage reduction. Secondly, methods exist for on-board registration task of satellite images, but they could not be applied to SAR images in the past since their feature library was too simple; the proposed two-step database can be applied to SAR images. The major contributions of this work are as follows:
1. A robust multi-source reference database establishment method applicable to all land-cover types with high accuracy.
2. A terrain-agnostic database compression method combining sparse sampling for the global database and automatic ROI mask generation for the local registration database.
3. A two-step matching framework design and GPU-accelerated implementation for enhanced computational efficiency.
2. Related Works
Current global spaceborne processing systems reveal critical gaps in systematic architectural design. Predominantly limited to technology demonstrations, experimental trials, or basic scenario implementations, these systems lack the standardized platform ecosystems and system-level engineering required for sustained, around-the-clock operational readiness. This highlights a substantial shortfall in achieving full-scale deployment. On-board processing hardware remains largely reliant on DSP–FPGA hybrid architectures, with commercial AI accelerators (e.g., from NVIDIA, Cambricon, or Orbita) yet to see widespread adoption. Moreover, prevailing spaceborne platforms predominantly follow closed-system paradigms that lack open-architecture standards, significantly hindering the debugging, testing, and validation lifecycle of AI algorithms. These constraints fundamentally limit the cross-platform scalability, technology dissemination, and iterative evolution of satellite-based intelligent processing systems.
Due to the early stage of development in spaceborne intelligent computing platforms and the stringent storage constraints of on-board processing, there is currently limited international research on making the feature library lightweight. In contrast, some scholars have conducted a certain degree of research in this area. For example, Huo et al. utilized hash mapping learning to achieve lightweight descriptors and adaptive matching for on-board control points, improving the robustness, stability, and efficiency of on-board matching [
32]. Yue et al. employed hash learning and a Hamming space to map original floating-point feature descriptors, resulting in higher computational efficiency and better method generalizability [
33]. Zhang et al. also used sparse coding to address storage space and matching efficiency issues [
34]. Ji et al. generated control points through block adjustment, described features using SIFT, and converted feature descriptors into hash codes, achieving a compression ratio of over 300 times compared to before hash learning [
35]. Chen et al. proposed a method that uses the relative positions of celestial bodies in space as a reference to construct a feature library for on-orbit geometric correction. However, their algorithm stores positional information of star maps and relies on relativity theory along with internal and external parameter models for calibration, making it unsuitable for the remote sensing image registration task studied in this paper [
36].
From the above review of current research, it can be seen that existing methods primarily remain within the framework of feature extraction–sparse coding. The so-called hash learning, or hash mapping learning, refers to specific algorithms that map high-dimensional feature vectors into low-dimensional binary data, while ensuring that similar original data remain close after mapping and dissimilar ones become distant. Although binarization mapping based on hash learning significantly reduces the storage occupied by feature descriptors, the irreversible information loss caused by sparse coding—mapping high-dimensional features into low-dimensional binary representations—severely compromises their descriptive and matching capabilities. The limited expressive power of binary coding substantially damages the information contained within the features. Furthermore, since complex designs are often required in matching to improve success rates and accuracy across multi-source images, hash learning struggles to ensure that the encoded results retain sufficiently strong matching ability. Moreover, most of these methods rely solely on SIFT for feature point detection, without considering the differences between multi-source images or whether feature points consistently appear at the same locations in both reference and target images. Relying on a single handcrafted feature extraction approach cannot guarantee strong representational ability under complex conditions. Additionally, the simple mapping process appended after feature extraction increases the risk of unstable registration results. Such a mapping process, like hash learning, lacks remote sensing physical meaning, making it difficult to ensure reliable performance. In more recent research, Zhao et al. used multiple images of the same area under different observation conditions and employed various feature extraction algorithms to investigate the repeatability of GCPs [
37], aiming to identify stable and repeatable keypoints across different times and methods. However, for a global feature library, the workload required to collect dozens of suitable images for every region on Earth is prohibitively large. Additionally, this method does not account for differences in imaging characteristics across various sensors and performs well only on optical imagery.
The current mainstream approach to make the feature library lightweight typically focuses on keeping the number of GCPs unchanged while making the feature vectors sparse. However, if the descriptive power of the feature vectors is insufficient, even a large number of GCPs cannot achieve effective matching. This paper adopts a different perspective for feature library compression: reducing the number of GCPs while preserving the completeness of the feature vectors. Even if the number of available reference control points is limited, as long as the feature vectors possess sufficient matching performance, a reliable matching result can still be provided. Regarding the potential impact of reduced control points on accuracy, this paper does not rely solely on the GCP method to maintain precision. Instead, after coarse matching is completed using the GCP method, a lightweight edge template feature library is introduced to perform fine matching. This feature library avoids the large storage capacity required by complex features while ensuring final positioning accuracy. Compared to existing research, the proposed method better meets the various requirements of on-board processing.
Building upon our previous work which employed a global-to-local framework, we note that micro-satellites often exhibit large initial positioning offsets. Therefore, performing coarse image localization prior to precise registration is advisable to avoid excessively broad search ranges. For instance, Xiang et al. proposed a global-to-local registration method based on multidirectional anisotropic Gaussian derivatives (MAGDs) [
38]. In our own prior research, we developed a database compression method using a similar two-step registration framework [
39]. However, that method specifically targeted areas with distinctive linear features, such as airport runways and coastlines. These regions could be pre-labeled using OpenStreetMap and converted into masks for database reduction, which limited its generalizability to arbitrary geographic areas. In the present study, we replace the open-source geolocation ROI with an automated main-structure mask to establish a lightweight database applicable regardless of terrain type. This proposed mask generation strategy imposes no constraints on terrain feature types, enabling the lightweight database to cover entire islands or continents. Furthermore, we replace the previously used downsampled binary edge features for global registration with a more robust and reliable feature vector database, thereby enhancing the overall stability and reliability of the method.
3. Methodology
The proposed method follows a two-step registration framework, as illustrated in
Figure 1. First, a global registration database is constructed using adjacent self-similarity (ASS) feature extraction [
40] combined with a novel sparse sampling strategy. Subsequently, a local registration database is built based on phase congruency features, and relative total variation (RTV) [
41] is applied to retain only the most essential structural information. Through this global-to-local framework, the method achieves high efficiency, improved accuracy, and significantly reduced storage requirements. The reference database is prepared in ground-based data centers and uploaded to the satellite before launch, enabling fully autonomous on-board operation without dependency on ground systems during mission execution.
3.1. Global Registration Database Construction
To establish a lightweight feature vector database for global registration, our goal is to obtain relatively accurate geolocation estimates, thereby narrowing the subsequent search radius for local registration. The proposed method is adapted from existing approaches to satisfy the constraints of on-board processing.
Although keypoint detection is effective for image registration, its repeatability—especially across multi-sensor platforms—remains a critical limitation. Moreover, feature descriptors must be both robust and compact to optimize storage efficiency. Our solution builds upon the RIFT (Rotation-Invariant Feature Transform) framework [
42]. We replace phase congruency with adjacent self-similarity (ASS) to improve cross-sensor feature consistency [
40]. By eliminating conventional keypoint detection, we implement a sparse uniform sampling strategy and characterize reference points using directional histograms with enlarged sampling windows.
The ASS feature represents an improved feature representation with reduced computational and memory costs. It compares adjacent subimages by applying a local statistics-weighted difference operation to suppress coherent speckle noise, making it particularly suitable for multi-source image registration [
40]. We define a window
W whose center is the same as the original image. The width and height of window
W is 2 pixels less than the original image; therefore,
W can be interpreted as the original image without boundary pixels. The central subimage slice
is the image cropped within range
W, while other subimage slices are acquired by moving
W in different directions by 1 pixel. The mathematical definition of these subimages is expressed as:
The self-similarity map response at an arbitrary orientation is defined as:
where
k is the weight function based on image average and variance, and
is a window with radius 2, which was recommended by the original author [
40]. The discrete index map of the maximum SSM is then written as:
where
is a threshold value to suppress noise. In our experiment,
was set to 0.001 for optical images and 0.05 for SAR images. The values of
were adjusted according to the extraction effect of ASS index map. Subsequently, the index map
O can be converted into histogram vectors to serve as feature descriptors. The Histogram of Oriented Gradients (HOG) represents a classical approach for describing keypoint features. In SIFT-based methods, an image patch is divided into blocks, with each block contributing a histogram of gradient orientations. However, precise registration remains challenging because such statistical representations are not optimal for pixel-level matching. In contrast, compressing image patches into histogram-based feature vectors significantly reduces storage requirements. A compact descriptor of this kind is well suited to global registration, delivering sufficient accuracy to reduce initial offsets. As shown in
Figure 2, the ASS index maps exhibit moderate structural consistency between optical and SAR images, which supports successful matching.
Conventional keypoint detectors are often sensitive to variations in imaging conditions and sensor characteristics, frequently requiring a large number of ground control points (GCPs) to ensure reliability. To overcome this limitation, we replace keypoint localization with a fixed-interval grid-based sparse sampling strategy. For reference images, we adopt a sampling step of 200 pixels; coordinates are thus recorded at positions such as (200,200), (200,400), (400,200), and (400,400). While the reference image uses a coarse step (e.g., 200 pixels), the sensed image is sampled at a much finer interval (e.g., 10–20 pixels). This design theoretically ensures that any pixel in the sensed image lies within 5 pixels (half of the finer step) of a reference coordinate, providing sub-pixel alignment accuracy adequate for subsequent refinement. Compared to keypoint-dependent methods, this systematic sampling strategy substantially reduces the number of feature vectors required.
Although bypassing keypoint detection may reduce the repeatability of individual reference point locations, our approach emphasizes improving the repeatability of feature vectors through structural enhancement. Under default RIFT parameters, the histogram window size is set to 96 pixels. While smaller windows enhance discriminative power for precise matching, global registration prioritizes maximizing feature overlap over pixel-level precision. Therefore, we expand the window size to 384 pixels. This enlarged configuration allows histogram vectors to capture broader contextual information, increasing their similarity across corresponding regions compared to the original 96-pixel implementation.
Figure 3 demonstrates the proposed feature vector description strategy. With a 96-pixel histogram window, reference window 1 and sensed window 2 exhibit no spatial overlap. When expanded to a 384-pixel range, reference window 3 and sensed window 4 share a
overlapping area, significantly enhancing feature vector similarity. During the establishment of the reference database, histogram normalization is omitted to reduce storage. But before on-board registration, the system would temporarily perform histogram normalization to the reference feature vectors in the selected geographic range, which only takes a short time. Therefore, the storage optimization does not affect accuracy. Since most histogram values remain below 256, each entry requires only 1 byte of storage instead of 4 bytes for floating-point numbers. Collectively, the proposed global database enhances feature vector repeatability while reducing stored point density, thus achieving sufficient accuracy for subsequent refinement processes. The sampling interval parameters are adjustable based on the actual registration performance. Our goal is to keep the global registration error under 10 pixels. A large sampling interval would increase the chance of mismatch in global registration, while a small one would result in more control points, leading to more storage. We first set the sampling intervals to a larger value to reduce storage and establish the reference database. If the global registration turns out to be unstable (failure or error greater than 10 pixels), we reduce the sampling interval to increase the robustness of global registration, and vice versa. After several rounds of adjustment, the intervals are finally set to 200 pixels for optical and 100 pixels for SAR as the result, since multi-source image similarity is weaker. It should also be noted that more control points also requires more GPU memory usage for parallelization.
3.2. Local Registration Database Construction
As previously noted, the results of global registration remain preliminary and require further refinement using a local registration database. In earlier work, we demonstrated the feasibility of a masked-edge feature database for specific target areas [
39], where feature compression was guided by four principles: intrinsicality, invariance, robustness, and uniqueness. Specifically, features extracted from the same scene must remain consistent across different imaging conditions and modalities, be resilient to minor terrain variations, and retain sufficient distinctiveness to support reliable matching. However, constructing a global database makes it impractical to manually define fixed patterns for all terrain types. This necessitates an automated region-of-interest (ROI) selection strategy rather than reliance on manual annotation. While adjacent self-similarity (ASS) improves the cross-sensor similarity of index maps, multi-scale phase congruency offers superior pixel-wise registration accuracy for precise local alignment.
Total variation was applied in our other previous work as an image smoothing method [
43]. RTV is an extension of total variation, which happens to meet our requirements as an ROI selection method [
41]. The general total variation measure is written as:
where
R(
p) is a window around pixel
p,
q belongs to window
R(
p), and
x and
y denote the directions of derivatives.
is a weight function:
The parameter
is the window scope. The general total variation preserves both prominent structures and texture elements. However, in a lightweight database, we wish to preserve as little information as possible. Compared to random textures, the main structures obviously fit our feature selection principles better. A novel windowed inherent variation helps distinguish between main structures and normal texture:
It does not sum the absolute values of gradients. Therefore, if the gradients in a window coincide,
would be significant, which indicates the main structure. On the other hand, a small
determines that the region belongs to texture regions. The overall loss function is written as:
During the filtering process,
still eliminates noise, which is similar to the original TV model. The difference is that
is small in texture regions, while remaining large in main structures. Since the optimization process minimizes
, texture regions gradually disappear as iterations increase. At the end of filtering, we obtain a main structure image without noise, fine details, or textures.
After generating the main structure binary mask, we apply Run-Length Encoding (RLE) to compress the masked edge features. As demonstrated in our earlier work, RLE proves to be an effective technique for reducing storage requirements in this design. The binary mask contains extensive continuous regions of zero values, which can be recorded by storing the count of zeros rather than each individual digit. By retaining only the principal structures, the mask preserves sufficient information for successful registration, while the high-resolution pixel-wise edge features ensure accurate matching results. This fully automatic strategy does not rely on specific terrain types and can therefore be easily generalized to global scenarios. Examples of the resulting masked features are presented in
Figure 4, where the most stable and reliable edges are preserved by the main structure mask. In each example, the left panel shows the original basemap, the middle panel displays the RTV-extracted mask, and the right panel presents the final masked edges. The RTV model automatically extracts the dominant structures, which are most discriminative and useful for local registration.
3.3. Global-to-Local Overall Framework
The global and local databases described above are constructed in ground-based data centers and uploaded to the satellite for on-board storage. During on-board processing, each acquired image undergoes a two-step global-to-local registration procedure. The advantages of this hierarchical matching strategy were outlined in previous sections. When an image is first captured, its initial geometric localization accuracy is typically poor, making a direct wide-area search highly inefficient. Within the two-step framework, global registration quickly provides a coarse but sufficiently accurate location estimate. This significantly narrows the search space for subsequent refinement, thereby greatly reducing computational cost.
We next explain how the proposed database design aligns with the requirements of this two-step framework. The framework itself builds upon our prior work, which used downsampled template features for global registration [
39]. In this study, we revisit the feature design strategy for the global reference database. The earlier simplified approach proved less effective in regions with indistinct features, prompting our shift to a feature vector-based global registration database. Unlike template features, feature vectors support geolocation-aware matching across the entire database, effectively handling large initial positioning offsets.
The refinement stage continues to utilize masked edge features as registration elements, a method whose efficacy was established in our previous work [
39]. Since deriving a universal masking rule through prior knowledge is impractical, we propose to automatically extract dominant structural outlines from reference images. This allows the region-of-interest (ROI) masks to generalize across diverse geographic regions. To this end, we introduce relative total variation (RTV) to preserve structures with high contrast, large scale, and well-defined shapes—such as streets, buildings, bridges, highways, rivers, and agricultural fields—while filtering out fine details and noise. This approach achieves performance comparable to earlier methods without relying on prior knowledge or imposing localization constraints.
Both the global and local registration databases are built from optical basemaps. The adjacent self-similarity (ASS) and phase congruency features are applicable to cross-modal image registration, as they were sufficiently validated in prior research. Therefore, the proposed processing framework applies to both optical and SAR sensed images, with only minor adjustments in parameters and the phase congruency algorithm for SAR data [
15].
The overall framework is illustrated in
Figure 1. First, global registration is performed by searching the feature vector database using the approximate geolocation metadata of the input image. Subsequently, refinement is carried out by querying the binary edge features and ROI masks within the local database. It is important to note that high precision is required only in the local registration phase, while global registration prioritizes computational efficiency. Furthermore, in on-board processing tasks, storage constraints take precedence over other metrics. The specific details of the registration algorithms are elaborated in the following section.
3.4. Matching Algorithm Implementation
In this section, we introduce the details of the matching methods. For global registration of high-dimensional feature vectors, we use the Dense Nearest Neighbor Searching (DNNS) algorithm to find matching pairs. For local registration of template features, the masked NCC (Normalized Cross Correlation) algorithm in our previous work is applied. Both algorithms were implemented on an Nvidia GPU for acceleration.
The proposed method is inspired by DFM multi-scale pyramid DNNS matching. Given the feature vector sets
and
extracted from reference and sensed images, for an element from
and another from
, DNNS computes the distance between these two feature vectors. Only when the ratio between the best and second-best match distances is below a threshold do we preserve the coordinates of the feature vectors. These preserved coordinates are later processed by Random Sample Consensus (RANSAC) to eliminate outliers. Apparently, the computation between all reference and testing pixels is extremely time-consuming. The multi-scale pyramid DNNS uses downsampling to improve efficiency. For example, an image with 1 m resolution is downsampled to 1/16 of the original size [
14]; therefore, we can achieve a rapid matching result with an error of less than 8 m (half of the pixel interval), which is the search radius in the next level of the pyramid. At the next level, the downsampling factor is adjusted to 8, meaning we only have to search within one pixel (8 m) to find a more accurate position. This way, the result is gradually refined while the downsampling factor decreases from 16 to 1.
Considering our requirement for global registration is only to eliminate the initial offset, we modified the multi-scale pyramid DNNS to operate efficiently with minimal storage usage. Taking optical image registration as an example, the reference feature sampling step is set to 200 pixels, and the sensed feature sampling step is set to 10 pixels. Since the number of elements in
and
is severely unbalanced, we do not require matches to be mutual to retain as many points as possible. This fast registration typically results in a positioning error of approximately 10 pixels. Once global registration is complete, such a small search radius is acceptable for refinement. The local registration based on binary edge features uses NCC to match the database and sensed images [
44]. The definition of NCC is given in Equation (
9).
The subscripts
R and
S have the same meaning as in global registration. I denotes intensity, while
and
represent the mean and variance of the image cropped inside the window
W.
denote feature points locations detected in the sensed image, to avoid computing NCC at meaningless places. We shift the sensed image by changing
dx and
dy until the NCC coefficient reaches its maximum; then
are considered the accurate locations for the pair
, as denoted in Equation (
10).
We strove to implement as many computationally intensive processes as possible on the GPU to save processing time. In the DNNS process, we partitioned GPU threads in a 2-D shape, and each thread had coordinates in the x and y directions. We used a two-dimensional GPU block to allocate one thread for each pair for parallel computing. Note that this step requires a certain amount of GPU memory to store float-type distances for all reference-sensed feature pairs. For a (10,000 × 10,000)-pixel image pair, the number of reference points is (10,000/200)2, while the number of sensed points is (10,000/10)2. Thus, it would take 4 × 2500 × 1,000,000 bytes (approximately 9 GB of GPU memory) to store all distance values. Afterwards, a one-dimensional GPU block sorts the distances for each vector in . For real satellite-grade embedded GPUs, our demand for 9 GB GPU memory can be easily meet. The Nvidia series is specially designed for edge computing, which has already been applied in satellites. Jetson Xavier NX/Orin NX edge device can provide 16 GB of GPU memory. There are also GPUs produced in China that provides 24 GB of GPU memory.
As for the NCC matching algorithm, it was already implemented on the GPU in our previous work. For a certain combination of translation , pixels in the template W, and each feature point , a separate thread is applied to compute NCC coefficients in parallel. This step has lower requirements on GPU memory since the features are all binary cropped images around feature points.
In summary, the on-board lightweight database provides moderate accuracy with low storage usage and high processing efficiency. We implemented the following measures:
1. We sampled features at fixed intervals on the histogram of the ASS index map to perform fast global registration with low storage usage.
2. We proposed a novel automatic main structure masking strategy for a lightweight edge database to enable accurate pixel-wise registration.
3. We applied a global-to-local two-step registration structure and GPU implementation to improve efficiency.
5. Discussion
In this paper, we proposed a lightweight reference database for on-board registration of remote sensing images. Although it met the storage, efficiency, and accuracy requirements of on-board registration, certain aspects could still be improved in future research. In this section, we discuss the limitations of the proposed method.
First, the terrain adaptability of the proposed method is examined. This capability is contrasted with existing studies and our earlier work, in which reference targets were largely confined to well-defined artificial structures such as airports, harbors, or buildings. Unlike these approaches, the proposed method does not depend on semantic definitions of reference targets, nor does it require prior open-source geographic vector data (e.g., OpenStreetMap) to delineate masks for such structures. Instead, a main-structure mask is automatically generated via the RTV model. As a result, the proposed method can be regarded as generally applicable, in contrast to earlier techniques. Nevertheless, certain limitations persist. The RTV mask does not fully differentiate heterogeneous landscapes, which can introduce redundancy into the reference database. Moreover, similar to most registration methods, the proposed approach performs poorly in weakly textured regions such as mountains, sea surfaces, forests, or deserts. As shown in
Figure 10, reliable match points are obtained only around built-up areas, whereas mountainous or densely vegetated regions fail to produce sufficiently discriminative feature vectors.
Deep learning-based methods are acknowledged as a major competing direction in current research; however, they present considerable drawbacks for on-board registration tasks. At present, the proposed method demonstrates advantages over neural descriptors in several key respects. First, it requires no labeled training data. Assembling a large volume of high-quality labeled data for global on-board registration would be costly and logistically difficult. Second, deep learning-based methods involve substantially higher computational complexity. Our method can be efficiently deployed on an on-board GPU with 16 GB of VRAM, while deep learning approaches generally demand significantly more memory and computational throughput to achieve efficient inference. Additionally, on-board systems would need to allocate extra storage for complex network models. Given these constraints, designing a viable on-board registration solution solely around deep learning remains challenging. The performance of the DFM method exemplifies these limitations. In future work, we intend to explore the use of deep learning-based techniques (such as variational autoencoders) to encode features into lower-dimensional representations, thereby enabling further reductions in storage footprint.
6. Conclusions
In this paper, we presented a novel lightweight on-board database that substantially reduced storage requirements. The registration process operated on this compressed feature database instead of raw images, using a two-step framework to accommodate large initial geolocation offsets. For global registration, we introduced a sparsely sampled high-dimensional feature vector library that circumvented the keypoint repeatability issue while reducing the number of control points. For local registration, we replaced the earlier masking strategy with relative total variation (RTV) to filter out irrelevant features from the binary edge map. The optimized workflow employed parallelized GPU accelerated implementations for both registration stages, fully utilizing the available on-board computational resources.
Experimental results confirmed that the proposed method achieves superior database storage efficiency while maintaining SOTA accuracy and processing speed. Since existing comparative methods either demand excessive time or storage, none offer a practical solution for on-board registration. The principal contributions of this work are summarized as follows:
1. To the best of the authors’ knowledge, this is the first feasible lightweight on-board database designed for multi-sensor satellite image registration.
2. An advanced feature-compression strategy that is independent of terrain types. This includes a sparse sampling method for control points in global registration and an automatic main-structure ROI extraction procedure for local registration, both incurring only minor accuracy loss.
3. GPU-accelerated implementation for both registration steps, enabling efficient use of limited on-board hardware resources.