1. Introduction
Satellite imagery can provide crucial benefits in tracking global economic and ecological changes. However, integrating diverse datasets from the myriads of observation satellites available poses significant challenges. Issues arise from differences in sensing technologies, the absence of standardized calibration methods prior to satellite launches, and changes in instruments over time. Although improved calibration techniques could enhance alignment of future data acquisition [
1], a large amount of historical data remains mutually incompatible. For instance, the U.S. Geological Survey’s Global Fiducials Library combines Landsat images from 1972 with Sentinel-2 images from 2015 [
2], highlighting the need for methods to effectively bridge different image types.
Converting satellite imagery from the domain of one sensor, such as WorldView-3 (WV), to another, such as SuperDove (SD), is non-trivial. As a first step, spectral resampling is necessary to translate the wavelength response of the WV sensor filters into the desired wavelengths of the SD image bands. This is because WV and SD sensors acquire imagery at different wavelengths. However, the resampled image remains quantitatively and qualitatively divergent from an image truly acquired in SD space, so additional radiometric conversions are required. Building off previous windowed image regression tasks [
3], we have demonstrated initial success with a parametrized convolutional neural network (PCNN) approach that leverages a Bayesian hyperparameter optimization loop to identify a performant model in this complex and often opaque conversion problem [
4].
Though demonstrated to be qualitatively and quantitatively performant on converting stationary images, we identify spurious conversion artefacts with the PCNN approach when objects in the field of view are in motion. In these instances, the time delay between band acquisitions means that the moving object should appear localized in slightly different spatial locations across the bands, and a rainbow-like visual effect should be observed when viewing multiple bands at once (such as in a traditional RGB view) [
5]. The issue arises because the previous PCNN model treats all image bands together in the prediction of a converted spectra, meaning that the effect of the moving object in each band spuriously bleeds over to every other band. This results in an expanded blurred blob for moving objects in the converted image, instead of spatiotemporally localized object signatures separated by band. Preserving these image features is of practical importance with object tracking applications to accurately represent where and at what time a moving object of interest appears.
While experimental data for the evaluation of specific moving objects can be prohibitive to acquire, radiometrically accurate simulations of 3D objects and environments have been studied for decades. In particular, the Quick Image Display (QUID) model continually developed by Spectral Sciences, Inc. since the early 1990s [
6] incorporates six basic signature components including thermal surface emissions, scattered solar radiation, scattered earthshine, scattered skyshine, plume emission from hot molecular or particular species, and ambient atmospheric transmission/emission. QUID computes radiance images at animation rates by factoring thermal emission and reflectance calculations into angular and wavelength dependent terms. This factorization allows for the precomputation of wavelength-dependent terms which do not change with target-observer angles, enabling fast object rendering and scene simulation capabilities. As a result, QUID has found broad application as a general tool for moving object simulation, from ground-based automobiles [
7] to aerial missiles [
8]. Thus, simulation can be a key tool for evaluating performance of image conversion methods, particularly in specialized scenarios where experimental data is inaccessible.
To address this issue of correctly treating moving objects, here we introduce an expanded Physics-Informed Gaussian-Enforced Separated-Band Convolutional Conversion Network (PIGESBCCN) to handle known spatial, spectral, and temporal correlations between bands and evaluate against QUID-simulated moving object scenes. Methods for data preparation, model optimization and evaluation, and for simulating imagery of moving target signatures labeled with ground truth data of object size and motion are discussed in
Section 2, followed by results in
Section 3 and conclusions in
Section 4.
3. Results
WV sensors acquire their 8 spectral bands not in wavelength order. Rather, the spectral indices are temporally acquired in the order of 1, 8, 7, 5, 4, 3, 6, then 2. Thus, if one were to conduct convolutions over spectrally adjacent bands (i.e., bands 1 and 2), this would lead to a large target blurring artefact, as the temporal gap between band 2 (last temporally acquired) and band 1 (first temporally acquired) would result in greatly different target locations. Instead, we leverage the known temporal order of acquisition to first rearrange the bands, such that target motion is smooth between bands and convolutions act on temporally adjacent bands. Then, we architect the model with separate convolutional branches for each output band as an additional precaution against large cross-band target position blurring.
The PIGESBCCN model architecture, as shown in
Figure 2, first separates out each band of the input image with band-specific Gaussian blurs derived from analysis of identical contrast targets in both the WV (input) and SD (output) domains. Specifically, we incorporate specific Gaussian blurs scaled by factors of 4.2576, 4.2679, 4.2670, 4.2498, 4.2841, 4.4390, 4.2026, and 4.3625 for the 8 image channels, respectively. We incorporate that physical knowledge to inform the model, instead of just hoping the ML model parameters learn the correct relative blurring. We then rearrange the order of the input scene bands by temporal acquisition. This rearrangement informs the model of the correct physical time progression, instead of just hoping the ML model parameters learn the correct band reordering. Once reordered, we chunk the image into 8 temporally adjacent 3-band sections. The first and last chunks at the temporal periphery repeat their earliest and latest bands, respectively, to yield 3-band chunks the same shape as temporally internal chunks. Then, the physical knowledge that correct output has band-localized object positions further informs our design of the model architecture, as separate convolutional branches per output image channel, instead of just hoping the ML model parameters learn correct band separation. Thus, after sectioning, these temporal chunks are then rearranged back into the original spectral ordering. Then, each of the 8 3-band chunks are fed through separate parameterized convolutional network paths to obtain a prediction for each band, and ultimately, a converted spectra for each pixel of the input image. In these ways, our knowledge of the physical system informs our methodology, solidifies known relationships between the data, and ameliorates spurious blur, band reordering, and band mixing errors that may otherwise manifest with a non-physics-informed basic convolutional approach.
Two images capturing the same spatial location—a scene in Turkey—were obtained using the WV and SD sensors, respectively. The spectra in the SD image are used to label their respective WV regions, constituting a paired dataset to facilitate learning the relationship between the WV and SD spectra. Due to the differing intensity scales of the WV and SD images (0–255 for WV and 0–24,545 for SD), we first normalize the pixel values of each image to a range of 0 to 1 by dividing by the maximum value of the image. After normalization, we split the pixels from these images into training, validation, and test datasets using a 70-20-10 split. Further details are provided in
Section 2.
The training set is employed for direct parameter training. Then, as in previous work [
14], we employ an automated hyperparameter tuning process using Bayesian optimization. The validation set guides the hyperparameter tuning loop, ultimately resulting in a final model that minimizes both training and validation loss. The paired WV and SD scenes used for training and the loss curves for the optimized model are shown in
Figure 3. Details on hyperparameter tuning and optimized model performance are provided in
Section 2.
To quantify the effectiveness of ML spectral conversion, we present the spectral angle deviations from actual SD spectra in
Figure 4. Details of the calculation are in
Section 2. Panel a compares spectral angle deviations from SD to the WV image after simple band resampling, while panel b compares from SD to the WV image converted with the trained ML model. In this visualization, brighter pixels represent greater deviations. The predominantly dark image in panel b indicates that the ML converted image is significantly closer to the actual SD data compared to simple band resampling. A horizontal trace across the field of view indicates that the average spectral angle deviation along that trace is 5.12° for the WV image after band resampling, which decreases significantly to an average of just 1.42° following conversion with the trained ML model.
Now, we utilize QUID to simulate a moving object in the input WV scene. Here, we specifically simulate a black SUV, as a representative generic target that can be found on an urban highway. Note that due to the time delay between band acquisitions, the moving object appears as a “rainbow” in a visible color image as it traverses in space. As the SUV in question is black, these colors then appear inverted when on a bright grey road background—masking the respective color band from the background. The QUID-simulated object scene is then passed through our PIGESBCCN model and converted to SD space to evaluate how well the ML conversion preserves band-separated object locations, as shown in the following
Figure 5. Details of the target simulation are provided in
Section 2.
To evaluate the ML model performance on converting a moving object to SD space, we use a simple Gaussian blur of the WV QUID images, both with and without a moving target, as a comparative baseline. This blur serves to simulate the difference in spatial resolution between the WV and SD sensors while maintaining correctly band-separated object locations. Specifically, using band-dependent full width half maximums of 4.258, 4.268, 4.267, 4.250, 4.284, 4.439, 4.203, and 4.363, for spectral bands of 0.443 nm, 0.490 nm, 0.531 nm, 0.565 nm, 0.610 nm, 0.665 nm, 0.705 nm, and 0.865 nm, respectively. We then pass the same WV QUID images through the PIGESBCCN ML conversion model. Finally, we subtract out the blurred background from the blurred background plus target image to isolate the effect on the target, as shown in
Figure 6.
Zooming in on the target in a particular band allows us to qualitatively compare the difference between the ML conversion and baseline Gaussian blurring. Here, an example focusing on band 4 is shown in the third column of
Figure 6. The ML converted target qualitatively appears with a similar location and size as the Gaussian blurred target. Note that the color of the ML converted target is not the same as the baseline blurred target. This is to be expected, as the baseline blurred target remains in WV color space while the ML converted target is in simulated SD color space.
We then conduct background subtraction for each of the 8 bands and use a value threshold of 5% to mask the target from the background. Then, comparing the baseline Gaussian blurred target and the ML converted target allows us to visualize the deviation between the ML method and baseline procedure for each band, as shown in
Figure 7. First, this band-by-band visualization confirms proper target localization with slightly different spatial locations per band. Second, we see that these target locations for each band qualitatively align between the ML method and baseline blurring.
To quantitatively investigate the possibility of systematic errors associated with ML conversion, we compare target locations between the ML converted and baseline Gaussian approaches in
Figure 8. Here, we see that the ML conversion yields target locations with good agreement to the baseline approach. More traditional methods of image conversion, without the band separation scheme outlined in this work, yield converted images where the presence of moving target in each band blurs over all output bands. In those cases, target reconstruction results in a single average effective position instead of varying per band. While this single effective position may be close to the positions of temporally internal bands (such as bands 4 or 5), other bands (such as 1 or 2) will contribute to large errors. Specifically, this single effective position yields Root Mean Squared Error (RMSE) of 1.8 and 2.7 pixels in the X and Y coordinates, respectively. In contrast, the PIGESBCCN converted target with band localized positions yields RMSE of only 0.49 and 0.41 pixels for the X and Y coordinates, respectively. This subpixel location accuracy indicates that the ML conversion successfully preserves target location without introducing systematic spatial translation errors. Furthermore, the averaged target size for the PIGESBCCN-treated imagery is close to the Gaussian blur baseline—only 1.6 pixels larger for the ML converted imagery (77 pixels) than the baseline (75.4 pixels).
4. Discussion and Conclusions
ML models can successfully learn complex non-linear relationships to convert one data domain to another. Application to satellite imagery can allow greater usage of data from disparate sources for combined studies.
Correct treatment of moving targets within satellite imagery was identified as a key mode for improvement over previous PCNN models, requiring a more sophisticated approach. Leveraging known physical relationships of the problem domain to put specific structure and guiderails on model architecture allows for corrected performance. Specifically, band-dependent blurring, temporal reordering and chunking, and separated band-specific prediction branches allow for improved treatment of spatial, temporal, and spectral characteristics.
As a result, the ML conversion approach detailed in this work operates quantitatively not only on stationary backgrounds but also for moving targets. The PIGESBCCN model enables accurate spectral conversion between WV and SD sensors, while maintaining moving target sizes and target locations across bands. Preservation of band localized moving targets is critical for moving target detection algorithms, a major application of satellite imagery data.
Future work may involve expanding into other image sensors and spectral bands, increasing computational efficiency for real-time conversion operations, and integrating our model architecture within live, periodic model training workflows for continually improved performance as more satellite imagery is acquired over time. Treatment of moving object blurring might not be important for general uses in, for example, agricultural monitoring or segmentation applications over large time windows. However, correct treatment of moving objects is important for select applications in, for example, small target detection, and the treatment of such experimental data may be addressed in future work.