4.2. Remote Sensing Image Reconstruction
We conduct experiments for image reconstruction on several remote sensing image datasets, including LandCover.ai [
40], LoveDA [
41], INRIA [
47], UAVid [
38], and ISPRS Potsdam [
48], enabling comprehensive analysis and documentation of the Earth’s surface. Moreover, we randomly select 20 remote sensing images from each dataset as task targets and calculate the average reconstruction accuracy for these images during evaluation. We compare various state-of-the-art methods with our approach to evaluate the effectiveness of our method in remote sensing image reconstruction, including position encoding and ReLU activation function-based INRs (PEMLP) [
3], sinusoidal representation networks using sine periodic activation functions (SIREN) [
1], learning spatially collaged Fourier bases for implicit neural representation (SCONE) [
49], disorder-invariant implicit neural representation (DINER) [
27], and flexible spectral bias tuning in implicit neural representations (FINER) [
14]. More specifically, in learning INR for remote sensing image reconstruction, the model takes pixel coordinates as input and predicts the corresponding pixel values. We follow the SIREN training hyperparameters and resize the input images to 256 × 256 pixels to train the INR models. Based on our preliminary findings, we set the model activation function parameters
k and
b to 3 and 2, respectively, and trained the model for 500 epochs for the remote sensing image reconstruction task.
LoveDA [
41] is a dataset specifically designed for urban and rural land cover classification, consisting of 5987 high-quality remote sensing images, each with a resolution of 1024 × 1024 pixels and a spatial resolution of 0.3 m. Each image is meticulously annotated at the pixel level, making it highly suitable for land cover classification tasks. These images encompass seven types of land cover: buildings, roads, water, wasteland, forest, farmland, and ground. The dataset is divided into two subsets based on geographic characteristics: LoveDA-Urban and LoveDA-Rural. LoveDA-Urban focuses primarily on urban scenes, characterized by complex and diverse backgrounds, dense buildings [
50], and fewer natural features. In contrast, LoveDA-Rural emphasizes rural areas featuring expansive natural landscapes, sparse man-made structures, and abundant natural features. The significant differences in land cover types, label granularity, and sampling resolution between urban and rural scenes enable researchers to test the performance of deep learning models in different geographical environments, thereby optimizing model robustness and generalizability. With these characteristics, the LoveDA dataset has been widely used in various remote sensing-related fields [
51,
52], such as urban planning, disaster monitoring, and land use analysis. For example, in urban planning, LoveDA data can be used to identify and track urban expansion and changes in land cover, while in disaster monitoring, precise classification labels help to quickly assess damage in affected areas. The LoveDA dataset also provides a standardized benchmark environment, allowing researchers to easily evaluate and compare the performance of different deep learning models. Evaluation metrics for LoveDA typically include pixel accuracy (PA) [
53] and Intersection over Union (IoU) [
54], which comprehensively reflect model performance in classification tasks.
We present model performance comparisons with advanced methods on the LoveDA dataset test set using remote sensing images, as shown in
Table 2. We observe that our DeLU-SIREN achieves state-of-the-art performance with a PSNR of 48.28 and an SSIM of 0.9953, showing an improvement of 15.01 PSNR and 0.0655 SSIM over SIREN. Compared to the latest method, Finer, our DeLU-SIREN also achieves a notable increase of 10.98 PSNR and 0.0418 SSIM in image reconstruction accuracy. Moreover, since the LoveDA dataset includes both urban and rural areas, which have different distributions, we also utilize various INR models to perform reconstruction and compare the results. We observe that our method achieves competitive reconstruction results in both urban and rural scenarios compared to other advanced methods. As shown in
Figure 7, our method avoids the noisy artifacts produced by the Gauss model while preserving fine details in the image reconstruction results.
LandCover.ai [
40] is a dataset designed for automated land cover classification, primarily applied in semantic segmentation tasks for supervised learning. The dataset consists of aerial imagery from Poland, with all images annotated at the pixel level, covering four main land cover types: buildings, forests, water bodies, and roads. It includes 41 orthophotos, covering approximately 216.27 km
2, with a resolution of 25 to 50 cm per pixel, ensuring high-precision details. The images are divided into 512 × 512 pixel tiles based on predefined fixed sizes, preserving geographic information to facilitate effective training and validation of deep learning models. Due to the diversity of scenes, which span different geographic regions, climates, and seasons, this dataset is particularly suitable for studying urban-rural land cover changes [
55], environmental monitoring [
56], and urban planning [
57].
In
Table 3, we report performance comparisons with existing methods on the test set of the LandCover.ai dataset with remote sensing images. We observe that our method achieves the best model performance of image reconstruction accuracy with 48.75 PSNR and 0.996 SSIM, which gets 13.45 PSNR and 0.078 SSIM improvement compared with SIREN. Compared to the latest INR model, Finer, our method also achieves an improvement of 10.08 PSNR and 0.040 SSIM. Moreover, we visualize the remote sensing image reconstruction results with details, as shown in
Figure 8. We observe that DeLU-SIREN captures more precise image details than other methods, including Finer, particularly in exposed surface areas of remote sensing images.
ISPRS Potsdam is an important dataset designed specifically for urban remote sensing 2D semantic segmentation and object detection tasks. It consists of high-resolution aerial remote sensing images of the Potsdam urban area in Germany. This dataset aims to advance research on detailed classification and segmentation of urban land cover, especially in the field of high-resolution remote sensing image processing. The Potsdam dataset contains 38,782 high-resolution color remote sensing images, each with dimensions of 6000 × 6000 pixels and a spatial resolution of 5 cm per pixel, covering the RGB and near-infrared (NIR) spectral bands. The dataset provides detailed annotations for six land cover classes, including impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. All images are labeled at the pixel level, providing detailed information on the land cover category of each pixel, which can be used to train and evaluate algorithms for land cover classification, object detection, and semantic segmentation tasks. The main characteristics of the Potsdam dataset are its high resolution and multispectral features, which make it highly accurate for land cover classification and object detection tasks. Moreover, the dataset provides challenging scenarios, such as complex land cover compositions and varying illumination conditions, offering a more realistic testing environment for practical algorithm applications.
For the ISPRS Potsdam dataset, we evaluate the performance of our method in reconstructing remote sensing images compared to other INR methods. As shown in
Table 4, our method achieves a reconstruction accuracy of 49.16 PSNR and 0.9959 SSIM, representing a significant improvement of 13.01 PSNR and 0.074 SSIM over SIREN. Moreover, we visualize the remote sensing image reconstruction results of our method and other methods in detail, as shown in
Figure 9. We observe that on the Potsdam remote sensing dataset, our method captures more precise image details than other methods, including the lane markings on roads.
4.3. High-Resolution Remote Sensing Image Reconstruction
Remote sensing images often exhibit high resolution, containing a wealth of information within each image, which indicates that the representation functions of these images are more complex. This complexity presents significant challenges for image reconstruction tasks. For the challenging high-resolution image reconstruction task, we evaluate the effectiveness of our method compared to the latest INR methods. In our activation function, we set
k and
b to 1 and 5, respectively. For high-resolution image reconstruction, we select images from the UAVid [
38] and INRIA datasets, resizing them to 1500 × 900 due to GPU memory limitations. These datasets are designed for urban scene analysis from a drone perspective.
The UAVid [
47] dataset enhances the understanding of urban environments in computer vision applications. It consists of high-resolution video frames captured by drones in various urban settings, offering an overhead perspective with resolutions up to 3840 × 2160 pixels. The dataset includes eight semantic classes, such as buildings, roads, trees, and vehicles. UAVid presents challenges due to significant viewpoint variations, scale changes in objects, and complex backgrounds, making it suitable for applications like urban surveillance and autonomous driving assistance systems, which require precise comprehension of urban environments.
As shown in
Table 5, our method performs exceptionally well in high-resolution image reconstruction, achieving 26.88 PSNR and 0.936 SSIM. Compared with SIREN, DeLU-SIREN improves PSNR by 7.58 and SSIM by 0.238, and compared with FINER, it further improves PSNR by 2.13 and SSIM by 0.041, as also evidenced by the qualitative results in
Figure 10.
INRIA [
47], developed for building segmentation task modules, contains 360 high-resolution RGB images with dimensions of 5000 × 5000 pixels and a spatial resolution of 0.3 m per pixel. Comprising 180 training and 180 testing images, it covers ten cities in the United States and Austria, facilitating the evaluation of model generalization across diverse urban landscapes. Generated through aerial photography, images undergo geometric and radiometric corrections to ensure precise alignment with real-world geographic coordinates. The primary objective is to develop robust models that adapt to varying urban environments, particularly assessing performance under different lighting conditions and architectural styles. The dataset provides binary semantic labels for buildings and non-buildings, establishing a crucial benchmark for building segmentation tasks.
As shown in
Table 6, DeLU-SIREN demonstrates outstanding performance in the image reconstruction task on the INRIA dataset, achieving a PSNR of 44.17 and an SSIM of 0.9961. This represents an improvement of 10.32 in PSNR and 0.0309 in SSIM over the SIREN method. Our approach consistently delivers superior image reconstruction quality compared to more recent methods, such as FINER. In the visualized results (
Figure 11), our method exhibits enhanced clarity and precision, particularly in reconstructing fine details such as cyclists in urban scenes. The reconstructed images display sharper contours and better-defined structures, ensuring a more accurate representation of moving objects like cyclists, which are often challenging to reconstruct due to their small size and dynamic nature. This improvement highlights our method’s high fidelity in capturing static and dynamic elements across complex urban environments in the INRIA dataset.
4.4. Ablation Study of k and b
In our DeLU-SIREN,
k and
b are important parameters that significantly influence the form of our activation function. We evaluate image reconstruction accuracy across different
k and
b values in DeLU-SIREN on an independent validation set to identify the optimal activation function settings for learning implicit neural representations (INRs) for images, as illustrated in
Figure 12. To prevent test set leakage, these validation images are strictly excluded from the final test set. Our findings show that both
k and
b significantly impact the model’s performance, with the best results being 48.1 PSNR when
and
. Additionally, we observed that the model’s performance is not significantly affected by the sign of
k and
b; for example,
and
also make our activation function effective, obtaining 47.5 PSNR.
However, when the absolute value of b is as small as 1 or −1, or as large as 4, the model struggles to accurately learn the INR for images, achieving a maximum accuracy of 40.9 PSNR. Furthermore, higher values of both k and b degrade the model’s performance. Consequently, we establish and as the baseline for reconstructing remote sensing images.
The choice of
k and
b reflects a trade-off between signal complexity and optimization stability. In DeLU-SIREN, the modulation amplitude is defined as EQ.(
7) indicating that
k controls the gradient scaling and sensitivity to local variations, while
b determines the base activation amplitude. For standard-resolution image reconstruction (e.g.,
or
), we recommend
and
as the default setting, since this configuration provides sufficiently strong gradients for fitting high-frequency details while maintaining stable training. For high-resolution reconstruction, we find that a smaller
k and a larger
b are more appropriate. This is because high-resolution remote sensing images contain denser and more complex spatial variations, for which a large slope may lead to stronger gradient fluctuations and unstable optimization. Reducing
k helps regularize the gradient flow, while increasing
b enlarges the activation amplitude and provides a wider dynamic range for fitting large-scale continuous spatial variations. Accordingly, for high-resolution settings, we adopt
and
.
4.6. Uniqueness of DeLU and Compatibility with SIREN
To clarify the theoretical novelty of DeLU and its essential difference from conventional activations such as Tanh, we further analyze its gradient property. In addition, to exclude the possibility that the gain of DeLU-SIREN merely comes from a generic combination of SIREN with arbitrary nonlinearities, we conduct a broader cross-combination study with representative activations.
From a theoretical perspective, the key distinction between DeLU and Tanh lies in gradient preservation for large-magnitude inputs. The derivative of Tanh is given by
which rapidly approaches zero as
increases. In INR-based remote sensing image reconstruction, accurately fitting high-frequency spatial structures often requires large pre-activation responses. Under such conditions, Tanh easily enters the saturation regime, causing severe gradient attenuation and suppressing the learning of fine details. By contrast, DeLU is defined as
whose derivative is
Therefore, DeLU maintains a constant and non-vanishing gradient magnitude over the entire input domain, avoiding the damping effect of smooth saturating activations.When combined with SIREN, although the periodic nature of the sine component inherently causes the derivative to oscillate, the linear scaling of DeLU preserves a more stable overall gradient envelope, preventing severe gradient attenuation.This property makes it particularly suitable for INR tasks that rely on stable gradient flow to recover high-frequency structures.
To support the above analysis, we further compare DeLU with several standard activations on the basic INR reconstruction task. The results are summarized in
Table 8. Although Tanh does not suffer from a hard dead region, its gradient saturation leads to the weakest reconstruction performance, indicating that avoiding hard truncation alone is insufficient. In contrast, DeLU achieves the best performance, demonstrating that its advantage is not simply due to symmetry or smoothness, but to its ability to preserve effective gradient flow for high-frequency fitting.
Furthermore, we investigate whether other activations can also enhance SIREN in a similar manner. Specifically, besides the original SIREN baseline, we construct ReLU-SIREN, Leaky ReLU-SIREN, and Tanh-SIREN under the same network architecture and training protocol. The quantitative results are reported in
Table 9. It can be observed that simply combining SIREN with conventional activations does not consistently improve performance. In fact, ReLU-SIREN and Leaky ReLU-SIREN both degrade the original SIREN, while Tanh-SIREN provides only a marginal improvement. In sharp contrast, DeLU-SIREN yields a decisive gain.
This result suggests that the effectiveness of DeLU-SIREN is not due to arbitrary activation stacking, but to a specific structural compatibility. ReLU-style truncation destroys the continuity of sinusoidal oscillations, while Leaky ReLU still introduces asymmetric piecewise behavior that is not well aligned with periodic representations. Although Tanh is smooth, its bounded output and saturating gradient restrict the dynamic range and optimization efficiency of SIREN. In contrast, DeLU preserves symmetry through the absolute value form, while the linear parameters k and b provide a continuously adjustable amplitude modulation. As a result, DeLU-SIREN simultaneously preserves the periodic inductive bias of SIREN and enhances its dynamic range and gradient stability. These results indicate that DeLU is fundamentally different from conventional activations such as Tanh, and that its effectiveness in INR arises from stable gradient preservation. They further show that the success of DeLU-SIREN is due to principled functional compatibility rather than arbitrary activation stacking.
4.7. Uniqueness of DeLU
We compare DeLU with classical ReLU and its variant Leaky ReLU to highlight its unique characteristics. Unlike ReLU and its variants, which utilize a dead region to produce nonlinear features, DeLU incorporates an absolute value operation to provide nonlinearity, along with the hyperparameters
k (slope) and
b (bias), as shown below:
The absolute value ensures that the function avoids dead zones, while
k and
b enable DeLU to possess a broader representational range and controllable first-order gradients, thus effectively representing the rich and complex remote sensing image. Moreover, we provide a comparison with other activation functions in terms of computational complexity, ease of implementation, and impact on the training process. Since our method is a linear transformation of ReLU, its complexity remains
, the same as that of ReLU. In terms of ease of implementation, we compare DeLU with other activation functions regarding the computation speed, as shown in
Table 10. As reported, DeLU records an inference time of 119 μs, incurring a marginal computational overhead of roughly 12% compared to standard ReLU (106 μs). Although the absolute value and linear scaling operations introduce this slight delay, the substantial gains in reconstruction accuracy make this modest overhead a highly worthwhile trade-off throughout the process of learning INRs for images. In the end, we mathematically demonstrate that our method is more suitable for representing remote sensing images compared to ReLU. The ReLU activation function is defined as
with its derivative given by
When
, the gradient becomes zero, leading to the well-known “dead neuron problem,” where certain neurons stop contributing to the learning process. This issue significantly hinders the training of implicit neural representations, particularly for learning high-frequency components of signals. In contrast, the activation function DeLU is defined as
with its derivative expressed as
This ensures that the gradient remains non-zero across the entire domain, except at
, where it is not differentiable. The key advantage of this property lies in its ability to avoid vanishing gradients, as the gradient’s magnitude remains constant (
k) regardless of the input value. This stability in gradient flow facilitates efficient learning, particularly for capturing high-frequency components in complex signals. Implicit neural representations require the network to model both low-frequency and high-frequency features of input data. While ReLU’s sparsity in gradients limits its ability to propagate high-frequency information effectively, DeLU maintains robust gradient flow, preserving high-frequency spectral features. Moreover, the constant and symmetric nature of the gradient ensures stable parameter updates, making DeLU particularly advantageous for tasks involving intricate signal representations such as remote sensing images.