1. Introduction
In the context of rapid global urbanization, impervious surface area (ISA) has emerged as a key indicator for characterizing urbanization intensity and urban environmental change. The expansion of ISA not only contributes to the formation and intensification of the urban heat island (UHI) effect but also alters urban hydrological processes, reduces ecological connectivity, and degrades ecosystem services [
1,
2,
3]. Consequently, the precise quantification of ISA holds profound scientific significance. Owing to its extensive temporal archive and global coverage, the Landsat satellite series has served as a primary high-quality data source for long-term ISA mapping [
4,
5]. However, urban landscapes are inherently characterized by high heterogeneity and spatial fragmentation. In such complex environments, the “mixed pixel” phenomenon is prevalent within medium-resolution imagery (e.g., 30 m), which makes traditional pixel-level hard classification methods inadequate for meeting high-precision monitoring requirements [
6,
7]. To address this challenge, estimating sub-pixel impervious surface abundance has become a prominent research hotspot. Currently, the prevailing methodologies primarily encompass Spectral Mixture Analysis (SMA) and regression models based on statistics or machine learning [
4,
8,
9].
Spectral Mixture Analysis (SMA), fundamentally premised on the linear weighted combination of endmember spectra, has long functioned as a cornerstone for sub-pixel urban analysis. The classic Vegetation-Impervious Surface-Soil (V-I-S) conceptual framework, along with associated physical spectral characterization studies [
7,
8], laid the theoretical foundation for this domain. To overcome the limitations of fixed endmembers, subsequent studies have introduced numerous optimizations—such as Normalized Spectral Mixture Analysis [
10] and Multiple Endmember SMA (MESMA) adapted for urban environments [
6,
11]—which have significantly enhanced model adaptability. These optimized models can effectively address endmember variability, thereby improving estimation accuracy and enabling finer feature characterization. Despite these advancements, SMA methods remain constrained by intra-class spectral variability (often referred to as the “same object, different spectrum” phenomenon), leading to inconsistencies in decomposition accuracy [
12].
In contrast to SMA methods that rely on physical spectral decomposition and continuous optimization, regression models circumvent the physical modeling of light interactions by establishing a direct mapping relationship between spectral features and impervious surface abundance. Early investigations successfully employed statistical techniques, such as Regression Trees (CART), to quantify this mapping at regional scales [
13]. With the advent of advanced computing, machine learning algorithms—including Support Vector Machines (SVM) and Random Forests (RF)—have been widely adopted in this field, demonstrating superior performance in capturing complex, non-linear spectral heterogeneity [
14,
15]. Recently, the research frontier has shifted towards deep learning, driving Convolutional Neural Networks (CNNs) to extract deep semantic features and continually pushing the boundaries of sub-pixel inversion accuracy [
16,
17,
18]. However, although machine learning regression models excel at modeling complex non-linear relationships, their performance remains highly dependent on the quantity and quality of training samples [
19,
20].
Traditional approaches typically derive ground truth abundance through the visual interpretation of high-spatial-resolution imagery (e.g., Google Earth) [
21]. However, this manual approach is not only labor-intensive and time-consuming but may also introduce systematic errors. Specifically, inevitable geometric registration errors, temporal mismatches, and inconsistencies in atmospheric correction between multi-source datasets (i.e., the high-resolution reference and the medium-resolution target) introduce significant noise into the sample set [
22,
23,
24]. These discrepancies severely compromise the training efficacy of the model and its generalization capability in heterogeneous urban environments [
25].
While Spectral Mixture Analysis (SMA) and machine learning regression are prevalent, a critical bottleneck remains the acquisition of high-quality training samples. Previous studies have attempted to mitigate this by generating synthetic training data—typically through the random linear combination of pure endmembers from spectral libraries [
9,
26] or by employing mathematical models to approximate non-linear mixing effects [
27,
28]. However, these approaches often rely on mathematical abstractions that treat pixels as independent entities, thereby oversimplifying the complex intra-class spectral variability and the inherent spatial autocorrelation of real-world urban landscapes.
To address these limitations, this study proposes a novel sub-pixel Impervious Surface Fraction (ISF) retrieval framework that constructs a training library by aggregating high-resolution airborne hyperspectral imagery (AVIRIS). Unlike existing methods that rely on the random mixing of idealized spectral signatures, our approach utilizes fine-scale classification of AVIRIS imagery (spatial resolution < 5 m) to derive impervious surface fractions within each 30 m grid cell. The hyperspectral data are then spectrally resampled using the Landsat spectral response functions (SRFs) and spatially aggregated to produce reflectance data aligned with the fraction values. This physically constrained process effectively preserves authentic sensor artifacts, complex atmospheric conditions, and the natural spatial-spectral correlations inherent in urban structures. Finally, a 1D-CNN is employed to effectively map the complex non-linear relationship between these simulated spectra and the corresponding ISF, facilitating high-precision sub-pixel impervious surface mapping.
The main contribution of this study is the development of a novel, physically consistent framework for generating high-quality labeled datasets for sub-pixel impervious surface mapping. By leveraging high-spatial-resolution airborne hyperspectral imagery as a unified physical source, we establish a rigorous ‘spectrum-abundance’ coupling pipeline. This framework uniquely integrates synchronous spectral convolution and spatial aggregation to derive both the spectral reflectance features and the ground-truth abundance labels within a common spatial domain. Consequently, this approach effectively mitigates systematic errors—such as geometric registration artifacts and temporal mismatches—inherent in traditional image-derived samples, providing a robust, noise-resistant training foundation.
3. Methods
This study proposes a hyperspectral simulation-driven framework for sub-pixel impervious surface fraction (ISF) retrieval. The proposed method first constructs a comprehensive target–background spectral library based on hyperspectral observations and generates physically consistent spectrum–abundance training samples through multispectral data simulation. A one-dimensional convolutional neural network (1D-CNN) is then employed to learn the nonlinear relationship between simulated multispectral spectra and ISF values. Finally, a series of comparative experiments are designed to evaluate the effectiveness of the proposed approach using Landsat imagery, and the retrieval performance is assessed through quantitative accuracy metrics and spatial distribution analysis.
To provide a comprehensive overview of the proposed methodology, we illustrate the end-to-end framework in
Figure 3. This framework systematically integrates sample library construction (
Section 3.1), 1D-CNN model training (
Section 3.2), and the final inference on Landsat imagery.
3.1. Construction of a Comprehensive Target-Background Spectral Library
3.1.1. Definition of Target and Background Endmember Categories
To characterize the inherent complexity of urban surfaces, this study treats Impervious Surface as a single target category rather than dividing it into rigid sub-classes. Based on high-resolution hyperspectral data, we extracted a comprehensive set of spectral features covering a wide dynamic range of anthropogenic materials—from dark asphalt roads to highly reflective metal roofing. Our strategy can be conceptualized as an enhanced, implicit Multiple Endmember Spectral Mixture Analysis (MESMA). While traditional MESMA minimizes reconstruction errors by selecting specific endmembers from a library on a per-pixel basis [
11], our framework implicitly encodes the spectral variability of ISA into a massive training dataset derived from hyperspectral simulations. Consequently, the subsequent Convolutional Neural Network (CNN) model learns robust and generalized “ISF features”, which have demonstrated superior performance in capturing non-linear spectral mixing compared to traditional physical models [
35,
36]. Simultaneously, natural surfaces—including vegetation, soil and water—are defined as pervious surface (PS). By decoupling the target (ISF) from PS, our model ensures accurate estimation of ISF regardless of the surrounding environment. A detailed list of the constituents covering this wide spectral range is provided in
Table 1.
3.1.2. Supervised Classification of Ground Truth Generation
To generate high-precision land-cover maps from high-dimensional AVIRIS imagery, we employed the Support Vector Machine (SVM) classifier. SVM is widely recognized for its superior generalization capability and robustness in handling high-dimensional data with limited training samples, making it particularly well-suited for hyperspectral image classification [
15,
37]. Comparative studies have consistently demonstrated that SVM outperforms traditional classifiers such as Maximum Likelihood (MLC) and shallow neural networks in complex remote sensing tasks [
38].
To optimize classification performance, we implemented a specialized processing workflow. First, the original 372 spectral bands were subjected to Minimum Noise Fraction (MNF) transformation to suppress noise and mitigate spectral redundancy. The first 20 MNF components, which cumulatively account for over 99% of the spectral variance, were selected as input features. The SVM model utilized a Radial Basis Function (RBF) kernel, with the penalty parameter (C) and kernel width (γ) optimized through a grid search strategy combined with 5-fold cross-validation.
Given that the resulting classification maps serve as the foundational “ground truth” for our subsequent deep learning framework, ensuring accurate class consistency is critical. Accordingly, a strict scene-by-scene independent validation approach was employed. To address the complex illumination effects present in high-resolution AVIRIS imagery (including shadows cast by buildings and tree canopies), a “shade” endmember was added to the system originally defined in
Section 3.1.1.
Reference samples (Regions of Interest, ROIs) for the five target categories were manually delineated through visual interpretation of high-resolution RGB composites. To ensure the representativeness and spectral purity of the training samples, a rigorous manual sampling strategy was implemented. ROIs for the five target categories were uniformly distributed across the entire hyperspectral scene to capture the natural intra-class variability under different illumination and background conditions. Furthermore, to minimize edge effects and the risk of mixed pixels, the ROIs were strictly delineated at the central locations of large, continuous, and homogeneous land cover patches, carefully avoiding transition zones and structural boundaries. These highly pure ROIs were utilized exclusively for training the scene-specific SVM classifiers. Finally, each classification result underwent exhaustive manual inspection and iterative refinement to ensure an Impervious Surface (IS) extraction accuracy exceeding 95%. Qualitative assessment of the representative classification results is illustrated in
Figure 4.
3.1.3. Multispectral Data Simulation
Theoretically, the radiance or reflectance recorded by a multispectral sensor is the weighted integral of the incoming continuous spectrum and the band-specific sensitivity of the sensor. Given that AVIRIS data provides fine-grained spectral information (5 nm sampling) covering the spectral range of most optical satellite sensors, it can be utilized to simulate broad-band multispectral data with high fidelity [
39]. To generate the simulated multispectral imagery from the AVIRIS-NG hyperspectral data, we employed the spectral convolution method based on the Spectral Response Functions (SRFs) of the target sensor. Previous studies have validated the physical fidelity of this synthesis approach, demonstrating high radiometric consistency between SRF-synthesized bands and actual multispectral observations [
40]. The simulation process, often referred to as spectral resampling or convolution, models the physical response mechanism of the sensor. The formula for data simulation was as follows:
where
is the spectral radiance of the synthesized multispectral image bands,
is the spectral radiance of the hyperspectral radiances.
is the band numbers of multispectral instrument (MSI),
is the band numbers of hyperspectral instrument (HSI),
is the channel width of hyperspectral instrument,
is spectral response function of multispectral sensor corresponding to central wavelength of hyperspectral sensor.
In this study, for Landsat 8 OLI, we simulated six spectral bands ranging from the visible to the shortwave infrared (SWIR), encompassing Bands 2 through 7 (Blue, Green, Red, NIR, SWIR-1, and SWIR-2) according to the spectral response functions of Landsat 8 OLI [
41]. The high spectral resolution of the source AVIRIS-NG data ensures that a sufficient number of hyperspectral channels fall within the bandwidth of each simulated multispectral band, thereby facilitating accurate weighted integration.
- 2.
Validation of Simulated Multispectral Imagery
To evaluate the radiometric consistency of the simulation, a comprehensive comparison was performed between the simulated data and actual satellite observations. In principle, the validation imagery should be acquired simultaneously. However, due to cloud cover constraints, we selected the closest available cloud-free Landsat 8 OLI image acquired on 14 July 2018 to validate the AVIRIS-derived simulated image from 2 July 2018. This results in a short 12-day temporal interval.
Visually,
Figure 5 demonstrates high consistency between the simulated and observed datasets. The spectral transformation from the hyperspectral domain (
Figure 5a) to the broad multispectral bands (
Figure 5b) effectively preserved key spectral features. Following spatial aggregation, the simulated 30 m imagery (
Figure 5c) exhibited a spatial pattern and radiometric characteristics highly comparable to the actual Landsat 8 image (
Figure 5d). Crucially, the boundaries of impervious surfaces (e.g., roads, rooftops) and water bodies remained stable during the 12-day interval, confirming that the simulated data accurately reflects the geometric and radiometric characteristics of the target scene.
A quantitative pixel-wise difference assessment (
Figure 5e) further confirms the spectral consistency between the simulated imagery and the actual Landsat data. The reflectance residuals (Simulated-2 July-Observed 14 July) across all six bands center closely around zero, indicating negligible systematic bias. Specifically, the visible bands (Blue, Green, Red) show extremely tight interquartile ranges. It is worth noting that the NIR and SWIR bands exhibit slightly broader variances compared to the visible bands. To quantitatively assess the uncertainty introduced by this 12-day temporal gap, we evaluated the reflectance bias (simulated minus observed) in the highly phenology-sensitive NIR band. As illustrated by the scatterplots in
Figure 6, temporally invariant artificial surfaces show a negligible mean bias of 0.32%, confirming the high fidelity of our spectral simulation. Conversely, vegetation samples exhibit a minor negative bias of −2.84%. This slight deviation aligns with the natural increase in vegetation vigor during early July, demonstrating that the observed variance in the NIR band is fundamentally driven by natural phenological shifts rather than systemic simulation errors.
3.1.4. Construction of the Physically Guided “Spectrum-Abundance” Sample Library
To effectively establish the mapping between spectral signatures and sub-pixel compositions, we constructed a paired endmember sample library
. In this dataset, the input features
represent the multispectral reflectance, while the labels
correspond to the fractional abundances of the endmembers. As illustrated in
Figure 3, the abundance vectors
and the multispectral reflectance
are generated through the following steps. The construction process strictly follows the physical mechanisms of spectral mixing, ensuring that the synthesized data is both statistically diverse and physically realistic.
The abundance labels
were derived from high-spatial-resolution hyperspectral imagery (e.g., AVIRIS) to preserve real-world land cover patterns. First, the high-resolution imagery was classified to generate a fine-scale ground truth map. The dimensions of the spatial aggregation window were strictly determined by the resolution relationship between the observed multispectral sensor and the high-resolution source data. Specifically, let
denote the scaling factor defined by the ratio of the spatial resolution of the multispectral sensor to that of the hyperspectral data. We applied a window of size
to traverse the classification map. The classification map was then aggregated using an
moving window. The abundance vector for each simulated pixel was then calculated as the areal fraction of each endmember class within this window:
where
denotes the number of high-resolution pixels belonging to the k-th endmember class, and
is the total number of pixels in the aggregation window (i.e.,
= S × S). This method ensures the labels satisfy the abundance non-negativity and sum-to-one constraints naturally while accurately reflecting the scale difference between the sensors.
- 2.
Feature Generation: Physically Based Spatial Aggregation
Corresponding to the derived abundance labels, the multispectral reflectance features were obtained via direct spatial aggregation of the high-resolution imagery. After spectrally resampling the AVIRIS data to Landsat 8 bands using Spectral Response Functions (SRFs), we performed spatial downsampling to match the 30 m resolution. For each coarse-resolution pixel, the reflectance was derived by averaging the reflectance of all spatially aligned high-resolution sub-pixels, ensuring consistency between scales.
The feature vector
is calculated as:
where K is the total count of high-resolution sub-pixels located within the
target multispectral grid, and
is the reflectance of the
sub-pixel.
This direct spatial aggregation strategy is widely adopted in remote sensing to generate reliable abundance ground truth and simulation datasets for validating unmixing algorithms [
42,
43]. Compared to purely mathematical synthesis, this approach preserves the real-world spectral complexity (e.g., intra-class variability) inherent in the high-resolution observations [
12], ensuring that the input features
naturally align with the abundance labels derived in the previous step.
Based on the proposed physically constrained spectrum-abundance coupling framework, this study constructed a specialized sample library for urban impervious and pervious surface components. First, land cover types derived from the interpretation of high-spatial-resolution hyperspectral imagery were reclassified into two primary endmembers: impervious surfaces and pervious surfaces. Their sub-pixel abundance fractions at the multispectral pixel scale were then calculated via spatial aggregation. Correspondingly, the multispectral reflectance features were derived through spectral consistency resampling and spatial aggregation of the high-resolution imagery. This construction process not only preserves physically realistic spectral mixing mechanisms and intra-class spectral variability but also achieves physical consistency between spectral features and abundance labels at the spatial scale. Consequently, it provides a robust training sample foundation for the sub-pixel abundance inversion of impervious surfaces and related urban environmental analyses.
3.2. Training of CNN Model
To effectively map the complex non-linear relationship between the simulated medium-resolution spectra and the corresponding Impervious Surface Fraction, a One-Dimensional Convolutional Neural Network (1D-CNN) was employed. Operating directly on individual pixel-wise spectral vectors, the 1D-CNN is highly adept at extracting latent spectral features and capturing local correlations across adjacent spectral bands, making it exceptionally well-suited for pixel-based quantitative inversion tasks. To rigorously assess the model’s generalization capability, we implemented a spatial disjoint partitioning strategy for dataset splitting. Specifically, the sample pool was divided into several independent subsets, each containing approximately 200,000 samples, with 80% randomly assigned for training and 20% for validation. To ensure the reliability of the training library, a strict quality control protocol was applied before splitting to exclude 23,240 anomalous pixels that violated key physical constraints. These constraints required spectral reflectance to remain within the [0, 1] range and the total abundance sum to fall strictly between 0.9998 and 1.0002. Following this optimization, a final dataset of 9,933,359 high-quality samples was obtained.
The specific architecture of the 1D-CNN is illustrated in
Figure 6. It consists of two convolutional layers with 64 and 128 filters, respectively, utilizing a kernel size of 2. To preserve fine-scale spectral signatures, pooling layers were omitted, and ReLU activation functions were applied throughout to capture non-linear features. The extracted feature vectors are flattened and passed through a 128-neuron fully connected layer, which incorporates a Dropout layer to mitigate overfitting across the large sample pool. Finally, a Softmax activation is utilized in the output layer to enforce the sum-to-one constraint, ensuring that the predicted abundances are physically realistic.
Training was conducted with a batch size of 128 and an initial learning rate of 0.001 using the Adam optimizer. While the maximum number of epochs was set to 100, an early stopping mechanism was implemented to monitor validation Mean Absolute Error (MAE), terminating the process if no improvement occurred for 10 consecutive rounds.
3.3. Experimental Design
To systematically evaluate the performance of the proposed sample generation strategy and the 1D-CNN unmixing model, three comparative experimental schemes were designed
Table 2. These experiments function as an ablation study to isolate the contributions of the high-quality training data and the deep learning architecture, respectively. Experiment 1 serves as the traditional baseline, where training samples were manually selected directly from the Landsat 8 imagery based on visual interpretation, and the Random Forest (RF) regressor was employed for unmixing. To validate the effectiveness of the data generation strategy, Experiment 2 utilized the hyperspectral-simulated sample library (constructed in
Section 3.1) as the training data, while keeping the unmixing model (RF) unchanged. A comparison between Experiment 1 and Experiment 2 allows for quantifying the impact of sample quality—specifically spectral purity and intra-class variability—on unmixing accuracy. Finally, Experiment 3 represents the complete methodology proposed in this study, which integrates the hyperspectral-simulated library with the 1D-CNN model (described in
Section 3.2). By comparing Experiment 2 and Experiment 3, where the training data remains identical, the superiority of the 1D-CNN architecture over traditional machine learning in extracting non-linear spectral features can be explicitly demonstrated. All data preprocessing, model training, and result analysis in this study were performed using open-source software and general-purpose computing hardware. Specifically, the core algorithms were implemented in Python 3.11.5, with key open-source libraries including NumPy, Pandas, Scikit-learn, TensorFlow 2.15.0, Rasterio, and GDAL.
3.4. Accuracy Assessment
3.4.1. Accuracy Evaluation Metrics
To quantitatively evaluate the performance of the proposed method in estimating subpixel endmember fractions, we employed three standard statistical metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the Coefficient of Determination (
). RMSE serves as the primary indicator of the overall deviation, giving greater weight to larger errors to reflect model robustness, while MAE provides a direct measurement of the average absolute difference between the estimated and ground-truth values, representing the physical accuracy of the unmixing results. Additionally,
is utilized to assess the goodness of fit and the proportion of variance explained by the model. These metrics are mathematically defined as:
where N denotes the total number of samples,
and
represent the actual and estimated abundance fractions for the i-th sample, respectively, and
corresponds to the mean of the observed abundances [
27,
44]. Consequently, lower RMSE and MAE values, combined with an
value approaching 1, demonstrate superior unmixing accuracy and generalization capability.
3.4.2. Validation Data
To evaluate the mode’s performance using real observations, an independent validation set containing 4259 samples was manually labeled from high-resolution GaoFen-2 (GF-2) imagery with a spatial resolution of 1 m. This external dataset is completely independent of the simulated training data. As shown in
Figure 7, high-resolution impervious surface features were first vectorized from GF-2 images and then aggregated into 30 m grid cells to be consistent with the spatial resolution of Landsat 8. Each validation sample represents the actual Impervious Surface Fraction (ISF) within a 30 m pixel, providing reliable ground truth for evaluating inversion accuracy in real urban and rural scenarios.
5. Discussion
5.1. Effectiveness of the Simulation-Based Training Dataset
The proposed hyperspectral simulation-driven framework demonstrates superior efficacy in Impervious Surface Fraction (ISF) retrieval. By integrating high-resolution airborne hyperspectral data, spectral convolution, and spatial aggregation, this approach constructs a training dataset characterized by rigorous physical consistency. Unlike conventional methods that rely on manually interpreted samples—often subject to by geometric misregistration, temporal inconsistencies, and subjective labeling errors—our framework ensures a strict correspondence between spectral features and abundance labels. This simulation-based generation mitigates data uncertainty by deriving both reflectance and ground truth from a unified, high-fidelity hyperspectral source. Consequently, the model benefits from more reliable supervision, leading to enhanced retrieval accuracy, improved preservation of spatial boundaries, and a significant reduction in false alarms within non-urban regions. These findings confirm that our physically constrained training library effectively captures the spectral characteristics of impervious surfaces, thereby substantially improving the stability and robustness of ISF mapping.
5.2. Analysis of Underestimation and Overestimation in ISF Retrieval
Despite overall performance improvements, the quantitative results in dense urban cores (e.g., Site d, slope = 0.6286) indicate a pronounced “regression-to-the-mean” bias, characterized by the systematic underestimation of high ISF values and overestimation of low ISF values. This phenomenon is primarily driven by two complex factors.
First, there is an inherent training sample imbalance heavily skewed toward intermediate ISF values. During the spatial aggregation process from high-resolution imagery to the 30 m Landsat scale, pure pixels (i.e., near 100% or 0% imperviousness) become exceedingly rare due to the high fragmentation of urban landscapes. Consequently, the statistical distribution of the simulated training library assumes a bell shape. Without explicit balancing constraints, the 1D-CNN optimizes its overall loss by adopting conservative predictions biased toward this central tendency, inevitably suppressing extreme high values.
Second, the underperformance is exacerbated by nonlinear spectral mixing effects prevalent in dense built-up areas. The current simulation pipeline generates multi-spectral features based on a linear spatial aggregation assumption (Equation (3)). However, dense urban cores feature complex 3D geometric structures, including tall buildings and narrow street canyons, which induce severe multiple scattering and deep shadowing effects. The linear synthesis fails to fully replicate these complex 3D radiative transfer mechanisms, leading to a physical discrepancy between the simulated training spectra and the actual multispectral observations recorded by the satellite.
5.3. Theoretical Potential for Cross-Sensor Extension
A distinctive feature of the proposed simulation-driven framework is its independence from specific sensor configurations, as it relies on high-resolution airborne hyperspectral imagery as a unified spectral information source. In principle, the framework generates sensor-specific training datasets through spectral convolution with the Spectral Response Functions (SRFs) of the target platform. This mechanism suggests a high potential for extending the methodology to other medium-resolution satellites, such as Sentinel-2 or MODIS, simply by substituting the SRF profile of the target sensor.
While our current validation is focused on Landsat 8 imagery, the modular design of this simulation-based approach provides a flexible paradigm for future cross-sensor model migration, potentially bypassing the labor-intensive requirements of manual re-labeling for different platforms. We acknowledge that the empirical robustness of this adaptability remains to be fully verified. Future research will incorporate multi-source satellite datasets to validate the universality of this framework across diverse satellite constellations, thereby providing a more systematic solution for long-term, multi-sensor urban environmental monitoring.
5.4. Limitations and Future Work
Despite its promising performance, several limitations of the proposed framework should be acknowledged. First, the construction of the spectral library assumes that hyperspectral pixels represent relatively pure surface materials. This assumption is generally reasonable when the spatial resolution of hyperspectral imagery is sufficiently high (e.g., finer than approximately 5 m), but mixed pixels may still exist in complex urban environments, which could introduce uncertainty into the spectral library. Second, the proposed framework relies on the availability of high-quality hyperspectral data for sample generation. Although airborne hyperspectral datasets such as AVIRIS provide rich spectral information, their spatial coverage is often limited, which may restrict the large-scale application of the framework. Finally, the current implementation primarily assumes linear spectral mixing during the spatial aggregation process. Incorporating nonlinear spectral mixing models may further improve the realism of simulated spectra and enhance ISF retrieval accuracy in complex urban environments.
To mitigate these limitations in future research, concrete strategies must be implemented at both the data and model levels. To counteract sample imbalance, advanced loss functions—such as focal loss or inverse frequency weighting—should be employed to heavily penalize prediction errors on minority pure pixels during network training. Furthermore, to address nonlinear mixing, future simulation frameworks should incorporate 3D urban canopy models or nonlinear radiative transfer approximations to synthesize more physically realistic training spectra for high-density urban zones.
Future research will focus on expanding the hyperspectral spectral library using datasets from multiple geographic regions, exploring nonlinear spectral mixing mechanisms, and validating the framework across multiple satellite sensors. These efforts will further improve the robustness and scalability of the proposed approach for large-scale impervious surface monitoring.
6. Conclusions
This study addresses the challenge of generating reliable training samples for impervious surface fraction (ISF) mapping from medium-resolution imagery by developing a physically consistent framework based on high-spatial-resolution hyperspectral data. Conventional training samples derived often suffer from geometric misregistration, temporal inconsistencies, and radiometric discrepancies, which limit the accuracy and reliability of ISF inversion models.
Experimental results demonstrate that the proposed approach significantly improves ISF estimation accuracy compared with conventional sample construction strategies. The method effectively suppresses spectral confusion in non-urban areas, restores the natural bimodal distribution of ISF values, and preserves fine spatial boundaries, achieving the highest overall accuracy (R2 = 0.8613).
Furthermore, the framework exhibits strong cross-sensor adaptability. By convolving hyperspectral data with the spectral response functions of different sensors, the method can generate sensor-consistent training datasets for platforms such as Landsat and MODIS. This capability significantly reduces the reliance on manual re-labeling when migrating models across sensors, thereby providing a scalable and systematic solution for multi-sensor urban monitoring and long-term impervious surface mapping.