1. Introduction
According to Australia’s latest State of the Environment report [
1], the country’s environment is generally in decline, with continued loss of natural capital across themes including land (e.g., native vegetation, soil, wetlands), biodiversity, inland waters, and marine environments. Land clearing has been an ongoing problem in Australia, despite Commonwealth legislation dedicated to environmental protection—the Environment Protection and Biodiversity Conservation Act 1999 and its predecessors—as well as legislation at state and territory levels. As much as 7.7 million hectares (an area larger than the island of Tasmania) of potential habitat and communities may have been cleared in the period 2000–2017 [
2].
Monitoring land cover change is essential for existing and future policies to regulate native vegetation protection and land clearing, forest production, land and water conservation, and landscape carbon storage [
3]. Satellite data provide a sound basis for understanding and mapping land cover change that is potentially internally consistent, repeatable over time, more economical and perhaps even more reliable than ground-based mapping [
4]. Pixel-based machine learning methods based on spectral characteristics, such as random forests [
5] and support vector machines [
6], have been extensively used to turn remote-sensing images into land cover maps. In Australia, land cover mapping is available at 250 m and 30 m resolution using pixel-based methods [
7,
8,
9]. The generation of more extensive time-series of land cover maps, with greater accuracy and reduced uncertainty has the potential to increase our understanding of Australia’s environmental change and land system processes [
9].
In addition, deep learning methods have achieved promising results for many image-analysis tasks, including land use and land cover classification (e.g., [
10,
11,
12]). A review published in 2019 [
13] identified convolutional neural networks (CNNs) as the most common deep learning method for satellite image analysis in use at that time. CNNs find features in an image through a training process that generates neural network weights through an iterative gradient descent algorithm. CNNs generally consist of three building blocks: convolution and pooling layers (one to many), followed by a fully connected layer to determine the probable class. Since 2012, when they set new benchmarks for image recognition, CNNs have been successfully used for segmenting remote-sensing images to generate land cover and vegetation maps [
14,
15,
16]. The median accuracy of land cover classifications from deep learning is generally higher than that from established classifiers such as random forests or support vector machines [
13]. In the context of annual land cover change monitoring and reporting, it is even more problematic that pixel-based classifiers can show a lack of spatial and temporal stability, e.g., due to variability between individual pixels and changes in vegetation condition, respectively (e.g., [
17,
18,
19,
20]). Machine learning methods that consider spatial patterns in addition to spectral information, such as CNNs, may address some of these issues.
Studies of the evolution of land cover over time using remote-sensing data [
21] typically exploit spectral, spatial, or temporal data to build models using multiple acquisitions within a single year (e.g., [
22,
23]) or multiple years (e.g., [
24]) using a satellite image time series. Methods generally employ supervised classification to assign a class to a pixel based on a model trained to compare input data to label data [
25,
26]. Unfortunately, in the Australian context, a general lack of high-resolution label data for the years before 2018 prevents the temporal component from being easily incorporated in model building.
The following hypotheses motivated the work described here:
CNN models can produce land cover maps that are spatially and temporally more consistent than those generated using pixel-based methods; and
CNN models are sufficiently robust to map, quantify and interpret changes in major land cover classes over decadal time scales.
In this study, we wished to specifically analyse the spatial and temporal stability of CNN-based land cover mapping over 34 years of Landsat imagery (1987–2020). We applied a pre-trained U-Net CNN model and developed innovative diagnostic methods to assess spatial and temporal stability. We used the diagnostics to quantitatively and qualitatively analyse how the CNN method differs from pixel-based mapping, which is the main contribution of this study. The results were compared to the recently released annual Digital Earth Australia (DEA) Land Cover product for 1988–2020 [
27] generated using pixel-based classifiers, as well as the annual global ESRI Land Cover product for 2017–2021 [
28] that was created using a U-Net CNN model trained from scratch, but which received limited or no specific training data for Australian conditions.
2. Materials and Methods
The study area (
Figure 1) was an area of approximately 500 km × 500 km (~400 million pixels) in south-eastern Australia for which annual Landsat geomedian data at 25 m resolution [
29] was available in GDA94/Australian Albers projection (EPSG:3577) from Geoscience Australia. Annual geomedian data are pixel composites generated using a high-dimensional statistic called the ‘geometric median’, which enables the production of a representative cloud-free annual composite image [
30]. Pixel-based compositing methods have been used to exploit large data volumes on annual or seasonal time scales to create representative images [
31] as inputs to machine learning model development.
Figure 2 shows the modelling and analysis workflow of the study, documenting the analysis steps and how they relate to sections in the results.
A U-Net CNN model [
32] was built using input Landsat annual geomedian v2 data [
29] for 2018 (a dry year) and 2020 (a wet year). Label data for model development was a simple eight-class land cover classification (Bare, Built-up, Crop, Forest, Grassland, Horticulture, Plantation, Water plus an unknown No data class) derived from reference land-use mapping [
33,
34], that was enhanced using forest-extent mapping [
35], using the approach described in [
17].
The U-Net CNN model, with ResNet50 encoder and transfer learning of ImageNet weights, was trained using Landsat annual geomedian six-band data (Blue, Green, Red, NIR, SWIR1, SWIR2) for 2018 and 2020 and corresponding label data. The study area (19,968 × 19,968 pixels) was broken into 24,336 non-overlapping 128 × 128 pixel patches for 2018 and 2020. Patches over the ocean (694) were discarded and the land-based tiles for each year (23,642) were split into training, validation, and test sets (
Table 1).
The models were trained using the Adam optimiser [
36] with a batch size of eight and an initial learning rate of 0.0001 determined through multiple iterations of these hyperparameters to identify the most performant configuration. Each model was trained for up to 80 epochs, ten epochs at a time, to generate a ‘best model’ with the highest validation accuracy and weighted-mean F1 score, using the methods and software as described in [
17]. The learning rate was automatically reduced after ten epochs if validation accuracy failed to improve as model training approached the optimal solution. Imbalanced class distributions were managed using the dice and focal loss functions following [
37]:
To reduce risk of overfitting and increase the dataset size, data augmentation was applied using the fast augmentation library Albumentations [
38]. Transformations included a random selection of image blurring and sharpening, gaussian noise addition, cropping, flipping, mirroring, and affine and perspective changes. Models were built for ten epochs and the results were examined. Those with the highest validation accuracy were saved and examined for the test area (tile +14, −40 in the Geoscience Australia Landsat tiling system, see
Figure 1) to determine through visual inspection the degree of match between predicted land cover and label data.
The best CNN model was then applied to 34 years (1987–2020) of Landsat annual geomedian data for a tile of 100 km × 100 km (16 million pixels; one million hectares) surrounding the city of Canberra (tile +15, −40,
Figure 1) to generate land cover maps for each year. The input geomedian reflectances were derived from Landsat 5 for 1987–1999 and 2003–2011, Landsat 7 for 2000–2002 and 2012, and Landsat 8 for 2013–2020.
Here, our focus was on the spatial and temporal stability of annual U-Net CNN model results compared to pixel-based classifiers and reference land cover products. Specifically, the following analyses were carried out on U-Net CNN model results for 1987–2020 using a combination of Python programs, the open-source Geographic Information System QGIS, and spreadsheets in Microsoft Excel:
- 3.
Investigation of the spatial and temporal stability of land cover classes between years using methods defining model transition probabilities (cf., [
39]);
- 4.
Comparison to the DEA Land Cover product for 1988–2020 [
8,
40] created using pixel-based methods (including Random Forests) using the same annual geomedian data; and
- 5.
Comparison to the ESRI Land Cover product, a global Sentinel-2 10-m Land Use/Land Cover product for 2017–2021 [
28] created using a U-Net model trained from scratch but not specifically developed or evaluated for Australia.
The authors developed a model transition probability classification for one-year, four-year and eight-year transitions (
Table 2), to assist in understanding transitions that are likely to be real versus those that may be artefacts of the model itself. The transition probability classification was developed by the authors through an extensive discussion, consensus, and agreement process. The classification categorised changes as either as Possible (P), Unlikely (U) or Implausible (I). The assessment has an element of subjectivity but the authors quickly reached consensus on transition probabilities. Overall change statistics for one, four and eight-year land cover transitions were generated considering their likelihood. One-year transitions were examined in detail to infer the most likely reasons for the results.
The DEA Land Cover product was chosen for comparison for several reasons: it was generated from the same annual Landsat geomedian data we used as input; it covered a similar range of years (1988–2020); and, importantly, it uses more conventional pixel-based machine learning methods, including Random Forests. The DEA Land Cover product was first converted to a simple land cover classification similar to ours (
Table 3). The DEA Land Cover product uses the FAO Land Cover Classification System [
41]. Simplifying its Level-4 classification enabled comparison to our land cover classes for each year from 1988–2020. It was not possible to separate the DEA class Natural Terrestrial Vegetated: Woody into our Forest and Plantation classes. Similarly, the DEA class Cultivated Terrestrial Vegetated covers both our Horticulture and Crop classes. Therefore, we amalgamated those classes for the comparison, yielding a simplified classification of six classes (plus No data for the remainder, see
Table 3).
Using simplified label data for 2018 as a reference, confusion matrices comparing the 2018 simplified CNN and DEA results were generated to assess their spatial classification accuracy. Metrics used were overall accuracy (OA) [
42], Kappa coefficient [
43], and class weighted-mean F1 scores as described in [
17].
Confusion matrices comparing the simplified CNN and DEA maps were also generated for every year from 1988–2020 to assess the degree of correspondence and summary statistics were calculated. The simplified CNN maps for 2018 and 2020 were used to calculate fractions for class weighted-mean value calculations.
Our CNN results were also compared to the ESRI Land Cover product, which was generated using Sentinel-2 data. Although this dataset only overlaps with our data for four years (2017–2020), the comparison was of interest because the ESRI product uses a similar U-Net methodology to map land cover, albeit at a higher resolution and using a globally trained model. The ESRI product produces a set of nine simple land cover classes (
Table 4) applied globally using a U-Net CNN model. The ESRI data were reprojected to the same Australian Albers projection and then resampled to 25-m resolution to match our CNN maps. Classes were also mapped to a common simplified land cover classification (
Table 4). Our CNN mapping results were directly compared to the ESRI Land Cover data and statistics were generated for the four years of overlap (2017–2020) between the two products.
3. Results
For the study area, the best CNN model, generated after 40 epochs, had an overall accuracy of 80% and a weighted-mean F1 score for all classes of 77%. The 2018 model output for the study area (
Figure 3c) compares well visually with label data (
Figure 3b) and the Landsat geomedian true colour RGB image (
Figure 3a). At this regional scale, the model results reflect several features (e.g., forest, water, grassland) that are also recognisable in the Landsat geomedian input.
The CNN model was used to generate yearly land cover maps for 1987–2020 (34 years) for the Canberra tile (~100 km × 100 km,
Figure 1). Land cover is dominated by forest in the southwest and southeast, with smaller built-up areas associated with the central urban area of Canberra, plantation forest to the west and east of the city, water over Lake Burrinjuck in the north-west and Lake George in the central east, with the rest of the tile being largely made up of grassland (
Figure 4). Crops are cultivated in this region, but the CNN was not effective in detecting this class, nor the horticulture (mainly vineyards) that occurs to the north of Canberra. Some broad-scale changes over the full period are detectable, including the gradual increase of the Canberra urban area (Built-up) and loss of plantations due to bushfires in 2003 to the west of Canberra.
Figure 4c and
Table 5 show the increase in Forest of ~55,000 hectares (16%) and decrease in Grassland of ~79,000 hectares (14%) over 34 years, as well as a significant increase in Built-up, reflecting the growth of Canberra, of ~20,000 hectares (82%). Large increases in crop and horticulture are from a low base and are not considered to be reliable.
3.1. Temporal Stability of U-Net CNN Model Predictions
One-year, four-year (e.g., 1987–1991, 1991–1995, etc.) and eight-year (e.g., 1987–1995, 1995–2003, etc.) transitions were analysed, and summary statistics were calculated. One-year transitions resulted in fewer changes in land cover, with an average 95.4% of pixels showing no change between consecutive years (
Table 6). This compares with an average 93.4% and 91.8% of pixels showing no change for the four-year and eight-year transitions, respectively. Regarding transitions deemed possible, the results show an average increase from 2.3% via 3.2% to 3.8% of pixels for one-, four-, and eight-year transitions, respectively. For transitions deemed unlikely, one- and four-year changes are 0.1% before increasing to 4% for eight-year transitions. In contrast, implausible transitions showed an increase from 2.1% for one-year to 3.2% for four-year transitions before a decrease to 0.3% for eight-year transitions. This reversal probably reflects the change in the transition likelihood classification (
Table 2). Some transitions in land cover considered implausible over four years were considered merely unlikely over eight years, mainly relating to the conversion of Grassland to Forest, Forest to Plantation, and Plantation to Forest. The standard deviations of mean transitions are below 1.1%, suggesting that average changes between years are relatively stable over time (
Table 6).
Of the one-year transitions, the pair of years with the least change was 2015–2016 (
Figure 5). Some minor differences between 2015 and 2016 occur over Lake George (red box) and in Plantation, Forest and Built-up classes.
Table 7 shows a transition matrix for 2015–2016, highlighting transition categories: no change, possible, unlikely and implausible changes. Of the implausible changes, the most common ones are from Built-up to Grassland (0.25%), Forest to Plantation (0.21%), Plantation to Forest (0.17%) and Grassland to Forest (0.16%) which largely appear to be edge effects associated with class boundaries (
Figure 5 and
Supplementary Materials Table S1).
3.2. Comparison of U-Net CNN Model Predictions with Reference Land Cover Products
Simplifying the land cover classification to six classes provided a means to compare model results with two reference land cover products: DEA Land Cover (1988–2020) and ESRI Land Cover (2017–2021).
Simplified U-Net CNN model results for 2018 were compared to Landsat input data, simplified reference label data and DEA Land Cover in
Figure 6. 2018 had the closest match (
Figure S1) between CNN model results (
Figure 6c) and DEA Land Cover (
Figure 6d), also for a year where high-resolution label data was available (
Figure 6b).
3.2.1. Spatial Accuracy Assessment
A spatial accuracy assessment based on confusion matrices was undertaken by comparing U-Net CNN and DEA Land Cover results, respectively, to 2018 label data as the reference. Compared to the DEA Land Cover product, the CNN results had higher overall accuracy (89 vs. 82%), Kappa (82 vs. 69%), and class weighted-mean F1 score (89 vs. 81%) (
Table 8 and
Table 9). For the Forest class, results are similar with a producer’s accuracy (how often real features in the landscape are shown in the classified map) of 91% vs. 92%, while for other classes the U-Net CNN shows better producer’s accuracy then DEA Land Cover (Grassland 94% vs. 87%; Bare 63% vs. 25%; Water 61% vs. 15%; Built-up 87% vs. 13%) with the exception being the
Crop class which was not detected well by either method (0.4% vs. 14%).
3.2.2. Annual Comparisons: 1988–2020
To better understand model results over time, confusion matrices comparing the simplified CNN to the DEA mapping were generated for 33 years (1988–2020), and overall accuracy (OA), Kappa and class weighted-mean F1 scores were calculated for each year using DEA as the reference (
Supplementary Figure S1 and
Table 10). The overall accuracy of these yearly comparisons ranged from about 50–86% and weighted-mean F1 from 56–85% (
Table 10). Kappa values ranged from 33–75%, translating to fair agreement (>20% and ≤40%) for three years, moderate agreement (>40% and ≤60%) for 19 years and substantial agreement (>60% and ≤80%) for 11 years following the classification proposed by [
44].
The best correlation between the simplified U-Net CNN model and simplified DEA Land Cover was obtained for 2018 (OA 86%, Kappa 75% and class weighted-mean F1 score 84%), while the worst correlation was for 2020 (OA 50%, Kappa 33% and class weighted-mean F1 score 56%). The confusion matrices for these comparisons are provided in
Tables S2 and S3, respectively.
Substantial spatial differences between our CNN and the DEA mapping (
Figure 6c,d) occur in the urban area of Canberra, Lake George and small areas of crop and forest, mostly north and east of the city. The U-Net CNN model in 2018 identified a smaller extent of the Crop class in this tile compared to DEA Land Cover and neither result appears to effectively represent this class as referenced in the label data (
Figure 6b). Though DEA Land Cover is a closer match for this class, its Crop extent is largely not spatially coincident with the label data. Differences in the Forest class between the U-Net CNN model and DEA Land Cover are minor and limited to class boundary or edge effects, as well as a difference associated with an area of partially cleared plantation Forest (red box in
Figure 6a) immediately east of the Canberra urban area which is not completely detected in either result. A mixture of the classes Forest, Grassland, Bare, Built-up and Crop can be seen across the urban area in the simplified DEA Land Cover, with an abundance of isolated pixels producing a grainy visual effect (
Figure 6d). For Lake George, the U-Net CNN model classifies this area as Water, although in 2018 this area would have been largely dry, whereas DEA Land Cover, arguably more realistically, classifies the area as mainly Bare and Grassland, with some small amounts of Water, but also Forest and Crop, which are considered unlikely to occur in this area.
Visually there is good spatial correspondence between simplified DEA Land Cover classes and U-Net CNN model results, particularly for the Forest, Water and Grassland classes (
Figure 6). DEA Land Cover classes represented by each U-Net CNN model class are shown for Forest/Plantation (
Figure 7), Grassland (
Figure 8a) and Built-up (
Figure 8b). For the U-Net CNN model Forest and Plantation classes, the vast majority of pixels from the simplified DEA Land Cover map are the class Natural Terrestrial Vegetated: Woody (
Figure 7), with a gradual increase in the classes with trees from 1988–2020, and a correlation between trends in the respective classes of
R2 = 0.47 (
Figure 9a). In contrast, the agreement is not as close for the U-Net CNN model class Grassland. As expected, the majority of CNN grassland corresponds to the DEA class Natural Terrestrial Vegetated: Herbaceous (
Figure 8a). However, there was also a considerable proportion (5–70%) of Cultivated Terrestrial Vegetated: Herbaceous and a smaller proportion (2–18%) of Natural Terrestrial Vegetated: Woody.
For the U-Net CNN model Built-up class, the expected Artificial Surface class occupied only 6–12% of pixels (
Figure 8b). Instead, the most common corresponding DEA Land Cover classes were Natural Terrestrial Vegetated: Herbaceous and Natural Terrestrial Vegetated: Woody. Despite the lower number of pixels, the trend in the DEA Land Cover Artificial Surface class correlated reasonably well with the U-Net CNN model Built-up class (
R2 = 0.77;
Figure 9b). Urban growth is more monotonic for U-Net CNN Built-up with increases recorded for 28 years (88%) from 1988–2020 compared to 21 years (66%) for DEA Land Cover’s Artificial Surface. Comparing temporal trends, the U-Net CNN model appeared to produce anomalous results for 1992 (
Figure 8b).
The U-Net CNN model did not identify the Crop class accurately and instances of crop across land cover maps are relatively few, e.g., for 1987, 2020 (
Figure 4). In contrast, the DEA Land Cover product mapped large extents of Cultivated Terrestrial Vegetated: Herbaceous (Crop) (e.g.,
Supplementary Figure S2). However, examination of transitions of this class suggests the extent is overestimated and has considerable temporal instability between years, as shown in
Figure 10 for 1993–1995 (similar patterns were seen for other periods, e.g., 1996–1998). Field observations by the authors suggest that these rapid changes are not realistic.
The spatial extent of the Built-up class (
Figure 11) in the simplified U-Net CNN model closely reflects the urban areas visible in the Landsat RGB image with a few areas of over-classification and unclear boundaries (e.g., the industrial area indicated by a yellow rectangle in
Figure 11c). Roads are generally correctly classified by the U-Net CNN model as Built-up, whereas the pixel-based methods used to generate DEA Land Cover often classify roads as Forest or Grassland (possibly reflecting roadside vegetation), e.g.,
Figure 11b, or as Crop, e.g.,
Figure 11c.
Newly developed urban areas in the DEA Land Cover (right image of three,
Figure 11) are predominantly classified as Built-up or Bare, as might be expected. Older urban areas visible to the southwest in the 1988 image and subsequent years show many pixels classified as Forest, Grassland and even Crop, perhaps reflecting the low-density, vegetation-rich character of Canberra’s established suburbs.
Forest areas are generally defined well spatially in both our CNN results and DEA mapping. In the Grassland and Crop areas (
Figure 11), DEA Land Cover shows the previously mentioned frequent change between these classes, which is unlikely to be accurate. There is cropping in these areas, but such large variations over a one-year (
Figure 10) to eight-year (
Figure 11) time scale do not occur in the authors’ observation.
Figure 12 shows a comparison of land cover classes defined by the U-Net CNN model and DEA Land Cover for 1988–2020 for the area from
Figure 11. The U-Net CNN model shows more stability between years (except for the anomalous year 1992) and gradual changes to land cover classes appear closely matched to features (e.g., built-up and forest areas) visible in Landsat images. In contrast, DEA Land Cover shows large changes in the Grassland and Crop classes between years and more variability in the Forest class.
3.2.3. Temporal Stability of U-Net CNN Model Compared to DEA Land Cover
Characterising how often individual pixels change between consecutive years (1988–2020) and mapping results provides a measure of temporal stability (
Figure 13). If a pixel changes every year, there would be a total of 32 changes across the 33 years (inclusive) from 1988 to 2020. Other than Lake George, forested and water areas in both the U-Net CNN model and DEA Land Cover maps are well conserved through the 33 years with no or few changes recorded (white areas in the southwest, southeast, and northwest,
Figure 13). Lake George shows many changes in both the CNN and DEA mapping; this is not unexpected, given that it is a shallow ephemeral lake with highly variable extent that is regularly completely dry. For the U-Net CNN model, the urban areas of Canberra (central area) are conserved (white areas) but not in the DEA mapping.
In our CNN results, changes are often associated with class boundaries (e.g., red box in
Figure 13a), as well as linear effects associated predominantly with roads. In the DEA mapping, changes are mostly associated with the coverage of the Crop and Grassland classes (e.g.,
Figure 10), and the urban area where there is a grainy, ‘salt and pepper’ mix of classes that changes through time (
Figure 13b).
Judging by the distribution of the number of class transitions for 1988–2020, the U-Net CNN model is more temporarily stable than DEA Land Cover (
Figure 14). Almost 76% of pixels in the CNN mapping do not change class over this period, whereas only 30% of the pixels for DEA Land Cover are unchanged. The number of pixels experiencing 1–4 changes over the period is similar for both maps, but from five changes on, the U-Net CNN model has significantly fewer pixel changes than DEA Land Cover. For example, the U-Net CNN model has about 11% of pixels showing more than four changes over 33 years, whereas for the DEA Land Cover, 55% of pixels have more than four changes in this period.
3.2.4. Comparison of U-Net CNN Model Predictions with ESRI Land Cover
A comparison of the simplified U-Net CNN model result for 2017 and simplified ESRI Land Cover is provided in
Supplementary Figure S3. The ESRI map (
Figure S3b) showed more Built-up in rural areas, as well as more Crop and Grassland than the simplified U-Net CNN model (
Figure S3a). The two maps do show temporal correlation (
Figure S4). It is reiterated that both results were obtained using U-Net CNN models, although the algorithms were separately trained and used input imagery from different sources and resolution (Landsat vs. Sentinel-2). Temporal correlation (
Figure S4) between the changing extent of the built-up (R
2 = 1.0) and forest (R
2 = 0.91) classes in the mapping by our and the ESRI algorithm was considerable, though only four years overlap between datasets were available (2017–2020). Correlation for other classes is lower, e.g., R
2 = 0.38 for Water and R
2 = 0.20 for Crop.
5. Conclusions
A U-Net CNN model was applied to annual Landsat geomedian input data for 1987–2020 to generate a series of land cover maps that are more stable and consistent over several decades compared to pixel-based approaches. Although the CNN had some issues with linear features, rounded class boundaries and a reduction in object detail that is not always desirable, the U-Net CNN model results showed better spatial coherence than results generated using pixel-based methods such as random forests and produced a more spatially and temporally consistent interpretation of the landscape at the regional scale. The U-Net CNN model operated at a coarser, more aggregated level than the individual pixel, allowing it to detect contextual information to generate a more stable spatial distribution of land cover classes through time, whereas pixel-based methods are better able to detect fine-scale variations within inhomogeneous classes, e.g., vegetation within urban areas. The U-Net CNN used for land cover mapping also appeared to be more temporally stable than pixel-based mapping, with fewer class changes (24% vs. 70% of pixels) over 33 years (1988–2020).
In summary, the U-Net CNN model appears to be capable of broad-scale land cover mapping over time, with several advantages over pixel-based mapping methods. The spatial transferability of the methods described here to other regions within Australia needs to be tested and is a subject for future research.