Convolutional Neural Network Shows Greater Spatial and Temporal Stability in Multi-Annual Land Cover Mapping Than Pixel-Based Methods

: Satellite imagery is the only feasible approach to annual monitoring and reporting on land cover change. Unfortunately, conventional pixel-based classiﬁcation methods based on spectral response only (e.g., using random forests algorithms) have shown a lack of spatial and temporal stability due, for instance, to variability between individual pixels and changes in vegetation condition, respectively. Machine learning methods that consider spatial patterns in addition to reﬂectance can address some of these issues. In this study, a convolutional neural network (CNN) model, U-Net, was trained for a 500 km × 500 km region in southeast Australia using annual Landsat geomedian data for the relatively dry and wet years of 2018 and 2020, respectively. The label data for model training was an eight-class classiﬁcation inferred from a static land-use map, enhanced using forest-extent mapping. Here, we wished to analyse the beneﬁts of CNN-based land cover mapping and reporting over 34 years (1987–2020). We used the trained model to generate annual land cover maps for a 100 km × 100 km tile near the Australian Capital Territory. We developed innovative diagnostic methods to assess spatial and temporal stability, analysed how the CNN method differs from pixel-based mapping and compared it with two reference land cover products available for some years. Our U-Net CNN results showed better spatial and temporal stability with, respectively, overall accuracy of 89% verses 82% for reference pixel-based mapping, and 76% of pixels unchanged over 33 years. This gave a clearer insight into where and when land cover change occurred compared to reference mapping, where only 30% of pixels were conserved. Remaining issues include edge effects associated with the CNN method and a limited ability to distinguish some land cover types (e.g., broadacre crops vs. pasture). We conclude that the CNN model was better for understanding broad-scale land cover change, use in environmental accounting and natural resource management, whereas pixel-based approaches sometimes more accurately represented small-scale changes in land cover.


Introduction
According to Australia's latest State of the Environment report [1], the country's environment is generally in decline, with continued loss of natural capital across themes including land (e.g., native vegetation, soil, wetlands), biodiversity, inland waters, and marine environments. Land clearing has been an ongoing problem in Australia, despite Commonwealth legislation dedicated to environmental protection-the Environment Protection and Biodiversity Conservation Act 1999 and its predecessors-as well as legislation at state and territory levels. As much as 7.7 million hectares (an area larger than the island of Tasmania) of potential habitat and communities may have been cleared in the period 2000-2017 [2].
Monitoring land cover change is essential for existing and future policies to regulate native vegetation protection and land clearing, forest production, land and water 1.
CNN models can produce land cover maps that are spatially and temporally more consistent than those generated using pixel-based methods; and 2.
CNN models are sufficiently robust to map, quantify and interpret changes in major land cover classes over decadal time scales.
In this study, we wished to specifically analyse the spatial and temporal stability of CNN-based land cover mapping over 34 years of Landsat imagery (1987-2020). We applied a pre-trained U-Net CNN model and developed innovative diagnostic methods to assess spatial and temporal stability. We used the diagnostics to quantitatively and qualitatively analyse how the CNN method differs from pixel-based mapping, which is the main contribution of this study. The results were compared to the recently released annual Digital Earth Australia (DEA) Land Cover product for 1988-2020 [27] generated using pixel-based classifiers, as well as the annual global ESRI Land Cover product for 2017-2021 [28] that was created using a U-Net CNN model trained from scratch, but which received limited or no specific training data for Australian conditions.

Materials and Methods
The study area (Figure 1) was an area of approximately 500 km × 500 km (~400 million pixels) in south-eastern Australia for which annual Landsat geomedian data at 25 m resolution [29] was available in GDA94/Australian Albers projection (EPSG:3577) from Remote Sens. 2023, 15, 2132 3 of 21 Geoscience Australia. Annual geomedian data are pixel composites generated using a high-dimensional statistic called the 'geometric median', which enables the production of a representative cloud-free annual composite image [30]. Pixel-based compositing methods have been used to exploit large data volumes on annual or seasonal time scales to create representative images [31] as inputs to machine learning model development.

Materials and Methods
The study area (Figure 1) was an area of approximately 500 km × 500 km (~400 million pixels) in south-eastern Australia for which annual Landsat geomedian data at 25 m resolution [29] was available in GDA94/Australian Albers projection (EPSG:3577) from Geoscience Australia. Annual geomedian data are pixel composites generated using a highdimensional statistic called the 'geometric median', which enables the production of a representative cloud-free annual composite image [30]. Pixel-based compositing methods have been used to exploit large data volumes on annual or seasonal time scales to create representative images [31] as inputs to machine learning model development.    Figure 2 shows the modelling and analysis workflow of the study, documenting the analysis steps and how they relate to sections in the results.
A U-Net CNN model [32] was built using input Landsat annual geomedian v2 data [29] for 2018 (a dry year) and 2020 (a wet year). Label data for model development was a simple eight-class land cover classification (Bare, Built-up, Crop, Forest, Grassland, Horticulture, Plantation, Water plus an unknown No data class) derived from reference land-use mapping [33,34], that was enhanced using forest-extent mapping [35], using the approach described in [17].
The U-Net CNN model, with ResNet50 encoder and transfer learning of ImageNet weights, was trained using Landsat annual geomedian six-band data (Blue, Green, Red, NIR, SWIR1, SWIR2) for 2018 and 2020 and corresponding label data. The study area (19,968 × 19,968 pixels) was broken into 24,336 non-overlapping 128 × 128 pixel patches for 2018 and 2020. Patches over the ocean (694) were discarded and the land-based tiles for each year (23,642) were split into training, validation, and test sets (Table 1) A U-Net CNN model [32] was built using input Landsat annual geomedian v2 data [29] for 2018 (a dry year) and 2020 (a wet year). Label data for model development was a simple eight-class land cover classification (Bare, Built-up, Crop, Forest, Grassland, Horticulture, Plantation, Water plus an unknown No data class) derived from reference landuse mapping [33,34], that was enhanced using forest-extent mapping [35], using the approach described in [17].
The U-Net CNN model, with ResNet50 encoder and transfer learning of ImageNet weights, was trained using Landsat annual geomedian six-band data (Blue, Green, Red, NIR, SWIR1, SWIR2) for 2018 and 2020 and corresponding label data. The study area (19,968 × 19,968 pixels) was broken into 24,336 non-overlapping 128 × 128 pixel patches for 2018 and 2020. Patches over the ocean (694) were discarded and the land-based tiles for each year (23,642) were split into training, validation, and test sets (Table 1). The models were trained using the Adam optimiser [36] with a batch size of eight and an initial learning rate of 0.0001 determined through multiple iterations of these hyperparameters to identify the most performant configuration. Each model was trained for up to 80 epochs, ten epochs at a time, to generate a 'best model' with the highest validation accuracy and weighted-mean F1 score, using the methods and software as described in  The models were trained using the Adam optimiser [36] with a batch size of eight and an initial learning rate of 0.0001 determined through multiple iterations of these hyperparameters to identify the most performant configuration. Each model was trained for up to 80 epochs, ten epochs at a time, to generate a 'best model' with the highest validation accuracy and weighted-mean F1 score, using the methods and software as described in [17]. The learning rate was automatically reduced after ten epochs if validation accuracy failed to improve as model training approached the optimal solution. Imbalanced class distributions were managed using the dice and focal loss functions following [37]: To reduce risk of overfitting and increase the dataset size, data augmentation was applied using the fast augmentation library Albumentations [38]. Transformations included a random selection of image blurring and sharpening, gaussian noise addition, cropping, flipping, mirroring, and affine and perspective changes. Models were built for ten epochs and the results were examined. Those with the highest validation accuracy were saved and examined for the test area (tile +14, −40 in the Geoscience Australia Landsat tiling system, see Figure 1) to determine through visual inspection the degree of match between predicted land cover and label data. The best CNN model was then applied to 34 years (1987-2020) of Landsat annual geomedian data for a tile of 100 km × 100 km (16 million pixels; one million hectares) surrounding the city of Canberra (tile +15, −40, Figure 1) to generate land cover maps for each year. The input geomedian reflectances were derived from Landsat 5 for 1987and 2003, Landsat 7 for 2000, and Landsat 8 for 2013-2020.
Here, our focus was on the spatial and temporal stability of annual U-Net CNN model results compared to pixel-based classifiers and reference land cover products. Specifically, the following analyses were carried out on U-Net CNN model results for 1987-2020 using a combination of Python programs, the open-source Geographic Information System QGIS, and spreadsheets in Microsoft Excel:

3.
Investigation of the spatial and temporal stability of land cover classes between years using methods defining model transition probabilities (cf., [39]); 4.
Comparison to the DEA Land Cover product for 1988-2020 [8,40] created using pixelbased methods (including Random Forests) using the same annual geomedian data; and 5.
Comparison to the ESRI Land Cover product, a global Sentinel-2 10-m Land Use/Land Cover product for 2017-2021 [28] created using a U-Net model trained from scratch but not specifically developed or evaluated for Australia.
The authors developed a model transition probability classification for one-year, fouryear and eight-year transitions (Table 2), to assist in understanding transitions that are likely to be real versus those that may be artefacts of the model itself. The transition probability classification was developed by the authors through an extensive discussion, consensus, and agreement process. The classification categorised changes as either as Possible (P), Unlikely (U) or Implausible (I). The assessment has an element of subjectivity but the authors quickly reached consensus on transition probabilities. Overall change statistics for one, four and eight-year land cover transitions were generated considering their likelihood.
One-year transitions were examined in detail to infer the most likely reasons for the results.  PPP  IIU  IIU  IIU  IIU  IIU  IIU  III  Grassland  PPP  PPP  UUU  PPP  PPP  PPP  PPP  III  Horticulture  PPP  PPP  PPP  PPP  IPP  PPP  UUU  III  Crop  PPP  PPP  PPP  PPP  IPP  PPP  UUU  III  Plantation  PPP  IIU  IIU  IIU  IIU  IIU  IIU  III  Bare  PPP  PPP  PPP  PPP  PPP  PPP  PPP  PPP  Water  PPP  UUU  PPP  UUU  UUU  UUU  PPP  III  Built-up  PPP  PPP  PPP  UUU  PPP  PPP  PPP  IIU The DEA Land Cover product was chosen for comparison for several reasons: it was generated from the same annual Landsat geomedian data we used as input; it covered a similar range of years (1988-2020); and, importantly, it uses more conventional pixel-based machine learning methods, including Random Forests. The DEA Land Cover product was first converted to a simple land cover classification similar to ours ( Table 3). The DEA Land Cover product uses the FAO Land Cover Classification System [41]. Simplifying its Level-4 classification enabled comparison to our land cover classes for each year from 1988-2020. It was not possible to separate the DEA class Natural Terrestrial Vegetated: Woody into our Forest and Plantation classes. Similarly, the DEA class Cultivated Terrestrial Vegetated covers both our Horticulture and Crop classes. Therefore, we amalgamated those classes for the comparison, yielding a simplified classification of six classes (plus No data for the remainder, see Table 3). Using simplified label data for 2018 as a reference, confusion matrices comparing the 2018 simplified CNN and DEA results were generated to assess their spatial classification accuracy. Metrics used were overall accuracy (OA) [42], Kappa coefficient [43], and class weighted-mean F1 scores as described in [17].
Confusion matrices comparing the simplified CNN and DEA maps were also generated for every year from 1988-2020 to assess the degree of correspondence and summary statistics were calculated. The simplified CNN maps for 2018 and 2020 were used to calculate fractions for class weighted-mean value calculations.
Our CNN results were also compared to the ESRI Land Cover product, which was generated using Sentinel-2 data. Although this dataset only overlaps with our data for four years (2017-2020), the comparison was of interest because the ESRI product uses a similar U-Net methodology to map land cover, albeit at a higher resolution and using a globally trained model. The ESRI product produces a set of nine simple land cover classes (Table 4) applied globally using a U-Net CNN model. The ESRI data were reprojected to the same Australian Albers projection and then resampled to 25-m resolution to match our CNN maps. Classes were also mapped to a common simplified land cover classification (Table 4). Our CNN mapping results were directly compared to the ESRI Land Cover data and statistics were generated for the four years of overlap (2017-2020) between the two products.

Results
For the study area, the best CNN model, generated after 40 epochs, had an overall accuracy of 80% and a weighted-mean F1 score for all classes of 77%. The 2018 model output for the study area ( Figure 3c) compares well visually with label data (Figure 3b) and the Landsat geomedian true colour RGB image (Figure 3a). At this regional scale, the model results reflect several features (e.g., forest, water, grassland) that are also recognisable in the Landsat geomedian input.

Results
For the study area, the best CNN model, generated after 40 epochs, had an overall accuracy of 80% and a weighted-mean F1 score for all classes of 77%. The 2018 model output for the study area ( Figure 3c) compares well visually with label data (Figure 3b) and the Landsat geomedian true colour RGB image (Figure 3a). At this regional scale, the model results reflect several features (e.g., forest, water, grassland) that are also recognisable in the Landsat geomedian input. The CNN model was used to generate yearly land cover maps for 1987-2020 (34 years) for the Canberra tile (~100 km × 100 km, Figure 1). Land cover is dominated by forest in the southwest and southeast, with smaller built-up areas associated with the central urban area of Canberra, plantation forest to the west and east of the city, water over Lake Burrinjuck in the north-west and Lake George in the central east, with the rest of the tile being largely made up of grassland ( Figure 4). Crops are cultivated in this region, but the CNN was not effective in detecting this class, nor the horticulture (mainly vineyards) that occurs to the north of Canberra. Some broad-scale changes over the full period are detectable, including the gradual increase of the Canberra urban area (Built-up) and loss of plantations due to bushfires in 2003 to the west of Canberra. Figure 4c and Table 5 show the increase in Forest of ~55,000 hectares (16%) and decrease in Grassland of ~79,000 hectares (14%) over 34 years, as well as a significant increase in Built-up, reflecting the growth of Canberra, of ~20,000 hectares (82%). Large increases in crop and horticulture are from a low base and are not considered to be reliable. The CNN model was used to generate yearly land cover maps for 1987-2020 (34 years) for the Canberra tile (~100 km × 100 km, Figure 1). Land cover is dominated by forest in the southwest and southeast, with smaller built-up areas associated with the central urban area of Canberra, plantation forest to the west and east of the city, water over Lake Burrinjuck in the north-west and Lake George in the central east, with the rest of the tile being largely made up of grassland ( Figure 4). Crops are cultivated in this region, but the CNN was not effective in detecting this class, nor the horticulture (mainly vineyards) that occurs to the north of Canberra. Some broad-scale changes over the full period are detectable, including the gradual increase of the Canberra urban area (Built-up) and loss of plantations due to bushfires in 2003 to the west of Canberra. Figure 4c and Table 5 show the increase in Forest of~55,000 hectares (16%) and decrease in Grassland of~79,000 hectares (14%) over 34 years, as well as a significant increase in Built-up, reflecting the growth of Canberra, of~20,000 hectares (82%). Large increases in crop and horticulture are from a low base and are not considered to be reliable.

Temporal Stability of U-Net CNN Model Predictions
One-year, four-year (e.g., 1987-1991, 1991-1995, etc.) and eight-year (e.g., 1987-1995, 1995-2003, etc.) transitions were analysed, and summary statistics were calculated. Oneyear transitions resulted in fewer changes in land cover, with an average 95.4% of pixels showing no change between consecutive years (Table 6). This compares with an average 93.4% and 91.8% of pixels showing no change for the four-year and eight-year transitions, respectively. Regarding transitions deemed possible, the results show an average increase from 2.3% via 3.2% to 3.8% of pixels for one-, four-, and eight-year transitions, respectively. For transitions deemed unlikely, one-and four-year changes are 0.1% before increasing to 4% for eight-year transitions. In contrast, implausible transitions showed an increase from 2.1% for one-year to 3.2% for four-year transitions before a decrease to 0.3% for eight-year transitions. This reversal probably reflects the change in the transition likelihood classification ( Table 2). Some transitions in land cover considered implausible over four years were considered merely unlikely over eight years, mainly relating to the conversion of Grassland to Forest, Forest to Plantation, and Plantation to Forest. The standard deviations of mean transitions are below 1.1%, suggesting that average changes between years are relatively stable over time (Table 6). Of the one-year transitions, the pair of years with the least change was 2015-2016 ( Figure 5). Some minor differences between 2015 and 2016 occur over Lake George (red box) and in Plantation, Forest and Built-up classes. Table 7 shows a transition matrix for 2015-2016, highlighting transition categories: no change, possible, unlikely and implausible changes. Of the implausible changes, the most common ones are from Built-up to Grassland (0.25%), Forest to Plantation (0.21%), Plantation to Forest (0.17%) and Grassland to Forest (0.16%) which largely appear to be edge effects associated with class boundaries ( Figure 5 and Supplementary Materials Table S1).
( Figure 5). Some minor differences between 2015 and 2016 occur over Lake George (red box) and in Plantation, Forest and Built-up classes. Table 7 shows a transition matrix for 2015-2016, highlighting transition categories: no change, possible, unlikely and implausible changes. Of the implausible changes, the most common ones are from Built-up to Grassland (0.25%), Forest to Plantation (0.21%), Plantation to Forest (0.17%) and Grassland to Forest (0.16%) which largely appear to be edge effects associated with class boundaries ( Figure 5 and Supplementary Materials Table S1).

Comparison of U-Net CNN Model Predictions with Reference Land Cover Products
Simplifying the land cover classification to six classes provided a means to compare model results with two reference land cover products: DEA Land Cover (1988-2020) and ESRI Land Cover (2017-2021).
Simplified U-Net CNN model results for 2018 were compared to Landsat input data, simplified reference label data and DEA Land Cover in Figure 6. 2018 had the closest match ( Figure S1) between CNN model results ( Figure 6c) and DEA Land Cover ( Figure  6d), also for a year where high-resolution label data was available (Figure 6b).

Comparison of U-Net CNN Model Predictions with Reference Land Cover Products
Simplifying the land cover classification to six classes provided a means to compare model results with two reference land cover products: DEA Land Cover (1988-2020) and ESRI Land Cover (2017-2021).
Simplified U-Net CNN model results for 2018 were compared to Landsat input data, simplified reference label data and DEA Land Cover in Figure 6. 2018 had the closest match ( Figure S1) between CNN model results ( Figure 6c) and DEA Land Cover (Figure 6d), also for a year where high-resolution label data was available (Figure 6b).

Spatial Accuracy Assessment
A spatial accuracy assessment based on confusion matrices was undertaken by comparing U-Net CNN and DEA Land Cover results, respectively, to 2018 label data as the reference. Compared to the DEA Land Cover product, the CNN results had higher overall accuracy (89 vs. 82%), Kappa (82 vs. 69%), and class weighted-mean F1 score (89 vs. 81%) ( Tables 8 and 9). For the Forest class, results are similar with a producer's accuracy (how often real features in the landscape are shown in the classified map) of 91% vs. 92%, while for other classes the U-Net CNN shows better producer's accuracy then DEA Land Cover (Grassland 94% vs. 87%; Bare 63% vs. 25%; Water 61% vs. 15%; Built-up 87% vs. 13%) with the exception being the Crop class which was not detected well by either method (0.4% vs. 14%).

Annual Comparisons: 1988-2020
To better understand model results over time, confusion matrices comparing the simplified CNN to the DEA mapping were generated for 33 years (1988-2020), and overall accuracy (OA), Kappa and class weighted-mean F1 scores were calculated for each year using DEA as the reference (Supplementary Figure S1 and Table 10). The overall accuracy of these yearly comparisons ranged from about 50-86% and weighted-mean F1 from 56-85% (Table 10). Kappa values ranged from 33-75%, translating to fair agreement (>20% and ≤40%) for three years, moderate agreement (>40% and ≤60%) for 19 years and substantial agreement (>60% and ≤80%) for 11 years following the classification proposed by [44]. The best correlation between the simplified U-Net CNN model and simplified DEA Land Cover was obtained for 2018 (OA 86%, Kappa 75% and class weighted-mean F1 score 84%), while the worst correlation was for 2020 (OA 50%, Kappa 33% and class weighted-mean F1 score 56%). The confusion matrices for these comparisons are provided in Tables S2 and S3, respectively. Substantial spatial differences between our CNN and the DEA mapping (Figure 6c,d) occur in the urban area of Canberra, Lake George and small areas of crop and forest, mostly north and east of the city. The U-Net CNN model in 2018 identified a smaller extent of the Crop class in this tile compared to DEA Land Cover and neither result appears to effectively represent this class as referenced in the label data ( Figure 6b). Though DEA Land Cover is a closer match for this class, its Crop extent is largely not spatially coincident with the label data. Differences in the Forest class between the U-Net CNN model and DEA Land Cover are minor and limited to class boundary or edge effects, as well as a difference associated with an area of partially cleared plantation Forest (red box in Figure 6a) immediately east of the Canberra urban area which is not completely detected in either result. A mixture of the classes Forest, Grassland, Bare, Built-up and Crop can be seen across the urban area in the simplified DEA Land Cover, with an abundance of isolated pixels producing a grainy visual effect (Figure 6d). For Lake George, the U-Net CNN model classifies this area as Water, although in 2018 this area would have been largely dry, whereas DEA Land Cover, arguably more realistically, classifies the area as mainly Bare and Grassland, with some small amounts of Water, but also Forest and Crop, which are considered unlikely to occur in this area.
Visually there is good spatial correspondence between simplified DEA Land Cover classes and U-Net CNN model results, particularly for the Forest, Water and Grassland classes ( Figure 6). DEA Land Cover classes represented by each U-Net CNN model class are shown for Forest/Plantation (Figure 7), Grassland ( Figure 8a) and Built-up (Figure 8b). For the U-Net CNN model Forest and Plantation classes, the vast majority of pixels from the simplified DEA Land Cover map are the class Natural Terrestrial Vegetated: Woody (Figure 7), with a gradual increase in the classes with trees from 1988-2020, and a correlation between trends in the respective classes of R 2 = 0.47 (Figure 9a). In contrast, the agreement is not as close for the U-Net CNN model class Grassland. As expected, the majority of CNN grassland corresponds to the DEA class Natural Terrestrial Vegetated: Herbaceous (Figure 8a). However, there was also a considerable proportion (5-70%) of Cultivated Terrestrial Vegetated: Herbaceous and a smaller proportion (2-18%) of Natural Terrestrial Vegetated: Woody. tion between trends in the respective classes of R 2 = 0.47 (Figure 9a). In contrast, the agreement is not as close for the U-Net CNN model class Grassland. As expected, the majority of CNN grassland corresponds to the DEA class Natural Terrestrial Vegetated: Herbaceous (Figure 8a). However, there was also a considerable proportion (5-70%) of Cultivated Terrestrial Vegetated: Herbaceous and a smaller proportion (2-18%) of Natural Terrestrial Vegetated: Woody.  ceous (Figure 8a). However, there was also a considerable proportion (5-70%) of Cultivated Terrestrial Vegetated: Herbaceous and a smaller proportion (2-18%) of Natural Terrestrial Vegetated: Woody.  For the U-Net CNN model Built-up class, the expected Artificial Surface class occupied only 6-12% of pixels (Figure 8b). Instead, the most common corresponding DEA Land Cover classes were Natural Terrestrial Vegetated: Herbaceous and Natural Terrestrial Vegetated: Woody. Despite the lower number of pixels, the trend in the DEA Land Cover Artificial Surface class correlated reasonably well with the U-Net CNN model Built- For the U-Net CNN model Built-up class, the expected Artificial Surface class occupied only 6-12% of pixels (Figure 8b). Instead, the most common corresponding DEA Land Cover classes were Natural Terrestrial Vegetated: Herbaceous and Natural Terrestrial Vegetated: Woody. Despite the lower number of pixels, the trend in the DEA Land Cover Artificial Surface class correlated reasonably well with the U-Net CNN model Built-up class (R 2 = 0.77; Figure 9b). Urban growth is more monotonic for U-Net CNN Built-up with increases recorded for 28 years (88%) from 1988-2020 compared to 21 years (66%) for DEA Land Cover's Artificial Surface. Comparing temporal trends, the U-Net CNN model appeared to produce anomalous results for 1992 (Figure 8b). up class (R 2 = 0.77; Figure 9b). Urban growth is more monotonic for U-Net CNN Built-up with increases recorded for 28 years (88%) from 1988-2020 compared to 21 years (66%) for DEA Land Cover's Artificial Surface. Comparing temporal trends, the U-Net CNN model appeared to produce anomalous results for 1992 (Figure 8b). The U-Net CNN model did not identify the Crop class accurately and instances of crop across land cover maps are relatively few, e.g., for 1987, 2020 ( Figure 4). In contrast, the DEA Land Cover product mapped large extents of Cultivated Terrestrial Vegetated: Herbaceous (Crop) (e.g., Supplementary Figure S2). However, examination of transitions of this class suggests the extent is overestimated and has considerable temporal instability between years, as shown in Figure 10 for 1993-1995 (similar patterns were seen for other periods, e.g., 1996-1998). Field observations by the authors suggest that these rapid changes are not realistic. The U-Net CNN model did not identify the Crop class accurately and instances of crop across land cover maps are relatively few, e.g., for 1987, 2020 ( Figure 4). In contrast, the DEA Land Cover product mapped large extents of Cultivated Terrestrial Vegetated: Herbaceous (Crop) (e.g., Supplementary Figure S2). However, examination of transitions of this class suggests the extent is overestimated and has considerable temporal instability between years, as shown in Figure 10 for 1993-1995 (similar patterns were seen for other periods, e.g., 1996-1998). Field observations by the authors suggest that these rapid changes are not realistic. up class (R 2 = 0.77; Figure 9b). Urban growth is more monotonic for U-Net CNN Built-up with increases recorded for 28 years (88%) from 1988-2020 compared to 21 years (66%) for DEA Land Cover's Artificial Surface. Comparing temporal trends, the U-Net CNN model appeared to produce anomalous results for 1992 (Figure 8b). The U-Net CNN model did not identify the Crop class accurately and instances of crop across land cover maps are relatively few, e.g., for 1987, 2020 ( Figure 4). In contrast, the DEA Land Cover product mapped large extents of Cultivated Terrestrial Vegetated: Herbaceous (Crop) (e.g., Supplementary Figure S2). However, examination of transitions of this class suggests the extent is overestimated and has considerable temporal instability between years, as shown in Figure 10 for 1993-1995 (similar patterns were seen for other periods, e.g., 1996-1998). Field observations by the authors suggest that these rapid changes are not realistic. The spatial extent of the Built-up class (Figure 11) in the simplified U-Net CNN model closely reflects the urban areas visible in the Landsat RGB image with a few areas of over-classification and unclear boundaries (e.g., the industrial area indicated by a yellow rectangle in Figure 11c). Roads are generally correctly classified by the U-Net CNN model as Built-up, whereas the pixel-based methods used to generate DEA Land Cover often classify roads as Forest or Grassland (possibly reflecting roadside vegetation), e.g., Figure 11b, or as Crop, e.g., Figure 11c. classify roads as Forest or Grassland (possibly reflecting roadside vegetation), e.g., Figure  11b, or as Crop, e.g., Figure 11c. Newly developed urban areas in the DEA Land Cover (right image of three, Figure  11) are predominantly classified as Built-up or Bare, as might be expected. Older urban areas visible to the southwest in the 1988 image and subsequent years show many pixels classified as Forest, Grassland and even Crop, perhaps reflecting the low-density, vegetation-rich character of Canberra's established suburbs.
Forest areas are generally defined well spatially in both our CNN results and DEA mapping. In the Grassland and Crop areas (Figure 11), DEA Land Cover shows the previously mentioned frequent change between these classes, which is unlikely to be accurate. There is cropping in these areas, but such large variations over a one-year ( Figure 10) to eight-year ( Figure 11) time scale do not occur in the authors' observation. Figure 12 shows a comparison of land cover classes defined by the U-Net CNN model and DEA Land Cover for 1988-2020 for the area from Figure 11. The U-Net CNN model shows more stability between years (except for the anomalous year 1992) and gradual changes to land cover classes appear closely matched to features (e.g., built-up and forest areas) visible in Landsat images. In contrast, DEA Land Cover shows large changes in the Grassland and Crop classes between years and more variability in the Forest class. Newly developed urban areas in the DEA Land Cover (right image of three, Figure 11) are predominantly classified as Built-up or Bare, as might be expected. Older urban areas visible to the southwest in the 1988 image and subsequent years show many pixels classified as Forest, Grassland and even Crop, perhaps reflecting the low-density, vegetation-rich character of Canberra's established suburbs.
Forest areas are generally defined well spatially in both our CNN results and DEA mapping. In the Grassland and Crop areas (Figure 11), DEA Land Cover shows the previously mentioned frequent change between these classes, which is unlikely to be accurate. There is cropping in these areas, but such large variations over a one-year ( Figure 10) to eight-year ( Figure 11) time scale do not occur in the authors' observation. Figure 12 shows a comparison of land cover classes defined by the U-Net CNN model and DEA Land Cover for 1988-2020 for the area from Figure 11. The U-Net CNN model shows more stability between years (except for the anomalous year 1992) and gradual changes to land cover classes appear closely matched to features (e.g., built-up and forest areas) visible in Landsat images. In contrast, DEA Land Cover shows large changes in the Grassland and Crop classes between years and more variability in the Forest class.

Temporal Stability of U-Net CNN Model Compared to DEA Land Cover
Characterising how often individual pixels change between consecutive years (1988-2020) and mapping results provides a measure of temporal stability (Figure 13). If a pixel changes every year, there would be a total of 32 changes across the 33 years (inclusive) from 1988 to 2020. Other than Lake George, forested and water areas in both the U-Net CNN model and DEA Land Cover maps are well conserved through the 33 years with no or few changes recorded (white areas in the southwest, southeast, and northwest, Figure 13). Lake George shows many changes in both the CNN and DEA mapping; this is not unexpected, given that it is a shallow ephemeral lake with highly variable extent that is regularly completely dry. For the U-Net CNN model, the urban areas of Canberra (central area) are conserved (white areas) but not in the DEA mapping.

Temporal Stability of U-Net CNN Model Compared to DEA Land Cover
Characterising how often individual pixels change between consecutive years (1988-2020) and mapping results provides a measure of temporal stability (Figure 13). If a pixel changes every year, there would be a total of 32 changes across the 33 years (inclusive) from 1988 to 2020. Other than Lake George, forested and water areas in both the U-Net CNN model and DEA Land Cover maps are well conserved through the 33 years with no or few changes recorded (white areas in the southwest, southeast, and northwest, Figure  13). Lake George shows many changes in both the CNN and DEA mapping; this is not unexpected, given that it is a shallow ephemeral lake with highly variable extent that is regularly completely dry. For the U-Net CNN model, the urban areas of Canberra (central area) are conserved (white areas) but not in the DEA mapping. In our CNN results, changes are often associated with class boundaries (e.g., red box in Figure 13a), as well as linear effects associated predominantly with roads. In the DEA mapping, changes are mostly associated with the coverage of the Crop and Grassland classes (e.g., Figure 10), and the urban area where there is a grainy, 'salt and pepper' mix of classes that changes through time (Figure 13b).  Characterising how often individual pixels change between consecutive years (1988-2020) and mapping results provides a measure of temporal stability (Figure 13). If a pixel changes every year, there would be a total of 32 changes across the 33 years (inclusive) from 1988 to 2020. Other than Lake George, forested and water areas in both the U-Net CNN model and DEA Land Cover maps are well conserved through the 33 years with no or few changes recorded (white areas in the southwest, southeast, and northwest, Figure  13). Lake George shows many changes in both the CNN and DEA mapping; this is not unexpected, given that it is a shallow ephemeral lake with highly variable extent that is regularly completely dry. For the U-Net CNN model, the urban areas of Canberra (central area) are conserved (white areas) but not in the DEA mapping. In our CNN results, changes are often associated with class boundaries (e.g., red box in Figure 13a), as well as linear effects associated predominantly with roads. In the DEA mapping, changes are mostly associated with the coverage of the Crop and Grassland classes (e.g., Figure 10), and the urban area where there is a grainy, 'salt and pepper' mix of classes that changes through time (Figure 13b). In our CNN results, changes are often associated with class boundaries (e.g., red box in Figure 13a), as well as linear effects associated predominantly with roads. In the DEA mapping, changes are mostly associated with the coverage of the Crop and Grassland classes (e.g., Figure 10), and the urban area where there is a grainy, 'salt and pepper' mix of classes that changes through time (Figure 13b).
Judging by the distribution of the number of class transitions for 1988-2020, the U-Net CNN model is more temporarily stable than DEA Land Cover ( Figure 14). Almost 76% of pixels in the CNN mapping do not change class over this period, whereas only 30% of the pixels for DEA Land Cover are unchanged. The number of pixels experiencing 1-4 changes over the period is similar for both maps, but from five changes on, the U-Net CNN model has significantly fewer pixel changes than DEA Land Cover. For example, the U-Net CNN model has about 11% of pixels showing more than four changes over 33 years, whereas for the DEA Land Cover, 55% of pixels have more than four changes in this period.
30% of the pixels for DEA Land Cover are unchanged. The number of pixels experiencing 1-4 changes over the period is similar for both maps, but from five changes on, the U-Net CNN model has significantly fewer pixel changes than DEA Land Cover. For example, the U-Net CNN model has about 11% of pixels showing more than four changes over 33 years, whereas for the DEA Land Cover, 55% of pixels have more than four changes in this period.

Comparison of U-Net CNN Model Predictions with ESRI Land Cover
A comparison of the simplified U-Net CNN model result for 2017 and simplified ESRI Land Cover is provided in Supplementary Figure S3. The ESRI map ( Figure S3b) showed more Built-up in rural areas, as well as more Crop and Grassland than the simplified U-Net CNN model ( Figure S3a). The two maps do show temporal correlation (Figure S4). It is reiterated that both results were obtained using U-Net CNN models, although the algorithms were separately trained and used input imagery from different sources and resolution (Landsat vs. Sentinel-2). Temporal correlation ( Figure S4) between the changing extent of the built-up (R 2 = 1.0) and forest (R 2 = 0.91) classes in the mapping by our and the ESRI algorithm was considerable, though only four years overlap between datasets were available (2017-2020). Correlation for other classes is lower, e.g., R 2 = 0.38 for Water and R 2 = 0.20 for Crop.

Discussion
The accuracy of label data affects the quality of the trained classifier. In a previous study [17], 800 randomly selected pixels (100 per land cover class) were examined to validate the label data used here using high-resolution imagery. Their validation suggested that the trained model is, in fact, slightly more accurate than the label data, with the 'real' CNN model OA and mean F1 values at least 4-5% higher than those calculated using the label data as a reference. They concluded that the CNN model generally distinguished major land cover classes correctly despite inaccuracies in the label data. Here, our focus

Comparison of U-Net CNN Model Predictions with ESRI Land Cover
A comparison of the simplified U-Net CNN model result for 2017 and simplified ESRI Land Cover is provided in Supplementary Figure S3. The ESRI map ( Figure S3b) showed more Built-up in rural areas, as well as more Crop and Grassland than the simplified U-Net CNN model ( Figure S3a). The two maps do show temporal correlation ( Figure S4). It is reiterated that both results were obtained using U-Net CNN models, although the algorithms were separately trained and used input imagery from different sources and resolution (Landsat vs. Sentinel-2). Temporal correlation ( Figure S4) between the changing extent of the built-up (R 2 = 1.0) and forest (R 2 = 0.91) classes in the mapping by our and the ESRI algorithm was considerable, though only four years overlap between datasets were available (2017-2020). Correlation for other classes is lower, e.g., R 2 = 0.38 for Water and R 2 = 0.20 for Crop.

Discussion
The accuracy of label data affects the quality of the trained classifier. In a previous study [17], 800 randomly selected pixels (100 per land cover class) were examined to validate the label data used here using high-resolution imagery. Their validation suggested that the trained model is, in fact, slightly more accurate than the label data, with the 'real' CNN model OA and mean F1 values at least 4-5% higher than those calculated using the label data as a reference. They concluded that the CNN model generally distinguished major land cover classes correctly despite inaccuracies in the label data. Here, our focus was on the spatial and temporal stability of our U-Net CNN model and its ability to map changes in major land cover classes over time.

Spatial and Temporal Stability
Confusion matrices (Tables 8 and 9) illustrate the differences in spatially determined accuracy and other metrics of the CNN model result and DEA Land Cover for 2018 compared to reference label data. The CNN exhibits greater spatial stability with overall accuracy, Kappa and weighted-mean F1 being 7-13% higher than the pixel-based algorithm and per class producer's accuracy (recall) substantially higher for the Built-up, Water, Bare and Grassland classes. However, some of these differences, particularly for the Built-up and Water classes, are likely related to a capability mismatch between the broader scale of target label data with its homogeneous classes, and pixel-based methods which operate at a finer scale, as further discussed later in this section.
The U-Net CNN model results showed good temporal stability, with 95%, 93% and 92% of pixels unchanged, on average, over one, four and eight-year periods (Table 6); with this stability mapped spatially as total changes by pixel for years 1988-2020 in Figure 13a. Some issues in land cover maps generated by the model were detected, evident from the occurrence of implausible land cover transitions. Lake George is an ephemeral shallow lake that during drier periods is partially bare or covered with sparse vegetation or grassland. Hence, the changing model interpretation of this area as Water, Bare or Grassland class is explicable. However, other classes, such as Forest (e.g., 2015 in Figure 5a), are unlikely to be correct. It was further found that the 1992 map was anomalous, with insufficient Plantation and Built-up in the model result. We interpret that this was due to the anomalous geomedian reflectance images for this year. This implies that the CNN model relies on spectral information in addition to spatial patterns.
Examination of the Canberra urban area helps with understanding the spatial stability of results (Figure 8b), in terms of both the simplified U-Net CNN model (e.g., Figure S2b) and simplified DEA Land Cover (e.g., Figure S2c). Canberra, locally known as the 'bush capital', includes considerable forest and grassland within the urban green space and remnant vegetation, with a total canopy cover of about 23% within its urban footprint according to a 2020 LiDAR study [45]. This helps to explain the DEA Land Cover results, which focus on a pixel-level determination of land cover. This pixel-based classification may be correct but fails to detect the broader scale spatial definition of urban area, which was better identified by the U-Net CNN model. To a degree, this difference appears to relate to a contrast in the effective scale of mapping between pixel-based and CNN mapping. Making use of spatial context, the CNN creates mapping units that are greater than one pixel (i.e., 625 m 2 ). In contrast, the DEA method does not consider spatial context and strictly classifies individual pixels. These methodological differences cause implicit differences in scale and definition that need not always be inherently inconsistent. Following the example given here, there can be smaller areas of vegetation within the urban area that the CNN correctly identifies as part of the Built-up area while also correctly identified by the DEA product as Terrestrial Vegetation. Combining the pixel-based and CNN-based classification results could help to reconcile much of this ambiguity and create a new level of mapping sophistication (e.g., Terrestrial Vegetation within Built-up areas).
Studies of the effects of spatial resolution and internal land cover class variability on classification accuracy (e.g., [46,47]) show that using lower spatial resolution imagery or smoothing or texture filters can lead to increased classification accuracy. Similarly, object-based classification approaches [48,49] can result in more generalised and visually acceptable depictions of land cover classes than pixel-based methods. Broadly, these results appear to be analogous to what was observed in this study concerning the contrast in the effective scale of mapping between pixel-based methods (operating at higher resolution) and the CNN (aggregating pixels to create larger mapping units). This is particularly true for land cover classes comprising different surface covers, such as urban areas.
Linear features such as roads are an interesting case as the road surface itself is often less than 25 m wide and, therefore, may not be mapped by pixel-based methods or else mapped as Forest or Grassland (reflecting roadside vegetation). In contrast to pixel-based approaches, the U-Net CNN model is better at correctly mapping these features as Built-up using label data and the signature of roadside verges and parallel tree rows in Landsat input images. The CNN model's ability to detect rivers and roads correctly seems in part to rely on the label data, which includes a cadastral representation of these linear features and these typically contain publicly owned river corridors and road easements, respectively. By contrast, the DEA mapping relies entirely on the interpretation of the Landsat geomedian image, which can lead to misclassification, perhaps due to the presence of riparian or roadside vegetation, respectively (e.g., Figure 11). The use of pre-or post-processing might improve these results for pixel-based methods.
Nonetheless, the U-Net CNN showed some issues of its own for linear features and unit boundaries. As noted by [13], CNN are prone to producing rounded edges and blunted corners for mapped land cover classes and sometimes show an undesirable loss of object detail. Some studies have reduced these issues through modifications to the CNN model methodology. An interesting area for future work would be the addition of the temporal dimension to model building through the application of a temporal CNN [50] or by using transformers [51] in combination with the U-Net CNN (e.g., [52]). The latter can potentially improve, among others, the extraction of linear features such as roads [53].

Mapping Changes in Major Land Cover Classes over Time
Eight land cover classes were mapped annually for the Canberra tile for the 34 years of 1987-2020. Pixel statistics on the distribution of land cover classes over time were generated (Figures 4c and 12a). These results generally show systematic and explicable changes in major land cover classes. They include a gradual increase in Built-up area, mainly reflecting the urbanisation of Canberra, an increase in Forest and a decline in Grassland over the same period. In terms of the Water class, results are more variable, e.g., with lower rainfall during the Millennium Drought (1997-2009) visible in the extent of that class. Plantation extent gradually increased until 2003, when bushfires west of the city removed approximately 38% of this class before a gradual recovery over the remaining years to 2020. Results for the Bare class are variable with no apparent pattern and may reflect real changes in clearing and urban development over time. Results for the Horticulture and Crop classes are not considered reliable, however. For these classes, the results are highly variable during 1987-2020, which may indicate that the CNN was not able to detect these land cover classes accurately from the Landsat input data. Because the imagery represents a single annual geomedian, it cannot capture changes in reflectance associated with the seasonal phenology of cultivated vegetation. Sub-annual reflectance information might help improve the distinction between Grassland, Horticulture, and Crop.
Summarising, our U-Net CNN model proved useful for broad-scale land cover mapping over time, with several advantages over pixel-based mapping methods. U-Net CNN models operate at a broader spatial scale producing more stable and consistent results for the segmentation of images into land cover units, generally without 'salt and pepper' issues, as also noted by [10,20]. By contrast, the use of DEA Land Cover for accurate estimation of total urban area or grassland/crop and their change over time did not appear to be feasible. Both methods each had their own strengths and weaknesses, but overall, a CNN rather than pixel-based mapping approach might be needed to produce sufficient spatial and temporal stability for applications such as environmental accounting [54,55] and other land cover monitoring activities to support natural resource management (e.g., [56]).

Conclusions
A U-Net CNN model was applied to annual Landsat geomedian input data for 1987-2020 to generate a series of land cover maps that are more stable and consistent over several decades compared to pixel-based approaches. Although the CNN had some issues with linear features, rounded class boundaries and a reduction in object detail that is not always desirable, the U-Net CNN model results showed better spatial coherence than results generated using pixel-based methods such as random forests and produced a more spatially and temporally consistent interpretation of the landscape at the regional scale. The U-Net CNN model operated at a coarser, more aggregated level than the individual pixel, allowing it to detect contextual information to generate a more stable spatial distribution of land cover classes through time, whereas pixel-based methods are better able to detect fine-scale variations within inhomogeneous classes, e.g., vegetation within urban areas. The U-Net CNN used for land cover mapping also appeared to be more temporally stable than pixel-based mapping, with fewer class changes (24% vs. 70% of pixels) over 33 years .
In summary, the U-Net CNN model appears to be capable of broad-scale land cover mapping over time, with several advantages over pixel-based mapping methods. The spatial transferability of the methods described here to other regions within Australia needs to be tested and is a subject for future research.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.