Semantic Segmentation of Sentinel-2 Imagery for Mapping Irrigation Center Pivots

: Estimating the number and size of irrigation center pivot systems (CPS) from remotely sensed data, using artiﬁcial intelligence (AI), is a potential information source for assessing agricultural water use. In this study, we identiﬁed two technical challenges in the neural-network-based classiﬁcation: Firstly, an e ﬀ ective reduction of the feature space of the remote sensing data to shorten training times and increase classiﬁcation accuracy is required. Secondly, the geographical transferability of the AI algorithms is a pressing issue if AI is to replace human mapping e ﬀ orts one day. Therefore, we trained the semantic image segmentation algorithm U-NET on four spectral channels (U-NET SPECS) and the ﬁrst three principal components (U-NET principal component analysis (PCA)) of ESA / Copernicus Sentinel-2 images on a study area in Texas, USA, and assessed the geographic transferability of the trained models to two other sites: the Duero basin, in Spain, and South Africa. U-NET SPECS outperformed U-NET PCA at all three study areas, with the highest f1-score at Texas (0.87, U-NET PCA: 0.83), and a value of 0.68 (U-NET PCA: 0.43) in South Africa. At the Duero, both models showed poor classiﬁcation accuracy (f1-score U-NET PCA: 0.08; U-NET SPECS: 0.16) and segmentation quality, which was particularly evident in the incomplete representation of the center pivot geometries. In South Africa and at the Duero site, a high rate of false positive and false negative was observed, which made the model less useful, especially at the Duero test site. Thus, geographical invariance is not an inherent model property and seems to be mainly driven by the complexity of land-use pattern. We do not consider PCA a suited spectral dimensionality reduction measure in this. However, shorter training times and a more stable training process indicate promising prospects for reducing computational burdens. We therefore conclude that e ﬀ ective dimensionality reduction and geographic transferability are important prospects for further research towards the operational usage of deep learning algorithms, not only regarding the mapping of CPS.


Introduction
Agriculture is the largest consumer of fresh water at the global scale. In 2008, Wisser et al. [1] reported that 70% of the surface and ground water resources were used for agricultural purposes. While this figure was confirmed in more recent studies [2,3] the number is set to increase further due to global population growth and an increasing demand for food and biomass [4,5]. At the same time, an intensified competition for the allocation of water resources due to anthropogenic climate change, increased water withdrawal from industry and urban consumers, and the uninterrupted growth of urban areas is projected [6]. Agricultural water demand is mainly driven by irrigation [5]. Here, we define irrigation as the temporary or continuous supply of water to crops. Such water supply can either compensate for fluctuations in precipitation and thus reduce inter-annual yield variations [7,8] of using the wide-spread principal component analysis (PCA) approach for spectral dimensionality reduction [29], on U-NET CPS classification and segmentation accuracy.
Consequently, the objectives of this work were twofold: Firstly, we compared the classification and segmentation accuracy of U-NET trained on four spectral Sentinel-2 bands to U-NET trained on the first three principal components of Sentinel-2, using a CPS dataset from Northern Texas, USA. Secondly, we assessed how well these two models performed when transferred to geographically contrasting areas: South Africa and the Duero basin in Spain. Based on these objectives we aligned the structure of the paper. Following a description of the study areas and the CPS datasets, we explain the processing of the Sentinel-2 imagery and outline the training data generation and model training process in Section 2. In Section 3, we present the results of the semantic segmentation approach and compare the performance of the two U-NET implementations and assess their geographic transferability, which is discussed in Section 4.

Study Areas
We chose three different study areas where CPSs can be found. In all study areas, CPSs represent a substantial part of the irrigated area. The areas differ in terms of climatic conditions, land-use systems, and natural vegetation, but, in each of the areas, water resources are either non-renewable or scarce, or will probably become in the future. The three areas and the mapped CPSs are shown in Figure 1, as well as their position on the world map. The Texas study area ( Figure 1A) encompasses 1714.54 km 2 and is part of the High Plains region which is characterized by a climatic water deficit, i.e., potential evaporation exceeds actual evaporation [30]. Since the amount of rainfall (mainly during the summer months) would only allow for extensive dryland agriculture, the region experienced a rapid expansion of irrigated agricultural land that replaced the natural prairie grassland vegetation [31]. The water is mostly (>80%) withdrawn from the High Plains Aquifer for which a median relative decrease in recharge of −10% for 2050 relative to 1990 conditions is reported under climate change projections [32]. At the same time, the aquifer is subject to intensive depletion resulting in a decline in the saturated layer of the ground water table [33] and reduced water quality due to pollutants from intensive agricultural activities [34]. Mapping the extent of irrigated agriculture is particularly important to implement and monitor groundwater policies [35].
In the Spanish study area ( Figure 1B) groundwater resources play less of a role, since the water is mostly taken from the Duero River: The Duero and its tributaries form the largest catchment on the Iberian Peninsula. Precipitation rates (1961-1990 reference period) are around 625 mm/yr on multi-annual average [36] which are mostly recorded during the winter months due to the Mediterranean climate. 70 to 75% of the available water resources in the Duero catchment are used for growing crops (mostly wheat and maize), with between 6476 and 7646 m 3 of water per hectare and year needed for irrigation [37,38]. Around 45% of the entire study area (4129.57 km 2 ) are permanently irrigated according to latest (2018) CORINE land-use/cover data. This is in line with findings by Lopez-Gunn et al. [39] who reported that up to 60% of the agricultural production in Spain is due to irrigation. CPSs account for parts of these numbers but are intertwined with other land-use patterns, such as roads, water courses, or smaller built-up areas which makes this area the most complex one. Since Ceballos et al. [40] observed an increase in the inter-annual variability of rainfall pattern and prolonged dry spells, the Duero region is at an increased risk of drought. Furthermore, drought risks are likely to increase under current climate change projections [41]. To respond to the risk of droughts and to ensure a minimum discharge in the river as required by the European Union Water Framework Directive (2000/60/EC) mapping the extent of irrigated agricultural is essential.
While droughts and water scarcity are considered future scenarios for the Texas and Duero study areas, the debate about water scarcity and "day-zero" scenarios for urban areas are an ongoing issue in South Africa [42] where the third study area is located (Figure 1, lower right). The area (270.55 km 2 ) is characterized by a semi-arid climate causing evaporation to exceed precipitation rates [43]. Although water resources are scarce, the expansion of irrigated agriculture was favored by water policies to support the development of rural areas [44]. The aforementioned expansion of agricultural production, the growth of urban areas, and increased rainfall variabilities due to climate change will most likely intensify water shortages until 2050 and increase conflicts about the allocation of water resources [45]. Moreover, it is estimated that agriculture employs up to 70% of South Africa's labor force [46], making it essential to safeguard agricultural production from a socio-economic perspective. The development of sustainable water use scenarios is therefore of great importance and requires accurate data.

Center Pivot Datasets
Using the available Sentinel-2 imagery (see Section 2.2.1.), human experts mapped the CPSs in the three study areas (see also Figure 1). In total, 2219 CPSs were mapped. Descriptive statistics denoting the main characteristics of the CPSs were calculated and are summarized in Table 1. Besides the area of the CPS, we calculated the Compactness Index (CI) [47] of the individual CPS's geometries. The CI is defined as the ratio of the area of a geometry, A G , to the area, A C , of a circle with the same perimeter: The index takes values between 0 and 1, with values close to one indicating an almost perfect circularity of the geometry under consideration.
As Table 1 shows, most CPSs are in the Texas study area (1208), which is why it was selected for training the U-NET algorithm. The CPSs there are by far the largest (average size 541,675.9 m 2 ), and most of them are perfectly circular (average CI is 0.99). The second largest number of CPSs is found in the Duero site (615), but they only occupy a small part of the rather large study area (124.4 of 4129.6 km 2 ) and are spatially sparsely scattered (see Figure 1). The CPSs there are smallest on average (202,323.7 m 2 ) and more often show non-perfect circularity, resulting in a mean CI of 0.91 and a standard deviation of CI of 0.12. This is three times higher than in the other two regions. In the South African test site, only 396 CPSs were mapped, since it is also the smallest site (270.6 km 2 ), but their shape is comparable to the CPSs in Texas. Although they are, on average, smaller than the CPSs in Texas (314,949.2 m 2 ), they have a similarly high CI (on average 0.98) and the same low standard deviation (0.04).

Sentinel-2 Data Preparation
We acquired three cloud-free (cloud cover < 5%) Sentinel-2 scenes in L1C processing level from Copernicus Open Access Hub (https://scihub.copernicus.eu/dhus/#/home) covering each of the three study areas (see Section 2.1.1). The Sentinel-2 mission comprises two identical satellites (Sentinel-2A and 2B), which are equipped with an optical multispectral sensor comprising 13 channels from visible (440 nm) to near infrared (2200 nm). The spatial resolution is 10, 20, or 60 m depending on the channel [48]. Due to the 180 • offset orbits, the Sentinel-2 Mission provides an average of one image of a point on Earth every five days, making it very suitable for agricultural applications [49]. The L1C processing level implies that the data were already geometrically rectified and projected into a planar coordinate system, but does not yet contain bottom-of-atmosphere reflectance values, which are required for assessing crop growth conditions. Table 2 shows the main properties of the scenes including the platform, Sentinel-2 granule, and the acquisition date. In case of the Spanish Duero site it was necessary to mosaic two Sentinel-2 granules (30TUM and 30TUL). The Sentinel-2 scenes listed in Table 2 were converted to surface reflectance images (L2A processing level) using the atmospheric radiative transfer model MODTRAN [50] applying an interrogation technique developed by Verhoef and Bach [51]. Only the Sentinel-2 bands 2, 3, 4, 5, 6, 7, 8A, 11, and 12 were retained since these bands provide relevant information about land surface properties [52]. Band 8 was not used because of its coarser spectral resolution (compared to band 8A). The spatial resolution of all these spectral bands was resampled to 10 m. For each scene listed in Table 2, we created two datasets: • One dataset containing the Sentinel-2 spectral bands 2 (blue), 3 (green), 4 (red), and 5 (NIR1) in accordance to a study undertaken by Saraiva et al. [14], who also used this spectral combination; • One dataset containing the first three principal components calculated from the nine Sentinel-2 bands available.
The principal component analysis (PCA) [53] is a widely-used tool for dimensionality reduction [54][55][56]. Mathematically, PCA corresponds to a linear base transformation, whereby the original coordinate system spanned by the Sentinel-2 spectral bands is rotated towards the direction of the largest variance [57]. In detail, the original spectral bands which are partly linearly correlated are transformed into a set of orthogonal base vectors with the first base vector (first principal component) denoting the largest variance.

Training Data Generation
We used the Texas study area (see Figure 1 and Table 1) due to its large number of CPSs for the generation of two training datasets for U-NET as outlined in the previous paragraph. The workflow for generating the training data closely follows the approach proposed by Saraiva et al. [16] and is shown in Figure 2: First, we clipped the input imagery into patches of 128 by 128 pixels (1.28 by 1.28 km), using a moving window that was shifted 64 pixels in x and y direction to obtain overlapping image chips. The same procedure was applied to the rasterized representations of the CPS data (using 10 m spatial resolution) that served as labels. This allowed for enlarging the number of training samples and produced different representations of CPSs, to enhance the generalization capacity of the U-NET models. We retained only those image patches that had at least a single pixel corresponding to the manually mapped CPSs (Figure 2A). Second, we split the data into a training, testing, and validation dataset. Two-thirds of the study area were assigned as training area and all image patches located within this spatial subset were used for training U-NET. The testing area, covering one sixth of the entire study area, was selected to test the generalization performance of the U-NET network after each training epoch (see Section 2.2.3), whereas the samples located in the validation area were used to assess the classification accuracy and segmentation quality of the network after the training was finished. ). In (A), the Sentinel-2 data preprocessing is shown, whereas (B) accounts for the generation of training data and the model training process. The best performing model in terms of smallest testing loss is used to assess the geographic transferability in terms of classification accuracy and segmentation quality (C) for each of the two U-NET instances.
In the case of the dataset consisting of the four spectral channels, the patches were normalized using z standardization. The z-standardization uses the mean reflection, µ, in band i , and its standard deviation, σ, to obtain normalized reflectance values, Z i , with µ = 0 and σ = 1 from the original pixel values, x i : We employed data augmentation to further increase the number of training samples available: All training samples were flipped vertically and horizontally, as well as rotated by 90, 180, and 270 degrees. Not only was the number of training samples increased by factor of six but also the rotation and flipping invariance was improved as networks based on convolution kernels by default are not invariant to flipping and rotation operations [58,59].
Overall, 10896 samples (Sentinel-2 patches plus labels) were available for training, 309 for testing and 472 for validation purposes from the Texas study area for each of the two datasets (Sentinel-2 bands 2 to 5 and first three principal components). We validated the two trained models on the Texas site, and then assessed the geographic transferability of the U-NET instances based on classification accuracy and segmentation quality (see Section 2.2.4 and Figure 2C). In South Africa, 51 samples, and in the Duero study area, 303 samples were available for performing these tasks.

U-NET Architecture and Training
A U-NET network was trained for both datasets, which for simplicity are referred to as U-NET SPECS for U-NET trained on the Sentinel 2 channels 2 to 5 and as U-NET PCA for U-NET trained on the first three principal components. The only difference in architecture between the two networks is the number of input channels, which is 4 (U-NET SPECS) and 3 (U-NET PCA). We implemented the network using a Tensorflow based version of the U-NET algorithm [60] coded in Python (Version 3.7).
The architecture used is summarized in Table 3 and builds upon the network architecture proposed by Ronneberger et al. [23] and the modifications made by Saraiva et al. [16]. U-NET has two branches: In the downsampling branch, as with almost all CNNs, convolution and max pooling operations are performed [61]. This branch-also called contracting path-records the spectral and spatial context in the form of feature maps. The upsampling path uses up-convolutions, which are combined with the feature maps from the contracting path. Pooling operators are replaced by upsampling functions. This allows the localization information to be carried along from input to output and retains the contextual information. However, only those pixels from input to output that have full spatial context are retained. This means that the segmentation map is always smaller than the input image. For the down-and the upsampling branch of U-NET 4 layers per branch were used. In the contracting (downsampling) branch two subsequent convolutions with a 3-by-3 pixel kernel size were performed first in each layer. The kernel size of 3 was chosen as a compromise between sufficient filtering of data on the one hand and sufficient network depth on the other. The output of each convolution was passed on using the Rectified Linear Unit (ReLU) activation function, which is widely recognized for image processing tasks [62]. The output of the two convolutions was pooled in a second step, using the max pooling operator with a kernel size of 2 by 2 pixels and a stride of 2 pixels. In the upsampling branch, the pooling operations were replaced by up-convolution operators. In the last layer 1 by 1 convolution was used to map the network output to class assignment probabilities. The number of feature channels, which allow to preserve contextual information, was set to 32 ( Table 3).
The network weights and biases were iteratively adjusted during the training using error backpropagation. The parameters used for training are shown in Table 4. In particular, we set the training and verification batch size to 32 and trained the network in total 150 times (epochs). In each epoch, 200 training iterations (steps) were conducted. A GPU optimized version of Tensorflow on an NVIDIA GEFORCE ®GT630 graphics card under Linux Ubuntu 18.04 LTS was used for the training. To calculate the network error for updating the network weights after each batch the cross-entropy loss (CE) cost function was used. CE, not only considers the class assignment but also the probability scores. For the optimization problem of minimizing the value of cost function, the Adam (adaptive moment estimation) [63] solver was employed with an initial learning rate of 0.001. The learning rate is adapted for each network independently, using an exponential moving average of the gradient and its squared representation. To prevent overfitting, we used a dropout probability of 25%, which is the chance of randomly dropping (i.e., ignoring) network connections during training stages [64].
To determine the epoch after which the network generalized best, we calculated the value of the CE function on the testing dataset after each training epoch. We then used the network weights obtained at the epoch with the overall lowest testing data loss.

Validation Strategies
Since U-NET performs both segmentation and classification, accuracy metrics are required to evaluate the capacity of the trained model to reproduce reference data not shown to the network during training. While segmentation quality is qualitatively examined by visual inspection of U-NET results against CPS reference geometries, pixel-by-pixel classification accuracy is checked by using widely accepted metrics of binary classification evaluation. These are listed in Table 5 together with the formula behind them and the meaning of the metric. Table 5. Pixel-based measures of binary classification accuracy, compiled from Sokolova and Lapalme [65] and Kohl [66], used for assessing U-NET classification performance. TP denotes the number of true positive; FP, false positives; TN, true negatives; and FN, false negative class assignments.

Metric Formula Meaning
Accuracy Score In addition, the receiver operator characteristic (ROC) curve was used, which graphically represents the capability of a binary classifier using different discrimination probabilities for class assignment [67]. For drawing the curve, the true positive rate on the ordinate axis is compared to the false positive rate on the abscissa. The steeper the ROC curve rises (i.e., high true positive rate and low false positive rate for low discrimination thresholds), the better a classifier is. For quantifying the characteristics of the U-NET models, the integral of the area under the ROC curve-often referred to as "Area Under the Curve" (AUC) [68]-which is equal to one in case of a perfect classifier was provided as an additional measure. We selected these metrics because they are often used in the evaluation of binary classifiers [69,70] and give a quick overview of the predictive capacity of the classifier.

U-NET Training
The results of training the two models over a total of 150 epochs on the Texas study area are shown in  Based on the value of the training loss, the network weights for the final U-NET were determined. In the case of U-NET PCA, a global minimum training loss was reached after 83 epochs, so the network weights were used as they were after this epoch. In the case of U-NET SPECS, this minimum was reached after 45 epochs.

Pixel-Based Error Metrics
The pixel-based metrics (see Table 5) are listed in Table 6 for both U-NET models and all three study areas. The better performing model per study area is marked in green; red indicates worse performance. Orange cells highlight that both models achieve the same score. Both U-NET models showed the highest values for precision, recall, f1-score, and AUC in Texas, where the model training was conducted (see Section 3.1). The results of the other two study areas used for assessing the geographical transferability of the approach indicate a lower classification accuracy of the two models, with the Duero study area clearly revealing the lowest values in relation to the f1-score (0.08 for U-NET PCA and 0.16 for U-NET SPECS). The South African study area occupies a medium position, in relative terms. The transfer to other geographical areas thus shows a decrease in classification quality, but with differences between the models. Table 6. Pixel-based metrics of classification accuracy for the CPS class, including the accuracy score, precision, recall, f1-score, and the area under the curve (AUC). For each study area, the results of the two U-NET implementations are shown. In detail, the predictions made by the U-NET models in Texas are of high precision for U-NET PCA (0.91) and U-NET SPECS (0.85). The recall for U-NET SPECS is also high, at 0.89, but lower for U-NET PCA (0.76). Accordingly, U-NET SPECS has the higher f1-score (0.87 to 0.83). The AUC value is also slightly higher for U-NET SPECS (0.88 to 0.84). The same applies to the accuracy score (0.88 to 0.83).
In the Duero Study area, U-NET SPECS also has a higher accuracy score than U-NET PCA (0.94 to 0.64, respectively). The precision score is very low for both models (U-NET PCA: 0.04, U-NET SPECS: 0.16). While the recall is also very low for U-NET SPECS (0.17), it is much higher for U-NET PCA (0.50). As a result, the f1-score is low for both models, and it is lowest with U-NET PCA at 0.08 (U-NET SPECS: 0.16). As the AUC value of 0.57 for both models indicates, the U-NET models performed only slightly better than a random classifier.
In the South African study area, where the overall model performance was higher than in the Duero area, but lower than in Texas, all metrics indicate a higher performance of U-NET SPECS. U-NET SPECS has the higher accuracy score (0.73 to 0.57), precision (0.77 to 0.58), recall (0.61 to 0.35), and consequently f1-score (0.68 to 0.43). The same applies to the AUC (0.72 to 0.56). The AUC value of 0.56 in the case of U-NET PCA also represents the lowest value among all three study areas.

Segmentation Results
In addition to the pixel-based metrics, the segmentation quality was determined by visual comparison with the CPS reference geometries. Segmentation quality refers to the capacity of the model to reproduce the manually mapped center pivot geometries. As with the quantitative evaluation of the classification quality (see Section 3.2.), the qualitative, visual examination of the results shows a decrease in the segmentation quality from Texas, over South Africa to the Duero region, where the results revealed an extremely low performance of both models.  Figure 4 and the results from U-NET SPECS on the right side. U-NET PCA reproduced the smaller CPSs, in particular, with high quality, and it had only occasional omissions. However, the larger CPSs, which are located central in the map, were only fragmentarily mapped by U-NET. In the western part of the area, which has no CPSs, some false positives can be found.

Texas Study Area
In case of U-NET SPECS, the misclassifications in the western area appear more spatially distributed, i.e., speckle-like and not organized into larger spatial clusters (Figure 4b). In addition, not all CPSs were reproduced in their circular form, and a few CPSs were not detected by the algorithm. The larger CPSs also reveal classification and segmentation problems (e.g., larger parts of the center pivots were not assigned to the center pivot class by the model), but at least individual segments of the circle were correctly recognized.

Duero Study Area
For a part of the Duero area, the results of the two models are shown in Figure 5, analogous to Figure 4. The poor segmentation quality of U-NET PCA is clearly visible in Figure 5a, which shows pixels assigned as CPSs in large, contiguous areas. This is especially the case in the northern part of the area, but not limited to it. Some of these areas correspond to CPSs in the reference, but the geometries resulting from the segmentation have little in common with the actual CPS geometries.
A completely different picture emerges from the results of U-NET SPECS (Figure 5b): The number of pixels classified as CPSs is significantly smaller than in U-NET PCA. The areas segmented are spatially more separated from each other, but often do not correspond to the actual CPS geometries. Only a very small part of the CPS (in the north-western part) was successfully segmented.

South Africa Study Area
Finally, the results for the South African study area can be found in Figure 6, where again, as in the Duero study area (Figure 5), U-NET PCA tends to misclassify large, connected areas. Only a small part of the reference CPS is reproduced with high segmentation quality; many CPSs remain undetected or are covered by only a few pixels, which take up a very small part of the actual area of the reference geometries (small intersection over union). U-NET SPECS (Figure 6b) did not provide completely accurate segmentation and classification results, since neither all CPSs were found nor are all objects completely segmented. Nevertheless, many segments correspond to the reference geometries and the number of large-area misclassifications is lower and less dominant in visual inspection than with U-NET PCA (Figure 6a).

Center Pivot Classification and Segmentation
In terms of classification accuracy (Table 6) and segmentation quality (Figures 4-6), U-NET SPECS mostly outperformed U-NET PCA. The segmentation and classification accuracy obtained in Texas is comparable to the results achieved by Saraiva et al. [16] in Brazil when considering both models. Although the quality in South Africa is higher than in the Duero area, in both areas the two models do not reach the accuracy and quality as in Texas.
Both models have the same receptive field and neural architecture, so the differences can likely be explained by the spectral information used. U-NET SPECS uses only a relatively small portion of the spectral information actually available from Sentinel-2 (VIS and NIR), whereas the first three principal components in U-NET PCA have significantly reduced the Sentinel-2 spectral feature space. We suppose that the contextual information resulting from the principal components is less suitable than the original Sentinel-2 channels used in U-NET SPECS, to separate CPSs from other structures. The principal components likely tend to show spectral differences in local neighborhoods, which are less indicative of the presence of CPSs than, e.g., differences in cultivation practices (especially irrigation in all forms). Moreover, PCA might emphasize site-specific characteristics such as differences in soil type and plant physiology. The spectral feature space reduction performed may therefore have highlighted non-discriminatory differences that are of secondary importance for the segmentation of CPSs. We speculate that while feature space reduction by PCA has reduced spectral redundancies, it has not necessarily contributed to complexity reduction. From this we conclude that, in addition to the integration of spectral attributes, spatial information should be included in the feature space reduction to archive an effective complexity reduction to boost classification accuracy and segmentation quality. Mathematical concepts like the "core tensor" proposed by López et al. [71] could therefore be a promising approach for effective feature space and complexity reduction. Nevertheless, we argue that such reductions are necessary due to reduced computing times and the more stable training behavior of U-NET PCA compared to U-NET SPECS (see Figure 3). However, it should be noted that the experiments could not be repeated more often due to time constraints, so the results should be interpreted with caution.
It is noticeable that smaller CPSs are generally delimited with higher accuracy. Since local texture and neighborhood information within the receptive field are crucial for pixel-by-pixel classification, the information processing by U-NET differs from the approach of human experts, which might explain why smaller CPSs are easier to detect for U-NET. Human experts tend to orient themselves more towards the "Gestalt" principles [72] in the evaluation (i.e., perception) of visual information and use these concepts intuitively in the delimitation of objects [73]. Here, larger spatial relationships in the form of edges and color differences are more important than texture in relatively small foci. Human experts tend to perform object extraction rather than wall-to-wall classification, as is the case with U-NET. Since the receptive field of U-NET in this study is relatively small (128 by 128 pixels; i.e., 1.28 by 1.28 km) this may not be large enough to segment larger CPSs with sufficient quality. For smaller CPS, which are also less likely to show internal heterogeneities (e.g. due to sectoral differences in crop types or irrigation management), the limited neighborhood information seems to be sufficient. The importance of local texture and neighborhood can also explain why larger, contiguous areas are misclassified, as is the case with U-NET PCA, since edges and spatial arrangement of objects are less decisive.
This point coincides with an observation by Li et al. [74]: Using the example of land-sea classification tasks, the authors showed that U-NET had problems to capture complex connectivity patterns and produce coherent, accurate segmentation results. In particular, the authors found that the number of convolution layers was too low to capture the inherent complexity of land-sea arrangements. This could also be the case in the study presented here. However, to make the network deeper, the input patches have to become larger, since the current size of 128 to 128 pixels does not allow for more layers than the one currently in use. A recent study by Yang et al. [75] seems to confirm this finding, as a deeper network architecture and a larger receptive field outperformed conventional U-NET segmentation. A deeper network architecture, however, increases the training times significantly, since more free parameters have to be determined. Whether the additional effort results in a significant increase in segmentation quality and classification accuracy in the CPS is a prospect for further research.

Geographic Transferability
The best results in terms of classification accuracy and segmentation quality were achieved in Texas, which is to be expected since the models were trained in this region. At the Spanish Duero and the South African test site, the models showed very different results, with both models providing hardly any useful results in Spain. U-NET SPECS almost exclusively performed better than U-NET PCA (see Section 4.1). It follows that geographical transferability of the models is not an inherent characteristic. Partly, the lack of geographical invariance results from the properties of convolution networks, which are shift and conditionally scale invariant, but cannot deal a-priori with geometric distortions and illumination differences [58,76].
Since the CPSs in the three study areas are not homogeneous in terms of size and compactness (see Table 1), and differences in land-use and vegetation patterns exist, part of this lack of invariance can be explained by the characteristics of CNNs. In addition, only a single Sentinel-2 scene was used per study area (Table 2). Thus, multi-temporal parameters (such as median reflectance covering, e.g., two vegetation periods) as proposed by Saraiva et al. [16] were not used. This also results in different illumination geometries and bi-directional reflection properties of the surfaces [77] per study area. Further research could tie in at this point and use multi-temporal variables instead of mono-temporal Sentinel-2 images.
Furthermore, the CPS density differs in each area: In Texas, almost 38% of the study area is covered by CPS, and in South Africa, even 45%, whereas, in the Duero, it is only 3% of the area (see Table 1). This therefore increases the chance of false positives, especially in Spain. Spain also has the most complex patterns in terms of land-use types, as agricultural land is often intersected by roads and settlements. Furthermore, only a small part of the agricultural land on the Duero belongs to the CPS. In Texas and South Africa, in contrast, agriculture concentrates exclusively on the CPS, so that the other areas are largely non-arable land, which reduces the classification problem of CPSs to the recognition of the arable land.
Although the performance of the two models was lower in South Africa than in Texas, the results of U-NET SPECS in particular showed comparatively high classification accuracy (Table 6). This can be explained by the relatively similar structure of the CPS, their spatial arrangement and comparable land-use patterns. We therefore postulate a conditional geographical transferability of at least the U-NET SPECS model. The transferability is restricted to areas with similar natural features and land-use patterns. Operationally, we propose to express such similarity through global image metrics such as the Shannon Entropy [78]. The Shannon Entropy-originating from information theory-is a measure of the disorder of an image. The greater the degree of complexity, the higher the degree of disorder. In detail, both the Texas and the South-Africa study area have a low entropy (~1.92), whereas the Duero region revealed a much higher value (15.99). Further research efforts are necessary to test whether image entropy alone is an indicator of the geographical transferability of the models, and whether thresholds can be defined above which transferability is not given.
Of course, it would be possible to simply compensate for the missing invariances by extending the training dataset to all three study areas. However, this only shifts the problem of lacking geographical transferability, since, more or less, all areas with CPSs at the global level would then presumably have to be included, which in turn would mean a great deal of manual mapping effort. Since this is clearly contrary to the goal of keeping manual invention to a minimum, we propose to train a minimum number of networks, each applicable to specific, similar regions in terms of land-use pattern and CPS characteristics. These could be expressed by land-use clusters. For example, in this study, two clusters could be defined: (i) CPSs located in dry climates (Texas and South Africa) and (ii) CPSs located in semi-arid regions embedded into complex agricultural patterns (Duero). Thus, a separate network would have to be trained for Spain; the network from Texas, on the other hand, could also be used operationally in South Africa. By using the Shannon Entropy and comparing other geographical factors (e.g., vegetation patterns, crop types, and rotation, average field size, number of land uses) similarity between geographic regions could be expressed. The number of different networks required for global coverage would only have to be as large as the number of identified clusters. However, this clearly requires further research beyond the scope of this work.

Conclusions
We trained two U-NET models for semantic segmentation of CPSs in Texas and applied the two resulting networks to other geographic areas with CPSs: the Spanish Duero Basin and South Africa. We were able to show that the reduction of the spectral feature space by means of principal component analysis shortens computation time and stabilizes the training process, but does not increase the quality of classification and segmentation. We assume that effective dimensionality reduction should include spatial (i.e., contextual) properties, in addition to spectral attributes. Since algorithms such as U-NET are supposed to increasingly automate manual mapping, we investigated the generalizability of both U-NET models with respect to their geographical transferability. The results clearly showed that geographical invariance is not an inherent property of U-NET and the complexity of land-use patterns should not be neglected. At this point, we cannot make a proposal for a globally applicable model for segmenting CPS, but we have used the Shannon Entropy as an indication of the transferability of a model to other geographical regions. However, this clearly requires further research.
We assume that the difficulties and approaches for further research presented here are not only relevant for the mapping of CPSs from Sentinel-2 data, but also for many other applications of deep learning algorithms in remote sensing. Funding: This research was funded by European Union's Horizon 2020 research and innovation programme within the project "ExtremeEarth-From Copernicus Big Data to Extreme Earth Analytics", grant number 825258.

Conflicts of Interest:
The authors declare no conflict of interest.