Object Identification in Land Parcels Using a Machine Learning Approach

: This paper introduces an AI-based approach to detect human-made objects and changes in these on land parcels. To this end, we used binary image classification performed by a convolutional neural network. Binary classification requires the selection of a decision boundary


Introduction
The analysis of the "physical cover of the Earth's surface" [1]-also called "land cover"-is of importance for authorities in order to manage urbanization as well as natural resources.In this context, remote sensing combined with classification systems offers an effective way to get specific information about territories under the authority's responsibility, which can be used for further decision making.
A number of articles (e.g., [2][3][4][5][6][7]) have analyzed land cover based on aerial imagery and other data sources, which can be related to both the fields of remote sensing and computer vision.There are some case studies showing the efficient utilization of machine learning (ML) techniques in remote sensing tasks, such as object classification and change detection.However, all these studies used different approaches and datasets.In [3], for example, digital orthophotos (DOPs) in combination with Sentinel-2 images and digital elevation models (DEMs) were used, whereas [8] used Landsat images.According to ML techniques, convolutional neural networks (CNNs) were used in many studies (e.g., [3,[9][10][11]).In [3,9], a CNN was combined with recurrent neural networks (RNNs) [10], whereas [11] combined CNN with support vector machines (SVMs) to perform classification.
However, the ability of a system relying on aerial images to detect objects is constrained by the quality of the images, which can be referred to as the spatial and spectral resolution, and the geometrical correctness (orthorectification) [12].A higher quality image comes with higher costs.Hence, it makes sense to develop systems for specific tasks, in order to adjust the quality as needed to fit the relevant objects.
One such task is the maintenance of a system for the management of agricultural parcels eligible for subsidiaries.Based on the common agricultural policy (CAP) of the European Union (EU), every member state is forced to maintain such a system for the administration of all land parcels located in their territory [13].This system is called the Land Parcel Identification System (LPIS), which utilize "ortho-imagery" (based on aerial or satellite images) [14] and stores the geometries and coordinates of the land parcels in a database.The LPIS database helps the member states to manage agricultural production, as well as to reach the environmental protection targets set by the EU.One common task is the detection of human made changes, e.g., buildings, streets, wind turbines, power lines etc., which has to be performed on a regular basis (i.e., every year).Consequently, this comes with a huge workload for the authorities, since there is no technical support involved.
In this study, we refer to an area covered by human made obstacles as a non-eligible area (NEA), which serves as an indicator of their impact on the subsidy calculations (in general, agricultural subsidies are based on the amount of eligible area within a parcel).Unfortunately, these NEAs are very small and difficult to detect using the human eye, depending on the image quality.The main issue related to the LPIS is the delineation of agricultural parcels.One might think that the detection of NEAs is done by systems focusing on parcel delineation.A number of articles (e.g., [15][16][17][18][19][20][21][22]) have focused on this problem, instituting the detection of a parcel boundary utilizing different ML approaches, especially CNN [15][16][17][18][19][20][21].However, these studies focused on the outer boundaries of the parcels and paid minor attention to the objects located on the parcels (within the outer boundaries).This is due to that a major part of a parcel's boundary is associated with objects and areas located in the neighborhood of the parcel.
According to [22], there is only little research (e.g., [23,24]) that has focused on objects located on a parcel and their contribution to the overall delineation (inner and outer) of the parcels, which was the intended subject of the articles mentioned above.However, these studies focused on specific objects, i.e., field roads [23] or ditches and furrows [24], and did not cover the whole spectrum of NEAs (Figure 1).Additionally, none of these studies utilized neural networks.
is the detection of human made changes, e.g., buildings, streets, lines etc., which has to be performed on a regular basis (i.e., every this comes with a huge workload for the authorities, since there i involved. In this study, we refer to an area covered by human made obs area (NEA), which serves as an indicator of their impact on the su general, agricultural subsidies are based on the amount of eligible Unfortunately, these NEAs are very small and difficult to detect depending on the image quality.The main issue related to the LP agricultural parcels.One might think that the detection of NEAs i cusing on parcel delineation.A number of articles (e.g., [15][16][17][18][19][20][21][22]) hav lem, instituting the detection of a parcel boundary utilizing differe pecially CNN [15][16][17][18][19][20][21].However, these studies focused on the outer cels and paid minor attention to the objects located on the parcels (w aries).This is due to that a major part of a parcel's boundary is assoc areas located in the neighborhood of the parcel.
According to [22], there is only little research (e.g., [23,24]) that located on a parcel and their contribution to the overall delineatio the parcels, which was the intended subject of the articles mentio these studies focused on specific objects, i.e., field roads [23] or dit and did not cover the whole spectrum of NEAs (Figure 1).Addit studies utilized neural networks.In this study, the objective was to develop and evaluate a syste NEAs using a CNN.Therefore, none of the solutions described in above were evaluated as options for the detection of new NEAs.Th In this study, the objective was to develop and evaluate a system that can detect new NEAs using a CNN.Therefore, none of the solutions described in the articles mentioned above were evaluated as options for the detection of new NEAs.This evaluation has been left to future works.Additionally, we neglected the detection of removed obstacles, assuming that the farmers had a high interest in reporting these types of changes themselves, because these parcels would result in higher subsidies.Therefore, it is more important to provide the authorities with a system for the detection of new NEAs.Additionally, the evaluation of the system is more straightforward this way, since one has to consider less uncorrelated metrics when focusing on just new NEAs instead of focusing on new as well as removed NEAs.We claim that our approach could detect removed NEAs with some adjustments; however, we have left this evaluation for future work.
For the detection of new NEAs, we applied supervised ML.Since we cannot assume perfect prediction (accuracy = 100%), we must handle and balance two types of errors; Type I errors, i.e., false negatives assessed by the FN-rate (false negatives over false negatives plus true positives), and Type II errors, i.e., false positives assessed by the FP-rate (false positives over false positives plus true negatives).Based on that, we proposed an algorithm to select the best decision boundary, a threshold for predicting positives vs. negatives, according to target values of FP-and FN-rates (or their complements, TN-and TP-rate) set by a user.
According to the specification of the LPIS [25], authorities should use orthoimages to detect NEAs on a parcel.Orthoimages are geometrically corrected ("orthorectified") [26], meaning that the images are represented as if they were captured at a nadir angle instead of an oblique one [12].However, it could be possible to detect NEAs using aerial images that are non-orthorectified.Since the orthorectification process comes with higher costs, the question arises whether it is necessary to use orthorectified aerial images for this task.Therefore, we evaluated the performance of the developed system using orthorectified and non-orthorectified aerial images.This was done in the context of managing eligible agricultural parcels in a region in the northern part of Germany.
It is hard to compare this study with others, since there are few research studies to be found in the literature that have focused on NEAs on a parcel.However, NEAs examined in this study are quite similar to the relevant objects in the related works focusing on land cover.Therefore, the results are compared to these related studies, although this comparison is still vague (Table A7, Appendix B).

Specific Case Study
In Germany, the federal states are responsible for the administration of agricultural subsidies.Therefore, they are also responsible for the maintenance of the LPIS, covering the area which the authorities of the federal states are responsible for.
To keep the LPIS database updated, the federal state officers run a parcel maintenance process (PMP) annually.During this PMP, they try to detect new human-made objects on the parcels and if necessary register them in the LPIS database as NEAs.The PMP is done based on DOPs, which are acquired in the year of the PMP.Since the size of some of the objects (Figure 1) is very small, the officers use DOPs with a ground sampling distance (GSD) of 50 cm.Moreover, the officers use the spectral information in the red, green, and blue bands (RGB), as well as the near infrared band (NIR).
The PMP is a two-step process (Figure 2).In the first step (Assessment), the officer iterates over the parcels (Parcel 1 . . .Parcel n ) in the review step, in order to inspect the parcel for new NEAs.Therefore, each parcel is located in the DOP based on the geometry and coordinates registered in the LPIS.The officer gets an image that combines all relevant information for reviewing the parcel, i.e., the parcel's geometry, the geometries of all NEAs intersecting the parcel's geometry, and the DOP.Based on that, the officer assesses whether there is a NEA depicted in the DOP that matches both of the following two conditions: 1  updates as required.If the updated information could not be verified, meaning that there is no parcel update necessary, it is dropped.Otherwise, the officer updates the geometry and coordinates of the parcel.In both cases, the parcel information captured in the LPIS database (geometry and coordinates of the NEAs) matches the actual situation, i.e., the information in the given DOPs (  1 …    ).Currently, there is no technical support for comparing parcel information with the information in the corresponding DOP; the comparison is solely done manually.Since there are, e.g., around 200,000 parcels registered in the LPIS database of Schleswig-Holstein, the first step of the update process takes a lot of human effort.According to the authorities in Schleswig-Holstein, it takes approximately two to three months of full-time work for at least three employees to complete the first step (i.e., up to nine person months annually).
In this study, we propose an automated approach that supports the first part of the update process described above.Implemented in a system, it could automatically decide whether the given parcels need to be updated according to the DOP.To build trust in the system, we keep the humans in the loop, i.e., officers could review all parcels that needed to be updated according to the system's suggestion.Eventually, we aim for a reduction of the workload associated with manually reviewed parcels.
Together with the authorities, we decided that it is more important to suggest necessary updates by the system than it is to reject irrelevant updates.This means that reducing Type I errors is considered more important than reducing Type II errors.Hence, the target values for the classification are a true positive rate (TP-rate, suggested update was necessary, a hit) of a minimum of 90% and a true negative rate (TN-rate, unnecessary update was rejected, a correct rejection) of a minimum of 70%.This leads to a Type I error FNrate of maximum 10%, and a Type II error FP-rate of maximum 30%.• The NEA is localized within the parcel geometry.
• The NEA is not yet registered in the LPIS.
If there is such a NEA, the parcel needs an update.After this step, the updated information (U) is formally stored as U parcel 1 . . .U parcel n .
In the second step (Update), another officer iterates over the information of each of the parcels to check the need for an update (U parcel 1 . . .U parcel n ) and to manually verify the updates as required.If the updated information could not be verified, meaning that there is no parcel update necessary, it is dropped.Otherwise, the officer updates the geometry and coordinates of the parcel.In both cases, the parcel information captured in the LPIS database (geometry and coordinates of the NEAs) matches the actual situation, i.e., the information in the given DOPs (Parcel act 1 . . .Parcel act n ).
Currently, there is no technical support for comparing parcel information with the information in the corresponding DOP; the comparison is solely done manually.Since there are, e.g., around 200,000 parcels registered in the LPIS database of Schleswig-Holstein, the first step of the update process takes a lot of human effort.According to the authorities in Schleswig-Holstein, it takes approximately two to three months of full-time work for at least three employees to complete the first step (i.e., up to nine person months annually).
In this study, we propose an automated approach that supports the first part of the update process described above.Implemented in a system, it could automatically decide whether the given parcels need to be updated according to the DOP.To build trust in the system, we keep the humans in the loop, i.e., officers could review all parcels that needed to be updated according to the system's suggestion.Eventually, we aim for a reduction of the workload associated with manually reviewed parcels.
Together with the authorities, we decided that it is more important to suggest necessary updates by the system than it is to reject irrelevant updates.This means that reducing Type I errors is considered more important than reducing Type II errors.Hence, the target values for the classification are a true positive rate (TP-rate, suggested update was necessary, a hit) of a minimum of 90% and a true negative rate (TN-rate, unnecessary update was rejected, a correct rejection) of a minimum of 70%.This leads to a Type I error FN-rate of maximum 10%, and a Type II error FP-rate of maximum 30%.

Investigated Area and Data 2.2.1. Investigated Area
The investigated area is the federal state of Schleswig-Holstein in northern Germany (Figure 3).

Investigated Area
The investigated area is the federal state of Schleswig-Holstein in northern Germany (Figure 3).

Digital Orthophotos
We used six datasets of DOPs derived from aerial photos acquired at different dates in the years 2019-2022 (Table 1).Two institutions created the datasets: Schleswig-Holstein State Office for Surveying and Geoinformation (LVGSH, Landesamt für Vermessung und Geoinformation Schleswig-Holstein), 24106 Kiel, Germany, and EFTAS Remote Sensing Technology Transfer GmbH (EFTAS, EFTAS Fernerkundung Technologietransfer GmbH), 48145 Münster, Germany.The DOPs were obtained with different quality and coverage.Here, the term quality refers to whether they are orthorectified.Coverage refers to the coverage of land area of the federal state Schleswig-Holstein (Figure 3b).All parcels and NEAs are stored with associated parcel information in the LPIS database.In the database, there are different versions of parcel information, which are harmonized with the DOPs in a preprocessing step, as seen in the first step in Figure 2 (DOP with parcel and NEA geometries).Therefore, we used the information from the review

Digital Orthophotos
We used six datasets of DOPs derived from aerial photos acquired at different dates in the years 2019-2022 (Table 1).Two institutions created the datasets: Schleswig-Holstein State Office for Surveying and Geoinformation (LVGSH, Landesamt für Vermessung und Geoinformation Schleswig-Holstein), 24106 Kiel, Germany, and EFTAS Remote Sensing Technology Transfer GmbH (EFTAS, EFTAS Fernerkundung Technologietransfer GmbH), 48145 Münster, Germany.The DOPs were obtained with different quality and coverage.Here, the term quality refers to whether they are orthorectified.Coverage refers to the coverage of land area of the federal state Schleswig-Holstein (Figure 3b).All parcels and NEAs are stored with associated parcel information in the LPIS database.In the database, there are different versions of parcel information, which are harmonized with the DOPs in a preprocessing step, as seen in the first step in Figure 2 (DOP with parcel and NEA geometries).Therefore, we used the information from the review processes the authorities had performed in the past to associate each version of parcel information with the DOP dataset used in the review process.As a result, we collected a specific number of parcels for each DOP dataset, as shown in Table A3, Appendix B.

Approach and Workflow
According to the PMP, described in Section 2.1, the goal was to reduce human workload by developing a system that can detect NEAs, which are not yet registered in the LPIS database.Since the current review process iterates over all parcels (Parcel 1 . . .Parcel n ), the solution was integrated into this loop.This is why the whole workflow iterates over each parcel (Figure 4).
Remote Sens. 2024, 16, x FOR PEER REVIEW 6 of 23 processes the authorities had performed in the past to associate each version of parcel information with the DOP dataset used in the review process.As a result, we collected a specific number of parcels for each DOP dataset, as shown in Table A3, Appendix B.

Approach and Workflow
According to the PMP, described in Section 2.1, the goal was to reduce human workload by developing a system that can detect NEAs, which are not yet registered in the LPIS database.Since the current review process iterates over all parcels ( 1 …   ), the solution was integrated into this loop.This is why the whole workflow iterates over each parcel (Figure 4).

Parcel Preparation
In the first part of the system's workflow, an individual parcel was prepared for NEA detection.The parcel preparation started with the localization of the parcel in the DOP, according to the geometry and the coordinates stored in the LPIS database.Based on the localization, the DOP was cut to get an image that focused on the parcels only (parcel image).Additionally, we created a label mask for the registered NEAs (NEA mask) in the same dimensions as the parcel image.
Then, both the parcel image and the NEA mask were divided into tiles (parcel tiles and NEA tiles) of equal dimension.Note that the dimension of the tiles was the same for all images and in all iteration steps.According to the NEA information in the LPIS, each NEA tile was labeled according to the existence of a NEA within it.This resulted in binary NEA info (  ) for each NEA tile.

Detection of New Non-Eligible Areas
The parcel preparation was followed by the detection of new NEAs.To detect new NEAs, each pair of parcel tiles () and binary NEA info (  ) was iterated.During one iteration step, the parcel tile was forwarded to a neural network (classifier) consisting of Finally (Parcel aggregation/verification), the system predicts, whether the parcel needs an update or not (U pred ), based on the detection of new NEAs.If so, a human must verify this prediction (U veri f ied ).

Parcel Preparation
In the first part of the system's workflow, an individual parcel was prepared for NEA detection.The parcel preparation started with the localization of the parcel in the DOP, according to the geometry and the coordinates stored in the LPIS database.Based on the localization, the DOP was cut to get an image that focused on the parcels only (parcel image).Additionally, we created a label mask for the registered NEAs (NEA mask) in the same dimensions as the parcel image.
Then, both the parcel image and the NEA mask were divided into tiles (parcel tiles and NEA tiles) of equal dimension.Note that the dimension of the tiles was the same for all images and in all iteration steps.According to the NEA information in the LPIS, each NEA tile was labeled according to the existence of a NEA within it.This resulted in binary NEA info (N In f o ) for each NEA tile.

Detection of New Non-Eligible Areas
The parcel preparation was followed by the detection of new NEAs.To detect new NEAs, each pair of parcel tiles (Tile) and binary NEA info (N In f o ) was iterated.During one iteration step, the parcel tile was forwarded to a neural network (classifier) consisting of convolutional neural network (CNN) layers and several fully connected (FC) layers, which proposed a probability for the existence of a NEA (P NEA ) in the tile.To decide whether there was a NEA depicted in the tile, the output (P NEA ) was transformed into binary information, which resulted in a binary classification (N Img ).Here, a specific decision boundary, selected in advance based on the given target values of the TP-and TN-rates, was used.Together with the binary NEA info (N In f o ), the decision was made as to whether a detected NEA in the tile was a new NEA (N new ).

Parcel Aggregation and Verification
After detecting (rough localization by tiling, and classification by the classifier) new NEAs on each parcel tile, this information (N new 1 . . .N new k ) was aggregated to find out whether the parcel contained a new NEA and, therefore, needed an update (U Pred ).After that, the two possible outcomes of U Pred , i.e., true (new NEA) or false (no new NEA), were directed according to the defined balance of Type I and Type II errors.For this specific case study, the system was optimized to avoid Type I errors by trading them off against Type II errors.Hence, further verification was avoided if U Pred was false (indicating no new NEA), thereby eliminating the need for parcel updates.However, a human verification for U Pred was forced if indicating the opposite, which had the potential to turn out to be a Type II error.Note that if the system had been optimized to avoid Type II errors instead of Type I errors, this step would have been defined the other way around.

Training and Evaluation
There are a number of parameters that could affect the classifier, and these are described in Sections 2.6 and 2.7 in detail.Consequently, it was necessary to evaluate the different parameter configurations as well as the whole approach.The process described above was, therefore, performed with some minor changes (Figure 5).convolutional neural network (CNN) layers and several fully connected (FC) layers, which proposed a probability for the existence of a NEA (  ) in the tile.To decide whether there was a NEA depicted in the tile, the output (  ) was transformed into binary information, which resulted in a binary classification (  ).Here, a specific decision boundary, selected in advance based on the given target values of the TP-and TNrates, was used.Together with the binary NEA info (  ), the decision was made as to whether a detected NEA in the tile was a new NEA (  ).

Parcel Aggregation and Verification
After detecting (rough localization by tiling, and classification by the classifier) new NEAs on each parcel tile, this information (  1 …    ) was aggregated to find out whether the parcel contained a new NEA and, therefore, needed an update (  ).After that, the two possible outcomes of   , i.e., true (new NEA) or false (no new NEA), were directed according to the defined balance of Type I and Type II errors.For this specific case study, the system was optimized to avoid Type I errors by trading them off against Type II errors.Hence, further verification was avoided if   was false (indicating no new NEA), thereby eliminating the need for parcel updates.However, a human verification for   was forced if indicating the opposite, which had the potential to turn out to be a Type II error.Note that if the system had been optimized to avoid Type II errors instead of Type I errors, this step would have been defined the other way around.

Training and Evaluation
There are a number of parameters that could affect the classifier, and these are described in Section 2.6 and 2.7 in detail.Consequently, it was necessary to evaluate the different parameter configurations as well as the whole approach.The process described above was, therefore, performed with some minor changes (Figure 5).Second (Ground truth generation), the binary NEA infos are aggregated to the parcels ground truth (U target ).Third (Detection), the system predicts the probability of containing a NEA for each parcel tile (P NEA ).Next, these probabilities were aggregated for each parcel (U pred ).Finally (Evaluation), the metrics are calculated based on U pred and U target , followed by the selection of a proper decision boundary taking the target values (TN target and TP target ) into account.
First, human verification was not part of the evaluation process, because the process should focus on the evaluation of the different variants of the classifier.Second, some ground truth was needed in order to compare it to the results the classifier produced.Therefore, the classifier was evaluated based on its detection of all NEAs that were already registered.Thus, the information about registered NEAs was retrieved from the binary NEA info (N In f o 1 . . .N In f o k ) and aggregated to the parcel level, i.e., true or false (U target ), in the ground truth generation step.To create values that could be compared against the ground truth, the extracted tiles (Tile 1 . . .Tile k ) were forwarded to the ML model (detection), resulting in a probability (P NEA ) for each tile.The set of probabilities was aggregated to one value (parcel aggregation), indicating the probability of the necessity of an update for the parcel, i.e., true or false (U pred ), which was then compared against the ground truth, according to some metrics described below.Finally, a proper decision boundary based on the metrics and the given target TP-and TN-rates was selected.

Metrics
The receiver operating characteristic (ROC) curve was used to measure the goodness of the trained ML model, since the TP-and TN-rates (as well as the corresponding FPand FN-rates) were to be balanced.The ROC curve described the TP-rate and the FP-rate (1-TN-rate) for every possible decision boundary of a classifier.Therefore, the area under the ROC (AUROC) curve was used to aggregate the target metrics as one value.The goal was to maximize this value by varying the parameters of the ML models.
Additionally, the overall accuracy was calculated for each ML model to compare the approach with other studies.

Decision Boundary Selection
Recall that for the case study mentioned in Section 2.1, target values of 90% for the TP-rate and 70% for the TN-rate were defined.In this case, a higher TP-rate was more important than a higher TN-rate.The ratio of differences of the target TP-and TN-rates to their maximum possible values (100%) were interpreted as weights of importance when comparing the two contradicting goals of high TP and TN.In the given case, 10% TP improvement corresponded to 30% TN improvement (100% − 90% = 10% and 100% − 70% = 30%).In other words, TP improvements were three (=30/10) times more appreciated in this study than TN improvements.Thus, the decision boundary that best fit the needs in an application context was selected in a deterministic way.
The ROC curve is a discrete function consisting of discrete points, each representing the TP-and FP-rate in the test data for a specific decision boundary.Since the test data is a finite set, the ROC curve steps are discrete decision boundaries, where at least one test data point changes from FN to TP or from TN to FP.In general, there is no decision boundary that exactly matches the target values of TP and FP (Figure 6).
Hence, the decision boundary with the best possible results given the target values as constraints (lower bound), which defines a specific classifier and its actual TP-and TN-rates, had to be found.This led to a constraint optimization problem, defined and solved as follows: Let B be a set of all discrete decision boundaries.Define the following functions on decision boundaries b ∈ B: Hence, the decision boundary with the best possible results given the target values as constraints (lower bound), which defines a specific classifier and its actual TP-and TNrates, had to be found.This led to a constraint optimization problem, defined and solved as follows: Let  be a set of all discrete decision boundaries.Define the following functions on decision boundaries  ∈ : () ≔ TP-rate of a classifier using decision boundary (1) () ≔ FP-rate of a classifier using decision boundary The goal of the optimization was to find the best decision boundary   ∈  with the highest TP-rate and the lowest FP-rate (corresponding to the highest TN-rate).Additionally, the (maybe imbalanced) importance of the two target values, e.g., a weight of three for TP-rate and one for TN-rate improvements, was considered.With algebraic transformation, the optimization problem was described using the following formula: In the case of two decision boundaries  1 and  2 matching the above criteria, we selected   = (( 1 ), ( 2 )) if the TP-rate was more (or equally as) important than the TN-rate, and   = (( 1 ), ( 2 )) if the TN-rate was more important than the TP-rate.
In our data, there were no significant performance problems related to calculating the best decision boundary by checking the optimization goal in a brute force manner, i.e., for all discrete decision boundaries of a model and a test dataset.Figure 6a shows an example of the ROC curve produced by a specific score-based classification approach (a specific ML model) together with the target values for the TPrate (90%) and the FP-rate (30%).Each point of the curve corresponds to a specific decision boundary for a specific classifier.Ideally, a solution is found in the top left corner, i.e., above the TP-rate line and left of the FP-rate line.Unfortunately, no decision boundary exists for the model that meets both constraints.To show the selection of   , focus is placed on the area between the points where the ROC curve crosses the target TP-rate as well as the target FP-rate (Figure 6b).The goal of the optimization was to find the best decision boundary b best ∈ B with the highest TP-rate and the lowest FP-rate (corresponding to the highest TN-rate).Additionally, the (maybe imbalanced) importance of the two target values, e.g., a weight of three for TP-rate and one for TN-rate improvements, was considered.With algebraic transformation, the optimization problem was described using the following formula: In the case of two decision boundaries b 1 and b 2 matching the above criteria, we selected b best = max(TP(b 1 ), TP(b 2 )) if the TP-rate was more (or equally as) important than the TN-rate, and b best = min(FP(b 1 ), FP(b 2 )) if the TN-rate was more important than the TP-rate.
In our data, there were no significant performance problems related to calculating the best decision boundary by checking the optimization goal in a brute force manner, i.e., for all discrete decision boundaries of a model and a test dataset.Figure 6a shows an example of the ROC curve produced by a specific score-based classification approach (a specific ML model) together with the target values for the TP-rate (90%) and the FP-rate (30%).Each point of the curve corresponds to a specific decision boundary for a specific classifier.Ideally, a solution is found in the top left corner, i.e., above the TP-rate line and left of the FP-rate line.Unfortunately, no decision boundary exists for the model that meets both constraints.To show the selection of b best , focus is placed on the area between the points where the ROC curve crosses the target TP-rate as well as the target FP-rate (Figure 6b).

Cross-Validation
To narrow the confidence interval of the statistical accuracy (TP, TN) estimation, we performed cross-validation for the best model.We used six datasets of DOPs, acquired in different years and by two different institutions.The characteristics of the DOPs differs due to vegetation, weather conditions, and camera equipment used for the photography.Therefore, a six-fold cross-validation was performed, selecting one whole dataset for the test and the other five for the training of the classifier.After that, averaged metrics (i.e., TP-rate, TN-rate, overall accuracy) for the best decision boundary, as well as their standard deviation were calculated to evaluate the selected classifier.

Machine Learning Model
As mentioned in Section 2.3.2, the ML model consisted of a CNN followed by a set of FC-layers.A logistic sigmoid was used as the final activation function in the model.The FC-layers were part of the parameters and varied through the training iterations, as described in Section 2.6.
The CNN used was ResNet152V2 [27].To apply the different parameters, the input layer and the FC-layer were changed.Here, we used four input channels, whereas the original CNN consisted of three input channels.Therefore, the network needed to be adapted to our preferences.A fourth channel was added to the input layer of the CNN; the kernel weights of this additional channel were randomly initialized.Except this, no other changes were considered to the original architecture of the ResNet152V2 model.
A training dataset was created based on the datasets described in Sections 2.2.2 and 2.2.3.According to Section 2.3.1, each tile created from the parcel images was collected, together with the binary NEA infos created from the NEA mask.Inspired by the ImageNet challenge [28] (ImageNet), we used a tile size of 224 × 224 pixels.Table 2 describes the resulting number of tiles that contain a NEA (with NEA) and those that do not (without NEA) per dataset.The training was performed in alternating training and validation steps.Therefore, the training dataset was separated into two batches, one for training (90%) and one for validation (10%).The training batch was used in the training step while the model was fitted.The validation batch was used in a separate validation step.In the validation step, the AUROC was calculated to measure any improvement in the model compared to the previous step.The training was considered finished if there was no improvement after four consecutive epochs, providing that at least two epochs had been completed.Moreover, the training was performed with the Adam optimizer [29] with a learning rate of 0.001 and a batch size of 100 in all iterations.As a loss function, we used binary cross-entropy loss.
In the ML models, there were two types of parameters: hyper-parameters and manually selected parameters.Hyper-parameters were used in every possible combination in each iteration, whereas the manually selected parameters were selected with special intention in each iteration.Both types of parameters are described below.

Transfer Learning
To benefit from pretrained computer vision models, we used transfer learning (TL) in our model.Therefore, we used two different approaches.First, we used the feature extractor of a model pretrained from the ImageNet challenge.Second, we used a feature extractor from a model trained in advance using self-supervised learning (SSL) [30], which is used in many studies (e.g., [31][32][33][34][35][36]) and claims an overall potential to improve the ML models used.In SSL, one differentiates between a pretext task (term for the task performed during pre-training) and a main task that follows (in our case, the detection of NEAs in tiles).Through this, we intended to adapt the chosen model to the DOPs used for the training of the main task.Eventually, we came up with five different models for transfer learning (Table 3).The parameters considered for the pre-training were: We used the names of the models in Table 3 for the description of the transfer learning parameters in the iterations for the training of the main task.
The details of the training process and the parameters, as well as the results of the pre-training, are described in Appendix A. Self-Supervised Learning Training.

Data Balancing
To balance the training data according to the classes we wanted to discriminate between, two different balancing techniques were investigated.This is particularly relevant as we focused on binary image classification with a major class (without NEA) and a minor class (with NEA).The first technique (referred to as reduce) reduced the major class by a random selection of n samples, where n equals the number of samples in the minor class.The second technique (referred to as augment(n)) duplicated random samples of the minor class n times.In order to vary the duplicated samples of the minor class, we used geometric transformation [37], i.e., a random 90-degree rotation, as an augmentation technique.

Training Data Selection
To find out the impact of the usage of different amounts of training data, we investigated different combinations of the given datasets, as well as different fractions of these datasets, as training data.Assume a balanced dataset named D and a percentage s of the usage of this dataset.Then D(s) denotes a subset containing s percent of random data from D. For example, LVGSH2020(50) indicates that 50% of the dataset LVGSH2020 was used as training data.

Input Channels
We used all four channels provided by the datasets (R, G, B, NIR).Several research studies (e.g., [4,6,8]) have shown that the normalized difference vegetation index (NDVI) comes with an advantage for classification tasks related to vegetative areas.Therefore, we combined the RGB channels with the NDVI (R, G, B, NDVI) instead of the NIR channel.The NDVI was calculated based on the near infrared channel NIR and the red channel R: 2.7.Hyper-Parameters 2.7.1.Trainable Layers One hyper-parameter concerned the layers of the model that were trained.We varied different trainable layers of the CNN.The training determination (yes or no) for the input layer depended on the transfer learning used.If transfer learning started from a CNN that was initially trained on the ImageNet challenge [28], the first layer needed addition training.This is because the pretrained CNN came with a three-channel input instead of a four-channel input, as used in our approach.Additionally, we varied training for the last layer, as well as for the last two layers of the CNN.Finally, we also tested and trained all CNN layers.

Fully Connected Layers
We used different sets of FC-layers to test whether they had a significant impact on the model's performance.In a network with fully connected layers, each layer consisted of neurons, which were connected to all neurons of the subsequent layer.We considered the following FC-layer variations: • 3 layers with neurons: 4096, 4096, 1 • 4 layers with neurons: 4096, 4096, 1000, 1

Training Iterations
An iterative approach for the training of the classifier was used, where each iteration was divided into one main iteration and multiple sub-iterations.The main iteration was determined by the manually selected parameters described in Section 2.6.Manual selection of parameters was performed to avoid a brute force approach, which would try out too many variants and overwhelm the computing resources.The sub-iteration tested all combinations of the hyper-parameters described in Section 2.7.Since there were only a few values considered for each of the hyper-parameters, optimization techniques were not necessary.Hence, the hyper-parameters were tested using a brute force approach.Appendix B, Table A4 shows the set of main iterations, and Table A5 shows the set of sub-iterations, where an iteration is named as main-iteration.sub-iteration,e.g., 3.6 for main iteration 3 and sub-iteration 6.
We ran the training and evaluation process described in Section 2.4 for each main and sub-iteration.After determining the best decision boundary as described in Section 2.4.2, we calculated the resulting TP-and TN-rates and compared the performance of the trained models to the target values (TP-and TN-rate).

Results
Since we wanted to focus on the impact of the different manually selected parameters, we first selected the best sub-iteration of each main iteration.This selection was based on a comparison of the AUROC values of the sub-iterations.Figure 7 shows AUROC, TP-and TN-rates, and the overall accuracy of the best sub-iterations of each main iteration, along with the target TP-(90%) and TN-rate (70%) values (red dashed lines).The values for the metrics are outlined in Table A6, Appendix B.
Every time the target value for the TP-rate was hit, the target value for the TN-rate was missed.Although the TP-rate was quite stable, the TN-rate showed a high variability.The overall accuracy was very similar to the TN-rate.This similarity was caused by the huge imbalance of the tiles with and without NEAs, as shown in Table 2 (i.e., a very small proportion of cases were NEA-positive).
Based on the AUROC shown in Figure 7, the best model was found in iteration 15.7.Table 4 shows the results of the six-fold cross-validation of this iteration for the other performance measures.Our proposed method achieved an average TP-rate of 91.3% with a sample standard deviation of 1.0%, an average TN-rate of 63.0% with a sample standard deviation of 3.0%, and an average overall accuracy of 69.4% with a sample standard deviation of 5.5%.Every time the target value for the TP-rate was hit, the target value for the TN-rate was missed.Although the TP-rate was quite stable, the TN-rate showed a high variability.The overall accuracy was very similar to the TN-rate.This similarity was caused by the huge imbalance of the tiles with and without NEAs, as shown in Table 2 (i.e., a very small proportion of cases were NEA-positive).
Based on the AUROC shown in Figure 7, the best model was found in iteration 15.7.Table 4 shows the results of the six-fold cross-validation of this iteration for the other performance measures.Our proposed method achieved an average TP-rate of 91.3% with a sample standard deviation of 1.0%, an average TN-rate of 63.0% with a sample standard deviation of 3.0%, and an average overall accuracy of 69.4% with a sample standard deviation of 5.5%.

Discussion
Regarding the hyper-parameter selection, training the last two and the first layer of the feature extractor produced the best results in every iteration.Regarding the structure of the FC-layers, the combination of three layers performed best most of the time.Only in iterations 13 and 14 did the more complex structure of FC-layers perform better.

Discussion
Regarding the hyper-parameter selection, training the last two and the first layer of the feature extractor produced the best results in every iteration.Regarding the structure of the FC-layers, the combination of three layers performed best most of the time.Only in iterations 13 and 14 did the more complex structure of FC-layers perform better.
Transfer learning of pretrained feature extractors using SSL did not perform as well as the transfer learning of the feature extractor trained in the ImageNet challenge.This is shown by a comparison of the main iterations where we used the SSL approach (9-12 and 14) with the other iterations (Figure 7).
The augmentation of the data had a positive impact on the performance.A higher augmentation factor n resulted in an improvement in performance.This is shown by comparing iterations two and three, where the augmentation factors were n = 4 and n = 22, respectively.
The training data selection had the most significant impact on performance.This is shown when comparing iterations 4, 5, 7, 13, and 15 in Figure 7, which differ from each other based on the selection of the training data.
The selection of the input channels changed depending on the best models in the main iterations.In iterations 1-6, the NDVI performed better than the NIR channel combined with the RGB channels.With more training data, the NIR channel combined with the RGB channels performed better than the combination with the NDVI.
Cross-validation showed, on average, a TP-and TN-rate of 91.3% and 63.0%, respectively.With a sample standard deviation for the metrics of 1% (TP-rate) and 3% (TN-rate), the most important metric (TP-rate) is near our target value of 90%, whereas the TN-rate is lower than the target value of 70%.Assuming a normal distribution of all datasets of aerial images used in the update process, we conclude that 95% of all datasets would exhibit a TP-rate of at least 89.3% and a TN-rate of at least 57.0%.With a more conservative view for any other distribution, we consider Chebyshev's inequality and conclude that at least 75% of all datasets of aerial images used in the update process would exhibit the TP-and TN-rates mentioned above (i.e., TP-rate at least 89.3%, and TN-rate at least 57.0%).
Moreover, the decision boundary seems to vary a lot; the averaged decision boundary of 0.184801 exhibits a sample standard deviation of 4.5%.A closer look at the decision boundaries calculated for each test dataset shows that datasets not rectified (EFTAS2019 and EFTAS2020) especially seem to be outliers compared to the other values.Nevertheless, these experiments still show quite good metrics (in terms TP-and TN-rate).
Finally, we needed to suggest a decision boundary for the integration of the model in the PMP mentioned in Section 2.1, where it is most important to detect the majority of the relevant objects (TP).In general, a lower decision boundary comes with a smaller TN-rate and a higher TP-rate.Since the TP-rate is more important than the TN-rate according to the case study, we would rather select a lower than a higher decision boundary.Following this, we suggest the lowest decision boundary based on the average and the standard deviation.In this way, our approach ended up with a decision boundary of 0.094145 (=0.184801 − 2 × 0.045328).

Conclusions
In this study, a system for the detection of new NEAs was proposed.Additionally, it was shown how to integrate such a system in an existing workflow performed by the authorities.The authorities can benefit from our proposed system according to the special application described in Section 2, even though the results did not hit the target values.The idea was to spare the authorities from reviewing parcels that do not require updates.Our system achieves this efficiently.According to the authorities, approximately 15% of all parcels need an update caused by the presence of a new NEA.This implies that even with a TN-rate of 57%, the workload can be reduced by approximately 50%.
Compared to the related work that focused on comparable case studies (Table A7, Appendix B), the overall accuracy of approximately 69.4% in this study was low.However, the studies are not directly comparable with respect to the objects that were to be classified or identified.To get a comparative analysis, it is necessary to evaluate traditional remote sensing techniques as well as the approaches described in the related studies for the special case study and research area used in this study.
This study focuses on particularly small objects that are difficult to identify with low-resolution satellite images.To overcome this issue, aerial images with a GSD of 50 cm were utilized for enhanced precision in object identification.
In order to improve the results achieved here, one could consider using other remote sensing data, such as LiDAR or derivatives of that, e.g., DEMs, as an input channel in addition to the RGB, NIR, and NDVI channels.In theory, the approach presented in this study is not limited to the amount of input channels since it utilized a CNN.The impact of increased spatial resolution could be analyzed as well, e.g., the impact on accuracy when detecting NEAs in aerial imagery with a GSD of 20 cm.Since imagery with higher resolution comes with higher costs, one could also investigate and evaluate this approach with remote sensing data of a lower spatial resolution such as Sentiel-2 images.
The SSL approach did not give an advantage compared to the transfer learning based on the model from the ImageNet challenge.Since the SSL approach includes two training loops and is, thus, associated with higher costs, it is not reasonable to invest in this kind of pretraining when working on aerial images.Instead, it is more efficient and effective to use a pretrained feature extractor, like the one from the ImageNet challenge.
The results also show that it could be possible to use aerial images that are nonorthorectified.Cross-validations with the non-orthorectified datasets (EFTAS2019 and EFTAS2020) showed competitive results compared to other cross-validations based on orthorectified images (LVGSH2019-2022).However, the suggested decision boundaries for the EFTAS datasets varied more than the ones for the LVGSH datasets.Since there are many properties of the aerial images to consider, e.g., brightness, contrast, color, sharpness, temporal changes, etc., it is not clear that orthorectification has caused the variation of the decision boundary.We also need to consider that the orthorectified and non-orthorectified images were provided by different companies (Table 1).
In this study, the reviewed parcels lead to verified data, i.e. verified by a human expert, which could be used further in training iterations.Hence, we might investigate the adaptation of our approach described in Section 2.3 to collect this verified data in an easy and integrated way.As indicated by the results, this could yield better performance over time.Furthermore, a stronger focus on augmentation techniques could also lead to an improvement in the performance, as it increases the size and the diversity of the datasets.Also, it would be worthwhile to review the dataset described in this paper to find failures in the labelling.Especially when looking at false positive and false negative classified tiles, we found some inconsistent data.Based on that, one could investigate the impact of those failures.
According to the pre-and post-processing of our approach, we should mention some disadvantages that came especially with the creation of tiles.Since we created tiles that were located next to each other, it is possible that NEAs were located right at the borderline between two or more tiles.We do not know if the classifier was able to detect those partially depicted NEAs.Moreover, it was not possible to detect a new NEA on a tile where a NEA was already registered.To reduce the impact of this problem, one could investigate other techniques for the creation of tiles, e.g., overlapping the borders of the tiles, or using other machine learning tasks like semantic segmentation.
The selection of an appropriate decision boundary is crucial but risky, since we do not know how it works with new data.To avoid this critical aspect of the whole case study, a list of probabilities of necessary updates for each parcel could be provided, instead of a definite decision.In this way, the officers can decide how many parcels they want to review, beginning with the ones associated with the highest probability of requiring an update.This could reduce the Type I and Type II errors of the overall process with a human in the loop.
Note that neither the problem nor the solution is specific for a special case study or research area.Instead, the authorities of other federal states or EU countries could directly apply the proposed solution.In general, the accurate detection of human-made changes to nature has many applications, including but not limited to the detection of new buildings without permission, illegal changes to nature reserves, and the need for updates in all types of maps and plans.However, these applications would most likely require a retraining of the ML models based on remote sensing images specific to the application.Table A7 lists the differences between the studies (i.e., [2][3][4]6,8,9,11]) with respect to the following aspects of each article: 1.
The main task of the case study according to image processing, which is either change detection or classification of the named objects.

2.
The data sources used (if DOPs are used, the GSD as well as the angle of the sensor, i.e., nadir or oblique, is mentioned).

3.
The research area the study was focused on.4.
The method used (keywords related to ML methods, or "No ML" if no ML approach was considered).5.
The overall accuracy achieved in the study.

Figure 2 .
Figure 2. Parcel maintenance process of the LPIS database.First (Assessment), an officer decides for each parcel ( 1 …   ) whether it needs an update or not, taking new non-eligible areas (NEA) into account.Second (Update), the officer changes the parcels marked for an update (  = ) to match the conditions in the digital orthophoto (DOP).

Figure 2 .
Figure 2. Parcel maintenance process of the LPIS database.First (Assessment), an officer decides for each parcel (Parcel 1 . . .Parcel n ) whether it needs an update or not, taking new non-eligible areas (NEA) into account.Second (Update), the officer changes the parcels marked for an update (U parcel = true) to match the conditions in the digital orthophoto (DOP).

Figure 3 .
Figure 3. Investigated area: (a) The location of the investigated area (red) covering the territory of the federal state Schleswig-Holstein in northern Germany; (b) Coverage of the provided aerial images in relation to the investigated area.Each dataset of aerial images is associated with a company responsible for the image acquisition (LVGSH or EFTAS) and a year (2019-2022).

Figure 3 .
Figure 3. Investigated area: (a) The location of the investigated area (red) covering the territory of the federal state Schleswig-Holstein in northern Germany; (b) Coverage of the provided aerial images in relation to the investigated area.Each dataset of aerial images is associated with a company responsible for the image acquisition (LVGSH or EFTAS) and a year (2019-2022).

Figure 4 .
Figure 4. Overview of the workflow of the system.First (Parcel preparation), the parcel image and the location of the non-eligible areas (NEAs) are extracted and then divided in smaller parts (parcel tiles and binary NEA infos).Second (Detection), the system detects new NEAs on each parcel tile.Finally (Parcel aggregation/verification), the system predicts, whether the parcel needs an update or not (  ), based on the detection of new NEAs.If so, a human must verify this prediction (  ).

Figure 4 .
Figure 4. Overview of the workflow of the system.First (Parcel preparation), the parcel image and the location of the non-eligible areas (NEAs) are extracted and then divided in smaller parts (parcel tiles and binary NEA infos).Second (Detection), the system detects new NEAs on each parcel tile.Finally (Parcel aggregation/verification), the system predicts, whether the parcel needs an update or not (U pred ), based on the detection of new NEAs.If so, a human must verify this prediction (U veri f ied ).

Figure 5 .Figure 5 .
Figure 5. Overview of the evaluation process of the system.First (Parcel preparation), the parcel image and the location of the non-eligible areas (NEAs) are extracted and then divided into smallerFigure 5. Overview of the evaluation process of the system.First (Parcel preparation), the parcel image and the location of the non-eligible areas (NEAs) are extracted and then divided into smaller parts (Parcel tiles and binary NEA infos).Second (Ground truth generation), the binary NEA infos are aggregated to the parcels ground truth (U target ).Third (Detection), the system predicts the probability of containing a NEA for each parcel tile (P NEA ).Next, these probabilities were aggregated for each parcel (U pred ).Finally (Evaluation), the metrics are calculated based on U pred and U target , followed by the selection of a proper decision boundary taking the target values (TN target and TP target ) into account.

2 )Figure 6 .
Figure 6.Receiver operating characteristic (ROC) curves for a specific classifier: (a) Whole ROC curve with target values; (b) Relevant part of ROC curve with target true positive (TP)-and false positive (FP)-rate, together with the TP-and FP-rate aimed by the best decision boundary.

Figure 6 .
Figure 6.Receiver operating characteristic (ROC) curves for a specific classifier: (a) Whole ROC curve with target values; (b) Relevant part of ROC curve with target true positive (TP)-and false positive (FP)-rate, together with the TP-and FP-rate aimed by the best decision boundary.

Figure 7 .
Figure 7. Metrics in terms of area under the receiver operating characteristic (AUROC) curve, true positive rate (TP-rate), true negative rate (TN-rate), and the overall accuracy per iteration.The red dashed lines indicate the target values of the TP-and TN-rate.

Figure 7 .
Figure 7. Metrics in terms of area under the receiver operating characteristic (AUROC) curve, true positive rate (TP-rate), true negative rate (TN-rate), and the overall accuracy per iteration.The red dashed lines indicate the target values of the TP-and TN-rate.

Table 1 .
Datasets of digital orthophotos used in this study.

Table 1 .
Datasets of digital orthophotos used in this study.

Table 2 .
Distribution of tiles in datasets (tiles with NEA vs. tiles without NEA).

Table 3 .
Self-supervised learning (SSL) pretrained models with the relevant parameters used for training these models.Trained epochs-the number of epochs used for training.• MPC-the minimum parcel coverage in the training dataset.• FC-layer-the version of the fully connected layer used.
• Dataset-the dataset used for training.

Table 4 .
Metrics in terms of decision boundary, true negative rate (TN-rate), true positive rate (TPrate), and overall accuracy of the six-fold cross-validation of iteration 15.7.

Table 4 .
Metrics in terms of decision boundary, true negative rate (TN-rate), true positive rate (TP-rate), and overall accuracy of the six-fold cross-validation of iteration 15.7.

Table A6 .
Metrics in terms of area under the receiver operating characteristic (AUROC) curve, true negative rate (TN-rate), true positive rate (TP-rate), and overall accuracy of the best models trained in the main iterations.

Table A7 .
Related studies using remote sensing and computer vision to detect changes or to classify objects.