Evaluation of the Symmetry of Statistical Methods Applied for the Identiﬁcation of Agricultural Areas

: The main priorities of the common agricultural policies of the European Union (EU) are improvement of the quality of life in rural areas for their inhabitants as well as the optimum utilisation of rural resources. The most efﬁcient tools to improve the management conditions and utilise the potential of land are land consolidation works aimed at creating more favourable management conditions in agriculture and forestry through improving the territorial structure of farms, forests and forestland; the reasonable conﬁguration of land, aligning the limits of real properties with the system of irrigation; and drainage facilities, roads and terrain. The development of agriculture in Poland and its production capacity are considerably differentiated in terms of space. At present, Poland has agricultural areas which, in many respects, have a chance of competing with agriculture in the other member states of the European Union. However, in some areas, agricultural production run by private farms owned by individuals is on the verge of falling below the limit of proﬁtability or falls below the limit of proﬁtability. Currently, Poland lacks tools (strategies) allowing identiﬁcation of land for intensive agricultural production as well as information about agricultural land that should be developed for non-agricultural purposes. Therefore, it is necessary to develop a methodology for identifying similar areas using available tools that can facilitate reliable identiﬁcation of the areas relating to the indicated factors. Taxonomic methods can be used for clustering purposes. The study materials are data derived from real property register databases referring to one of the districts (poviats) situated in east-central Poland. As a ﬁnal result, a method of clustering villages according to similar land-use categories was developed. It was created using two independent statistical methods: Ward’s method and the complete-linkage method. The highest consistency was observed in two groups of identiﬁed types of areas sharing very similar characteristics. A high index of similarity of both methods—the so-called Rand index—testiﬁed to the reliability of the results of calculations. The results of clustering corresponded to a large extent to actual features deﬁning the use of land in the analysed villages as well as the terrain relief.


Introduction
The development of agriculture and its production capabilities, when considered globally, are characterised by high variability in many countries of the world. This situation is due to the processes of long-term transformations in politics and agricultural management in areas with different social, demographic and economic situations. Issues related to agricultural crops have been an area of research and interest among many scientists all over the world due to agricultural production capabilities and the need for protecting such land. As shown by studies carried out by many authors, the area of arable land throughout the world is decreasing [1][2][3][4][5][6][7][8][9]. Research reveals that, despite the continuous growth of the world's population, the supply of land for the needs of the population is limited on all continents. The rapid growth of the world's population reflects the dynamic development of civilisation [10,11], creating an unprecedented requirement for lands other than crops or forestland [12]. According to the studies by [13][14][15], crop and forest resources have been almost completely exhausted all over the world. Thus far, the use of land and the changes in the use of land have been widely investigated, mainly in the context of their effects on the landscape [16][17][18][19][20][21][22][23][24], soil protection [25][26][27][28], climate change [29][30][31][32][33][34], policies used at a local level aimed at protecting crops and forestland [35][36][37] as well as the land's comprehensive reconstruction through land consolidation works [38][39][40][41][42][43].
Studies indicate [44][45][46][47] that there are areas where private farms run agricultural production on the verge of or below the level of profitability. It can be seen that such areas are the most severely affected by all changes and system transformations and are often referred to as "problem areas". A main factor contributing to the formation of problem areas includes the unreasonable utilisation of natural resources, which intensifies erosive degradation and soil acidity as well as the depletion of soil organic matter. Other hazards to the environment and agriculture are the concentration of industrial production, the locations of landfill sites and dust emissions contributing to local pollution of agricultural soils [48]. In addition, these are grounds upon which agricultural production is inefficient and onerous: "(...) agricultural production in respective regions is determined by natural conditions that almost "automatically" lead to underdevelopment of the whole agricultural infrastructure and culture, which results in the region's backwardness" [49]. Therefore, attention should be paid to areas with limited production potential, lower income per capita and delayed economic development. These areas are increasingly exposed to marginalisation and exclusion from the list of areas with potential for development. Agriculture in such areas is doomed to fail. However, to this end, it is necessary to develop a comprehensive, reliable methodology using statistical tools that will allow identifying areas that are similar in terms of land-use categories because the tools provide a possibility of determining the variety of land covered by the study.
In studies regarding spatial and economic phenomena occurring in larger rural areas, villages are clustered into larger typological units. These areas can be confined within the administrative boundaries of a village or can be larger typological areas referring to, for instance, soil classes. This is associated with both the differentiation and similarity of rural areas. Identification of the most similar areas allows analysing and capturing their spatial differentiation [50]. In Poland [51][52][53][54][55][56][57] and in many other countries of the world [58][59][60][61][62], available methods in many fields of science are used for delimiting areas. Areas with similar land-use categories have been delimited using recognised taxonomic tools and in particular, agglomerative, hierarchical clustering procedures. Hierarchical methods are among the most frequently used taxonomic procedures. They lead to the identification of a full hierarchy of clusters with a monotonically increasing similarity index. The clusters of higher orders include separable clusters of lower orders. There are agglomerative (bottom-up) and divisive (top-down) clustering procedures. The agglomerative methods presume that each unit is initially a separate cluster and then, in a sequential manner, the numbers of existing clusters are reduced by agglomerating them into higher order groups. The proceeding ends when one cluster is obtained that includes all units of the set. Divisive methods use a reverse algorithm. The analysed set of units is initially treated as one group, and the classification gradually increases the number of clusters by dividing the currently existing groups until one-element groups are obtained. An advantage of hierarchical methods is the possibility of presenting the results of classification in a compact graphic form by means of a dendrogram illustrating subsequent links between groups of higher orders. The most popular agglomerative, hierarchical clustering method is Ward's procedure, determined as a squared Euclidean distance to distances between objects. The idea behind Ward's method is to minimise intragroup variability (maximise intergroup variability) at each clustering stage [63]. This was the reference method used in this study.
In this paper, another clustering method was used as a control-the complete-linkage approach-in which the underlying element was a city block distance that, as a percentage Land 2021, 10, 664 3 of 13 structure, generally complemented the index of similarity between two categories by 100%. The results of research were presented as a dendrogram illustrating accurately the whole clustering process and the mean output variables in each cluster and, in addition, indicators of mean values for clusters calculated as the ratio between the mean value for a specific cluster and the total mean value. Highly consistent results between two statistical methods can be a guarantee of well-made decisions allowing the reliable identification of picked objects. One method does not provide a sufficient guarantee in this case. In the cluster analysis of objects constituting a study sample, a dendrogram can be deemed an estimator of the hierarchical structure of the whole population only when the analysed objects are described by essential features [64].
The clustering covered 44 villages of the Brzozów district, according to 24 indicators describing land-use categories. The study area was not a random choice because the authors had already carried out studies in this area, covering the dynamics of changes in the use of land in the years 1872-2008 and a preliminary land use analysis in 2010. The calculations were based on unprocessed values of land use indicators: arable land, orchards, permanent meadows, permanent pastures, built-up agricultural land, pond bottoms, ditch bottoms, agricultural land with tree stands and shrubs, wasteland, forests, other land with tree stands and shrubs, housing grounds, industrial grounds, other built-up grounds, land under building development, leisure grounds, surface mining grounds, roads, other transport grounds, grounds for the construction of roads, grounds for the construction of roads, water courses, still waters, ecological areas and other various grounds because this way, higher weights could be naturally assigned to types of land extending over relatively larger areas. The final result of the work was the development of clusters of villages that were similar in terms of land-use categories in the area of 44 villages of the Brzozów district.

Materials
The study materials were actual field data derived from real property register databases. There were created using two independent statistical methods: Ward's method and the complete-linkage method. The use of these methods resulted in a combination with outcome conformity, which constituted sufficient study material.
The study material was a complex of 44 villages forming part of 6 communes in the Brzozów district situated in southeastern Poland in the Subcarpathian region ( Figure 1).
The study area covered 44 precincts with an area totalling 53,941.00 ha, divided into 144,610 administrative plots owned by individual farmers. The study made use of 24 indicators describing land-use categories. Table 1 presents the distribution of respective indicators in the whole analysed group.
The rural areas of the Brzozów district were characterised by high variability in their categories of land use a (% forests), b (% arable land), c (% fields), d (% pastures). For instance, the share of arable land ranged from 2.66% to 71.15%, and that of forestland from 6.49% to 96.85%. Most indicators showed very strong right-sided asymmetry, which meant that for most areas, the indicator had low or average values, but for a number of areas, the values were high or very high. For example, the share of permanent meadows was highly asymmetrical (A = 2.54) due to the fact that in spite of the mean level being low (not more than 2.80% in every second commune), the maximum value was 33.94%. The spatial distribution of the four factors (uses of land) with the highest percentage shares within the study area is illustrated in Figure 2. The study area covered 44 precincts with an area totalling 53941.00 ha, divided into 144,610 administrative plots owned by individual farmers. The study made use of 24 indicators describing land-use categories. Table 1 presents the distribution of respective indicators in the whole analysed group.   6.49% to 96.85%. Most indicators showed very strong right-sided asymmetry, which meant that for most areas, the indicator had low or average values, but for a number of areas, the values were high or very high. For example, the share of permanent meadows was highly asymmetrical (A = 2.54) due to the fact that in spite of the mean level being low (not more than 2.80% in every second commune), the maximum value was 33.94%. The spatial distribution of the four factors (uses of land) with the highest percentage shares within the study area is illustrated in Figure 2.

Methods
The research comprised calculations and comparison of the results of two statistical approaches: Ward's method and complete-linkage using the STATISTICA PLUS program developed by StatSoft Polska. Using Ward's method (based on data in Table 1), it was possible to identify clusters of rural areas in the Brzozów district, according to land-use categories. The computational algorithm was based on differences (a so-called distance matrix) between the analysed rural areas determined as squared Euclidean distances. The

Methods
The research comprised calculations and comparison of the results of two statistical approaches: Ward's method and complete-linkage using the STATISTICA PLUS program developed by StatSoft Polska. Using Ward's method (based on data in Table 1), it was possible to identify clusters of rural areas in the Brzozów district, according to land-use categories. The computational algorithm was based on differences (a so-called distance matrix) between the analysed rural areas determined as squared Euclidean distances. The optimum number of clusters was obtained by cutting the arms of the dendrogram at points at which they became longer and the distances between clusters were significantly larger, which made it possible to create five clusters of villages differing in the proposed features.
In complete-linkage clustering, five groups were selected to facilitate comparing them with the results obtained using Ward's method. However, the division into four groups is more clearly marked on the chart-when the cutting line is moved slightly to the right, groups A and B or six groups are combined-when the cutting line is moved slightly to the left, C is divided. The accurate courses of the clustering process are presented as dendrograms (Figure 3), and Figure 4 illustrates their spatial distribution. In complete-linkage clustering, five groups were selected to facilitate comparing them with the results obtained using Ward's method. However, the division into four groups is more clearly marked on the chart-when the cutting line is moved slightly to the right, groups A and B or six groups are combined-when the cutting line is moved slightly to the left, C is divided. The accurate courses of the clustering process are presented as dendrograms (Figure 3), and Figure 4 illustrates their spatial distribution.  Both methods selected for comparison were slightly different. Ward's method analysis based on the squared Euclidean distance matrix contributed to increasing the significance of specific types of land (forestland, arable land, meadows etc.) in the results of classification. For complete-linkage, the variation in less-significant types of land could have had a relatively bigger share in the clustering results. Of course, the types of land for which the share was a fraction of a percent were insignificant for both methods used.

features.
In complete-linkage clustering, five groups were selected to facilitate comparing them with the results obtained using Ward's method. However, the division into four groups is more clearly marked on the chart-when the cutting line is moved slightly to the right, groups A and B or six groups are combined-when the cutting line is moved slightly to the left, C is divided. The accurate courses of the clustering process are presented as dendrograms (Figure 3), and Figure 4 illustrates their spatial distribution.  Both methods selected for comparison were slightly different. Ward's method analysis based on the squared Euclidean distance matrix contributed to increasing the significance of specific types of land (forestland, arable land, meadows etc.) in the results of classification. For complete-linkage, the variation in less-significant types of land could have had a relatively bigger share in the clustering results. Of course, the types of land for which the share was a fraction of a percent were insignificant for both methods used. Both methods selected for comparison were slightly different. Ward's method analysis based on the squared Euclidean distance matrix contributed to increasing the significance of specific types of land (forestland, arable land, meadows etc.) in the results of classification. For complete-linkage, the variation in less-significant types of land could have had a relatively bigger share in the clustering results. Of course, the types of land for which the share was a fraction of a percent were insignificant for both methods used. Table 2 shows the mean values for each land use indicator within each of the five groups. For each type of land, the colour green means a relatively low share of the group, and red corresponds to a higher share. In addition, the value of test probability p evaluating the statistical significance of variations between the identified groups of rural areas in terms of the share of a specific type of land is given.

Results
The description of the group is complemented by Table 3, showing mean group indicators that illustrate the group-to-total-population mean ratio. For instance, the mean share of arable land in the areas from group A was almost identical to the general mean (0.97), for areas from group E it was, on average, four times lower (0.25) and in group D, it was almost two times higher (1.74) than the mean for the whole population.  Complete-linkage clustering is presented in Table 4 and-in terms of spatial distributionin Figures 3 and 4. Having analysed the mean values for respective types of land in the identified groups, it could be stated that group E featured the highest percentage of forestland and pond bottoms. These were mainly leisure grounds with multiple ponds and waterholes, and most of the land had maintained its natural landscape. Group C showed the decidedly highest share of arable land in comparison to other groups with a relatively small share of forestland. Villages from Group C were typically agricultural areas with good quality soils and wide-scale agricultural management. Group D had a relatively high share of arable land, the highest share of meadows and the lowest of forestland. These were areas with varied terrain relief. Group A featured a high share of pastures and quite a high share of forestland, while most types of land had an average share in group B, except that it clearly had the highest share of the relative area of agricultural land with tree stands and shrubs. The descriptive analysis of the identified groups can be supplemented by a compilation of group indicators, providing a relatively easy way of characterising the distinctive elements of each group (Table 5). In order to verify the methods used the index of similarity of both identified divisionsthe so-called Rand's index-was determined. Rand's index is calculated based on the following concept-all pairs of objects are analysed, and those falling into the same group in both divisions being compared or falling into different groups in both divisions are taken into account. The number of object pairs divided by the number of all possible pairs is Rand's index value. This leads to the formula: N is the number of objects, and Z is the number of object pairs linked in both classifications. Surveys showed that its value was 0.67, which meant that the similarity of both identified divisions was slightly above average, as this index could assume values from 0 to 1.

Discussion and Conclusions
The studies made it possible to present the results of clustering the rural areas in the Brzozów district with respect to land-use categories, using Ward's method and the complete-linkage method. In the first method, differences (a so-called distance matrix) between the analysed rural areas were determined as a squared Euclidean distance. The accurate course of the clustering process was presented as a dendrogram (Figure 3). The division into five groups identified as A-E was quite clear (in addition, the division into three groups of A and B and C and D as well as E was very clear), whereas the division using the complete-linkage method led to the identification of five groups that were more similar in terms of count than in Ward's method. Verification of the methods using an index of similarity of both identified divisions-the so-called Rand's index-showed that the value of similarity was 0.67 (range 0-1).
Studies carried out based on two independent groupings showed that the areas most similar in terms of land-use categories were E/E and C/D, where the index of similarity-Rand's index-was 1, with 100% village coverage (E/E) and 0.71% (C/D). The area of the E/E group comprised villages in the highlands situated in the eastern part of the district (Hroszówka, Ulucz, Wołodź). These villages were characterised by the highest share of forestland and pond bottoms. Other uses of land constituted very small percentages. These villages have a historical background because in 1947, during "Operation Vistula", the villagers were expelled, and now there are few residents in that area. The territory maintained its natural landscape with little signs of human interference. Villages in groups C (Stara Wieś, Buków, Haczów, Jasionów, Trześniów, Wzdów, Jasienica Rosielna) and D (Stara Wieś, Buków, Haczów, Jasionów, Trześniów, Wzdów) were situated in plain and hilly terrain with a high share of arable land and meadows and the highest share of pastures. The identified similar areas featured the lowest share of forestland. The identified area comprised plains featuring very good conditions for agricultural production.
The results of clustering to a large extent correspond to actual features regarding the use of land in the analysed villages as well as terrain relief, as corroborated by previous studies by Leń [65,66] in the study area. The statistical methods used for clustering of similar areas provided satisfactory results in terms of assessment of the actual degree of similarity of the spatial distribution of land-use categories in the study area. High consistency between the results of each combination of pairs of the selected statistical methods could guarantee the right decisions in the development strategy for the specific area. A strategy based on the calculation results derived from a single method is not a sufficient guarantee in this case. The credibility of results is confirmed by the conformity of calculations with the actual needs of the region. The highest consistency was observed between the results of clustering for groups E/E and C/D, which was corroborated by the high Rand's index of similarity. Considering the purpose of the study, it should be emphasised that two independent statistical methods were used, and the calculated index of similarity in village clustering provides a reliable source of data on clustered areas with very similar features.
It should be noted that funds allocated to the consolidation of land are certainly insufficient in countries like Poland which-due to its historical background-is now facing huge problems in agriculture in connection with the defective spatial structure of rural areas, especially in the eastern, south-eastern, southern and central parts of Poland. The proposed methods allow identifying areas featuring the best quality of agricultural land where research should be continued to identify more specifically the areas where land consolidation is advisable. Such a methodology will be used by the authors in further detailed studies concerning the options of clustering agricultural problem areas to ensure the possibility of their management during land consolidation works.

Conclusions
In order to identify rural areas featuring low soil productivity for planning land surveying works in the process of land consolidation, a methodology was developed based on data fully reflected in actual features that was related to the quality of land in the surveyed villages. The results of this study could be helpful in identifying areas for the process of land consolidation aimed at, among other objectives, aligning the chances for development and preservation of the agricultural nature of areas with unfavourable natural and landscape conditions. It should be noted that the designed algorithm could be one of the elements taken into account in creating strategies at the local, district and regional level because it also allows identifying areas with the most favourable soil class, which does not fall under the scope of this study. Familiarity with the spatial variability of soil quality classes within the area of the commune, district and voivodeship has an impact on planning the development of agricultural areas and the effective allocation of