A Minimum Cross-Entropy Approach to Disaggregate Agricultural Data at the Field Level

: Agricultural policies have impacts on land use, the economy, and the environment and their analysis requires disaggregated data at the local level with geographical references. Thus, this study proposes a model for disaggregating agricultural data, which develops a supervised classiﬁcation of satellite images by using a survey and empirical knowledge. To ensure the consistency with multiple sources of information, a minimum cross-entropy process was used. The proposed model was applied using two supervised classiﬁcation algorithms and a more informative set of biophysical information. The results were validated and analyzed by considering various sources of information, showing that an entropy approach combined with supervised classiﬁcations may provide a reliable data disaggregation.


Introduction
Agriculture and forests are essential to preserve biodiversity and develop the economy in rural areas. They supply essential goods for human survival and well-being and hence need to be well managed [1]. Thus, information on the spatial distribution of land-use at a detailed level is crucial for models and applications on agro-forestry production that require a spatial representation [2,3]. For instance, in the European Union, agricultural statistics try to report information at the regional and sub-regional level. However, the Agricultural Census, which is the main territorial statistical operation in European Union, is carried out every 10 years and between this period, there is no available information at the municipality or parish levels. This lack of information is a worldwide problem since an updated knowledge of land-use contributes to ensure a judicious spatial planning that considers characteristics of interest [4][5][6][7][8][9].
Every 3 years the LUCAS survey (Land-Use/Cover Area frame Statistical survey) is carried out. Taking photographs, this survey collects land cover/land-use, agro-environmental, and soil data by field observation of referenced points [10]. Information is also available monthly via satellite imagery from LANDSAT and more recently SENTINEL 2. LANDSAT and SENTINEL 2 are multispectral satellites with high spatial resolution developed by the National Aeronautics and Space Administration municipality, in southern Portugal. Two different supervised classification algorithms are used in order to show the reliability of the approach. The aim is to include in the entropy model used in the disaggregation process for additional information through various restrictions.
The remainder of the paper is presented as follows: in section two, the methodological approach is presented; in section three, the data and model application scenarios are explained; sections four is dedicated to the results and analyses. Finally, section five presents the concluding remarks.

Methodological Approach
Recent research shows a variety of studies that use supervised classification techniques to produce thematic maps of land use [23,25,28,32]. The supervised classification method is an established classification from a training dataset, which contains the predictor variables measured in each sampling unit and assigns prior classes to the sampling units and, therefore, presents several advantages over unsupervised ones [26]. A comparison between different classification methods and their performance can found in References [27][28][29][30].
The methodological approach proposed combines several techniques, such as the classified supervisions of satellite images, cluster analysis, mapping, and cross-entropy minimization, and considers several sources of information. Several studies use entropy to estimate data when ordinary methods are not applicable since it overcomes some problems that hamper traditional econometric methods [6,[33][34][35][36]. A generalized maximum entropy model to estimate multi-output production functions was adopted by References [37,38]. The maximum entropy can be used to estimate farm-level multi-input/multi-output production functions [39]. A dynamic approach for disaggregating agricultural data was presented by Reference [17]. The results of farm management models were disaggregated to the level required by natural science models [40]. Cross-entropy was also used to present the spatiotemporal dynamics of a maize cropping system in Northeast China [3]. In Portugal, several entropy models were also developed to disaggregate data [18,19,[40][41][42][43].
The proposed methodological approach comprises two main steps, as shown in Figure 1. In the first one, prior information is previously estimated from a supervised classification of satellite imagery, the Lucas Survey, and experts' knowledge from the Ministry of Agriculture. In the second one, a cross-entropy model is applied to disaggregate the data from an aggregate level (for instance national or regional) to a detailed level (local or pixel level) with respect to the prior of information estimated. This procedure allows for a guaranteed consistency among the different sources of information and with the aggregate.
The supervised classification of satellite images is carried out to identify the distribution of land-use. This process comprises of the following steps: (i) collecting all the available information and selecting carefully the satellite imagery to be processed; (ii) defining the "training fields" using the LUCAS survey samples as references and empirical knowledge; (iii) defining the spectral signatures; (iv) implementing the supervised classification algorithms. As in other previous studies [4,8], full advantage was taken from the LUCAS Survey. It allows for using a set of samples that can work as "training fields" or sample areas in a "supervised classification" of the satellite images. To calculate the prior estimate, the Minimum Distance algorithm and the Maximum Likelihood algorithm algorithms were used and their results compared.
The Minimum Distance algorithm calculates the Euclidean distance (x, y) between the spectral signatures of the image pixels and the training spectral signatures, according to the following equation: where x is the spectral signature vector of an image pixel; y is the spectral signature vector of a training area; and n is the number of image bands. Therefore, the distance is calculated for every pixel in the image, assigning the class of the spectral signature that is closer, according to the following discriminant function: where C k is the land cover class k; y k is the spectral signature of class k; and y j is the spectral signature of class j. The Maximum Likelihood algorithm calculates the probability distribution for the classes related to Bayes' theorem, estimating if a pixel belongs to a land cover class. The discriminant function is calculated for every pixel as follows: where p(C k ) is the probability that the correct class is C k ; |∑ k | is the determinant of the covariance matrix of the data in class C k ; and ∑ k −1 is the inverse of the covariance matrix.
Since the prior of information has been estimated, it can be used in the disaggregation process to guide a cross-entropy model. This procedure is very useful because it allows incorporating additional information in the disaggregation process, such as biophysical restrictions, historic restrictions, and so forth. In addition, a unique optimal solution is obtained to the disaggregation process and the consistency among different sources of information is guaranteed.
Thus, inspired by the studies of References [8,41], the following generalized cross-entropy model was developed HM i k ≤ x i k ≤ HMX i k ∀ i and k (8) where, x i k is the probability of land-use k to be estimated in area i; B i k is the matrix of probabilities of each land-use k in area i resulting from prior estimates; ST i is the area weight of each disaggregated unit i; STAT k are the regional statistics for land-use k; LAND i k is the land use available for land-use k in disaggregated unit i; HM i k are the minimum historical limits and HMX i k are the maximum historical limits for each land-use by disaggregated unit i; and e kn refers to a parameterized error term [36].
Equation (4) is the objective function, which minimizes the joint cross-entropy of the estimated probability distribution (x i k ), the previous estimate (B i k ), and the error distribution (e kn ). Equation (5) guarantees that x i k and e kn have the characteristics of a probability distribution. Equation (6) ensures that the disaggregated shares x i k are compatible with the aggregate at the regional level. Equation (7) ensures that biophysical limits (restrictions of soils, climate, and slope) are respected. Equation (8) relates to the historical limits that must be respected for land use. These limits represent the maximum and minimum areas that a given land-use has achieved in the past. So, using this information, we can bound the model variables to more likelihood values concerning crop areas.
After having calculated the shares, it only remains to redistribute the regional data by using the following equation: where is the estimated area for land use k in unit i and SA is the area of unit i. An important phase of our approach is the validation of the entropy model in order to test the coherency of the disaggregation process. To carry out this validation process, deviation measures and general statistical measures were used. As deviation measures, the Prescription Absolute Deviation (PAD) and the Weighted Prescription Absolute Deviation (WPAD) were considered. The PAD indicator measures the deviation between estimations and statistical data: The WPAD i indicator allows assessing the real deviation at the statistical unit c level and at the aggregate level and is obtained by the following: Finally, at the aggregate level, WPAD is calculated as follows: Regarding the general statistical measures, the correlation coefficient of Pearson R, the determination coefficient R 2 , and the modeling efficiency (EF) were used to compare S c k andŜ c k . The R coefficient is a measure of association among two variables while R 2 refers to how the variance of the dependent variable is explained by the independent variables and is used to measure the adjustment of a regression line. Thus, when R 2 is equal to 1, the estimated data are completely explained by the variance of real data. EF is a normalized measure to evaluate the model performance [6]. An EF indicator equal to 1 shows a total efficiency of the model, since there are complete information gains, while an indicator equal to 0 means the opposite. In cases where deviations between real and estimated data are high, this indicator may present negative values. These indicators were calculated as follows: where S c k is the observed value;Ŝ c k is the model result, and S c k is the average of the S c k values.

Data and Application Scenarios
The Algarve region in the south of Portugal was selected to implement the proposed approach in order to disaggregate the data to a kilometric grid. (Figure 2). This region was selected to implement this study due to recent dynamics regarding permanent crops, namely, regarding irrigated ones (such as citrus) and the necessity of policy evaluation. Algarve is a region with an area of 4996.8 km 2 and, in 2010, was composed of 16 municipalities and 84 parishes, which was reduced to 67 a few years later. The Mediterranean climate predominates and there are several biophysical contrasts between the coastal and inland areas with less fertile areas and higher slopes. There are municipalities where permanent crops are predominant and citrus areas are relevant. This is the case of Silves, which was chosen as the pilot municipality to disaggregate the data to a more detailed 25-hectare grid. This municipality covers an area of about 680.1 km 2 and, in 2009, was divided into 8 parishes which were later reduced to 6. It extends from the inland Algarve to the coast and in 2009, the agrarian census represented more than 17% of the permanent crop area and about 41% of the citrus area in the region.
In the Algarve region and Silves municipality, the predominant land-use is permanent crops, which were disaggregated as follows: fresh fruits, citrus, nuts, olive trees, vineyards, and other permanent crops.
The "training fields" (that is, the sample areas for defining the spectral signatures) for the supervised classification were defined using the 2012 LUCAS survey and the knowledge of experts from the Ministry of Agriculture, having been considered from a total of 191 sample areas in the whole Algarve region. For defining the training fields to implement the supervised classifications, the LUCAS survey may be a source of information, but if these observations are limited in number, the empirical knowledge of the area by technicians may be inserted. In Portugal, this empirical knowledge was easily obtained in the different regions to define the training fields and to carry out the supervised classifications. This approach has, therefore, the potential to be implemented in other areas if there is information on the satellite imagery. Regarding the satellite imagery, the LANDSAT 5 and 8 images with 30-m resolution were used in the supervised classification process. The LANDSAT 5 image used to disaggregate the 2009 data is from 4 July 2009 while the LANSAT 8 image used for Silves is from 29 June 2013. Both images used were obtained from the NASA system by using the Earth Explorer: http://earthexplorer.usgs.gov.
For the Algarve region, two simulations of the entropy model using the LANDSAT 5 2009 image were developed. One (SCMD2009) considers for the supervised classification (SC) the minimum distance algorithm (MD) and another (SCML2009) uses the maximum likelihood algorithm (ML). In the case of the Silves municipality, both simulations were also considered, but they were tested in the entropy model with historical restrictions of land use (SCMD2009 and SCML2009) and without historical restrictions (simulations SCMD2009WR and SCML2009WR). Besides these four simulations, an SC using the minimum distance algorithm and the LANDSAT 8-2013 image was tested considering the entropy model with and without historical restrictions (simulations SCMD2013 and SCMD2013WR). The historical restrictions include general incomplete limits indicated by experts (that may not be available to all region) and limits regarding crops evolution.
Most approaches developed in recent years were applied at a scale comparable to a 1 × 1 km grid [4]. Thus, the entropy model was used to disaggregate variables to the pixel level using a kilometric grid, which allowed obtaining a total of 6832 disaggregated units for the Algarve region. In the case of the Silves municipality, the data were disaggregated considering a 25-hectare grid with a total of 3148 disaggregated units.
Technical implementation of the approach used QGIS and the Semi-Automatic Classification Plugin (SCP) [23], and the entropy models were implemented using the General Algebraic Modelling System (GAMS).

Results and Analysis
In a first step, a prior estimate of land-use was calculated using the supervised classification. Figure 3 presents examples of the results for 2009 using the LANDSAT 5 image in the Algarve region and the minimum distance algorithm and the maximum likelihood algorithm. Despite the differences between the two classification algorithms presented, some contrasts in the Region can be observed, namely, between the more forested areas in inland Algarve and the coastal areas with different uses. We also concluded that the minimum distance algorithm tends to identify agricultural areas in inland Algarve, which are identified as forestry areas with the maximum likelihood algorithm. However, according to the knowledge of experts from the Ministry of Agriculture in those areas, the Maximum Likelihood Algorithm tends to provide results more coherent with the observed reality. These results are presented at a 30 m × 30 m pixel level and are aggregated in a grid at a kilometric level. Examples of some "pixel" level estimates using the kilometric grid are presented in Table 1. In the second step of our approach, the data disaggregation process was carried out by applying the cross-entropy model in the Algarve region and Silves municipality. Figure 4 presents examples of results per disaggregated unit according to the algorithms tested for the Algarve region. They allow us to identify some of the major contrasts in the regional distribution of permanent crops. Several differences in allocation according to the classification algorithm used are also seen. The use of different algorithms allows providing the decision maker with different approaches which result in different spatial patterns that will be validated and analyzed. For the Silves municipality, Figure 5 presents examples of the results for several distinct simulations using a finer grid (25 hectares). These different simulations are relevant to test the methodological approach at a more detailed pixel level. Spatially, the results let us identify some of the major contrasts in land use distribution, providing a more detailed "picture", with differences according to the simulations considered. For instance, the areas with the highest concentration of citrus are located in the parishes of S. B. Messines, Algoz, and Silves, while in the inner areas, they do not exist due to the unsuitable biophysical conditions. This is consistent with the knowledge held in this area. The results of the cross-entropy model were validated using the deviation indicators mentioned before. The average and median PAD indicators are presented in Table 2 per crop type for the Algarve region and the Silves municipality. In general, the average and median PAD values are high, which may mean a weak consistency between the model results and observed statistical data. In the Algarve region, the lowest median values are obtained for olive trees and for other permanent crops. At the aggregate level, the WPAD indicator is 42.8% in simulation SCMD2009 and 41.1% in simulation SCML2009 (Table 3). For the Silves municipality, the PAD values are better than in the Algarve region and can be compared with those of previous studies [18,19,41,42]. In terms of results per crop, other permanent crops tend to present the lowest median values, but several crops have high median values, often reaching more than 50%. In aggregate terms, the lowest WPAD (20.86%) is recorded in simulation SCMD2013 (see Table 3). This result may be explained due to the more precise set of bands in the LANDSAT 8 image than in the LANDSAT 8 image. For 2009, SCMD2009 is the simulation that presents the best results (25.6%). Only the simulations without historical restrictions (SCMD2009WR, SCML2009WR, and SCMD2013WR) present WPAD values higher than 30%. Despite the PAD results tending to present some high values, we must highlight that they are summary results and there are territorial units with heterogeneous areas. Therefore, if we analyze the individual results of the parishes of Algoz, Silves, Alcantarilha, and S. B. Messines, which have the most relevant area of permanent crops (about 92%), we find that the PAD values are low in several relevant crops and hence a WPAD i lower than 30% can be observed in all simulations. The errors in other crops are of little importance and parishes with little relevance for the total area. Thus, the above values hide very satisfactory WPAD results.
The correlation coefficient (R) and the determination coefficient (R 2 ) are presented in Table 4. In the Algarve region, all the R 2 indicators are above 0.5, except for fresh fruits, which present R 2 values of 43% in simulation SCMD2009 and of 41.4% in simulation SCML2009. In some crops, such as olive trees and citrus, the R 2 values are always higher than 0.7.
In Silves, the results are considerably better than those presented in previous studies since they are higher than 0.5 for most crops in all simulations and, in several cases, are higher than 0.8 or even 0.9. In the case of citrus, the most relevant permanent crop in the Silves municipality, the results are always above 0.9.
A similar validation process was implemented in Brazil in Reference [5] considering all the country's municipalities (more than 4000) and correlation coefficients between 0.4 and 0.65 were obtained. Other authors (from Reference [6]) validated their model using 4 crops and obtained for one crop an R 2 of 0.8 while the others presented values of R 2 between 0.40 and 0.45.
Finally, the results were tested using the Efficiency Indicator (EF), as shown in Table 5 for the Algarve region and Silves municipality.
In the case of the Algarve region, the EF is always higher than 0.45, except for fresh fruits. Citrus and olive trees tend to reveal results always above 0.7. Both algorithms (minimum distance algorithm-SCMD2009 and maximum likelihood algorithm-SCML2009) reveal similar results regarding the EF. For the Silves municipality, vineyard land-use present negative EF values and therefore, a null disaggregation efficiency of results in these cases. Despite using a good number of training fields, the best EF in vineyards is 0.541. Previous studies were also less successful in estimating vineyard areas due to the low number of training fields [22]. However, this study presented a detailed number of training fields and revealed low EF values in several cases. This may be due to the classes used in the supervised classification process, which requires a better revision of the macro-classes considered. Fresh fruits also tend to present EF values lower than 0.3in most simulation. The reason for this is the fact that fresh fruits include a diversity of crops and it will also require a better revision of the macro-classes considered.
On the other hand, citrus always presents an EF above 0.9. All the other crops present EF values above 0.5 in most simulations. These results are not much different from those of previous studies [6], which obtained EF values between 0.23 and 0.71, and only one crop presented a value higher than 0.71.
The proposed approach allows disaggregating agricultural data at a detailed level, being relevant for agricultural economics analysis. As with other crop mapping studies, the quality of maps depends on the quality of data sources [7]. From an economic point of view, knowing land uses will allow for the identification of total yields and the economic output of agricultural areas. This model is also well suited when the data variation is great, such as the case of Mediterranean regions where crop acreage, farm economics, and production technologies have a high variability. The model's results can also be aggregated (up-scaled) into different spatial units, such as agro-ecological zones, providing another framework for analysis [6]. The model results allow for the revealing of contrasts related to different farms' strategies and different biophysical conditions and it provides information on the location of crops in the territory but does not differentiate them according to the productive system.
Therefore, the proposed model offers an effective way to disaggregate data at a detailed level, but further research is necessary to improve previous estimates and integrate the different layers of information. Also, changing the crop patterns over time is important as crop patterns change over space [7] and there must be efforts to estimate land-use continuously over time [17][18][19]. One line of research will be to test other classification algorithms. This research focused on two known ones to provide a detailed experience of the feasibility of these approaches because they are very known and used widely. Nevertheless, it may be tested further using other classification algorithms, such as the random forest algorithm.

Concluding Remarks
This study presented a methodological approach for disaggregating agricultural data at the pixel level, which may be useful to planning land-use, monitoring policy, and strategies of rural development. This approach is based on the supervised classifications to identify the distribution of land use and improve it through an entropy model. The study showed that full advantage of up-to-date satellite imagery can be made use of. These satellite images in combination with the LUCAS survey and empirical knowledge allow for the development of more precise supervised classifications. The use an entropy model guarantees consistency among the different sources of information and allows for the correction of existing errors in a supervised classification. Therefore, this paper proposes a good alternative to the traditional econometric approaches to disaggregate data at a detailed level and recovery incomplete information. Further research is being made to improve the model results, such as the development of an approach to disaggregate yearly agricultural data at the pixel level and its implementation in more complex areas is being designed.