Evaluation of Crop Type Classiﬁcation with Different High Resolution Satellite Data Sources

: Crop type classiﬁcation with satellite imageries is widely applied in support of crop production management and food security strategy. The abundant supply of these satellite data is accelerating and blooming the application of crop classiﬁcation as satellite data at 10 m to 30 m spatial resolution have been made accessible easily, widely and free of charge, including optical sensors, the wide ﬁeld of viewer (WFV) onboard the GaoFen (GF, high resolution in English) series from China, the MultiSpectral Instrument (MSI) onboard Sentinel 2 (S2) from Europe and the Operational Land Imager (OLI) onboard Landsat 8 (L8) from USA, thanks to the implementation of the open data policy. There are more options in using the satellite data as these three data sources are available. This paper explored the different capability of these three data sources for the crop type mapping in the same area and within the same growing season. The study was executed in a ﬂat and irrigated area in Northwest China. Nine types of crop were classiﬁed using these three kinds of time series of data sources in 2017 and 2018, respectively. The same suites of the training samples and validation samples were applied for each of the data sources. Random Forest (RF) was used as the classiﬁer for the crop type classiﬁcation. The confusion error matrix with the OA, Kappa and F1-score was used to evaluate the accuracy of the classiﬁcations. The result shows that GF-1 relatively has the lowest accuracy as a consequence of the limited spectral bands, but the accuracy is at 93–94%, which is still excellent and acceptable for crop type classiﬁcation. S2 achieved the highest accuracy of 96–98%, with 10 available bands for the crop type classiﬁcation at either 10 m or 20 m. The accuracy of 97–98% for L8 is in the middle but the difference is small in comparison with S2. Any of these satellite data may be used for the crop type classiﬁcation within the growing season, with a very good accuracy if the training datasets were well tuned.


Introduction
Global food security has been attracting great concern around the world as the world population is projected to continually grow and the water and thermal condition of farmland is dramatically influenced by this global change. Effectively and efficiently monitoring agricultural production is critical to tackle food insecurity issues and support farming management. Rapid development of Earth observation satellites provides a great opportunity to fulfill this requirement globally. In particular, free and uncharged access to satellite images is accelerating methodology development for agricultural monitoring.
GF, the acronym of GaoFen in Chinese, which means high resolution in English, is one of the key Earth observation programs in China. The China Space Administration has announced at the GEO Ministerial Summit in December 2019 in Australia that the global data of GF-1 data [1] at 16 m spatial resolution will be made freely available to the world as a contribution from China in building the GEOSS, Global Earth Observing System of Systems. Sentinel [2] is a key component of the Copernicus program for Earth observation in Europe. Sentinel data is already freely accessed via the Copernicus Open Access Hub or other Hubs. Landsat [3] is a USA satellite mission and has a long history, dating back to the 1970s, providing 30-m resolution satellite multispectral data for the world. Landsat 8 is providing high quality satellite data free and openly since its successfully launched in 2013. S2 and L8 [4,5] are freely available via a well-developed data sharing server; however, GF is still new and an online data sharing platform is still in development. GF data are accessible online to some extent via the current web portal, while GF series satellite data are quite easily available in several operational satellite application centers in China in which there are dedicated land lines to connect to the data distributor, CRESDA (China Center for Resource Satellite Data and Application). The user may have more options for the input satellite data in agricultural monitoring as the abundant abovementioned data sources are freely available and at no charge.
As one of key elements of agricultural monitoring [6][7][8], crop area estimation with the satellite images has been implemented for many years. Crop type classification of using remote sensing time series is an important tool to deliver such information. A bunch of applications and the tool of cloud based and open source in crop type classification have been developed recently, of which the Sen2-Agri system is free and open source, allowing any user generating near-real time products tailored to their needs at their own premises or on cloud computing infrastructure [9]. Additionally, the Google Earth Engine provides a free cloud computing capacity with the various hosted datasets [10,11]. A large variety of crop classification methods at different scales and various levels of accuracy can be found in the literature [12][13][14]. From the first use of satellites for agricultural monitoring in the 1970s [15] to the latest study [16][17][18][19][20][21], crop classification strategies have evolved tremendously.
A common method of crop classification is a pixel-based approach in which an individual pixel is assigned to a specific class according to its spectral similarity to the class, as a satellite image is a pixel-based n-dimension data array. However, with the increase of spatial resolution, the pixel-based classification may bring more negative effects because the pixel size is too small to match the target class. Therefore, the object-based method [22][23][24] was adopted to improve the classification. A key step in the objected-based method is that it takes the spectral-, textural-, neighborhood-and object-specific shape parameters into account as segmentation; hence, it operates with a group of similar pixels instead of a single pixel. In addition to the advantages, limitations are also obvious in this method [25]. The overall effect of the method strongly relies on the segmentation scales. Inappropriate scale for the segmentation may create a negative impact on input features of classification. Spatial resolution of satellite image is another unavoidable problem. The sensors with a high revisit frequency and a large swath width usually offer a coarser spatial resolution. Using such sensors to map heterogeneous agricultural areas is problematic when the field areas are inferior to the sensor's pixel size, causing mixed pixels. In order to overcome the mixed pixel problem, the sub-pixel classification approach was developed for the coarse resolution satellite images [26][27][28][29][30]. The classification approach takes into account the possibility of a pixel of belonging to different classes with a method of pixel unmixing. Working with time series images instead of single one allows extracting temporal features that are a great asset for classifying different crop types [11,31,32].
Developing an algorithm for classification is also extremely important for crop type classification. Traditionally, unsupervised and supervised classifications are the two algorithms of classification. In unsupervised classification, clusters are created according to their similar spectral characteristics of the input images without the training samples. The class type is assigned later based on expert knowledge. The K-means and the ISO-DATA (Iterative Self-Organizing Data Analysis Technique Algorithm) are the two most frequently and widely used unsupervised algorithms. However, accuracy of unsuper-vised classification is hard to be improved to a very high level in comparison with the supervised classification.
In supervised classification, training samples have to be provided in association with the input images. The final class for each pixel will be decided by such commonly used classifiers as Maximum Likelihood Classification (MLC), Support Vector Machine (SVM), Random Forest (RF), Artificial Neural Network (ANN), Decision Trees (DT) and so on [32]. Many papers [9,33,34] have concluded that RF and other similar classifiers have the best performance in practice. Therefore, RF has been a very popular classification method in many applications, such as land cover and crop type identification. Fusion of optical data with SAR time series images for the crop type classification has also made steady progress in recent years [35][36][37], especially in reducing the cloudy effects of pixels.
It has been demonstrated the combination of S2 and L8 for classification can obviously improve the capability of change detection over the ground [14,38]. A systematic review on recent applications of Landsat 8/OLI and Sentinel-2/MSI for land use and land cover mapping has been presented in Chaves [39], which highlighted the possibility of using a medium-resolution (Landsat-like, 10-30 m) time series and multispectral optical data to provide a harmonization between these sensors and data cube architectures for analysis-ready data that are permeated by publicizations, open data policies and open science principles.
As a newcomer, GF-1 provides alternative high spatial resolution satellite data for classification. A combination of GF, S2 and L8 for classification has not been reported [40][41][42][43]. The objective of this study is to evaluate the influence of independent satellite data sources on the crop type classification. We conducted crop type classification with these three data sources independently. Then, we reveal the best option for crop type classification by evaluating the accuracy of the classifications. The question we intend to answer is how well the GF may achieve for crop type classification in comparison with S2 and L8. Results from the study will support the expectation of combining these three data sources for crop type classification in the near future.

Study Area
The study area locates at the northern part of the Yellow River Irrigation Area of Ningxia Hui Autonomous Region in Northwest China. With a total territory of 66.400 km 2 , Ningxia is naturally composed of three geographical zones strongly governing economic development in the region [44]: the mountainous and loess hilly district (MLHD) in the south, the dry and desert district (DDD) in the center and the Yellow River irrigated district (YERID) in the north. In the MLHD, 70% of the cultivated land consists of slope farmlands. The DDD is a very arid area, mostly covered by grassland, which accounts for 77% of Ningxia's total. The YERID is the most important agricultural production region in Ningxia. It has only one third of the total farmland in the region, but accounts for two thirds of Ningxia's total grain production and agricultural production value. Annual rainfall is extremely low in Ningxia, but the Yellow River brings abundant water from Qinhai-Tibet Plateau for an efficient irrigation to the agricultural fields in the region. In YERID, wheat, rice and corn are by far the major crops. This study focused on the YERID because it is the most important agricultural district in the region. The study area also represents a very typical irrigated agricultural region in China. Availability of clear sky images is high, as the rainfall is extremely low. Figure 1 shows the study area with six Sentinel-2 tiles. The crop growing season in the region is from May to September; therefore, only the satellite data observed in the growing season were collected.

Satellite Data and Processing
GF-1 Satellite, as the first satellite of the Chinese High Resolution Earth Observation System (GF), which includes series of high resolution optical and SAR satellites, was successfully launched on April 26, 2013 [1]. Onboard, there are four sets of multiple spectral cameras (wide field of viewer, WFV) with a mosaiced swatch of 800 km at 16-meter spatial resolution and a four-day revisit frequency [45]. WFV has only four bands as listed in Table 1. The L1B data of GF-1 WFV data covering the study area and over the growing season of 2017 and 2018 were downloaded from the CRESDA. The L1B data of GF-1 WFV have included the calibration coefficients and the RPC information which were used to implement the image georeferencing with the RPC Orthorectification approach. Then, the FLAASH approach was used to perform the atmospheric correction [46]. In order to compare with S2 and L8, the spatial resolution of GF-1 WFV in this study was set as 15 m.
The Sentinel satellites are part of European Space Agency's Copernicus program. Sentinel-2A was launched in June 2015 and complemented by Sentinel-2B in March 2017. These two Sentinel missions were equipped with high resolution multispectral (MSI) sensors. Sentinel-2 provides an unprecedented 10-m spatial resolution with a five-day global revisit frequency, a 13 bands imager and a 290 km swath width. Table 1 also lists the spectral and spatial specifications of Sentinel 2A/B. Through the Sen2-Agri system [47], Sentinel-2 L1C (top of atmosphere) images were automatically downloaded for the six tiles covering the study area and over the growing season of 2017 and 2018 from the Copernicus Open Access Hub. Using Sen2-Agri's L2A processor, a Multi-sensor Atmospheric Correction and Cloud Screening (MACCS algorithm) [48,49] was performed on top of the atmosphere images. The 60-m spectral bands were not used in this study due to the coarse resolution.
Landsat 8, launched on February 11, 2013, consists of two science instruments-the Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS). These two sensors provide seasonal coverage of the global landmass at a spatial resolution of 30 meters (visible, NIR, SWIR), 100 meters (thermal) and 15 meters (panchromatic). The Landsat 8 image files consist of 16-bit GeoTIFF images, packaged as Georeferenced Tagged Image File Format. Table 1 lists the band information for Landsat 8 [5]. Each scene of Landsat images may be referred to by the path and row. In this study, the scenes of 129/33, 129/34, 130/33 and 130/34 cover the study area. Thus, all the Landsat 8 data during the crop growth season from May to September in 2017 and 2018 were downloaded from the USGS EarthExplorer. The FLAASH approach was used to perform the atmospheric correction

Satellite Data and Processing
GF-1 Satellite, as the first satellite of the Chinese High Resolution Earth Observation System (GF), which includes series of high resolution optical and SAR satellites, was successfully launched on 26 April 2013 [1]. Onboard, there are four sets of multiple spectral cameras (wide field of viewer, WFV) with a mosaiced swatch of 800 km at 16-meter spatial resolution and a four-day revisit frequency [45]. WFV has only four bands as listed in Table 1. The L1B data of GF-1 WFV data covering the study area and over the growing season of 2017 and 2018 were downloaded from the CRESDA. The L1B data of GF-1 WFV have included the calibration coefficients and the RPC information which were used to implement the image georeferencing with the RPC Orthorectification approach. Then, the FLAASH approach was used to perform the atmospheric correction [46]. In order to compare with S2 and L8, the spatial resolution of GF-1 WFV in this study was set as 15 m. The Sentinel satellites are part of European Space Agency's Copernicus program. Sentinel-2A was launched in June 2015 and complemented by Sentinel-2B in March 2017. These two Sentinel missions were equipped with high resolution multispectral (MSI) sensors. Sentinel-2 provides an unprecedented 10-m spatial resolution with a five-day global revisit frequency, a 13 bands imager and a 290 km swath width. Table 1 also lists the spectral and spatial specifications of Sentinel 2A/B. Through the Sen2-Agri system [47], Sentinel-2 L1C (top of atmosphere) images were automatically downloaded for the six tiles covering the study area and over the growing season of 2017 and 2018 from the Copernicus Open Access Hub. Using Sen2-Agri's L2A processor, a Multi-sensor Atmospheric Correction and Cloud Screening (MACCS algorithm) [48,49] was performed on top of the atmosphere images. The 60-m spectral bands were not used in this study due to the coarse resolution.
Landsat 8, launched on February 11, 2013, consists of two science instruments-the Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS). These two sensors provide seasonal coverage of the global landmass at a spatial resolution of 30 meters (visible, NIR, SWIR), 100 meters (thermal) and 15 meters (panchromatic). The Landsat 8 image files consist of 16-bit GeoTIFF images, packaged as Georeferenced Tagged Image File Format. Table 1 lists the band information for Landsat 8 [5]. Each scene of Landsat images may be referred to by the path and row. In this study, the scenes of 129/33, 129/34, 130/33 and 130/34 cover the study area. Thus, all the Landsat 8 data during the crop growth season from May to September in 2017 and 2018 were downloaded from the USGS EarthExplorer. The FLAASH approach was used to perform the atmospheric correction for all Landsat 8 images [50]. The thermal bands, the panchromatic band and the Cirrus band were not used in this study.
After a visual check of all images, the images with higher than 25% cloudy coverage were completely removed. The images taken on the same day were mosaiced and tailored to the study area only. The finally used images are listed in Table 2.

Field Data and Training Samples
The reference data needed for training and validation of the crop classification were collected in the field. During the field campaign, the georeferenced pictures were taken with a GPS camera along the roads in the study area following predefined itineraries. At home, the land cover and crop type classes with the longitude and latitude coordinates were retrieved by visually screening the pictures with the tool developed for the photo data interpretation. The final output of this process is a formatted file gathering all GPS points with corresponding classes, class codes, author, roadside (left or right), collecting dates and times and the corresponding photo file names. A three-day field campaign with a random sampling approach was carried out in middle June 2017, and collected about 1500 ground truth photos with spatial reference and associated crop or other land cover classes. Three field campaigns with a random block sampling approach were carried out Remote Sens. 2021, 13, 911 6 of 17 again in June, July and September in 2018. All those sample points are distributed over the irrigated area, as shown on Figure 2.
The field samples include both the crop classes and non-crop classes. These samples are point-based ground truth and not ideally and evenly distributed in the study area. As a result of the land ownership policy in China, the field of each household is quite small and often a parcel of a few meter width and a few hundred meters length. The field boundary data in the study area are not available, nor are the boundaries in most arable land areas in China. Therefore, it is not feasible to use the field-based training sample, as other papers did [9,47]. However, it looks like a large field with a single crop as the fields often grow the same crop in practice in China. More well-distributed training samples were obtained by visually interpreting the satellite images. Thus, based on those ground sample points, using Google Earth and the Landsat 8 images in the growing season, referring to a 10 km grid frame, the systematic samples approach was implemented and the spatial size of 3 × 3 pixels or a little bit larger of polygons for crop types and non-crop types were drawn on the images. Samples within each 10 km grid were further identified and dramatically increased and the spatial distribution of samples was improved.
All samples were randomly separated into two parts with a ratio of 70% to 30%. The 70% samples were used for classification and 30% for validation. The distance between two samples, in fact, the centers of every two polygons, was taken into account in the separation process. If they are too close, all pixels in both polygons are classified as either training or validation datasets. The threshold for the distance in this study was set as 900 m. This avoids a strong spatial correlation. For instance, all samples taken in one field represent only one field at the level of image pixel. If some pixels of the samples are used for the classification and other for the validation, the accuracy will be very high in normal case, but this accuracy does not represent the true value. Table 3 lists the number of samples and the proportion for each crop type at the 30-m level.
Remote Sens. 2021, 13, 911 6 of 17 1500 ground truth photos with spatial reference and associated crop or other land cover classes. Three field campaigns with a random block sampling approach were carried out again in June, July and September in 2018. All those sample points are distributed over the irrigated area, as shown on Figure 2. The field samples include both the crop classes and non-crop classes. These samples are point-based ground truth and not ideally and evenly distributed in the study area. As a result of the land ownership policy in China, the field of each household is quite small and often a parcel of a few meter width and a few hundred meters length. The field boundary data in the study area are not available, nor are the boundaries in most arable land areas in China. Therefore, it is not feasible to use the field-based training sample, as other papers did [9,47]. However, it looks like a large field with a single crop as the fields often grow the same crop in practice in China. More well-distributed training samples were obtained by visually interpreting the satellite images. Thus, based on those ground sample points, using Google Earth and the Landsat 8 images in the growing season, referring to a 10 km grid frame, the systematic samples approach was implemented and the spatial size of 3 × 3 pixels or a little bit larger of polygons for crop types and non-crop types were drawn on the images. Samples within each 10 km grid were further identified and dramatically increased and the spatial distribution of samples was improved.
All samples were randomly separated into two parts with a ratio of 70% to 30%. The 70% samples were used for classification and 30% for validation. The distance between two samples, in fact, the centers of every two polygons, was taken into account in the separation process. If they are too close, all pixels in both polygons are classified as either training or validation datasets. The threshold for the distance in this study was set as 900 m. This avoids a strong spatial correlation. For instance, all samples taken in one field represent only one field at the level of image pixel. If some pixels of the samples are used for the classification and other for the validation, the accuracy will be very high in normal case, but this accuracy does not represent the true value. Table 3 lists the number of samples and the proportion for each crop type at the 30-m level.

Classification Algorithm
Two methods are available for crop type and land cover classification: unsupervised and supervised [51]. The K-means and ISO-data are two typical unsupervised classification algorithms which are easy to apply but the accuracy is hard to reach the potential ceiling. The supervised classification is widely used at present. Maximum Likelihood (ML), Support Vector Machine (SVM) and Neural Net (NN) are three typically and widely used supervised algorithms for crop type classification with satellite images. However, Random Forest (RF) has become the more popular classifier in recent years because it is robust and easy to apply and only a few parameters are required to be set and adjusted accordingly. It has commonly proved that the accuracy of RF is often better than that of others such as ML and SVM. Therefore, RF was selected as the classifier in this study.
RF is a supervised machine learning algorithm and a kind of ensemble of the decision tree. The decision will be made based on the bagging method. At each time, a random subset of input training samples will be used and the error of out of bag will be used to evaluate the accuracy. The final result will be decided by merging all trees together and get a more accurate and stable prediction. The detailed RF algorithm may refer to the literature [52][53][54][55]. There are only two key parameters to be considered: one is the number of features and the other is the number of trees. In general, the number of features may be set as the square root or the logarithm of the number of input bands of the image. The number of trees should be tested in advance. In practice, 100 or 200 for the number of the tree is fine. The larger both parameters are, the more time it will take to compute, but the accuracy will certainly not be the highest. After many tests, the square root of all input features and 100 was set for the number of features and the number of trees in this study, respectively.

Increasing Features
In order to achieve the highest accuracy, the classification should take all spatial, temporal and spectral information of the input images into account. For the optical image, clouds always contaminate the images. Some images have been removed due to high cloud coverage. In reality, only a few scenes of time series images may be used for the classification. Therefore, all available satellite image should be used during the crop growing season to capture the crop phenology changes as much as possible. The different crop phenology information is the key to discriminate the different crop types in the same growing season. The spatial information is decided by the spatial resolution of the image. The data source has an influence on the spatial information. More features also may increase the accuracy of the classified image [5,25]. In this study, every two spectral bands may be used to calculate an NDVI-like index (Equation (1)), and thus, all possible NDVI-like indices were calculated and added to the spectral bands as the input features for the classification: where F is a new feature and b i and b i+1 are spectral bands.

Validation Methods
An error confusion matrix is usually used as the quantitative evaluation method to calculate the image classification accuracy. The approach of the error confusion matrix was also applied for the validation in this study. Table 4 lists the typic error confusion matrix. The overall accuracy, OA, Kappa and F1-scores may be further calculated based on the error confusion matrix. In Table 4, i represents the ground truth and j the classified result. Nij is the number of pixels that were in class i according to the ground truth but were classified to class j in the resulting image. J is the digital code of each crop type class. The overall accuracy (OA) can be computed as the proportion of fully correct classes in the validation samples: The Kappa coefficient is another index to evaluate the accuracy of the classified image: where M is total number of validation samples. The Kappa coefficient is good to evaluate the unbalanced validation samples, but it is difficult to understand the meaning or what Kappa represents. Kappa gives a sense to evaluate how much mistake was avoided in the classification process. Kappa is always less than or equal to 1. A value of 1 implies perfect agreement and values less than 1 imply less than perfect agreement. Normally, 0.8 and higher shows very good agreements. In addition, the F1-Score is also used to evaluate the classified image. The F1-score (Equation (6)) is derived from the precision and the recall for each class, which are computed through Equations (4) and (5) respectively. It gives an indication of the classification performance per class: In this study, the validation of the classified images was carried out independently with the same suite of the validation samples. The statistic was computed by counting the number of pixels for each class.

Accuracy Analysis for the Different Experiments of Input Data
GF-1 WFV has only four bands, namely blue, green, red and near infrared. S2 has four bands at 10 m and six bands at 20 m. L8 has seven bands at 30 m. These bands are the basic features that may be used for the classification. It is already known that the classification accuracy may increase as more features are used as input data [5,25]. Thus, in this study, three experiments were designed to perform the crop type classification. The first one uses four GF-1 WFV available bands and four similar bands for S2 and L8. The second one uses four similar bands, with all calculated feature indices accordingly. The third one uses all available bands and all calculated feature indices accordingly for these three satellite data sources. Table 5 lists the spectral bands used and the number of increased features accordingly for GF-1, S2 and L8 in three experiments. The training datasets and the validation datasets are the same for all these experiments. The RF classifier algorithm is also the same. Because 10 S2 bands have two spatial resolutions, the datasets for S2 were prepared into two datasets, one with 10 m spatial resolution and another with 20 m spatial resolution. For the 10-m S2 datasets, the pixels in the window of 2 × 2 took the value of corresponding one pixel at 20 m. For the 20-m S2 datasets, the pixel at 20 m took the averaged value of the corresponding 2 × 2 pixels at 10 m.  Figure 3 shows the accuracies of experiment 1. It is clear that the classification accuracy from the GF-1 WFV is lower than those from S2 and L8 in both years with regards to OA, Kappa and F1-score. The accuracy from the S2 data has the relatively highest score in 2017, while it is slightly lower than L8 in 2018 because of a lack of data after July in 2018. There was no difference found between the 10-m S2 and the 20-m S2 for both years.
In experiment 2, six feature indices were calculated in reference to Equation (1) and added as the input features. Figure 4 shows the accuracies of these three satellite data sources in experiment 2. It is clear that the classification accuracy from the GF-1 WFV data has the relatively lowest score for both years with regards of the OA, Kappa and F1-score. The accuracies of 10-m S2 and 20-m S20 are close in 2017, while it is slight larger in 2018. The accuracies between L8 and S2 are also close, though the one from S2 is a little bit better than L8 in 2017. Due to a lack of S2 data after July in 2018, it did not show the higher value of S2 in 2018 than L8, but both are very close. From the experiments 1 and 2, it is obvious that the accuracies arose for each input datasets as the input features increased.  Figure 3 shows the accuracies of experiment 1. It is clear that the classification accuracy from the GF-1 WFV is lower than those from S2 and L8 in both years with regards to OA, Kappa and F1-score. The accuracy from the S2 data has the relatively highest score in 2017, while it is slightly lower than L8 in 2018 because of a lack of data after July in 2018. There was no difference found between the 10-m S2 and the 20-m S2 for both years.  In experiment 2, six feature indices were calculated in reference to Equation (1) and added as the input features. Figure 4 shows the accuracies of these three satellite data sources in experiment 2. It is clear that the classification accuracy from the GF-1 WFV data has the relatively lowest score for both years with regards of the OA, Kappa and F1-score. The accuracies of 10-m S2 and 20-m S20 are close in 2017, while it is slight larger in 2018. The accuracies between L8 and S2 are also close, though the one from S2 is a little bit better than L8 in 2017. Due to a lack of S2 data after July in 2018, it did not show the higher value of S2 in 2018 than L8, but both are very close. From the experiments 1 and 2, it is obvious that the accuracies arose for each input datasets as the input features increased. In experiment 2, six feature indices were calculated in reference to Equation (1) and added as the input features. Figure 4 shows the accuracies of these three satellite data sources in experiment 2. It is clear that the classification accuracy from the GF-1 WFV data has the relatively lowest score for both years with regards of the OA, Kappa and F1-score. The accuracies of 10-m S2 and 20-m S20 are close in 2017, while it is slight larger in 2018. The accuracies between L8 and S2 are also close, though the one from S2 is a little bit better than L8 in 2017. Due to a lack of S2 data after July in 2018, it did not show the higher value of S2 in 2018 than L8, but both are very close. From the experiments 1 and 2, it is obvious that the accuracies arose for each input datasets as the input features increased.  S2 and L8 have more bands available for the classification, and thus, all available bands were further used in the crop type classification. Figure 5 shows the accuracies of these three satellite data sources with the potential capability. Similar results were found in this experiment. The classification accuracy from the GF-1 WFV data always relatively has the lowest score. It is obvious that the accuracy from S2 is highest due to having the most bands available for the classification. The accuracy from 10-m S2 is a little bit better than the 20-m S2 but there is no big difference. The accuracies for S2 and L8 were increased a bit from four bands to all bands. Table 6 lists the commission and omission errors for the experiment 3. S2 and L8 have more bands available for the classification, and thus, all available bands were further used in the crop type classification. Figure 5 shows the accuracies of these three satellite data sources with the potential capability. Similar results were found in this experiment. The classification accuracy from the GF-1 WFV data always relatively has the lowest score. It is obvious that the accuracy from S2 is highest due to having the most bands available for the classification. The accuracy from 10-m S2 is a little bit better than the 20-m S2 but there is no big difference. The accuracies for S2 and L8 were increased a bit from four bands to all bands. Table 6 lists the commission and omission errors for the experiment 3. Note: Table 6 lists the commission and omission errors for the experiment 3. Sward and Medlar are two problematic crop types due to being less represented in the region.

The Full Capability of Data Sources for the Crop Type Classification
As mentioned above, the highest accuracy will be achieved when all spectral bands and associated indices are involved in the classification as the input features. Figure 6 shows the best crop type maps from 15-m GF-1, 10-m S2 and 30-m L8, respectively. Table  7 shows the accuracy of OA, Kappa and F1-Score for the whole images and F1-Score for each crop type. The classified crop type maps with these three data sources are all visually acceptable, as the quantitative accuracies for the whole images are also high.
Nine types of crop were successfully classified in 2017 and 2018. The major crop types, such as rice, was classified with very high accuracy (F1-score of 98.3-100%). Forage Grass, corn, grape, greenhouse, and vegetable were always classified quite accurately (85% plus F1-score). With wheat only, Sward and Medlar obtained low accuracy in 2017 with GF data. Sward obtained low accuracy in 2018 with S2 data. Wheat was a major cereal in this region, but is now decreasing due to the cultivation cost. Sward and Medlar are cultivated in the farmland but not the major crop type. The training samples for these three types was difficult to collect and the final number of samples was low. The low accuracy is most likely related to the appearance of cloudy images and low availability of the training samples. these three satellite data sources with the potential capability. Similar results were found in this experiment. The classification accuracy from the GF-1 WFV data always relatively has the lowest score. It is obvious that the accuracy from S2 is highest due to having the most bands available for the classification. The accuracy from 10-m S2 is a little bit better than the 20-m S2 but there is no big difference. The accuracies for S2 and L8 were increased a bit from four bands to all bands. Table 6 lists the commission and omission errors for the experiment 3.  Note: Table 6 lists the commission and omission errors for the experiment 3. Sward and Medlar are two problematic crop types due to being less represented in the region.

The Full Capability of Data Sources for the Crop Type Classification
As mentioned above, the highest accuracy will be achieved when all spectral bands and associated indices are involved in the classification as the input features. Figure 6 shows the best crop type maps from 15-m GF-1, 10-m S2 and 30-m L8, respectively. Table  7 shows the accuracy of OA, Kappa and F1-Score for the whole images and F1-Score for each crop type. The classified crop type maps with these three data sources are all visually acceptable, as the quantitative accuracies for the whole images are also high.   Nine types of crop were successfully classified in 2017 and 2018. The major crop types, such as rice, was classified with very high accuracy (F1-score of 98.3-100%). Forage Grass, corn, grape, greenhouse, and vegetable were always classified quite accurately (85% plus F1-score). With wheat only, Sward and Medlar obtained low accuracy in 2017 with GF data. Sward obtained low accuracy in 2018 with S2 data. Wheat was a major cereal in this region, but is now decreasing due to the cultivation cost. Sward and Medlar are cultivated in the farmland but not the major crop type. The training samples for these three types was difficult to collect and the final number of samples was low. The low accuracy is most likely related to the appearance of cloudy images and low availability of the training samples.

Discussions
Training datasets, input data and classification algorithm are three key aspects for the satellite image classification. In comparison with the other two, the training dataset is the core and decides the quality of output result fundamentally, as the classification model is set up with the training datasets completely. The spatial distribution and the statistical distribution of the training datasets for each crop type should be taken into account. Collecting the field samples is the first step to prepare for the training datasets. The field sampling is of importance and may help the expert to understand the features on the image first and then teach the computer to identify and learn features correctly but it is not enough if only field samples are being used as the training data for the classification. In reality, it is not able to perform the field data collection completely following the route planned at home due to the unexpected road work and other disconnection problems. Therefore, adding and optimizing of the training datasets spatially and evenly with the support of images should be carried out. Adding training samples may improve it dramatically in case that a lack of training samples in some area obviously creates the incorrect classification result.
However, a set of good training datasets is also related to the field size and crop cultivation practice. In this study area, the size of each field owned by a farmer is small and normally a few meters by a few tens of meters; however, in practice, the same crop is planted in consecutive fields, which makes it easy to identify crop types on the Landsat 8 image. This advantage helps the application of 10-30-m resolution satellite data in this area. In other parts of China, the small size field hampers the wide application of these satellite data for the crop type classification. The same situation may be found in Africa due to small agricultural fields and a heterogeneous landscape. The 5-m optical and 1-m SAR data were fused for the crop classification, with an accuracy of up to 75% in west Africa [56].
One study [57] on crop classification in Madagascar and Burkina Faso also showed low performances with an accuracy of around 50% due to the presence of fields smaller than the pixel size and a mix of trees and crops in the fields. Hence, the classification system on crop group rather than crop type may improve the classification accuracy with these free satellite images. Otherwise, it had better use a higher resolution commercial satellite image.
Data volume should also be taken in account. In this study, the data volume for GF-1 and Landsat 8 was still easily handled, but the data volume for S2 was extremely large. Although the training datasets is the same for these three data sources from the point view of polygons, the training datasets used for S2 in pixel units is nine times that for Landsat 8. Thus, it has to set a threshold to select some pixels in each polygon and keep the total volume of training pixels at an acceptable level. Otherwise, it takes a very long time to build up the model. The data volume of the input data with the additional features also increases. In this study, GF-1 was increased six additional features and 10 bands were finally used as the input features for each single date image. Landsat 8 was increased 21 additional features and finally used 28 bands as the input features. S2 was increased 45 additional features and finally used 55 bands as the input features. Regarding the data volume for a single date image, the volume for 10-m S2 data is about 18 times that of the L8 data, while the volume for GF-1 data is about 1.4 times of L8 data. Therefore, the final input data volume for S2 is too large to process properly with the limited capacity of some computer as the data volume for S2 used in this study was approximately 1 Terabyte. Therefore, the reduction of input features may be carried out, applying the methods of the principal component decomposition and selecting the main component of input datasets, if the computer is not able to handle these datasets.
Undoubtedly, the GF-1 WFV still has a geometric problem after it is orthorectified with image chip matching. This is the reason why the classification accuracy from GF-1 is still 3-4% lower than that from S2 or L8, which are in very good geometric condition under the same spectral circumstances. Clouds are another big issue for crop type mapping. Sometimes it is difficult to get sufficient images even with less cloudy coverage. It will do nothing if there are no satellite data available. Optimistically, it increases the data availability if these three kinds of satellite data are combined and fused in the near future in a way like Harmonized Landsat Sentinel-2 [38].

Conclusions
A case study of the evaluation of individual capability for crop type mapping from difference satellite data sources, such as GF-1, S2 and L8, was carried out in the yellow river irritation area in Ningxia Hui autonomous region of Northwest China. It revealed that the crop type mapping with any of these three kinds of satellite data may achieve the acceptable accuracy since the lowest OA may reach 94%. The relatively lowest accuracy from the GF-1 WFV data is the consequence of the limited spectral bands. It proved again that the accuracy will increase as the input features increase. There is no obvious difference between the 10-m S2 data and 20-m S2 data for the crop type mapping, but, indeed, S2 may achieve the highest accuracy with 10 available bands for the crop type mapping at either 10 m or 20 m. The accuracy for L8 is in the middle, but the difference is small in comparison with S2. In order to achieve the highest accuracy, all special bands and associate indices should be jointly used as the input features for the classification. The training datasets may be tuned and improved by expert knowledge with the field samples. The evenly spatial and statistical distribution of the training sample is of importance for training datasets. Classifications based on individual image may produce the results with various accuracies due to the limited signals for differentiating the crop types on each image. The time series of satellite images used altogether as the input will bring the highest accuracy as all available information during the whole growing season are used in the classification.