Effects of Training Parameter Concept and Sample Size in Possibilistic c-Means Classifier for Pigeon Pea Specific Crop Mapping

This research work aims to study the effect of training parameter concept and sample size in the process of classification by using a fuzzy Possibilistic c-Means (PCM) approach for Pigeon Pea specific crop mapping. For specific class extraction, the “mean” of the training data is considered as a training parameter of the classification algorithm. In this study, we proposed an “Individual Sample as Mean” (ISM) approach where the individual training sample is accounted as a mean parameter for the fuzzy PCM classifier. In order to avoid the spectral overlap of target Pigeon pea crop with other crops in the study area, a temporal indices database was generated from Sentinel 2A/2B satellite images acquired during the 2019–2020 Pigeon Pea crop cycle. The spectral dimensionality of temporal data was reduced to extract the required bands to achieve maximum enhancement of the target crop class in the temporal data. Further, the training sample size was increased to study the heterogeneity within the class in the classified output. The proposed ISM approach delivered a higher mean membership difference (MMD) between the Pigeon Pea crop and the co-cultivated Cotton crop as compared to the conventional mean method. This indicated that a better separation was achieved between the target crop and the spectrally similar crop grown, that were cultivated in the same study area. When the sample size was gradually increased from 5 to 60, the MMD values within the Pigeon Pea test fields remained in the range 0.013–0.02, thereby implying that the proposed algorithm works better even with a small number of training samples. The heterogeneity was better handled using the proposed ISM approach since the variance obtained within Pigeon Pea field was only 0.008, as compared to that of 0.02 achieved using the conventional mean approach.


Introduction
India is the primary centre of the origin and diversification for the Pigeon Pea crop. India is also the largest producer (over 85%) and consumer of the Pigeon pea crop with an annual production of 4.2 million metric tonnes in the year 2017 [1]. This crop is grown in regions with temperatures ranging from 26 • C to 30 • C in the monsoon season (June to October) and 17 • C to 22 • C in the post monsoon (November to March) season. It is mainly grown in well drained black cotton soils with a pH ranging from 7.0 to 8.5 [2]. Pigeon Pea is highly recommended for developing economies as a part of a balanced diet to fill the protein based nutritional gap, and its seeds and leaves are also used in medicinal applications. Thus, the Pigeon Pea crop is immensely important for developing economies like India and helps in maintaining the sustainable productivity of small holder cropping systems.
In recent years, the technological advancements in the data acquisition systems have made a majority of the geospatial technology affordable and accessible to the agricultural community [3][4][5]. One such application of remote sensing in the field of agriculture is the Geomatics 2022, 2 108 identification and mapping of specific crops that enables the determination of the acreage and spatial distribution of individual crops [6,7]. Multispectral sensor-based satellites acquire multi-band spectral data with regular temporal coverage. Thus, these multispectral satellites are an important resource for developing approaches that require spectral and temporal remote sensing information simultaneously. Since many different crops are grown in the vicinity of each other in large scale farming, the spectral properties of the target crop may overlap with those of the nearby crops, thereby resulting in faulty crop maps [8,9].The literature of crop mapping using multispectral remote sensing data establishes that using a single date imagery is a challenging task and yields a less accurate classification of crops. Temporal multispectral images have been effectively used for different applications of remote sensing and have delivered more accurate results for multiple studies as compared to single date images [10][11][12][13][14]. Similarly, multiple studies have developed methods based on temporal multispectral data in agricultural applications including specific crop identification [15,16]. The motivation for using temporal images is that identical crops are spectrally separable at some point of time during the growing season. This requires multi-date data to study the spectral signatures and assist algorithms to perform the task more efficiently. Thus, the problem of distinguishing spectrally similar crops and error in crop mapping can be resolved by adopting a temporal approach that helps in distinguishing different crops. In this regard, satellite based temporal multispectral data hold great potential to resolve the existing challenges for crop separation and identification.
Apart from temporal data requirements, crop mapping and identification demands robust algorithms that are sensitive to the spectral and temporal information of the multispectral data [17][18][19][20]. Different computer vision techniques have been applied depending upon the application in the field of agriculture. Wavelet and Fourier based techniques have been used to study crop phenology, k-means and the Support vector machine (SVM) and the Random forest (RF) algorithm, performed accurately for crop/fruit grading and techniques such fuzzy algorithms and discriminant analysis for the identification of specific crops [21][22][23][24]. Especially, the SVM and RF have been most popularly used among the machine learning algorithms for various remote sensing applications. However, the SVM does not perform well if the target classes have a high spectral overlap (e.g. spectral overlap between two identical crops) [25]. The RF algorithm is stable, less impacted by noise and reduces variance to improve accuracy, but has a greater parameter complexity, requires high computational power and has more risk of overfitting [26]. Recently, deep learning techniques are becoming popular and have been used for performing different tasks in agriculture, such as crop type classification (customised CNN), plant disease detection (AlexNet, LeNet) and prediction of soil moisture and other biophysical parameters (Deep belief networks) [27][28][29][30]. Although, the use of such advanced techniques requires expertise in the domain, and a large training dataset for the training of networks is required [31].
Fuzzy algorithms work on the principle of "one pixel-several classes", where each pixel holds a membership value for each of the class [32]. The unique feature of a fuzzy approach is the presence of a membership function, which determines the membership grade of a pixel whose values lie between [0, 1]. The closer the value of the membership grade is to 1, the more the pixels belong to that particular class. The conventional Fuzzy c-Means (FCM) algorithm uses a hyperline constraint that sums the membership values of a pixel to 1 [33,34]. Hence, as an improvement to the existing FCM algorithm, a Possibilistic c-Means approach was developed by Krishnapuram & Keller [35] where the hyperline constraint was relaxed and a noise clustering algorithm was introduced where the outliers or the noisy points are dumped into a separate noise cluster (NC) class. Hence, this algorithm contains an additional class comprising of all the noisy points and, in the absence of training data, the NC algorithm does not forcefully allocate classes to data points and treats them as noise. A few studies [36][37][38] have demonstrated the efficiency of PCM in handling mixed pixels by extracting a single class of interest from mixed pixels using satellite multispectral data. The unique approach of the study was the extraction of a single class independent of the presence of other classes in the image. Thus, a fuzzy PCM approach helps in extracting a specific class in an image with a high spectral overlap and simultaneously deals with noise of a small training dataset exhibiting a low probability of overfitting.
The innovation that could be found in this research work is that a unique approach of "Individual Sample as Mean" (ISM) has been proposed and employed where the individual training sample itself is accounted as a mean parameter for the fuzzy PCM classifier against the conventional approach of using the statistical "mean" from the training data. The statistical parameters obtained from the training data are not representative of the variations existing within a field. Various factors, such as non-uniformly applied water, fertilisers and pesticides, could result in local variations within a crop field. This element of heterogeneity is detrimental to the classification accuracy, thereby hampering specific crop mapping. The proposed ISM approach is far superior as the heterogeneity is supressed when the individual samples in the training data were considered as the mean input parameter. Hence, this research work bridges the research gap of not exploiting the potential of fuzzy approaches in the handling of heterogeneity. Further, the effect of sample size in the handling of heterogeneity was also studied. In addition to this, a novel approach of the Class Based Sensor Independent-Modified Soil-Adjusted Vegetation Index 2 (CBSI-MSAVI2) was applied in this research work. This Class Based Sensor Independent approach ensures the maximum enhancement of target crop class in temporal domain which cannot be assured by the conventional band combinations of NIR and Red bands used in the usual MSAVI2. In totality, this research work aims to study the effect of the training parameter concept and sample size in the due process of classification by means of a fuzzy PCM (Possibilistic c-Means) approach for specific Pigeon Pea crop mapping.

Study Area
The J Bhupalpally District in Telangana state of India was selected as the study area for this research work. This district lies between latitudes 18 •

Datasets
In this research work, temporal data were acquired from Sentinel 2A/2B remote sensing satellites. Among the wide range of products, the dataset with Bottom of Atmosphere (BOA) reflectance has been utilised for this study. In other words, the atmospherically corrected Level-2 product was considered [39]. A total of 12 imageries were acquired for

Datasets
In this research work, temporal data were acquired from Sentinel 2A/2B remote sensing satellites. Among the wide range of products, the dataset with Bottom of Atmosphere (BOA) reflectance has been utilised for this study. In other words, the atmospherically corrected Level-2 product was considered [39]. A total of 12 imageries were acquired for the period ranging from June 2019 to February 2020 as shown in Table 1. The band details of the MSI sensor onboard Sentinel 2 that were used for the study are shown in Table 2.  The Possibilistic c-means algorithm (PCM) was introduced by Krishnapuram and Keller [35] to deal with the short comings of fuzzy c-means algorithm. This approach is best suited for specific crop mapping as the untrained classes do not affect the classified output, thereby facilitating single-class extraction. The objective function of the PCM classifier is given in Equation (1):

Band Details Resolution
The function is subject to below shown constraints (Equations (2)-(4)) where, η i is any suitable positive number, 'm' is called fuzzifier or weighted exponent, 1 < m < ∞.
According to the first term of Equation (2), the separation between feature vector and prototype vector should be as low as possible, while the second term (Equation (2)) forces the membership value to be as high as possible. In PCM, as 'm' increases, the possibility of pure pixels occurring in a class decreases [40]. From Equation (2), the membership values for PCM can be estimated as shown in Equation (5); where η i can be estimated as Equation (6): where, K = 1, constant, η i is called band width parameter and it is basically a distance at which membership to a class is equal to 0.5.

CBSI-MSAVI2 Indices
This research work proposes a novel approach of using Class-Based Sensor Independent Modified Soil Adjusted Vegetation Index 2 (CBSI-MSAVI2) for the dimensionality reduction. The usage of CBSI-MSAVI2 index ensures maximum enhancement of the target crop, without the need for a deeper knowledge of the sensor specifications. In addition to this, the usage of CBSI-MSAVI2 index ensures the spectral dimensionality reduction while preserving the temporal dimensionality. The formula used to calculate CBSI-MSAVI2 is mentioned in Equation (7), where ρ max and ρ min denotes maximum and minimum reflectance values. The bands NIR and Red in the calculation of MSAVI2 [41] have been replaced with ρ max and ρ min .

Methodology Adopted
The temporal data acquired over the period were subjected to preprocessing, where only the required ten bands (mentioned in Table 2) were considered and resampled to 10 m spatial resolution. These resampled data were taken as the input for temporal indices database generation. Initially, various vegetation indices were applied on the dataset to reduce the spectral dimensionality while retaining the temporal dimensionality. Different band combinations considered for generating those indices are maximum and minimum reflectance bands for CBSI-MSAVI2, NIR and Red for MSAVI2 [41] and NIR and Red Edge1 band to generate MSAVI2RE1 [42].
Separability Analysis was carried out on these temporal indices databases to determine the optimal number of temporal images. Since CBSI-MSAVI2 index produced the maximum Euclidean separation distance of Pigeon Pea with respect to the other crops in the region like Cotton, this index was chosen for further classification. The optimal dates were chosen based on the maximum unique information extracted towards Pigeon Pea phenology. The selected temporal dates from the temporal indices database were layer stacked and converted to the Generic Binary format which is compatible with the in-house Java-based Subpixel Multispectral Image Classifier (SMIC) tool [43].
Using this layer stacked input dataset, different signature samples were created ranging from 5 sample points up to 60 points to study the effect of sample size on the classification. The sampling was carried out six times in order to generate the signature files with 5, 10, 15, 20, 25 and 60 samples. The Java-based in-house SMIC tool was used for the PCM classification. In addition to the conventional PCM classification where Mean is taken as the input parameter [44], Individual Samples as Mean (ISM) based PCM classification was also carried out by means of this tool. In this approach, the entire sample is taken as the input parameter instead of considering the statistical mean.
The effect of training concept was experimented by making use of the same signature data of training samples without any kind of alteration. The accuracy assessment was conducted using the Mean Membership Difference (MMD) method. Basically, the difference of mean of membership values of the test fields (Cotton and Pigeon Pea) was calculated with respect to the training fields (Pigeon Pea crop). The methodology followed in this research work is shown in Figure 2.

Database of Generated Temporal Indices-CBSI-MSAVI2
The maximum separation distance of the target Pigeon Pea crop was obtained only from the temporal CBSI-MSAVI2 database, thereby making it the most suitable index for this study. In the study [45], it was shown that the Class-Based Sensor Independent Approach is simpler and better for the extraction of single class. This research work proposes a novel Class-Based Sensor Independent-Modified Soil Adjusted Vegetation Index (CBSI-MSAVI2) approach for spectral dimensionality reduction while preserving the temporal dimensionality. The usage of the CBSI-MSAVI2 index ensures maximum enhancement of the target Pigeon Pea crop class, without the need for a deeper knowledge regarding the sensor bands. The in-house SMIC tool was used to generate this index. The different bands as maximum and minimum reflectance used in CBSI-MSAVI2 of the temporal images were shown in Table 3.

Database of Generated Temporal Indices-CBSI-MSAVI2
The maximum separation distance of the target Pigeon Pea crop was obtained only from the temporal CBSI-MSAVI2 database, thereby making it the most suitable index for this study. In the study [45], it was shown that the Class-Based Sensor Independent Approach is simpler and better for the extraction of single class. This research work proposes a novel Class-Based Sensor Independent-Modified Soil Adjusted Vegetation Index (CBSI-MSAVI2) approach for spectral dimensionality reduction while preserving the temporal dimensionality. The usage of the CBSI-MSAVI2 index ensures maximum enhancement of the target Pigeon Pea crop class, without the need for a deeper knowledge regarding the sensor bands. The in-house SMIC tool was used to generate this index. The different bands as maximum and minimum reflectance used in CBSI-MSAVI2 of the temporal images were shown in Table 3. The band ratios computed at each of the temporal dates are shown in the Figure 3.
Geomatics 2022, 2 114 The band ratios computed at each of the temporal dates are shown in the Figure 3. The band ratio (0.791) peaks in the month of November indicating the peak of the crop growth. Gradually, the ratio decreases over the months January-February, coinciding with the harvest period. Hence, the growth cycle of the Pigeon Pea crop is better modelled using CBSI-MSAVI2.

Separability Analysis and Selection of Optimal Dates
The separability analysis was carried out for the target crop, Pigeon Pea with other vegetation patches, plantation and the Cotton crop grown in the study area, to find out optimum number of dates. Out of all the indices used, the CBSI-MSAVI2 produced the maximum Euclidean Separation of Cotton [46], and had also produced the highest Euclidean separation on using the Class Based Sensor Independent index. Basically, the Euclidean separation distance between the Pigeon Pea crop and the other crops has been maximised, starting with single date indices data to reduce the spectral overlap. Once the separation distance starts getting constant, the optimum dates were selected from that point. This way, the selected optimum dates ensure the maximum extraction of Pigeon Pea phenology for the best results. The result of separability analysis for CBSI-MSAVI2 has been shown in Table 4. Table 4. Separability analysis using CBSI-MSAVI2 temporal indices images for the selection of best temporal dates to map Pigeon Pea crop with spectrally least similar cotton crop. The band ratio (0.791) peaks in the month of November indicating the peak of the crop growth. Gradually, the ratio decreases over the months January-February, coinciding with the harvest period. Hence, the growth cycle of the Pigeon Pea crop is better modelled using CBSI-MSAVI2.

Separability Analysis and Selection of Optimal Dates
The separability analysis was carried out for the target crop, Pigeon Pea with other vegetation patches, plantation and the Cotton crop grown in the study area, to find out optimum number of dates. Out of all the indices used, the CBSI-MSAVI2 produced the maximum Euclidean Separation of Cotton [46], and had also produced the highest Euclidean separation on using the Class Based Sensor Independent index. Basically, the Euclidean separation distance between the Pigeon Pea crop and the other crops has been maximised, starting with single date indices data to reduce the spectral overlap. Once the separation distance starts getting constant, the optimum dates were selected from that point. This way, the selected optimum dates ensure the maximum extraction of Pigeon Pea phenology for the best results. The result of separability analysis for CBSI-MSAVI2 has been shown in Table 4. A closer look at Table 4 reveals the fact that the separability distance becomes almost constant at 59-60. This value was attained while taking six date temporal image combinations. Hence the dates corresponding to the number of temporal images (6) Table 4. In other words, these dates capture the unique phenological information of the target Pigeon Pea crop to its maximum extent, which is helpful for the level-2 classification.

Optimising Weighted Exponent (m) Parameter
The value of "m" was varied from 1.1 to 3 to determine the optimised value to be used for the classification. For each of the outputs obtained by varying the value of m, MMD analysis was carried out to find the best result. The results were analysed by comparing the membership values of Pigeon Pea training fields with those of Pigeon Pea and Cotton Test fields. This is a two-pronged approach of examining the difference in membership values within the Pigeon Pea field and the extent of separation achieved with respect to the spectrally similar Cotton crop. The performance of the classification has not only been judged by the proximity of the Pigeon Pea training fields to the Pigeon Pea test fields, but also with the departure attained from the Cotton test fields, as this was spectrally closest crop to Pigeon Pea crop. As 'm' increases the chances of occurrence fuzziness in the output increases. In other words, when the value of m is closer to 1, it indicates a hard classification output. The research work [47], which had used a Possibilistic c-means classifier with a hypertangent (tanh) kernel for wheat (Triticum aestivum) identification, evaluated that for 2.7, 2.5, and 2.5 values of the weighted constant (m), images of 4 date combinations from Formosat-2 and Landsat-8 (Operational Land Imager) sensors represent the nicely separated wheat crop from other vegetation.
It was noted that the difference in the membership value remained almost constant in the case of spectrally similar Cotton for an 'm' value of 2.1. On the other hand, the MMD value for Pigeon Pea remained closer to 0. Thus, the optimised m value used for PCM classification was 2.1. The table of MMD values against the chosen "m" for Pigeon Pea and Cotton are shown in Tables 5 and 6 and Tables 7 and 8, respectively. Thus, the membership values attained at both Pigeon Pea test fields and Cotton test fields were compared with the membership values at training fields of Pigeon Pea in order to get an insight into the respective proximity and separation achieved by the classification.

Classification Results
The objective of this research was to study the effect of the training concept and the sample size in classification meant for specific Pigeon Pea crop mapping. The effect of the training concept incorporates "Individual Samples" directly to be taken as the mean input parameter without going with the conventional "mean". With regard to this, six training sample data were generated, consisting of 5, 10, 15, 20, 25 and 60 training sample points. In general classification, according to [48],"10*n" samples size have to be considered for training, 'n' being the number of bands in the input image. Here, since 6 optimum dates were chosen, according to the 10*n rule, meaning 10*6, i.e., 60 samples needed to be chosen for training purpose. The subset image taken is shown in Figure 4 and the classification results for the signature file (5 samples-minimum) with the two different training parameters are shown in Figure 5.

Accuracy Assessment Using MMD
The Mean Membership Difference (MMD) method has been used to quantitatively analyze the output images generated. Basically, the difference of the mean of membership values of the test fields (Cotton and Pigeon Pea) with training fields were calculated with respect to the training fields of the Pigeon Pea crop. It is expected that the MMD for Pigeon Pea test fields were closer to 0 (proximity) and that of the Cotton fields were tending towards 1 (separation). This two-pronged approach of studying the MMD values from the perspectives of Proximity and Separation helps in a better comprehension and evaluation of the classifier performance. The MMD values for the Pigeon Pea Test field while using different training data with various sample sizes using a Mean PCM were shown in Table  9, while considering an optimum m value of 2.1. Table 9. Effect of increase in sample size by comparing membership values between Pigeon Pea Training fields and Pigeon Pea Test fields for proximity-Mean training parameter based PCM classifier.

Accuracy Assessment Using MMD
The Mean Membership Difference (MMD) method has been used to quantitatively analyze the output images generated. Basically, the difference of the mean of membership values of the test fields (Cotton and Pigeon Pea) with training fields were calculated with respect to the training fields of the Pigeon Pea crop. It is expected that the MMD for Pigeon Pea test fields were closer to 0 (proximity) and that of the Cotton fields were tending towards 1 (separation). This two-pronged approach of studying the MMD values from the perspectives of Proximity and Separation helps in a better comprehension and evaluation of the classifier performance. The MMD values for the Pigeon Pea Test field while using different training data with various sample sizes using a Mean PCM were shown in Table 9, while considering an optimum m value of 2.1.
It was observed that the MMD value ranges between 0.013-0.019 while varying the number of samples. The least MMD value was achieved while using 60 samples. The MMD values for Cotton Test fields from Mean PCM outputs is shown in Table 10, while considering optimum m value of 2.1.
The MMD value for Cotton fields ranges from 0.31 to 0.33 for using different sample signature files. The results of the MMD for the Pigeon Pea test fields using an ISM-PCM classification are mentioned in Table 11, while considering an optimum m value of 2.1.
The comparison of the variance values against sample size for both the approaches is mentioned in Figure 6. The comparison of the variance values against sample size for both the approaches is mentioned in Figure 6. While using the ISM approach, the variance of the Pigeon Pea test field varies from 0.008-0.02, which is lesser when compared to the conventional Mean PCM. This shows that the heterogeneity within the fields was handled better using this approach. The MMD values for Cotton field using the ISM-PCM classification is shown in Table 12, while considering an optimum m value of 2.1.
The MMD value for Cotton ranges from 0.30-0.35, which is slightly better when compared to the values obtained using the conventional Mean PCM, thereby exhibiting a better separation from Pigeon Pea. It was observed that as the sample size kept increasing, the MMD value for the Pigeon Pea test fields also kept decreasing. In the case of Cotton, a decrease in MMD values is seen for the ISM approach, whereas there is a steady increase of the same in the conventional approach.
Ref. [49] followed a temporal approach in the specific mapping of Sugarcane Ratoon. Time series data from LISS-III and AWiFS sensors were separately subjected to the Possibilistic c-Means classifier to extract single class sub-pixel information. A Fuzzy Modified Possibilistic c-Means (MPCM) classifier was used by [50] to identify and map the vegetation cover in Newai town of Rajasthan. In this study, data from Sentinel -1 and Sentinel -2 missions were used to fulfil the required temporal dates to incorporate phenological and sessional variation. Thus, the incorporation of dual-sensor data helped in better mapping of the agro-geography. The transplanted paddy crops were mapped using MPCM and Noise Clustering algorithms by [46]. Due to the presence of cloud cover, a single optical sensor was not enough to provide the required temporal scenes. Two optical datasets (Sentinel-2 and Landsat-8) and one SAR dataset (Sentinel-1) were used for this work. The Noise Clustering algorithm had outperformed MPCM in the mapping of transplanted paddy crops. In all of these works, pixel-based classifiers had used means or variance-covariance statistical parameters generated from training samples. In reality, these statistical parameters do not represent in totality about variations existing within class.
The research work of [50] aimed at the identification of different crops grown in a region by means of the fuzzy PCM algorithm. The MMD attained between two different crops (viz., mustard and wheat) in the study was close to 0.077, thereby indicating very little separation. Although the crops used in [50] are different from that of the current research work, better separation could have been achieved by employing the ISM-based PCM classification. Here in our research work, an MMD of up to 0.35 was attained on using this unique approach.

Conclusions
The training parameter concept plays an important role in the classification output. When the individual samples concept was considered as an input training parameter, a better representation of the field was achieved through the PCM classifier. The variance of the MMD values of the Pigeon Pea test fields indicates that the ISM-PCM classification handles heterogeneity in a better way. The variance value achieved in the conventional method was 0.019, whereas that for the ISM approach was 0.008.
Although, general classification requires the presence of a minimum of 60 samples (considering 10*n rule), but the PCM algorithm works well even in the presence of 5 training samples. On increasing the training samples from 5 to 60, the MMD value of the Pigeon Pea test fields kept decreasing from 0.019 to 0.013. Even the MMD values for the Cotton fields showed a decreasing trend with an increase in the number of samples. Since, the variation in MMD values was quite nominal, it can be concluded that the effect of sample size was not significant in the fuzzy PCM classification for MMD values. However, the variance value was decreasing with the increasing sample size, especially in the ISM training approach. This algorithm exhibits a robust performance even in the presence of a minimal number of training samples, say 15 training samples onwards.