1. Introduction
Traditional Chinese medicine (TCM) [
1] originated in ancient China and has evolved over thousands of years as the only health care and disease healing [
2]. A long time before the birth of modern Western medicine, traditional medicinal recipes were handed down orally generation by generation in many parts of the world [
3]. Given that TCM is a practical medicine built on experience, and has been mainly practiced and researched in China [
4,
5], the essence of TCM has always been the most advanced and experienced medicine in the world [
6]. Moreover, scientists proved that TCM can coexist with Western medicine [
3,
7]. Geoherb is a type of Chinese herb with a geographical indication corresponding to a specific geographical location or origin, which has a certification that the product possesses certain qualities, and its production will be protected by intellectual property rights law [
8,
9]. Compared to the herbal resources produced in other areas, the quality and efficacy of geoherbs are much better [
10]. As a highly-regarded TCM resource and a rare kind of geoherb in China [
11,
12],
Panax notoginseng (see
Figure 1) has been cultivated for more than 400 years in the south-west region of China [
13], especially in Wenshan Prefecture, Yunnan province. The conventional methods of TCM resource surveys mainly focus on the qualitative description of species rather than the natural storage or dynamic changes of the planting fields, resulting in a problematic situation that TCM resources appear difficult to monitor over time, which is not conducive to the sustainable development of TCM [
14]. In the past few decades, a resource census of TCM has been carried out three times (e.g., 1960–1962, 1969–1973, and 1983–1987). Until 2009, the government of China proposed “
To carry out a nationwide census of TCM resources, strengthen the monitoring of TCM resources and the construction of an information network” [
15]. Furthermore, the State Council of China highlighted “
Strengthening the landscape-scale dynamic monitoring and protection of TCM” [
16] again. From 2011, the government of China planned to conduct the fourth national census of TCM resources, and remote-sensing techniques were regarded as the core-key technologies for surveying and monitoring TCM resources in a large area. The inheritance, innovation, modernization, and internationalization of TCM would be the four basic tasks for a considerable period of time [
17,
18].
Remote-sensing techniques have been applied to monitor land cover at a range of spatial and temporal scales, in order to satisfy a range of scientific and practical requirements [
19,
20,
21]. In particular, remote-sensing mapping is an efficient technique to acquire spatial and temporal cropland information repeatedly and consistently [
22,
23]. In many cases, remotely-sensed data are utilized to derive landscape information on a specific land-cover class of interest [
24]. The ability to map and monitor land-cover types and their dynamics for diverse applications has been enhanced by the availability and constantly increased coverage, of satellite images [
25,
26]. Many problems are encountered in mapping land cover from remotely sensed data by a classification analysis or landscape class description [
27] in order to quantify the relationships among all the pixels in an image, such as similarities or differences in spectra signature or spatial texture [
22], and extract land cover classes from remote-sensing images [
28,
29,
30]. The application of remote-sensing techniques plays a significant role in the quantitative resource survey of TCM, particularly in exploring monitoring abilities for the sustainable utilization and bio-diversity protection of Chinese materia medica resources in macrocosm. Under the prerequisite that remote-sensing techniques provide up-to-date landscape surveying at a fine scale [
31], one of the motivations for this work comes from the fact that only a single class of interest is involved in a mapping task.
Recently, there has been a number of applications [
32], i.e., geological products [
33], vegetation indices [
34], aerosol products [
35], ocean data products [
36], dust source identification [
37], and crops identification [
38] in remote-sensing based on machine learning techniques. In this study, we present an original innovation to apply most of the available single-class data descriptors (SCDDs) through P-learning to conduct mapping of
Panax notoginseng fields. After that, the work of measuring performance will give us profound insights into defining the selection criteria and scoring proof for choosing a fine SCDD. As such, the introduction of SCDDs for remote-sensing mapping of
Panax notoginseng fields will be helpful to promoting the development of a
Panax notoginseng resource inventory and dynamic monitoring towards the quantitative direction. Attributing to the standardized cultivation technique of good agricultural practice (GAP) [
39], or so-called controlled-environment agriculture using shade houses provide a distinct image texture to interpret
Panax notoginseng fields visually [
40]. Additionally, due to only
Panax notoginseng fields being the target class, the task of mapping
Panax notoginseng fields becomes a specific typology of land-cover classification, which could be regarded as a problem of single-class data description or a special type of one-class classification [
41]. SCDDs are the appealing alternatives to the conventional supervised classifiers because they can be trained with only the target training samples [
42]. These kinds of algorithms have emerged to only require training samples from the target class, which are referred to as P-learning [
43]. Notice that the single class meant no more than one landscape class, and the P-learning based class description may depict the sight that no negative samples are used for training. Such a classification approach aims to identify only one landscape class of interest regardless of the other classes presented in the study area [
44]. In the case of single-class data description, we always face an imbalanced binary classification [
45,
46] including (1) the positive class (i.e.,
Panax notoginseng fields); and (2) the negative class (i.e., the other classes). In this case, the positive class is assumed to be sampled well, while the other classes may be sampled very sparsely or totally absent. When no samples of the other classes are available, most classification errors (e.g., false negative) cannot be estimated [
47]. In addition, the procedure that trains an accurate SCDD is challenging, particularly in the face of a large number of unlabeled samples; or say, only a small class or relatively few training samples are available [
42]. Therefore, it might be very expensive to collect the negative samples which are so abundant that a good sampling seems elusive. Although this was an extreme case, we carefully designed the training and test sets, which were composed of the qualified positive and negative samples. Note that if we want to improve the overall performance of the numerous classifiers which may differ in complexity, a combination of these classifiers will always be a viable solution [
48].
Regarding another motivation of this work with respect to TCM, the quality control of TCM remains a significant issue that affects medicinal herbs, formulations, and even TCM practice. Due more to the lax enforcement of standards [
49], resulting in the diminishing popularity of TCM rather than a failure of remedies, particularly, the patchy regulation has led to inconsistent herb quality, unqualified practitioners, unsubstantiated claims for secret formulas, and both deliberate and inadvertent mislabeling and adulteration, sometimes with fatal consequences. Considering
Panax notoginseng is a vulnerable crop which has a serious succession cropping obstacle [
50], consequently the continuous planting of the same crop in the same field will lead to the decrease of yield and quality. In order to promote the quality of production, it is crucial to monitor the spatial planting patterns of
Panax notoginseng fields, such as crop rotation [
51,
52]. The planting pattern implies standardized planting with the specific crop structure and spatio-temporal configuration in the same field for a specific region under the particular natural resource and socio-economic conditions [
53,
54] so as to realize the sustainable utilization of agricultural resources and crop yield. Until now, the concrete planting pattern changes of
Panax notoginseng are still poorly known. To the best of our knowledge, until now no such work has been done which, on the one hand, enriches the approaches to monitor spatial planting pattern changes of the perennial ginseng from space; on the other hand, employs SCDDs for mapping
Panax notoginseng.
Our studies on mapping Panax notoginseng aim to provide fruitful information for studies on the quality assurance of TCM production, precision farming, the construction of agro-ecosystems, sustainable development, and the protection of the biodiversity of Panax notoginseng. Furthermore, determining the planting area of Panax notoginseng is an important part of obtaining more accurate information about annual yield and natural storage, except for mapping the spatial distribution. The current study, which involves mapping the planting parcels of Panax notoginseng at a 30 m spatial resolution, has three aims: (1) mapping Panax notoginseng fields through a stack of SCDDs as the future technical milestone for planting pattern analysis; (2) evaluating the abilities of SCDDs in identifying small Panax notoginseng fields in the complex agricultural landscapes; and (3) providing the potential possibilities for monitoring the planting pattern changes of Panax notoginseng fields, further giving us new insights into the planting pattern transitions of the perennial ginseng in macrocosm. The case study area is located in Wenshan City of China, which is characterized by a distinctive crop rotation agricultural system. The highlights of this study include: (1) striving for the research of the landscape-scale remote-sensing interpretation of TCM resources for the first time; (2) employing a stack of SCDDs with a comparative perspective to conduct mapping of Panax notoginseng fields; (3) defining the selection criteria and scoring proof for choosing a most appropriate SCDD; and (4) evaluating the abilities of SCDDs in identifying the fragmented parcels of Panax notoginseng in the complex agricultural landscapes.
The rest of this paper is structured as follows. The description of materials and methods is introduced in
Section 2, and the experiments and analysis are presented and discussed in
Section 3 and
Section 4, respectively. Finally, the conclusions of this work are summarized in
Section 5.
4. Discussion
4.1. Selection Criteria
Although sufficient statistical analyses have been conducted, it is not known how to recognize which SCDD looks good. In fact, it is not easy to determine which is a fine, or even the best in the face of so many SCDDs with multiple performance measures. Therefore, we want to set up a handful of naive selection criteria to achieve such a goal by means of a rank board (see
Table 4). For this work, more empirical selection criteria are adopted. Intrinsically, most of the statistical metrics are derived from the basic errors (i.e., the true positive, the true negative, the false positive, and the false negative). The derived measures (i.e., the precision, recall, F
1, AUC, OA, KC, PA, and UA) could be quantitatively analyzed with actions such as rating and scoring. Note that the AUC measure appears mediocre herein. The OA may not be a reliable metric for the real performance of the SCDD in this study, because it yields misleading results supposing the training data are imbalanced when the numbers of observations in different classes vary greatly. The ROC curve and cost graph are used for supporting numerical indicators, which are especially suitable for classification problems in which there are only two classes (i.e., the positive and negative classes). The limitation [
70] of both ROC analysis and cost curve is the lack of any effective method to show the performance results obtained from several different data sets in a single plot. This difficulty follows the fact that only two dimensions are used to present the performance of a single data set.
It is important to realize the optimal selection criteria for the hybrid classifiers, such as comparing the performance of an ensemble classifier with a member classifier, which is also presented in this study. The fixed combination strategies or so-called rules (i.e., the mean rule, median rule, and voting rule), are more likely to obtain better classification results, just as the inferior classifiers will reduce the whole performance. In particular, the time taken will be an insufferable problem. It is crucial to address the question of under what criteria does one classifier outperform another. Additionally, a decision needs to be made to determine which classifier should be selected over others. That is if, given the current operating conditions, a set of selection criteria can be derived. It is often easy, by varying the parameter setting, such as a threshold or the variables of the mathematical model, or by varying the class distribution in the training set, to create a whole set of SCDDs. One commonly used selection criterion is to select the SCDD whose parameter settings and training conditions most closely agree with the current operating conditions, which is called the performance-independent criterion [
75]. This is the reason why we try to fix all irrelevant conditions prior to developing the performance-dependent selection criteria. A plain criterion is to choose the qualified SCDDs regardless of their training conditions or parameter settings.
4.2. Scoring Model
For SCDDs with multiple performance measures, it would be expected for them to be scored. Then, there is always one possibility that all SCDDs can be quantitatively evaluated and scored. Consequently, a score-oriented method is presented here to clarify this concern. In this study, we put forward a kind of scoreboard on the basis of the rank board to give each SCDD an explicit score so that we can determine which SCDD is optimal.
According to
Table 5, we use the rows (M.) to denote the measures and the columns (C.) to represent the different SCDDs. Signs are used to identify the metric belonging to which group: “–” denotes the error metric (i.e., the smaller the better), while “+” is the accuracy metric (i.e., the greater the better). The score variable
Scj is calculated by the following equation:
or
and
where
xi represents the measures in the
ith row,
xj represents SCDDs in the
jth column, and
xij denotes the measured value of the
jth SCDD with
ith metric. The
si represents the scores in the
ith row, the
sj represents the scores in the
jth column, and the
sij denotes the score value of the
jth SCDD with the
ith metric.
Scj represents the total score of the
jth SCDD. For simplicity, we assume that there are five SCDDs and five measures herein to facilitate the illustration of the scoreboard and the derivation of Equations (2)–(4). The
n is a key scale to slice a certain metric for all SCDDs so that each SCDD can be assigned a normalized float value (ranging from 0 to
n) associated with this metric. In the end, the gross score of every SCDD can be obtained and plotted by performing the summation by column.
Figure 11 illustrates that two inferior SCDDs, i.e., the c5 is underestimated, and the c9 is overestimated, which are prominently identified. Meanwhile, two slightly inferior SCDDs, i.e., the c2 and c6, are apt to be observed again.
4.3. McNemar’s Test
The four-cell confusion matrix is very intuitive to show the similarities and differences between the proportions (i.e., the true and false allocated parts) concerning two sets of specific labels. On the basis of the error table, we wish to achieve the statistical significance of the differences between the proportions using McNemar’s test [
76] so as to assess two allocated results, which are obtained by two SCDDs, or just given, under a position-specific comparison. Here, McNemar’s test is specifically useful for comparing paired proportions derived from two sets of samples. In the formulas below, we use the notations:
where the
is the proportion of testing samples that the first SCDD is true, and the second is false; meanwhile, the
denotes the proportion of testing samples that the first SCDD is false while the second is true. Thus, McNemar’s test focuses on the proportions of testing samples that one SCDD is true while another is false (and vice versa) [
41].
where
SEp represents the standard error derived from the difference between the proportions, and
N is the total number of the pairs of objects. McNemar’s test will perform the evaluation of the
confidence interval for comparing the difference between two accuracy values based on the differences (
) between the proportions. Assuming a normal distribution
, the general expression of the confidence interval [
76] can be expressed as:
For exploring in a straightforward manner, we split Equation (7) into a real image, then the image part may be more crucial to give the proper assessment with regard to a confusion matrix. In this way, the statistical assessment of the differences is carried out to determine if these are significantly different or not [
76]. Since three kinds of error matrices are presented in this study, here we name them as CDt (i.e., the classifier-dependent using the test set), CIa (i.e., the classifier-independent using the reference map of the “sa”), and CIb (i.e., the classifier-independent using the reference map of the “sb“).
The difference between the accuracies yielded by the paired SCDDs (or two sets of labels) is
, ranging from
to
at the 100 × (1 − 0.05)% confidence interval. In order to make all confidence intervals comparable, the one-sided absolute range (
) around
is exhibited in
Table 6. Two inferior SCDDs (i.e., the c5 and c9) have their appearances again. As for two slightly inferior SCDDs (i.e., the c2 and c6), only the c2 can be observed. Such results estimated by McNemar’s test provide a useful back-up to previous analysis.
4.4. Special Concerns and Limitations
Panax notoginseng is a rare kind of ginseng, and which also is an antique and endangered medicinal plant (i.e., a traditional Chinese geoherb). This paper aims to explore its potential and provides some insights into the application of SCDDs for the landscape-scale mapping of Panax notoginseng. We wish this work could be the referenced technical basis for exploring more novel points that make outstanding contributions and provide the fruitful information for studies on the quality assurance of the production of TCM, precision farming, the construction of agro-ecosystems, sustainable development, and the protection of biodiversity of Panax notoginseng. Special concerns and limitations of this study can be summarized as follows:
This work utilizes a manually-collected set of samples of the target class and grid-constraint uniformly-collected negative samples. The uncertainty exists that a few possible land-cover classes may be left out, even though the classification results look rather good.
Thirteen SCDDs are employed and compared, however, there may be many other algorithms and their variants. Anyhow, the available ones have been used in this study.
The comparison with the different SCDDs does not judge them to be good or not. Actually, we wish to extend the ability of SCDDs to achieve the expected experiences in a straightforward way to find the optimal approach to monitoring the plant pattern changes of Panax notoginseng.
The class imbalance is a non-negligible problem in terms of a real specific land-cover classification using SCDDs. We strive for trying to observe what influences it would cause, and find two mediocre-appearing measures, i.e., the OA and AUC.
The selection criteria and scoring model are presented to determine the optimal SCDD which is outstanding and deserves attention.
The division of the site-specific error matrices by discriminating if the SCDD is dependent on the training set or not provides a more comprehensive approach to assess the final results.
The combination of SCDDs which are taken as the base classifiers can reduce the error or improve the accuracy. However, lower computational efficiency would be an annoying problem. Additionally, the pruned ensembles can give better performance.
Some classification accuracies may not be the reliable indicators, particularly if the training data are imbalanced [
41,
77]. As single-class data description is a special type of one-class classification, there are difficulties that may exist when trying to fit a single-class learner using the positive samples only. If SCDDs are trained with the samples of the single target class, then only the sensitivity can be estimated. There is a possibility that using only the sensitivity to fine-tune an algorithm may result in a class descriptor with high sensitivity but low specificity and overestimating the true extension of the class of interest [
41]. In terms of the scope of this study, the introduction of single-class data description regarding remote-sensing mapping of
Panax notoginseng fields based on P-learning, which provides us the new insights to promote the development of the resource inventory and dynamic monitoring of
Panax notoginseng.