Knowledge-Based System for Crop Pests and Diseases Recognition

: With the rapid increase in the world’s population, there is an ever-growing need for a sustainable food supply. Agriculture is one of the pillars for worldwide food provisioning, with fruits and vegetables being essential for a healthy diet. However, in the last few years the worldwide dispersion of virulent plant pests and diseases has caused signiﬁcant decreases in the yield and quality of crops, in particular fruit, cereal and vegetables. Climate change and the intensiﬁcation of global trade ﬂows further accentuate the issue. Integrated Pest Management (IPM) is an approach to pest control that aims at maintaining pest insects at tolerable levels, keeping pest populations below an economic injury level. Under these circumstances, the early identiﬁcation of pests and diseases becomes crucial. In this work, we present the ﬁrst step towards a fully ﬂedged, semantically enhanced decision support system for IPM. The ultimate goal is to build a complete agricultural knowledge base by gathering data from multiple, heterogeneous sources and to develop a system to assist farmers in decision making concerning the control of pests and diseases. The pest classiﬁer framework has been evaluated in a simulated environment, obtaining an aggregated accuracy of 98.8%.


Introduction
The World Health Organization (WHO) and the Food and Agriculture Organization (FAO) of the United Nations agreed on the following definition: organic agriculture is a holistic production management system which promotes and enhances agroecosystem health, including biodiversity, biological cycles, and soil biological activity. It emphasizes the use of management practices in preference to the use of off-farm inputs, taking into account that regional conditions require locally adapted systems. This is accomplished by using, where possible, cultural, biological and mechanical methods, as opposed to using synthetic materials, to fulfil any specific function within the system [1]. Thus, beyond ensuring the provision of food for the increasing world population, organic agriculture is concerned with sustainability [2]. If pests and diseases are one of the main threats to crop yields when employing conventional farming practices, in organic agriculture, in which the application of synthetic chemical fertilizers and pesticides is prohibited, the impact could be devastating [3]. For that reason, the general approach in organic agriculture is to apply management practices aiming at preventing pests and diseases from affecting a crop, rather than treating the symptoms. On the side, globalization and climate change are contributing to the emergence of new diseases and to their spread [4,5]. Under these circumstances, early detection of the outbreak of a pest or disease becomes paramount to reduce yield losses and their corresponding economic damage. Both small and large farm owners should be provided with access to relevant information about best practices in organic agriculture and the allowed methods to fight crop pests and diseases. However, in most cases such information is dispersed throughout • A knowledge model representing the crop pest domain in the form of an ontology has been defined as a revision of our previous work [22]. • Natural language processing tools have been used to automatically populate the ontology from unstructured data sources in Spanish. During this process, data about the plant pests and diseases, including their symptoms and recommended treatments, are gathered. • A novel approach for representing symptoms associated plant diseases is proposed based on the combination of plant parts and observed damage. • A knowledge-based crop pest recognizer has been built which is capable of identifying multiple overlapping pests. The proposed framework is easily extensible to support new evidentiary inputs such as images. • Sustainable agricultural practices are fostered by suggesting organic agriculturecompliant treatments. Nonetheless, the proposed framework provides support for both conventional and organic management strategies. • A dataset has been compiled with symptoms-pests associations for three crops: almond tree, olive tree and grape vine. In total, the dataset contains 212 symptoms declared by means of sentences in Spanish, connected to 75 pests and diseases. This dataset is publicly available at http://agrisemantics.inf.um.es/datasets/ (accessed on 9 April 2021).
The rest of this paper is organized as follows. Section 2 provides the background information on IT-enabled tools for plant pests and diseases recognition and current approaches to knowledge acquisition from natural language texts. The framework proposed in this work to automatically identify the pest or disease that has affected a crop is described in Section 3. Section 4 shows a preliminary validation analysis of the framework in a simulated environment, and finally, our conclusions and future work are put forward in Section 5.

Related Work
In this paper, a plant pest diagnosis system is described that leverages an automatically populated knowledge base to boost the overall accuracy results. In the last few years, researchers in the field of agronomy have proposed a significant number of ways of recognizing plant pests and diseases, while semantic technologies have simultaneously been leveraged to improve the performance of natural language processing tools in different application domains. In this section, various approaches to plant diseases' identification and management will be discussed and the most representative works in ontology-driven natural language processing will be listed.

Pests and Diseases Recognition
Among the uses of ICTs in agriculture, that of automatic pests and diseases recognition is extensive [24,25]. The most common approach is that of image processing [11,26,27] using sophisticated artificial intelligence techniques such as deep learning [28][29][30][31][32][33][34]. In some cases, image processing is complemented with information retrieved by sensors [35][36][37] or other inputs [38]. Scarcer are the approaches relying on other evidence, such as odor [39,40], weather [41,42], or rule-based systems triggered by symptoms introduced manually in natural language [43][44][45][46][47]. While some solutions focus on a specific crop or a single condition (throughout this manuscript the term "condition" is used as a synonym for "pest or disease".) reaching very high accuracy values [35,41,[48][49][50][51][52][53], others struggle to achieve good precision results dealing with a large number of conditions in different crops [38,[54][55][56]. A common issue hampering image-based tools for plant pest identification is that of the scarcity of images available to train the deep learning method in question [57]. To overcome Electronics 2021, 10, 905 4 of 21 this limitation, in Reference [58] the authors present a method to generate complete plant lesion leaf images with the aim to assist in improving the recognition accuracy of the classification tool.
In previous works, we have recently explored different syntactic-based approaches to pest identification [21,59,60]. In Reference [59] we describe a tool that relies on Google Cloud's Vision API (https://cloud.google.com/vision/, accessed on 9 April 2021) to recognize the pest or disease affecting a plant from an image taken by the farmer. The image is sent to the Vision API, which automatically assigns labels to the image. The labels are then compared against the data available in a local database containing information about common diseases, their symptoms and suggested treatments. If one of the labels matches the name of one of the recorded pests or diseases, then all the information about that pest or disease is shown to the user. Similarly, in Reference [60] an image-based approach is proposed. Here, instead of relying on an external service such as the Google Cloud's Vision API, we built our own image processing tool by using Convolutional Neural Networks. The network was trained using a large set of images about common conditions affecting stone and citrus fruit trees along with some other crops. In such "ideal" conditions (only two diseases are considered), precision reached 90%. Again, if the diagnosis is successful, both the identified condition and the suggested treatment are shown to the user. Finally, a NLP-based approach was presented in [21], where the GATE framework (https://gate.ac.uk/, accessed on 9 April 2021) ("General Architecture for Text Engineering") is leveraged to process the textual description of the visible symptoms and impacts of the pest or disease. The keywords retrieved by GATE are then compared against a pest control database and the most likely causes of the problem are shown along with the recommended treatments.
Image-based pest recognition techniques benefit from the current widespread use of phones with high-resolution cameras. However, the range of pests and diseases covered by these approaches is limited to those with a visible impact on plants and their structural components (stems, leaves, flowers or fruit). Other conditions associated with injury that cannot be captured by a photographic shot will not be identified by these tools (e.g., premature fruit drop). Language-guided approaches are more flexible in terms of coverage, but their accuracy is hampered by the inherent ambiguity and imprecision of natural language. Ontologies and other related semantic technologies have proven to be useful to limit the effects of language ambiguity in different scenarios [18,61,62]. The use of knowledge technologies in agriculture is very broad [20]. Currently, there are several ontologies and structured vocabularies available in the agronomy domain [63][64][65][66][67]. AgroPortal [68] has become the reference repository in which most of the vocabularies and ontologies produced to represent and annotate agronomic data are hosted. More specifically, in Reference [69] the authors provide a detailed review on the use of knowledge graphs in the crop pests and diseases domain.
By building upon such formal collection of terms, several applications have been developed to assist farmers in their day-to-day practices, including pest control [12,43,[70][71][72]. In Reference [43] the authors describe a knowledge-based system to support the diagnosis of plant diseases. The system rests on a rule-based engine built with the assistance of domain experts. If the symptoms described by the farmer trigger a rule, then a diagnosis is provided, and relevant treatments and recommendations are suggested to the farmer. The way in which symptoms are entered into the system is not clear, but it relies on the perception of the farmer. Ontologies are also leveraged in [70] to model the interrelation between crops, pests and treatments. Once the model has been automatically populated from a number of different heterogeneous sources (official guides) including 462 crops, 549 pests and 42,397 treatments, a recommendation system suggests the required treatment given the crop and the symptoms. While the approach is similar to ours (i.e., use of natural language processing to build a knowledge base with which to nourish a recommendations system), the focus is set on different stages of the process. In our work, the main goal is to assist farmers in identifying the pests and diseases in their crops; meanwhile, in Reference [70] the authors put their attention on obtaining a fully fledged knowledge base, undermining the symptoms-pest matching. The symptoms in our work are modelled in the form "plant part-damage", thus simplifying the matching between the symptoms in the knowledge base associated with each pest and the symptoms entered by users (which are also processed using NLP). In Reference [70] a text field is used to store the textual description of the produced symptoms; then a regular expression with the symptoms indicated by users is embedded in a SPARQL query for finding matches in the knowledge base. Additionally, one of the main drawbacks of this work is that it has not been validated using traditional information retrieval or recommender systems evaluation metrics. The authors in [71] present an ontology-based agro advisory system with the aim to bridge the gap between farmers and agriculture domain experts by integrating various data resources. A built-from-scratch cotton crop ontology constitutes the knowledge base for the proposed expert system, which provides advice to farmers given keyword-based queries. The collection of information regarding cotton farming practices to nourish the knowledge base was done manually from different sources and with the assistance of experts in the field. Another related approach is suggested in [72], where plant diseases are modelled in the form of ontology elements and the diseases likely affecting a crop are retrieved given farmers' observations. To issue queries to the knowledge base, those observations should be transformed into Web Ontology Language (OWL) concepts. Besides, the validation of the proposed system is limited to the conditions associated with a single crop, namely, rice, for which the authors developed a rice disease ontology. Finally, in Reference [12] the authors describe AgriEnt, a knowledge-based Web platform for assisting farmers in the crop insect pest diagnosis and management. The AgriEnt-Ontology constitutes the cornerstone of the platform, an ontology representing knowledge about crops, diseases, symptoms, insects, insect pests, and treatment recommendations. To populate the knowledge base, crop insect pests' records generated by agricultural entomology experts as well as academic publications were collected. Then, a rule-based inference engine built using the Semantic Web Rule Language (https://www.w3.org/Submission/SWRL/, accessed on 9 April 2021) (SWRL) is used to explore the symptoms and provide a diagnosis. Again, experts in the field were required to define the rules. Furthermore, a diagnosis is only reached when all the symptoms defined in the rule have been pointed out by the user. The average accuracy obtained by the system for the six crops considered, namely, sugar, cocoa, corn, rice, banana and soya, is above 82%.
In this work, a novel approach to the recognition of crop pests and diseases based on the combination of language technologies and semantic conceptual representations is proposed. To build this expert system, no human expert was required since all the required knowledge was gathered from available resources. Our framework makes use of a formula to calculate the likelihood that each pest connected to a given crop is the one associated with the symptoms pointed out by the farmer. The obtained scores allow the system to provide a ranked list of the possible conditions affecting a crop. As a consequence, if more than one pest or disease is actually present, the farmer can become aware of such a circumstance.

Language Technologies for Knowledge Acquisition
The manual construction of ontologies is a demanding task which needs a great deal of time and resources. To avoid it, several studies have been conducted lately on their automatic construction and update [73,74]. It is possible to distinguish three main categories: ontology learning, ontology population (a.k.a., ontology instantiation), and ontology evolution (a.k.a., ontology enrichment). Ontology learning involves the extraction of new concepts, relations, attributes, and axioms [75,76]. Due to this processing, the terminological component of ontologies (TBox) is modified. On the other hand, the automatic instantiation of ontologies [77] extracts and classifies the instances of the concepts and features which have been defined by ontologies (ABox). The starting point of the ontology's instantiation is usually a partially instantiated ontology or a combination of Electronics 2021, 10, 905 6 of 21 possible individuals or named entities and relations between those entities. The stages of ontology learning and instantiation from text in natural language are mainly term extraction, synonym detection, concept creation, named entity detection, the creation of concept hierarchies, the extraction of other nontaxonomic relations, and axiom acquisition [78]. Although the stages for the creation of concept hierarchies have obtained very good results for different languages, further research is currently being conducted on the automatic extraction of taxonomic relations, nontaxonomic relations and axioms [79], since current approaches have not yielded satisfying results [80]. The main problem of these automatic extraction strategies is that most of them aim to detect a predetermined combination of relations such as partonomy, time, causality, etc. [79]. As regards axiom extraction, there are few studies trying to extract simple axioms like those dealing with nontaxonomic relations [80,81]. Another drawback is that there is limited research to fulfill this task in Spanish, where the authors have made a major contribution [82]. The evolution of ontologies is based on the two technologies explained above and it not only deals with the creation of new information, but also with the updating (creation, modification and deletion) of elements as the domain changes over time. Currently, there are some satisfactory solutions [83], but they pose the same problems that those mentioned above for language technologies and automatic instantiation of ontologies.
On the other hand, text annotation can be considered as a process that enables the mapping of concepts, relations, comments or descriptions to a document or a text extract. Overall, annotations can be assimilated to metadata associated with particular text extracts from a document or any other pieces of information. Semantic annotation helps deal with natural language ambiguity and its representation through ontologies [84,85]. This process involves relating text extracts with tags representing ontological elements (concepts, relations, attributes, and instances), which enables document processing by software systems. The major limitation of these methodologies is their reliance on static knowledge; thus, the ontologies do not evolve over time. Recent studies conducted by the authors of this work provide tools for semantic annotation based on ontology evolution technology [86,87]. Finally, it is worth noting that new deep learning technologies are being applied to traditional ontology learning tasks in different languages [88].
In this work, we built upon existing natural language processing resources to develop an automatic ontology population tool which is used to gather relevant data from unstructured documents and create the corresponding instances in the ontology. For future work, we plan to exploit our previous experience in ontology evolution to apply refinement actions and enable the adaptation of the knowledge base to this changing domain.

Crop Pests and Diseases Identification from Natural Language Text
In this work, an expert system to classify symptoms expressed in natural language into crop pests and diseases is proposed. This section provides a detailed description of the proposed framework. Next, the functional architecture of the framework is presented, and its main components are explained.

Proposed Framework
The functional architecture of the proposed system is shown in Figure 1 and comprises three main components: (i) the pests and diseases management ontology (CropPestO); (ii) the knowledge base population tool (KB Instantiator); and (iii) the crop symptoms analyzer. The input to the system is a list of symptoms expressed in natural language that represent the harmful effects of a likely pest or disease affecting a given plant (users select the crop from a list of the crops found in the knowledge base), while the output is an ordered list of crop pests and diseases matching the provided symptoms.
The functional architecture of the proposed system is shown in Figure 1 and comprises three main components: (i) the pests and diseases management ontology (Crop-PestO); (ii) the knowledge base population tool (KB Instantiator); and (iii) the crop symptoms analyzer. The input to the system is a list of symptoms expressed in natural language that represent the harmful effects of a likely pest or disease affecting a given plant (users select the crop from a list of the crops found in the knowledge base), while the output is an ordered list of crop pests and diseases matching the provided symptoms. The system works as follows using a two-step process. The first step takes place before the system is made available to users and consists of the population of the knowledge base. During this stage, a number of unstructured documents elaborated by experts in the field of IPM are processed by the KB Instantiator. This component takes into account the reference data model, a previously defined pests and diseases management ontology named CropPestO, to transform the natural language input into a number of instances to be added to the knowledge base. Once the knowledge base has been fully populated, the system becomes functional and users can interact with it. Users should point out both the crops that are being likely affected by some condition (that is, users select the crop under question from the list of crops included in the knowledge base) and the observed symptoms, which are defined in natural language. During this second stage, the Crop Symptoms Analyzer processes the entered symptoms and matches them with the ones previously introduced in the knowledge base. From the matches found, a ranked list with the most likely conditions along with the suggested control methods are shown to the users.

CropPestO: Pests and Diseases Management Ontology
The pests and diseases knowledge base, which constitutes the cornerstone of the proposed approach, is based on a domain ontological scheme that has been designed by following the steps suggested in the "Ontology Development 101" guide [89]. While there are a number of ontologies in the agronomy domain and, more specifically, in the crop pests and diseases field, none fit the requirements of an organic agriculture-based pest control recommender system. The scope of the ontology has been limited by the following competency questions (i.e., questions the ontology should help to answer): (i) Which The system works as follows using a two-step process. The first step takes place before the system is made available to users and consists of the population of the knowledge base. During this stage, a number of unstructured documents elaborated by experts in the field of IPM are processed by the KB Instantiator. This component takes into account the reference data model, a previously defined pests and diseases management ontology named CropPestO, to transform the natural language input into a number of instances to be added to the knowledge base. Once the knowledge base has been fully populated, the system becomes functional and users can interact with it. Users should point out both the crops that are being likely affected by some condition (that is, users select the crop under question from the list of crops included in the knowledge base) and the observed symptoms, which are defined in natural language. During this second stage, the Crop Symptoms Analyzer processes the entered symptoms and matches them with the ones previously introduced in the knowledge base. From the matches found, a ranked list with the most likely conditions along with the suggested control methods are shown to the users.

CropPestO: Pests and Diseases Management Ontology
The pests and diseases knowledge base, which constitutes the cornerstone of the proposed approach, is based on a domain ontological scheme that has been designed by following the steps suggested in the "Ontology Development 101" guide [89]. While there are a number of ontologies in the agronomy domain and, more specifically, in the crop pests and diseases field, none fit the requirements of an organic agriculture-based pest control recommender system. The scope of the ontology has been limited by the following competency questions (i.e., questions the ontology should help to answer): (i) Which measures should be applied to prevent the outbreak of a disease or pest? (ii) What evidence does an outbreak of a disease or pest suggest in a crop?; (iii) Which disease or pest is present in a crop?; and (iv) Which measures should be applied at any given moment to treat a disease or pest? The focus is set on organic agriculture, so organic-compliant control methods are highlighted.
In the development of the ontology, some of the terms included in the AGROVOC thesaurus [90] were reused. AGROVOC is a controlled vocabulary built by United Nations' FAO, with more than 37,000 concepts and 750,000 terms in up to 37 languages covering elements related to food, nutrition, environment, plant cultivation techniques, etc. AGROVOC was just recently published as a linked open data (LOD) set and is aligned with other 18 datasets related to agriculture. Besides, AGROVOC satisfies our requirements in terms of both completeness (including a large number of domain relevant concepts and being actively maintained and updated (http://aims.fao.org/agrovoc/releases, accessed on 9 April 2021); in particular, AGROVOC includes concepts tagged in both English and Spanish for all the pathogens, plants and plant products covered in the processed IPMrelated documents) and formality (enough semantic expressivity for our purposes). The upper-level concepts that form the backbone of the ontology are as follows: "Plant Product" (i.e., a product produced by a plant including cereals, fruits, legumes, among others), "Pest" (i.e., this concept encompasses both diseases and pests that can inflict damages on plants or plant products, such as fruit flies or tuta absoluta), "Control Method" (i.e., technique that can be applied to avoid or reduce the harmful effects of pests and diseases, such as trap cropping or sexual confusion), "Plantae" (i.e., plant; the focus is set on those that produce basic human foods such as grain, fruit and vegetables, including Solanum lycopersicum or Prunus armeniaca, among others), "Symptom" (i.e., a physical feature which is regarded as indicating a condition of disease, such as fruit rot or leaf spot). Additionally, to assist in the resolution of the abovementioned competency questions, the following relationships between the upper-level elements were considered (along with their inverse relationships): "Plantae produces Plant Product", "Plant Product hasPest Pest", "Symptom isInfluencedBy Pest", and "Control Method controls Pest".
The ontology has been developed in OWL 2 [91] and is available at http://agrisemantics.inf.um.es/ontologies/CropPestOv2.owl (accessed on 9 April 2021). More details on the ontology construction process can be found in [22]. An excerpt of the ontology, including the high-level classes, is depicted in Figure 2. In the figure, the classes and relationships directly extracted from AGROVOC are represented in green. Since the framework has been originally conceived to be used by Spanish-speaking farmers (e.g., the documents used for populating the ontology are Spanish reference guides for IPM in different crops, see Section 3.3), the ontology has been labelled in Spanish (besides English). In total, the populated ontology contains 286 classes, 8 object properties, 11,754 individuals, and 96,550 axioms. The correctness of the resulting ontology has been checked using the following tools: (i) the RDFS Validator (https://www.w3.org/RDF/Validator/, accessed on 9 April 2021) to repair the definitions of concepts, relations, and instances; (ii) OOPS! ( http://oops.linkeddata.es/catalogue.jsp, accessed on 9 April 2021) (the OntOlogy Pitfall Scanner!) to identify deficiencies in metadata information such as license and version information, among others; and (iii) OQuare (https://semantics.inf.um.es/ontology-metrics/, accessed on 9 April 2021) to test the model's features.

KB Instantiator: Knowledge Base Population
In agriculture, pest control is a vast field where each crop can be infected with different types of infectious agents. In this work, the focus is set on the crops grown in Spain, but the proposed framework can be easily adapted to other environments. To prepare the

KB Instantiator: Knowledge Base Population
In agriculture, pest control is a vast field where each crop can be infected with different types of infectious agents. In this work, the focus is set on the crops grown in Spain, but the proposed framework can be easily adapted to other environments. To prepare the list of supported crops, the Web portal of the Spanish Ministry for Agriculture, Fisheries and Food (https://www.mapa.gob.es/es/agricultura/temas/default.aspx, accessed on 9 April 2021) has been queried. It contains detailed information and statistics about the production and exploitation of the crops in this country. Besides, it offers a set of official documents written by agronomy experts to guide farmers in applying the most convenient control methods according to an IPM strategy to combat the pests and diseases that affect a number of different crops (https://www.mapa.gob.es/es/agricultura/temas/sanidadvegetal/productos-fitosanitarios/guias-gestion-plagas/, accessed on 9 April 2021). Each document provides information about the pathogenic agents known to attack a given crop. It also describes their associated symptomatology and provides details about the most appropriate strategies to monitor, prevent and combat these infectious agents. These IPM-related documents have been used to create the relationships between crops (i.e., "Plant Product") and their associated pests and diseases (i.e., "Pest"), and between each pest/disease (i.e., "Pest") and the associated symptoms (i.e., "Symptom") and the suggested treatment (i.e., "Control Method"), as described below. In this section, the process carried out to populate the ontology is described in detail. This process encompasses the steps enumerated next.
A variety of dictionaries have been created to assist in both the definition of the ontology scheme and in its instantiation. The process starts with the definition of the list of supported crops. To create this list, the report available in [92], which studies the performance of crops and crop groups with great relevancy in the Spanish economy, has been analyzed. The document includes an annex with a list of the crops produced in towns and provinces. The Snowtide library (https://www.snowtide.com, accessed on 9 April 2021) has been used to process the PDF file, extract the list of crops and create the glossary of crops. Besides, the resulting list has been manually enriched with the plants that produce the crop. The resulting resource is a file with a list of pairs relating each crop (i.e., "Plant Product") with the plant producing it (i.e., "Plantae").
Similarly, the IPM-related documents mentioned above have been processed to extract: (i) pest names (i.e., "Pest"); (ii) the damages produced by such pests (i.e., "Symptoms"); and (iii) their recommended treatments (i.e., "Control Method"). As a result, for each document two new files are defined, one including pest names, the crops known to be attacked by such pests and diseases, and the associated symptoms and damages produced, and the other associating each pest and the recommended control methods, which are described in tabular format. In addition to this, additional information was gathered from the resource at [93]. This official document provides a detailed description of a wide variety of pathogen agents of plants. It offers a complete classification of the plant pathogens observed in Spain including virus, viroid, bacteria and fungus, among others. It also provides specific sections with further details about synonyms, the taxonomy they belong to, associated symptoms, and hosts affected. A script was implemented to process each pathogen from the document and extract the associated details. In all the processed documents, the connection between pests, the plant products that they harm, their symptomatology, and the known control methods to limit their impact and spread are made explicit and can be easily reproduced in the knowledge base.
However, to facilitate the search for symptoms in the knowledge base matching those expressed by the users of the system (a key step in the pest recognition process) we conceived a novel approach to represent the symptoms. In particular, a two-step method has been defined to process the natural language sentences describing the pests associated symptomatology in the documents. First, all relevant information is gathered, and the text is tokenized into sentences-that is, the text is divided in sentences. Then, these sentences are analyzed and only those providing specific details about the effects of the pest or disease in the plant and in the plant product are kept. To do so, a recursive method was designed which analyzes the syntactic dependency graph of each sentence. During this search, it utilizes a morphology glossary and a phytopathology dictionary (http://www.ub.edu/vocabularia/archives/4518, accessed on 9 April 2021) to identify terms related to part of plants and damages, respectively. The morphology glossary was created in a semisupervised manner. First, a frequency analysis tool was used to identify the most used words in the sections dealing with pests' symptomatology. A stopwords list was employed next to remove nondomain specific terms. Then, a botanical dictionary (https://www.arbolesornamentales.es/glosario.htm, accessed on 9 April 2021) was exploited to identify those words specifically related to the plants' domain. As a result, a list of terms sorted by the number of apparitions is obtained. Finally, the list was analyzed to remove those identified words which are related to the plant domain but are not explicitly related to the plants' morphology. Once the relevant sentences have been identified, they are processed so that "plant part-damage" pairs are obtained, generating a new resource in which each pest is associated with the gathered pairs. Algorithm 1 describes the pseudocode of the two-step method.  As inputs, the method receives four parameters: , , and . The parameter stands for the sentences to be analyzed; the represents the morphology dictionary utilized to identify parts of the plant; the parameter denotes the phytopathology dictionary to identify the plants' injuries; and finally, the indicates the stopwords list used to filter terms without useful information such as prepositions and conjunctions, among others. The method utilizes the library spaCy (https://spacy.io, accessed on 9 April 2021) to extract the syntactic dependency graph of a sentence. SpaCy is an open-source library for Natural Language Processing capable of analyzing a vast of text volume. Among the diverse functions that it provides, it is possible to highlight Name Entity Recognition, Part-of-speech tagging, Syntax-driven sentence segmentation, integrated viewers for syntax, etc. When a sentence is given, the method splits the sentences into tokens. First, it checks if the token is a term related to a plant morphology by using the dictionary. If so, the next step will be to traverse into the dependency graph to obtain the verb root of the sentence. Then, it traverses the graph recursively to analyze the terms related to this token. For each extracted term, the method checks if it is related to phytopathology domain by using a dictionary. If so, the term is stored in a stack. The procedure finishes when all token's dependencies are analyzed. Once all the resources described above are available, the ontology scheme can be enriched and the knowledge base populated. Initially, the algorithm loads the aforemen-As inputs, the method receives four parameters: s, mg, dp and sw. The s parameter stands for the sentences to be analyzed; the mg represents the morphology dictionary utilized to identify parts of the plant; the dp parameter denotes the phytopathology dictionary to identify the plants' injuries; and finally, the sw indicates the stopwords list used to filter terms without useful information such as prepositions and conjunctions, among others. The method utilizes the library spaCy (https://spacy.io, accessed on 9 April 2021) to extract the syntactic dependency graph of a sentence. SpaCy is an open-source library for Natural Language Processing capable of analyzing a vast of text volume. Among the diverse functions that it provides, it is possible to highlight Name Entity Recognition, Partof-speech tagging, Syntax-driven sentence segmentation, integrated viewers for syntax, etc. When a sentence is given, the method splits the sentences into tokens. First, it checks if the token is a term related to a plant morphology by using the dictionary. If so, the next step will be to traverse into the dependency graph to obtain the verb root of the sentence. Then, it traverses the graph recursively to analyze the terms related to this token. For each extracted term, the method checks if it is related to phytopathology domain by using a dictionary. If so, the term is stored in a stack. The procedure finishes when all token's dependencies are analyzed. Once all the resources described above are available, the ontology scheme can be enriched and the knowledge base populated. Initially, the algorithm loads the aforementioned dictionaries. Then, it first creates the base taxonomy and defines the object properties. The initial structure is composed of the high-level classes as depicted in Figure 2, namely, "Plantae", "Plant Product", "Pest" (and its hierarchy), "Symptom", and "Control Method" (and its hierarchy). Then, the method starts evolving the ontology's hierarchy by inserting crops. To integrate the crops in the CropPestO ontology, an algorithm has been implemented that recursively traverses through the AGROVOC hierarchy, collecting the upper categories of a given concept. When a term referring to either a plant product or a plant is retrieved from one of the entered dictionaries, the method localizes the concept in AGROVOC and recursively iterates through upper categories, systematically adding each concept found as subclasses in the hierarchy. The process finishes when the top concept "Plantae" or "Plant Product" is found. As a result, the whole path from the given concept to those high-level classes is inserted in the CropPestO ontology. Next, it inserts the symptoms relating them with the crops and the pests. As described above, the symptoms dictionary not only contains lists of "plant part-damage" pairs; it also indicates the pest producing those symptoms and the affected. When a pair is inserted in the ontology as an instance of "Symptom", its relationship with the condition causing such effects (i.e., "Pest") is defined employing the "influences" object property, and also is the relationship between the later and the plant product (i.e., "Plant Product") afflicted by such disease through the "hasPest" object property. Finally, the method integrates the treatments (i.e., "Control Method"). As pointed out above, in the guides, each treatment is related to a particular pest. Thus, to relate this information, it is enough to look for each pest in the model and associate its respective treatment.
As a general overview of the content of the populated ontology, in Table 1. the plant products connected to the highest number of pests are enumerated. Then, in Table 2. the pests linked to the highest number of symptoms are put forward. Finally, the symptoms associated with the highest number of pests are listed in Table 3.  Table 3. Symptoms and number of associated pests (partially in Spanish).

Crop Symptoms Analyzer
The analyzer represents the input of the system, and it provides the farmer with a natural language interface to interact with it. Through this interface, a farmer can describe the symptoms observed in a determined plant (the input is composed of (i) the plant, from a list of plants supported by the system, that is, plants available in the knowledge base, and (ii) a list of observed symptoms), and the module will answer with a list of pests and diseases ranked based on matches found with the symptoms of the diseases described in the knowledge base. Thus, when the module receives a symptom description, it employs the algorithm described above (see Algorithm 1) to decompose the sentence and keep only those tokens related to the plant domain. As a result, a list of pairs will be retrieved where each pair expresses a plant part and its damage. In the next stage, the populated CropPestO ontology is utilized to find the symptoms linked to each pair. Certainly, for each pair elaborated from the user input, the analyzer tries to find instances of the class "Symptom" matching with such a pair. If a match is found, then the instances of "Pest" related to such an instance of "Symptom" are automatically selected as a candidate to be put in the pests' recommendation list. For example, let us suppose that the farmer writes the following sentence: "The almond tree produces black fruit". First, the module would employ the algorithm to keep those terms related to the plants' domain. The algorithm would analyze the sentence, and it would obtain "black fruit", "black almond" as a list of symptoms. Next, for each pair, the populated ontology would be queried to find an exact match. If such a coincidence is found, then the method selects the associated pests, and it adds them to the recommendation list of pests to be sent back to the farmer.
The formula used to rank all candidate pests given the symptoms entered by the farmer takes into account a measure of sensitivity (importance of the symptom in the total pool of symptoms associated with a given pest) and specificity (number of pests to which a given symptom is associated). The formula is as follows: where p j is one of the candidate pests considered, s is the list of all the symptoms entered by the farmer, n is the total number of symptoms provided by the farmer, sensitivity is calculated as follows: where symptoms p j returns the set of all symptoms associated with pest p j ; and speci f icity is calculated as follows: where pests(s i ) returns the set of all pests associated with symptom s i . The score in Equation (1) is calculated for all the pests associated with the crop at hand when at least one of the entered symptoms matches one of the symptoms associated with such a pest in the knowledge base-that is, all candidate pests. The rationale behind that formula is that (i) if a few symptoms are associated with a given pest and one of these symptoms has been entered by the user, then that pest is a very likely candidate, and (ii) if a symptom is associated with very few pests and this symptom is entered by the user, then those pests are also very likely candidates. A candidate pest is included in the recommendation list to be shown to the user if its score is above a given threshold (to be defined by the administrator). Therefore, if the plantation is afflicted by more than one pest or disease, the farmer can become aware of such a circumstance.

Evaluation
This section focuses on the evaluation of the pests and diseases recognition method proposed in this work. First, an exemplary scenario is described representing how the proposed tool can be accessed by its intended users. Then, some details about the dataset used for this validation experiment are put forward and the evaluation metrics are enunciated. Finally, the main results of the experiment are shown and discussed.

Exemplary Usage Scenario
In a typical usage scenario, farmers in the field would observe some worrying signs in their plantation, the likely effects of an unknown pathogen. Under these circumstances, farmers would open the "CropPestIdentifier" app and describe the observed symptoms by means of statements in natural language. Then, the system would process the data and return a list of the pests or diseases that are most probably causing such harm. The flowchart of the app is depicted in Figure 3. The following three steps are required: (i) farmers select the crop and input the observed damage in natural language sentences; (ii) the system analyzes these inputs and leverages the knowledge base to obtain a set of pests that might be producing those damages; and (iii) farmers can visualize detailed information about each retrieved pest, including the recommended treatment. This section focuses on the evaluation of the pests and diseases recognition method proposed in this work. First, an exemplary scenario is described representing how the proposed tool can be accessed by its intended users. Then, some details about the dataset used for this validation experiment are put forward and the evaluation metrics are enunciated. Finally, the main results of the experiment are shown and discussed.

Exemplary Usage Scenario
In a typical usage scenario, farmers in the field would observe some worrying signs in their plantation, the likely effects of an unknown pathogen. Under these circumstances, farmers would open the "CropPestIdentifier" app and describe the observed symptoms by means of statements in natural language. Then, the system would process the data and return a list of the pests or diseases that are most probably causing such harm. The flowchart of the app is depicted in Figure 4. The following three steps are required: (i) farmers select the crop and input the observed damage in natural language sentences; (ii) the system analyzes these inputs and leverages the knowledge base to obtain a set of pests that might be producing those damages; and (iii) farmers can visualize detailed information about each retrieved pest, including the recommended treatment.  While our "Crop Symptoms Analyzer" returns a list of the most likely pests affecting the farmer's crops, for evaluation purposes, we only consider the pest that obtains the highest score for each test case. Consequently, it can be treated as a classification problem in which, given the crop under question and all the observed symptoms, the system has to determine the pest associated with such a crop which is more likely to be causing such symptoms. In sum, the classes into which one item of the dataset (i.e., set of symptoms) can be classified are any of the conditions (i.e., pests and diseases) associated with the crop at hand.

Dataset
For the purposes of this study, an evaluation dataset has been defined in which a number of symptoms are associated with the corresponding condition affecting a given crop. Therefore, for each pool of symptoms, the pest under question is known. The symptoms are declared by means of sentences in natural language. This dataset has been collected from a number of different webpages containing information about the pests and diseases associated with the selected crops (e.g., https://agroes.es, accessed on 9 April 2021, https://www.fertibox.net, accessed on 9 April 2021, among others) by using a web scrapping tool. An exemplary test case is shown in Table 4. Table 4. Excerpt of the dataset (partially in Spanish).
In particular, the dataset built for this preliminary validation experiment contains a total of 212 symptoms, connected to 75 pests and diseases in three different crops, namely, almond tree (Prunus dulcis), olive tree (Olea europaea), and grape vine (Vitis vinifera). These are some of the main crops that are cultivated in Mediterranean regions. In Table 5, some additional details about this dataset are put forward. The whole test dataset is available at http://agrisemantics.inf.um.es/datasets/ (accessed on 9 April 2021).

Evaluation Metrics
The metrics typically used to assess the performance of classification models such as the one described here are accuracy, precision, recall and f-measure. These metrics have traditionally been employed in the evaluation of information retrieval systems [94], but are well suited to the quality assessment of classifiers: we wish to verify whether the system properly identifies the pest or disease affecting the crops given some observable sign and symptoms. Four outcomes for a predicted value are consequently possible. These values are calculated for each pest in each dataset, and the results are aggregated by dataset (i.e., crop). For a given pest, (i) a True Positive (tp) occurs when the entered symptoms, which are associated with the pest under question, are correctly classified as being caused by this pest; (ii) a False Negative ( f n) occurs when the pool of symptoms associated with the pest are wrongly classified as being caused by another pest; (iii) a False Positive ( f p) occurs when a pool of symptoms associated with another pest is wrongly classified as being caused by the pest under question; and (iv) a True Negative (tn) occurs when a pool of symptoms associated with another pest are not wrongly classified as being caused by the pest under question.
In this context, Accuracy can be interpreted as the probability of being correct and is calculated as follows: Precision, also known as positive predictive value, represents the proportion of diagnosed diseases that have been correctly classified and is obtained as follows: Correctly predicted as positive Total number o f predicted as positive Recall, also known as sensitivity or true positive rate, measures the system's ability to correctly classify diseases and is calculated as the proportion of actual diseases that have been correctly classified by the system:

Recall =
Correctly predicted as positive Total number o f positives Finally, the F − measure, also known as F1 score, is the harmonic mean of precision and recall, computed as follows:

Results
In Table 6 the results of the experiments for the four metrics considered are illustrated (more details about the results of the experiment are available at: http://agrisemantics.inf. um.es/datasets/Evaluation_results.xlsx (accessed on 9 April 2021). The overall accuracy of the proposed approach is 98.8%, achieving a 99% accuracy for both almond tree and grape vine, and a 97% accuracy for olive tree. In a multiclass classification problem such as the one faced in this work, the precision, recall and f-measure metrics provide the evaluation on a per class basis. Given the characteristics of our dataset in which for each class (i.e., disease) only one pool of symptoms has been considered (i.e., one test for each disease), and the results are shown aggregated by crop.

Discussion
Generally, the classifier has achieved promising results. In the experiments carried out for both almond trees and grapes vines, only a few diseases have not been correctly classified during the experiment. Conversely, in the olive tree experiments, only half of the diseases were classified correctly, with no results for 7 of the 25 test cases. This explains the worse precision and recall values obtained, i.e., 0.460 and 0.480, with respect to 0.846 and 0.885 in the almond tree experiments and 0.813 and 0.958 in the grape vine experiments. The test cases in which no disease has been identified are those for which our symptoms decomposition procedure could not retrieve any "plant part-damage" pair from the entered text. A more versatile approach might be required for those situations in which symptoms are not expressed as expected. On the other hand, "false positives" are usually associated with cases in which the same "plant part-damage" pair representing a symptom is linked to more than one disease. In consequence, while using common, more generally used words might result in a more human-understandable knowledge base, the overall performance of the system can be significantly degraded.
By examining the results of the evaluation process, we observed some other issues to consider. First, some of the "false negatives" (i.e., pool of symptoms not correctly associated with their corresponding disease) are due to deficiencies in the automatically generated knowledge base. Not all the symptoms pointed out in the official guides processed to populate the knowledge base have been adequately represented in the instantiated ontology. Consequently, the NLP method conceived to automatically instantiate the knowledge base should still be fine-tuned to fully tease out all relevant information. In line with this, an exhaustive analysis of the ontology model and the automatically generated instances is required to ensure the adequate representation of the original data. The method for evaluating agricultural ontologies proposed in [95] could constitute a first step towards this end.
Second, the results for diseases with few evidentiary facts are unstable. Such is the case, for example, of the Eurytoma amygdali Enderlein wasp in almond trees. Only one observable symptom has been identified ("black fruit") and no matching has been found with the test input. It would be desirable to extend the pool of symptoms associated with such diseases so as to avoid this instability. Additionally, the presence of synonyms among the symptoms stored in the knowledge base and the existence of symptoms associated with more than one condition (pest or disease) can give rise to false positive results. A manual revision of the contents of this part of the ontology might become necessary. Alternatively, it is possible to simplify the way symptoms are entered in the system (and likewise represented in the ontology) by following the example of AgriEnt [12]: to present users with a list of symptoms (accompanied by representative images) from which to choose.
In the literature, the approaches closest to that presented here are AgriEnt and the information retrieval system built upon the PCT-O ontology [70]. As mentioned above, AgriEnt has been evaluated in terms of accuracy (i.e., correct diagnosis of all test cases) in six different crops, reaching an average accuracy of 0.8221. It is not appropriate to compare our respective accuracy values since they have been obtained from different experimental settings. The dataset used in the evaluation of AgriEnt has not been made public and the input is slightly different, since in AgriEnt farmers select the symptoms from a list of available symptoms associated with a given crop. On the other hand, while the population process of the PCT-O ontology and the actual ontology model are thoroughly revised in their work, the authors of this approach do not provide any performance data concerning the information retrieval or recommender system.

Conclusions and Future Work
Agriculture is one of the pillars for worldwide food provisioning, with fruits and vegetables being essential for a healthy diet. A large proportion of the world's population live in countries where agriculture is the main source of livelihood [96]. Organic agriculture presents several benefits over conventional agriculture, including improved environmental health and reduction of costly external inputs [97]. However, its feasibility is often questioned due to the constraints on the use of synthetic products such as chemical fertilizers and pesticides. For that reason, the general approach in organic agriculture is to deal with the causes of a problem rather than treating the symptoms. Therefore, the early detection of a pest or disease outbreak becomes crucial so as to allow the adoption of preventive measures. Yet, in most cases farmers do not have the knowledge and resources necessary to detect the trigger factors and act accordingly. Moreover, organic agriculture-compliant treatments are still unknown to most people. It is thus necessary to provide farmers with the means to, first, recognize the presence of pests and diseases in their crops and, second, develop preventive actions and use IPM practices allowed for organic production, to limit their harmful effects.
Many ICT-enabled tools have been developed to facilitate the detection of pests and diseases in crops. Most solutions rely on image processing and often require the use of sophisticated high-resolution image capture devices or other types of sensors that are not usually available to individuals responsible for agricultural holdings. Besides, the syntactic-based core of existing approaches limits their ability to leverage the already vast amount of information about plant pests, diseases, their causes, and their control measures. In this work, we describe a semantic approach for the identification of crop pests and diseases. The framework proposed in this paper makes use of ontologies to semantically model the domain of interest. The final knowledge base contains a total of 338 plants (i.e., individuals under the top-level "Plantae" concept) and 513 crops (i.e., individuals under the top-level "Plant Product" concept). The use of this formal model greatly facilitates the automatic integration of data from multiple, heterogenous sources, resulting in a complete knowledge base. Reasoning and inferencing mechanisms are then put in place to determine the condition producing damages to crops and the required control measures complying with organic agriculture regulations. Actually, since IPM guides have been used to populate the knowledge base, the application can be easily extended to support conventional growers.
For future work, the CropPestO ontology will be improved to make it more human readable and incorporate more axioms to boost the inferencing capabilities. Those formal underpinnings of ontologies can then be leveraged to carry out reasoning processes that enhance pest recognition and related tasks. Besides, we plan to extend the framework to support other evidentiary items as input, including images and environmental parameters. Certainly, weather, soil conditions, affected area, affected crops, yield losses, etc., are some of the factors that can help characterize the problem's source, and the exhaustive analysis of historic data can lead to insights into how and why certain crop pests and diseases break out. The integration of different identification systems can result in efficiency and effectiveness gains. On the other hand, currently the knowledge base has been solely populated with data from official Spanish guides, and thus is only useful for Spanishspeaking users. While the underlying ontology model has been labelled in both English and Spanish, the NLP method used for ontology population should be adapted to support other languages. Moreover, the pest control domain is an evolving, ever changing field, and so we aim to develop a semisupervised ontology evolution tool. The ontology could then be continuously enriched and updated by considering the state-of-the-art knowledge. This tool would also assist in maintaining the ontology and keeping it up to date with the changes in the reference vocabularies used. Finally, a more robust validation, in a real environment (i.e., with tests provided by real users) and with large volumes of data, is required to verify the scalability of the proposed approach. Under these circumstances the use of spell-checker tools will be essential to deal with the foreseeable typos. Synonyms should also be considered along with other matching measures such as the Levenshtein distance. As part of this envisioned validation scenario, the use of other metrics such as Mean Average Precision at k (MAP@k) and AUC-ROC (area under the ROC curve) will be considered.