An Efficient Case Retrieval Algorithm for Agricultural Case-Based Reasoning Systems, with Consideration of Case Base Maintenance

Case-based reasoning has considerable potential to model decision support systems for smart agriculture, assisting farmers in managing farming operations. However, with the explosive amount of sensing data, these systems may achieve poor performance in knowledge management like case retrieval and case base maintenance. Typical approaches of case retrieval have to traverse all past cases for matching similar ones, leading to low efficiency. Thus, a new case retrieval algorithm for agricultural case-based reasoning systems is proposed in this paper. At the initial stage, an association table is constructed, containing the relationships between all past cases. Afterwards, attributes of a new case are compared with an entry case. According to the similarity measurement, associated similar or dissimilar cases are then compared preferentially, instead of traversing the whole case base. The association of the new case is generated through case retrieval and added in the association table at the step of case retention. The association table is also updated when a closer relationship is detected. The experiment result demonstrates that our proposal enables rapid case retrieval with promising accuracy by comparing a fewer number of past cases. Thus, the retrieval efficiency of our proposal outperforms typical approaches.


Introduction
Managing farming operations is a challenging task due to its complexity and unpredictability [1]. It includes the activities like irrigation scheduling [2,3], pest management [4,5], nutrient management [6,7], investment of agricultural machinery [8,9], harvesting [10,11], logistics [12,13] and so forth. Farmers and stakeholders not only need to deal with short-term (daily and weekly) scheduling problems but also have to consider long-term (yearly) management. Typically, farmers are used to make these decisions according to their own observations and experiences [14,15]. However, an inappropriate decision may usually cause serious issues like decreasing the productivity, damaging the soil fields, increasing the costs and so forth. Owing to the latest advance of Internet of Things (IoT) and sensor techniques, data collected by climate sensors, ground sensors, radiation sensors and weather stations (made of sensors) enable researchers to build an IoT-based platform and therefore to execute tasks like monitoring, knowledge mining, reasoning and control [16,17]. However, with the growing amount of data collected by various sensors, farmers sometimes have great difficulties in making proper judgments, since they are not data scientists. As a consequence, farmers are now gradually employing decision support systems (DSSs) [18,19] for obtaining advice, because DSSs are able to transfer unstructured raw data into useful knowledge, therefore assisting farmers in managing agricultural activities efficiently and profitably.
As one of the most popular techniques in artificial intelligence, case-based reasoning (CBR) has been gradually employed for modelling DSSs in the domain of smart agriculture [20,21]. In general, an agricultural decision support system (ADSS) is a platform that gathers and analyses data collected from a variety of sources (meteorological, plant/crop-related, economic data). The purpose of an ADSS aims at assisting farmers in smoothening the decision-making processes for agricultural management by providing a list of feasible solutions [18]. With strong reasoning capability, CBR can be used to generate these solutions. Once farmers encounter a new agricultural problem, the description of this problem is treated as a new case to a CBR enabled ADSS. Afterwards, the ADSS uses similarity measures to retrieve the most similar past cases from the case base, along with corresponding solutions. It is acknowledged that if the new case and retrieved past cases have great commonalities, then the solutions of retrieved cases can be used to solve the new case as well [22]. Therefore, farmers can obtain the decision supports from the ADSS for managing the agricultural tasks.
Though applying case-based reasoning has promising advantages like ease of use and precise response, some critical issues in case retrieval have been pointed out by researchers [23]. For example, each past case is typically considered as an independent individual in the case base and assigned with a sequential number as its unique identifier. However, a case in the case base could share similar (or dissimilar) feature values with others, leading to the fact that these cases can be interconnected by a similar association. Under such circumstance, the retrieval task may skip unnecessary comparisons between new and past cases and therefore to accelerate the retrieval process. Unfortunately, few researches pay attention to use the internal associations between cases. The negligence on these relations may lead to poor performance at the stage of case retrieval, because the case retrieval algorithms would sequentially traverse all the past cases for matching the most similar ones, even though a large volume of cases is stored in the case base.
To improve retrieval efficiency, some methods like the rough set theory [24,25] and filtering techniques [26,27] were adopted. On the one hand, the rough set theory could reduce the number of compared cases by defining lower and upper approximations. However, all past cases have to get involved when generating a set of qualified cases that meets the approximations. On the other hand, some researchers defined a rule set manually for filtering past cases. The rules were specified based on the observation of cases and researchers' own interests for case retrieval tasks. Unfortunately, both rough set theory and filtering techniques failed to address any associations between cases and they were task specific. Once a new task is put forward, the filtering process has to be executed over and over.
Case retrieval plays an essential role in CBR systems because the rest of steps (reuse, revise and retain) cannot further proceed without successfully retrieving the most similar past cases in the first place. Current studies on the case retrieval algorithms mainly concern the following two aspects-(i) proposing new similarity measures and (ii) proposing new indexing methods.
In case-based reasoning, similarity measures are used to quantify the similarity between two objects [28]. Usually, a smaller distance value means that the compared two objects have more commonalities. For retrieving the most similar past cases, researchers have contributed a lot towards proposing new similarity measures.
Wang et al. [29] proposed a novel hybrid similarity measure for case retrieval in case-based reasoning systems, with considerations for five formats of attributes values-crisp symbols, crisp numbers, fuzzy numbers, fuzzy linguistic variables and fuzzy intervals. The calculation formula of the global similarity was established by combining the hybrid similarity measure and the synthesis weight measure for retrieving the proper historical case. Yoon et al. [30] presented C-Rank, a link-based similarity measure for identifying similar scientific literatures in databases. This similarity measure used both in-link and out-link references, disregarding the direction of the references. The experimental result demonstrated that C-Rank achieved higher accuracy than existing approaches. Yazid et al. [31] designed a new similarity measure based on Bayesian network for brain tumors cases retrieval. Their proposal was based on graph correspondences and signature nodes comparison from the Bayesian classifiers. The promising experimental results indicated that the proposed similarity measure outperformed classical methods. Zhai et al. [32] proposed a novel triangular similarity measure, overcoming the shortcomings of cosine similarity and Euclidean distance similarity. The experimental result showed that their proposal had strong robustness and great accuracy. Jiang et al. [33] introduced a novel semantic similarity measure for formal concept analysis by taking advantages of linked data and WordNet. The proposed method was not only used for data analysis and knowledge representation but also for concept formation and learning.
Though newly-proposed similarity measures indeed enable case-based reasoning systems to retrieve more accurate past cases, these measures do not improve the efficiency of case retrieval. During the step of case retrieval, the algorithms have to traverse all past cases in the case base, leading to low efficiency when a large volume of cases is stored. Therefore, proposing new similarity measures is not enough for improving the performance of case-based reasoning systems.
As a computational data structure, an index enables a case to be stored and searched in memory. Case indexing assigns indexes to cases for facilitating their retrieval [34] and it plays a key role in case base maintenance. Many literatures have concerned the indexing issues.
Honigl and Kung [35] proposed a data quality index method for maintaining the case base and avoiding redundant cases. Three indices (average solutions per case, count of similar retained queries and missing values) were used to build an index for the quality of the case base. Wiltgen et al. [36] presented two indexing methods, named functional indexing and structural indexing. Both indexing methods generated separate discrimination networks and had mechanisms for preventing the network from having duplicate nodes. Similar past cases could be retrieved by adopting the indexing methods and similarity measurements. Ahmad et al. [37] adopted the locality sensitive hashing (LSH) technique for obtaining short binary codes to represent medical radiographs. These hashing codes enabled indexing and efficient retrieval in large scale image collections. Durmaz and Bilge [38] proposed an approach named randomized distributed hashing (RDH), which used LSH in a distributed scheme. RDH randomly distributed data to different nodes on a cluster and used hash function for indexing. Then the query sample was locally searched in different nodes during the query stage. The experimental result showed that the proposed distributed scheme had great potential to search images in large datasets with multiple nodes. Ahmed and Sarma [39] detected that the accuracy of a system degraded with the increase in the size of the database, therefore an indexing approach was designed to deal with the feature deviation under noise. Considering the retrieval task, the proposed indexing approach gave higher hit rate than existing approaches, even at low penetration rate.
From above review on current literatures, it is concluded that indexing methods have great influence on case retrieval and case base maintenance. LSH is especially popular in case indexing. LSH refers to use a family of functions to map hash data points into buckets [40]. As a result, data points that near to each other are located in the same buckets with high probability, while data points that are far from each other are likely to be placed in different buckets. This makes it easier and more efficient to identify past cases that are similar to the new one. However, LSH does not guarantee accuracy of classified cases. For example, two similar data points may be separated into different buckets due to the design of hashing functions. Thus, improvements on new indexing methods for case retrieval and case base maintenance are expected. It is worth noticing that none of above literatures mentioned mining and using the internal associations between past cases. In other words, each case is still individually stored and searched. Therefore, in this paper, a new case retrieval algorithm for agricultural case-based reasoning systems is proposed. Before executing the algorithm, an association table is constructed, containing the associations between past cases. At first, the new case is compared with an entry-point case. Based on the similarity measurement, the similar or dissimilar association is then selected for comparison in Agriculture 2020, 10, 387 4 of 21 the next iteration until the most similar past cases are detected. Under such circumstance, potential similar past cases can be evaluated preferentially and the number of compared cases is therefore reduced, because the proposed algorithm is able to skip unnecessary comparisons. Meanwhile, our proposal takes case base maintenance into account. The association table is updated during runtime. After resolving the problem, the new case is retained in the case base, as well as its similar and dissimilar associations.
The rest of this paper is organized as follows. Section 2 presents the materials and methods of the proposed case retrieval algorithm. The results and discussions are presented in Section 3. Finally, conclusions are drawn in Section 4.

Materials and Methods
The proposed algorithm relies on a pre-constructed association table. Within this table, each past case is interconnected to several similar and dissimilar past cases. Once a new case is reported, it firstly compares with an entry point (as a starting case for comparison in the first iteration). If the similarity measurement between the new case and the entry point indicates these two cases have great commonalities, the similar association of the entry point is then selected for comparison in the next iteration, otherwise, the new case is compared with the dissimilar past cases which are associated with the entry point. The retrieval process keeps going until the termination of the algorithm is reached. On the contrary to traversing all past cases in typical case retrieval algorithms, our proposal measures the similarity of associated cases preferentially. Under this circumstance, the number of compared cases can be greatly reduced, therefore efficiency of case retrieval can be improved. For case retention, features of the new case, its similar and dissimilar associations, as well as its solutions, are stored in the case base. Meanwhile, the association table is updated if the new case shows closer relations than the old ones (associations).

Case Representational Formulism
The proposed algorithm focuses on retrieving agricultural cases which are formularized by the feature vector representation [41]. As the simplest formulism of case representation, it represents cases by a set of features that describes the problems and corresponding solutions. In this manuscript, the agricultural case-based reasoning systems tries to manage pest problems, therefore, the agricultural cases are defined and shown in Table 1. In Table 1, pest, crop and environment data are considered in agricultural cases [42]. Each case has the same type and number of features. For implementation, past cases are stored in the CSV format. Since our agricultural case-based reasoning system is coded by the programming language Agriculture 2020, 10, 387

of 21
Python, libraries like "Numpy" and "Pandas" can be used to manipulate the stored past cases easily. Furthermore, data in CSV format are understandable and readable for farmers, even though they do not have any expertise in knowledge management and computer science.
Contents of some features in Table 1 are given by texts, like "pest name," "pest stage," "crop name" and "growth stage." To deal with these textual features, we encode them into integers. Transforming a linguistic feature into a real number is a common approach in case-based reasoning systems [43,44]. For example, the life cycle of pest includes "egg," "pupae," "larvae" and "adult." An integer is assigned to each stage respectively. Thus, integer "1" represents "egg," "2" represents "pupae," "3" represents "larvae" and "4" represents "adult." The same transformational process works for the rest of textual features as well. For normalization both numeric and textual features, we adopted the Min-Max feature scaling method, mapping the original features into the range from 0 to 1. Although, there are some interrelations between a single feature, such as the life cycle of pest follows a time sequence. The process of data normalization does not eliminate these interrelations, since the original values (1,2,3,4) would be normalized as (0,0.3333,0.6667,1), which also reflects the interrelations.
In the feature vector representation, it is worth mentioning that there are no relationships built between cases. In other words, each case is individually stored in memory. This is the reason why an association table is constructed in the next section for case retrieval.

Construction of an Association Table
The association table contains the interconnection between past cases. Within the case base, a case could be similar or dissimilar to several other cases. An example of a case base is given in Table 2, including six past cases. Each case has four features. After data normalization [45], all the cases are visualized in Figure 1. Furthermore, data in CSV format are understandable and readable for farmers, even though they do not have any expertise in knowledge management and computer science. Contents of some features in Table 1 are given by texts, like "pest name," "pest stage," "crop name" and "growth stage." To deal with these textual features, we encode them into integers. Transforming a linguistic feature into a real number is a common approach in case-based reasoning systems [43,44]. For example, the life cycle of pest includes "egg," "pupae," "larvae" and "adult." An integer is assigned to each stage respectively. Thus, integer "1" represents "egg," "2" represents "pupae," "3" represents "larvae" and "4" represents "adult." The same transformational process works for the rest of textual features as well. For normalization both numeric and textual features, we adopted the Min-Max feature scaling method, mapping the original features into the range from 0 to 1. Although, there are some interrelations between a single feature, such as the life cycle of pest follows a time sequence. The process of data normalization does not eliminate these interrelations, since the original values (1,2,3,4) would be normalized as (0,0.3333,0.6667,1), which also reflects the interrelations.
In the feature vector representation, it is worth mentioning that there are no relationships built between cases. In other words, each case is individually stored in memory. This is the reason why an association table is constructed in the next section for case retrieval.

Construction of An Association Table
The association table contains the interconnection between past cases. Within the case base, a case could be similar or dissimilar to several other cases. An example of a case base is given in Table  2, including six past cases. Each case has four features. After data normalization [45], all the cases are visualized in Figure 1. According to the visualization result in Figure 1, it is obvious that case 1 is similar to cases 3 and 5 because their data deviation is small. Meanwhile case 1 is dissimilar to cases 2 and 4 because their data distribution has major differences. Similarly, it is observed that case 2 is similar to cases 4 and 6, while case 2 is dissimilar to cases 3 and 5. Thus, the following association table can be constructed, shown in Table 3. Each past case is associated with two similar and two dissimilar associations. The similarity measurements between two associated cases are stored in the association table as well. The similarity and the dissimilarity measurements are both calculated according to Reference [32]. In Table 3, for filling in the similar association, the cases that achieve the top two highest measurements will be chosen. Meanwhile, the cases that achieve the last two lowest measurements will be selected as the dissimilar association. In Table 3, each case has two types of associations: • Similar association-This type of association indicates that the features of concerned two cases have great commonalities. Consequently, the IDs of these similar cases are stored in the similar association, building interconnections to the source case. For example, case 1 is associated with cases 3 and 5. Once a new case is reported and case 1 is treated as the entry point, the cases 3 and 5 are selected for comparison if the new case is considered similar to case 1. Because other potential similar cases might exist among the similar association, the similar association offers the chance of evaluating the past cases within a smaller range, instead of searching the whole case base. As a result, the number of compared cases can be reduced and retrieval efficiency can be improved.

•
Dissimilar association-This type of association specifies that there are significant differences between the features of concerned two cases. The IDs of these dissimilar cases are stored in the association table as well. For example, case 2 is associated with cases 5 and 3. The dissimilar association aims at assisting the new case in identifying a relative similar case at the very beginning of case retrieval. Meanwhile, this association is also helpful when the retrieval process is trapped in a local optimal solution. In other words, the dissimilar association can adjust the searching trajectory in order to detect the global optimal solution.
For constructing such association table, it is necessary to measure the similarity between each past case. For instance, in Table 2, case 1 has to be compared with cases 2, 3, 4, 5 and 6 respectively, case 2 has to be compared with cases 1, 3, 4, 5 and 6 respectively and so forth. After obtaining all the similarity measurements, each case can be associated with several similar and dissimilar ones. For instance, the number of similar associations is two, then two cases with the top similarity measurements are selected and stored in the association table. As shown in Table 3, cases 3 and 5 achieves the top 2 similarity measurements when being compared with case 1, therefore, cases 3 and 5 are selected as the similar associations of case 1. The number of associated cases depends on the size of the case base. More cases are stored in the case base, more associations should be built.
Though constructing the association table is a time-consuming process when a large volume of cases is stored, it is still essential to explore the relations between past cases, because these relations could be useful for case retrieval. Besides, this construction is a one-time task. The association table is constructed at the initial stage, just before the system receives new inquires. After completing the construction, the association table is ready for use. For maintaining the association table, on the one hand, if a new retrieval task is completed and the CBR system decides to retain this new case, the association of this new case will be added to the association table. Meanwhile, the detected closer Agriculture 2020, 10, 387 7 of 21 relations will update the old ones in the association table. On the other hand, if the CBR system determines not to retain the new case, the association table would remain unchanged.

Case Retrieval Algorithm
The workflow of the proposed case retrieval algorithm is presented in Figure 2.
determines not to retain the new case, the association table would remain unchanged.

Case Retrieval Algorithm
The workflow of the proposed case retrieval algorithm is presented in Figure 2. In Figure 2, the case retrieval algorithm firstly starts with the new case input by users. An entry point is randomly selected from the case base for comparison in the first iteration. Based on the similarity measurement, the algorithm decides whether the new case is similar (or dissimilar) to the entry-point case. Afterwards, the corresponding association is determined and associated similar (or dissimilar) cases are read and retrieved from the association table. Then, the similarity between the new case and associated similar (or dissimilar) ones are measured in the next iteration, until the termination condition is reached. The termination condition of the algorithm is defined as-(i) the maximum iteration number is reached or (ii) a satisfied similar past case is found. For determining whether the compared two cases are similar or dissimilar, Table 4 is presented, indicating the correspondence between similarity level and measurements. Table 4. Correspondence between similarity level and measurements.

Level
Condition Identical The compared two cases are exactly the same, achieving a similarity measurement at 100.00% Highly similar The compared two cases achieve a similarity measurement, ranging from 75.00% to 99.99% Similar The compared two cases achieve a similarity measurement, ranging from 50.00% to 74.99% Dissimilar The compared two cases achieve a similarity measurement, ranging from 25.00% to 49.99% Highly dissimilar The compared two cases achieve a similarity measurement, ranging from 0.00% to 24.99% In regard to the association determination (for selecting the associated similar or dissimilar cases), a set of policies is defined in the case retrieval algorithm as follows.


Policy 1-Detection of identical cases-If a past case is detected identical to the new case, the case retrieval algorithm terminates immediately. The output is the retrieved past case.  Policy 2-Token assignments-Once a past case is considered highly similar to the new case, three positive tokens will be assigned to this past case. Once a past case is considered similar to the new case, one positive token will be assigned to this past case. Once a past case is considered highly dissimilar to the new case, three negative tokens will be assigned to this past case. Lastly, Figure 2. Workflow of the proposed case retrieval algorithm.
In Figure 2, the case retrieval algorithm firstly starts with the new case input by users. An entry point is randomly selected from the case base for comparison in the first iteration. Based on the similarity measurement, the algorithm decides whether the new case is similar (or dissimilar) to the entry-point case. Afterwards, the corresponding association is determined and associated similar (or dissimilar) cases are read and retrieved from the association table. Then, the similarity between the new case and associated similar (or dissimilar) ones are measured in the next iteration, until the termination condition is reached. The termination condition of the algorithm is defined as-(i) the maximum iteration number is reached or (ii) a satisfied similar past case is found.
For determining whether the compared two cases are similar or dissimilar, Table 4 is presented, indicating the correspondence between similarity level and measurements. Table 4. Correspondence between similarity level and measurements.

Identical
The compared two cases are exactly the same, achieving a similarity measurement at 100.00% Highly similar The compared two cases achieve a similarity measurement, ranging from 75.00% to 99.99%

Similar
The compared two cases achieve a similarity measurement, ranging from 50.00% to 74.99%

Dissimilar
The compared two cases achieve a similarity measurement, ranging from 25.00% to 49.99% Highly dissimilar The compared two cases achieve a similarity measurement, ranging from 0.00% to 24.99% Agriculture 2020, 10, 387 8 of 21 In regard to the association determination (for selecting the associated similar or dissimilar cases), a set of policies is defined in the case retrieval algorithm as follows.
• Policy 1-Detection of identical cases-If a past case is detected identical to the new case, the case retrieval algorithm terminates immediately. The output is the retrieved past case. • Policy 2-Token assignments-Once a past case is considered highly similar to the new case, three positive tokens will be assigned to this past case. Once a past case is considered similar to the new case, one positive token will be assigned to this past case. Once a past case is considered highly dissimilar to the new case, three negative tokens will be assigned to this past case. Lastly, once a past case is considered dissimilar to the new case, one negative token will be assigned to this past case. • Policy 3-Association selection-In general, the association with more tokens will be selected for comparison. When the number of positive tokens is greater than negative ones, the past case with the highest similarity measurement will be selected. The associated similar cases of this chosen one will be evaluated in the next iteration. While the comparative result of the current iteration suggests that the number of negative tokens is more, then the past case with the lowest similarity measurement will be selected. Consequently, the associated dissimilar cases of this selected one are retrieved from the association table for comparison in the next iteration. • Policy 4-Selection of previous cases-It happens that all associated cases in a single iteration have been compared previously due to the reason that a past case can be associated with a 1-to-N relation. For instance, in Table 3, case 5 has the similar association with cases 1 and 3. It makes no sense to repeatedly evaluate cases that have been already compared, resulting in an endless loop for the algorithm. Under this circumstance, the cases to be evaluated in the next iteration are selected from previous iterations. Based on the number of tokens, corresponding association is determined and the past case with the second highest (or lowest) similarity measurement from the previous iteration will be chosen for comparison. If the past cases in the previous iteration have all been selected, then the algorithm will repeat Policy 4 one more time.
In Table 4, apart from the identical, similar and dissimilar levels, we also defined highly similar and highly dissimilar levels. Assume that the retrieval algorithm meets the following situation-the similarity measurements between the new case 1 and past cases 1, 2 and 3 are 90.00%, 30.00% and 40.00% respectively. Without the definition of highly similar and highly dissimilar levels, according to the pre-defined Policy 3, the dissimilar association of past case 2 will be selected for comparison in the next iteration. However, since the past case 1 is so similar to the new case 1, the similar association of the past case 1 has a great chance of being similar to the new case 1. Therefore, it would be a better choice to search in the similar association of the past case 1. Therefore, for avoiding this situation from happening, we decided to define the highly similar and highly dissimilar levels. Under such circumstance, the proposed retrieval algorithm would be forced to follow the potential optimal searching path. In summary, we equally divide the measurement into four intervals, denoting the highly similar, similar, dissimilar and highly dissimilar respectively.
The pseudo code of the proposed case retrieval algorithm is displayed in Table 5.
For better demonstrating the proposed case retrieval algorithm, an example is presented in Figure 3. In Figure 3, P i represents the ith past case in the case base, while N 1 is the first new case. Initially, P 1 is selected as the entry point for comparing with N 1 in the first iteration. The comparative result suggests that the dissimilar association of P 1 should be chosen for comparison (Policies 2 and 3). Thus, P 336 , P 157 and P 479 are compared with N 1 . In the second iteration, the number of positive tokens is greater than negative ones. Consequently, the associated similar cases of P 157, which has the greatest similarity measurement, are selected for comparison in the next iteration (Policies 2 and 3). The case retrieval algorithm keeps running until the 6th iteration, all past cases have been repeated and used previously. According to Policy 4, P 339 which has the second highest similarity measurement from the 5th iteration is chosen as a substitution (Policy 4). The output of this algorithm is a past case which has Agriculture 2020, 10, 387 9 of 21 the greatest commonalities with the new case. The termination condition of the proposed algorithm is defined as-(i) the maximum iteration number is reached; or (ii) an identical case is detected. The travelling sequence of the proposed case retrieval algorithm in above scenario is presented in Figure 4. For better demonstrating the proposed case retrieval algorithm, an example is presented in Figure 3. In Figure 3, Pi represents the ith past case in the case base, while N1 is the first new case. Initially, P1 is selected as the entry point for comparing with N1 in the first iteration. The comparative result suggests that the dissimilar association of P1 should be chosen for comparison (Policies 2 and 3). Thus, P336, P157 and P479 are compared with N1. In the second iteration, the number of positive tokens is greater than negative ones. Consequently, the associated similar cases of P157, which has the greatest similarity measurement, are selected for comparison in the next iteration (Policies 2 and 3). The case retrieval algorithm keeps running until the 6th iteration, all past cases have been repeated and used previously. According to Policy 4, P339 which has the second highest similarity measurement from the 5th iteration is chosen as a substitution (Policy 4). The output of this algorithm is a past case which previously. According to Policy 4, P339 which has the second highest similarity measurement from the 5th iteration is chosen as a substitution (Policy 4). The output of this algorithm is a past case which has the greatest commonalities with the new case. The termination condition of the proposed algorithm is defined as-(i) the maximum iteration number is reached; or (ii) an identical case is detected. The travelling sequence of the proposed case retrieval algorithm in above scenario is presented in Figure 4.

Case Base Maintenance
After retrieving the most similar past case, the solution of this past case can be reused and revised for resolving the new problem. However, the solution reuse and revision are not the main concern of this manuscript. Thus, this issue is not going to be further discussed. Our main objective focuses on case retrieval and case base maintenance.
In terms of case retention and case base maintenance, the typical approach is to directly add the newly-solved case into the case base, along with its solution [46]. Under the circumstance when the new case is extremely similar to a past case that has been already stored in the case base, a case forgetting strategy could be applied after evaluating the quality of both cases [47]. In our case, we have to pay attention to the association table as well, because the performance of our algorithm depends on this table.
The proposed case retrieval algorithm takes care of case base maintenance in the following two aspects-(i) storing the learned case and (ii) updating the existing association of past cases.
Firstly, the case base should retain the learned case which is composed by the problem description of the new case, the corresponding solution and its association. In general, the learned case is assigned with a sequential number, as its unique identifier and then stored in the case base. The similar and dissimilar associations of this learned case are added at the end of the association table. The addition of new cases certainly ensures the possibility of retrieving cases that are similar to the target problems, however, this continuous addition also enlarges the size of the case base, leading to the complexity and low efficiency of case retrieval tasks [48]. As a consequence, if the new case is extremely similar to a certain past case in the case base, its retention should be dropped for avoiding redundancy. This solution is acknowledged as forgetting strategy [49]. By calculating the goodness of the learned case, the case-based reasoning system decides whether the learned case should be remembered or forgotten. For simplifying the process of case retention, a threshold is defined at 98.00%, suggesting that if the similarity measurement between the learned case and the retrieved most similar past case achieves beyond 98.00%, then the learned case will be forgotten and it will be not stored in the case base. Otherwise, case retention follows the general circumstance mentioned at the beginning of this paragraph.
Secondly, once the CBR system decides to retain the learned case, the existing association of past cases should be updated as well. The following two scenarios are considered in the manuscript.

Scalability of the Retrieval Algorithm and the Case Base
It is necessary to consider the scalability issues since the number of cases in the case base may keep growing. On the one hand, for the retrieval algorithm, the number of positive and negative tokens can be increased as the size of the case base enlarges. The retrieval process remains the same presented in Section 3.3. On the other hand, the size of the case base would not increase infinitely. It is noted that we do not record the value of each variable for every day. The CBR system only stores useful past experiences. In other words, the system only remembers those variables in executed tasks with a complete pair of problem and solution features and the system does not record daily measurements. Therefore, the case base in a CBR system is quite different from those databases for social networks and weather stations. For instance, the weather station would store all the measured data. For an agricultural task like spraying the pesticide for rice, the average times are around 5 to 6 during the complete growing circle. Even when the farmland is divided into grids consisting of 50 blocks. The useful past experiences can be stored in the case base are maximum 300 pieces for a single farmland. If we have 50 farmlands in total, the maximum number stored in the case base is 15,000. In conclusion, the scalability of the retrieval algorithm and the case base would not be an obstacle for the CBR system.

Results and Discussions
In this section, extensive experiments are performed for verifying the effectiveness and efficiency of the proposed case retrieval algorithm from the following perspectives-(i) generation of the association table, (ii) result of case retrieval and iii) update of the association table. In Figure 5a, when N 1 is compared with P 133 , P 148 and P 301 , it is detected that the similarity measurement between N 1 and P 148 achieves the highest value. As a result, P 148 takes the first position in the similar association of N 1 . Meanwhile, N1's existing association with P 407 and P 157 has a minor adjustment by moving backward their positions in the association table. It works the same for updating the association with P 301 .

Experimental Settings
In Figure 5b, the similar and dissimilar association of P 14 is presented. During the iteration, P 14 is compared with N 1 . The comparative result indicates that N 1 has a closer association than P 256 with P 14 . Consequently, N 1 updates the third position in the similar association of P 14 and P 256 is therefore removed from the association table.

Scalability of the Retrieval Algorithm and the Case Base
It is necessary to consider the scalability issues since the number of cases in the case base may keep growing. On the one hand, for the retrieval algorithm, the number of positive and negative tokens can be increased as the size of the case base enlarges. The retrieval process remains the same presented in Section 3.3. On the other hand, the size of the case base would not increase infinitely. It is noted that we do not record the value of each variable for every day. The CBR system only stores useful past experiences. In other words, the system only remembers those variables in executed tasks with a complete pair of problem and solution features and the system does not record daily measurements. Therefore, the case base in a CBR system is quite different from those databases for social networks and weather stations. For instance, the weather station would store all the measured data. For an agricultural task like spraying the pesticide for rice, the average times are around 5 to 6 during the complete growing circle. Even when the farmland is divided into grids consisting of 50 blocks. The useful past experiences can be stored in the case base are maximum 300 pieces for a single farmland. If we have 50 farmlands in total, the maximum number stored in the case base is 15,000. In conclusion, the scalability of the retrieval algorithm and the case base would not be an obstacle for the CBR system.

Results and Discussions
In this section, extensive experiments are performed for verifying the effectiveness and efficiency of the proposed case retrieval algorithm from the following perspectives-(i) generation of the association table, (ii) result of case retrieval and iii) update of the association table.

Experimental Settings
The agricultural case-based reasoning system that adopts the proposed algorithm tries to retrieve the most similar past cases from a case base. As introduced in Section 3.1, our proposal is employed to manage pest problems. The pest considered in the experiment is Chilo suppressalis (CS), while the target crop is rice. Totally, 3000 past cases are stored in the case base and 500 new cases are prepared for testing purpose. The entry point is sequentially selected from the past cases for each retrieval task. Though this case retrieval algorithm is developed within a European research project, named Aggregate Farming in the Cloud (AFarCloud) (link to the project-http://www.afarcloud.eu/), the deployment of sensors and vehicles has not been fully completed yet. Therefore, simulated data are used currently and they are generated within a given range. For instance, judging from the current literature [50], the planting density of rice is generated from 180 to 525 seeds/m 2 . The life cycle of rice can be categorized by "embryogenesis," "vegetative," "ripening" and "reproductive" stages [51], encoded by integers "1," "2," "3" and "4." We are expecting to receive data from real farming fields in the near future once the devices are fully deployed. The simulation data we used can be found in the following link-https://github.com/ZhaoyuZHAI/Case-base. It is assumed that all the information is complete and there are no missing data of crop, pest and environmental features within all new and past cases.

Result of Generated Association Table
The association table is generated by comparing all past cases with each other sequentially. In this experiment, each past case is associated with three similar ones and three dissimilar ones, along with their similarity measurements. A part of this association table is given in Table 6, where ' . . . ' hides the associations of past cases 6 to 2995. The full association table for all 3000 past cases in the case base can be found in the following link-https://github.com/ZhaoyuZHAI/Case-base/blob/master/ associationTableWith3. Owing to the adequate coverage, each past case is associated for at least one time. It is worth noting that one past case can be associated with others several times, depending on the similarity measurements. For instance, in Table 6, the past case 328 is associated with both past cases 2997 and Agriculture 2020, 10, 387 13 of 21 2998, achieving at 12.18% and 7.10% respectively. However, these associations do not guarantee that past cases 2997 and 2998 are similar. For better demonstration, Figure 6 displays the data visualization of past cases 328, 2997 and 2998 after normalization.  Figure 6 demonstrates that though both past cases 2997 and 2998 are dissimilar to the past case 328, they still have a major difference among data like pest stage, infected area and planting density. Actually, the similarity measurement between past cases 2997 and 2998 only achieves at 67.08%.
Another interesting fact we noticed is that one past case may appear in the similar association of others more than one time. For example, in the association table, the past case 984 appeared in the similar association of past cases 748 and 749 respectively. The data visualization of these three past cases is displayed in Figure 7. For evaluating the commonalities of these three cases, the data covariation [52] is used for analyses. The result is presented in Table 7.   Figure 6 demonstrates that though both past cases 2997 and 2998 are dissimilar to the past case 328, they still have a major difference among data like pest stage, infected area and planting density. Actually, the similarity measurement between past cases 2997 and 2998 only achieves at 67.08%.
Another interesting fact we noticed is that one past case may appear in the similar association of others more than one time. For example, in the association table, the past case 984 appeared in the similar association of past cases 748 and 749 respectively. The data visualization of these three past cases is displayed in Figure 7. For evaluating the commonalities of these three cases, the data covariation [52] is used for analyses. The result is presented in Table 7.  Figure 6 demonstrates that though both past cases 2997 and 2998 are dissimilar to the past case 328, they still have a major difference among data like pest stage, infected area and planting density. Actually, the similarity measurement between past cases 2997 and 2998 only achieves at 67.08%.
Another interesting fact we noticed is that one past case may appear in the similar association of others more than one time. For example, in the association table, the past case 984 appeared in the similar association of past cases 748 and 749 respectively. The data visualization of these three past cases is displayed in Figure 7. For evaluating the commonalities of these three cases, the data covariation [52] is used for analyses. The result is presented in Table 7.   In Figure 7 and Table 7, the result shows that past cases 984, 748 and 749 all have great commonalities and their data inflections match with each other. According to the data covariation specification, if the covariation value is positive, it means that the data distribution of compared cases is the same. Meanwhile, a smaller covariation value indicates a closer correlation between cases. From the result in Table 7, the data covariation is positive and the values have a tiny difference.
As a consequence, two conclusions can be drawn according to the result of above examples.
• If the past case P x is stored in the dissimilar association of past cases P y and P z at the same time, it is not guaranteed that past cases P y and P z are similar with each other.

•
If the past case P x is stored in the similar association of past cases P y and P z at the same time, then past cases P y and P z are potentially similar with each other.
This is the reason why the proposed case retrieval algorithm tries to compare the associated past cases preferentially, instead of traversing all past cases in the case base. Under general circumstances, the potential target case usually exists among the association.

Result of Case Retrieval
The proposed case retrieval algorithm is compared with the typical algorithm which traverses all the past cases in the case base. For the result of case retrieval, we mainly evaluate it through two aspects-retrieval accuracy and efficiency. On the one hand, retrieval accuracy specifies that the retrieved past case should be as similar as possible to the target. On the other hand, retrieval efficiency specifies that the number of compared past cases should be as fewer as possible. Please note that the new cases are not retained in the case base after case retrieval and the association table is not updated during the experiments in this section. For each new case, we tried to use each past case as the entry-point case for testing. Thus, the total times of tests are 1.5 million (3000 × 500). By this design, we are able to verify whether the selection of the entry-point case has any effects on the performance of the proposed retrieval algorithm.
Firstly, retrieval accuracy concerns the average precision of retrieved top three similar cases. The formula of the average precision is given in Equation (1).
where TP means true positive and FP stands for false positive. The result of the average precision is displayed in Figure 8.  In Figure 7 and Table 7, the result shows that past cases 984, 748 and 749 all have great commonalities and their data inflections match with each other. According to the data covariation specification, if the covariation value is positive, it means that the data distribution of compared cases is the same. Meanwhile, a smaller covariation value indicates a closer correlation between cases. From the result in Table 7, the data covariation is positive and the values have a tiny difference.
As a consequence, two conclusions can be drawn according to the result of above examples.
 If the past case Px is stored in the dissimilar association of past cases Py and Pz at the same time, it is not guaranteed that past cases Py and Pz are similar with each other.  If the past case Px is stored in the similar association of past cases Py and Pz at the same time, then past cases Py and Pz are potentially similar with each other.
This is the reason why the proposed case retrieval algorithm tries to compare the associated past cases preferentially, instead of traversing all past cases in the case base. Under general circumstances, the potential target case usually exists among the association.

Result of Case Retrieval
The proposed case retrieval algorithm is compared with the typical algorithm which traverses all the past cases in the case base. For the result of case retrieval, we mainly evaluate it through two aspects-retrieval accuracy and efficiency. On the one hand, retrieval accuracy specifies that the retrieved past case should be as similar as possible to the target. On the other hand, retrieval efficiency specifies that the number of compared past cases should be as fewer as possible. Please note that the new cases are not retained in the case base after case retrieval and the association table is not updated during the experiments in this section. For each new case, we tried to use each past case as the entrypoint case for testing. Thus, the total times of tests are 1.5 million (3000 × 500). By this design, we are able to verify whether the selection of the entry-point case has any effects on the performance of the proposed retrieval algorithm.
Firstly, retrieval accuracy concerns the average precision of retrieved top three similar cases. The formula of the average precision is given in Equation (1).
where TP means true positive and FP stands for false positive. The result of the average precision is displayed in Figure 8. In Figure 8, the average precision of retrieved top three similar cases achieves at 90.52% (1,357,804/1,500,000), 82.11% (1,231,654/1,500,000) and 75.03% (1,125,449/1,500,000). The average precision of retrieved top three dissimilar cases achieves at 80.39% (1,205,858/1,500,000), 79.14% (1,187,119/1,500,000) and 75.91% (1,138,655/1,500,000). The result of the average precision demonstrates that the proposed case retrieval algorithm achieves promising retrieval accuracy. Meanwhile, the selection of the entry-point case has minor influence on the performance of the retrieval algorithm.
Under certain circumstances, the proposed case retrieval algorithm fails to retrieve the correct top three similar cases. For instance, the proposed algorithm is unable to identify the most similar past case. This most similar one was missed during runtime due to the limitation of retrieving time (iteration number). The second most similar past case usually takes the first position instead and is therefore treated as the output. This is also the reason why the second and the third similar past cases are not 100% correctly retrieved. However, this is acceptable because case-based reasoning does not necessarily require the successful retrieval of the most similar cases. All retrieved top three similar past cases are not exactly the same as the target one [53]. It is worth noting that apart from case retrieval, case-based reasoning also adopts the processes of solution reuse and revision. These processes are responsible to update the solutions of retrieved past cases for adapting to the current situation. Therefore, successfully retrieving the second or the third similar past cases is enough for the CBR-based systems. For supporting this point of view, an example is presented in Figures 9-11, displaying the data visualization of the new case 7 and retrieved top three similar past cases 52, 267 and 646. The statistical analysis of these cases is given in Table 7. The data covariance function, the root mean square error (RMSE) and the mean absolute error (MAE) are adopted here. retrieval algorithm.
Under certain circumstances, the proposed case retrieval algorithm fails to retrieve the correct top three similar cases. For instance, the proposed algorithm is unable to identify the most similar past case. This most similar one was missed during runtime due to the limitation of retrieving time (iteration number). The second most similar past case usually takes the first position instead and is therefore treated as the output. This is also the reason why the second and the third similar past cases are not 100% correctly retrieved. However, this is acceptable because case-based reasoning does not necessarily require the successful retrieval of the most similar cases. All retrieved top three similar past cases are not exactly the same as the target one [53]. It is worth noting that apart from case retrieval, case-based reasoning also adopts the processes of solution reuse and revision. These processes are responsible to update the solutions of retrieved past cases for adapting to the current situation. Therefore, successfully retrieving the second or the third similar past cases is enough for the CBR-based systems. For supporting this point of view, an example is presented in Figures 9-11, displaying the data visualization of the new case 7 and retrieved top three similar past cases 52, 267 and 646. The statistical analysis of these cases is given in Table 7. The data covariance function, the root mean square error (RMSE) and the mean absolute error (MAE) are adopted here.
From the result in Figures 9-11 and Table 8, it is concluded that the new case 7 has great commonalities with retrieved past cases 52, 267 and 646. The similarity measurements achieve at 95.28%, 95.14% and 95.09% respectively. Meanwhile, the values of data covariation are positive, showing a closer correlation between the new case and past cases. In other words, there are minor differences between the retrieved top three similar past cases. The solutions of all these three cases can be reused and revised. Therefore, retrieval accuracy of the proposed algorithm is proved.      Secondly, we evaluate retrieval efficiency of the proposed algorithm by being compared with the typical case retrieval algorithms. Traditionally, the typical case retrieval algorithms try to identify the most similar past cases by traversing the whole data base. As a consequence, the number of compared cases in this experiment reaches 3000 in total for a single search. Differing from typical approaches, the proposed case retrieval algorithm takes advantage of the association table and therefore measures the similarity between the new case and associated cases preferentially. From the evaluation perspective, a fewer number of compared cases indicates greater efficiency of case retrieval. The result of the number of travelled cases for 1.5 million tests is summarized in Table 9.   Table 8, it is concluded that the new case 7 has great commonalities with retrieved past cases 52, 267 and 646. The similarity measurements achieve at 95.28%, 95.14% and 95.09% respectively. Meanwhile, the values of data covariation are positive, showing a closer correlation between the new case and past cases. In other words, there are minor differences between the retrieved top three similar past cases. The solutions of all these three cases can be reused and revised. Therefore, retrieval accuracy of the proposed algorithm is proved. Secondly, we evaluate retrieval efficiency of the proposed algorithm by being compared with the typical case retrieval algorithms. Traditionally, the typical case retrieval algorithms try to identify the most similar past cases by traversing the whole data base. As a consequence, the number of compared cases in this experiment reaches 3000 in total for a single search. Differing from typical approaches, the proposed case retrieval algorithm takes advantage of the association table and therefore measures the similarity between the new case and associated cases preferentially. From the evaluation perspective, a fewer number of compared cases indicates greater efficiency of case retrieval. The result of the number of travelled cases for 1.5 million tests is summarized in Table 9. In Table 9, the result demonstrates that the number of compared past cases ranges from 800 to 1197. The least number of compared cases is 834 while the largest number is 1197. The average number of compared cases is around 1047 (1046.95). As a consequence, compared with the result of typical case retrieval algorithms (3000 compared cases), it is proved that the proposed algorithm is able to retrieve similar past cases by fewer comparison. In other words, it has greater retrieval efficiency.
Overall, the proposed algorithm enables case retrieval with great accuracy and efficiency. After successful case retrieval, the solutions of retrieved past cases can be reused and revised to resolve the new problems.

Result of Updated Association Table
During the experiments in this section, new cases will be retained in the case base after successful retrieval and the association table will be updated accordingly. Since there are 500 new cases for testing, the experiment in this section is performed for 500 times. In each test, the past case 1 is selected as the entry point (entry case) for comparison at the initial iteration. The average precision of these 500 retrieval tasks is shown in Figure 12.
In Table 9, the result demonstrates that the number of compared past cases ranges from 800 to 1197. The least number of compared cases is 834 while the largest number is 1197. The average number of compared cases is around 1047 (1046.95). As a consequence, compared with the result of typical case retrieval algorithms (3000 compared cases), it is proved that the proposed algorithm is able to retrieve similar past cases by fewer comparison. In other words, it has greater retrieval efficiency.
Overall, the proposed algorithm enables case retrieval with great accuracy and efficiency. After successful case retrieval, the solutions of retrieved past cases can be reused and revised to resolve the new problems.

Result of Updated Association Table
During the experiments in this section, new cases will be retained in the case base after successful retrieval and the association table will be updated accordingly. Since there are 500 new cases for testing, the experiment in this section is performed for 500 times. In each test, the past case 1 is selected as the entry point (entry case) for comparison at the initial iteration. The average precision of these 500 retrieval tasks is shown in Figure 12. In Figure 12, the average precision of retrieved top three similar cases achieves at 90.80% (454/500), 82.11% (408/500) and 75.60% (378/500). The average precision of retrieved top three dissimilar cases achieves at 80.40% (402/500), 79.60% (398/500) and 74.40% (372/500). The result of average precision indicates that the proposed retrieval algorithm is able to guarantee the retrieval accuracy after new cases are retained in the case base.
After performing the retrieval tasks for all 500 new cases, the result shows that 30 new cases are not retained in the case base because the similarity measurement between these cases and existed ones achieves beyond 98.00%. These 30 new cases are listed in Table 10. In Figure 12, the average precision of retrieved top three similar cases achieves at 90.80% (454/500), 82.11% (408/500) and 75.60% (378/500). The average precision of retrieved top three dissimilar cases achieves at 80.40% (402/500), 79.60% (398/500) and 74.40% (372/500). The result of average precision indicates that the proposed retrieval algorithm is able to guarantee the retrieval accuracy after new cases are retained in the case base.
After performing the retrieval tasks for all 500 new cases, the result shows that 30 new cases are not retained in the case base because the similarity measurement between these cases and existed ones achieves beyond 98.00%. These 30 new cases are listed in Table 10. In Table 10, it is worth mentioning that the forgetting strategy is applied to new cases 138 and 399 due to the reason that these two cases have great commonalities with newly retained cases 3057 and 3185 respectively. The rest of unretained cases is all similar to the past cases which have been already stored in the case base.
For verifying the updated association table, the times of updates in both similar and dissimilar associations are counted, shown in Figure 13.

N99
Sim In Table 10, it is worth mentioning that the forgetting strategy is applied to new cases 138 and 399 due to the reason that these two cases have great commonalities with newly retained cases 3057 and 3185 respectively. The rest of unretained cases is all similar to the past cases which have been already stored in the case base.
For verifying the updated association table, the times of updates in both similar and dissimilar associations are counted, shown in Figure 13. In Figure 13, updates in the association table are counted 2594 in total. 43.71% happens in updating the similar association while 56.29% in the dissimilar association.
Lastly, the association table also concerns the similar and dissimilar associations of newly retained cases, which are presented in Table 11, where '…' hides the association of past cases 3006 to 3465. In Figure 13, updates in the association table are counted 2594 in total. 43.71% happens in updating the similar association while 56.29% in the dissimilar association.
Lastly, the association table also concerns the similar and dissimilar associations of newly retained cases, which are presented in Table 11, where ' . . . ' hides the association of past cases 3006 to 3465. In total, 470 newly retained cases have their own similar and dissimilar associations, which means the updates in the association table are successful.

Conclusions
Typical approaches of case retrieval try to match the most similar past cases by traversing the whole case base, leading to low efficiency when a large volume of cases is stored. Therefore, this paper focuses on proposing a case retrieval algorithm for agricultural case-based reasoning systems. Before performing the retrieval tasks, an association table is constructed, consisting of both the similar and dissimilar relationships between all past cases. The novelty of our proposal lies on selecting associated cases from the table and evaluating their similarity between the new case preferentially. Under this circumstance, the proposed case retrieval algorithm is able to retrieve similar past cases by comparing fewer cases in the case base. Our proposal also concerns the retention part in the loop of case-based reasoning. The association table is updated during runtime. After successful retrieval, the new case is retained in the case base, along with its similar and dissimilar associations when the similarity measurement between the new case and the retrieved most similar past case is smaller than 98.00%. Meanwhile, the associations of past cases are updated as well. The experimental result demonstrates that our proposal is able to retrieve similar past cases with great efficiency and accuracy. The case base is successfully maintained with newly retained cases and their associations.
It is acknowledged that case retrieval is one of the most significant parts in case-based reasoning. Because the rest of processes like reuse and revision cannot proceed further without successful case retrieval. Thus, the proposed case retrieval algorithm is not only useful in CBR enabled agricultural systems but also has great potential in CBR systems for other domains. With the efficient retrieval capability, a CBR based ADSS enables to provide farmers with quick decision supports about agricultural management.
Since this work was developed within the AFarCloud project, we are expecting to receive real data from the farms to verify the proposed case retrieval algorithm. For further improving the performance of the proposed retrieval algorithm, it is also worth looking into the selection of preferable entry-point case. By classifying similar cases in a single cluster and selecting the most representative case as the entry-point to be compared with the new case, it might potentially improve the performance of the algorithm. Furthermore, it might be helpful for improving the algorithm performance if we could set the range of similarity levels (presented in Table 4) more precisely. Lastly, it would be interesting to investigate the performance of the proposed algorithm when the size of the case base increases to a larger magnitude. Under such circumstances, the number of similar and dissimilar associations are supposed to increase as well.