1. Introduction
Similarity-based decision support systems have gained prominence in recent years for their ability to provide personalised suggestions across a wide range of domains, from entertainment and e-commerce to education and healthcare [
1]. Their integration into agriculture presents a novel opportunity to enhance decision-making by offering tailored advice to farmers based on environmental, operational, and regional contexts.
In agriculture, decision-making often entails processing diverse data under time constraints. Similarity-based decision support systems, particularly those using similarity-based algorithms, offer a pathway to reduce decision fatigue, improve recommendation speed, and support more informed agricultural practices. By lowering the threshold of the prior knowledge required, these systems have the potential to empower new entrants into farming, addressing global food security concerns amid increasing population pressures and environmental challenges [
2,
3].
One of the key motivations for this study is to support small and resource-constrained farmers by demonstrating that classic and accessible algorithms, such as K-Nearest Neighbour and Approximate Nearest Neighbour with Locality Sensitive Hashing, can still yield valuable, personalised recommendations. Unlike complex or opaque models, these methods are interpretable, lightweight, and can be applied in settings with limited computational infrastructure.
Similarity-based methods identify patterns in user behaviour to generate personalised recommendations. In agriculture, this can translate to insights such as optimal crop selection, resource management strategies, and pest control interventions. A notable benefit of user–user similarity-based methods lies in their ability to compare the profiles of farmers operating under similar agro-environmental conditions and infer actionable practices [
4,
5]. This approach facilitates the dissemination of sustainable techniques and encourages innovation among smaller-scale farmers who might lack access to institutional research and advisory services.
Machine learning-based decision support systems have already been studied for their potential to optimise farming decisions using historical and contextual data [
6]. Studies have demonstrated the role of factors such as soil properties in yield prediction and decision-making [
7]. For instance, conservation practices have led to increases in corn yield by up to 41% due to improved soil health [
8], while global crop losses from pests and diseases remain substantial, amounting to 20–40% annually and approximately USD 220 billion in economic impact, respectively [
9]. Such challenges underscore the value of data-driven support systems capable of mitigating risks and enhancing productivity.
Further, similarity-based algorithms can be used to manage emerging threats by aggregating shared information from regional farmer networks. If multiple users report pest outbreaks or adverse climate patterns, the system can suggest countermeasures based on previously observed outcomes. This real-time adaptability provides a robust advantage over static advisory models.
Among the core algorithms used in similarity-based systems are K-Nearest Neighbour (KNN) and Approximate Nearest Neighbour using Locality Sensitive Hashing (ANN-LSH). KNN is known for its interpretability and high accuracy in smaller datasets, while ANN-LSH offers scalability and faster processing for real-time applications. This research applies both algorithms to agricultural data, evaluating their performance on structured profiles generated with domain-relevant attributes [
10]. A case study of SyrAgri, a web-based system deployed for Malian farmers, demonstrates the practicality of similarity-based decision support systems in agriculture, where contextual personalisation has been successfully implemented [
11].
The novelty of this work lies in applying user–user similarity-based methods within an agricultural context, using both KNN and ANN-LSH to match farming profiles based on 13 environmental and operational attributes. Unlike traditional item-based models, this approach supports localised decision-making and lays the groundwork for future real-time capabilities. By integrating synthetically generated, literature-informed datasets, this study contributes a methodological foundation for building adaptable, intelligent agricultural similarity-based decision support systems.
The following sections present the structure of this paper:
Section 2 presents the materials and methods employed in this study, including data generation and algorithmic implementation.
Section 3 discusses the obtained results, highlighting algorithm performance and practical insights.
Section 4 provides a detailed discussion of the findings in relation to agricultural decision-making and system scalability. Finally,
Section 5 presents concluding remarks and outlines future research directions.
2. Materials and Methods
Similarity-based methods can be categorised into three main types relevant to agricultural recommendation systems: user-based, item-based, and hybrid approaches [
12,
13,
14,
15,
16].
User-based filtering generates recommendations by analysing the practices of farmers operating under similar conditions, such as climate, crop type, and farming techniques. For example, if one farmer achieves success with a particular irrigation method, that method can be recommended to others in comparable environments. This approach relies on accurate data sharing and community engagement [
17].
The item-based approach suggests treatments, seeds, and fertilisers based on the observed success of similar products. If an agricultural item is highly rated or frequently used, it can be inferred to be of high quality and, as a result, a safe recommendation.
Hybrid methods combine both user-based and item-based approaches to produce more complete recommendations. These systems adapt to varied farmer needs and improve decision-making across agricultural tasks. While traditional systems often emphasise user-based filtering, item-based filtering has been shown to improve scalability in large datasets by focusing on item–item similarities [
18].
Beyond algorithmic function, similarity-based systems support community building by fostering farmer networks where users share experiences, solutions, and concerns. This collective knowledge helps individuals facing unfamiliar challenges learn from others with similar profiles, improving resilience and adaptability.
This study focuses on applying similarity-based methods to improve crop lifecycles and farm efficiency, with an emphasis on the user–user approach. Each farmer is represented as a user, and the algorithm identifies the most similar users based on attributes such as soil type, climate, and pest management.
The implementation was carried out in Java 11.0.17 (Oracle OpenJDK, Austin, TX, USA), chosen for its efficiency in handling both small and large datasets and for its support of the KNN and ANN-LSH algorithms. Farmer profiles were stored in CSV format, and Java collections were used to ensure efficient storage, retrieval, and neighbour computation.
2.1. Why Is User–User Filtering More Suitable?
Unlike item-based filtering, which focuses on individual tools or treatments, user–user filtering considers broader environmental and contextual factors influencing agricultural success, such as climate, soil type, and pest pressures.
Farmers operating under similar conditions are more likely to benefit from shared strategies. For instance, if a farmer with loamy soil in a temperate climate grows tomatoes successfully, the system will recommend similar practices to other farmers with matching conditions.
Moreover, user–user filtering allows dynamic adaptation to changing environments, whereas item-based filtering assumes static product effectiveness. For personalised agricultural recommendations that consider environmental variability, user–user filtering is the most suitable method [
13,
19,
20].
2.2. Data Gathering
For this purpose, a synthetic dataset was programmatically generated using a custom Java application (AgricultureDatasetGenerator.java). This dataset simulates agricultural practices by combining realistic values sampled from predefined, literature-supported ranges. The full implementation, including the data generation code and CSV samples used in the experiments, is publicly available in the repository linked in the Data Availability section. The dataset includes 13 attributes that reflect key factors influencing agricultural decisions, such as the following:
Soil Type;
Fertilizer Type;
Climate Data;
Humidity Level;
Pest Disease Management;
Plant Time;
Crop Harvested;
Water Temperature;
Harvest Colour;
Seed Supplier;
Season;
Distance to Retailer;
Harvest Yield (%).
Each feature captures elements influencing crop growth, resilience, and agricultural efficiency, establishing a robust basis for recognising commonalities among farming profiles.
To ensure ease of use and minimise the burden on the farmer, the system is designed with future automation in mind, aiming to reduce user input through pre-populated data from reliable online sources. Although the current implementation relies largely on manual inputs and predefined ranges sourced from agricultural literature and validated datasets, each attribute has been carefully evaluated to determine the most appropriate and user-friendly input method. This design balances flexibility and usability, acknowledging that farmers are primarily focused on crop management and productivity rather than on time-consuming data entry.
For this proof of concept, many of the input values were manually inserted or defined using fixed ranges derived from agricultural literature and validated datasets. This approach ensured that the system operated with realistic parameters that accurately reflect farming scenarios. While automated and real-time data retrieval is planned for future iterations, the current method offers a reliable baseline and facilitates testing under practical and representative conditions.
Soil Type—Indicates the main soil category (e.g., clay, loamy, sandy), affecting moisture retention, nutrient levels, and crop suitability. This attribute informs soil-based crop and fertiliser recommendations, grounded in realistic values from agricultural sources [
21,
22].
Fertilizer Type—Identifies the fertilizer class most effective for a given soil and crop type (e.g., nitrogen-rich for leafy crops) [
23].
Climate Data—By using historical and manually curated climate data, the system identifies crops best suited for specific weather conditions, helping reduce the risks associated with weather variability by aligning crop choices with climate patterns [
24].
Humidity Level—Indicates atmospheric moisture, which affects disease risk and irrigation scheduling. High humidity increases fungal risk, while low humidity raises water demand [
25].
Pest and Disease Management—Covers approaches like chemical (synthetic or organic) and biological controls. Combining multiple strategies improves crop resilience and supports targeted pest control [
26].
Plant Time—Indicates the crop’s sowing period (month/week), which strongly influences yield. Timely planting can increase productivity, while delays often reduce it [
27].
Crop Harvested—Identifies the harvested crop type (e.g., tomatoes, wheat), enabling tracking of yield trends and refinement of recommendations based on interactions between crop type, soil, and fertilisers.
Water Temperature—Affects crop development and disease vulnerability. Slightly elevated temperatures can stimulate plant growth, but excessive heat may reduce chlorophyll level and increase susceptibility to disease [
28].
Harvest Colour—Indicates crop ripeness at harvest and helps determine market readiness. Especially important for crops like bananas and tomatoes, this assists in optimising harvest timing.
Seed Supplier—Identifies the source of seeds used. Well-adapted, high-quality seeds improve germination, growth uniformity, and yield potential across different climates [
29].
Season—Denotes the planting or harvesting period (Spring, Summer, Autumn, Winter). Seasonal alignment improves crop health, yield forecasting, and risk management through informed scheduling [
30].
Distance to Retailer—Measures proximity to markets. Longer distances discourage perishable crops due to increased transport risks and costs, guiding farmers in selecting suitable crop types [
31].
Harvest Yield (%)—Represents the success rate of harvested crops, serving as a key performance indicator for farm practices under specific environmental conditions.
2.3. Algorithm Applied
The K-Nearest Neighbour (KNN) algorithm is applied to the structured dataset composed of farmer profiles, each defined by a fixed set of attribute values [
4,
14,
32,
33,
34,
35]. These attributes are categorized into categorical attributes (e.g., Soil Type, Climate, Pest Management) and numerical attributes (e.g., Water Temperature, Humidity Level), allowing the system to assess both qualitative and quantitative similarities between farmers.
To fine-tune the performance of the KNN model, we evaluated the impact of varying the number of neighbours,
k. Several values were tested, including
, 10, and 100, across datasets of different sizes. As shown in
Table 1, while larger
k values slightly increased stability, they also diluted the relevance of the neighbours by incorporating more dissimilar farmer profiles. Since our recommendation task values specificity and meaningful similarity, we selected
as the optimal choice. This value strikes a balance between accuracy and contextual relevance, ensuring recommendations are generated from the most similar farmers while maintaining computational efficiency.
To better understand the recommendation process, we outline below the steps involved in computing farmer similarity:
Step 1: Categorical Distance. The first step is to calculate the distance between our selected farmer and another from the list. If the value is the same as the query point, assign a distance of 0. If the value is different, assign a distance of 1. The compared farmer has the same soil type and pest management, but the climate is different. This yields a Categorical Distance, as computed in Equation (
1):
Step 2: Numerical Distance. For the numerical attributes, the Euclidean distance between two points is used,
and
, as given by Equation (
2):
Now let us see this equation applied for two users having a watering temperature of 32 °C versus 34 °C. The difference would be
and the squared difference would be
. The same logic applied for the humidity level will conclude to the Euclidean distance from Equation (
3):
Step 3: Total Distance. Step three consists of combining the Categorical and Numerical distances to obtain the Total Distance (Equation (
4)). This should be applied to the selected user in combination with all the other users.
This is a practical way to implement the similarity-based algorithm in the agriculture section. By identifying these nearest neighbours, farmers can quickly find others facing similar challenges, allowing for knowledge exchange and more data-driven decisions [
36].
2.4. Example: Distance Calculation Between Two Farmers
To clarify how distance influences similarity analysis in the similarity-based method, we illustrate the step-by-step computation using two farmer profiles from the dataset. Farmer 1 and Farmer 2 have the attribute values shown in
Table 2.
Step 1: Categorical Distance. The categorical attributes in this context are Soil Type, Fertiliser Type, Climate, Pest Management, Plant Time, Crop Harvested, Harvest Colour, Seed Supplier, and Season. Out of these nine attributes, Farmer 1 and Farmer 2 match only on Climate and differ on the remaining eight. We apply the simple matching distance, where a value of 0 indicates a match and 1 a mismatch:
Step 2: Numerical Distance. The numerical attributes considered are Humidity Level, Water Temperature, Distance to Retailer, and Harvest Yield. These are compared using the Euclidean distance formula:
Step 3: Total Distance. The overall distance between the two farmers, used for neighbour ranking, combines both categorical and numerical components:
Interpretation and Influence on Analysis. This total distance reflects a moderate similarity score. While numerical environmental parameters such as humidity, water temperature, and harvest yield are relatively close, the significant number of categorical mismatches highlights diverging agricultural practices. In the context of the similarity-based method, such a distance affects the recommendation strength: Farmer 2 may provide useful input on environmental conditions and operational metrics, but their advice on crop-specific strategies, planting periods, or pest control may be less reliable for Farmer 1. By quantifying both exact and approximate similarity, the system ensures more context-aware and informed decision-making for farmers.
2.5. Potential Solutions to Improve Scalability in Agricultural Similarity-Based Method
In the similarity-based method, the approach is to calculate the distances between farmers to determine the most similar users. This requires calculating the distance between the selected farmer and all the other users, as discussed above. This may be challenging to implement promptly when applied to real-time data. For this, the best approach is using ANN algorithms, which offer a solution by trading off some accuracy for speed, making them ideal for real-time applications where quick decisions are essential [
37]. ANN algorithms can quickly identify neighbours by focusing on approximate similarities rather than on exact distances. One of the most popular ANN methods is Locality Sensitive Hashing (LSH), which is widely used in high-dimensional data to identify similar items more efficiently [
38,
39,
40,
41,
42].
LSH creates hash functions that map similar data points into the same “bucket” or hash space. The algorithm assigns each data point a hash value based on its features, so that points close in the feature space (similar users) are likely to fall into the same bucket.
Table 3 provides an overview of how the number of hash functions and bands impact the balance between speed and accuracy in our dataset, helping to adjust the performance of the LSH algorithm.
Instead of comparing every single farmer, LSH allows the algorithm to search only within the bucket where the target farmer is located. This drastically reduces the search space, speeding up the recommendation process.
The key idea behind LSH is that it preserves locality: similar data points end up in the same or nearby buckets with high probability.
This approach, applied in our similarity-based algorithm, will conclude to a quicker farmer grouping sharing the same environmental conditions, for example, without needing to compare one by one. This grouping allows the system to deliver recommendations faster, even as the dataset grows.
3. Results
To evaluate the effectiveness of the similarity-based decision support system in agricultural contexts, both the K-Nearest Neighbour (KNN) and Approximate Nearest Neighbour using Locality Sensitive Hashing (ANN-LSH) algorithms were applied to a synthetically generated dataset, with sizes ranging from 500 to 100,000 farmer profiles.This setup enables a comparative assessment of how each method identifies and ranks peer profiles with similar agricultural conditions, thereby informing the design of scalable and responsive decision-support tools for farming. The corresponding attribute ranges are detailed in
Appendix A.
By examining the ordering and selection of similar farmers, the relative strengths and limitations of exact (KNN) versus approximate (ANN-LSH) neighbour search approaches are highlighted. These findings are particularly pertinent for practical scenarios where computational efficiency and response time are of paramount importance.
Figure 1 and
Figure 2 present the algorithmic of the two algorithms, illustrating the internal mechanisms used to calculate similarity and rank neighbours [
43].
While both methods return overlapping groups of similar farmers, the order of the recommendations differs. For instance, the top-ranked recommendation produced by KNN (
Figure 3) may occupy a lower position in the ANN-LSH output (
Figure 4). Such discrepancies are expected due to the algorithms’ differing strategies. For practitioners, the implications of these variations are significant, as recommendation order may influence which practices or peer cases are prioritised when making decisions.
To assess scalability, execution time was benchmarked at increasing dataset sizes. For 500 records, KNN required 0.0481 s, whereas ANN-LSH achieved a similar response of 0.0674 s. Initial runs included console-based debug output for neighbour retrieval, later disabled to ensure accurate timing results.
A notable performance difference appeared at 8000 farmers, where KNN exhibited a runtime of 6.19 s, while ANN-LSH required only 0.235 s. This divergence continued as the dataset scaled further.
When the dataset was expanded to 100,000 farmers, KNN’s execution time reached approximately 2043.278 s (over 34 min), while ANN-LSH completed in just 1.004 s. These trends are depicted in
Figure 5 and
Figure 6, underscoring ANN-LSH’s significant computational advantage for large-scale applications.
ANN-LSH performance is influenced by the configuration of its parameters, specifically the number of hash functions and bands. As shown in
Table 3, lower values yield faster execution but lower accuracy, while higher values improve precision at the cost of increased computational overhead [
44,
45].
To further illustrate this trade-off,
Table 4 presents execution times for various parameter configurations applied to a dataset of 100,000 farmers.
These results confirm that increasing the number of hash functions and bands enhances the approximation quality of ANN-LSH but also leads to slower performance. If configured inappropriately, the algorithm may fail to retrieve sufficiently similar neighbours, reducing the effectiveness of the recommendations delivered to farmers.
In summary, both algorithms offer viable approaches for generating peer-based agricultural recommendations. However, ANN-LSH demonstrates a significant advantage in terms of scalability and runtime efficiency, making it a promising candidate for real-time integration in future data-intensive agricultural advisory systems. Its flexibility in parameter tuning further enables trade-offs between speed and precision, thereby supporting practical deployment in diverse farming environments.
To further clarify what influences the differences between the two algorithms,
Table 5 summarises the key parameters affecting their behaviour, including their impact on accuracy, similarity grouping, and execution time.
3.1. Sensitivity Analysis of Input Variables
To determine which input variables have the greatest influence on the classification results of both KNN and ANN-LSH models, we performed a sensitivity analysis inspired by the methodology used in [
46]. For KNN, a one-at-a-time (OAT) perturbation approach was adopted, in which each attribute was varied individually while others were kept constant, and the impact on classification accuracy was measured. For ANN-LSH, the connection weight method, based on analysing the magnitude of feature weights in the ANN structure, was used to infer the relative importance of each variable by analysing the impact of feature weights on the output layer.
The analysis shows that Water Temperature had the strongest influence on the KNN model. This is expected, as it is a key environmental variable that differentiates farmer profiles in the dataset. For ANN-LSH, Humidity Level was the most significant, suggesting that the hash functions were more sensitive to this feature in grouping similar farmers. Other high-impact variables included Soil Type and Pest Management, which relate directly to farming strategies and environmental adaptation, as it can be seen in
Table 6.
This sensitivity analysis helps highlight the robustness of each model in capturing the most discriminative features and supports a better understanding of how model-based similarities are determined, which in turn informs decision-making for agricultural recommendations.
3.1.1. Benefits
The predictive models developed in this study directly support farmers by offering tailored insights derived from peers facing similar environmental and operational conditions. For example, a farmer experiencing reduced yield during spring planting can use the system to identify others with comparable soil, climate, and humidity profiles who achieved better results. By analysing what fertilisers or pest management strategies those peers used, the farmer can make more informed decisions for future crop cycles. Similarly, recommendations for optimal planting periods or seed suppliers can be extracted by identifying consistent patterns in successful neighbouring profiles. These applications demonstrate the model’s role in enabling data-driven, context-aware decisions on the ground.
Both KNN and ANN-LSH models enable the provision of personalised recommendations for farmers sharing similar climatic, soil, and crop conditions. The system can suggest agricultural practices associated with successful harvests or inputs (such as fertilisers or pest control methods) that have contributed to healthier crops. These data-driven recommendations can lead to improved productivity, higher-quality yields, and increased income.
KNN offers a straightforward approach that is easy to implement and does not require intensive training phases. It is particularly effective for basic similarity-based searches where transparency and interpretability are key.
On the other hand, ANN-LSH provides scalability and efficiency for real-time applications. It excels in scenarios where large datasets must be queried quickly, enabling the rapid identification of similar farmers without sacrificing too much accuracy. This makes it ideal for operational environments with continuous data updates.
In addition to supporting decision-making, the approach fosters knowledge sharing among farming communities. By building a network of similar profiles and outcomes, the system promotes the diffusion of best practices across farmers facing comparable challenges, ultimately leading to improved agricultural resilience and innovation.
3.1.2. Disadvantages
The KNN method struggles with large datasets, as it requires computing the distance from the selected point to all other data points. As the dataset grows, the time needed to compute recommendations increases accordingly.
While ANN-LSH improves speed and scalability, it may sometimes yield approximate rather than exact neighbours. This approximation can overlook subtle distinctions between farmers that would otherwise be captured by KNN.
Both algorithms rely on the availability and accuracy of the input data. If inconsistencies exist in the dataset, they can negatively affect the quality of the resulting recommendations [
47].
Although KNN and ANN-LSH can effectively identify similar profiles, they may struggle to capture complex interactions, such as the combined effects of soil type and climate conditions on crop yield.
KNN provides a practical, easy-to-implement tool for recommending practices based on similarities between farmers. However, it performs best with clean, structured data and when applied to datasets of manageable size. When combined with other analytical methods or integrated into a broader system, KNN can significantly enhance decision-making, making it a valuable asset.
While the current implementation relies on synthetic and static input data, ANN-LSH offers the computational efficiency required for future real-time or large-scale deployment. These comparative advantages are summarised in
Table 7.
4. Discussion
This study examined the application of K-Nearest Neighbour (KNN) and Approximate Nearest Neighbour using Locality Sensitive Hashing (ANN-LSH) in agricultural recommendation systems. The effectiveness of both algorithms was analysed in identifying similar farming profiles and suggesting contextually relevant practices. While KNN yielded highly accurate recommendations by exhaustively comparing data points, ANN-LSH demonstrated a scalable and faster alternative by leveraging hash functions to approximate neighbours efficiently.
The results confirm that both methods successfully identify meaningful patterns in farmer profiles, aiding data-driven decision-making. In particular, KNN’s precision and ANN-LSH’s performance balance offer complementary benefits depending on use-case demands. Despite algorithmic differences, both have proven useful in enabling farmers to optimise crop selection, identify reliable seed suppliers, and improve pest management strategies, ultimately enhancing productivity and resource efficiency.
While this study provides a promising proof of concept, several areas warrant further investigation to strengthen the system’s accuracy, flexibility, and real-world deployment potential. Future implementations should integrate real-time environmental data, such as live weather updates, soil conditions, and climate anomalies, to refine recommendations dynamically and enhance responsiveness. Moreover, combining KNN and ANN-LSH with deep learning techniques could lead to hybrid models that preserve accuracy while improving scalability.
Personalisation and adaptive learning also represent promising directions for improvement. Models that evolve based on individual farmer feedback and observed field conditions could improve the long-term relevance of recommendations. Additionally, the methodology could be extended to other agricultural sectors, including livestock management, greenhouse cultivation, and hydroponic systems, where similar decision-support challenges exist.
Finally, user-friendly deployment is essential for widespread adoption. A mobile-accessible and intuitive interface, potentially incorporating voice command functionality, would significantly improve usability for smallholder farmers, many of whom may have limited technological proficiency. Addressing these development opportunities could help transform the proposed system into a robust and intelligent decision-support tool, thereby contributing further to the digitalisation and sustainability of modern agriculture.
5. Conclusions
This paper presented a comparative study of exact (KNN) and approximate (ANN-LSH) neighbour search algorithms applied to agricultural recommendation systems. The innovation lies in adapting these machine learning techniques—traditionally used in domains such as e-commerce or document retrieval—to simulate farmer-to-farmer recommendations within a structured, domain-specific dataset.
A synthetically generated dataset of farming profiles, supported by literature-grounded attribute ranges, served as the foundation for testing and validating algorithmic performance. This approach ensured experimental control while maintaining real-world relevance.
This work makes two primary contributions. First, it demonstrates that both KNN and ANN-LSH are capable of generating meaningful and explainable recommendations for farmers. Second, it quantifies the performance trade-offs between accuracy and scalability. While KNN achieves high precision, its execution time increases considerably with larger datasets. By contrast, ANN-LSH provides a fast and scalable alternative, delivering approximate results with substantially lower computational cost.
This study demonstrates how similarity-based decision support systems can facilitate peer-informed agricultural decision-making, leading to improved yield optimisation, reduced waste, and enhanced sustainability. The adaptation of ANN-LSH to agricultural contexts represents a novel contribution to the field and sets a foundation for future research in hybrid or real-time recommender architectures.
As the agricultural sector continues its digital transition, such systems can serve as critical enablers of precision agriculture—linking farmers not only to data but also to each other in intelligent, actionable ways.