Machine Learning Models for Predicting Romanian Farmers’ Purchase of Crop Insurance

: Considering the large size of the agricultural sector in Romania, increasing the crop insurance adoption rate and identifying the factors that drive adoption can present a real interest in the Romanian market. The main objective of this research was to identify the performance of machine learning (ML) models in predicting Romanian farmers’ purchase of crop insurance based on crop-level and farmer-level characteristics. The data set used contains 721 responses to a survey administered to Romanian farmers in September 2021, and includes both characteristics related to the crop as well as farmer-level socio-demographic attributes, perception about risk, perception about insurers and knowledge about agricultural insurance. Various ML algorithms have been implemented, and among the approaches developed, the Multi-Layer Perceptron Classiﬁer (MLP) and the Linear Support Vector Classiﬁer (SVC) outperform the other algorithms in terms of overall accuracy. Tree-based ensembles were used to identify the most prominent features, which included the farmer’s general perception of risk, their likelihood of engaging in risky behaviour, as well as their level of knowledge about crop insurance. The models implemented in this study could be a useful tool for insurers and policymakers for predicting potential crop insurance ownership.


Introduction
Crop insurance offers farmers and agricultural producers financial protection against crop damage caused by natural events or disasters. Within the EU, in 2019 Romania had the largest number of workers employed in the agricultural sector [1]. Considering the sector size, increasing the crop insurance adoption rate and identifying the factors that drive adoption could present a real interest in the Romanian market. The continuous collection of insureds' data can enable insurance companies to implement machine learning (ML) models for targeting current or potential policyholders or for predicting insured lifetime value or attrition.
Technological developments have had a crucial impact on the insurance market, as on any other financial industry. In this line, machine learning (ML) algorithms receive special attention from researchers for addressing the following issues: insurance fraud detection [2][3][4], insurance premium prediction [5], underwriting process [6], claim analysis [7], risk prediction [8], sales forecasting [9], customer churn [10], and insurance tariff plans [11] among others.
A large number of studies are devoted to demonstrating the effectiveness of ML algorithms on forecasting. Accurate forecasting has attracted attention in various fields, such as price forecasting (e.g., authors in [12] proposed a novel machine-learning-based electricity price, while the authors in [13] integrated variational mode decomposition and random sparse Bayesian learning to forecast the oil prices), import and export forecasting (e.g., the authors in [14] used an econometrics-based co-integration model to estimate the natural gas demand, and those in [15] proposed NARX and Transformer models with regularization based on neural network models for mid-term forecasting of crop production and export, and showed that the values forecast by the proposed method were more accurate), financial products (e.g., the authors in [16] used machine learning algorithms to study the volatility of Bitcoin, and Hanafy and Ming [17] used ML for auto insurance) and others.
Specifically, to capture information about socio-environmental patterns in agriculture, ML is increasingly and widely applied, but generally, articles have focused on crop yield density and some other part of farmer behaviour. In this regard, the authors in [25] provide a review of ML in crop yield prediction, and emphasize that the most used ML algorithm seems to be the neural network, and the most widely used deep learning algorithm is the convolutional neural network. Recently, Wu and colleagues [26] explored a nonparametric ML tool based on Gaussian process regressions to predict crop yields over time and its applications to decision-making in crop insurance. They proposed models of non-stationary crop yields in a single stage and showed the utility of their method for insurance companies. Nguyen et al. [5] investigated the efficacy of ML in predicting farmer behaviour. They used data from 534 Vietnamese farmers, and showed that the insurance premium depended on factors such as quantity harvested, cost, province, and the farmer's desire to be insured.
This article focuses on the desire of Romanian farmers to take out insurance. Our research is even more important, as our literature search did not identify any papers on the behaviour of Romanian farmers that applied ML methodology. The main goal of the present paper is to fill this gap by showing the increased efficiency of using ML approaches in order to predict behavioural issues on the Romanian crop insurance market. It is wellknown that Romania is a former communist developing country, and these patterns are worth investigating because they can decisively influence farmer behaviour. The articles that have focused on this topic have shown that there are a wide range of factors that affect behaviour. Among the most used factors remain the socio-demographic variables in works led by [27][28][29] and others. The typology of risks and the complex way in which they act outline the existence of agricultural insurance [30][31][32][33]. Other factors that have a significant influence on the insurance decision, but which directly influence the insurance premium that the farmer pays, are related to the characteristics of the land [34,35].
Narrowing the specificity of EU countries, the authors in [36], using a sample of 224 Bulgarian farmers interviewed in 2011, found that regional effects were one of the most influential factors in increasing the demand for crop insurance. Additionally, small and medium-sized farms were less likely to get insured compared to large farms. A similar result was obtained in a study in France [37]. The authors used a Logistic Regression and showed that large farms and risk exposure were predictors of the decision to take out crop insurance. In Spain, Garrido and Zilberman [38] found that premium subsidies explained an important proportion of the differences in farmers' insurance decisions. Hungarian crop farmers were positively influenced by education, size, and indebtedness of the crop, as seen in [39]. The same research also showed that crop-producing farms with an agricultural insurance contract were more efficient than the farmers without insurance. Using a structural model, Trestini et al. [40] obtained an interesting result in terms of the insurance intention of Italians and Poles. Risk aversion seemed to negatively influence the intention to purchase insurance, and previous insurance adoption at farm level as well level of trust in the insurer were the main factors of future intention. In this regard, Iyer et al. [41] emphasized that in order to make predictions about farmers' behaviour, their risk preference and the heterogeneity in the level of their risk aversion need to be taken into account. Menapace et al. [42] showed, based on the regression analysis of risk and crop insurance purchases, that farmers in the Province of Trento, Northern Italy, were more likely to buy crop insurance if they were more risk averse.
In Romania, Dragos and Mare [43] studied the factors affecting crop insurance using a sample of 308 farmers from 18 villages from the six North-West Region counties (Cluj, Bihor, Bistrita -Nasaud, Maramures, Salaj and Satu Mare). The findings, based on a logit model, showed that education, age, distance from the farm to the nearest important city, size of the village and type of culture significantly influenced the decision to purchase crop insurance. Additionally, Romanian farmers that grew vegetables were more likely to purchase crop insurance. Unfortunately, we did not identify studies that integrated information on Romanians' perception of risk, or their level of knowledge in the field of agricultural insurance. There is already evidence in the insurance literature about the difference between education level and education in the field. Authors in [44] constructed the Index of Annuity Literacy and tested it for the German annuity market, and those in [45] constructed the Index of Insurance Knowledge for the Romanian private pension and life insurance market, and pointed out the important differences given by having or not having knowledge in the insurance field.
The main objective of this research was to identify the performance of ML techniques in predicting Romanian farmers' purchase of crop insurance. The models use crop-level and farmer-level characteristics from crop insurance purchase data collected in the span of one month-September 2021. Additionally, we aimed to identify the main contributing features in some of the models implemented in order to increase the awareness of the actors on this market in respect to what should be treated and how to enhance its development in a country like Romania, with massive agricultural potential. We achieved this by using the feature importance scores for the top 10 variables depicted by each analysis method.
The following Section 2 is devoted to describing the materials and methods used in this study. We explain the data and the methodology applied. The results of our research and their interpretations are given in Section 3. Section 4 concludes the paper.

Data Set
The data used in this research were collected from farmer responses to a survey administered by the Romanian Agency for Financing Rural Investments (AFIR) on Romanian farmers' crop insurance purchases. Data were collected in September 2021, both through Computer-Assisted Telephonic Interview (CATI) and an online platform. The final data set contains 721 entries from Romanian farmers. The respondents were aged between 21 and 68 years and lived in both rural and urban areas.

Predictors
The survey collected information regarding crop-level characteristics and farmerlevel characteristics, all of which were used in the model. Table 1 shows the primary characteristics of the data set.  Crop-level characteristics considered in this study were principal crop type (mostly field (57.4%), tree and vine (24.5%), vegetable (14.7%), and others), region (the 8 different official regions of Romania-NUTS2 level), and two ordinal variables indicating area under cultivation and past experience with damage caused by natural calamities. A total of 72.2% of respondents had an area under cultivation less than five hectares (compared to the 91.8% national statistic in 2016 [46], and 34.5% of the respondents had indicated that calamities in the past had caused large or very large damages to their crops.
Farmer-level characteristics included socio-demographic attributes, farmer-insurance attributes and farmer-risk-aversion attributes.
In terms of the socio-demographic variables, the mean and median age of the farmers was 46 years, with 77.7% residing in rural areas and 35.6% having completed higher education studies. Other attributes considered were marital status, level of income, as well as percent of income attributed to agricultural activities (with two-thirds of respondents having less than 50% of total income coming from agricultural activities).
Farmer-specific insurance attributes included past experience and trust in insurance companies, which were quantified on a 5-point Likert scale, from very unpleasant to very pleasant experience, and from minimum to maximum trust, respectively. A total of 54.4% of respondents expressed positive trust in insurance companies, while a smaller percentage (41.3%) had a positive experience with insurance companies. Additionally, respondents' agricultural insurance knowledge was evaluated using 7 questions.
The survey also collected data regarding farmers' risk-aversion characteristics. The perceived risk of losing a crop due to economic, legal, calamitous or other reasons was evaluated using a 5-point Likert scale, with 47.5% expressing very little or little fear. Respondents' risk perception about multiple activities (e.g., gambling or investing in stock or crypto markets) and their likelihood to engage in those activities were also evaluated using a 7-point Likert scale. One example of such an activity is gambling the monthly income on a sports event, with 50.3% considering it extremely risky, and 32.2% viewing as extremely unlikely to perform such an action. Additionally, farmers were also asked about the ratio between the value of the total premium and the amount of risk to which the farm is exposed, and whether that was under or over-valued, with 47.2% considering it an equitable ratio.
The RURAL variable (dummy variable indicating rural or urban residence) was highly correlated with MALE (dummy variable indicating male or female), and therefore the latter was excluded from the analysis. The descriptive features of the data can be visualized in Appendix B.

Target Variable
A total of 423 (58.67%) respondents did not have crop insurance, 193 (26.77%) had standard insurance, while 105 (14.56%) had extended insurance. The target variable considered in all the models was the binary feature indicating whether the respondent owned or did not own crop insurance. Because the data set is not heavily imbalanced between the two classes (as can be seen in Table 2), no sampling techniques for dealing with imbalanced class distributions were considered in this approach.

Data Processing
Responses about perceived risk of multiple activities were aggregated using the mean into a composite variable (RISK_PERCEPTION_GENERAL). The same method was applied for responses about likelihood of engaging in the same activities (RISK_BEHAVIOUR). The sum of the received answers to the 7 questions regarding agricultural insurance knowledge were aggregated into the composite variable INS_EDUC. Only the composite variables were further used in the model-building step. Input features were normalized using MinMaxScaler, which transforms all inputs in the [0, 1] range. In total, 27 input variables were further used in all the models specified further. The definition for MinMaxScaler can be seen below: and min = lower value of desired transformation range; max = upper value of desired transformation range; min(x) = the minimum value of feature x; max(x) = the maximum value of feature x.

Baseline Model
The Baseline was built as a frame of reference to compare the existing models against, and it predicts the majority class (class 0-does not have insurance) in all situations. The Baseline disregards any patterns in the training data, and we expected all other implemented models to surpass the performance of this random classifier.

Logistic Regression (LR)
The Logistic Regression model is a widely used classification method, and can be written as: where β are the parameters, and x the explanatory variable.
It returns the probability of class membership, and is based on the standard logistic function, which is defined as which returns values between 0 and 1.

Decision Tree Classifier (DT)
A DT is a commonly used supervised classification method that has a flowchart structure. The tree achieves classification by splitting the data multiple times based on certain cutoff values in the features. At each split, different subsets of the initial data are created, with each instance belonging to one set. The leaf nodes represent the final subsets, while the intermediary ones are called split nodes. Decision trees carry a large explainability, and they can capture nonlinear relationships.

Tree-Based Models
Two tree-based ensemble methods were also considered-namely, Random Forest Classifier (RFC) and eXtreme Gradient Boosting Classifier (XGB). RFC is a "bagging" (bootstrap aggregating) ensemble method which implements in parallel multiple decision trees on bootstrapped samples, and the results are combined into a final model through a "majority vote" mechanism. XGB is a boosting ensemble model that trains decision trees sequentially, turning weak learners into strong learners.

Multi-Layer Perceptron Classifier (MLP)
An MLP is a feed-forward artificial neural network that maps a non-linear function from an input vector to an output vector. It is composed of a minimum of three layers: an input layer, one or more hidden layers and an output layer. Each layer is fully connected to the next layer, and non-linear activation functions are applied to neurons in each layer (except for the input layer). An advantage of the MLP is that it can distinguish data that is not linearly separable.

Model Implementation
Hyper-parameter optimization was performed using grid search and 5-fold cross-validation on the training set. The final models were evaluated on the test set, which was not used for tuning, in order to achieve an unbiased evaluation.
For the tree-based ensembles (RFC and XGB), the top 10 features contributing to the model were selected (representing 37% of total input features). Feature importance was employed to rank the variables based on their impact upon the decision to buy crop insurance. For robustness reasons, we present the feature importance results of both the RFC and the XGB.

Model Evaluation
The data set was split into an 80% training set and a 20% test set. Models were trained on the training set, using the optimal hyper-parameters identified on the same set with grid search. Performance was evaluated on the test set using overall accuracy and F1 score for each class as performance metrics. We also report precision and recall for Class 1 (owns insurance), as well as the area under the ROC curve (AUROC) score.
Accuracy reflects the proportion of correct predictions made out of the total predictions. Precision shows the ratio of true positives out of all cases predicted as positives. Recall indicates the proportion of true positives out of all actual positive cases. F1 score is the harmonic mean of precision and recall. A receiver operating characteristic (ROC) curve is a plot of the true positive rate (sensitivity, recall) against the false positive rate (1-specificity) at various thresholds. The AUROC score highlights the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.
In terms of overall accuracy, the MLP had the best performance (0.76), followed by SVC and LR. The tree-based methods had an overall good performance, with the exception of the Decision Tree. MLP surpassed all implemented models in terms of overall accuracy (0.76), as well as F1 scores for both classes. SVC and LR followed in performance, with an overall accuracy of 0.74 and 0.72, respectively. All implemented models (except for the Baseline) had AUROCs above 0.5, indicating that they performed better than a random classifier (in our case, the Baseline model).
The performance comparison of all the models can be seen in Table 3. This is not a surprising result, considering that other articles related to the use of ML techniques in the financial sector also point out the efficiency of MLP against SVC and Logistic Regression (e.g., [47,48]). Bold indicates the maximum value.

Feature Importance from Ensemble Methods
Tree-based models have the advantage of ease of interpretation, as opposed to deep learning algorithms, which behave as a black box and limit the understanding of how features are combined to make predictions. Decision Trees are highly interpretable, as long as the depth of the tree is not very large. Bagging (e.g., RFC) and boosting (e.g., XGB) methods improve the performance by combining multiple Decision Trees, but are more difficult to interpret. For these cases, feature importance (i.e., the importance score of each variable in the construction process) is determined using computed information gain. For example, in RFC, feature importance is evaluated using the mean decrease impurity-namely, the average across trees of the total decrease in node impurity. In the boosting methods (e.g., XGB), the relative importance of each variable is higher because it is more used to take decisions. We plotted the feature importance to highlight the main contributing factors to the prediction of whether a farmer had or had not purchased crop insurance. Figures 1 and 2 contain the top 10 contributing features in the ensemble models implemented, RFC and XGB, respectively. The definition of each variable can be found in Table 1. Highlighted in green are the features not consistent across models within the top 10 contributions.
The top 10 factors point out several significant factors impacting Romanian farmers' decision to buy crop insurance. Among the crop-level characteristics [34,35], past experience with damage made by calamities was an important factor in the final purchasing decision of the farmer. However, the feature importance scores indicate that farmer-level variables were the most impactful. Among these, both the RFC and the XGB indicated the relative importance of risk aversion (consistent with [30][31][32][33]41,42] and education in the field [45]). Therefore, our results are in line with the literature.  In terms of consistency across models, eight out of the top ten contributing features were common between the two ensemble algorithms. Both identified farmer-level characteristics related to risk as the main two contributing features: the general perception of risk and the likelihood of engaging in risky behaviour. Additionally, the general knowledge about crop insurance was another of the main features highlighted. The perceived risk of losing the crop and the level of damage caused by calamities in the past were also prominent factors. As expected, in line with economic risk theory, this is the key predictor of insurance. Insurers transfer the risk to the insured, and the higher the level of risk they perceive, the more likely they are to take out insurance [49]. In this regard, authors in [50] prove that, as farmers' perception of floods increases or as farmers become more risk-averse, they are more likely to buy crop insurance, but risk-seeking farmers are less likely to purchase crop insurance. Additionally, the general knowledge about crop insurance was another of the main features highlighted (similar to results obtained on other types of insurance [44,45]). Socio-demographic attributes such as the level of education completed by the respondent and the type of residence (rural versus urban) were identified within the main factors, along with the expressed level of trust in insurance companies (see [29,50,51]). RFC additionally identified the age of the respondent and the perceived quality of the interaction with insurance companies in the past as two other contributing factors. Our results for age are in line with previous research [49,50,52] which showed that older farmers were more likely to purchase crop insurance than younger farmers. On the other hand, XGB results highlighted, on top of the eight main common features, crop-level attributes (e.g., whether it was a vegetable crop, or whether the crop was located in the North-Western region (see [36,43])).

Conclusions
In this research, we implemented and evaluated several machine learning models for predicting Romanian farmers' purchase of crop insurance. These models used variables related to crop characteristics, as well as farmer-level variables including socio-demographic attributes, characteristics related to risk perception, as well as insurance knowledge and perception about insurance companies. Among all the models developed, the best-performing ones were the Multi-Layer Perceptron Classifier (MLP) and the Linear Support Vector Classifier (SVC), with an overall accuracy of 0.76 and 0.74, respectively. To identify the main contributing features, tree-based ensemble methods were used, namely Random Forest Classifier (RFC) and eXtreme Gradient Boosting Classifier (XGB). In both approaches, the most important features included the farmer's general perception of risk, their likelihood of engaging in risky behaviour, as well as their level of knowledge about crop insurance, which is supported by the previous research of [49,50].
Crop insurance market actors could use these models to better predict whether farmers would own or not crop insurance, based on crop-level and farmer-level attributes. The farmers that are predicted to own crop insurance can be further targeted and potentially converted into an existing customer base. They can be approached with offers based on more complex crop insurance policies, but much more adapted to their real needs. At the same time, actors on the market may use our results to address the issues that determine the other group of farmers' decision not to get insured and treat these. Additionally, taking into account the most prominent features identified, insurers should first consider the farmers' perception and attitude toward risk, and should potentially invest in increasing the level of knowledge about agricultural insurance. As previously stated in the insurance literature, we emphasize the different importance of the general educational level versus education in the field. The level of knowledge in the insurance sector, in general, and in crop insurance in particular, is of major importance for the farmer's decision and for the future development of the market. An uneducated farmer in the field will not be able to take a feasible decision in respect to purchasing crop insurance. Consequently, we point out the need for market actors to invest in educating their target clients.
The model results indicate that machine learning can be a useful tool for predicting farmers' purchase of crop insurance, and predictions can be further used by insurers for better targeting of potential policyholders. Additionally, we emphasize the different efficiencies of several ML methods in modelling the data. We obtained a similar ranking to other important studies in the field's literature.
Behavioural aspects are non-linear and complex, not only in the crop insurance field, but in any other. The main advantage of machine learning approaches is the fact that they are able to better treat the non-linear relationships that exist in the behavioural field, in contrast with the classical methods that are more conservative and use the linear approach. Consequently, they may lose important information provided by the data. Our results clearly show that machine learning techniques can be used in a very efficient way to predict purchasing decisions with respect to crop insurance. One important advantage is that our methodology can be extended and used in any other sector in which behavioural assessment is required.
The most important limitation of our study, in respect to the part of the field's research which employs classical econometric approaches, is that, for example, we only show the importance of different variables (features) and rank them accordingly, without showing the direction of the impact. This is an aspect that we intend to treat in future developments.
This research can also be further improved by fragmenting the type of insurance the farmer holds and building a 3-class model (no insurance, standard insurance, extended insurance) instead of the binary one considered in this research. In this way we will contribute to the field's literature by adding new information that can help insurers and policymakers to better target the different groups of farmer clients.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The tuned hyperparameters can be seen in the table below for each of the implemented models.