Extracting Knowledge from Big Data for Sustainability: A Comparison of Machine Learning Techniques

: At present, due to the unavailability of natural resources, society should take the maximum advantage of data, information, and knowledge to achieve sustainability goals. In today’s world condition, the existence of humans is not possible without the essential proliferation of plants. In the photosynthesis procedure, plants use solar energy to convert into chemical energy. This process is responsible for all life on earth, and the main controlling factor for proper plant growth is soil since it holds water, air, and all essential nutrients of plant nourishment. Though, due to overexposure, soil gets despoiled, so fertilizer is an essential component to hold the soil quality. In that regard, soil analysis is a suitable method to determine soil quality. Soil analysis examines the soil in laboratories and generates reports of unorganized and insigniﬁcant data. In this study, di ﬀ erent big data analysis machine learning methods are used to extracting knowledge from data to ﬁnd out fertilizer recommendation classes on behalf of present soil nutrition composition. For this experiment, soil analysis reports are collected from the Tata soil and water testing center. In this paper, Mahoot library is used for analysis of stochastic gradient descent (SGD), artiﬁcial neural network (ANN) performance on Hadoop environment. For better performance evaluation, we also used single machine experiments for random forest (RF), K-nearest neighbors K-NN, regression tree (RT), support vector machine (SVM) using polynomial function, SVM using radial basis function (RBF) methods. Detailed experimental analysis was carried out using overall accuracy, AUC–ROC (receiver operating characteristics (ROC), and area under the ROC curve (AUC)) curve, mean absolute prediction error (MAE), root mean square error (RMSE), and coe ﬃ cient of determination (R 2 ) validation measurements on soil reports dataset. The results provide a comparison of solution classes and conclude that the SGD outperforms other approaches. Finally, the proposed results support to select the solution or recommend a class which suggests suitable fertilizer to crops for maximum production.


Introduction
In the production of crops, the role of plant nutrients is well established. The essential nutrients are carbon (C), hydrogen (H), oxygen (O), nitrogen (N), potassium (K), calcium (Ca), and phosphorus (P). The required quantity of these nutrients should be present in the crops for yield targets. These nutrients come from various sources like atmosphere, soil, irrigation water, and mineral fertilizers. Any deficient The proposed model and performance evaluation have been elaborated in Section 4. In Section 5, conclusions and future research directions are discussed

Background
The deficiency of the nutrients will leads to a decrease in crop productivity, as soil is a measured source of nutrients for crops. Many researchers had contributed and come up with a different solution in order to give better crop productivity. The success story of "Green Revolution" in India and other developing countries had fascinated the researchers towards the maximization of fertilizers and pesticide usage in agriculture. It is found that to increase productivity, the use of chemical fertilizers such as Nitrogen (N), Phosphorus (P), and Potassium (K) is increased in unrestrained manner [10]. In China the study reveals that from 1960 to 2000 productivity of wheat increased three-fold, and to achieve that the farmers had increased the usage of chemical fertilizers almost five-fold [11]. There is a paradigm shift of fertilizer usage in agriculture from 1990s, where optimal usage of nutrients became an important issue for sustainable agriculture. Compare to other nutrients, issues of nitrogen fertilization are found in large numbers in research literature. This is mainly due to its pivotal role in high productivity and more general applicability across the different crops. The big data analysis technologies are now in use in agriculture to boost the yield of the crop, to make smarter and accurate decisions. The automated systems are useful for farmers to monitoring processes through alerts. The weather conditions and soil moisture are reported via internet of things IoT sensors [12].
In 2018, Bodake et al. [13] have introduced a system using techniques such as Cloud-Computing, and Data Mining. This model provides the details of the required fertilizers from soil sample. It helps to improve crop production with minimum cost of fertilizer. The data is collected in the database regarding crop details and soil conditions which provide total fertilizer requirements. Shastry and Sanjay [14] have designed a cloud-based agricultural framework that provides services of soil classification crop yield prediction. For soil classification, the author used a hybrid support vector machine (M-SVM) and customized artificial neural network (M-ANN). In [15] research work, the support vector machine, artificial neural network, and decision tree are used to classify soil texture classes. The experiments are done with Gaussian radius basic function performed better than other techniques. The overall accuracy rate of SVM is 0.943. Taghizadeh-Mehrjardi et al. [16] have used six different classifiers, namely the support vector machine, Logistics regression, artificial neural network, random forest, k-nearest neighbor and decision tree to predict group soil organic in Iran. In the experiments, a high accuracy rate produced by decision tree and artificial neural network. Brungard et al. [17] are used 11 diverse machine learning approaches to mapping soil taxonomic classes in arid area. Random forest indicator results to produce a high accuracy rate. Kovačević et al. [18] have been used to support vector machine (SVM) for the classification of soil. The soil chemical structure is the input of these methods.
The key benefit of the approaches is that it required minimal number of samples to train the method.
Guevara-Hernandez et al. [19] have developed a classification system of wheat and barley grain. They are a total 99 features included 72 texture, 21 morphological, and six colors. The experiment was done on two classes. The accuracy of this research is more than 99%. Artificial neural network and K-nearest neighbor classifiers [20] are used to categorize Indian wheat seed variations by extracting 31 features using local binary pattern, gray level textural features, etc. The accuracy of this system is 66.8%. The common vector approach [21] is used to classify six wheat varieties. Using minimum distance property, supposed test image is allocated to a label. The accuracy of the method is 36.7%. In [22] a classification system is established through multi-layer observation network for classes of rain red wheat growth. The 87.22% precision is obtained by this method. Romero et al. [23] are used as a classification system for wheat by machine learning software WEKA (University of Waikato, Hamilton, New Zealand). The a priori modified local accuracy (MLA) is 90% accurate.

Problem Statement
A dataset of soil reports that contains soil nutrition values and recommended fertilizer quantities for paddy and wheat crop is collected. In each report, soil nutrition composition is represented as an n-dimensional vector. The value of the vector indicates the quantities of soil nutrition elements, such as organic carbon, phosphorous, potassium, sulfur, etc. The recommended fertilizer quantity is treated as a single value for the classification of reports. The recommended fertilizers with the nutrient compositions are shown in Table 1. In this research work, we are developing a system that identifies the fertilizer recommendation classes using different machine learning algorithms.

Dataset
For the experimental setup, the soil report datasets are collected from two Tata soil testing laboratories. The laboratories are located in different areas (Sangrur and Kutukshetra district) but their geographical conditions are the same. The datasets of the year 2011-2017 are collected for evaluation. The collected data are in the raw form that contained the soil nutrition composition values and recommended fertilizer quantities. The collected data contained millions of records, but in our study, only 30,000 random records of paddy and 30,000 random records of wheat from each laboratory are taken using rand() function.

Fertilizer Recommendation Class
In the collected data, there are many fertilizer recommendations. In this paper, only ten fertilizer recommendations solutions are selected for the classification reports with the support of agricultural experts. Table 2 shows the selected fertilizer solutions where the first five are related to wheat and remaining five are related to paddy with recommended quantity. Here, Sol. indicates the fertilizer recommendations solutions.

Hadoop
Hadoop is an open-source Java project from Apache Software Foundation and officially supported by the Linux operating system. The MapReduce [17] programming model and the Hadoop distributed file system (HDFS) are the main components of Hadoop [24][25][26]. The key advantage of Hadoop is its MapReduce programming model that enables the execution of the job in a distributed environment. It contains thousands of nodes with commodity hardware and an overwhelming amount of data ( Figure 1). Hadoop framework provides java based library and can import any java support IDE, such as Netbeans, Eclipse. Hadoop open-source is widely used for MapReduce computations. Hadoop framework works on the principle of master-slave architecture. In the Hadoop environment, thousands of nodes are grouped into clusters that are known as datanodes ( Figure 1). Each datanode contains a task tracker that performs the computational task. All datanodes are connected to a masternode. The masternode has a job tracker that distributes the task to each datanode. Masternode is connected to the secondary node that is an independent node and becomes active when the masternode fails. The namenode stores the metadata of job distribution and data storage information.
Hadoop framework follows a fault tolerance mechanism. Hadoop distributed file system stores data at three different locations. In case of failure, namenode accesses the data from the alternate locations. Presently, the Hadoop system provides many data storage solutions, such as HBase, Hive, and Pig. HBase is a distributed column-oriented database in the category of not only SQL(NoSQL) storage system [27]. Hive is specially developed by Facebook for the SQL-based developers. Hive saves data in the table format and retrieves data using the SQL query [28]. Pig is a data storage tool developed by Google for custom queries [11]. The aforementioned data storage tools come under the HDFS.

MapReduce
MapReduce model enables the users to process huge data available in both structured and unstructured forms in parallel batches over big clusters in a reliable and fault-tolerant manner [17,27,28]. This programming model is based on a well-known strategy of divide and conquers. In this paradigm the problem is classified into small sub-problems. Further, each sub-problem is resolved and the solutions are then combined into a final solution show in Figure 2. MapReduce used hash table data structure to mapping between key and its value.

Stochastic Gradient Descent
Stochastic gradient descent (SGD) is a function optimization technique [34]. SGD generally use to support machine learning algorithms. The SGD assigns a gradient between the sample points. It also adjusts the weight in the objective function to go along with this gradient. These functions will directly assign to move a weight in direction (adding or subtracting from the current weights) then adjust the weights by a fixed value. In a nutshell, SGD has been taken into account as a stand-alone logistic regression method for classification [29]. The model is revised after every input process, and no information about the previous inputs is required to be retained. Logistic regression can be defined as: where a i is the i th row of the matrix A (n × d), b lies between -1 to 1. Let the probability of b is 1 and −1 as follows: where θ ∈ R d weights. To lessen the error function, the highest weight vector identifier θ has to be identified on the behalf of negative log probability [34].
where γ regulatory period to suppress large weight parameters.

Artificial Neural Network
Artificial neural network [32] is a branch of computer science and mathematics that are inspired from biological neural network. An artificial neural network (ANN) is a suitable tool for the recognition and classification of any kind of data. Mahout library provides ANN classification algorithm with linear regression. We select the extensively-used back-propagation (BP) neural network with one hidden layer in the experiments. We set a linear function in the input layer for neurons (each of which accepts all properties) and a sigmoid function ϕ(x) for them in the hidden and output layers is formally defined as follows.
where f p is a feature of input b m , b n , and b are different layers neuron w pq , w qr , and w r indicate the weight associated with the input of different layers. In this research work mahout library is used to implement ANN, to recognize fertilizer solution recommendation class on the behalf of soil nutrition structure with input, hidden and an output layer. The hidden layer neuron varies from 2 to 20 for performance analysis. The input layer and output layer have 11 and 10 nodes, respectively. For implementation purpose data is divided into three sets: training, validation, and testing, which contains 70%, 15%, and 15% of data, respectively.

Random Forest
Based on the combination of aggregation and bootstrap ideas with decision trees random forests were introduced by Breiman [35]. This is a non-standard statistical method that allows considering regression problems as well as two and multi-class classification problems in single and multiple frameworks. The RF method consists of a pair of random regression trees that operate by making multiple regression trees and, thereafter, make a prediction [36]. The RF fusing the predictions of various trees and each is trained separately. It generates an arbitrary sample of information and looks at a large set of contributions to developing an alternative tree. After the creation of key attributes, it collects quite a few trees and finds their error rate to select which tree will be used.

K-Nearest Neighbor Algorithm
This is a sample-based training technique that is significantly working on the concept of equality and statistics [37]. A reference database holding information on a huge variety of soil nutrition composition values is investigated for soil reports that similar to the target soil nutrition composition values on the behalf of recommended solution. The similarity distances for the target soil nutrition compositions are measured on a Euclidean distance after normalization and re-scaling the soil nutrition data from the reference database. This results in the diverse input attribute gaining similar weight.
where d i is the "distance" of the ith nutrition composition value from the target nutrition composition values and ∆a ij is the difference of the ith nutrition composition values from the target soil in the jth nutrition composition attribute. Soil nutrition composition values of the database are arranged in ascending sequence according to their distance from target soil nutrition composition values. In this experiment, we first optimized the parameters (p and k) of K-NN method. Here, k is the number of soil nutrition composition. The parameter p is utilized to weigh each of the considered k soil nutrition composition values while developing the output attributes' estimate. The values of parameter k between 1 and 10 and parameter p between 0.5 and 3 are examined.

Regression Tree Models
The regression tree (RT) [38] methods are used in the field of digital mapping of soil salinity. The single trees are hard to develop because of incorrect parameter settings, tree un-stability. These concerns have assisted in the creation of bagging, boosting and random approaches. These approaches are helped to improve predictive performance [39]. The bagging models are used random independent bootstrap replicates. After that they are joined using average of regression output [40]. The boosted regression tree(BRT) model is a novel tree-based model created to optimize predictive performance using combination of various simple trees into a strong model despite considering a single tree model based on conventional regression trees [35,39,41,42]. It is used here, to characterize the relationship between soil nutrition composition and fertilizer recommendations.

Support Vector Machine (SVM)
SVM [15,43] is a binary classifier model, used as a classification task. This classifier assumes two classes where each class is considered through decision surface.
The idea of SVM is as below. From a set of the database here k samples, {x i , y i }, i = 1, . . . , k, where x ∈ R n is an n-dimensional vector and y ∈ {−1, +1} indicates the equivalent class label. The Support Vector Machine computes a hyperplane using the following equation, i.e., y i w T .∅( For SVM with a polynomial method the best results obtained using combination of C = 90 and p = 3 and radial basis function (RBF) function using C = 100, r = 1 produced best results. When we compared both functions for SVM polynomial function performed better than RBF function. For this proposed experiment, SGD, k-NN, ANN, RF, RT, and SVM with polynomial and Gaussian radius basis methods are compared. Overall accuracy, receiver operating characteristics (ROC), and area under the ROC curve (AUC) are utilized as performance measures to find the models' accuracy for classification.

Performance Evaluation
In this research work, the k-fold cross validation function is used for performance analysis. Here collected reports are randomly divided into ten subsets which are also known as 10-fold cross-validation. Out of these ten subsets, one subset used for testing and rest for training. Each subset once appeared in test set and at same time other as training subset. This process is replicated ten times. Training subset is used to train classifier and test subsets were used to test the performance developed model.
The performance evaluation of the experiments are using TP (true positive), FP (false positive), TN (true negative) and FN (false negative) as true positive (required correctly predicted number of instance), false positive (required falsely predicted number of instance), true negative (non-required correctly predicted number of instance) and false negative (non-required falsely predicted number of instance), respectively. Then, measures the accuracy as follows: In addition to the above-mentioned evaluation method, we use ROC curve and AUC to examine the benefits and drawbacks of the classifier. We also used the confusion matrix to evaluate the performance. ROC is a probability curve and AUC indicates degree of differentiation. It identifies capability of the model to distinguish between classes. The AUC-ROC curve shows the trade-off between the true positive rate (TPR) and the false positive rate (FPR), where the TPR and FPR are defined as follows: The model is better when the ROC curve is near to the upper left corner of the graph. When AUC is near to 1, the model is better.
The performances of the models are also calculated using three other validation measurements: mean absolute prediction error (MAE), root mean square error (RMSE) and coefficient of determination (R 2 ). These indices were calculated as follows: Here, n is the number of reports (samples) in each class, P i , O i are the predicted and observed nutrition composition values i, respectively. MAE evaluates the average prediction bias, and RMSE reveals the total quality of the prediction. Predictions become increasingly optimal as MAE and RMSE approach zero.

Proposed Model and Results
The basic goal of our study is to propose a model that will identify fertilizer recommendation solution classes. The first step of the proposed model was to extract values from collected reports that are in textual raw format. Pattern-based KMP (Knuth Morris Pratt) text mining technique is utilized to extract the soil nutrition value and fertilizer recommendations, which are saved into distributed system. Soil nutrition composition and crop names are saved into n− dimensional vector. Fertilizer solution of the report acted like a single string that was used as a target value for classification. In the proposed model, most commonly used fertilizer recommendations were selected for classification. The target value string contains six different fertilizer recommendations, shown in Table 2. After extracting the data, it was randomly divided into ten subsets as discussed in Section 3.13. One subset used for testing and the rest of subsets used for training once and this process was repeated 10 times. For the training model machine, learning algorithms were performed using SGD, ANN, RF, K-NN, RT, and SVM. The SGD and ANN are implemented using Mahoot library on Hadoop cluster. For better performance analysis, other algorithms RF, K-NN, RT, and SVM are implemented using traditional single machine. To check the performance of developed model for overall accuracy and area under curve (AUC) performance indicators are used.
In the proposed research, the recommendation procedure, masternode approved the soil nourishment, crop name using an application program interface, or the soil testing machines from a web interface. The Hadoop distributed file system and Parellel job procedures are used by MapReduce programming model for data storage. The job tracker divides the jobs into data locality nodes, mappers executed machine learning techniques and predicts recommendation class. The Mapper used a Hashmap to store the outcomes and submit to the reducer. Then reducer rearranges the data and predicted final fertilizer recommendation class. The function of fertilizer recommendation system depicts in Figure 3.

Experimental Setup
Herein the experimental analysis of our proposed approach is illustrated. Table 3

Model Performance and Results
To measure the models' performance, the experiments are done firstly by SGD and ANN. The idea to use these models is that these are very popular in big data analytics for classification in health, banking and e-commerce sectors. So, we observe the performance of both of them initially. The SGD and ANN are performed on a Hadoop cluster. We have also done the experiments on RF, K-NN, RT, and SVM for classification using a single machine. Lastly, we have compared all the classification methods for better performance evaluation of the Hadoop cluster and a single machine.
The accuracy produced by SGD, ANN, RF, KNN, RT, SVM (Poly), and SVM (RBF) of fertilizer recommendation classes is shown in Figure 4. In the case of SGD, first and second class has the highest accuracy among all the other classes. However, class no. five have minimum accuracy. The overall accuracy which is calculated by using the average of all the ten classes of SGD is 0.88. The highest accuracy of a class using mentioned classifiers will help to select the suitable fertilizer quantity to crops for maximum production. Besides, the use of unsuitable fertilizers will hamper crop production. In Figure 5, the experiments show that the overall accuracy of all 10 classes on different neurons from 1 to 20. Where we find that the highest accuracy is achieved by 16 neurons using hidden layer is 0.811. The SGD and ANN confusion matrix performed in Table 4. The overall accuracy of SGD is 0.865 which is greater than overall accuracy using ANN i.e., 0.814. Table 5 shows the AUC values of two different machine learning classification methods. The overall AUC values of the SGD and ANN classifiers are 0.882, and 0.816, respectively. With respect to both performed indicators of two classification algorithms we conclude that SGD is the best machine learning algorithm, to identify fertilizer recommendation class for current data.

Comparison with Existing Methods
For the better performance analysis, the performance of the proposed methods are compared with the five approaches: random forest (RF) [44], K-NN [45], regression tree (RT) [38], SVM using polynomial functions [15], SVM using Gaussian radius basis function [15] using the average AUC and overall accuracy on our dataset. The approaches such as random forest (RF) [44], K-NN [45], regression tree (RT) [38], SVM using polynomial functions [15], SVM using Gaussian radius basis function [15] have been already used by many authors in soil classification. Figure 6 depicts the ROC curves of seven approaches, and Table 6 reports to the comparative analysis of the seven approaches using AUC and overall accuracy. It is observed from the Table 6 that SGD achieves the highest value of AUC, 0.882, and overall accuracy is 0.881, followed by the ANN, with 0.816 and 0.814 values of average AUC and overall accuracy, respectively. The third best approach on our dataset RF [44] achieves 0.79 and 0.786 using average AUC and overall accuracy, respectively. The K-NN [45] achieves the next highest AUC and overall accuracy values i.e., 0.783 and 0.761, respectively.  Table 6. Comparative analysis of various approaches using average AUC and Overall accuracy.

Methods
Average AUC Overall Accuracy SGD [11] 0.882 0.881 ANN [32] 0.816 0.814 RF (Random Forest) [44] 0.790 0.786 K-NN [45] 0.783 0.761 RT (Regression Tree) [38] 0.756 0.749 SVM using Polynomial functions [15] 0.743 0.737 SVM using Gaussian radius basis functions [15] 0.728 0.720 The next AUC and overall average values obtained by the regression tree (RT) [38] are 0.756 and 0.749, respectively. The performance of the method SVM using the polynomial function, which yields AUC and overall average values of 0.743 and 0.737 are much better than the performance of method SVM using Gaussian radius basis functions which AUC and overall average values of 0.728 and 0.719, respectively. Table 7 shows the predictive performance of the stochastic gradient descent (SGD) [11], artificial neural network (ANN) [32], random forest (RF) [44], K-NN [45], regression tree (RT) [38], SVM using polynomial functions [15], SVM using Gaussian radius basis function models based on the Ten-fold cross-validation, consisting the MAE,RMSE, and R 2 values. In particular, the SVM using Gaussian radius basis functions with an MAE of 0.75 had the greater tendency for overestimation, whereas SGD method with MAE of 0.41 reported the lowest tendency for overestimation following by ANN (0. values, as well as the highest R 2 (0.74) value. Therefore, it is the most superior method to determine nutrition composition values. However, ANN prediction followed closely with ME, RMSE, and R 2 values of 0.49, 0.58, and 0.71, respectively. This shows a modest improvement in prediction accuracy by ANN. Therefore, both SGD and ANN models should be standardized, and the best result applied for prediction of target fertilizer recommendation. The RF method comparatively performed low from SGD, and ANN but outperformed from K-NN, RT, and SVM. Similarly, K-NN, RT, and SVM using Gaussian radius basis and Polynomial functions performed comparatively low form SGD method. Hence, this implies that the SGD method predicts the data very precisely. Table 7. Predictive performance of various methods using mean absolute prediction error (MAE), root mean square error (RMSE), and coefficient of determination (R 2 ) prediction error indices.  [32] 0.49 0.58 0.71 RF(Random Forest) [44] 0.53 0.65 0.69 K-NN [45] 0.57 0.60 0.64 RT(Regression Tree) [38] 0.66 0.73 0.62 SVM using Polynomial functions [15] 0.73 0.82 0.60 SVM using Gaussian radius basis functions [15] 0.75 0.86 0.59

Conclusions and Future Research
Plants must obtain food in their systems in order to survive as animals do. Plants produce energy for the use of animals, so they must fill their nutrients. The major things that plants need to grow are water, nutrients, and light. The water is used to carry moisture and nutrients back and forth between the leaves and roots, in plants. Water and nutrients are usually taken through roots from the soil. Fertilizers provide nutrients to plants during watering. The essential nutrients are nitrogen (N), phosphorus (P), and potassium (K) for the growth of plants. To making green leaves, big flowers and strong roots, and fight for diseases nitrogen, phosphorus, and potassium help, respectively. In the present work, the recommended classification of fertilizer structure has been established to improve the soil quality. The classification system supports to recommend the right amount of soil fertilizer on behalf of soil reports generated by agricultural specialists. For performance analysis, we use AUC and the overall accuracy of the fertilizer recommendation system. Two different machine learning algorithms SGD and ANN are used to identify the recommendation class. For better performance analysis we are also compared the SGD and ANN with other existing approaches such as random forest, K-NN, regression tree, SVM using polynomial functions, and SVM using Gaussian radius basis functions. As a result, we found that SGD performs better than above existing techniques. For future research trends, local attributes and more crops, in addition to wheat and paddy, should be included for good understanding of our findings. The comparative study should also use SPARK library.
Author Contributions: R.G. and H.A. have given their contribution to the algorithmic ideas; H.A., P.C. and R.C. supervised the whole work and analyzed the results.
Funding: This research received no funding.