Input Use Efﬁciency Management for Paddy Production Systems in India: A Machine Learning Approach

: This research illustrates the technical efﬁciency of the pan-India paddy cultivation status obtained through a stochastic frontier approach. The results suggest that the mean technical efﬁciency varies from 0.64 in Gujarat to 0.95 in Odisha. Inputs like human labor, mechanical labor, fertilizer, irrigation and insecticide were found to determine the yield in paddy cultivation across India (except for Chhattisgarh). Inefﬁciency in the paddy production in Punjab, Bihar, West Bengal, Andhra Pradesh, Tamil Nadu, Kerala, Assam, Gujarat and Odisha in 2016–2017 was caused by technical inefﬁciency due to poor input management, as suggested by the signiﬁcant σ 2 U and σ 2 v values of the stochastic frontier model. In addition, most of the farm groups in the study operated in the high-efﬁciency group (80–90% technical efﬁciency). No speciﬁc pattern of input use can be visualized through descriptive measures to give any speciﬁc policy implication. Thus, machine learning algorithms based on the input parameters were tested on the data in order to predict the farmers’ efﬁciency class for individual states. The highest mean accuracy of 0.80 for the models of all of the states was achieved in random forest models. Among the various states of India, the best random forest prediction model based on accuracy was ﬁtted to the input data of Bihar (0.91), followed by Uttar Pradesh (0.89), Andhra Pradesh (0.88), Assam (0.88) and West Bengal (0.86). Thus, the study provides a technique for the classiﬁcation and prediction of a farmer’s efﬁciency group from the levels of input use in paddy cultivation for each state in the study. The study uses the DES input dataset to classify and predict the efﬁciency group of the farmer, as other machine learning models in agriculture have used mostly satellite, spectral imaging and soil property data to detect disease, weeds and crops.


Introduction
Of the many facets of agrarian distress in India, the input management factor carries the highest weight among all.Input management is the process of employing inputs, such as chemicals, in optimal quantities to increase yield and destroy pests, etc.The published agriculture statistics of India's Government show an apparent disparity among the major paddy producing states with regard to their input application rates and productivity over the years (Department of Economics and Statistics (DES), Ministry of Agriculture, GoI, reports on the Cost of Cultivation Surveys).With the advent of the 21st century, agriculture has witnessed technological growth like all other sectors of the economy [1].India got its share of technological augmentation in the agricultural sector in the "Green Revolution," which spanned from the 1960s to the 1990s, with long-term effects on the productivity growth of major crops like wheat and paddy.Rice, which is the final product of paddy crop, is the staple food of the majority of the population in Asia and half of the world's population.Asia accounts for 90 percent of global rice consumption (https: //ricepedia.org/rice-as-food/the-global-staple-rice-consumers(accessed on 20 January 2021)) and the demand continues to rise.Paddy cultivation covers nearly 43 million hectares of arable land, which is almost 27 percent of the total 159 million hectares of arable land in India; rice is the staple food grain for nearly 50 percent of the Indian population [2], and it covers all of the states and agroclimatic zones.Thus, it is one of the most important crops for food security, with the income of about 59 percent of the Indian population [2] engaged in agriculture.The productivity of this crop has steadily increased decade by decade from 1961 to 2001; however, after 2001, there was stagnation in productivity, and a productivity plateau can be observed after the year 2005 [2].The cost of cultivation per hectare of paddy has seen a steep growth from 2001-2002 to the latest 2016-2017, as estimated by the DES, Government of India.In 2016-2017, the total production of paddy was 109 million tonnes, of which 67 percent came from seven major producing states out of the 31 states on which DES collected data.
Furthermore, 37 percent of paddy production in India came just from West Bengal, Uttar Pradesh and Punjab, which incidentally covers 34 percent of the country's total paddy area [3].Thus, the major production is concentrated in certain country regions due to their technological and policy advantages.Regional productivity figures demonstrate that certain regions in the country are far better than other regions for paddy farming.The northern plains recorded a productivity average of 2831 kg per hectare, followed by 2665 kg per hectare in the southern states, 2286 kg per hectare in the eastern states, and the lowest productivity was observed in the northern hilly regions, at 2133 kg per hectare.Thus, regional disparity plays a crucial role in determining future strategies for sustainable paddy farming across India.Capital intensive agriculture should have penetrated all states after the green revolution in the States of Punjab and Haryana.However, certain states still practice labor-intensive practices, and subsistence agriculture thus has low productivity figures.Thus, an empirical approach should be taken to ascertain the causes of yield stagnation, which could cause food security issues in the future.
For economists and policymakers across India, the policy challenge is to delineate a strategy to augment yield levels from the current stagnation and enhance the shrinking profit margin.Even in states like Punjab, which reaped the benefits of the green revolution, the intensive resource exploitation, the partial adoption of production technology, and the ineffective policy formulation have led to stagnation in paddy cultivation [4].The new economic policies that proposed the removal of subsidies on crucial farm inputs, like fertilizer, have put upward pressure on the cost of cultivation and can lead to wash-off profits from paddy production.The high input requirement, rising cost of inputs and slow increase in assured prices cumulatively lead to the lower profit margins of the farmer.In such a technological setup, the only thing that remains under the control of farmers is the efficient use of input to obtain the maximum potential yield.Against the backdrop of these studies, an attempt has been made to examine the paddy cultivation status across all of the major growing states of India.The stochastic frontier analysis provides ample scope to minutely analyze the states' efficiency dynamics in paddy cultivation and the provision of improving efficiency and yield levels to the highest possible levels at the present level of technology.
Some major efficiency studies on paddy farming in India show the non-profitable status across various states of India.In Rajasthan, the share of operational and fixed costs increases in the same proportion in the total cost of cultivation [5].The pivotal factors for the increase in operational cost were the high wage rates, the increased mechanization, and the steep increases in seed and fertilizer prices.In a pan-India study, it was found that of the seven-time periods under study, only in two periods could farmers make some profit over the total cost of cultivation, namely C2, considering both the fixed and variable cost components per crop [6].The reports of the Commission of Agriculture Cost and Prices (CACP) accentuated the fact that in some of the major paddy-producing states like Kerala, Tamil Nadu and Odisha, profitability hovered around ten percent in 1999-2000 and 2010-2011; the varying degree of loss was reported in other periods [7].In Andhra Pradesh, a trend of higher input use with an increase in farm size was reported [8].In recent studies, the higher incidence of farmer suicides has been attributed to higher production costs and low profits due to low prices [9][10][11][12].
Efficiency studies in crop production help us to understand the current production system's potential yield levels and thus improve the actual yield, e.g., how to achieve better productivities without increasing input application [13][14][15], or how to better use the current technology and institutional reforms to accommodate innovations and investment in rural infrastructure to increase production growth [16].A national study designed to compare the production efficiency among various pre-classified farm categories can give policymakers an outline to properly allocate resources for the achievement of the maximum productivity potential.These studies are essential to exploit the potential of current technologies and bring productivity reforms [13].According to Kalirajan et al. [17], developing countries face a two-fold problem: scarce resources and a lag in technological growth.In such a setup, efficiency studies provide an excellent base to achieve productivity growth through the improvement of current technologies and avoiding costly technological reforms in the short run.As Shanumugan et al. [18] proposed, it is possible to raise crops' productivity without raising the input application.On the backdrop of this research, an efficiency study for the rice crop will have pivotal implications to improve productivity by improving current technology and acquiring knowledge about the status of technologies in different Indian states.
Smart decision making in agriculture is based on four key areas, namely (a) optimal natural resource management, (b) the conservation of the ecosystem, (c) the development of adequate services, and (d) the utilization of modern technologies [19].Various studies use different kinds of datasets, from satellite to multispectral images and generic field observation to extract information for smart agriculture applications.There are important studies for soil fertility prediction [20], soil moisture [21][22][23], clay prediction by portable multispectral cameras [24], prediction for the condition of indoor plants through partial least squares [25], disease detection [26][27][28], and weed detection [29,30].These models use an array of machine learning algorithms, including artificial neural networks (ANN), SVM, RF, KNN, multiple linear regression (MLR), etc., for various crops to predict, including ANN, SVM, RF, KNN, and MLR, etc., for various crops to predict their yield.Specifically for paddy/rice crop yield prediction, RGB and UAV data [31][32][33]; satellite spectral data [34]; weather data [35,36]; weather and soil data [37,38]; and weather, irrigation, planting, and fertilizer data [39,40] have been used in various studies.Random forest models [33,[38][39][40], SVM models [34,39,40] and KNN models [39,40] have been suggested in notable studies for yield prediction in paddy crops.However, no notable studies used these models to predict the efficiency level of the farmers based on inputs like human labor, machine labor, irrigation, fertilizer, crop area and size group.The classification of farmers through efficiency levels helps us to understand the levels of input utilization and the level of technology that the farmers are rendering.The further analysis helps us to know the farmers' size groups, which provides us the scope to improve the achievable efficiency by suggesting changes in the input management.The research problem of classifying farmers into different technical efficiency levels is addressed herein through a Stochastic Frontier Approach [20,21].Furthermore, three machine learning models-i.e., k-nearest neighbors (KNN), support vector machine (SVM), and Random Forest (RF)-have been used to predict the efficiency group of the farmer based on the input variables and size group.The relatively accurate prediction model will be suggested for each state of the nation in order to advise on the appropriate policy measures for each state on input management.
Thus, the study proposes to study the regional disparity in paddy cultivation across India, and to establish that the input management capacity of the farmers across various states plays a pivotal role in determining the productivity and efficiency difference.In addition, the study aims to build an efficiency group classification cum prediction model for each state individually, in order to help policymakers decide on an effective input management strategy to keep the farmers at the highest level of efficiency.

Data Acquisition
The study used data published by the Department of Economics and Statistics (DES) under the Ministry of Agriculture and Farmers Welfare of the Government of India.The data was collected at state nodal centers under the scheme 'Cost of Cultivation of Principal Crops of India' [41].The data used in this study came from the 2016-2017 period, the latest available one.The data is a plot-level summary of selected farmers in each state encompassing input use in paddy cultivation.The workflow for the study is illustrated in the workflow diagram below.
A three-stage stratified random sampling coupled with a probability proportional sampling method was used to collect the data.A detailed description of the complex sampling techniques can be found in the Manual of Cost of Cultivation Surveys; (2008) published by the DES, Government of India [42].The unique feature of those data is that the farmers record and collect it carefully during the production process, so that the data accuracy remains high.The data covers varying land sizes across states as 10 farmers from each tehsil (township) of the considered states, so the bigger the state is or the higher the number of tehsils is, the larger the sample size.The farmers in the data are classified according to their farm size, and there are five size categories: Marginal (<1 ha), Small (1-2 ha), Semi-Medium (2-4 ha), Medium (4-6 ha) and Large (≥6 ha).

Data Pre-Processing
Data cleaning was performed before their use in this study.The plot-level summary data was first summarized to a farm level, which was used in our study.This study also used the cost of cultivation.We then described the methodological framework of this study, aiming to identify the loopholes in paddy cultivation technology across different Indian states and suggest appropriate mitigation measures.
The model for each state represents the technology level.They are not readily comparable because they represent the technology frontier for the respective state.For each input used by farmers in paddy production, the corresponding variable in our model is zero if that input type is not used, or the variable is removed from the model if the input is not of common use (>90% of the cases).For cases where the input is 0, we have put 0.01 because the stochastic frontier model uses a log-linear form, and a logarithm of 0 is impossible.Then, the variables are filtered again through the Ramsey Reset Test validity to obtain a well-fitted model for each Indian state.

Stochastic Frontier Algorithm
Firstly, the individual farm-level technical efficiency was estimated through the stochastic frontier approach.The model was theorized by Meeusen and Van Den Broeck, and Aigner, Lovell and Schmidt in two different seminal papers published in 1977 [43,44].The stochastic frontiers model was then developed and applied to many sectors, including the agricultural sector, and a model based on these studies was applied to this study.The stochastic frontier models are not affected by the outliers or the extreme observations, as they require normalized logarithmic values for the estimation procedure [44].
Various researchers have carried out considerable improvements in the model since then.According to Battese and Coelli, applying stochastic frontier models to cross-sectional and panel data models to estimate individual farm level efficiency is very important.The specification of this model is such that model the error term (Ei) is divided into a stochastic term (vi) and an inefficiency term (ui) [45,46].This inefficiency term is of prime importance for this study.The R Frontier package 1.1-8 [47] was used in the stochastic frontier model estimation to predict individual farm efficiencies.
Generally, a Cobb-Douglas production function is represented by the following equation: where Y = the yield or any variable representing the productivity per unit area.X i = the vector of inputs used in production.β i = the estimated coefficient of the ith input.u = the error term.
The Cobb-Douglas production function is expanded to carry the inefficiency term in the following form of the equation, which is known as the stochastic frontier production function, and is given by where Y = the yield or any variable representing the productivity per unit.X i = the vector of inputs (the same as Equation ( 1)); β i = the estimated coefficient of the ith input.v i = an asymmetrical random term or stochastic noise, assumed with a normal distribution [N 0, σ 2 v ] u i = the individual farm level technical inefficiency assumed to be half-normally distributed.
For the current study, the variable specification for the study is as follows: Y i = Output/Yield (quintals per hectare) X 1 = Total human labor (Man-hours) X 2 = Total animal labor (Hours) X 3 = Total machine labor (Hours) X 4 = Total Fertilizer (kg.) X 5 = Total insecticide (Rupees).
Each farm has its own production frontier f (X i , β)e v i composed of a deterministic part f (X i , β) common to all producers, and a farm-specific part e (v i ) .The following equation provides the farm-level technical efficiency: where f = the Cobb-Douglas type production function.TE = the technical efficiency of an individual farm (0 < TE i ≤ 1).
The efficiency levels obtained from the stochastic frontier analysis will classify the farmers into four different groups, as discussed in Figure 1 and mentioned in Table 1. .

Efficiency Class
Efficiency Score Range Very High 1.0 to 0.90 High 0.90 to 0.80 Medium 0.80 to 0.70 Low <0.70

Machine Learning Algorithms for the Prediction of Efficiency Classes
All of the inputs used in the stochastic frontier model and the size group will be used to predict the efficiency classes.For this task, the "nnet" [48] and "caret" [49] packages provided in the R computing environment were used.The KNN, SVM and RF algorithms are run after the data partitioning for the train and test ratio.As illustrated in Figure 1, a train test ratio of 80:20 has been used for the data sets of each state, and state-wise classification algorithms were run with 10-fold cross-validation.A comparative table of classification and a prediction algorithm used in the agricultural study is given below.
Table 2 gives a comparative view of the various datasets and models used in the prediction of paddy yield, and also the use of KNN, SVM and RF algorithms in agriculture.As discussed in an earlier section, our dataset is unique for this set up, as we have used production input data to classify the efficiency groups of farmers.

Machine Learning Algorithms for the Prediction of Efficiency Classes
All of the inputs used in the stochastic frontier model and the size group will be used to predict the efficiency classes.For this task, the "nnet" [48] and "caret" [49] packages provided in the R computing environment were used.The KNN, SVM and RF algorithms are run after the data partitioning for the train and test ratio.As illustrated in Figure 1, a train test ratio of 80:20 has been used for the data sets of each state, and state-wise classification algorithms were run with 10-fold cross-validation.A comparative table of classification and a prediction algorithm used in the agricultural study is given below.
Table 2 gives a comparative view of the various datasets and models used in the prediction of paddy yield, and also the use of KNN, SVM and RF algorithms in agriculture.As discussed in an earlier section, our dataset is unique for this set up, as we have used production input data to classify the efficiency groups of farmers.The KNN algorithm is a non-parametric classification model, which is simple and effective [52].The support vector machine has applications ranging from time series prediction [53] to biological data processing for medical diagnosis [54], and can be applied to our study for efficiency group classification.The random forest algorithm is one of the most efficient decision tree-based algorithms proposed by Leo Breiman, and it has been used to predict discrete classes [55].The most accurate models obtained through this experiment on the basis of their accuracy percentage and kappa values for individual states can be used to classify and predict efficiency levels given the input parameters; this means that new strategies in input management can be evaluated thanks to our approach before being applied.The models used in the study are simple and are performed through preexisting modules in the R computing environment.For simplicity, we have not included the detailed mathematical explanation of the algorithms; however, the performance evaluation of the models will be based on precision, recall, accuracy, sensitivity and specificity measures.These are measured from the true positive (TP), true negative (TN), false positive (FP) and false-negative (FN) values obtained from the model.

Results and Discussion
This section presents the results following the approach proposed in this work.In particular, we first describe the status of paddy farming in India, and later we analyze the regional disparity in productivity.We explore the farm-level technical analysis to find the reason behind input mismanagement in the selected states.We conclude this section by analyzing three standard classification algorithms that take input data and size group labels, and predict the efficiency group for specific states.

The Status of Paddy Farming in India
Paddy farming is covered all over India, with variations in area, production and productivity, as shown in Table 3.It provides an overview of the paddy cultivation area, production and productivity statistics across all of India's major growing states in 2016-2017.The data suggest that the production percentage has surpassed the area percentage in the states with higher average productivity (see the first six rows of Table 2), indicating that more food per unit of land is produced.Thus, the disparity in productivity must be studied at a micro level in order to ascertain the causes and prescribe remedial measures.Analyzing the cross-sectional plot-level data for farmers across various states (Table 4), it is evident that, on average, the proportion of the operational cost in paddy farming remained on a higher side than the fixed cost.However, in states like Punjab and Haryana, the proportion of fixed costs remained higher.From the development era of the green revolution to Punjab and Haryana's highly commercialized farm economy, it is apparent that the fixed-cost investment capacity remained high in these states.Some southern states like Andhra Pradesh and Karnataka are also catching up with the trend of investment in higher fixed costs.Linking these factors with the study shown in Table 1, it may be suggested that higher fixed cost investment may lead to higher productivity gains, and may act as a good policy implication.
From the input management perspective, Table 5 provides evidence that human labor remains the single largest input in the total operational cost, with a minimum 41 percent (Madhya Pradesh) to a maximum 74 percent (Himachal Pradesh) contribution to the total operation cost in paddy cultivation in all major paddy growing states of India.Thus, human labor wages in these states represent a crucial factor in determining the total cost of cultivation.The data illustrates that the states with higher human labor utilize fewer machines, as expected.For this study, we focused only on the input factors like human labor, machine labor, fertilizer, irrigation and insecticide, which make up nearly 90% of the total input cost in paddy cultivation across all of the states under study.The effective management of these inputs to obtain higher productivities will be crucial for paddy cultivation and these states' agrarian economy.

Regional Disparity in Productivity and Input Use
Table 6 gives a lucid picture of India's various states' average input use pattern in 2016-2017.The highest yield was observed in Punjab (67.13 kg/ha), and the lowest was recorded in Himachal Pradesh (22.72 kg/ha).Furthermore, all of the eastern states except West Bengal in the study area were below the average yield of 41.31 kg per hectare in the study area, which was below the average yield of the southern region (49.68 kg/ha) and northern region (46.84 kg/ha).This may be attributed to various geographic, biotic, abiotic factors coupled with input management practices.This disparity calls for a targeted approach in these areas in terms of varietal development and input management.In chemical inputs like fertilizer, the average application was 143.30 kg per hectare over the study area.However, only West Bengal in the eastern region applied over this average (171.94kg/ha).The rest of the eastern region states were well below it, with an average 92 kg per hectare application rate.Both the northern region (except Himachal Pradesh) and southern region had a more considerably high (more than 1.5-2.5 times) application rate than the eastern region.Insecticide use in the northern region was Rs.2069.30per hectare, second to the southern region (Rs.2127.51 per hectare).The least insecticide use was reported in eastern region (Rs.980.60 per hectare).The average insecticide use in the study area was Rs.1630.23 per hectare.The most crucial component of the cost of paddy cultivation, i.e., human labor, has an average application of 627.19 person-hours per hectare in the study area.The eastern states used nearly 742 person-hours per hectare, and the northern states engaged 576.29 personhours per hectare, while the southern states used only 503 person-hours per hectare.This indicates that eastern states are more labor-intensive.Higher agricultural wages in the southern region (Rs.393 per person-day), Rs.274 per person-day in the northern region, and Rs.208 per person-day in the eastern region were recorded.As such, eastern states can easily employ higher human labor to increase production with the same capital.
The analysis illustrates that India's eastern region has a lot of potential for yield, production and productivity through higher input use.For further technical analysis, we applied the stochastic frontier approach to the assessment of the individual farmers' technical efficiency in different states in the study to obtain a comparative view of the potential yield improvement and efficiency distribution.

The Stochastic Frontier Approach of Technical Efficiency Estimation
The analysis in the previous section shows a clear disparity among Indian states in paddy cultivation methods.A large part of the paddy-producing area is incurring a loss.In order to address this problem, the study first tried to explore the farm-level technical analysis to find the reason behind input mismanagement in those states for 2016-2017.Stochastic frontier models were specified for each state under their technology's present level to determine the paddy production's technical efficiency.The models were specified based on the variables' availability and the Ramsey Reset test specified in the methodology section.
Perusing the stochastic frontier analysis results as presented in Tables 6 and 7, it was observed that, in the Uttar Pradesh area, increase has led to improvement in yield levels, while in the Bihar, Odisha Tamil Nadu, Assam, and Gujarat areas increase has significantly reduced the productivities.Thus, both positive and negative instances of the land size and productivity relationship exist in paddy production across various states of India.Human labor use has shown the highest positive and significant elasticity in Tamil Nadu (0.193), Bihar (0.145) and Odisha (0.127), followed by Gujarat and Uttar Pradesh, which indicates the excess use of human labor in these states, which would have been optimized for the improvement of paddy production.However, in Punjab (−0.271) and Kerala (−0.169), human labor was found to have negative elasticities.Mechanical labor showed significant negative elasticities in West Bengal, Odisha and Assam, mainly due to higher reliance on animal labor, while the coefficient was positively significant only for Tamil Nadu (0.014), where it had a slightly higher contribution in productivity.Note: "***", "**" and "*" represent significance at the 1%, 5% and 10% levels, respectively."ns" represents non-significant estimates.Figures in the parenthesis represent the standard error of the estimates.
In Odisha, Tamil Nadu, Assam, Gujarat and Chhattisgarh, human labor contributed positively and significantly to the paddy yield in 2016-2017.However, the states like Punjab, Andhra Pradesh and Kerala showed a negatively significant value, indicating the need to reduce human labor production.Furthermore, Tamil Nadu (0.19) showed the highest elasticity, followed by Bihar (0.132), Gujarat (0.13) and Odisha (0.10), indicating the scope for improvement in these states.In contrast, there was negative elasticity for human labor in Punjab (−0.27),Andhra Pradesh (−0.05) and Kerala (−0.18).Similarly, animal labor was found to have significantly contributed to the paddy yield in states like West Bengal, Odisha, Andhra Pradesh, Tamil Nadu, Assam, Gujarat and Chhattisgarh in the years 2016-2017.
The fertilizer application was found to have significantly contributed to states like Punjab, Bihar, Uttar Pradesh, West Bengal, Odisha, Andhra Pradesh, Kerala and Gujarat, indicating the scope for an enhanced level of fertilizer application for improved paddy yield in 2016-2017.However, Assam showed a negative value, indicating the need to reduce the fertilizer application in paddy production.Furthermore, the magnitude of elasticities shows that the highest value was observed in Odisha (0.420), followed by Uttar Pradesh (0.08), Gujrat (0.087), Andhra Pradesh (0.065), Kerala (0.080), Bihar (0.049) and Punjab (0.061), indicating the scope of improvement of fertilizer use in these states.The negative elasticity in Assam (−0.007) indicates the excess use of fertilizer application, which could have been optimized to improve the paddy yield.The studies of Shanumugan and Venkatramani [18], Bhende and Kalirajan [56], and Dung et al. [57] conform to the results of our study, e.g., that fertilizer and human labor have positive production elasticity in case of paddy production.Except for West Bengal, Odisha and Assam, mechanical labor has not contributed to a variation in paddy yield in 2016-2017 in other paddy-producing states of India.However, in these states there was also negative elasticity for mechanical labor.
As mechanical labor consists of both animal labor and machine labor, it was not directly interpretable.An increase in irrigation hours would have significantly augmented the yield in states like Punjab (0.152), West Bengal (0.004), Andhra Pradesh (0.005), Assam (0.039) and Gujarat (0.022), while increased irrigation hours in Bihar and Odisha would have reduced the yield.In small farms of central Gujarat, a study by Narala and Zala [58] found positive elasticity for irrigation in paddy production, which conforms to our study.Here, it should be noted that in Punjab, more than 98 percent of the irrigation for paddy crops is there, while other states lag behind in irrigation infrastructure development.Thus, the elasticity of the irrigation remained high for Punjab compared to other states.All of the states under study except Assam and Chhattisgarh showed significantly positive estimates for insecticide use, indicating the prevalence of insect pests throughout the country in paddy crop significantly determining yield.
The estimated variance parameters σ 2 U and σ 2 v in Table 6 are significantly different from zero, which suggested that the difference in the variation of the yield in the paddy production in Punjab, Bihar, West Bengal, Andhra Pradesh, Tamil Nadu, Kerala, and Assam in 2016-2017 was not caused by stochastic error alone but also involved technical inefficiency or inefficiencies in input management.Further, the significant value of γ for Punjab, Bihar, Uttar Pradesh, West Bengal, Odisha, Andhra Pradesh, Tamil Nadu, Kerala, Assam, and Gujarat shows the presence of dominant inefficiency effect over the random error term in all of the states.Among all of the states, Chhattisgarh showed the highest difference of 98 percent between the observed and frontier outputs, followed by Punjab (97%), Gujarat (97%), Tamil Nadu (95%), Assam (90%), Kerala (89%), Andhra Pradesh (93%), West Bengal (89%), Bihar (61%), Uttar Pradesh (54%) and Odisha (28%), which was mainly due to the inefficient use of resources by the farmers in these states.The value of γ also highlighted the percentage of inefficiency due to the factors under the farmers' control.It can be inferred from the estimate that states with a high level of γ have very little opportunity left to adjust production factors.Their yield can only be ameliorated through a complete change in technology in the form of a new variety or some hi-tech production measures.In contrast, states with a lower technical efficiency need to improve their technical efficiency in order to improve management to achieve potential yield levels in paddy.Lambda (λ), which measures the degree of asymmetry in the distribution of the composite error term (Ei = Vi − Ui), was found to be significantly more than one for all of the states except Chhattisgarh in our study.The value of λ illustrates technical inefficiency and a higher magnitude of the one-sided error component Ui in Ei.
The stochastic frontier analysis suggests that in all of the major paddy-growing states except for Chhattisgarh, input management practices entailed the inefficiency in paddy production to varying degrees in 2016-2017.Each input showed a different degree of responsiveness to paddy production, and management must be aimed to optimize the input application.Consequently, a profitable level of paddy production can be achieved in the future.The farm-level technical efficiency estimated from this analysis reveals that India's mean technical efficiency varies from 0.64 in Gujarat to 0.96 in Odisha.The results from the pan-India study across all states by Shanumugan and Venkatramani [18] found that the technical efficiency ranged from 0.77 in Madhya Pradesh to 0.84 in Odisha in 1990-1991.The fact that Odisha farmers are more efficient in utilizing farm resources is due to high cropping intensity [18].Table 7 divides technical efficiency into four efficiency groups, as delineated in the methodology section across all of the states and size groups of farmers.
The heatmap in Figure 2 illustrates that the marginal farmers of Uttar Pradesh and West Bengal were operating at the highest efficiency level, while Andhra Pradesh and Kerala were at the lowest efficiency level.Small farmers of Andhra Pradesh and Uttar Pradesh showed the highest efficiency in paddy production, while those of Punjab and Tamil Nadu had the lowest efficiency.In semi-medium farm groups, only Kerala showed the lowest efficiency, while Chhattisgarh, Bihar and West Bengal were operating at the highest efficiency level.Medium farms of Punjab and West Bengal were the least efficient in paddy production, while Kerala and Uttar Pradesh employed the highest efficiency level.In large farms, Kerala, Punjab and West Bengal were running at the lowest efficiency level, while those of Tamil Nadu were operating at the highest efficiency level.Overall, we can deduce that ten farm groups performed at the lowest efficiency level, ten at the very-high efficiency level, 16 at the high-efficiency level, and 13 at the medium efficiency level.The state-specific analysis showed that Kerala has the highest instance of low-efficiency farms, while Uttar Pradesh has the highest number of very-high efficiency farmers.Thus, the distribution of technical efficiency suggests that there is a need to improve the efficiency of a significant proportion of farmers, and they belong to any of the farmer classes.The study concludes that efficiency is not concentrated on any specific farm group; instead, it is a discrete phenomenon.
The distribution graph (Figure 3) suggests that in all of the states under study, the proportions of farmers operating at high and very high technical efficiency levels were high except for Gujarat and Kerala, where a significant chunk of farmers was operating at the lowest technical efficiency level.Skewed distribution can be seen in states like Uttar Pradesh, West Bengal, Bihar, Gujarat and Kerala, showing a high level of instability in input management practices.
In the following graphs (Figures 4-8), the study tries to overview the input use of farmers operating in the four efficiency groups.The charts show that the highest efficiency group also has the highest level of yields in paddy production across all of the states of India.The graph shows that, as such, policymakers cannot go for one input management policy because there is no specific pattern of input use among the different efficiency groups which can be standardized for all of the states.Thus, a specific classification cum prediction model to identify efficiency groups should be developed for appropriate input management policy before the cropping season.This may act as a basis to advise on the optimum input levels for specific states.The distribution graph (Figure 3) suggests that in all of the states under study, the proportions of farmers operating at high and very high technical efficiency levels were high except for Gujarat and Kerala, where a significant chunk of farmers was operating at the lowest technical efficiency level.Skewed distribution can be seen in states like Uttar Pradesh, West Bengal, Bihar, Gujarat and Kerala, showing a high level of instability in input management practices.In the following graphs (Figures 4-8), the study tries to overview the input use of farmers operating in the four efficiency groups.The charts show that the highest efficiency group also has the highest level of yields in paddy production across all of the states of         Figure 4 shows that the yield levels directly vary with the technical efficiency group.The study confirms the disparity in yield among the states of India in paddy production.Figure 5 shows that there is no pattern and no striking variation of human labor in relation to the efficiency group; Gujarat and West Bengal use the highest human labor hours among all of the states.
Figure 7 shows that there is very high variation in insecticide use among the states under study, with Andhra Pradesh and Punjab being the highest user of insecticide.There is no distinctive pattern of insecticide use among various size groups of farmers across the states.In the case of fertilizer, there seems to be no specific pattern of difference to classify the technical efficiency group (Figure 7), and the same can be observed in the case of irrigation hours (Figure 8).
Thus, the visualization of the efficiency group and related input parameters is insufficient to provide a classification based on technical efficiency, and more sophisticated methods are needed to map the pattern of input use with respect to the efficiency group.The next section employs machine learning algorithms on various parameters discussed in the methodology section to find an accurate solution to the classification problem.

Machine Learning Models for Efficiency Group Prediction
The previous analysis suggests that there is disparity among the states regarding paddy production technology, which leads to various levels of yield.The intra-state variation of yield among various farm size groups was also found from the study.Thus, input management that forms a major policy issue to target farmers needs to be tailored to state and size groups.The stochastic frontier approach concluded that there exist four efficiency levels of which the input management and yield levels differ.Scientists have employed linear programming models to determine the input levels in the past.Still, as new methodologies are being introduced, we have to check their applicability in input management in agriculture.This will open new avenues for intelligent decision-making in agriculture.Thus, a machine learning model predicting the efficiency level of a farm, given the input levels, would be advantageous to manage farm inputs to achieve yieldaugmenting objectives of the states.
Considering all of these advantages, the current study employed three standard classification algorithms that take input data and size group labels and predict the efficiency group for specific states.The tenfold cross validation method was used to compare the model accuracy of the KNN, SVM and RF algorithms, and the mean accuracy and kappa statistics are presented in Table 7.The mean accuracy for the KNN method ranges between 0.306 for Punjab to 0.685 in Uttar Pradesh, while that of SVM was in the range of 0.518 in Tamil Nadu to 0.848 in Uttar Pradesh.The accuracy statistics of the random forest model varied between 0.729 in Punjab to 0.943 in Uttar Pradesh.Overall, the random forest model was the best model for our dataset across all of the states.The dataset for Uttar Pradesh had the best response to all of the three models, while that of Punjab was the worst.The random forest model is the most accurate for the classification and prediction objectives in our case, as shown in Table 8.Table 8 shows that the random forest model's mean accuracy and kappa values (with 10-fold cross-validation) across all of the states remained higher than the KNN and SVM algorithms.Thus, the random forest model was chosen for classification and prediction of the efficiency groups across the selected states of India in our study.Detailed accuracy statistics are presented in Table 9.A detailed performance evaluation measure for the KNN, SVM and RF models can be found in the Table A1.
Table 9 confirms that the random forest algorithm with 10-fold cross-validation predicted the efficiency group, given the input data in nine of the ten states in the classification study, with a mean accuracy of 80 percent.The highest accuracy of the RF model was observed for the data of Bihar (0.730), and the lowest was observed in the case of Gujarat (0.667).The highest variation in accuracy was observed in case of case of Gujarat (0.45-0.84), followed by Chhattisgarh (0.50-0.86) and Punjab (0.59-0.84).The model can be further improvised by taking more features from soil fertility, soil properties, soil moisture, weather and satellite data for all of the states across time in order to improve the accuracy and reduce the NIR.This kind of model will predict the efficiency level and augment the yield in upcoming crops by making suggestions to farmers on their level of input use, and is hence a very effective tool in the hand of stakeholders to mitigate risk in agriculture.

Conclusions
Among the major paddy-growing states in the study, the production percentage is greater than the area percentage with higher productivity.The analysis suggests that higher investments in fixed-cost components like mechanization have strongly contributed to higher productivity, proving that the rural infrastructure is crucial for productivity; in other words, investment policies are bearing fruit in the areas that have benefitted from them.Input management and other efficiency-related measures can also be used to raise productivity in the states that are instead falling behind.In our analysis, all of the eastern states, except West Bengal, were below the average national yield of 41.31 kg per hectare; the southern region's average yield is 49.68 kg/ha, and the northern region's is 46.84 kg/ha.This disparity calls for a targeted approach in terms of varietal development and input management.The capital-intensive northern (except Himachal Pradesh) and southern regions have considerably higher fertilizer application rates than the eastern region (more than 1.5-2.5 times).Furthermore, the insecticide use in the northern region is Rs.2069.30per hectare, second to the southern region (Rs.2127.51 per hectare).The least insecticide use was reported in the eastern region (Rs.980.60 per hectare).However, paddy cultivation labor hours were considerably higher in eastern states than in the northern and southern regions because of lower mechanization and the higher use of human labor.Thus, the policy for input management should be tailored to the context-i.e., capital-intensive north and southern regions, and the labor-intensive eastern part.Our analysis suggests that the yield can be increased by means of a more efficient input management.The yield can be improved for different states in the range of 4.2 percent in Odisha to 36.1 percent in Gujarat with the optimum use of inputs under the current level of technology of the specific states.
Input management inefficiency is responsible for lower yields.The technical efficiency figure of individual farmers suggests that the very-high efficiency level was achieved in our sample by ten farm groups, the high-efficiency level by 16 groups, the medium efficiency by 13 farm groups, and the lowest efficiency by ten farm groups.Low efficiency is more common in medium and large farm groups due to input management issues, rather than input availability.Overall, the study concludes that efficiency is not concentrated on any specific farm group; rather, it is a discrete phenomenon.From the maximum likelihood estimates of all of the states (except Chhattisgarh), significant inefficiency due to input management is visible, causing yield variation.In addition, many states are operating at a very high technical efficiency level, and saturation in the current status of technical efficiency has already occurred, as confirmed from high gamma estimates, which means that there is little room for improvement.In states like Odisha, Bihar and Uttar Pradesh, the gamma values are low enough to accommodate higher inputs considering the technology level.The states with high gamma values need to improve their technology to increase their yield.Thus, the study suggests a targeted approach for states and regions regarding the input management in the short term, and a technological shift in the next years to keep farms at a profitable level.The study suggests that random forest algorithm is best suited for this dataset across all of the states under study.The random forest algorithm we used suggests that 66 percent of farmers in Gujarat to 91 percent in Bihar can be correctly associated with the achieved efficiency levels using only the farmers' input and size group features.The random forest algorithm is highly significant in predicting the efficiency levels in nine of the ten states.In the future, the development of a targeted random forest algorithm can be considered for each state to achieve higher accuracies, especially considering additional features.As the scope of the dataset of this study is limited, we recommend using more features from published datasets on soil fertility, soil moisture, satellite data and weather data to improve the accuracy and NIR.Future studies can use this dataset across time to develop other algorithms used in this field of work.Such a study can help policymakers predict farmers' efficiency levels through input application data before the cropping season, in turn providing support for policies for targeted input management for each specific state operating under different levels of technology.

Figure 1 .
Figure 1.Systemic workflow of the research paper.

Figure 1 .
Figure 1.Systemic workflow of the research paper.

Figure 2 .
Figure 2. Heatmap showing the distribution of the technical efficiency across the states and farmer classes in paddy production in India in the AY 2016-2017.

Figure 2 . 28 Figure 3 .
Figure 2. Heatmap showing the distribution of the technical efficiency across the states and farmer classes in paddy production in India in the AY 2016-2017.Agriculture 2021, 11, 837 16 of 28

Figure 3 .
Figure 3. Distribution of technical efficiency in paddy production across various states of India in AY 2016-2017.

Figure 4 .
Figure 4. Yield of paddy (quintal/hectare) across various states and efficiency groups.

Figure 5 .
Figure 5. Human labor use in paddy (man-hours/hectare) across various states and efficiency groups.

Figure 6 .
Figure 6.Insecticide use in paddy (rupees/hectare) across various states and efficiency groups.

Figure 7 .
Figure 7. Fertilizer use in paddy (kg/hectare) across various states and efficiency groups.

Figure 8 .
Figure 8. Irrigation in the paddy (hours/hectare) across various states and efficiency groups.

Table 1 .
Efficiency group cut-offs in Stochastic Frontier Analysis.

Table 2 .
Comparison of paddy efficiency group classification and prediction with relevant works.

Table 1 .
Efficiency group cut-offs in Stochastic Frontier Analysis.

Table 2 .
Comparison of paddy efficiency group classification and prediction with relevant works.

Table 3 .
State-wise area, production and productivity for paddy crops in India for 2016-2017.
Source: Handbook of Statistics on the Indian States, RBI Publication (2018-2019).

Table 4 .
Operational cost and fixed cost as a percentage of the total cost of cultivation in different states of India in 2016-2017.

Table 5 .
Proportion of different input costs in the total operational/variable cost in paddy cultivation in India for 2016-2017.

Table 6 .
Descriptive statistics of the parameters used in the stochastic frontier model estimated from cross-sectional farm level data for AY2016-17.

Table 7 .
Maximum likelihood estimates for the Cobb-Douglas type stochastic frontier production function for major paddy cultivating states in India for the AY 2016-2017.

Table 8 .
Comparison of the KNN, SVM and random forest algorithms for their accuracy in classifying efficiency groups in paddy production across major paddy-producing states of India in AY 2016-2017.
Note: Due to model misspecification for Odisha in the standard production function, only the highest efficiency class was present and hence excluded from the classification study.

Table 9 .
Accuracy statistics of random forest models in the classification of the efficiency groups in paddy production across the major producing states of India in AY 2016-2017.: NIR means no information rate, and is significant when accuracy > no information rate."***" means significance at the 5% level." NS " represents non-significant estimates.Figures in the parenthesis represent the standard error of the estimates. Note

Table A2 .
Abbreviations for states.

Table A3 .
Abbreviations for institutions, machine learning models and other technical terms.