Applications of Machine Learning Technologies for Feedstock Yield Estimation of Ethanol Production

Lim, Hyeongjun; Kim, Sojung

doi:10.3390/en17205191

Open AccessReview

Applications of Machine Learning Technologies for Feedstock Yield Estimation of Ethanol Production

by

Hyeongjun Lim

and

Sojung Kim

^*

Department of Industrial and Systems Engineering, Dongguk University-Seoul, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

Energies 2024, 17(20), 5191; https://doi.org/10.3390/en17205191

Submission received: 9 August 2024 / Revised: 11 September 2024 / Accepted: 17 October 2024 / Published: 18 October 2024

(This article belongs to the Special Issue Simulation Modelling and Analysis of a Renewable Energy System, Volume II)

Download

Browse Figures

Versions Notes

Abstract

Biofuel has received worldwide attention as one of the most promising renewable energy sources. Particularly, in many countries such as the U.S. and Brazil, first-generation ethanol from corn and sugar cane has been used as automobile fuel after blending with gasoline. Nevertheless, in order to continuously increase the use of biofuels, efforts are needed to reduce the cost of biofuel production and increase its profitability. This can be achieved by increasing the efficiency of a sequential biofuel production process consisting of multiple operations such as feedstock supply, pretreatment, fermentation, distillation, and biofuel transportation. This study aims at investigating methodologies for predicting feedstock yields, which is the earliest step for stable and sustainable biofuel production. Particularly, this study reviews feedstock yield estimation approaches using machine learning technologies that focus on gradually improving estimation accuracy by using big data and computer algorithms from traditional statistical approaches. Given that it is becoming increasingly difficult to stably produce biofuel feedstocks as climate change worsens, research on developing predictive modeling for raw material supply using the latest ML techniques is very important. As a result, this study will help researchers and engineers predict feedstock yields using various machine learning techniques, and contribute to efficient and stable biofuel production and supply chain design based on accurate predictions of feedstocks.

Keywords:

biofuel; ethanol; renewable energy; machine learning; sustainability

1. Introduction

Ethanol is a biofuel produced by fermenting sugars and starches found in crops such as corn and sugarcane. It is considered renewable and sustainable as an alternative to fossil fuels [1]. By blending ethanol with gasoline, an estimated 43.5 million metric tons of CO₂ was reduced in the transport sector in 2016, equivalent to the removal of approximately 9.3 million cars from the road for an entire year [2]. In 2023, 112 billion liters of ethanol was produced worldwide, of which 59.13 billion liters (53%) was produced in the United States, and 31.27 billion liters (28%) produced in Brazil. The EU, India, and China produced 5.45 billion liters (5%), 5.41 billion liters (5%), and 3.60 billion liters (3%), respectively [3]. Considering that 111 billion liters of ethanol was produced in 2019 before the pandemic [4], ethanol production in 2023 appears to have recovered the impact of the pandemic. For reference, considering that ethanol production in 2020 (i.e., the pandemic period) was 100 billion liters [5], it grew by 12% (average annual rate of 4%) by 2023. Regarding this ethanol production trend, securing the feedstock necessary for ethanol production is also expected to become very important.

In fact, optimized research is important for each operation consisting of various operations (e.g., fermentation, distillation, and distribution) in biofuel production to maximize its efficiency and profit [6,7]. For example, discovering enzymes that can increase ethanol conversion efficiency for various feedstocks in the fermentation process [8], finding eco-friendly and low-cost distillation methodologies [9], and designing efficient transportation during feedstock supply and biofuel sales [10]. Due to the importance of each of these operations, research on various bioenergy production operations is being actively conducted worldwide [11]. Nevertheless, considering that products cannot be produced without raw materials, the production and securing of feedstock, which is the raw material for ethanol production, is also an important part of running the bioenergy production process smoothly [12]. In particular, the ethanol production process varies depending on the type of feedstock such as sugarcane or corn (see Section 2 for more detail) [13]. Moreover, regarding that profit is production costs subtracted from revenue, its importance increases even more when considering the purchase and supply of feedstock, which accounts for most of the production costs [14]. In other words, securing an economical feedstock and producing ethanol are important for successful refinery operation, and this can be achieved by reducing costs through the use of a reliable prediction model for feedstock. To this end, Kim and Kim (2022) predicted the yield of feedstock through polynomial regression (PR) and designed a supply chain network that supplies feedstock to refineries via simulation [15]. Note that PR is one of the most widely used machine learning technologies, and it has the advantage of being able to explain the influence on dependent variables through the coefficients of independent variables [16].

Machine learning (ML) is a set of algorithms that are able to learn and solve problems without requiring additional explicit instructions from a human [17]. In general, it consists of two categories such as supervised learning and unsupervised learning [18]. In supervised learning, a model that is trained with labeled data is used to infer corresponding known answers [19]. The learning process is designed to minimize the discrepancy between predicted values and observed values. Various regression-based approaches (PR, the least absolute shrinkage and selection operator (LASSO) regression, random forest (RF) regression), including linear regression and multivariate regression, are being utilized, and neural network-based approaches (recurrent neural network (RNN) and convolutional neural network (CNN)) are also widely used. Unsupervised learning uses a data set without any accompanying labels and tries to identify underlying structures of the data set. Representative unsupervised learning techniques include k-means clustering, principal component analysis (PCA), and latent Dirichlet allocation (LDA) [20,21,22]. For example, k-means clustering employs a Euclidean distance metric to minimize the sum of squared distances between each data point and its nearest cluster center so that it can partition a given dataset into relevant clusters [20]. Since there are various techniques depending on problems to be solved, it is necessary to examine cases of how each technique (or an approach) is applied in order to apply machine learning to improve efficiency in biofuel production.

This study aims at investigating feedstock yield estimation approaches using ma-chine learning technologies that focus on gradually improving estimation accuracy by using big data and computer algorithms from traditional statistical approaches. Prediction techniques for various feedstocks used in the production of first-generation ethanol using food crops (e.g., sugarcane and corn) and second-generation ethanol using either non-food crops or biomass residues (e.g., corn stover, wheat straw, and switchgrass) will be considered. To this end, this study will review the prediction techniques used in 17 major papers related to biofuel feedstock estimation among approximately 210,000 papers searched using the keywords “crop yield estimation” and “machine learning model” in Google Scholar. In particular, the machine learning techniques used in major papers will be classified into regression model-based, neural network (NN)-based, and image-based approaches, and their characteristics will be described. As a result, the study will help engineers select appropriate of machine learning technologies for economical production of ethanol. In addition, this study enables to help researchers conducting research on efficient and stable feedstock supply understand how various machine learning technologies can be used to accurately estimate feedstock yields, and to contribute to improving the effectiveness of research related to more economical and efficient biofuel production and supply chain design by applying appropriate techniques.

The remaining sections are organized as follows. Section 2 describes the process and characteristics of ethanol production. Section 3 summarizes recent studies of different machine learning-based approaches for ethanol feedstock yield estimation. Advantages and disadvantages of feedstock estimation approaches are discussed. Section 4 concludes this study and summarizes the results of the study.

2. First- and Second-Generation Ethanol Production

The countries with the highest ethanol production rates are the United States and Brazil, which are currently ranked first and second, respectively. In the United States, corn is the most commonly utilized feedstock, with ethanol production reaching 59.05 billion liters in 2023 [23]. In Brazil, sugarcane is a prevalent source of biomass, with an annual production of 30 billion liters of bioethanol [24]. Ethanol is derived from sugarcane and corn through a series of chemical transformations. The general processes of producing ethanol from sucrose-based feedstock (sugarcane) and starch-based feedstock (corn) consists of feedstock collection and storage, juice extraction, juice treatment, fermentation, distillation, and bioethanol production [25]. If the feedstock harvest is as predicted and is supplied stably to a biofuel refinery [26], the juice extraction stage involving crushing sugarcane stalks and roots with special rollers is started to obtain the sugarcane juice. Subsequently, in the treatment stage, the collected juice is purified and clarified by the addition of lime, which precipitates fibers and sludge. The filtered sugar solution is then concentrated to contain 14–18% sugar and washed with sulfuric acid. In the fermenter, the concentrated solution is fermented to a 10% ethanol concentration, and the ethanol’s purity is increased through distillation [13,14,15,16,23,24,27].

In contrast to sucrose crops, which only require the extraction of juice to obtain fermentable sugars, starch crops require a process to convert starch into glucose. The general process of corn dry milling consists of feedstock storage, mashing and cooking, liquefaction, saccharification, fermentation, distillation, and ethanol production [28]. This ethanol production process, like ethanol production using sugarcane, is carried out under the premise that corn is produced stably as expected and does not affect refinery operations [29]. The dry milling process of corn, which is the typical starch crop, encompasses two stages: liquefaction and saccharification. Corn flour that has passed through a hammer mill is mixed with water, heated to 88 °C, and liquefied by adding α-amylase enzymes. The dextrinized mash is then cooled and glucoamylase enzymes are added. Glucoamylase converts the liquefied starch into glucose [30]. Subsequently, the glucose undergoes fermentation and distillation to produce ethanol. Bioethanol produced in these ways is defined as first-generation ethanol. First-generation ethanol is process-efficient and competitive. However, using food crops such as corn and sugarcane as feedstocks could potentially conflict with food production [31,32].

Several alternatives have been proposed to address the limitations associated with first-generation ethanol feedstocks. One such option is the utilization of lignocellulosic feedstocks, such as switchgrass, which are more abundant and cost-effective than their predecessors [33]. Unlike other crops, switchgrass is a hardy grass that maintains high productivity despite various environmental stresses [34]. Another example of lignocellulosic biomass is the use of sugarcane. In contrast to traditional methods that utilize sugarcane stalks, this approach employs the tops and leaves of the sugarcane plant, often left in the field and discarded, which account for 15% of the total weight of mature sugarcane stalks [35]. The typical process of converting lignocellulose into ethanol consists of lignocellulosic biomass, pretreatment for hydrolysis, enzymatic hydrolysis, fermentation, distillation, and ethanol production [36]. However, the conversion process encounters obstacles due to the heterogeneity of plant cell walls and the inaccessibility of individual components to degrading agents [37]. To this end, many studies have been conducted on various conversion methods including chemical (acid, alkaline, and oxidative delignification), physical (milling, pyrolysis, and microwave), physico-chemical (steam explosion, ammonia fiber expansion, and CO₂ explosion), and biological processes [35,38]. This second-generation ethanol production process is generally less competitive than the first-generation one due to its complex production process. Despite this production difficulty, one of the reasons why ongoing research is being conducted on second-generation ethanol production is to ensure a stable supply of feedstock while minimizing the impact on the existing food supply [15,39].

In summary, first- and second-generation biofuels each have their own unique production characteristics, and various studies are being conducted to improve their production efficiency [40]. Nevertheless, it is important to note that both types of ethanol use crops or crop residues as raw materials, and ethanol production cannot be performed without a stable supply of the raw materials [29]. Because the harvested yield of feedstocks eventually contributes significantly to the total production of bioethanol, conducting the accurate prediction of feedstock yield is a crucial task for the sustainable and economical production of biofuel [41,42].

3. Machine Learning Technologies for Feedstock Estimation of Ethanol

Feedstock estimation can be broadly categorized into three approaches: regression model-based, neural network (NN)-based, and image-based approaches. Regression model-based approaches utilize statistical methodologies to forecast feedstock yield. These include linear regression, PR, and RF regression algorithms [43,44]. In contrast, NN-based approaches employ artificial neural networks (ANNs) which are composed of interconnected neurons enabling them to learn and model intricate patterns in the data. The capacity of neural networks to adapt and generalize from data renders them particularly efficacious for feedstock estimation, where relationships between variables are not straightforward [45]. Also, image-based approaches employ deep learning techniques, particularly convolutional neural networks (CNNs), to estimate feedstock yield from visual data. CNNs are designed to process grid-like data (i.e., image). By employing multiple layers of convolutional filters, these approaches can automatically detect and learn relevant features from images [46]. The following sections provide comprehensive elaborations on the aforementioned estimation approaches, along with illustrative examples of their implementation in authentic research contexts.

3.1. Regression Model-Based Estimation

Regression models are statistical techniques used to predict the value of dependent variable based on independent variables. It is therefore crucial to utilize appropriate variables that impact yield in order to achieve accurate prediction. Meteorological data, including temperature, precipitation and soil information, have primarily been employed as independent variables [47,48,49]. Common techniques including linear regression, PR, and RF regression have been used [50]. Figure 1 illustrates the general structure of crop yield estimation via regression model [44].

Linear regression is one of the most basic regression techniques, assuming that the output variable can be expressed as a linear combination of input variables. The objective of linear regression is to identify a straight line that best represents the data set. This is typically achieved through the use of the least squares method, which involves minimizing the sum of squares of the errors. The basic formula is as follows [51]:

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{n} x_{n} + ϵ

(1)

If Equation (1) has only one dependent variable

y

, it is known as a simple linear regression model. Equation (1) with multiple independent variables is also known as multiple regression (or multivariate regression). Since it assumes linear relationships among multiple variables, it has the limitation that it can only explain linear relationships [52]. If Equation (1) has a combined (or coupling) effect of multiple independent predictors on predicting the response variable y, an interaction term between variables can be included. Equation (2) shows an example of multivariate regression with two variables. When an interaction term is included, it can also be problematic to consider all combinations of independent variables, which causes the complexity of the model to increase exponentially [16].

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{12} x_{1} x_{2} + ϵ

(2)

PR is an extended model of this multivariate regression. PR assumes that the relationship between variables is not a linear equation but a nonlinear function of degree L, and the core of PR modeling is to find a multidimensional function of the optimal degree that can explain individual relationships [53]. Equation (3) is a general PR model without involving an interaction term between independent variables. As with multivariate regression, interaction terms can also be considered, and because it considers n-dimensional nonlinear relationships, optimal modeling search methods are considered important research [54,55].

y = f_{1} (x_{1}) + f_{2} (x_{2}) + \dots + f_{n} (x_{n}) + ϵ

(3)

where f_{j} (x_{j}) = β_{j 0} + β_{j 1} (x_{j}) + β_{j 2} {(x_{j})}^{2} + \dots + β_{j L} {(x_{j})}^{L}

The use of standard linear regression approaches often results in overfitting due to an increasing number of variables, which leads to bias in the training dataset and degrades predictive performance on new data [56]. To address this challenge, the least absolute shrinkage and selection operator (LASSO) has been proposed [57]. LASSO is an extension of linear regression that enables the selection of pertinent features by setting certain coefficients to zero. This is the process of eliminating superfluous variables in order to facilitate estimation. The loss function of LASSO regression is as follows, where

λ

is the normalization parameter that adjusts the complexity of the model:

L = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} + λ \sum_{j = 1}^{P} |β_{j}|

(4)

Notwithstanding the efficacy of the regression model-based approaches, persistent challenges have emerged in addressing unbalanced, high-dimensional, and noisy data [58]. To address this issue, ensemble techniques employing a combination of multiple models have been devised. RF regression is one of the ensemble learning approaches that employs the use of multiple decision trees for the purpose of prediction. The training of each tree is based on a random subset of the data and random variable selection [59]. This is performed in order to enhance the model’s generalization performance by minimizing the correlation between the trees. In the context of RF regression, multiple decision trees are trained independently through the bagging technique, and the final prediction is generated by averaging the results. This serves to reduce the variance of the model, thereby preventing overfitting. Figure 2 shows the overall procedure of the RF algorithm [50]. Notice that the blue and white circles in Figure 2 represent data points for two variables such as

x_{1}

and

x_{2}

.

Shahhosseini et al. [43] developed various regression models (e.g., linear regression, LASSO, RF) for corn yield prediction. The independent variables of the model were selected from the Agricultural Production Systems Simulator (APSIM), which is a process-based simulation model used to predict crop yields in the same subject region. Meteorological, soil, and yield data were collected from the Daymet, the Web soil survey, and the United States department of agriculture the objective of indicating the corn yield, according to the environmental conditions under which the crops were cultivated. Table 1 illustrates the data structure utilized for the model development.

The collected data were separated into training sets and evaluation sets for validation. To prevent specific variables from being over or underestimated due to different units of measurement, all input variables were normalized to values between 0 and 1. Ensemble technology was used to improve model performance. In addition, Equation (5) was applied to the model to reflect the increasing trend in corn yield as technology advances:

\hat{Y_{i}} = b_{0_{i}} + b_{1_{i}} {Y E A R}_{i}

(5)

Notice that the modeling performance is evaluated via multiple performance metric such as root mean squared error (RMSE), relative root mean squared error (or normalized root mean squared error), mean bias error (MBE), and mean directional accuracy (MDA) [43].

Kim and Kim [44] estimated the yields of five crops—sesame, mung bean, red bean, corn, and soybean—using PR analysis. To account for the effect of shade on crop production, a formula was developed that multiplies the regression coefficient by the shade ratio. In contrast to the traditional linear approaches, PR can accurately capture non-linear relationships using polynomial functions of varying degrees, such as quadratic, cubic, and quartic functions. The results of the analysis of variance (ANOVA) from the study demonstrated that the p-value was less than α = 0.05, indicating that crop growth was significantly affected not only by crop type but also by shade ratio. The accuracy of the model was verified with field study data collected from the Jeollanam-do Agricultural Research and Extension Services in Naju-si (35.0161° N, 126.7108° E), Jeollanam-do, South Korea, and the coefficient of determination (R²) values for each crop were 97.19%, 90.54%, 98.31%, 83.57%, and 99.72%, indicating that the predictions were highly appropriate. The PR was also utilized to estimate solar power generation and rice yield in Agrophotovoltaic (APV) system [60]. Meteorological data such as solar energy and relative humidity were recalculated under APV system conditions through PR with gradient descent, and rice yield was estimated. The results of the study demonstrated a high degree of accuracy, with an R² score ranging from 0.8 to 0.9. However, it was noted that in new regions, the PR coefficient should be re-established with data from that region.

Sakamoto [61] predicted corn and soybean production utilizing a RF regression algorithm. Environmental indicators including temperature, precipitation, shortwave radiation, soil moisture, and the ratio of agricultural land in the county were used. These were utilized in conjunction with the wide dynamic range vegetation index (WDRVI), a moderate resolution imaging spectroradiometer (MODIS)-based vegetation index. WDRVI was calculated (see Equation (6)) using red reflectance and near infrared radiation (NIR) reflectance.

W D R V I = \frac{{α ρ}_{N I R} - ρ_{r e d}}{{α ρ}_{N I R} + ρ_{r e d}}

(6)

Root mean square error (RMSE) was adopted as a performance metric representing the average discrepancy between predicted and actual values. The result of the model verification demonstrated that the estimation accuracy of the RF regression (RMSE: 0.539 t/ha for corn, 0.206 t/ha for soybean) was higher than that of the PR (RMSE: 0.897 t/ha for corn, 0.283 t/ha for soybean) at the state level.

Bolton and Friedl [62] compared the remotely sensed vegetation indices (VIs) commonly used for crop production forecasting. For three satellite-observed indicators: normalized difference vegetation index (NDVI), two-band variable of enhanced vegetation index (EVI2), and normalized difference water index (NDWI), this study performed production forecasting by linear regression and compared their performance by R². EVI2 was a more effective predictor (R² = 0.55) of corn yield than NDVI (R² = 0.49), while NDWI was the most effective predictor (R² = 0.69) of corn yield in semi-arid areas. To evaluate the accuracy of the predicted yields, the leave-one-year-out method was used. This refers to training the model on all data except for a one-year sample, and then validating the forecasts using the excluded year’s data. The RMSE between the predicted and actual yields resulted in an average of 9.4%. Notice that the corn yield estimated in this way can be used to predict the feedstock supply in the ethanol production process mentioned in Section 2.

3.2. Neural Network-Based Estimation

NNs are artificial intelligence approaches inspired by biological neural networks, particularly the structure and function of the brain [63]. NN is composed of multiple neurons, each of which receives input values and performs a weighted summation. This weighted sum is analogous to the process by which synapses, the connections between neurons, adjust the strength of the input signals [64]. In order to produce a result, the neuron applies a non-linear function to the aforementioned sum. The non-linear function, which is commonly referred to as an activation function, ensures that the neuron activates only when the input value exceeds a certain threshold [65].

A representative example of NN architecture is a multi-layer perceptron (MLP). In the initial layer, designated the input layer, neurons receive external data and disseminate it to the neurons in the intermediate layers, frequently referred to as “hidden layers”. The hidden layers are responsible for processing the data and extracting features. The weighted sum of one or more hidden layers is ultimately propagated to the output layer, which provides the final output of the neural network to the user. In each layer, the neurons receive input from the neurons in the preceding layer, calculate a weighted sum, apply an activation function, and then pass the value to the subsequent layer. Figure 3 depicts the structure of an MLP in NN.

The objective of training a neural network is to minimize the discrepancy between the desired output and the actual output for a given input. This is achieved by defining a loss function that quantifies the discrepancy between the network’s output and the actual values and employing a process to minimize this loss, for which gradient descent is a commonly used method. Gradient descent updates the weights by calculating the gradient (partial derivative) of the loss function with respect to each weight [66]:

ω_{i j}^{t + 1} = ω_{i j}^{t} - α \frac{\partial L}{\partial ω_{i j}}

(7)

Variable α represents the learning rate. This process is repeated until the loss function reaches a minimum. To efficiently calculate the gradients, the technique of backpropagation is employed. Backpropagation employs the chain rule from calculus to determine the impact of each weight on the loss function [67]. This allows the weights to be updated in a way that reduces the loss, thereby optimizing the network. In NN, learning occurs by adjusting the weights as data pass from the input layer, through the hidden layers, and finally to the output layer. This iterative and incremental process enables the neural network to learn complex, non-linear relationships by processing data through multiple layers of hidden units, ultimately finding an appropriate solution to the given problem.

Jiang [68] developed a long short-term memory (LSTM) model for county-level corn yield estimation. LSTM is a recurrent neural network (RNN)-based approach that can perform yield estimation by learning patterns from high-dimensional (i.e., spectral, spatial, and temporal) inputs. The Google Earth engine platform was employed for the acquisition of geospatial data, preprocessing, and the detection of corn phenomena. WDRVI, growing degree days, killing degree days, and precipitation data were collected. The LSTM model, developed through the TensorFlow library (GPU version 1.2) within a Python environment, comprises one input layer, two LSTM layers, and one output layer. In the input layer, the collected data are entered and accumulated in each of the five time steps according to the corn growth. The LSTM layer consists of LSTM cells, wherein information is selectively transported and stored. The model’s output is the expected corn yield. Hyperparameters of the model include a batch size of 1000, 200 hidden units in the LSTM layer, an Adam optimizer based on gradient descent, and a learning rate of 0.001. The LSTM model in this study demonstrates better performance (R² = 0.66) under extreme weather events than the LASSO and RF (R² = 0.63 and 0.58, respectively) models, as more data accumulate in this model. Along with R², RMSE was also measured and the model with the best estimation accuracy was 0.87 Mg/ha.

Kuwata and Shibasaki [69] used VI and deep learning techniques to estimate corn yields in Illinois. The model was developed using deep learning framework called convolutional architecture for fast feature embedding (Caffe). Standardized enhanced vegetation index (EVI), 5-year moving average of corn yields, and meteorological data were used as inputs. The study demonstrated that the deep learning model utilizing two inner product layers exhibited the highest accuracy, with an R² score of 0.810.

Sun et al. [70] devised an approach of combining convolutional neural network (CNN) and RNN to enhance the accuracy of corn production prediction. CNN is advantageous for the extraction of spatial characteristics from images, while RNN is advantageous for the exploitation of temporal patterns. CNN was used for extracting MODIS image data and soil characteristic data, while RNN was used for time series data (i.e., daily weather). All extracted patterns were integrated as inputs to yield a prediction of yield. The study is significant in that it can reflect detailed conditions. For the modeling, the training data set used data points of 5863 in 2013, 6778 in 2014, 7713 in 2015, and 8588 in 2016, and the test data set used data points of 915 in 2013, 935 in 2014, 875 in 2015, and 913 in 2016. RMSE, mean absolute percentage error (MAPE), and R² were used to measure estimation accuracy of the developed machine learning models. The devised model showed the performance with RMSE of 1148.07, MAPE of 9.02, and R² of 0.68.

3.3. Image Data-Based Estimation

The factors affecting crop yields are highly complex, making it challenging to obtain accurate results using traditional approaches that rely on soil and climate data. To achieve more precise yield predictions, studies have been conducted that utilize image data [71]. Image-based approaches employ CNNs to analyze images for feedstock estimation. CNNs constitute a class of deep learning approaches that have been specifically designed to process and analyze visual data. They have proven highly effective in a variety of tasks, including image recognition, object detection, and image segmentation. The distinctive architecture of CNN enables them to automatically and adaptively learn spatial hierarchies of features from input images, making them particularly appropriate for image-related tasks. Morphological features of crops (e.g., corn kernel density) can be extracted from images. Figure 4 illustrates the structure of CNN [72].

Convolutional layers in CNN use filters, known as kernels, to scan over the input image. These filters are small, learnable matrices that slide across the image and perform a dot product with the sub-regions of the input, producing a feature map [73]. Each filter captures different aspects of the image, such as edges, textures, or patterns. The stride determines the step size of the filter as it moves across the image, while padding involves adding extra pixels around the border of the image to control the spatial dimensions of the output feature map.

After each convolution operation, an activation function, typically a rectified linear unit [74], is applied to introduce non-linearity into the model. This non-linearity allows the CNN to learn more complex patterns and features. Pooling layers then reduce the spatial dimensions of the feature maps, retaining the most important information while reducing computational load and controlling overfitting [75]. Max pooling selects the maximum value in each patch of the feature map, whereas average pooling computes the average value (see numbers in Figure 4).

Following several convolutional and pooling layers, high-level reasoning in the neural network is performed via fully connected layers. These layers connect every neuron in one layer to every neuron in the next layer, aggregating the learned features to make final predictions (e.g., A, B, C, D and E in Figure 4). Dropout is a regularization technique used to prevent overfitting by randomly setting a fraction of input units to zero during training, which helps the model generalize better to unseen data.

Kim et al. [76] developed a decision-making software program that can generate the maximum revenue through the production of corn and the generation of power under an APV system. The identification of farmland is achieved through the application of K-means clustering technique to red, green, blue (RGB) pixels extracted from satellite images. To measure performance of K-means clustering, the matching size was measured by comparing the actual area of the selected farmland with the detected area through K-means clustering, and it was proven that the proposed methodology showed an accuracy of detecting 97.6% of the actual size of farmland areas. The estimation of corn production is then accomplished through the utilization of shadow ratio-based PR. After measuring the farmland area through this methodology, it was utilized for yield estimation. R² was used to measure the estimation accuracy of the proposed PR, and its value was 86.03%.

Yang et al. [77] estimated corn yield through hyperspectral imagery and CNN. The spectral information, which captured the internal traits, and the spatial data from the color image (comprising red, green, and blue bands extracted from the hyperspectral imagery), which represented the external traits of corn growth, were extracted for the purposes of modeling and validation. A drone, specifically the DJI Matrice 600 Pro equipped with an A3 Pro flight controller, was employed for the purpose of capturing images of agricultural fields. Hyperspectral images were initially collected in raw digital numbers, which lack physical meaning. To preprocess these images, the solar illumination conditions were applied in order to reflect the actual reflectance properties of the corn objects. Consequently, an area measuring 64 × 64 pixels was cropped from each image as the region of interest. The CNN algorithm, a prominent feedforward neural network in image recognition research, was utilized. The collected images served as CNN input, and the output was set to the corn yield data categorized into five levels, for supervised learning. This study differs from previous ones in that it directly utilized the morphological features of corn from images. The validation results showed that the proposed CNN model had a classification accuracy of 75.50%.

Khaki et al. [78] focused on selecting superior crop species for corn plant breeding. Semi-supervised learning was used, and the crowd counting method, a technique for estimating the number of people in an image, was benchmarked. The density of corn kernels was estimated from images of corn ears, and the number of corn kernels was predicted based on this estimation. The proposed DeepCorn showed better performance compared to CNN-based state-of-the-art image processing approaches, including DensityCNN, SaCNN, and ACSPNet.

3.4. Machine Learning Techniques Applied to Other Operations of Biofuel Production

As mentioned in Section 2, this study emphasized that accurate prediction of feedstock yield is important as it is performed at the earliest stage of biofuel production, and various machine learning techniques for accurate feedstock yield prediction were reviewed in Section 3.1, Section 3.2 and Section 3.3. By ensuring a stable supply of raw materials (feedstock), an optimal production plan can be established, and the refinery can be operated efficiently. Nonetheless, several studies are being conducted using machine learning techniques to increase production efficiency in various operations after feedstock supply, and representative studies can be summarized as follows.

Horikawa et al. [79] employed NIR and partial least squares (PLS) regression analysis to predict the bioethanol yield of chemically pretreated erianthus. PLS was selected due to the high correlation between the spectral data variables (i.e., NIR), rendering it an effective method for dimensionality reduction and noise elimination. NIR spectroscopic data were collected within the range of 4000–10,000 cm⁻¹. The analysis demonstrated that the R² > 0.89, and the RMSE of prediction <6.34 in all datasets except for those within the 7300–10,000 cm⁻¹ band. The 5500–7300 cm⁻¹ band, where OH and CH vibrations occur, exhibited the highest accuracy, indicating a significant impact on the model. Magaña et al. [80] also present an approach that employs NIR and modified PLS regression to forecast the direct bioethanol yield in sugar radish. The model demonstrated the standard error of cross-validation (SECV), the standard error of prediction (SEP), and the R² values to be 0.51, 0.49, and 0.91, respectively. This non-destructive approach has the potential to significantly reduce the time required for bioethanol yield measurement, from 64 h to 3 min.

Watanabe et al. [81] investigated the impact of fermentation inhibitors on the bioethanol yield derived from the fermentation of lignocellulosic biomass. The components of the hydrolysate were quantified by gas chromatography–mass spectrometry (GC-MS) and high-performance liquid chromatography (HPLC). PLS regression were used to examine the relationship between the measured components and ethanol concentration. The impact of the inhibitors was evaluated based on their variable importance in projection (VIP) scores and correlation coefficients. The study results identified well-known inhibitors such as acetate, furfural, and 5-HMF, as well as low-concentration but important inhibitors like apocynin and

m

-methoxyacetophenone (VIP ≥ 1).

Zhang et al. [82] employed a regression model-based approach to optimize fermentation factors in the bioethanol production process using water hyacinth. Fermentation temperature, fermentation time, and inoculums dosage were selected as main factors, and response surface method (RSM) was applied to evaluate their combined effects. Subsequently, a quadratic polynomial regression equation was utilized to model the relationship between bioethanol concentration and the aforementioned factors. The optimal conditions were identified as 38.87 °C, 81.87 h, and 6.11 mL yeast. The model demonstrated a high level of significance (

p

< 0.05), with a predicted ethanol yield of 1.291 g/L, which closely aligned with the experimental result of 1.289 g/L.

Shenbagamuthuraman and Kasianantham [83] have compared the RSM and ANN approaches as a methodology for optimizing the bioethanol yield from chlorella vulgaris biomass. Both models were constructed with fermentation temperature, fermentation time, and yeast concentration as input factors. The ANN model employed a multilayer feedforward neural network and was trained to achieve mean squared error (MSE) of 0.017 and R² of 0.99. These outputs were better than those of the RSM model with MSE of 0.022 and R² of 0.97. The ANN model identified the optimal fermentation conditions (i.e., 36 h of fermentation time, 30 °C, and 1.5 g/L inoculum concentration) to achieve the maximum ethanol yield of 3.3 g/L.

Dave et al. [84] modeled bioethanol yield prediction using ulva prolifera biomass by employing a methodology that combined ANN with a genetic algorithm (GA). The ANN model employed the variables of substrate concentration (g/L), fermentation time (h), yeast inoculum (% v/v), temperature (°C), agitation speed (rpm), and potential of hydrogen (pH) as inputs, utilizing a regression equation to predict bioethanol yield as the fitness function. Subsequently, GA was employed to optimize the fermentation conditions based on the regression equation. The developed model predicted the maximum bioethanol yield at 30 g/L substrate, 48 h fermentation time, 10% (v/v) inoculum, 30 °C, 50 rpm agitation speed, and pH 6. The experimentally obtained maximum yield was 0.242 ± 0.002 g/g RS, which closely matched the predicted yield of 0.239 g/g RS. A comparable study was conducted by Mondal et al. [85] who employed an ANN-GA approach to achieve the maximum yield of total reducing sugar (TRS) in the saccharification process using waste broken rice.

Konishi [86] developed a deep learning model for the prediction of bioethanol yield based on the analysis of volatile components present in lignocellulosic biomass hydrolysates. The deep neural network (DNN) comprised six layers and utilized the volatile components of hydrolysates (n = 208) as input. An asymmetric autoencoder–decoder (AAE) was employed to decode the volatile composition and highlight significant inhibitory variables through nonlinear dimensionality reduction. The developed model demonstrated high accuracy, with training and validation losses of 0.033 and 0.507, respectively. Furthermore, the AAE identified 2, 4-tert-butylphenol, which had not been identified in previous PLS models, as a highly toxic substance.

Concu [87] developed a perturbation theory machine learning (PTML) model to predict enzyme subclasses. In this study, a sequence alignment-free model was developed using sequence recurrence networks (SRNs), in which amino acids are treated as nodes and connections are formed between nodes that are adjacent or recurrent. The PTML-ANN model was implemented using four different ANN models: linear neural network, radial basis function, probabilistic neural network, and MLP. Among the models considered, the MLP model exhibited the best performance, achieving over 90% accuracy, specificity, and sensitivity in both the training and validation datasets.

Fernández et al. [88] proposed an approach for applying nonlinear tracking control simulation to a bioethanol production system. By combining a controller design based on linear algebra with a Bayesian state estimator based on gaussian processes, the method predicts the concentrations of cells, ethanol, and glycerol using only substrate measurements. The proposed controller is designed to minimize the error between the actual state and the reference state by adjusting the feed flow rate. Qualitative evaluation through the figures demonstrates that the Bayesian estimator performs better than the neural network estimator by reducing the total error with less mathematical complexity and lower computational cost.

Ostos-Garrido et al. [89] employed high-throughput phenotyping using unmanned aerial vehicle-based multispectral imaging technology to assess the potential for bioethanol production in various crops, including barley, wheat, and triticale. The VIs were calculated from the spectral information collected from the images of 66 cereal accessions. Subsequently, the theoretical bioethanol yield was calculated using the measured values of total biomass dry weight (kg/m²) and sugar release (µL/mg). A linear regression model was employed to analyze the correlation between the estimated theoretical yield and the VIs, with NDVI exhibiting the highest correlation (R² = 0.66). This enabled the ranking of the cereal accessions in terms of their potential for bioethanol production.

To increase production efficiency in the preprocessing, fermentation, distillation, and bioethanol production processes as described above, the latest machine learning techniques are being applied, and research on the application of new technologies will be conducted for continuous improvement in the future.

3.5. Discussion

As mentioned in Section 1 and Section 2, the first- and second-generation ethanol production consists of multiple operations such as fermentation, distillation, and distribution. To enhance production efficiency, it is very important to develop enzymes that can efficiently convert feedstock into biofuel and to discover the most effective chemical reaction conditions. Additionally, the distillation operation is also important to secure high-quality, pure and refined biofuel. Likewise, to reduce transportation costs, optimal distribution strategies for feedstock or produced biofuel must be considered. In fact, since the biofuel operations are connected sequentially, issues that occur in the previous operation are passed on to the next operation. From this perspective, a stable supply of feedstock, which is the raw material for biofuel, is also an important issue, and unstable feedstock supply can change the optimally designed production schedule or stop operations of the entire refinery facility. Considering the fact that predicting crop production worldwide is becoming increasingly difficult, especially due to serious climate change, various techniques are being studied to accurately predict yields in order to ensure a stable supply and demand of feedstock. In the field of agriculture, where much research has been performed on feedstock yield prediction, attempts are being actively made to apply machine learning techniques, including crop yield prediction using existing process-based models such as agricultural land management alternative with numerical assessment criteria (ALMANAC) [90] and the agricultural policy and environmental extender (APEX) [91]. Unlike the process-based models that require considerable time for simulation modeling regarding the influence of crop parameters and external environmental variables (e.g., climate and soil nutrients) on feedstock growth [15], the major advantage of machine learning methodologies is that they can quickly provide the most accurate feedstock yield by considering various relationships (linear, nonlinear, or multi-layer neural network structures) between independent variables based on collected data [50].

In this study, the diverse methodologies employed for the estimation of crop yield each possess distinctive advantages and challenges, as mentioned in this review. Each approach, based on regression, neural networks, or image data, provides a comprehensive understanding of agricultural yield prediction. Artificial neural network-based approaches are advantageous in that they can effectively capture nonlinear relationships and achieve high accuracy through weighted learning of artificial neural networks. In particular, time series data can be employed to leverage regionally accumulated temporal characteristics, which are effective in predicting crop yields.

However, a limitation of deep learning-based approaches, often referred to as “black boxes”, is that it is frequently challenging to interpret and explain their internal operating principles [92]. To be more specific, in the case of deep learning, the relationship between independent and dependent variables is generated according to the given database and algorithm, and it becomes difficult to explain the relationship between individual independent and dependent variables. The interactions between neurons adjust thousands to millions of parameters, which introduces a significant challenge in assessing the reliability of the prediction and determining the cause when the model makes a false prediction [93]. In contrast, regression models yield results in the form of multiplying predictors by regression coefficients, which facilitates the identification of variables (e.g., solar radiation, precipitation) with a significant impact on yield. Moreover, training regression models typically requires less data and computing resources than deep learning models, which is an additional advantage. This is the reason why traditional statistical techniques such as regression models or correlation analysis models, which can explain the relationship between specific independent variables and dependent variables through correlation coefficients, are still widely used to solve this black box problem. The good news is that there has been a lot of recent development and research on explainable artificial intelligence (XAI) models [94].

In consequence, when precise production forecasting is necessary for the purposes of inventory and demand management, it is efficacious to employ deep learning approaches. Regression model-based approaches, meanwhile, can be of assistance in the study of the results predicted when specific conditions are altered, or in the explanation of the causes of high or low yields. Among the deep learning approaches most frequently used in agriculture are those based on images, which can include morphological, meteorological, and spatial information that is difficult to numerically substitute in the predictions made through images. These approaches are particularly beneficial in scenarios where feedstock crops can be visually inspected or where satellite imagery can provide valuable data.

4. Conclusions and Future Directions

This study conducted a review by selecting 17 papers that contained widely used techniques for yield estimation among 210,000 papers searched on Google Scholar using the keywords “crop yield estimation” and “machine learning model”. The selected papers are classified by three machine learning categories such as regression model-based, neural network (NN)-based, and image-based approaches. In regression model-based estimation approaches, linear regression, PR, and RF regression were found as the most popular approaches due to their convenience of use. In particular, in the case of multinomial regression or RFs, there was an advantage in that it could predict yields more accurately by expressing the nonlinear relationship between independent and dependent variables that is not considered in linear regression. In NN-based estimation approaches, various ANN techniques were utilized, and in particular, a yield prediction model using an LSTM model based on RNN was utilized to increase prediction accuracy when time series data were given. LSTM had the advantage of showing higher prediction accuracy than LASSO or RF because it performed crop yield prediction from high-dimensional input data (i.e., spectral, spatial, and temporal), but it required the collection of a lot of data to explain the complex relationships between variables. In image-based estimation approaches, Deep learning techniques such as CNN were mainly utilized, and CNN had the advantage of being able to identify the condition of a farm and predict yields from image data such as satellite photos or aerial photographs. Additionally, various machine learning techniques (i.e., multi-variate regression, ANN, PTML, DNN) and new data such as NIR sensing images were utilized to increase the efficiency of the process of converting a given feedstock into ethanol during preprocessing, fermentation, distillation, and bioethanol production. As a result, this study confirmed that various machine learning-based approaches could accurately perform yield estimation in given data sets and experimental environments. However, considering that widely used machine learning techniques still have the limitation of being black-box methodologies, it seems necessary to pay more attention to the development and use of explainable AI by combining existing process-based models or mathematical models designed through experiments that enable interpretation of cause and effect for future scientific advancement.

Author Contributions

Conceptualization, S.K.; methodology, H.L. and S.K.; validation, H.L. and S.K.; resources, S.K.; writing—original draft preparation, H.L. and S.K.; writing—review and editing, H.L. and S.K.; visualization, H.L.; project administration, S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (No. RS-2023-00239448).

Data Availability Statement

Not applicable.

Acknowledgments

The authors gratefully acknowledge the support of the NRF of Korea, funded by the Ministry of Education.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tse, T.J.; Wiens, D.J.; Reaney, M.J.T. Production of Bioethanol-A Review of Factors Affecting Ethanol Yield. Fermentation 2021, 7, 268. [Google Scholar] [CrossRef]
Robak, K.; Balcerek, M. Review of Second Generation Bioethanol Production from Residual Biomass. Food Technol. Biotechnol. 2018, 56, 174–187. [Google Scholar] [CrossRef] [PubMed]
Renewable Fuels Association. Annual Ethanol Production. Available online: https://ethanolrfa.org/markets-and-statistics/annual-ethanol-production (accessed on 6 July 2024).
Renewable Fuels Association. Ethanol Industry Outlook. 2019. Available online: https://d35t1syewk4d42.cloudfront.net/file/18/RFA_outlook_2019_newlogo.pdf (accessed on 6 July 2024).
Renewable Fuels Association. Ethanol Industry Outlook. 2020. Available online: https://d35t1syewk4d42.cloudfront.net/file/21/2020-Outlook-Final-for-Website.pdf (accessed on 6 July 2024).
Zheng, J.L.; Zhu, Y.H.; Su, H.Y.; Sun, G.T.; Kang, F.R.; Zhu, M.Q. Life cycle assessment and techno-economic analysis of fuel ethanol production via bio-oil fermentation based on a centralized-distribution model. Renew. Sustain. Energy Rev. 2022, 167, 112714. [Google Scholar] [CrossRef]
Li, H.; Li, S. Optimization of continuous solid-state distillation process for cost-effective bioethanol production. Energies 2020, 13, 854. [Google Scholar] [CrossRef]
Dickson, R.; Liu, J.J. A strategy for advanced biofuel production and emission utilization from macroalgal biorefinery using superstructure optimization. Energy 2021, 221, 119883. [Google Scholar] [CrossRef]
Khan, S.; Naushad, M.; Iqbal, J.; Bathula, C.; Ala’a, H. Challenges and perspectives on innovative technologies for biofuel production and sustainable environmental management. Fuel 2022, 325, 124845. [Google Scholar] [CrossRef]
Yazdanparast, R.; Jolai, F.; Pishvaee, M.S.; Keramati, A. A resilient drop-in biofuel supply chain integrated with existing petroleum infrastructure: Toward more sustainable transport fuel solutions. Renew. Energy 2022, 184, 799–819. [Google Scholar] [CrossRef]
Ambaye, T.G.; Vaccari, M.; Bonilla-Petriciolet, A.; Prasad, S.; van Hullebusch, E.D.; Rtimi, S. Emerging technologies for biofuel production: A critical review on recent progress, challenges and perspectives. J. Environ. Manag. 2021, 290, 112627. [Google Scholar] [CrossRef] [PubMed]
Singh, A.K.; Garg, N.; Tyagi, A.K. Viable feedstock options and technological challenges for ethanol production in India. Curr. Sci. 2016, 111, 815–822. [Google Scholar] [CrossRef]
Vohra, M.; Manwar, J.; Manmode, R.; Padgilwar, S.; Patil, S. Bioethanol production: Feedstock and current technologies. J. Environ. Chem. Eng. 2014, 2, 573–584. [Google Scholar] [CrossRef]
Slade, R.; Bauen, A.; Shah, N. The commercial performance of cellulosic ethanol supply-chains in Europe. Biotechnol. Biofuels 2009, 2, 573–584. [Google Scholar] [CrossRef] [PubMed]
Kim, S.; Kim, S. Hybrid simulation framework for the production management of an ethanol biorefinery. Renew. Sustain. Energy Rev. 2022, 155, 111911. [Google Scholar] [CrossRef]
Kim, S.; Kim, S.; Green, C.H.M.; Jeong, J. Multivariate polynomial regression modeling of total dissolved-solids in rangeland stormwater runoff in the Colorado River Basin. Environ. Model. Softw. 2022, 157, 105523. [Google Scholar] [CrossRef]
Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Kotsiantis, S.B.; Zaharakis, I.; Pintelas, P. Supervised machine learning: A review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 2007, 160, 3–24. [Google Scholar]
Cunningham, P.; Cord, M.; Delany, S.J. Supervised learning. In Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval; Springer: Berlin/Heidelberg, Germany, 2008; pp. 21–49. [Google Scholar]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Renewable Fuels Association. Ethanol Industry Outlook. 2024. Available online: https://d35t1syewk4d42.cloudfront.net/file/2666/RFA_Outlook_2024_full_final_low.pdf (accessed on 6 July 2024).
Malik, K.; Sharma, P.; Yang, Y.L.; Zhang, P.; Zhang, L.H.; Xing, X.H.; Yue, J.W.; Song, Z.Z.; Nan, L.; Su, Y.J.; et al. Lignocellulosic biomass for bioethanol: Insight into the advanced pretreatment and fermentation approaches. Ind. Crop. Prod. 2022, 188, 115569. [Google Scholar] [CrossRef]
Dias, M.O.S.; Ensinas, A.V.; Nebra, S.A.; Maciel, R.; Rossell, C.E.V.; Maciel, M.R.W. Production of bioethanol and other bio-based materials from sugarcane bagasse: Integration to conventional bioethanol production process. Chem. Eng. Res. Des. 2009, 87, 1206–1216. [Google Scholar] [CrossRef]
An, H.J.; Wilhelm, W.E.; Searcy, S.W. Biofuel and petroleum-based fuel supply chain research: A literature review. Biomass Bioenergy 2011, 35, 3763–3774. [Google Scholar] [CrossRef]
Zabed, H.; Sahu, J.N.; Suely, A.; Boyce, A.N.; Faruq, G. Bioethanol production from renewable sources: Current perspectives and technological progress. Renew. Sust. Energy Rev. 2017, 71, 475–501. [Google Scholar] [CrossRef]
McAloon, A.; Taylor, F.; Yee, W.; Ibsen, K.; Wooley, R. Determining the Cost of Producing Ethanol from Corn Starch and Lignocellulosic Feedstocks; National Renewable Energy Laboratory (NREL): Golden, CO, USA, 2000.
Slewinski, T.L. Non-structural carbohydrate partitioning in grass stems: A target to increase yield stability, stress tolerance, and biofuel production. J. Exp. Bot. 2012, 63, 4647–4670. [Google Scholar] [CrossRef] [PubMed]
Bothast, R.J.; Schlicher, M.A. Biotechnological processes for conversion of corn into ethanol. Appl. Microbiol. Biotechnol. 2005, 67, 19–25. [Google Scholar] [CrossRef] [PubMed]
Aditiya, H.B.; Mahlia, T.M.I.; Chong, W.T.; Nur, H.; Sebayang, A.H. Second generation bioethanol production: A critical review. Renew. Sustain. Energy Rev. 2016, 66, 631–653. [Google Scholar] [CrossRef]
Bai, Y.; Luo, L.; van der Voet, E. Life cycle assessment of switchgrass-derived ethanol as transport fuel. Int. J. Life Cycle Assess. 2010, 15, 468–477. [Google Scholar] [CrossRef]
Balat, M.; Balat, H.; Öz, C. Progress in bioethanol processing. Prog. Energy Combust. Sci. 2008, 34, 551–573. [Google Scholar] [CrossRef]
Larnaudie, V.; Ferrari, M.D.; Lareo, C. Switchgrass as an alternative biomass for ethanol production in a biorefinery: Perspectives on technology, economics and environmental sustainability. Renew. Sustain. Energy Rev. 2022, 158, 112115. [Google Scholar] [CrossRef]
Dos Santos, L.V.; de Barros Grassi, M.C.; Gallardo, J.C.M.; Pirolla, R.A.S.; Calderón, L.L.; de Carvalho-Netto, O.V.; Parreiras, L.S.; Camargo, E.L.O.; Drezza, A.L.; Missawa, S.K. Second-generation ethanol: The need is becoming a reality. Ind. Biotechnol. 2016, 12, 40–57. [Google Scholar] [CrossRef]
Limayem, A.; Ricke, S.C. Lignocellulosic biomass for bioethanol production: Current perspectives, potential issues and future prospects. Prog. Energy Combust. Sci. 2012, 38, 449–467. [Google Scholar] [CrossRef]
Horn, S.J.; Vaaje-Kolstad, G.; Westereng, B.; Eijsink, V.G.H. Novel enzymes for the degradation of cellulose. Biotechnol. Biofuels 2012, 5, 45. [Google Scholar] [CrossRef]
Keshwani, D.R.; Cheng, J.J. Switchgrass for bioethanol and other value-added applications: A review. Bioresour. Technol. 2009, 100, 1515–1523. [Google Scholar] [CrossRef] [PubMed]
Dias, M.O.S.; Cunha, M.P.; Maciel, R.; Bonomi, A.; Jesus, C.D.F.; Rossell, C.E.V. Simulation of integrated first and second generation bioethanol production from sugarcane: Comparison between different biomass pretreatment methods. J. Ind. Microbiol. Biotechnol. 2011, 38, 955–966. [Google Scholar] [CrossRef] [PubMed]
Palacios-Bereche, R.; Mosqueira-Salazar, K.J.; Modesto, M.; Ensinas, A.V.; Nebra, S.A.; Serra, L.M.; Lozano, M.A. Exergetic analysis of the integrated first- and second-generation ethanol production from sugarcane. Energy 2013, 62, 46–61. [Google Scholar] [CrossRef]
Ahamed, T.; Tian, L.; Zhang, Y.; Ting, K.C. A review of remote sensing methods for biomass feedstock production. Biomass Bioenergy 2011, 35, 2455–2469. [Google Scholar] [CrossRef]
Dimov, D.; Uhl, J.H.; Löw, F.; Seboka, G.N. Sugarcane yield estimation through remote sensing time series and phenology metrics. Smart Agric. Technol. 2022, 2, 100046. [Google Scholar] [CrossRef]
Shahhosseini, M.; Hu, G.P.; Archontoulis, S.V. Forecasting Corn Yield with Machine Learning Ensembles. Front. Plant Sci. 2020, 11, 1120. [Google Scholar] [CrossRef]
Kim, S.; Kim, S. Performance Estimation Modeling via Machine Learning of an Agrophotovoltaic System in South Korea. Energies 2021, 14, 6724. [Google Scholar] [CrossRef]
Khaki, S.; Wang, L.Z. Crop Yield Prediction Using Deep Neural Networks. Front. Plant Sci. 2019, 10, 621. [Google Scholar] [CrossRef] [PubMed]
Shin, H.C.; Roth, H.R.; Gao, M.C.; Lu, L.; Xu, Z.Y.; Nogues, I.; Yao, J.H.; Mollura, D.; Summers, R.M. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans. Med. Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef]
Shastry, A.; Sanjay, H.; Bhanusree, E. Prediction of crop yield using regression techniques. Int. J. Soft Comput. 2017, 12, 96–102. [Google Scholar]
Ansarifar, J.; Wang, L.Z.; Archontoulis, S.V. An interaction regression model for crop yield prediction. Sci. Rep. 2021, 11, 17754. [Google Scholar] [CrossRef] [PubMed]
Johann, A.L.; de Araújo, A.G.; Delalibera, H.C.; Hirakawa, A.R. Soil moisture modeling based on stochastic behavior of forces on a no-till chisel opener. Comput. Electron. Agric. 2016, 121, 420–428. [Google Scholar] [CrossRef]
Kim, S.; Seo, J.; Kim, S. Machine Learning Technologies in the Supply Chain Management Research of Biodiesel: A Review. Energies 2024, 17, 1316. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112. [Google Scholar]
Maulud, D.; Abdulazeez, A.M. A review on linear regression comprehensive in machine learning. J. Appl. Sci. Technol. Trends 2020, 1, 140–147. [Google Scholar] [CrossRef]
Wang, L.; Yang, G.; Li, Z.; Xu, F. An efficient nonlinear interval uncertain optimization method using Legendre polynomial chaos expansion. Appl. Soft Comput. 2021, 108, 107454. [Google Scholar] [CrossRef]
Bertsimas, D.; Van Parys, B. Sparse hierarchical regression with polynomials. Mach. Learn. 2020, 109, 973–997. [Google Scholar] [CrossRef]
Dette, H.; Melas, V.B.; Pepelyshev, A. Optimal designs for estimating individual coefficients in polynomial regression—A functional approach. J. Stat. Plan. Inference 2004, 118, 201–219. [Google Scholar] [CrossRef][Green Version]
Ying, X. An Overview of Overfitting and its Solutions. J. Phys. Conf. Ser. 2019, 1168, 022022. [Google Scholar] [CrossRef]
Ranstam, J.; Cook, J.A. LASSO regression. Br. J. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
Dong, X.B.; Yu, Z.W.; Cao, W.M.; Shi, Y.F.; Ma, Q.L. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Kim, S.; Kim, S.; An, K.Y.A. An integrated multi-modeling framework to estimate potential rice and energy production under an agrivoltaic system. Comput. Electron. Agric. 2023, 213, 108157. [Google Scholar] [CrossRef]
Sakamoto, T. Incorporating environmental variables into a MODIS-based crop yield estimation method for United States corn and soybeans through the use of a random forest regression algorithm. ISPRS J. Photogramm. Remote Sens. 2020, 160, 208–228. [Google Scholar] [CrossRef]
Bolton, D.K.; Friedl, M.A. Forecasting crop yield using remotely sensed vegetation indices and crop phenology metrics. Agric. For. Meteorol. 2013, 173, 74–84. [Google Scholar] [CrossRef]
Sharma, S.; Sharma, S.; Athaiya, A. Activation functions in neural networks. Towards Data Sci. 2017, 6, 310–316. [Google Scholar] [CrossRef]
Abiodun, O.I.; Jantan, A.; Omolara, A.E.; Dada, K.V.; Mohamed, N.A.; Arshad, H. State-of-the-art in artificial neural network applications: A survey. Heliyon 2018, 4, e00938. [Google Scholar] [CrossRef]
Apicella, A.; Donnarumma, F.; Isgrò, F.; Prevete, R. A survey on modern trainable activation functions. Neural Netw. 2021, 138, 14–32. [Google Scholar] [CrossRef]
Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
Werbos, P.J. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef]
Jiang, H.; Hu, H.; Zhong, R.H.; Xu, J.F.; Xu, J.L.; Huang, J.F.; Wang, S.W.; Ying, Y.B.; Lin, T. A deep learning approach to conflating heterogeneous geospatial data for corn yield estimation: A case study of the US Corn Belt at the county level. Glob. Chang. Biol. 2020, 26, 1754–1766. [Google Scholar] [CrossRef]
Kuwata, K.; Shibasaki, R. Estimating Crop Yields with Deep Learning and Remotely Sensed Data. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 858–861. [Google Scholar]
Sun, J.; Lai, Z.L.; Di, L.P.; Sun, Z.H.; Tao, J.B.; Shen, Y.L. Multilevel Deep Learning Network for County-Level Corn Yield Estimation in the US Corn Belt. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5048–5060. [Google Scholar] [CrossRef]
García-Martínez, H.; Flores-Magdaleno, H.; Ascencio-Hernández, R.; Khalil-Gardezi, A.; Tijerina-Chávez, L.; Mancilla-Villa, O.R.; Vázquez-Peña, M.A. Corn Grain Yield Estimation from Vegetation Indices, Canopy Cover, Plant Density, and a Neural Network Using Multispectral and RGB Images Acquired with Unmanned Aerial Vehicles. Agriculture 2020, 10, 277. [Google Scholar] [CrossRef]
Albelwi, S.; Mahmood, A. A Framework for Designing the Architectures of Deep Convolutional Neural Networks. Entropy 2017, 19, 242. [Google Scholar] [CrossRef]
Krichen, M. Convolutional Neural Networks: A Survey. Computers 2023, 12, 151. [Google Scholar] [CrossRef]
Li, Z.W.; Liu, F.; Yang, W.J.; Peng, S.H.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Yamashita, R.; Nishio, M.; Do, R.K.G.; Togashi, K. Convolutional neural networks: An overview and application in radiology. Insights Imaging 2018, 9, 611–629. [Google Scholar] [CrossRef]
Kim, Y.; On, Y.; So, J.; Kim, S.; Kim, S. A Decision Support Software Application for the Design of Agrophotovoltaic Systems in Republic of Korea. Sustainability 2023, 15, 8830. [Google Scholar] [CrossRef]
Yang, W.; Nigon, T.; Hao, Z.Y.; Paiao, G.D.; Fernández, F.G.; Mulla, D.; Yang, C. Estimation of corn yield based on hyperspectral imagery and convolutional neural network. Comput. Electron. Agric. 2021, 184, 106092. [Google Scholar] [CrossRef]
Khaki, S.; Pham, H.; Han, Y.; Kuhl, A.; Kent, W.; Wang, L.Z. DeepCorn: A semi-supervised deep learning method for high-throughput image-based corn kernel counting and yield estimation. Knowl. Based Syst. 2021, 218, 106874. [Google Scholar] [CrossRef]
Horikawa, Y.; Imai, T.; Takada, R.; Watanabe, T.; Takabe, K.; Kobayashi, Y.; Sugiyama, J. Chemometric Analysis with Near-Infrared Spectroscopy for Chemically Pretreated Erianthus toward Efficient Bioethanol Production. Appl. Biochem. Biotechnol. 2012, 166, 711–721. [Google Scholar] [CrossRef]
Magaña, C.; Núñez-Sánchez, N.; Fernández-Cabanás, V.M.; García, P.; Serrano, A.; Pérez-Marín, D.; Pemán, J.M.; Alcalde, E. Direct prediction of bioethanol yield in sugar beet pulp using Near Infrared Spectroscopy. Bioresour. Technol. 2011, 102, 9542–9549. [Google Scholar] [CrossRef] [PubMed]
Watanabe, K.; Tachibana, S.; Konishi, M. Modeling growth and fermentation inhibition during bioethanol production using component profiles obtained by performing comprehensive targeted and non-targeted analyses. Bioresour. Technol. 2019, 281, 260–268. [Google Scholar] [CrossRef]
Zhang, Q.Z.; Weng, C.; Huang, H.Q.; Achal, V.; Wang, D.C. Optimization of Bioethanol Production Using Whole Plant of Water Hyacinth as Substrate in Simultaneous Saccharification and Fermentation Process. Front. Microbiol. 2016, 6, 1411. [Google Scholar] [CrossRef]
Shenbagamuthuraman, V.; Kasianantham, N. Microwave irradiation pretreated fermentation of bioethanol production from Chlorella vulgaris Biomasses: Comparative analysis of response surface methodology and artificial neural network techniques. Bioresour. Technol. 2023, 390, 129867. [Google Scholar] [CrossRef] [PubMed]
Dave, N.; Varadavenkatesan, T.; Selvaraj, R.; Vinayagam, R. Modelling of fermentative bioethanol production from indigenous Ulva prolifera biomass by Saccharomyces cerevisiae NFCCI1248 using an integrated ANN-GA approach. Sci. Total Environ. 2021, 791, 148429. [Google Scholar] [CrossRef] [PubMed]
Mondal, P.; Sadhukhan, A.K.; Ganguly, A.; Gupta, P. Optimization of process parameters for bio-enzymatic and enzymatic saccharification of waste broken rice for ethanol production using response surface methodology and artificial neural network-genetic algorithm. 3 Biotech 2021, 11, 28. [Google Scholar] [CrossRef]
Concu, R.; Cordeiro, M.N.D.S.; Munteanu, C.R.; González-Díaz, H. PTML Model of Enzyme Subclasses for Mining the Proteome of Biofuel Producing Microorganisms. J. Proteome Res. 2019, 18, 2735–2746. [Google Scholar] [CrossRef]
Konishi, M. Bioethanol production estimated from volatile compositions in hydrolysates of lignocellulosic biomass by deep learning. J. Biosci. Bioeng. 2020, 129, 723–729. [Google Scholar] [CrossRef]
Fernández, M.C.; Pantano, M.N.; Rodriguez, L.; Scaglia, G. State estimation and nonlinear tracking control simulation approach. Application to a bioethanol production system. Bioprocess Biosyst. Eng. 2021, 44, 1755–1768. [Google Scholar] [CrossRef]
Ostos-Garrido, F.J.; de Castro, A.I.; Torres-Sánchez, J.; Pistón, F.; Peña, J.M. High-Throughput Phenotyping of Bioethanol Potential in Cereals Using UAV-Based Multi-Spectral Imagery. Front. Plant Sci. 2019, 10, 948. [Google Scholar] [CrossRef]
Kim, S.; Kim, S.; Cho, J.; Park, S.; Jarrín Perez, F.X.; Kiniry, J.R. Simulated biomass, climate change impacts, and nitrogen management to achieve switchgrass biofuel production at diverse sites in US. Agronomy 2020, 10, 503. [Google Scholar] [CrossRef]
Chen, Y.; Ale, S.; Rajan, N. Spatial variability of biofuel production potential and hydrologic fluxes of land use change from cotton (Gossypium hirsutum L.) to Alamo switchgrass (Panicum virgatum L.) in the Texas High Plains. BioEnergy Res. 2016, 9, 1126–1141. [Google Scholar] [CrossRef]
Hussain, J. Deep Learning Black Box Problem; University of Michigan Dearborn: Dearborn, MI, USA, 2019. [Google Scholar]
Eitel, F.; Schulz, M.A.; Seiler, M.; Walter, H.; Ritter, K. Promises and pitfalls of deep neural networks in neuroimaging-based psychiatric research. Exp. Neurol. 2021, 339, 113608. [Google Scholar] [CrossRef]
Coşgun, A.; Günay, M.E.; Yıldırım, R. Machine learning for algal biofuels: A critical review and perspective for the future. Green Chem. 2023, 25, 3354–3373. [Google Scholar] [CrossRef]

Figure 1. General process of crop yield estimation via regression-based model [44].

Figure 2. Overall procedure of random forest [50].

Figure 3. Multilayer perceptron for neural network.

Figure 4. Structure of the CNN [72].

Table 1. Data structure for corn yield regression modeling.

Variables	Description	Unit
Plant population	The number of plants per acre.	plants/acre
Planting progress	The weekly cumulative percentage of corn planted within each state.	%
Minimum air temperature	Daily minimum air temperature.	°C
Maximum air temperature	Daily maximum air temperature.	°C
Precipitation	Daily total precipitation.	mm
Shortwave radiation	The amount of incoming solar radiation.	W/ $m^{2}$
Water vapor pressure	The pressure exerted by water vapor in the atmosphere.	Pa
Snow water equivalent	The amount of water contained in the snowpack.	kg/ $m^{2}$
Day length	The duration of daylight each day.	Sec
Soil	180 soil feature variables considering soil organic matter, sand content, soil pH, soil bulk density, field capacity, and hydraulic conductivity.	N/A
Yield	Annual corn yield data.	bu/acre

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lim, H.; Kim, S. Applications of Machine Learning Technologies for Feedstock Yield Estimation of Ethanol Production. Energies 2024, 17, 5191. https://doi.org/10.3390/en17205191

AMA Style

Lim H, Kim S. Applications of Machine Learning Technologies for Feedstock Yield Estimation of Ethanol Production. Energies. 2024; 17(20):5191. https://doi.org/10.3390/en17205191

Chicago/Turabian Style

Lim, Hyeongjun, and Sojung Kim. 2024. "Applications of Machine Learning Technologies for Feedstock Yield Estimation of Ethanol Production" Energies 17, no. 20: 5191. https://doi.org/10.3390/en17205191

APA Style

Lim, H., & Kim, S. (2024). Applications of Machine Learning Technologies for Feedstock Yield Estimation of Ethanol Production. Energies, 17(20), 5191. https://doi.org/10.3390/en17205191

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Applications of Machine Learning Technologies for Feedstock Yield Estimation of Ethanol Production

Abstract

1. Introduction

2. First- and Second-Generation Ethanol Production

3. Machine Learning Technologies for Feedstock Estimation of Ethanol

3.1. Regression Model-Based Estimation

3.2. Neural Network-Based Estimation

3.3. Image Data-Based Estimation

3.4. Machine Learning Techniques Applied to Other Operations of Biofuel Production

3.5. Discussion

4. Conclusions and Future Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI