Next Article in Journal
Recent Advances in Treatment Technologies and Resource Utilization of Mine Tailings in Hunan Province, China
Previous Article in Journal
Production and Evaluation of Green Soybean (Glycine max L.) Powder Fortified with Encapsulated Crude Procyanidin Extract Powder
Previous Article in Special Issue
ResNet + Self-Attention-Based Acoustic Fingerprint Fault Diagnosis Algorithm for Hydroelectric Turbine Generators
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Review of Intelligent Modeling for Microalgae Systems: Integrating Data Mining, Machine Learning, and Hybrid Approaches

by
Geovani R. Freitas
1,2,3,4,5,
Sara Badenes
3,
Rui Oliveira
4,5 and
Fernando G. Martins
1,2,*
1
LEPABE, Laboratory for Process Engineering, Environment, Biotechnology and Energy, Chemical Engineering Department, Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal
2
ALiCE, Associate Laboratory in Chemical Engineering, Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal
3
A4F—Algae for Future, Campus do Lumiar, Estrada do Paço do Lumiar, Edif. E, R/C, 1649-038 Lisbon, Portugal
4
UCIBIO, Research Unit on Applied Molecular Biosciences, Department of Chemistry, NOVA School of Science and Technology, NOVA University Lisbon, 2829-516 Caparica, Portugal
5
Associate Laboratory i4HB, Institute for Health and Bioeconomy, NOVA School of Science and Technology, NOVA University Lisbon, 2829-516 Caparica, Portugal
*
Author to whom correspondence should be addressed.
Processes 2025, 13(9), 2956; https://doi.org/10.3390/pr13092956
Submission received: 1 August 2025 / Revised: 10 September 2025 / Accepted: 12 September 2025 / Published: 17 September 2025

Abstract

Despite the extensive research work on microalgae systems over the last decades, there is still a poor understanding of critical cultivation factors that could boost microalgae production economics. Extensive and systematic analysis of microalgae pilot and industrial production data could bring new insights into mechanisms and operational strategies for enhancing microalgae production systems. Recently, various machine learning methods have been employed within data mining workflows to accurately model microalgae growth under various process conditions. This review article provides a comprehensive analysis of data mining and machine learning methods in microalgae systems, with a focus on the effective application of artificial neural networks and deep learning models. It also highlights the importance of data acquisition techniques and real-time data availability that could foster the development of robust machine learning models. In addition, this paper delves into the field of hybrid modeling, a distinct approach that integrates the prior knowledge of mechanistic models with the descriptive power and adaptability of data-driven models. This synergy offers a robust framework to enhance production strategies, addressing critical challenges in scalability and efficiency, eventually paving the way for more sustainable and economical microalgae production systems.

1. Introduction

Microalgae are microscopic aquatic organisms that play a crucial role in various ecosystems, often serving as the foundation of aquatic food chains. Their potential stems from their autotrophic nature, superior growth rate, and lower requirements for nutrients when compared to other plants. They can also grow in nutrient-rich waste effluents, capturing contaminants from these effluents [1,2,3]. Recent studies have highlighted their potential in different applications, including bioremediation [4,5], dietary regulation [6,7], carbon sequestration [8,9], biofuel production [10,11], and wastewater treatment [12,13], making them an environmentally sustainable alternative.
Common cultivation methods of microalgae involve open and closed systems. Open systems, such as thin-layer cascades and raceway ponds, are advantageous for their simplicity and cost-effectiveness but limited by contamination risks and environmental variability. As an alternative, closed systems, or photobioreactors, provide sterility, controlled conditions, and reduced evaporation; however, their scalability is hindered by higher costs and technical challenges such as light penetration and energy demand. Addressing these competing factors is critical to achieving efficient large-scale production [14,15,16].
With the increasing availability of data on microalgae systems, the use of data mining (DM) and machine learning (ML) has gained importance as key data-driven technologies. DM uncovers patterns and insights from large datasets, while ML develops algorithms that learn from data to make predictions [17]. Algorithms such as Random Forest (RF), artificial neural network (ANN), and support vector machine (SVM) have enhanced prediction efficiency and process optimization in this field. Despite these benefits, their effectiveness is constrained by limited data quality and quantity, as well as challenges of overfitting and domain shift across strains and cultivation systems [18,19].
High-quality datasets are crucial for the success of DM and ML, as they determine model accuracy and generalization. While traditional acquisition methods rely on manual sampling and lab analysis, recent advances, such as biosensors, microfluidic systems, and remote sensing technologies, enable real-time, comprehensive data collection. To further address limitations of small or noisy datasets, preprocessing techniques, such as data cleaning and data augmentation, in combination with more advanced techniques like ensemble learning reduce variance and model errors and improve predictive reliability across diverse applications [20,21,22].
Current knowledge gaps include the integration of heterogeneous data types, limited use of advanced learning strategies, and the absence of standardized benchmark datasets. To address such issues, hybrid modeling (HM) offers a promising approach that combines existing domain knowledge with ML techniques, thus providing a better modeling strategy [23]. Typically, HM integrates white-box models, which encode prior process knowledge, with black-box models such as ANN or SVM, within a single framework, where each component performs a specific role [24,25].
In summary, this review provides a comprehensive overview of the state of the art and future directions of DM workflows and ML methods in microalgae systems, highlighting their potential to optimize productivity, sustainability, and process efficiency. By critically examining the application of HM models, it also demonstrates how these models can address key challenges in large-scale cultivation. Overall, this work offers valuable guidance for researchers in this field by identifying current limitations, knowledge gaps, and promising strategies, thereby supporting the development of data-informed strategies in microalgae systems.

2. Methodology

In the present study, the authors performed a complete search in the Elsevier Scopus electronic database using the following keywords: (“Microalgae” OR “Microalgae system”) AND (“Data mining” OR “Machine learning” OR “Hybrid modeling”). In addition, the following terms were excluded: “Algae biorefinery”, “Mechanistic modeling”, and “Stochastic hybrid system”. The literature search was performed by applying keyword screening to the “title, abstract, and keyword” fields. This search was restricted to journal sources, with document types limited to articles and reviews, and the publication period set between 2015 and 2025. This process yielded 202 documents, with the most recent study conducted in January 2025.
Afterwards, the authors refined some documents by manually reviewing the abstracts and contents of eligible publications. As part of this process, a quality appraisal was conducted based on criteria such as clarity of research objectives, transparency of methodology, and relevance to different topics, especially in DM and ML methods. In addition, the authors included references that are crucial for explaining fundamental concepts related to the topics. This combination of targeted selection and inclusion of the fundamental literature provided a robust basis for this review, resulting in a compilation of 160 documents. Figure 1 illustrates the distribution of studies over the years, indicating a growing interest in these research topics since 2015.
The keyword analysis from the scientific papers found in the literature search of this study was carried out using the VOSviewer software tool version 1.6.20 [26]. This software allows the creation of network maps, where each keyword is displayed as a node and the connections between nodes represent collaborations between the corresponding keywords [27].
The network map generation was based on co-occurrence, with the counting method set to full counting and the unit of analysis defined as all keywords. A minimum occurrence threshold of 12 was applied out of a total of 2975 keywords identified. After data refinement, 47 keywords met the threshold and were included in the final network map, shown in Figure 2. This network map shows the distributions of all the keywords appearing in the selected publications, and the connections established around them. As can be observed, 3 clusters appear, each identified by a color. The red, green, and blue clusters revolve around the keywords “biomass”, “machine learning”, and “microalgae”, respectively. The green cluster is influenced by the large number of publications focused on the development of data-driven models, while the red and blue clusters are more interested in the analysis of processes involving microalgae for distinct applications: wastewater treatment and biofuel production, respectively. The content of some of these articles is discussed in the following sections.

3. Overview of Data Mining and Machine Learning Methods

In the digital age, the exponential growth of data has reached unprecedented magnitudes, in the order of petabytes (1015) or exabytes (1018), transforming the way information is perceived and manipulated. DM has emerged as an essential tool for uncovering hidden patterns, valuable insights, and knowledge from big and complex datasets to address the issues derived from big data. Over the past decade, DM, in association with ML methods, has greatly evolved, adapting to the challenges posed by big data and the ever-expanding sources of information [28,29].
DM is the systematic process of exploring, analyzing, and extracting meaningful patterns or knowledge from data using computational and statistical methods. It plays a crucial role in various sectors, including business, healthcare, finance, marketing, and scientific research, by promoting data-driven decision-making and predictive modeling [20,30]. As a highly application-driven domain, DM has assimilated many techniques from other domains, including statistics, ML, pattern recognition, database systems, visualization, and various other application domains, as presented in Figure 3. Although the concepts of DM and ML are distinct, their fields are complementary, which contributes significantly to the success of DM and its applications [20].
ML primarily involves the development of algorithms that enable computers to learn from data, making predictions or decisions based on patterns and relationships found in that data. On the other hand, DM is focused on the discovery of hidden patterns, trends, and knowledge within datasets, often with a strong emphasis on descriptive and exploratory analysis. While both fields utilize statistical and computational techniques, ML leans toward predictive modeling and learning patterns from the data, whereas DM is focused on analyzing large databases, knowledge discovery, and understanding the intrinsic structure of data. Ultimately, the choice of which ML method will be embedded in an adequate DM workflow depends on the specific objectives of a data analysis project [19,31].
The DM process starts with the selection of target data from the raw material and proceeds with preprocessing and transforming it into an appropriate format. The workflow of a common DM application is presented in Figure 4 and typically comprises the following stages: (i) data preprocessing, (ii) use of DM tools, and (iii) data postprocessing. Various types of data are employed in data analysis, including database records, matrix data, documents, graphs, links, transaction data, DNA sequence data, whole-genome information, and spatiotemporal data [19].
In DM, tasks can be categorized into different types, each corresponding to specific analysis objectives, such as exploratory data analysis, descriptive tasks, and predictive tasks. Among these types, descriptive and predictive tasks are the main ones in the DM field. Descriptive tasks aim at finding patterns and associations that are interpretable by humans after examining the whole dataset and developing a model. On the other hand, predictive tasks seek to anticipate a specific response of interest. While there may be some overlap in goals, e.g., predictive tasks can reveal interesting patterns, the key distinction lies in the requirement for prediction tasks to include a designated response variable. This variable can be either categorical or numerical, which classifies predictive DM into classification and regression, respectively [30,31].
Combining DM and ML approaches, mathematical models are created, and model parameters are adjusted using data until the model closely aligns with the real data. Progress in algorithms, such as ANN and deep learning (DL), has significantly improved the performance of various ML methods. However, these data-driven methods rely on the availability of large volumes of labeled training data. For instance, extensive data from several years might be necessary for reliable crop detection. As more labeled data is gathered, such as through data-sharing practices, modeling, and simulations, the ML methods employed to predict crop yields continue to improve, leading to enhanced microalgae systems [19].
In general, ML methods can be divided into four major categories, supervised, unsupervised, semi-supervised, and reinforcement learning, as shown in Figure 5. The effectiveness and the efficiency of an ML technique are dependent on the inherent attributes of the data and the capabilities of the learning algorithms. Selecting an appropriate learning algorithm suitable to a specific domain is a challenging task, since various learning algorithms have different purposes, and even within the same category, the outcomes can differ based on the data features. Therefore, it is important to understand the principles of different ML algorithms and their suitability for several real-world applications [17].
Various ML algorithms have been employed in the literature for the prediction of microalgae growth rate and optimization of cultivation systems. In the next sections, we provide an overview of the most used ML methods from the above categories, including deep learning, a technique that has gained prominence due to its ability to analyze and learn from large-scale data through ANN with many layers. Furthermore, we briefly discuss the fundamentals behind each ML algorithm, with the scope of their applicability to microalgae-based datasets.

3.1. Application of Supervised Learning Methods Using Microalgae-Based Datasets

In supervised learning, a set of examples or a training dataset is provided, each accompanied by their correct outputs. The algorithm refines its responses through this training data by comparing its outputs to the provided inputs. Supervised learning is often referred to as learning via examples or learning from exemplars, i.e., a task-driven approach. Its practical applications extend to predicting outcomes based on historical data. There are two main types of supervised learning: classification, which involves discrete output labels, and regression, where the output is continuous in nature [17,32]. In classification, the main objective is to categorize data points into predefined classes or categories based on the features or attributes of the data. It comprises the creation of a predictive model that learns from labeled training data and then applies this model to classify new data [17].
Alternatively, in regression, the focus is on predicting numerical values or continuous outcomes. It aims to establish a mathematical relationship between independent variables (features) and a dependent variable (target) in the dataset. In this category, the goal is to find a line or curve that best fits the data points, minimizing the errors between distances of data points from the curve or line [32]. Several types of regression models exist, including multiple linear regression (MLR) and principal component regression (PCR). Regarding classification algorithms, some of the most used are k-nearest neighbor (kNN) and decision trees (DTs). However, some of these models can be adapted for both classification and regression by modifying their configurations, such as RF, SVM, and extreme gradient boosting (XGBoost). Understanding the strengths and limitations of each model is useful when selecting the right model for predicting the target variable considered in microalgae systems [17,33].
MLR is an extension of linear regression that involves two or more variables. It describes the relationship between the predictor variables and a single continuous target variable by fitting the data to a multidimensional surface [20]. Ota et al. [34] employed an MLR model to describe the relationship between the growth rate of the green microalga Chlorococcum littorale and environmental factors, such as light intensity and temperature. The coefficients obtained from the MLR model indicated that the growth rate was affected independently by these variables and not strongly affected by the interaction between them. Another study involving MLR models and microalgae was carried out by Laurens et al. [35]. They employed MLR models to establish relationships between biochemical components of microalgae biomass, more specifically lipid content, and the near-infrared (NIR) spectral data. They concluded that the MLR models were able to capture the linear relationships between the NIR spectral data and the known concentrations of lipids, offering valuable insights for biochemical characterization.
PCR represents a technique that combines the benefits of principal component analysis with linear regression. It serves as a robust approach for analyzing high-dimensional data, particularly when the number of observations is less than the number of predictors. PCR operates by generating a concise set of principal components, which are then employed in a regression model [36]. Karakach et al. [37] carried out online fluorescence measurements in a Scenedesmus sp. AMDD cultivation, and the acquired spectra for protein concentration were analyzed using PCR. Their results showed that the model was able to fit the spectral data to the protein concentration sufficiently, showing it to be a good estimator of this variable in future spectral data. Horton et al. [38] also employed PCR to predict the concentration of solid analytes, such as proteins and carbohydrates, from data obtained by Fourier Transform Infrared (FTIR) spectroscopy. They concluded that the PCR model improved the quantification of solid analytes in microalgae, being able to handle the noise present in the FTIR spectral data.
kNN represents an “instance-based learning” or non-generalizing approach. Unlike traditional models, kNN does not construct an internal general model. Instead, it preserves all training data instances in an n-dimensional space. kNN uses these data points to classify new data based on similarity measures, typically employing the Euclidean distance function. This model has robustness against noisy data, with its accuracy closely tied to data quality [17,39]. The use of kNN for classification problems can be found in the study carried out by Reimann et al. [40]. They employed a variety of ML models, including kNN, DT, and RF, to classify dead and living microalgae based on bioimage data. They found that the kNN model performed effectively in distinguishing between dead and living microalgae. Other studies also reported using kNN models to evaluate classification accuracy for different purposes, such as the identification of microalgae species [41], the development of tools for noncoding RNA in microalgae [42], and the estimation of value classes for conversion and heat flow targets in the oxidative torrefaction of microalgae biomass [43].
As mentioned previously, some of the models can be used for either regression or classification. Although kNN models are more commonly applied to classification problems, some studies reported their use for regression purposes. Meenatchisundaram et al. [44] employed several ML models, including kNN, to optimize the biomass yield in microalgae-based wastewater treatment. Although kNN was not the best model in terms of performance metrics, its simplicity and effectiveness make it a reliable option for predicting biomass yield in a wastewater treatment application. Yew et al. [45] also applied kNN models to predict the growth properties of microalgae when cultivated in waste molasses. They concluded that kNN models could effectively estimate the microalgae biomass, nitrate, and pH value by keeping the hyperparameter k of 4. At this value, the average normalized Root Mean Square Error (RMSE) was the lowest, and the predicted values were closest to the actual experimental data.
DT is an analytical approach used for the categorization of a group of interests into multiple subgroups or for making predictions through the creation of decision rules represented in a tree-like structure. Typically, a DT analysis involves the following steps: first, an initial decision tree is generated by specifying a suitable split criterion and a stopping rule in alignment with the analysis’s objectives and data structure. In the second step, branches with a high potential for large classification errors or inappropriate rules are removed. In the next step, the decision tree’s validity is assessed using techniques such as cross-validation and test data evaluation. Finally, the decision tree is analyzed, leading to the establishment of a classification model [46,47]. This type of model can be used for either regression or classification. For regression purposes, Meenatchisundaram et al. [44] developed different ML models to optimize the microalgae biomass yield in wastewater treatment. Although the DT model presented high values of the coefficient of determination (R2), its Mean Absolute Error (MAE) values caused this model to present uncommon deviations from the prediction line.
Concerning the application of DT models for classification purposes, it is worth mentioning the studies carried out by Singh et al. [48,49]. In the former study [48], they employed various DT models to determine the effects and best combination of predictor variables resulting in high nitrogen and phosphorus removal efficiency and high biomass productivity. These combinations were tested afterwards on new microalgae-based datasets, achieving nearly 80% accuracy. In the latter study [49], they used DT models to predict different descriptor variables that increase growth rate and microalgae biomass production. The best DT model presented 18 combinations of descriptor variables and was able to achieve a general accuracy of around 81%. Coşgun et al. [50] also employed a DT model to resolve the best combination of variables resulting in high biomass and lipid content. They concluded that the most significant operational variables for high biomass productivity were photoperiod, CO2 content, light intensity, and feed gas flow, while for lipid content, pH was the most important variable, followed by light intensity and NaCl amount in the growth medium.
RF is an ensemble learning method that is widely used for both classification and regression tasks. This technique combines multiple decision trees to create a “forest”, where each tree is trained on a random subset of the data and features. The final prediction is made by aggregating the results from all individual trees, often through majority voting for classification or averaging for regression. This approach helps improve the model’s accuracy and robustness by reducing overfitting and variance compared to a single decision tree [17,20]. Studies were reported using the RF model for classification purposes. Xu et al. [51] used a spectral imager device to acquire spectral images of microalgae, which were analyzed afterwards with ML algorithms. The RF model obtained was able to predict the growth stage of microalgae with an accuracy of 98.1%. Reimann et al. [40] also employed an RF model to evaluate image data from microalgae suspensions’ cultures. They concluded that this model proved to be the most effective one among the ML models considered, being able to distinguish between dead and living microalgae with the highest accuracy and performance scores.
The application of the RF models for regression tasks are also reported in the literature. Cheng et al. [52] employed various ML models, including RF, to predict the mass yields of multiple products and characteristics from hydrothermal treatment of multiple feedstocks. Results showed that the RF models had better performance metrics and were able to outperform other models, such as regression trees and multiple linear regression. Lopez-Exposito et al. [53] also employed an RF model to analyze the floc length and geometric shape during microalgae growth. A dataset comprising lengths generated by computer software that created virtual flocs through focused reflection operations was used for training the model. The optimized model demonstrated great performance metrics in actual tests, and it could be quickly adapted to the floc structure based on actual requirements. Similarly, Lopez-Exposito et al. [54] employed an RF model to estimate the biomass concentration of the microalga Chlorella sorokiniana. After a systematic optimization of their main hyperparameters, the authors obtained an RF model able to predict the concentration of the microalgae cultures with good performance metrics.
SVM is another supervised learning method that can be used for either regression or classification tasks. SVM constructs a hyperplane that best separates the data with the maximum margin, making it highly effective in high-dimensional space. The versatility of this method is due to the use of different kernel functions, which allows it to handle non-linear data [20,55]. The use of SVM for classification purposes can be found in the study performed by Harmon et al. [56]. They used fluorescence imaging technology to capture images of microalgae; then they employed SVM models to quantify their morphological characteristics. The accuracy of the models reached 99.8%, which exceeded the accuracy of other ML models, such as the ANN. Chong et al. [41] also employed SVM models to analyze the images from microalgae datasets. They evaluated the classification performance based on the types of image preprocessing techniques implemented on both morphological and texture feature extraction, obtaining accuracies of 97.63% when leveraging the SVM models.
Studies were also performed describing the application of SVM models for regression purposes, which in this case can be referred to as support vector regression (SVR). Chong et al. [57] explored the use of an SVR model to predict the blue pigment content in the microalga Arthrospira platensis. The model was found to provide good predictions of this pigment when including extra parameters and was able to outperform models based on ANNs. Another study using the SVR model for regression tasks was performed by Yeh et al. [58]. They improved the performance metrics of microalgae growth predictions in outdoor cultivation in photobioreactors (PBRs). The study found that SVR models, when combined with historical light data, significantly enhanced the performance metrics of growth rate predictions compared to traditional modeling approaches. SVM was particularly effective in capturing the complex relationships between fluctuating light conditions and microalgae growth, demonstrating its robustness in handling non-linear and multidimensional data.
XGBoost is another ensemble learning model designed for regression and classification tasks. Its iterative process evaluates model performance using an objective function, balancing underfitting and overfitting through dataset validation. Similar to RF, it relies on an ensemble of decision trees, but it improves gradient boosting by incorporating second-order gradients to minimize the loss function [17,59]. Studies reported using XGBoost models for classification purposes. Magalhães et al. [60] presented a novel fluorometric device integrated with ML to classify microalgae based on their pigment composition. Among the ML models, XGBoost demonstrated the best performance, achieving 97% accuracy for the test dataset with a weighted average of 97% and 98% for recall and precision, respectively, at the phylum level. Colkesen et al. [61] also employed XGBoost models for classification tasks. In the study, the authors aimed to assess the effectiveness of various methods in detecting dense floating algal blooms in Lake Burdur using Sentinel-2 satellite imagery. They found that XGBoost demonstrated accuracies between 94% and 98%, comparable to RF models, which achieved the best accuracy among the models evaluated.
The application of XGBoost for regression purposes is also reported in the literature. Fu et al. [62] employed ML methods to predict the harvesting efficiency (HE) of microalgae with magnetic nanoparticles (MNPs). Their objective was to develop a robust predictive model that could estimate HE based on microalgae properties, MNP characteristics, and flocculation conditions, reducing the need for labor-intensive experiments. Among the eight ML models tested, XGBoost presented the best predictive performance, achieving an R2 of 0.932, RMSE of 6.96%, and MAE of 4.17%. The model was further validated through batch experiments, confirming its accuracy in predicting HE. Another study using the XGBoost model for regression purposes was performed by Sundaram et al. [63]. They aimed to develop a time series model to predict the growth curve of microalgal biomass under environmental conditions like those found in wastewater. The results showed lower test performance (R2 of 0.3) compared to training performance. The decline in prediction was attributed to an increased deviation of data points from the perfect prediction line as the forecasting range expanded. They concluded that while XGBoost can be applied for biomass forecasting, it may require further optimization to improve its generalization in long-term predictions.
A summary of all the supervised learning methods for classification and regression tasks using microalgae-based datasets is found in Table 1. The performance of each ML model is influenced by some factors, such as operating conditions and specific use cases. Consequently, comparing performance is challenging because these factors vary across different models [64]. In addition, given the importance of the ANN, which originated after the concept of DL, the authors decided to highlight its applications within the supervised learning scheme in a new distinct section.

3.2. Application of Supervised Learning with Artificial Neural Network Models

ANNs consist of interconnected nodes that process and transmit information in layers, similar to the structure and function of the human brain. These networks are particularly robust in modeling complex, non-linear relationships between inputs and outputs, making them suitable for a wide range of predictive tasks in fields like bioinformatics and image processing. An ANN is classified as DL when it consists of multiple layers, typically three or more, between the input and output layers. These additional layers, known as hidden layers, allow the network to learn and model patterns and representations in large and complex datasets. DL essentially enables the computer to build complex concepts out of simpler concepts, for instance, the representation of an image of a person by combining basic elements, such as corners, edges, and contours. Among the most used DL structures are the multilayer perceptron (MLP), convolutional neural network (CNN), and Long Short-Term Memory (LSTM) [20,65].
One of the most common DL algorithms is the MLP, which is a special case of a feedforward ANN. A standard MLP includes multiple layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node in a layer is connected to every node in the next layer, and through a process known as “backpropagation”, they adjust the weights of the connections between nodes to minimize prediction errors. MLP algorithms are widely used for tasks requiring both regression and classification, where they have demonstrated significant success due to their flexibility and capacity to model patterns in data [65,66,67]. Susanna et al. [68] proposed an MLP model to predict the growth of Arthrospira platensis (Spirulina) in a PBR. Their results showed that the model could predict growth up to three days in advance, achieving values of R2 greater than 0.94. Concerning the application of the MLP model for classification purposes, it is worth mentioning the study carried out by Bricaud et al. [69]. They used MLP models to estimate phytoplankton pigment concentrations and size structure from absorption spectra. They obtained average relative errors between 27% and 51% for accessory pigments, and between 19% and 33% for three size classes.
Another type of DL is the CNN. This model is specifically designed to process grid-like data structures, such as images. Unlike traditional ANN models, CNN models leverage convolutional layers to efficiently extract spatial features from input data, such as edges, textures, and shapes. These layers apply a series of filters, or kernels, that move across the input data, creating feature maps that capture important patterns while maintaining the spatial relationships between pixels. They also employ pooling layers to reduce the dimensionality of feature maps, which helps in controlling overfitting and reducing the number of parameters and computational costs. Typical CNN structures stack some convolutional layers, followed by a pooling layer, then repeat the sequence with additional convolutional layers and pooling layers. At the top of the structure, a standard feedforward ANN with a few fully connected layers is added, culminating in the final layer that generates the prediction [22,65,70].
The application of CNN models for classification and regression is reported in the literature, although it is more common to find studies employing CNN models for classification purposes, especially for image-based classification tasks. Chong et al. [71] provided an overview of the state of the art of microalgae identification techniques and ML methods used in image analysis. They performed several preprocessing steps like resizing, gray-scaling, denoising, and feature extraction, and applied various ML models, such as an ANN, CNN, and SVM. Their results revealed that the CNN is predominantly employed to classify microalgae species based on digital images. They also compared different works [72,73,74,75,76,77,78,79,80] for the identification of microalgae species using ML methods, especially CNNs. In contrast to all these previous works, where the images were obtained in water solution, D’Orazio et al. [81] trained CNN models with images of microalgae on building facades. Their CNN models were characterized by accuracies of 87%, showing their ability to recognize microalgae and cyanobacteria on the brick’s surfaces.
Other studies were reported using CNN models to classify images of microalgae species. Sonmez et al. [82] treated light microscopy and scanning electron microscopy images of microalgae to classify them using a CNN model. They obtained a classification accuracy of 99% and concluded that microalgae identification using optical microscopes outperformed electron microscopy techniques. Carloto et al. [83] also employed a CNN model to detect morphological changes in the microalga Planktothrix agardhii before and after chemical stress caused by the addition of hydrogen peroxide. After testing different image segmentation methods, optimizers, and network architectures, they obtained a median accuracy of 93%. For regression purposes, Nguyen et al. [84] employed a CNN model to estimate the density of the microalga Chlorella vulgaris. They proposed a CNN regression architecture that accepted the color image, as input while the density was calculated as output, possessing an R2 of 0.99.
CNN models were also used in modeling complex spatial relationships to enhance the efficiency of PBR design and optimization. In the study performed by del Rio-Chanona et al. [85], the authors proposed a novel framework aimed at optimizing both the configuration and operating conditions of a pilot-scale PBR used to produce a sustainable biofuel by the microalga Chlamydomonas reinhardtii. To simulate the complex biosystem’s behavior, they built an integrated kinetic–computational fluid dynamics model, generating a sufficient dataset for training purposes. A CNN model was then developed using a state-of-the-art structure selection method to capture the system’s complexity. Their results revealed that hydrodynamic and biochemical mechanisms notably influence the optimal configurations, particularly when shifting the objective from biomass cultivation to biofuel production. Further information about the contemporary application of ML methods to enhance the efficiency and sustainability of biofuel production from microalgae can be found in the review performed by Yang et al. [86] and Omole et al. [87].
LSTM, classified as a form of recurrent ANN, is particularly adept at handling time series or sequential data owing to its inherent capacity to preserve information across extended temporal spans. The basic structure of the LSTM encompasses distinct gates: the input gate, the forget gate, and the output gate. The input gate governs the incorporation of novel information into the memory, the forget gate determines the retention or removal of historical data, and the output gate shows the selection of relevant information for generating the output. In summary, LSTM can learn to identify an important input, store it in the long-term memory, maintain it for as long as necessary, and retrieve it whenever it is needed. LSTM networks are widely used for applications in natural language processing, such as machine translation, sentiment analysis, and automatic summarization [65,70].
The use of LSTM for classification purposes can be found in the study performed by Colkesen et al. [61]. They employed an LSTM model to conduct classification-based microalgae detection. Although the results of this model were outperformed by other ensemble learning models, such as RF and XGBoost, the differences in accuracies between these models were weak, reaching a maximum value of 12%. Although some articles are found employing LSTM for classification tasks, its use is more widespread for regression purposes. Rodríguez-Rangel et al. [88] developed an LSTM model to forecast microalgae biomass production in wastewater treatment systems. They also worked with other ML models. However, only the DL models, i.e., LSTM and CNN, showed better predictions in estimating the accumulation of carbohydrates present in the microalgae cultures. Syed et al. [89] also employed an LSTM model to predict the growth of the microalga Nannochloropsis sp. in a vertical PBR and compared the results obtained by an SVR model. They found that the LSTM model presented a better performance due to its ability to capture temporal dependencies through its memory cells, making it more suitable for handling time series data.
Table 2 outlines various studies that employed DL models for supervised learning on microalgae-based datasets. One can notice that some models were selected more for regression purposes, for instance, the MLP and LSTM, while others were used more when classification tasks were involved, such as the CNN. In the end, the choice will depend on the nature of the problem, i.e., whether the target variable is numerical or categorical. The ability of these models to handle both regression and classification tasks simply by adjusting their configurations makes them highly suitable for different scenarios.

3.3. Application of Other Learning Methods Using Microalgae-Based Datasets

In addition to supervised learning, ML includes other learning categories, such as unsupervised, semi-supervised, and reinforcement learning. Unlike supervised learning, which has been extensively developed and researched, these categories are typically used for more specialized tasks and are less commonly applied in typical data analysis. Consequently, they are grouped together into a single section to provide a broader overview of these alternative approaches without requiring the same level of detailed exploration as supervised learning.
The unsupervised learning (UL) approach revolves around the identification of latent patterns within data, with the aim of deriving rules based on these patterns, i.e., a data-driven process. This technique is particularly useful when the categories of the data are unknown, and in this context, the training data lacks labels. UL is characterized as a statistical method for learning, focusing on the task of finding hidden patterns or structures within the unlabeled data. This approach is particularly useful for dimensionality reduction, association rule learning, and clustering [17,32].
In the context of microalgae-based datasets, dimensionality reduction techniques can simplify complex environmental datasets by reducing the number of variables or features in the data while retaining as much information as possible. Principal component analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are the main algorithms for dimensionality reduction. In PCA, the data is transformed into a set of orthogonal components, capturing the maximum variance. It is particularly useful to identify the most influential factors in microalgae growth [90], to differentiate the properties of various species [91], and to assess the metabolome similarity among different microalgae species [92]. Concerning the t-SNE technique, it is a non-linear method that excels at visualizing high-dimensional data by projecting it into lower dimensions, typically two dimensions or three dimensions; it has applications in different fields, such as genome feature extraction [93] and characterization of images [94].
Association rule learning focuses on discovering interesting relationships between variables in large datasets [17]. In the literature, association rule learning algorithms were employed using microalgae database to find, for instance, the specific conditions leading to high biomass and lipid levels [50] and the specific parameters for enhancing microalgae growth in wastewater [95]. In terms of clustering, it is a technique for categorizing objects into distinct groups, partitioning the dataset into subsets or clusters based on shared characteristics, often determined by a defined distance measure. Once trained, the algorithm can assign new, unseen data points to one of these originated clusters [33,96]. Among the available algorithms, Hierarchical Agglomerative Clustering (HAC) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) are usually applied in microalgae research. Sánchez et al. [97] employed HAC to distinguish different types of diatoms from different aquatic environments, while Pozzobon et al. [98] applied the DBSCAN algorithm to analyze the data obtained by flow cytometry, using it to distinguish between active and non-viable cells.
Table 2. Comparison of different ANN models within the category of supervised learning methods and their performances when using microalgae-based datasets.
Table 2. Comparison of different ANN models within the category of supervised learning methods and their performances when using microalgae-based datasets.
ML ModelTaskMicroalgae Classes/SpeciesDataset/ModalitiesTarget(s)Performance Metrics/OutputsReference
ANNClassificationChlorella vulgarisImaging data and morphological propertiesLiving and dead microalgae cell categoryAUC of 84.8%, accuracy of 75.9%, precision of 76.8%, and recall of 82.6%[40]
ANN-GARegressionScenedesmus sp.,
Chlorella sp.
Tabular data: pH, retention times, phosphate, nitrate, and nitrite concentrationsBiomass yieldR2 of 0.98, RMSE of 0.056, and MAE of 0.04[44]
MLPClassificationPhytoplanktonSpectral data: absorption spectraPigment composition and size structureAverage relative errors between 27% and 51% for pigments and between 19% and 33% for three size classes[69]
MLPRegressionScenedesmus sp., Chlorella sp.Tabular data: pH, retention times, phosphate, nitrate, and nitrite concentrationsBiomass yieldR2 of 0.96[44]
MLPRegressionArthrospira platensisTabular data: pH, culture temperature, and light intensityBiomass concentration of the next dayR2 > 0.94[68]
CNNClassificationCyanobacteria, ChlorophytaImaging data: light microscopy and scanning electron microscopy imagesRecognition of microalgal speciesHighest accuracy was 99%[82]
CNNClassificationPlanktothrix agardhiiImaging dataDetection of morphological changesHighest median accuracy was 93.33%[83]
CNNClassificationC. vulgaris, C. reinhardtii, A. platensisMorphological and texture descriptorsMicroalgae species designationAccuracy of 97.86%, precision of 97.87%, recall of 94.44%, and F1-score of 96.07%[41]
CNNRegressionChlorella vulgarisImaging dataMicroalgae densityR2 of 0.9997[84]
LSTMClassificationNot AvailableTime series data and imaging dataMapping dense floating bloomsOA was between 95% and 99%[61]
LSTMRegressionCyanobacteriaTabular data: properties of microalgae, influent, and effluentBiomass productionR2 of 0.8646 and RMSE of 0.06[88]
LSTMRegressionNannochloropsis sp.Time series data: biomass concentration, pH, and temperatureMicroalgae growthR2 of 0.91 and RMSE of 0.061[89]
LSTMRegressionPhaeodactylum tricornutumTime series data: biomass concentration, incident light intensity, and light historySpecific growth rateR2 of 0.91 and RMSE of 0.0276[58]
In semi-supervised learning (SSL), an ML technique that combines both labeled and unlabeled data is used during the training process. This method is particularly useful when acquiring labeled data is costly or time-consuming, and/or the labeled dataset is small, but large amounts of unlabeled data are readily available. By using the limited labeled data to guide the learning process, SSL can improve model performance, especially when compared to UL methods [17,55]. In the context of microalgae-based datasets, SSL has been applied to process and label hyperspectral images, where obtaining fully labeled data is challenging. Manian et al. [99] presented a novel methodology for classifying hyperspectral images, specifically focused on identifying harmful algal blooms (HAB). They used an ensemble SSL approach that involved image preprocessing, feature extraction, and clustering, followed by supervised classification with different algorithms, such as SVM and gradient boosting. They achieved high performance in the identification of HAB and surface scum in Lake Erie images.
Another study that explored an innovative approach to classifying microalgae using SSL techniques was carried out by Drews-Jr et al. [100]. Their approach was based on SSL and active learning (AL), using the Gaussian mixture model and expectation-minimization algorithms to model the distribution and the class of each data, respectively. In the study, AL was used to iteratively select the most informative data points for labeling, optimizing the training process. Concerning the performance of the proposed method, they used two metrics, maxF1 and accuracy, and the method showed favorable outcomes for both, achieving approximately 92% accuracy. For AL, three evaluation metrics were presented, with entropy-based sampling showing a slight advantage, indicating that the approach with AL enhanced the metrics even with a limited number of samples.
Another method is reinforcement learning (RL), which is an ML approach where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. Over time, the agent learns the optimal policy to maximize cumulative rewards through trial and error. Unlike supervised learning, where labeled data is used, RL focuses on learning from dynamic interactions and adapting strategies accordingly [17,101]. The use of RL methods in optimizing bioprocessing operations has gained attention due to its ability to provide real-time control and adjustment of complex systems. In batch bioprocesses, RL algorithms, such as policy gradients, have been applied to achieve real-time optimization by learning the best operational strategies dynamically [102]. Similarly, other RL methods, like partially supervised RL and Q-learning, have been successfully used to control bioreactors with high precision and low error [103]. Moreover, deep reinforcement learning, incorporating models such as Neural Fitted Q-learning, has shown promise in optimizing the output of co-culture bioreactors, enabling better process efficiency through direct learning from the system’s feedback [104].
In microalgae-based research, RL methods can be applied to optimize processes like microalgae cultivation, biomass production, or nutrient management. For instance, RL can help in finding the best environmental conditions, such as light, temperature, and nutrient levels, that maximize microalgae growth. By continuously adjusting and receiving feedback from growth outcomes, the RL agent can develop optimal strategies for improving yield and resource efficiency in microalgae production processes, as shown in the study carried out by Doan et al. [105]. In the study, they explored a novel approach to enhancing the cultivation of Arthrospira sp. using RL with state prediction in combination with LSTM neural networks. Their objective was to optimize the dry-weight yield of microalgae by controlling the critical factor of light irradiation in a closed photobioreactor. The RL algorithm continuously adapted its decisions based on feedback, with parameters like temperature being controlled to maximize the reward (the biomass dry-weight yield). To further enhance the RL process, LSTM networks were used to predict light intensity, resulting in a better performance of light prediction.
Table 3 summarizes several different studies that used various learning methods, such as UL, SSL, and RL, on microalgae-based datasets. One can notice that UL methods are more commonly applied than the others, probably due to the nature of available datasets. Microalgae studies often involve large amounts of unlabeled data, such as images and spectral data, making UL methods particularly suitable for detecting patterns without prior labeling. However, this focus on microalgae identification limits the generalizability of these methods and increases the risk of domain shift when applying image-based models to other data types. Therefore, future studies should apply these methods to other datasets (e.g., tabular and time series data) to broaden the scope of applications, where their potential to uncover hidden patterns can improve practical applicability.

4. Strategies for Enhancing Data-Driven Modeling of Microalgae Systems

Data-driven models are built by analyzing and learning from large amounts of empirical data. Unlike mechanistic models, whose development is laborious and requires detailed knowledge about the process, data-driven models identify patterns and correlations from the data without requiring a deep understanding of the underlying physical process. Since they depend on both the quality and quantity of datasets, their application in industrial biosystems becomes limited, especially when the data is not reliable, suffering from significant measurement errors and systematic noise [107,108].
Most data-driven models require state variable measurements at pre-specified time intervals. However, in industrial plants, data is often collected at irregular intervals due to variations in the efficiency and availability of analytical equipment. This inconsistency introduces additional challenges, such as missing data, which hampers the development of accurate data-driven models. In addition, since many bioproducts are produced periodically, the amount of accumulated data for specific operations is often much lower than in steady-state operation plants. This scarcity of high-quality datasets significantly hinders the application of data-driven models in industrial bioprocesses. Another challenge with these models is that they are not based on underlying physical mechanisms and are primarily used for interpolating steady-state processes. Consequently, their ability to forecast unknown dynamic processes for industrial plants has not been widely explored in the literature [108,109].
The dynamic nature of process parameters, such as temperature and nutrient concentration, plays a crucial role in ensuring the effectiveness of ML models for prediction, monitoring, and optimization in industrial processes. Since these parameters can change rapidly during operations, it is essential for these models to adjust continuously in real time to fluctuations, ensuring that predictions remain accurate and functional. This dynamic adaptation enhances the model’s ability to provide, over time, insights for decision-making, which is particularly important in processes like microalgae cultivation or biochemical manufacturing, where conditions can vary unpredictably. Furthermore, the integration of real-time data enables ML models to minimize risks associated with unforeseen changes, leading to more efficient resource use, higher yields, and better optimization of production systems [21,64].

4.1. Data Acquisition Techniques and Their Influence on Microalgae-Based Datasets

Traditional methods of data collection, often performed manually, have limitations when trying to capture the dynamic nature of microalgae systems. However, recent advances in data acquisition techniques are transforming the ability to generate real-time datasets for microalgae systems, enabling more accurate monitoring of growth processes [71,110]. Among these techniques, biosensors have drawn attention due to their ability to collect continuous, high-resolution data. These analytical devices can translate intracellular concentrations of metabolites, such as oxygen, carbon dioxide, or nutrients, into machine-readable signals, like fluorescence or electrical outputs. For instance, fluorescence-based biosensors can monitor photosynthetic efficiency, a key indicator of microalgae health, by detecting the production of reactive oxygen species or chlorophyll fluorescence. In addition, biosensors can be integrated into PBRs for continuous tracking of pH, dissolved gases, and nutrient levels, offering precise insights into the microalgae system [111,112,113].
Conventional biosensors typically convert the biological interaction into readable signals, often through electrochemical, optical, or piezoelectric transducers. While effective, these biosensors often suffer from limitations in sensitivity, response time, and stability when applied to complex systems. As an alternative, novel biosensors have been designed with advanced materials, including nanomaterials, which enhance sensitivity and response times, making them ideal for detecting slight changes in environmental conditions [114,115]. The integration with automated data logging systems and Internet of Things platforms enables real-time monitoring and adjustments during the process, optimizing factors such as light intensity and temperature. These real-time datasets are essential to improve the metrics of ML models, enabling dynamic process control and enhanced biomass yield [116,117].
In addition to biosensors, other approaches are also gaining popularity for real-time monitoring of microalgae. Imaging techniques, such as automated microscopy [40] and flow cytometry [98], offer continuous visual analysis of cell morphology, size, and population dynamics. Spectroscopic methods, like NIR [35] and Raman spectroscopy [118], enable non-invasive chemical profiling of microalgae cultures, providing real-time insights into their composition, photosynthetic activity, and metabolic state. The development of portable and cost-effective multi- and hyperspectrometers has enabled a shift from traditional point-based measurements to spatially resolved optical monitoring. In addition, airborne hyperspectrometers or multispectrometers offer an exciting opportunity for remote monitoring of large-scale cultivation facilities, allowing for comprehensive and efficient assessment of microalgae growth over large areas [119].
Other data acquisition techniques are microfluidic systems and remote sensing technologies. Microfluidic systems [120] allow for the controlled manipulation of microalgae cultures at nano- to pico-litter scale, enabling high-precision processing and analysis of cells at high resolution. Such systems perform these tasks in a parallel format, achieving high-throughput assays at low cost. In addition, remote sensing technologies using drones or satellites are being explored for large-scale monitoring of HAB, enabling real-time monitoring of water quality parameters, such as chlorophyll-a concentrations and surface temperature [121,122]. Further information about the importance of using advanced ML methods and remote sensing technologies for real-time prediction and monitoring of HAB can be found in the review article carried out by Zahir et al. [123].
These new data acquisition techniques offer high-resolution measurements of microalgae biomass concentration and other process variables, like temperature and pH. By continuously providing real-time feedback, these techniques allow researchers and other professionals to better manage the complexities of microalgae cultivation and ensure more efficient, productive systems. Although this section is oversimplified, one can present a summary of all the techniques cited above. Imaging techniques, especially flow cytometry, are the fastest, but also the least accurate ones; spectroscopic methods enable non-invasive, real-time analysis, although they may lack specificity for certain microalgae species; microfluidic systems offer precise control and high-throughput, but they are limited by their complexity and need for specialized equipment; and remote sensing enables large-scale, real-time monitoring, but it is less effective for fine-scale monitoring and may be influenced by environmental factors [124,125,126].

4.2. Approaches for Improving Model Performance

Model precision refers to the model’s ability to consistently make correct predictions, while accuracy/performance reflects how often the model’s predictions are close to each other. Improving model precision and accuracy/performance involves several key strategies, starting with data preprocessing techniques, such as outlier removal, cleaning, and normalization. While cleaning and outlier removal techniques are employed to eliminate missing data or outliers, normalization is used to standardize features, ensuring that no single feature is more relevant to the model due to differences in scale. Another preprocessing technique is feature engineering, a process of selecting and transforming the most relevant features, enhancing the model’s ability to capture patterns in the dataset [20,31]. In parallel, feature importance analysis is performed to understand the contribution of each input feature in an ML model. One of the most employed approaches is Shapley additive explanations analysis, which is a method based on cooperative game theory that provides a consistent and unbiased way to explain individual prediction results [127].
As a promising solution to address data scarcity, data augmentation emerges as a preprocessing technique used to increase the size and diversity of a dataset by creating modified versions of existing data. Common methods include rotating, flipping, cropping, or scaling images [128,129,130], while more advanced techniques, such as the Generative Adversarial Network (GAN) [131] and Variational Autoencoder (VAE) [132], are the most popular ones to generate entirely new synthetic data. While a GAN consists of two ANNs, i.e., a generator and a discriminator, which compete against each other, a VAE works by encoding input data into a latent space and then decoding it back to generate new data. These techniques are documented in the literature, where they are claimed to enhance the generalization capabilities of ML models. For instance, Correa et al. [130] reported that DL models supported by data augmentation achieve higher accuracy in microalgae classification compared to models without augmentation. However, it is important to notice that improper or irrational data augmentation can lead to inaccurate predictions and reduce model reliability [133].
Advanced ML techniques, such as ensemble methods [134,135], improve accuracy/performance by combining the predictions of multiple models to improve overall performance, while regularization techniques like Lasso [136] and Ridge regression [137] are employed to prevent overfitting and to penalize model complexity, making the model more robust. To optimize model performance, hyperparameter tuning is performed through methods like random search or grid search. These methods systematically explore different combinations of hyperparameters to find the best settings that maximize model performance. Likewise, cross-validation strategies, such as k-fold and stratified k-fold, rigorously test the model and help prevent overfitting. They assess the model’s performance by splitting the data into multiple subsets, ensuring a more reliable evaluation [19,20,28]. Another approach to enhance model performance is using different optimization techniques. These are used to minimize or maximize a specific objective function, such as error or accuracy. In ML models, common optimization algorithms include gradient descent and its variants like stochastic gradient descent and Adam, which iteratively update model weights based on the gradient of the loss function. Optimization is essential for improving model performance, ensuring faster convergence, and avoiding issues like overfitting [22,71,129].
In addition to these approaches, an increasingly promising alternative is the use of HM. These models combine data-driven ML algorithms with theoretical or physics-based models, creating systems that benefit from both the vast data-handling capabilities of ML models and the accuracy and interpretability of physics-based models. This synergy allows HM to perform well in complex, real-world scenarios where purely data-driven models may struggle. They can also be readily integrated into various optimization techniques to overcome the limitations of both physics-based and data-driven models. Moreover, the HM approach is viewed as a practical framework for addressing challenges in optimizing industrial biosystems, such as low-quality and limited datasets, online measurement constraints, high costs associated with periodic sampling, and insufficient physical process knowledge [64,107,138]. In this regard, the following section is dedicated to a more in-depth discussion of HM and the potential benefits of this approach for microalgae systems.

5. Importance of the Hybrid Modeling Approach

HM was introduced in 1992 to control fermentation processes [24]. The structure of a hybrid model is adaptable and depends on the approach used to integrate ML with mechanistic models [23]. First-principles, mechanistic, or phenomenological models represent a wide category of more transparent models, also called white-box models. On the other hand, ML represents a less transparent modeling framework relying almost exclusively on process data, which is why they are also known as black-box models [107]. In general, the HM approach provides models with a well-defined internal structure where each component performs a distinct task. The model consists of two key elements: the process parameter estimator, i.e., data-driven models, and the partial model based on first principles. The partial model offers a more robust starting point compared to a purely data-driven model while also accommodating both structural and parametric uncertainties [23,24].
A related mathematical categorization can be made based on whether the model is nonparametric, parametric, or semiparametric. Parametric models are defined based on prior knowledge of the process. They have a fixed number of parameters and offer a physical or empirical interpretation depending on the underlying process knowledge [139]. Mathematically, they can be expressed as
Y = f ( X , Ω )
with the outputs ( Y ), the inputs ( X ), the model parameters ( Ω ), and the mathematical function f ( · ) , which represents the parametric model.
On the opposite side, nonparametric models are derived exclusively from data. Although the term nonparametric might suggest that these models are entirely devoid of parameters, it instead indicates that their number and nature are flexible, being determined by the data rather than fixed in advance as in parametric models [139,140]. In general, nonparametric models can be defined as
Y = g ( X , ω )
with the model parameters ( ω ) and the approximating mathematical function g ( · ) . This function is typically a construction of several interconnected modules, whose connections are weighted according to the parameters [139].
While white-box models fall into the category of parametric models, black-box or ML models, in general, belong to the nonparametric model category. Between these two extremes lies the domain of hybrid semiparametric modeling. Static hybrid semiparametric models integrate nonparametric and parametric components, with the parametric part incorporating prior knowledge of the system [139,140,141]. Mathematically, this can be generally expressed as
Y = h ( f X , Ω , g X , ω , θ )
with f X , Ω and g X , ω representing the parametric and nonparametric models, respectively, while the function h ( · ) combines both models. In general, this function is a parameterized function with the parameter θ , which also needs to be estimated [139]. Although hybrid semiparametric models have a more complex structure and may require longer development times, their benefits typically compensate these costs, including improved data fitting, adherence to physical constraints, enhanced generalization, and greater interpretability compared to purely nonparametric models [139,140].
Other modeling approaches have been proposed that involve the integration of different types of knowledge and/or submodels. The concept of “gray-box modeling” emerged in the 1990s within the area of systems and control theory [142]. This approach may be defined as the incorporation of prior information, mainly structural insights derived from white-box models, into black-box models. Hybrid semiparametric models may be seen as a class of gray-box models, as they incorporate parametric and nonparametric submodels with different levels of transparency. In addition, the term “hybrid modeling” is often used interchangeably with “hybrid semiparametric modeling” in the literature. However, this definition is rather ambiguous, as it can encompass various other modeling methods, such as gray-box or block-oriented modeling approaches [107].
A properly validated hybrid model can effectively address the challenges posed by incomplete physical knowledge and low-quality data or a limited quantity of data. Other frequently noted HM advantages are more accurate interpolation and extrapolation, ease of analysis and output interpretation, and the requirement of fewer training examples than a pure ML model [23,24]. However, incorrectly parametrized hybrid models are also susceptible of overfitting, which can result in increased uncertainty and poor generalization. Consequently, it is crucial to identify a reliable mechanistic backbone that faithfully represents the process, thereby reducing the reliance on the data-driven model and decreasing the risk of overfitting [143].
The application of HM in chemical and biochemical processes has been showcased in different works [144,145,146,147,148]. The prevailing hybrid structure is based on mass balance equations derived from first principles combined with an ANN to describe reaction kinetics [25,149]. Standard multilayer perceptrons are the most frequently used ANN topologies for two reasons: these networks are universal non-linear function approximators, and their application does not require prior knowledge about the structure of the system to be modeled. These two factors make them particularly appealing for modeling very complex mechanisms related to cell growth and biocatalysis [25,144].
The combination of ML models with mass and energy balance equations, whether in parallel or in serial configurations, as presented in Figure 6, yields non-linear dynamical systems described by a set of ordinary or algebraic differential equations [24,149]. The choice between serial or parallel structures is dictated by the level of detail and performance of the white-box model structure. In cases where the white-box model structure is complete (in the sense that it covers all essential parts of the process) but not sufficiently accurate, the parallel configuration is the method of choice. The ML component runs in parallel with the job to decrease the white-box model prediction errors; however, if the white-box model structure partially covers the essential features of the process with some of them lacking mechanistic understanding, then the serial configuration is the method of choice [107,150]. The ML components have the job of modeling the process features that lack a sufficiently accurate mechanistic description.
HM is a promising alternative for the efficient use of available experimental data, and such models can support the scalability of various systems, enabling the transition from laboratory experiments to large-scale industrial applications. However, they should be viewed as an enhancement tool rather than a solution for poor experimental design or gaps in process knowledge [151,152]. Each HM is tailored to the specific data available and the complexity of the bioprocess, making it difficult to replicate across systems that are not similar. While a general framework for developing HM can be established, the design of nonparametric components along with their hyperparameters and architecture remains largely heuristic. In this way, developing HM requires integrating prior domain knowledge of the underlying process [23,152].
Despite its advantages, HM has not seen widespread application in microalgae systems. This is largely due to challenges such as the limited availability of high-quality, large-scale datasets, the complexity of integrating biological knowledge into ML models, and the need for expertise in both data science and biological systems. Additionally, the high variability in microalgae strains and cultivation conditions makes it difficult to develop generalized models that can be applied across different systems, limiting the broader adoption of the HM approach in this field [21,64]. Table 4 outlines some studies that employed HM in microalgae-based systems.
As a first study, Zhang et al. [23] presented a novel HM strategy to enhance photo-production systems prediction and optimization. They used microalgae lutein synthesis to showcase the practical application and performance of the proposed HM approach. They selected a second-degree polynomial regression (PR) model as the data-driven model to estimate the deviations between the kinetic model and the process data. The PR model was applied in parallel with the mechanistic model to enhance the predictive modeling and to enable a more straightforward parameter estimation and uncertainty analysis. Then, an advanced model structure identification scheme was employed to determine the most physically plausible kinetic model structure and the minimal number of data-driven parameters. They concluded that the HM approach showed high potential for process monitoring and optimization, with its self-calibration feature being easily implementable in an online control system.
The study by Zhang et al. [23] was based on the study carried out by del Rio-Chanona et al. [156]. In the study, the authors investigated the process of lutein production, a carotenoid synthesized by the microalga Desmodesmus sp. with several applications due to its antioxidant properties. They developed a kinetic model using modified Monod kinetics to predict the optimal parameters for maximizing lutein yield. This kinetic model was used afterwards as the kinetic part of the HM approach in the subsequent study performed by Kay et al. [153]. In their work, the authors combined ANN models in a parallel scheme with the proposed kinetic model to describe the microalgae lutein synthesis. After building the HM, transfer learning was applied to update the model using limited process data of other microalgae species, Chlorella sorokiniana. Then, the hybrid transfer model was shown to be capable of reproducing the trends of biomass growth and lutein production of the new species under various operating conditions.
In another study, Zhang et al. [109] presented an HM framework to optimize biomass growth and lutein production in microalgae cultivation, in which the kinetic model was also based on the study performed by del Rio-Chanona et al. [156]. The data-driven component employed an ANN with multilayered structures and data augmentation techniques for noise filtering, process prediction, and control optimization. The HM approach was able to improve predictions in biomass, nitrate, and lutein predictions, achieving deviations as low as 5.1%, 11.7%, and 2.6%, respectively, in continuous process trajectories. These results demonstrate the effectiveness of the HM approach in addressing challenges in microalgae systems through high predictive and flexible capabilities, indicating its potential for the simulation and optimization of complex biosystems.
Wang et al. [154] proposed an HM construction framework that was showcased on a microalgae cultivation process. The kinetic model used in the study was a simplification of the model developed by del Rio-Chanona et al. [157], describing the biomass growth and nitrate consumption in a microalgae system. The data-driven model used was polynomial regression that was integrated in the mechanistic backbone derived from the kinetic model. Statistical methods were then employed to identify the best hybrid model that achieves a balance between fitting and extrapolation performance. Among the statistical methods applied, the Bayesian information criterion was described as the most robust information method, even when the HM was trained at different levels of Gaussian noise to mimic real-life measurements. The identified best HM showed good prediction capabilities under different noise levels, showing that the proposed framework is successful in developing an efficient HM.
In the study carried out by Shahhoseyni et al. [155], the authors employed HM for microalgae systems to achieve improved prediction and robustness in photobioreactor performance. They used a kinetic model that incorporated light intensity as a key variable in the modeling process and a polynomial regression as the data-driven component. Lasso regularization was also used to manage model complexity and prevent overfitting. Their results showed that the HM with the inclusion of light intensity as a variable in the kinetic model presented better performance metrics, such as higher R2 and lower MAPE, than the models without the light intensity as a variable. Finally, they concluded that the HM approach effectively captured detailed dynamics, resulting in improved accuracy for predicting biomass growth under varying light conditions.

6. Future Directions in the Field of Microalgae Systems

In this review, an overview is given of the commonly used algorithms for data-driven modeling in the context of microalgae systems. While most applications of these algorithms are in supervised or unsupervised learning, other ML methods like SSL and RL can be used for more specialized tasks. SSL could bridge the gap by leveraging small amounts of labeled data along with abundant unlabeled data, improving the performance of models in tasks like species classification and growth prediction. Similarly, RL holds the potential for optimizing microalgae systems, such as adjusting environmental factors in real time to maximize microalgae growth or biomass yields. Expanding these methods in microalgae research would enhance predictive capabilities, improve process optimization, and lead to more efficient biotechnological applications.
However, the success of these ML methods depends heavily on the availability and quality of the data. Many current microalgae datasets are limited, noisy, or incomplete, which can hinder model performance. DM techniques such as data cleaning and anomaly detection can be employed to resolve issues like missing data, noise, and errors. In general, ML models’ performance improves as more data becomes available. However, collecting large datasets can be both time-consuming and costly in real-world applications. When data is limited, preprocessing techniques such as data augmentation can enhance data quality. This technique can artificially expand datasets by generating variations based on known invariances. This allows the model to learn these patterns even when data is limited or imbalanced, which leads to improved generalization and enhanced model performance [158]. Advanced methods like GAN and VAE have gained popularity due to their ability to create realistic and high-dimensional data. These methods not only help prevent overfitting by introducing variability but also improve the generalization capability of models. More details about DM techniques with a focus on missing data and data augmentation can be found elsewhere [20,159,160].
Additionally, the collection of real-time data is crucial for improving the performance and the adjustment of ML models. Recent advances in data acquisition techniques have significantly enhanced the ability to collect, analyze, and utilize data across various domains, particularly in environmental monitoring and bioprocess optimization. The use of ML algorithms in data acquisition processes has facilitated the automatic detection of anomalies and the prediction of system behaviors based on historical data patterns. The implementation of biosensors, which can provide rapid and accurate measurements of biological indicators, has further accelerated real-time monitoring capabilities. In parallel, microfluidic systems have emerged as powerful techniques for manipulating and analyzing minute volumes of fluids, allowing for high-throughput screening of microalgae under varying process conditions. Remote sensing technologies have also advanced in their ability to monitor real-time microalgae systems and environmental conditions using satellite imagery and aerial drones equipped with hyperspectral sensors.
Another aspect of critical importance in researching microalgae systems is the HM that combines the strengths of physics-based and data-driven models. By incorporating both types of models, hybrid approaches also mitigate the issues arising from limited or noisy datasets. ML components within HM can compensate for missing or imperfect data by learning from real-time inputs, and these models are adaptable, allowing for continuous refinement as new data is collected. One of the main advantages of HM is its ability to improve scale-up. As microalgae systems move from lab-scale to industrial-scale production, the process parameters and environmental conditions can vary dramatically. A purely physics-based model may struggle to adapt to such scale shifts, while a data-driven model might not have sufficient data to predict outcomes accurately. HM can handle these changes by combining empirical data with fundamental process knowledge, making them robust across different scales of production. This adaptability is particularly useful for optimizing dynamic systems in real time, such as photobioreactors, where environmental conditions continuously fluctuate.

7. Conclusions

This review provides a detailed overview of data mining, machine learning, and hybrid modeling approaches applied to microalgae systems. At first, an overview of different data mining tasks and machine learning methods is provided, categorizing those methods into four distinct types: supervised, unsupervised, semi-supervised, and reinforcement learning. In addition, special emphasis is placed on deep learning methods in a supervised scheme. As highlighted throughout this review, the optimization of microalgae systems relies not only on the system design and operation but also on the quality and availability of reliable datasets.
Therefore, based on the discussion in this review, we highlight the following key recommendations:
(1)
Enhance data quality and acquisition: Reliable and standardized datasets are essential for model performance. Implementation of more advanced data acquisition techniques, such as biosensors, microfluidic systems, and remote sensing technologies, should be expanded to generate accurate and high-resolution data.
(2)
Adopt advanced preprocessing and modeling strategies: The application of advanced preprocessing techniques, including data augmentation and ensemble methods, in combination with different learning approaches, such as deep learning, semi-supervised learning, and reinforcement learning, should be prioritized to enhance predictive accuracy and robustness.
(3)
Expand the use of hybrid approaches: Hybrid modeling is a powerful tool for optimizing microalgae systems by integrating empirical data with theoretical understanding, improving scalability, efficiency, and predictive power. Its application in microalgae systems should be further explored.
Finally, future directions point toward the increased use of more sophisticated learning approaches, such as deep learning and reinforcement learning, and the application of hybrid modeling to address challenges in dataset quality, generalization, and model performance. However, a critical first step for advancing this field is the construction of homogeneous and well-documented datasets, since the effective application of HM models depends on data consistency and quality. Achieving this requires greater transparency from researchers, particularly in reporting how data are processed, reproduced, and linked to model development. Such advancements will be key to driving innovation and improving the sustainability and productivity of microalgae systems.

Author Contributions

Conceptualization, G.R.F. and F.G.M.; methodology, G.R.F.; formal analysis, G.R.F.; investigation, G.R.F.; resources, S.B.; writing—original draft preparation, G.R.F.; writing—review and editing, G.R.F., S.B., R.O. and F.G.M.; supervision, R.O. and F.G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work received financial support from the Foundation for Science and Technology (FCT) and the Ministry of Science, Technology and Higher Education (MCTES), through national funds (PIDDAC): LEPABE (Laboratory for Process Engineering, Environment, Biotechnology and Energy, Faculty of Engineering, University of Porto, UIDB/00511/2020, DOI: 10.54499/UIDB/00511/2020, UIDP/00511/2020, DOI: 10.54499/UIDP/00511/2020), ALiCE (Associate Laboratory in Chemical Engineering, Faculty of Engineering, University of Porto, LA/P/0045/2020, DOI: 10.54499/LA/P/0045/2020), UCIBIO (Research Unit on Applied Molecular Biosciences, UIDP/04378/2020, DOI: 10.54499/UIDP/04378/2020, UIDB/04378/2020, DOI: 10.54499/UIDB/04378/2020), and i4HB (Associate Laboratory Institute for Health and Bioeconomy, LA/P/0140/2020, DOI: 10.54499/LA/P/0140/2020). G.R.F. thanks FCT for the individual research grant PRT/BD/154543/2022.

Data Availability Statement

Data are contained within the article.

Acknowledgments

G.R.F. thanks A4F company for providing the facilities and the necessary conditions for the accomplishment of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ALActive Learning
ANNArtificial Neural Network
AUCArea Under the Curve
CNNConvolutional Neural Network
DBSCANDensity-Based Spatial Clustering of Applications with Noise
DCNNDeep Convolutional Neural Network
DLDeep Learning
DMData Mining
DTsDecision Trees
FTIRFourier Transform Infrared
GANGenerative Adversarial Network
GBGradient Boosting
GCGaussian Classifier
GSDAMGrayscale Surface Direction Angle Model
HABHarmful Algal Blooms
HACHierarchical Agglomerative Clustering
HCAHierarchical Cluster Analysis
HEHarvesting Efficiency
HMHybrid Modeling
kNNk-Nearest Neighbor
LSTMLong Short-Term Memory
MAEMean Absolute Error
MLMachine Learning
MLPMultilayer Perceptron
MLRMultiple Linear Regression
MNPsMagnetic Nanoparticles
NIRNear Infrared
PBRPhotobioreactor
PCAPrincipal Component Analysis
PCRPrincipal Component Regression
PRPolynomial Regression
PRIProbability Rand Index
RFRandom Forest
RLReinforcement Learning
RMSERoot Mean Square Error
SSLSemi-Supervised Learning
SVMSupport Vector Machine
SVRSupport Vector Regression
t-SNEt-distributed Stochastic Neighbor Embedding
ULUnsupervised Learning
VAEVariational Autoencoder
XGBoostExtreme Gradient Boosting

References

  1. Borowitzka, M.A. Dunaliella: Biology, Production, and Markets. In Handbook of Microalgal Culture: Applied Phycology and Biotechnology; Richmond, A., Hu, Q., Eds.; Wiley Blackwell: Chichester, UK, 2013; pp. 359–368. ISBN 9780470673898. [Google Scholar]
  2. Polle, J.E.W.; Tran, D.; Ben-Amotz, A. The Alga Dunaliella: Biodiversity, Physiology, Genomics and Biotechnology, 1st ed.; Ben-Amotz, A., Polle, J.E.W., Rao, D.V.S., Eds.; Science Publishers: Enfield, NH, USA, 2009. [Google Scholar]
  3. Alvarez, A.L.; Weyers, S.L.; Goemann, H.M.; Peyton, B.M.; Gardner, R.D. Microalgae, Soil and Plants: A Critical Review of Microalgae as Renewable Resources for Agriculture. Algal Res. 2021, 54, 102200. [Google Scholar] [CrossRef]
  4. Aghaalipour, E.; Akbulut, A.; Güllü, G. Carbon Dioxide Capture with Microalgae Species in Continuous Gas-Supplied Closed Cultivation Systems. Biochem. Eng. J. 2020, 163, 107741. [Google Scholar] [CrossRef]
  5. Leong, Y.K.; Chang, J.S. Bioremediation of Heavy Metals Using Microalgae: Recent Advances and Mechanisms. Bioresour. Technol. 2020, 303, 122886. [Google Scholar] [CrossRef] [PubMed]
  6. Zhao, W.; Zhu, J.; Yang, S.; Liu, J.; Sun, Z.; Sun, H. Microalgal Metabolic Engineering Facilitates Precision Nutrition and Dietary Regulation. Sci. Total Environ. 2024, 951, 175460. [Google Scholar] [CrossRef]
  7. Laamanen, C.A.; Desjardins, S.M.; Senhorinho, G.N.A.; Scott, J.A. Harvesting Microalgae for Health Beneficial Dietary Supplements. Algal Res. 2021, 54, 102189. [Google Scholar] [CrossRef]
  8. Kassim, M.A.; Meng, T.K. Carbon Dioxide (CO2) Biofixation by Microalgae and Its Potential for Biorefinery and Biofuel Production. Sci. Total Environ. 2017, 584–585, 1121–1129. [Google Scholar] [CrossRef]
  9. Xu, X.; Gu, X.; Wang, Z.; Shatner, W.; Wang, Z. Progress, Challenges and Solutions of Research on Photosynthetic Carbon Sequestration Efficiency of Microalgae. Renew. Sustain. Energy Rev. 2019, 110, 65–82. [Google Scholar] [CrossRef]
  10. Nagappan, S.; Devendran, S.; Tsai, P.C.; Dahms, H.U.; Ponnusamy, V.K. Potential of Two-Stage Cultivation in Microalgae Biofuel Production. Fuel 2019, 252, 339–349. [Google Scholar] [CrossRef]
  11. Peng, L.; Fu, D.; Chu, H.; Wang, Z.; Qi, H. Biofuel Production from Microalgae: A Review. Environ. Chem. Lett. 2020, 18, 285–297. [Google Scholar] [CrossRef]
  12. Ambat, I.; Tang, W.Z.; Sillanpää, M. Statistical Analysis of Sustainable Production of Algal Biomass from Wastewater Treatment Process. Biomass Bioenergy 2019, 120, 471–478. [Google Scholar] [CrossRef]
  13. Zhao, D.; Cheah, W.Y.; Lai, S.H.; Ng, E.P.; Khoo, K.S.; Show, P.L.; Ling, T.C. Symbiosis of Microalgae and Bacteria Consortium for Heavy Metal Remediation in Wastewater. J. Environ. Chem. Eng. 2023, 11, 109943. [Google Scholar] [CrossRef]
  14. Tripathi, S.; Choudhary, S.; Meena, A.; Poluri, K.M. Carbon Capture, Storage, and Usage with Microalgae: A Review. Environ. Chem. Lett. 2023, 21, 2085–2128. [Google Scholar] [CrossRef]
  15. Razzak, S.A.; Hossain, M.M.; Lucky, R.A.; Bassi, A.S.; De Lasa, H. Integrated CO2 Capture, Wastewater Treatment and Biofuel Production by Microalgae Culturing—A Review. Renew. Sustain. Energy Rev. 2013, 27, 622–653. [Google Scholar] [CrossRef]
  16. Pires, J.C.M.; Alvim-Ferraz, M.C.M.; Martins, F.G. Photobioreactor Design for Microalgae Production through Computational Fluid Dynamics: A Review. Renew. Sustain. Energy Rev. 2017, 79, 248–254. [Google Scholar] [CrossRef]
  17. Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
  18. Bock, F.E.; Aydin, R.C.; Cyron, C.J.; Huber, N.; Kalidindi, S.R.; Klusemann, B. A Review of the Application of Machine Learning and Data Mining Approaches in Continuum Materials Mechanics. Front. Mater. 2019, 6, 110. [Google Scholar] [CrossRef]
  19. Järvinen, P.; Siltanen, P.; Kirschenbaum, A. Data Analytics and Machine Learning. In Big Data in Bioeconomy—Results from the European DataBio Project; Södergård, C., Mildorf, T., Habyarimana, E., Berre, A.J., Fernandes, J.A., Zinke-Wehlmann, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
  20. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Han, J., Kamber, M., Pei, J., Eds.; Morgan Kaufmann Publishers—Elsevier: Amsterdam, The Netherlands, 2012. [Google Scholar]
  21. Mondal, P.P.; Galodha, A.; Verma, V.K.; Singh, V.; Show, P.L.; Awasthi, M.K.; Lall, B.; Anees, S.; Pollmann, K.; Jain, R. Review on Machine Learning-Based Bioprocess Optimization, Monitoring, and Control Systems. Bioresour. Technol. 2023, 370, 128523. [Google Scholar] [CrossRef]
  22. Alpaydin, E. Introduction to Machine Learning, 2nd ed.; Dietterich, T., Ed.; MIT Press: Cambridge, UK, 2010; ISBN 026201243X. [Google Scholar]
  23. Zhang, D.; Savage, T.R.; Cho, B.A. Combining Model Structure Identification and Hybrid Modelling for Photo-Production Process Predictive Simulation and Optimisation. Biotechnol. Bioeng. 2020, 117, 3356–3367. [Google Scholar] [CrossRef]
  24. Psichogios, D.C.; Ungar, L.H. A Hybrid Neural Network-First Principles Approach to Process Modeling. AIChE J. 1992, 38, 1499–1511. [Google Scholar] [CrossRef]
  25. Oliveira, R. Combining First Principles Modelling and Artificial Neural Networks: A General Framework. Comput. Chem. Eng. 2004, 28, 755–766. [Google Scholar] [CrossRef]
  26. VOSviewer, Version 1.6.20; 31 October 2023. Available online: http://www.vosviewer.com/ (accessed on 30 July 2025).
  27. Garrido-Cardenas, J.A.; Manzano-Agugliaro, F.; Acien-Fernandez, F.G.; Molina-Grima, E. Microalgae Research Worldwide. Algal Res. 2018, 35, 50–60. [Google Scholar] [CrossRef]
  28. Aggarwal, C.C. Data Mining; Springer International Publishing: Cham, Switzerland, 2015; ISBN 978-3-319-14141-1. [Google Scholar]
  29. Witten, I.H.; Frank, E.; Hall, M.A. What’s It All About? In Data Mining: Practical Machine Learning Tools and Techniques; Elsevier: Amsterdam, The Netherlands, 2011; pp. 3–38. [Google Scholar]
  30. Bellazzi, R.; Zupan, B. Predictive Data Mining in Clinical Medicine: Current Issues and Guidelines. Int. J. Med. Inform. 2008, 77, 81–97. [Google Scholar] [CrossRef]
  31. Tan, P.-N.; Steinbach, M.; Karpatne, A.; Kumar, V. Introduction to Data Mining, 2nd ed.; Addison Wesley: New York, NY, USA, 2018. [Google Scholar]
  32. Alzubi, J.; Nayyar, A.; Kumar, A. Machine Learning from Theory to Algorithms: An Overview. J. Phys. Conf. Ser. 2018, 1142, 12012. [Google Scholar] [CrossRef]
  33. Dhage, S.N.; Raina, C.K. A Review on Machine Learning Techniques. Int. J. Recent Innov. Trends Comput. Commun. 2016, 4, 395–399. [Google Scholar]
  34. Ota, M.; Takenaka, M.; Sato, Y.; Lee Smith, R.; Inomata, H. Effects of Light Intensity and Temperature on Photoautotrophic Growth of a Green Microalga, Chlorococcum Littorale. Biotechnol. Rep. 2015, 7, 24–29. [Google Scholar] [CrossRef] [PubMed]
  35. Laurens, L.M.L.; Wolfrum, E.J. High-Throughput Quantitative Biochemical Characterization of Algal Biomass by NIR Spectroscopy; Multiple Linear Regression and Multivariate Linear Regression Analysis. J. Agric. Food Chem. 2013, 61, 12307–12314. [Google Scholar] [CrossRef]
  36. Yan, Q.; Yang, C.; Wan, Z. A Comparative Regression Analysis between Principal Component and Partial Least Squares Methods for Flight Load Calculation. Appl. Sci. 2023, 13, 8428. [Google Scholar] [CrossRef]
  37. Karakach, T.K.; McGinn, P.J.; Choi, J.; MacQuarrie, S.P.; Tartakovsky, B. Real-Time Monitoring, Diagnosis, and Time-Course Analysis of Microalgae Scenedesmus AMDD Cultivation Using Dual Excitation Wavelength Fluorometry. J. Appl. Phycol. 2015, 27, 1823–1832. [Google Scholar] [CrossRef]
  38. Horton, R.B.; Duranty, E.; McConico, M.; Vogt, F. Fourier Transform Infrared (FT-IR) Spectroscopy and Improved Principal Component Regression (PCR) for Quantification of Solid Analytes in Microalgae and Bacteria. Appl. Spectrosc. 2011, 65, 442–453. [Google Scholar] [CrossRef]
  39. Lee, J.H.; Park, J.J.; Yoon, H. Automatic Bridge Design Parameter Extraction for Scan-to-BIM. Appl. Sci. 2020, 10, 7346. [Google Scholar] [CrossRef]
  40. Reimann, R.; Zeng, B.; Jakopec, M.; Burdukiewicz, M.; Petrick, I.; Schierack, P.; Rödiger, S. Classification of Dead and Living Microalgae Chlorella Vulgaris by Bioimage Informatics and Machine Learning. Algal Res. 2020, 48, 101908. [Google Scholar] [CrossRef]
  41. Chong, J.W.R.; Khoo, K.S.; Chew, K.W.; Ting, H.Y.; Iwamoto, K.; Ruan, R.; Ma, Z.; Show, P.L. Artificial Intelligence-Driven Microalgae Autotrophic Batch Cultivation: A Comparative Study of Machine and Deep Learning-Based Image Classification Models. Algal Res. 2024, 79, 103400. [Google Scholar] [CrossRef]
  42. Anuntakarun, S.; Lertampaiporn, S.; Laomettachit, T.; Wattanapornprom, W.; Ruengjitchatchawalya, M. MSRFR: A Machine Learning Model Using Microalgal Signature Features for NcRNA Classification. BioData Min. 2022, 15, 8. [Google Scholar] [CrossRef]
  43. Chen, W.H.; Felix, C.B. Thermo-Kinetics Study of Microalgal Biomass in Oxidative Torrefaction Followed by Machine Learning Regression and Classification Approaches. Energy 2024, 301, 131677. [Google Scholar] [CrossRef]
  44. Meenatchisundaram, K.; Gowd, S.C.; Lee, J.; Barathi, S.; Rajendran, K. Data-Driven Model Development for Prediction and Optimization of Biomass Yield of Microalgae-Based Wastewater Treatment. Sustain. Energy Technol. Assess. 2024, 63, 103670. [Google Scholar] [CrossRef]
  45. Yew, G.Y.; Puah, B.K.; Chew, K.W.; Teng, S.Y.; Show, P.L.; Nguyen, T.H.P. Chlorella Vulgaris FSP-E Cultivation in Waste Molasses: Photo-to-Property Estimation by Artificial Intelligence. Chem. Eng. J. 2020, 402, 126230. [Google Scholar] [CrossRef]
  46. Jung, W.S.; Jo, B.G.; Kim, Y. Do A Study on the Occurrence Characteristics of Harmful Blue-Green Algae in Stagnant Rivers Using Machine Learning. Appl. Sci. 2023, 13, 3699. [Google Scholar] [CrossRef]
  47. Razzak, S.A.; Alam, M.S.; Hossain, S.M.Z.; Rahman, S.M. Tree-Based Machine Learning for Predicting Neochloris Oleoabundans Biomass Growth and Biological Nutrient Removal from Tertiary Municipal Wastewater. Chem. Eng. Res. Des. 2024, 210, 614–624. [Google Scholar] [CrossRef]
  48. Singh, V.; Mishra, V. Exploring the Effects of Different Combinations of Predictor Variables for the Treatment of Wastewater by Microalgae and Biomass Production. Biochem. Eng. J. 2021, 174, 108129. [Google Scholar] [CrossRef]
  49. Singh, V.; Verma, M.; Chivate, M.S.; Mishra, V. Machine Learning-Based Optimisation of Microalgae Biomass Production by Using Wastewater. J. Environ. Chem. Eng. 2023, 11, 111387. [Google Scholar] [CrossRef]
  50. Coşgun, A.; Günay, M.E.; Yıldırım, R. Exploring the Critical Factors of Algal Biomass and Lipid Production for Renewable Fuel Production by Machine Learning. Renew. Energy 2021, 163, 1299–1317. [Google Scholar] [CrossRef]
  51. Xu, Z.; Jiang, Y.; Ji, J.; Forsberg, E.; Li, Y.; He, S. Classification, Identification, and Growth Stage Estimation of Microalgae Based on Transmission Hyperspectral Microscopic Imaging and Machine Learning. Opt. Express 2020, 28, 30686. [Google Scholar] [CrossRef] [PubMed]
  52. Cheng, F.; Porter, M.D.; Colosi, L.M. Is Hydrothermal Treatment Coupled with Carbon Capture and Storage an Energy-Producing Negative Emissions Technology? Energy Convers. Manag. 2020, 203, 112252. [Google Scholar] [CrossRef]
  53. Lopez-Exposito, P.; Negro, C.; Blanco, A. Direct Estimation of Microalgal Flocs Fractal Dimension through Laser Reflectance and Machine Learning. Algal Res. 2019, 37, 240–247. [Google Scholar] [CrossRef]
  54. López Expósito, P.; Blanco Suárez, A.; Negro Álvarez, C. Laser Reflectance Measurement for the Online Monitoring of Chlorella Sorokiniana Biomass Concentration. J. Biotechnol. 2017, 243, 10–15. [Google Scholar] [CrossRef]
  55. Ning, H.; Li, R.; Zhou, T. Machine Learning for Microalgae Detection and Utilization. Front. Mar. Sci. 2022, 9. [Google Scholar] [CrossRef]
  56. Harmon, J.; Mikami, H.; Kanno, H.; Ito, T.; Goda, K. Accurate Classification of Microalgae by Intelligent Frequency-Division-Multiplexed Fluorescence Imaging Flow Cytometry. OSA Contin. 2020, 3, 430. [Google Scholar] [CrossRef]
  57. Chong, J.W.R.; Khoo, K.S.; Chew, K.W.; Ting, H.-Y.; Koji, I.; Show, P.L. Digitalised Prediction of Blue Pigment Content from Spirulina Platensis: Next-Generation Microalgae Bio-Molecule Detection. Algal Res. 2024, 83, 103642. [Google Scholar] [CrossRef]
  58. Yeh, Y.C.; Syed, T.; Brinitzer, G.; Frick, K.; Schmid-Staiger, U.; Haasdonk, B.; Tovar, G.E.M.; Krujatz, F.; Mädler, J.; Urbas, L. Improving Microalgae Growth Modeling of Outdoor Cultivation with Light History Data Using Machine Learning Models: A Comparative Study. Bioresour. Technol. 2023, 390, 129882. [Google Scholar] [CrossRef]
  59. Trizoglou, P.; Liu, X.; Lin, Z. Fault Detection by an Ensemble Framework of Extreme Gradient Boosting (XGBoost) in the Operation of Offshore Wind Turbines. Renew. Energy 2021, 179, 945–962. [Google Scholar] [CrossRef]
  60. Magalhães, V.; Pinto, V.; Sousa, P.; Afonso, J.A.; Gonçalves, L.; Fernández, E.; Minas, G. A Portable and Low-Cost Optical Device for Pigment-Based Taxonomic Classification of Microalgae Using Machine Learning. Sens. Actuators B Chem. 2025, 423, 136819. [Google Scholar] [CrossRef]
  61. Colkesen, I.; Ozturk, M.Y.; Altuntas, O.Y. Comparative Evaluation of Performances of Algae Indices, Pixel- and Object-Based Machine Learning Algorithms in Mapping Floating Algal Blooms Using Sentinel-2 Imagery. Stoch. Environ. Res. Risk Assess 2024, 38, 1613–1634. [Google Scholar] [CrossRef]
  62. Fu, Y.; Zhang, Q.; Tan, Z.; Yu, S.; Zhang, Y. Predicting Harvesting Efficiency of Microalgae with Magnetic Nanoparticles Using Machine Learning Models. J. Environ. Chem. Eng. 2025, 13, 115406. [Google Scholar] [CrossRef]
  63. Meenatchi Sundaram, K.; Kumar, D.; Lee, J.; Barathi, S.; Rajendran, K. Time Series Forecasting of Microalgae Cultivation for a Sustainable Wastewater Treatment. Process Saf. Environ. Prot. 2025, 196, 106845. [Google Scholar] [CrossRef]
  64. Syed, T.; Krujatz, F.; Ihadjadene, Y.; Mühlstädt, G.; Hamedi, H.; Mädler, J.; Urbas, L. A Review on Machine Learning Approaches for Microalgae Cultivation Systems. Comput. Biol. Med. 2024, 172, 108248. [Google Scholar] [CrossRef]
  65. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, UK, 2016. [Google Scholar]
  66. Szelag, B.; González-Camejo, J.; Eusebi, A.L.; Barat, R.; Kiczko, A.; Fatone, F. Multi-Criteria Analysis of the Continuous Operation of a Membrane Photobioreactor to Treat Sewage: Modeling and Sensitivity Analysis. Chem. Eng. J. 2024, 496, 154202. [Google Scholar] [CrossRef]
  67. You, H.; Ma, Z.; Tang, Y.; Wang, Y.; Yan, J.; Ni, M.; Cen, K.; Huang, Q. Comparison of ANN (MLP), ANFIS, SVM, and RF Models for the Online Classification of Heating Value of Burning Municipal Solid Waste in Circulating Fluidized Bed Incinerators. Waste Manag. 2017, 68, 186–197. [Google Scholar] [CrossRef]
  68. Susanna, D.; Dhanapal, R.; Mahalingam, R.; Ramamurthy, V. Increasing Productivity of Spirulina Platensis in Photobioreactors Using Artificial Neural Network Modeling. Biotechnol. Bioeng. 2019, 116, 2960–2970. [Google Scholar] [CrossRef]
  69. Bricaud, A.; Mejia, C.; Blondeau-Patissier, D.; Claustre, H.; Crepon, M.; Thiria, S. Retrieval of Pigment Concentrations and Size Structure of Algal Populations from Their Absorption Spectra Using Multilayered Perceptrons. Appl. Opt. 2007, 46, 1251–1260. [Google Scholar] [CrossRef]
  70. Géron, A. Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems; O’Reilly: Beijing, China, 2018; ISBN 9781491962299. [Google Scholar]
  71. Chong, J.W.R.; Khoo, K.S.; Chew, K.W.; Vo, D.V.N.; Balakrishnan, D.; Banat, F.; Munawaroh, H.S.H.; Iwamoto, K.; Show, P.L. Microalgae Identification: Future of Image Processing and Digital Algorithm. Bioresour. Technol. 2023, 369, 128418. [Google Scholar] [CrossRef]
  72. Xu, W.; Niu, J.; Gan, W.; Gou, S.; Zhang, S.; Qiu, H.; Jiang, T. Identification of Paralytic Shellfish Toxin-Producing Microalgae Using Machine Learning and Deep Learning Methods. J. Oceanol. Limnol. 2022, 40, 2202–2217. [Google Scholar] [CrossRef]
  73. Yadav, D.P.; Jalal, A.S.; Garlapati, D.; Hossain, K.; Goyal, A.; Pant, G. Deep Learning-Based ResNeXt Model in Phycological Studies for Future. Algal Res. 2020, 50, 102018. [Google Scholar] [CrossRef]
  74. Pant, G.; Yadav, D.P.; Gaur, A. ResNeXt Convolution Neural Network Topology-Based Deep Learning Model for Identification and Classification of Pediastrum. Algal Res. 2020, 48, 101932. [Google Scholar] [CrossRef]
  75. Kloster, M.; Langenkämper, D.; Zurowietz, M.; Beszteri, B.; Nattkemper, T.W. Deep Learning-Based Diatom Taxonomy on Virtual Slides. Sci. Rep. 2020, 10, 1–13. [Google Scholar] [CrossRef]
  76. Zhuo, Z.; Wang, H.; Liao, R.; Ma, H. Machine Learning Powered Microalgae Classification by Use of Polarized Light Scattering Data. Appl. Sci. 2022, 12, 3422. [Google Scholar] [CrossRef]
  77. Park, J.; Baek, J.; Kim, J.; You, K.; Kim, K. Deep Learning-Based Algal Detection Model Development Considering Field Application. Water 2022, 14, 1275. [Google Scholar] [CrossRef]
  78. Otálora, P.; Guzmán, J.L.; Acién, F.G.; Berenguel, M.; Reul, A. Microalgae Classification Based on Machine Learning Techniques. Algal Res. 2021, 55, 102256. [Google Scholar] [CrossRef]
  79. Baek, S.S.; Pyo, J.C.; Pachepsky, Y.; Park, Y.; Ligaray, M.; Ahn, C.Y.; Kim, Y.H.; Ahn Chun, J.; Hwa Cho, K. Identification and Enumeration of Cyanobacteria Species Using a Deep Neural Network. Ecol. Indic. 2020, 115, 106395. [Google Scholar] [CrossRef]
  80. Park, J.; Lee, H.; Park, C.Y.; Hasan, S.; Heo, T.Y.; Lee, W.H. Algal Morphological Identification in Watersheds for Drinking Water Supply Using Neural Architecture Search for Convolutional Neural Network. Water 2019, 11, 1338. [Google Scholar] [CrossRef]
  81. D’Orazio, M.; Gianangeli, A.; Monni, F.; Quagliarini, E. Automatic Monitoring of the Bio-Colonisation of Historical Building’s Facades through Convolutional Neural Networks (CNN). J. Cult. Herit. 2024, 70, 80–89. [Google Scholar] [CrossRef]
  82. Sonmez, M.E.; Altinsoy, B.; Ozturk, B.Y.; Gumus, N.E.; Eczacioglu, N. Deep Learning-Based Classification of Microalgae Using Light and Scanning Electron Microscopy Images. Micron 2023, 172, 103506. [Google Scholar] [CrossRef] [PubMed]
  83. Carloto, I.; Johnston, P.; Pestana, C.J.; Lawton, L.A. Detection of Morphological Changes Caused by Chemical Stress in the Cyanobacterium Planktothrix Agardhii Using Convolutional Neural Networks. Sci. Total Environ. 2021, 784, 146956. [Google Scholar] [CrossRef] [PubMed]
  84. Nguyen, L.; Nguyen, D.K.; Nguyen, T.; Nghiem, T.X. Convolutional Neural Network Regression for Low-Cost Microalgal Density Estimation. e-Prime Adv. Electr. Eng. Electron. Energy 2024, 9, 100653. [Google Scholar] [CrossRef]
  85. del Rio-Chanona, E.A.; Wagner, J.L.; Ali, H.; Fiorelli, F.; Zhang, D.; Hellgardt, K. Deep Learning-Based Surrogate Modeling and Optimization for Microalgal Biofuel Production and Photobioreactor Design. AIChE J. 2019, 65, 915–923. [Google Scholar] [CrossRef]
  86. Yang, C.T.; Kristiani, E.; Leong, Y.K.; Chang, J.S. Machine Learning in Microalgae Biotechnology for Sustainable Biofuel Production: Advancements, Applications, and Prospects. Bioresour. Technol. 2024, 413, 131549. [Google Scholar] [CrossRef]
  87. Omole, O.A.; Ogbaga, C.C.; Okolie, J.A.; Akande, O.; Kimera, R.; Dayil, J.L. Advancing Algal Biofuel Production through Data-Driven Insights: A Comprehensive Review of Machine Learning Applications. Comput. Chem. Eng. 2025, 196, 109049. [Google Scholar] [CrossRef]
  88. Rodríguez-Rangel, H.; Arias, D.M.; Morales-Rosales, L.A.; Gonzalez-Huitron, V.; Partida, M.V.; García, J. Machine Learning Methods Modeling Carbohydrate-Enriched Cyanobacteria Biomass Production in Wastewater Treatment Systems. Energies 2022, 15, 2500. [Google Scholar] [CrossRef]
  89. Syed, T.; Kalliadan, S.; Mädler, J.; Laukens, K.; Roef, L.; Urbas, L. LSTM-Based Soft Sensor for the Prediction of Microalgae Growth. In Computer Aided Chemical Engineering; Elsevier B.V.: Amsterdam, The Netherland, 2024; Volume 53, pp. 3145–3150. [Google Scholar]
  90. Mazzelli, A.; Cicci, A.; Di Caprio, F.; Altimari, P.; Toro, L.; Iaquaniello, G.; Pagnanelli, F. Multivariate Modeling for Microalgae Growth in Outdoor Photobioreactors. Algal Res. 2020, 45, 101663. [Google Scholar] [CrossRef]
  91. Davani, L.; Terenzi, C.; Tumiatti, V.; De Simone, A.; Andrisano, V.; Montanari, S. Integrated Analytical Approaches for the Characterization of Spirulina and Chlorella Microalgae. J. Pharm. Biomed. Anal. 2022, 219, 114943. [Google Scholar] [CrossRef]
  92. Hegazi, N.; Khattab, A.R.; Saad, H.H.; Abib, B.; Farag, M.A. A Multiplex Metabolomic Approach for Quality Control of Spirulina Supplement and Its Allied Microalgae (Amphora & Chlorella) Assisted by Chemometrics and Molecular Networking. Sci. Rep. 2024, 14, 2809. [Google Scholar] [CrossRef]
  93. Wei, P.J.; Pang, Z.Z.; Jiang, L.J.; Tan, D.Y.; Su, Y.S.; Zheng, C.H. Promoter Prediction in Nannochloropsis Based on Densely Connected Convolutional Neural Networks. Methods 2022, 204, 38–46. [Google Scholar] [CrossRef] [PubMed]
  94. Suzuki, Y.; Kobayashi, K.; Wakisaka, Y.; Deng, D.; Tanaka, S.; Huang, C.-J.; Lei, C.; Sun, C.-W.; Liu, H.; Fujiwaki, Y.; et al. Label-Free Chemical Imaging Flow Cytometry by High-Speed Multicolor Stimulated Raman Scattering. Proc. Natl. Acad. Sci. USA 2019, 116, 15842–15848. [Google Scholar] [CrossRef] [PubMed]
  95. Singh, V.; Mishra, V. Analysing the Effects of Culture Parameters on Wastewater Treatment Capability of Microalgae through Association Rule Mining. J. Environ. Chem. Eng. 2022, 10, 108444. [Google Scholar] [CrossRef]
  96. Sancho, A.; Ribeiro, J.C.; Reis, M.S.; Martins, F.G. Cluster Analysis of Crude Oils with K-Means Based on Their Physicochemical Properties. Comput. Chem. Eng. 2022, 157, 107633. [Google Scholar] [CrossRef]
  97. Sánchez, C.; Vállez, N.; Bueno, G.; Cristóbal, G. Diatom Classification Including Morphological Adaptations Using CNNs. In Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, Madrid, Spain, 1–4 July 2019; Springer: Berlin, Germany, 2019; pp. 317–328. [Google Scholar]
  98. Pozzobon, V.; Levasseur, W.; Viau, E.; Michiels, E.; Clément, T.; Perré, P. Machine Learning Processing of Microalgae Flow Cytometry Readings: Illustrated with Chlorella Vulgaris Viability Assays. J. Appl. Phycol. 2020, 32, 2967–2976. [Google Scholar] [CrossRef]
  99. Manian, V.; Alfaro-Mejía, E.; Tokars, R.P. Hyperspectral Image Labeling and Classification Using an Ensemble Semi-Supervised Machine Learning Approach. Sensors 2022, 22, 1623. [Google Scholar] [CrossRef]
  100. Drews-Jr, P.; Colares, R.G.; Machado, P.; de Faria, M.; Detoni, A.; Tavano, V. Microalgae Classification Using Semi-Supervised and Active Learning Based on Gaussian Mixture Models. J. Braz. Comput. Soc. 2013, 19, 411–422. [Google Scholar] [CrossRef]
  101. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; Bradford Books: Cambridge, MA, USA, 2018. [Google Scholar]
  102. Petsagkourakis, P.; Sandoval, I.O.; Bradford, E.; Zhang, D.; del Rio-Chanona, E.A. Reinforcement Learning for Batch Bioprocess Optimization. Comput. Chem. Eng. 2020, 133, 106649. [Google Scholar] [CrossRef]
  103. Pandian, B.J.; Noel, M.M. Control of a Bioreactor Using a New Partially Supervised Reinforcement Learning Algorithm. J. Process Control 2018, 69, 16–29. [Google Scholar] [CrossRef]
  104. Treloar, N.J.; Fedorec, A.J.H.; Ingalls, B.; Barnes, C.P. Deep Reinforcement Learning for the Control of Microbial Co-Cultures in Bioreactors. PLoS Comput. Biol. 2020, 16, e1007783. [Google Scholar] [CrossRef]
  105. Doan, Y.T.-T.; Ho, M.-T.; Nguyen, H.-K.; Han, H.-D. Optimization of Spirulina Sp. Cultivation Using Reinforcement Learning with State Prediction Based on LSTM Neural Network. J. Appl. Phycol. 2021, 33, 2733–2744. [Google Scholar] [CrossRef]
  106. Tang, N.; Zhou, F.; Gu, Z.; Zheng, H.; Yu, Z.; Zheng, B. Unsupervised Pixel-Wise Classification for Chaetoceros Image Segmentation. Neurocomputing 2018, 318, 261–270. [Google Scholar] [CrossRef]
  107. von Stosch, M.; Oliveira, R.; Peres, J.; Feyo de Azevedo, S. Hybrid Semi-Parametric Modeling in Process Systems Engineering: Past, Present and Future. Comput. Chem. Eng. 2014, 60, 86–101. [Google Scholar] [CrossRef]
  108. Baughman, D.R.; Liu, Y.A. Neural Networks in Bioprocessing and Chemical Engineering; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
  109. Zhang, D.; del Rio-Chanona, E.A.; Petsagkourakis, P.; Wagner, J. Hybrid Physics-Based and Data-Driven Modeling for Bioprocess Online Simulation and Optimization. Biotechnol. Bioeng. 2019, 116, 2919–2930. [Google Scholar] [CrossRef]
  110. Thiviyanathan, V.A.; Ker, P.J.; Hoon Tang, S.G.; Amin, E.P.; Yee, W.; Hannan, M.A.; Jamaludin, Z.; Nghiem, L.D.; Indra Mahlia, T.M. Microalgae Biomass and Biomolecule Quantification: Optical Techniques, Challenges and Prospects. Renew. Sustain. Energy Rev. 2024, 189, 113926. [Google Scholar] [CrossRef]
  111. Allouzi, M.M.A.; Allouzi, S.; Al-Salaheen, B.; Khoo, K.S.; Rajendran, S.; Sankaran, R.; Sy-Toan, N.; Show, P.L. Current Advances and Future Trend of Nanotechnology as Microalgae-Based Biosensor. Biochem. Eng. J. 2022, 187, 108653. [Google Scholar] [CrossRef]
  112. Chuong, J.J.C.C.; Rahman, M.; Ibrahim, N.; Heng, L.Y.; Tan, L.L.; Ahmad, A. Harmful Microalgae Detection: Biosensors versus Some Conventional Methods. Sensors 2022, 22, 3144. [Google Scholar] [CrossRef]
  113. Zhou, Z.; Tian, D.; Yang, Y.; Cui, H.; Li, Y.; Ren, S.; Han, T.; Gao, Z. Machine Learning Assisted Biosensing Technology: An Emerging Powerful Tool for Improving the Intelligence of Food Safety Detection. Curr. Res. Food Sci. 2024, 8, 100679. [Google Scholar] [CrossRef]
  114. Huang, C.W.; Lin, C.; Nguyen, M.K.; Hussain, A.; Bui, X.T.; Ngo, H.H. A Review of Biosensor for Environmental Monitoring: Principle, Application, and Corresponding Achievement of Sustainable Development Goals. Bioengineered 2023, 14, 58–80. [Google Scholar] [CrossRef]
  115. Naresh, V.; Lee, N. A Review on Biosensors and Recent Development of Nanostructured Materials-Enabled Biosensors. Sensors 2021, 21, 1109. [Google Scholar] [CrossRef]
  116. Jin, X.; Liu, C.; Xu, T.; Su, L.; Zhang, X. Artificial Intelligence Biosensors: Challenges and Prospects. Biosens. Bioelectron. 2020, 165, 112412. [Google Scholar] [CrossRef] [PubMed]
  117. Thapa, R.; Poudel, S.; Krukiewicz, K.; Kunwar, A. A Topical Review on AI-Interlinked Biodomain Sensors for Multi-Purpose Applications. Measurement 2024, 227, 114123. [Google Scholar] [CrossRef]
  118. Wieser, W.; Assaf, A.A.; Le Gouic, B.; Dechandol, E.; Herve, L.; Louineau, T.; Dib, O.H.; Gonçalves, O.; Titica, M.; Couzinet-Mossion, A.; et al. Development and Application of an Automated Raman Sensor for Bioprocess Monitoring: From the Laboratory to an Algae Production Platform. Sensors 2023, 23, 9746. [Google Scholar] [CrossRef] [PubMed]
  119. Solovchenko, A. Seeing Good and Bad: Optical Sensing of Microalgal Culture Condition. Algal Res. 2023, 71, 103071. [Google Scholar] [CrossRef]
  120. Kim, H.S.; Devarenne, T.P.; Han, A. Microfluidic Systems for Microalgal Biotechnology: A Review. Algal Res. 2018, 30, 149–161. [Google Scholar] [CrossRef]
  121. Sebastiá-Frasquet, M.T.; Aguilar-Maldonado, J.A.; Herrero-Durá, I.; Santamaría-Del-ángel, E.; Morell-Monzó, S.; Estornell, J. Advances in the Monitoring of Algal Blooms by Remote Sensing: A Bibliometric Analysis. Appl. Sci. 2020, 10, 7877. [Google Scholar] [CrossRef]
  122. Stauffer, B.A.; Bowers, H.A.; Buckley, E.; Davis, T.W.; Johengen, T.H.; Kudela, R.; McManus, M.A.; Purcell, H.; Smith, G.J.; Woude, A.V.; et al. Considerations in Harmful Algal Bloom Research and Monitoring: Perspectives from a Consensus-Building Workshop and Technology Testing. Front. Mar. Sci. 2019, 6, 399. [Google Scholar] [CrossRef]
  123. Zahir, M.; Su, Y.; Shahzad, M.I.; Ayub, G.; Rahman, S.U.; Ijaz, J. A Review on Monitoring, Forecasting, and Early Warning of Harmful Algal Bloom. Aquaculture 2024, 593, 741351. [Google Scholar] [CrossRef]
  124. Plouviez, M.; Bhatia, N.; Shurygin, B.; Solovchenko, A. Advanced Imaging for Microalgal Biotechnology. Algal Res. 2024, 82, 103649. [Google Scholar] [CrossRef]
  125. Barsanti, L.; Birindelli, L.; Gualtieri, P. Water Monitoring by Means of Digital Microscopy Identification and Classification of Microalgae. Environ. Sci. Process Impacts 2021, 23, 1443–1457. [Google Scholar] [CrossRef]
  126. Zhang, Y.; Li, J.; Zhou, Y.; Zhang, X.; Liu, X. Artificial Intelligence-Based Microfluidic Platform for Detecting Contaminants in Water: A Review. Sensors 2024, 24, 4350. [Google Scholar] [CrossRef]
  127. Hou, C.; Zheng, X.; Song, Y.; Yu, Z.; Zhang, K.; Wang, J.; Zhou, X.; Zhang, Y.; Shen, Z. Prediction of Product Properties and Identification of Key Influencing Parameters in Microwave Pyrolysis of Microalgae Using Machine Learning. Algal Res. 2024, 82, 103662. [Google Scholar] [CrossRef]
  128. Liu, D.; Yuan, G.; Tan, H.; Jiang, Y.; Bi, H.; Cheng, Y. AlgaeClass_Net: Optimizing Few-Shot Marine Microalgae Classification with Multi-Scale Feature Enhancement Network. IEEE Access 2024, 13, 16223–16237. [Google Scholar] [CrossRef]
  129. Sonmez, M.E.; Eczacıoglu, N.; Gumuş, N.E.; Aslan, M.F.; Sabanci, K.; Aşikkutlu, B. Convolutional Neural Network—Support Vector Machine Based Approach for Classification of Cyanobacteria and Chlorophyta Microalgae Groups. Algal Res. 2022, 61, 102568. [Google Scholar] [CrossRef]
  130. Correa, I.; Drews, P.; Botelho, S.; De Souza, M.S.; Tavano, V.M. Deep Learning for Microalgae Classification. In Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2017. [Google Scholar]
  131. Abdullah; Ali, S.; Khan, Z.; Hussain, A.; Athar, A.; Kim, H.C. Computer Vision Based Deep Learning Approach for the Detection and Classification of Algae Species Using Microscopic Images. Water 2022, 14, 2219. [Google Scholar] [CrossRef]
  132. Lyu, Y.; Chen, J.; Song, Z.; Zhang, Q. Synthesizing Data by Transferring Information in Data-Intensive Regions to Enhance Process Monitoring Performance in Data-Scarce Region. Can. J. Chem. Eng. 2021, 99, S521–S539. [Google Scholar] [CrossRef]
  133. Hao, X.; Liu, L.; Yang, R.; Yin, L.; Zhang, L.; Li, X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sens. 2023, 15, 827. [Google Scholar] [CrossRef]
  134. Chen, M.; Chen, Y.; Zhang, Q. Assessing Global Carbon Sequestration and Bioenergy Potential from Microalgae Cultivation on Marginal Lands Leveraging Machine Learning. Sci. Total Environ. 2024, 948, 174462. [Google Scholar] [CrossRef]
  135. Hu, W.; Su, S.; Mohamed, H.F.; Xiao, J.; Kang, J.; Krock, B.; Xie, B.; Luo, Z.; Chen, B. Assessing the Global Distribution and Risk of Harmful Microalgae: A Focus on Three Toxic Alexandrium Dinoflagellates. Sci. Total Environ. 2024, 948, 174767. [Google Scholar] [CrossRef]
  136. Andriopoulos, V.; Kornaros, M. LASSO Regression with Multiple Imputations for the Selection of Key Variables Affecting the Fatty Acid Profile of Nannochloropsis Oculata. Mar. Drugs 2023, 21, 483. [Google Scholar] [CrossRef]
  137. Ching, P.M.L.; Mayol, A.P.; Juan, J.L.G.S.; So, R.H.Y.; Sy, C.L.; Mandia, E.; Ubando, A.T.; Culaba, A.B. Early Prediction of Spirulina Platensis Biomass Yield for Biofuel Production Using Machine Learning. Clean. Technol. Environ. Policy 2022, 24, 2283–2293. [Google Scholar] [CrossRef]
  138. Teixeira, A.P.; Alves, C.; Alves, P.M.; Carrondo, M.J.T.; Oliveira, R. Hybrid Elementary Flux Analysis/Nonparametric Modeling: Application for Bioprocess Control. BMC Bioinform. 2007, 8, 30. [Google Scholar] [CrossRef] [PubMed]
  139. von Stosch, M.; Carinhas, N.; Oliveira, R. Chapter 7—Hybrid Modeling for Systems Biology: Theory and Practice. In Large-Scale Networks in Engineering and Life Sciences; Benner, P., Findeisen, R., Flockerzi, D., Reichl, U., Sundmacher, K., Eds.; Springer International Publishing: Cham, Switzerland, 2014. [Google Scholar]
  140. Von Stosch, M.; Oliveira, R.; Peres, J.; Feyo De Azevedo, S. A General Hybrid Semi-Parametric Process Control Framework. J. Process Control 2012, 22, 1171–1181. [Google Scholar] [CrossRef]
  141. Härdle, W.; Müller, M.; Sperlich, S.; Werwatz, A. Nonparametric and Semiparametric Models; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
  142. Tulleken, H.J.A.F. Grey-Box Modelling and Identification Using Physical Knowledge and Bayesian Techniques. Automatica 1993, 29, 285–308. [Google Scholar] [CrossRef]
  143. Rogers, A.W.; Cardenas, I.O.S.; del Rio-Chanona, E.A.; Zhang, D. Investigating Physics-Informed Neural Networks for Bioprocess Hybrid Model Construction. Comput. Aided Chem. Eng. 2023, 52, 83–88. [Google Scholar] [CrossRef]
  144. Montague, G.; Morris, J. Neural-Network Contributions in Biotechnology Artificial Neural Network Representations The Problem of Identifying the Parameters of a Model Structure Essentially Reduces to the Determi. Trends Biotechnol. 1994, 12, 312–324. [Google Scholar] [CrossRef]
  145. Pinto, J.; Mestre, M.; Ramos, J.; Costa, R.S.; Striedner, G.; Oliveira, R. A General Deep Hybrid Model for Bioreactor Systems: Combining First Principles with Deep Neural Networks. Comput. Chem. Eng. 2022, 165, 107952. [Google Scholar] [CrossRef]
  146. Ramos, J.R.C.; Pinto, J.; Poiares-Oliveira, G.; Peeters, L.; Dumas, P.; Oliveira, R. Deep Hybrid Modeling of a HEK293 Process: Combining Long Short-Term Memory Networks with First Principles Equations. Biotechnol. Bioeng. 2024, 121, 1554–1568. [Google Scholar] [CrossRef]
  147. Pinto, J.; Ramos, J.R.C.; Costa, R.S.; Rossell, S.; Dumas, P.; Oliveira, R. Hybrid Deep Modeling of a CHO-K1 Fed-Batch Process: Combining First-Principles with Deep Neural Networks. Front. Bioeng. Biotechnol. 2023, 11, 1237963. [Google Scholar] [CrossRef]
  148. Wu, W.; Huang, C.M.; Tsai, Y.H. Design and Validation of a Microalgae Biorefinery Using Machine Learning-Assisted Modeling of Hydrothermal Liquefaction. Algal Res. 2023, 74, 103230. [Google Scholar] [CrossRef]
  149. Thompson, M.L.; Kramer, M.A. Modeling Chemical Processes Using Prior Knowledge and Neural Networks. AIChE J. 1994, 40, 1328–1340. [Google Scholar] [CrossRef]
  150. Lee, D.S.; Jeon, C.O.; Park, J.M.; Chang, K.S. Hybrid Neural Network Modeling of a Full-Scale Industrial Wastewater Treatment Process. Biotechnol. Bioeng. 2002, 78, 670–682. [Google Scholar] [CrossRef] [PubMed]
  151. Walsh, I.; Myint, M.; Nguyen-Khuong, T.; Ho, Y.S.; Ng, S.K.; Lakshmanan, M. Harnessing the Potential of Machine Learning for Advancing “Quality by Design” in Biomanufacturing. MAbs 2022, 14, 2013593. [Google Scholar] [CrossRef] [PubMed]
  152. Mahanty, B. Hybrid Modeling in Bioprocess Dynamics: Structural Variabilities, Implementation Strategies, and Practical Challenges. Biotechnol. Bioeng. 2023, 120, 2072–2091. [Google Scholar] [CrossRef]
  153. Kay, S.; Kay, H.; Rogers, A.W.; Zhang, D. Integrating Hybrid Modelling and Transfer Learning for New Bioprocess Predictive Modelling. In Proceedings of the 33rd European Symposium on Computer Aided Chemical Engineering (ESCAPE33), Athens, Greece, 18–21 June 2023; Elsevier B.V.: Burlington, MA, USA, 2023; Volume 52, pp. 2595–2600. [Google Scholar]
  154. Wang, H.; Kontoravdi, C.; del Rio Chanona, E.A. A Hybrid Modelling Framework for Dynamic Modelling of Bioprocesses. In Proceedings of the 33rd European Symposium on Computer Aided Chemical Engineering (ESCAPE33), Athens, Greece, 18–21 June 2023; Elsevier B.V.: Burlington, MA, USA, 2023; Volume 52, pp. 469–474. [Google Scholar]
  155. Shahhoseyni, S.; Greco, L.; Sivaram, A.; Mansouri, S.S. A Reduced-Order Hybrid Model for Photobioreactor Performance and Biomass Prediction. Algal Res. 2024, 84, 103750. [Google Scholar] [CrossRef]
  156. del Rio-Chanona, E.A.; Ahmed, N.; Zhang, D.; Lu, Y.; Jing, K. Kinetic Modeling and Process Analysis for Desmodesmus Sp. Lutein Photo-Production. AIChE J. 2017, 63, 2546–2554. [Google Scholar] [CrossRef]
  157. del Rio-Chanona, E.A.; Dechatiwongse, P.; Zhang, D.; Maitland, G.C.; Hellgardt, K.; Arellano-Garcia, H.; Vassiliadis, V.S. Optimal Operation Strategy for Biohydrogen Production. Ind. Eng. Chem. Res. 2015, 54, 6334–6343. [Google Scholar] [CrossRef]
  158. Oruganti, R.K.; Biji, A.P.; Lanuyanger, T.; Show, P.L.; Sriariyanun, M.; Upadhyayula, V.K.K.; Gadhamshetty, V.; Bhattacharyya, D. Artificial Intelligence and Machine Learning Tools for High-Performance Microalgal Wastewater Treatment and Algal Biorefinery: A Critical Review. Sci. Total Environ. 2023, 876, 162797. [Google Scholar] [CrossRef]
  159. Prayaga, L.; Devulapalli, K.; Prayaga, C.; Wade, A.; Reddy, G.S.; Pola, S.S.H. Integrating Unsupervised and Supervised ML Models for Analysis of Synthetic Data From VAE, GAN, and Clustering of Variables. Int. J. Data Anal. 2024, 5, 1–19. [Google Scholar] [CrossRef]
  160. Santos, M.S.; Pereira, R.C.; Costa, A.F.; Soares, J.P.; Santos, J.; Abreu, P.H. Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access 2019, 7, 11651–11667. [Google Scholar] [CrossRef]
Figure 1. Number of publications versus years that appeared in the Elsevier Scopus database for the keywords selected in this survey.
Figure 1. Number of publications versus years that appeared in the Elsevier Scopus database for the keywords selected in this survey.
Processes 13 02956 g001
Figure 2. Network map of keywords generated using the VOSviewer software. The keywords displayed in the map were found in the scientific papers of the search performed in this study.
Figure 2. Network map of keywords generated using the VOSviewer software. The keywords displayed in the map were found in the scientific papers of the search performed in this study.
Processes 13 02956 g002
Figure 3. Data mining field overlapping with other fields (adapted from [31]).
Figure 3. Data mining field overlapping with other fields (adapted from [31]).
Processes 13 02956 g003
Figure 4. The process of data mining (adapted from [31]).
Figure 4. The process of data mining (adapted from [31]).
Processes 13 02956 g004
Figure 5. Categories of ML methods (adapted from [17]).
Figure 5. Categories of ML methods (adapted from [17]).
Processes 13 02956 g005
Figure 6. Schematic representation of serial and parallel configurations in HM. Note that white boxes represent the parametric models, while the black boxes represent the nonparametric models. (A) Serial configuration in which the outputs of the parametric model serve as inputs to the nonparametric model; (B) serial configuration in which the outputs of the nonparametric model serve as inputs to the parametric model; and (C) parallel configuration by some function or operator, like multiplication or summation (adapted from [139]).
Figure 6. Schematic representation of serial and parallel configurations in HM. Note that white boxes represent the parametric models, while the black boxes represent the nonparametric models. (A) Serial configuration in which the outputs of the parametric model serve as inputs to the nonparametric model; (B) serial configuration in which the outputs of the nonparametric model serve as inputs to the parametric model; and (C) parallel configuration by some function or operator, like multiplication or summation (adapted from [139]).
Processes 13 02956 g006
Table 1. Comparison of different supervised learning methods and their performances when using microalgae-based datasets.
Table 1. Comparison of different supervised learning methods and their performances when using microalgae-based datasets.
ML ModelTaskMicroalgae Classes/SpeciesDataset/ModalitiesTarget(s)Performance Metrics/OutputsReference
MLRRegressionChlorococcum littoraleTabular data: temperature and light intensityGrowth rateAverage Relative Deviation was around 6.6%[34]
MLRRegressionChlorella sp., Scenedesmus sp., Nannochloropsis sp.Spectral data: NIR spectraLipid contentR2 of 0.86 and 0.77 when NIR spectra with wavelength of 1725 and 2305 cm−1 were used, respectively[35]
PCRRegressionScenedesmus sp. AMDDSpectral data: fluorescence emission spectraProtein concentrationR2 of 0.8 was obtained from the more complex PCR model[37]
PCRRegressionScenedesmus subspicatus, Neochloris oleobundansSpectral data: FT-IR spectraSolid analyte concentrationRelative deviations were around 7% and 8% when spectra with wavelength of 2901 and 1595 cm−1 were used, respectively[38]
kNNClassificationChlorella vulgarisImaging data and morphological propertiesLiving and dead microalgae cell categoryArea Under the Curve (AUC) of 79.5%, accuracy of 74.3%, precision of 76.6%, and recall of 77.2%[40]
kNNClassificationChlorella vulgaris, Chlamydomonas reinhardtii, Arthrospira platensisMorphological and texture descriptorsMicroalgae species designationAccuracy of 96.93%, precision of 96.16%, recall of 96.08%, and F1-score of 96.09%[41]
kNNRegressionChlorella vulgarisImaging data and tabular data: total lipids, proteins, and carbohydratesBiomass concentration, nitrate, and pHAverage Deviation was around 0.10[45]
kNNRegressionScenedesmus sp., Chlorella sp.Tabular data: pH, retention times, and phosphate, nitrate, and nitrite concentrationsBiomass yieldR2 of 0.94, and MAE was around 0.13[44]
DTClassificationChlorella vulgarisImaging data and morphological propertiesLiving and dead microalgae cell categoryAUC of 85.1%, accuracy of 77.4%, precision of 79.2%, and recall of 80.5%[40]
DTClassificationChlorophyceae, Cyanophyceae, Eustigmatophyceae, Trebouxiophyceae, Chlorodendrophyceae, XanthophyceaeCategorical data: microalgae class and reactor type; numerical data: CO2 content and pHMicroalgae biomass production and growth rateAccuracy on biomass production of 81.25%[49]
DTClassificationPhaeocystis, Chlamydomonas, ChaetocerosHyperspectral imagingGrowth stage of microalgae in a growth cycleAccuracy of 97.5%[51]
DTRegressionScenedesmus sp., Chlorella sp.Tabular data: pH, retention times, and phosphate, nitrate, and nitrite concentrationsBiomass yieldR2 of 0.91; however, abnormal deviations from the prediction line were present[44]
DTRegressionMultiple microalgae speciesProperties of magnetic nanoparticles, conditions of magnetic flocculation, and properties of microalgae: biomass concentration and cell diameterHarvesting efficiency of microalgae in magnetic flocculationR2 of 0.85 and MAE higher than 6%[62]
RFClassificationPhaeocystis, Chlamydomonas, ChaetocerosHyperspectral imagingGrowth stage of microalgae in a growth cycleAccuracy of 98.1%[51]
RFClassificationChlorella vulgarisImaging data and morphological propertiesLiving and dead microalgae cell categoryAUC of 85.6%, accuracy of 77.7%, precision of 79.3%, and recall of 80.6%[40]
RFClassificationBacillariophyta, Chlorophyta, Ochrophyta, Miozoa, Haptophyta, Cryptophyta, CyanobacteriaSpectral data: fluorescence spectraIdentification of microalgae at the phylum levelAccuracy of 29% for training dataset[60]
RFRegressionScenedesmus sp., Chlorella sp.Tabular data: pH, retention times, phosphate, nitrate, and nitrite concentrationsBiomass yieldR2 of 0.79 but presence of abnormal deviations[44]
RFRegressionChlorella sorokinianaImaging data: microscopic images; properties: suspension chord length distribution and floc average geometryAverage fractal dimension of microalgae flocsR2 of 0.98 and RMSE of 0.003[53]
RFRegressionMultiple microalgae speciesProperties of magnetic nanoparticles, conditions of magnetic flocculation, and properties of microalgae: biomass concentration and cell diameterHarvesting efficiency of microalgae in magnetic flocculationR2 of 0.9 and MAE higher than 5%[62]
SVMClassificationChlorella vulgaris, Chlamydomonas reinhardtii, Arthrospira platensisMorphological and texture descriptorsMicroalgae species designationAccuracy of 97.63%, precision of 97.81%, recall of 97.78%, and F1-score of 97.79%[41]
SVMClassificationScenedesmus aff. acutus, Gloeomonas anomalipyrenoide, Chlamydomonas reinhardtii, Hamakko caudatus, Chlorella sorokiniana, Haematococcus lacustrisMorphological properties and imaging data: frequency-division-multiplexed fluorescence imagingIdentification of spherical microalgal speciesHigh classification accuracy of 99.8%[56]
SVMClassificationPhaeocystis, Chlamydomonas, ChaetocerosHyperspectral imagingGrowth stage of microalgae in a cycleAccuracy of 94.4%[51]
SVMClassificationBacillariophyta, Chlorophyta, Ochrophyta, Miozoa, Haptophyta, Cryptophyta, CyanobacteriaSpectral data: fluorescence spectraIdentification of microalgae at the phylum levelAccuracy of 93% for training dataset and 89% for test dataset[60]
SVRRegressionPhaeodactylum tricornutumTime series data: biomass concentration, incident light intensity, and light historySpecific growth rateR2 of 0.87 and RMSE of 0.0315[58]
SVRRegressionArthrospira platensisImaging data: microscopic imagesBlue pigment contentR2 of 0.9903[57]
SVRRegressionScenedesmus sp., Chlorella sp.Tabular data: pH, retention times, and phosphate, nitrate, and nitrite concentrationsBiomass yieldR2 of 0.98[44]
XGBoostClassificationBacillariophyta, Chlorophyta, Ochrophyta, Miozoa, Haptophyta, Cryptophyta, CyanobacteriaSpectral data: fluorescence spectraIdentification of microalgae at the phylum levelAccuracy of 92% and 97% for training and test datasets, respectively, and a weighted average of 97% and 98% for recall and precision, respectively[60]
XGBoostClassificationNot availableTime series data and imaging dataMapping dense floating bloomsOverall accuracy was between 94% and 98%[61]
XGBoostRegressionMultiple microalgae speciesProperties of microalgae, properties of magnetic nanoparticles, and conditions of magnetic flocculationHarvesting efficiency in magnetic flocculationR2 of 0.932, RMSE of 6.96%, and MAE of 4.17%[62]
XGBoostRegressionScenedesmus sp.Time series data: temperature, pH, and light intensityMicroalgae biomass yield and growth curveR2 of 0.3 for the test dataset[63]
Table 3. Comparison of unsupervised, semi-supervised, and reinforcement learning methods and their performances when using microalgae-based datasets.
Table 3. Comparison of unsupervised, semi-supervised, and reinforcement learning methods and their performances when using microalgae-based datasets.
ML ModelTaskMicroalgae Classes/SpeciesDataset/ModalitiesTarget(s)Performance Metrics/OutputsReference
ULGray-scale surface direction angle model (GSDAM) and Canny combined with deep convolutional neural network (DCNN)ChaetocerosImaging data: microscopic imagesAutomatic segmentation of microalgaeBoundary Displacement Error, Probability Rand Index (PRI), and F1 measure of 70.6359, 0.8569, and 0.6928, respectively[106]
SSL combined with ALGaussian mixture model combined with expectation-minimization algorithm and ALPhytoplanktonImaging dataIdentification of microalgaeAccuracy of 92%[100]
ULCNN combined with HACDiatomsImaging dataIdentification of diatoms in their life cyclePRI and homogeneity of 0.9959 and 0.9951, respectively, and accuracy of 99.07%[97]
Ensemble SSLSVM combined with gradient boosting (GB), Gaussian classifier (GC), and linear perceptronHAB and surface scumImaging data: hyperspectral imagesIdentification of cyanobacteriaAccuracy of 99.92% using GB for 3 PCA bands[99]
ULCNN combined with t-SNEEuglena gracilisImaging data: microscopy imagesIdentification and classification of different culturesAccuracy greater than 99% for all of the cultures[94]
ULDBSCANChlorella vulgarisData obtained from flow cytometry readingsAutomatic segregation and microalgae identificationResults with −0.10% absolute deviation compared to human processing[98]
RLRL combined with LSTM networkArthrospira sp. HHNumeric sensor dataOptimization of the dry-weight biomass production and prediction of light intensityAchieved 17% higher biomass yield compared to traditional methods and 10% higher yield compared to the threshold-based method[105]
ULHierarchical cluster analysis (HCA) and PCAArthrospira, Amphora, ChlorellaSpectral data: gas chromatography–mass spectrometry profilesAssessment of the metabolome similarity of the three microalgae speciesPCA explained 95.9% of the total variance present in the dataset, while HCA was able to identify 2 clusters[92]
Table 4. Summary of HM studies of microalgae production systems.
Table 4. Summary of HM studies of microalgae production systems.
Mechanistic ModelData-Driven ModelDescriptionPerformance Metrics/OutputsReference
Kinetic model is determined by an automatic model structure identification method2nd-degree polynomial regressionParallel HM scheme consisting of a kinetic model to describe the overall dynamic trajectory of the process and a data-driven model to estimate the differences between the outcomes of the kinetic model and the process data.Fitting error was 1.7%, 4.6%, and 2.9% for concentrations of microalgae biomass, nitrate, and lutein, respectively[23]
Kinetic model is designed to account for the effects of light intensity and nitrate concentrationANN with 4, 20, 15, and 3 neurons in the input, first hidden, second hidden, and output layers, respectivelyHM approach was used to predict the interaction of light intensity, nitrate concentration, and attenuation on biomass growth and lutein production, and to administer the optimal actions related to nitrate inflow rate.HM provided high predictive and flexible capabilities with deviations of 5.1%, 11.7%, and 2.6% for the concentration of biomass, nitrate, and lutein, respectively[109]
Kinetic model of lutein production based on modified Monod kineticsANN with 3, 9, and 2 neurons in the input, hidden, and output layers, respectivelyHM approach was used to predict biomass growth and lutein production of the microalgae under various operating conditions. Transfer learning was applied to update the HM using limited process data from a newly isolated strain.HM captured the underlying mechanisms involved in the evolution of biomass, substrate, and product concentration over time to a considerable extent[153]
Mechanistic backbone is designed by applying a mass balance on microalgae biomass and nutrients3rd-degree polynomial regressionHM approach was proposed and tested on a microalgae case study, in which different statistical information criteria were used to discriminate the best HM structure under different noise levels.HM framework showed good prediction capabilities under different noise levels[154]
Kinetic model is based on Monod equation with the inclusion of light intensity as a variablePolynomial regression with light-dependent termsHM approach was used for photobioreactor modeling tailored to microalgae cultivation. Kinetic model incorporated light intensity as a key decision variable, while polynomial regression was used to calculate the optimal set of model coefficients.HM presented better prediction with higher R2 and lower MAPE compared to the model that did not incorporate light intensity[155]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Freitas, G.R.; Badenes, S.; Oliveira, R.; Martins, F.G. A Review of Intelligent Modeling for Microalgae Systems: Integrating Data Mining, Machine Learning, and Hybrid Approaches. Processes 2025, 13, 2956. https://doi.org/10.3390/pr13092956

AMA Style

Freitas GR, Badenes S, Oliveira R, Martins FG. A Review of Intelligent Modeling for Microalgae Systems: Integrating Data Mining, Machine Learning, and Hybrid Approaches. Processes. 2025; 13(9):2956. https://doi.org/10.3390/pr13092956

Chicago/Turabian Style

Freitas, Geovani R., Sara Badenes, Rui Oliveira, and Fernando G. Martins. 2025. "A Review of Intelligent Modeling for Microalgae Systems: Integrating Data Mining, Machine Learning, and Hybrid Approaches" Processes 13, no. 9: 2956. https://doi.org/10.3390/pr13092956

APA Style

Freitas, G. R., Badenes, S., Oliveira, R., & Martins, F. G. (2025). A Review of Intelligent Modeling for Microalgae Systems: Integrating Data Mining, Machine Learning, and Hybrid Approaches. Processes, 13(9), 2956. https://doi.org/10.3390/pr13092956

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop