Conceptual Framework for a Machine Learning-Based Algorithmic Model for Early-Stage Business Idea Evaluation

Chahuán-Jiménez, Karime; Garrido-Araya, Dominique; Román, Carlos Escobedo

doi:10.3390/su172210124

Open AccessArticle

Conceptual Framework for a Machine Learning-Based Algorithmic Model for Early-Stage Business Idea Evaluation

by

Karime Chahuán-Jiménez

¹

,

Dominique Garrido-Araya

^2,*

and

Carlos Escobedo Román

³

¹

Centro de Investigación en Negocios y Gestión Empresarial, Escuela de Auditoría, Facultad de Ciencias Económicas y Administrativas, Universidad de Valparaíso, Valparaíso 2361891, Chile

²

Escuela de Auditoría, Facultad de Ciencias Económicas y Administrativas, Universidad de Valparaíso, Valparaíso 2361891, Chile

³

Escuela de Ingeniería Informática, Facultad de Ingeniería, Universidad de Valparaíso, Valparaíso 2362905, Chile

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(22), 10124; https://doi.org/10.3390/su172210124

Submission received: 13 October 2025 / Revised: 10 November 2025 / Accepted: 10 November 2025 / Published: 12 November 2025

Download

Browse Figures

Versions Notes

Abstract

This research proposes an algorithmic machine learning framework aimed at the early evaluation of business ideas. The framework integrates fifteen critical variables organized into five dimensions—innovation, sustainability, the entrepreneurial team, scalability, and initial finances—identified from a systematic review of the literature. Unlike traditional approaches that focus on financial metrics or one-dimensional indicators, this model provides a comprehensive, multidimensional view of entrepreneurial viability in uncertain contexts. Methodologically, the study presents a structured pipeline that incorporates Random Forest, Gradient Boosting, and XGBoost ensemble algorithms, as well as SMOTE data balancing techniques. These techniques address common problems, such as class imbalance and generalization limitations. Theoretically, innovation and sustainability constructs are operationalized alongside entrepreneurial and financial factors, contributing to more consistent, integrative evaluation models. In practical terms, this proposal provides incubators, accelerators, and public policy designers with a replicable and adaptable tool for the early stages of entrepreneurship. While empirical validation is planned for the future, this work lays the methodological groundwork to bridge gaps in the literature and advance more robust predictive models for entrepreneurial evaluation.

Keywords:

machine learning; SMOTE; business ideas; entrepreneurship; innovation; sustainability

1. Introduction

Entrepreneurship is a fundamental driver of global economic growth and social development. It stimulates markets, promotes competition, and accelerates the spread of innovation. Its positive impact stems not from the aggregate volume of initiatives, but from the quality of projects geared toward opportunity and capable of generating sustainable value [1,2]. Recent literature emphasizes that entrepreneurial ecosystems significantly increase productivity and economic resilience when supported by strong institutions and regulatory frameworks that favor innovation [3,4,5].

The early evaluation of business ideas is positioned as an essential strategic process to reduce uncertainty, allocate resources efficiently, and increase the likelihood of entrepreneurial projects’ success and scalability. Recent research highlights that the ability to filter, select, and validate opportunities in the early stages directly impacts business survival and the generation of sustainable value [6,7,8]. In this context, developing analytical models and automated tools can strengthen entrepreneurial decision making and contribute to more robust innovation ecosystems.

Entrepreneurials with a high degree of innovation are recognized as a determining factor in long-term growth because it drives business competitiveness and facilitates the transition to more sustainable and digital economies. Empirical evidence from multinational studies confirms a direct relationship between investment in innovation and the expansion of the gross domestic product (GDP). This highlights the cross-cutting effect of innovation on productive sectors and the ability of companies to adapt to technological and environmental changes [9,10,11]. Innovation allows businesses to differentiate their products and services and face scenarios of disruption and structural transformation more effectively.

The current business environment is marked by high levels of uncertainty and turbulence. These conditions stem from rapid technological changes and global shocks, such as financial crises, health problems, geopolitical conflicts, environmental impacts, and intensified competitive dynamics. These factors create volatile conditions that challenge market stability and an organization’s ability to maintain a competitive advantage over time. In this scenario, companies must develop dynamic capabilities to adapt, reorganize, and adjust their resources quickly in response to changing contexts [12,13]. Consequently, evaluating a business idea early plays a strategic role, as it allows anticipation of failure risks and the selection of initiatives with greater potential for economic, innovative, and sustainable viability.

The use of machine learning (ML) techniques to predict business success has received increasing attention. Several studies have demonstrated that algorithms such as Random Forest, Gradient Boosting, XGBoost, and neural networks are more effective than traditional methods at anticipating the results of startups and new companies [7,14,15]. These studies demonstrate that variables such as the experience of the founding team, capacity to innovation, business model, and interaction with support networks are crucial factors in startup performance.

Previous research has tended to focus on specific dimensions, primarily financial or innovation-related, while paying less attention to cross-cutting issues. Although this approach has led to significant advances in prediction accuracy, it limits the ability to make holistic diagnoses applicable to various scenarios [13].

Combining different data types in hybrid approaches often leads to more accurate results, for example, Zhang and Lau [16] reported an accuracy of 82.2% using TextCNN for multimodal crowdfunding analysis, and Sadia and Cheng [17] developed CrunchLLM, a model tailored specifically for entrepreneurship data with an accuracy exceeding 80%. These advances suggest that integrating textual analysis with structured variables is a promising frontier.

However, significant methodological limitations include limited sample sizes [18], survival bias in historical datasets [7], and geographical concentration in developed ecosystems [19]. These limitations restrict the generalizability and applicability of existing models, thus justifying the development of more robust frameworks. Survival bias occurs when analyses only consider active companies, omitting those that failed or disappeared early on. These issues lead to an overestimation of the probability of success and the identification of incomplete predictive factors [20,21].

In Addition, growing market turbulence and the high complexity of competitive environments have been identified as factors that affect the validity of traditional prediction models. In response, several authors have proposed the need for algorithms that can process large volumes of heterogeneous data and adapt dynamically to changing environments [4,12].

In practice, digital platforms aimed at supporting entrepreneurs tend to focus on financial diagnostics or operational guidelines without providing early assessments based on advanced predictive models. Recognized limitations include the lack of data updates, low scalability, and absence of automation [2,22].

Most of the developed models have focused on financial dimensions or specific aspects of innovation, neglecting a comprehensive analysis that considers multiple critical factors simultaneously [14,20]. This fragmented approach prevents the full complexity of entrepreneurial processes from being captured and limits the generalizability of the results.

Consequently, there is a need to develop methodological frameworks and algorithmic models that can systematically integrate innovation and sustainability variables with data balancing and optimization techniques. These models would provide more robust early assessments applicable to contexts of high uncertainty and change [23].

This research generates an algorithmic machine learning model for the early evaluation of business ideas. It considers identified limitations, such as the use of small samples, survival bias, and the absence of external validation [14,20,21]. To address these limitations, we propose combining Random Forest and XGBoost ensemble techniques with robust cross-validation procedures. These techniques have been shown to improve predictive power and generalization in applied studies [23]. This approach aims to minimize uncertainty in the early stages, optimize resource allocation, and enhance entrepreneurs’ ability to make informed decisions in dynamic, highly complex environments. Additionally, advances in artificial intelligence applied to venture evaluation [24], smart technology adoption linked to sustainability and the circular economy [25], and collaborative process development between humans and automated systems in the context of Industry 5.0 [26] enable connections between machine learning, sustainability, and technological innovation in this research development.

This study introduces an explainable, algorithmic framework designed for the preliminary assessment of business ideas. The proposed model integrates key dimensions—innovation, sustainability, teamwork, scalability, and financial performance—into a theoretically grounded quantitative structure. Methodologically, the framework employs an eight-phase machine learning pipeline encompassing class balancing through SMOTE, cross-validation, and hyperparameter optimization via GridSearchCV. This integrative design bridges qualitative assessments and quantitative modeling, thereby ensuring transparency, reproducibility, and comparability of results. The approach provides a robust decision-support tool for the early-stage evaluation of entrepreneurial initiatives, particularly in the contexts of business incubation and initial financing.

This research is organised as follows: Section 2 outlines the methodological procedure used to develop the framework. Section 3 presents the proposed algorithmic model in detail, outlining the variables and dimensions considered. Section 4 discusses the findings, contrasting the framework with traditional approaches and highlighting its contributions and limitations. Finally, Section 5 presents the main conclusions and suggests future lines of research.

2. Materials and Methods

2.1. Methodological Approach

In the literature, the concept of entrepreneurial success is ambiguous and has been operationalised in various ways. Some studies associate it with the temporary survival of the company [27], while others emphasise financial performance metrics such as sales, profitability, or access to financing rounds [19,28]. A third groupe of thought links it to organisational growth, measured in terms of employment, assets or international expansion [29]. More recently, indicators of innovation and sustainability have been incorporated into the concept of success, which is now understood as the ability to generate long-term value and contribute to the resilience of ecosystems [30,31,32].

The systematic review was conducted based on the PRISMA guidelines to identify studies from 2019 to 2025 that apply machine learning to the early evaluation of ventures, particularly in entrepreneurship. The Web of Science Core Collection, Scopus, ScienceDirect (Elsevier), and SpringerLink databases were consulted. The search string was: “startup*” OR “new venture*” OR “early-stage”) AND (predict* OR model* OR forecast*) AND (“machine learning” OR “deep learning” OR “XGBoost”) AND (innovation OR sustainab* OR ESG OR “team” OR “financ*”.

Of the initial 87 results, eight met the criteria (empirical with machine learning [ML] and consistent with the study variables). Duplicates, theoretical results, unindexed results, and results already cited in the manuscript were excluded. The studies converge on the use of ensembles (RF/GBM/XGBoost), cross-validation, and text, market, and team signals to predict success, survival, or funding. They also demonstrate the importance of innovation, ESG, teamwork, scalability, and finances in the early stages. Table 1.

This diversity of definitions poses a methodological challenge, as lack of consistency limits the comparability of models and hinders empirical validation [41,42]. This research takes a multidimensional approach to success, integrating factors such as innovation, sustainability, teamwork, scalability, and initial finances, in order to overcome the limitations of one-dimensional views.

The research is framed within a methodological and propositional design to overcome the limitations identified, for instance, Gilsing et al. [43] emphasise that the initial phases of innovation and business model design predominantly rely on qualitative tools, lacking robust quantitative metrics for thorough evaluation. Similarly, Park et al. [15] emphasise that the variety of definitions of success, inconsistent feature engineering, and diverse methodological validation pose significant obstacles to achieving comparable and practically applicable predictive models. Based on this evidence, this work aims to construct a formalised algorithmic framework for the early evaluation of business ideas. This framework integrate key dimensions of innovation and sustainability and provide a coherent methodological basis for future developments and empirical validations.

2.2. Framework Construction Procedure

The framework was developed using a structured procedure consisting of main stages. First, a systematic review of the literature on predicting entrepreneurial success using algorithmic techniques was conducted to identify advances and limitations [7,14]. Secondly, critical variables were identified and categorised based on factors such as innovation, sustainability, team composition, scalability, and initial financial metrics, all of which have been repeatedly identified as determinants of business performance [8,19]. Third, a tool was created to measure the dimensions of innovation and sustainability in the initial assessment of business ideas, comprising specific variables that enable the evaluation of various aspects of entrepreneurial viability.

The result of this process was the definition of a set of 15 critical variables organised into five main categories: Innovation, Sustainability, the Entrepreneurial Team, Scalability and Initial Finances. These variables are widely recognised in the literature as determinants of the viability of early stage business projects, particularly in contexts characterised by uncertainty and technological disruption [22].

Table 2 summarises these variables and their characteristics, including data type, indicators and coding scheme, enabling their systematic use in the proposed framework. Together, these 15 variables form the basis of the framework’s algorithmic experimentation, providing a structured representation of early-stage viability.

In the final stage, candidate algorithmic techniques were selected that favour ensemble models, such as Random Forest, Gradient Boosting and XGBoost. These models are favoured due to their ability to handle heterogeneous and non-linear data, as well as their robustness in complex contexts. These models were selected due to their ability to handle heterogeneous and nonlinear data, as well as their robustness in complex market contexts [12,13]. Based on this selection, the logical pipeline of the framework was designed to sequentially and coherently integrate the input, transformation and output components. This allowed the entire process of the early evaluation of business ideas to be structured within controlled experimental scenarios.

2.3. Proposed Framework

The proposed framework is divided into three main components Figure 1. The first component corresponds to the inputs, which consist of critical variables identified in recent literature. These include innovation, sustainability, the experience of the founding team, the scalability of the business model and initial financial metrics. The second component is the process, which integrates these variables through an algorithmic machine learning pipeline comprising preprocessing, integration and predictive modelling stages. The third component corresponds to the output, which is represented by a predictive score or probability of the business idea’s early success. This score can be used as a decision-making support tool.

2.4. Concept Validation

The framework’s validity rests on its consistency with recent advances in the literature and the integration of dimensions that are commonly treated in isolation. While most predictive models have focused on financial variables or specific characteristics of the founding team [7,44], the present approach simultaneously incorporates sustainability and innovation as determinants of long-term success [11,45]. This integration constitutes a significant methodological advancement, offering a framework that captures the true complexity of entrepreneurial processes in highly uncertain environments. The next phase of research involve empirical validation using real data, which is necessary to complement the methodological contribution presented here.

2.5. Reproducibility and Transparency

The framework is formally documented through structured descriptions, conceptual diagrams and pseudocode, facilitating its replicability and potential implementation in future studies. This level of methodological detail addresses the requirement for reproducible and transferable frameworks in the field of innovation and entrepreneurship, where heterogeneity of context often hinders the comparability of results [1,2]. In line with recommendations for scientific openness, planned experimental implementations of the framework are released in public repositories to promote transparency, academic collaboration and cumulative knowledge advancement.

3. Analysis of the Business Success Prediction Algorithm Using Machine Learning

3.1. System Architecture

3.1.1. Data Acquisition and Preprocessing Phase

The algorithm begins with the function LoadAndPreprocess(D), which encapsulates the entire business data transformation pipeline. This phase is essential, given that business data has heterogeneous characteristics requiring specialised treatment. Explicitly separating innovation (X_inn) and sustainability (X_sus) characteristics allows us to analyse the differential contribution of each dimension to business success. Various studies have identified this as a decisive factor in predicting business results [7,14]. Different preprocessing strategies are applied depending on the type of variable. The model addresses missing data using statistical methods that are appropriate for each data type (continuous, categorical, or ordinal). Nominal categorical variables, such as equipment characteristics, are then transformed into a numerical representation. Continuous and ordinal variables, which already have a numerical scale (such as Likert scales), are then adjusted to operate on a comparable scale. This multi-layered approach ensures that all characteristics are numerical and homogeneous, thereby optimizing compatibility with machine learning algorithms. Continuous variables undergo standard scaling. This multi-layered approach preserves the original information while optimising compatibility with the selected machine learning algorithms.

3.1.2. Division Strategy and Data Balance

The implementation uses StratifiedSplit with an 80/20 ratio to preserve the distribution of success and failure classes in both sets. This technique is particularly important in business prediction contexts, where class imbalance is a common issue [46]. The conditional application of SMOTE (Synthetic Minority Over-sampling Technique) is an evidence-based methodological decision. This technique generates synthetic examples of the minority class by interpolating between existing instances, which significantly improves predictive power in imbalanced datasets. Previous studies have demonstrated the effectiveness of combining oversampling techniques, such as SMOTE, with ensemble algorithms, such as Random Forest and AdaBoost, for improving predictive power in imbalanced data scenarios, thereby enhancing metrics such as balanced accuracy and F1-score.

3.2. Algorithm Pseudocode

3.2.1. Training and Optimization Center

The algorithm (Algorithm 1) uses an ensemble approach, evaluating three complementary algorithms: Random Forest (RF), Gradient Boosting (GB) and XGBoost (XGB). This selection is based on empirical evidence indicating that decision tree-based models are highly accurate and robust for predicting the success of start-ups and ventures [7,14,44,46].

Algorithm 1 Business Success Prediction Algorithm using Machine Learning

Require:: Dataset D with business ideas and success labels
Ensure:: Optimal model $M^{*}$ and evaluation results

1:: // Fase 1: Fuente de Datos
2:: $X, y \leftarrow$ LoadAndValidate (D)
3:: ValidateSchema (X) ▹ Verify 15 variables across 5 dimensions
4:: // Fase 2: Preprocesamiento
5:: $X \leftarrow$ ImputeMissingValues (X) ▹ Median for continuous, mode for ordinal/categorical
6:: $X \leftarrow$ EncodeAndScale (X) ▹ Ordinal encoding, z-score standardization
7:: $X_{d i m} \leftarrow$ GroupByDimension (X) ▹ Group by innovation, sustainability, team, scalability, finance
8:: // Fase 3: División de Datos
9:: $X_{t r a i n}, X_{t e s t}, y_{t r a i n}, y_{t e s t} \leftarrow$ StratifiedSplit $(X, y, 0.2)$
10:: // Fase 4: Balance de Clases con SMOTE
11:: if MinorityRatio $(y_{t r a i n}) < 0.3$ then
12:: $X_{t r a i n}, y_{t r a i n} \leftarrow S M O T E (X_{t r a i n}, y_{t r a i n})$
13:: end if
14:: // Fase 5: Entrenamiento y Optimización
15:: $m o d e l s \leftarrow {RandomForest, GradientBoosting, XGBoost}$
16:: $t r a i n e d_m o d e l s \leftarrow {}$
17:: for all $m o d e l \in m o d e l s$ do
18:: $m o d e l^{*} \leftarrow$ GridSearchCV $(m o d e l, h y p e r p a r a m e t e r s, 5, ’ f 1 ’)$
19:: $t r a i n e d_m o d e l s [m o d e l] \leftarrow m o d e l^{*} .$ fit $(X_{t r a i n}, y_{t r a i n})$
20:: end for
21:: // Fase 6: Generación de Predicciones
22:: $p r e d i c t i o n s \leftarrow {}$
23:: for all $m o d e l \in t r a i n e d_m o d e l s$ do
24:: $y_{p r e d} \leftarrow m o d e l .$ predict $p r e d i c t (X_{t e s t})$
25:: $y_{p r o b a} \leftarrow m o d e l .$ predict_proba $(X_{t e s t})$
26:: $p r e d i c t i o n s [m o d e l] \leftarrow {y_{p r e d}, y_{p r o b a}}$
27:: end for
28:: // Fase 7: Evaluación Integral
29:: $s c o r e s \leftarrow {}$
30:: for all $m o d e l \in t r a i n e d_m o d e l s$ do
31:: $s c o r e s [m o d e l] \leftarrow {$
32:: $f 1 :$ F1Score $(y_{t e s t}, p r e d i c t i o n s [m o d e l] . y_{p r e d}),$
33:: $a c c u r a c y :$ Accuracy $(y_{t e s t}, p r e d i c t i o n s [m o d e l] . y_{p r e d}),$
34:: $p r e c i s i o n :$ Precision $(y_{t e s t}, p r e d i c t i o n s [m o d e l] . y_{p r e d}),$
35:: $r e c a l l :$ Recall $(y_{t e s t}, p r e d i c t i o n s [m o d e l] . y_{p r e d}),$
36:: $b a l a n c e d_a c c :$ BalancedAccuracy $(y_{t e s t}, p r e d i c t i o n s [m o d e l] . y_{p r e d})$
37:: }
38:: end for
39:: // Fase 8: Selección y Prueba del Modelo
40:: $M^{*} \leftarrow arg max (s c o r e s [m o d e l] . f 1)$ ▹ Select best model by F1-Score
41:: $f e a t u r e_i m p o r t a n c e \leftarrow$ ExtractImportance $(M^{*})$
42:: $d i m e n s i o n_w e i g h t s \leftarrow$ ComputeDimensionalWeights $(f e a t u r e_i m p o r t a n c e, X_{d i m})$
43:: function EvaluateIdea( $x_{n e w}$ )
44:: $x_{p r o c e s s e d} \leftarrow$ Preprocess $(x_{n e w})$ ▹ Apply learned transformations
45:: $d i m_s c o r e s \leftarrow$ ComputeDimensionScores $(x_{n e w})$
46:: $g l o b a l_s c o r e \leftarrow$ AggregateScore $(d i m_s c o r e s, d i m e n s i o n_w e i g h t s)$
47:: $l a b e l \leftarrow$ ClassifyByThreshold $(g l o b a l_s c o r e)$ ▹≥0.70: Alto, 0.50-0.69: Medio, <0.50: Reformular
48:: return ${d i m_s c o r e s, g l o b a l_s c o r e, l a b e l}$
49:: end function
50:: return $M^{*}, f e a t u r e_i m p o r t a n c e, d i m e n s i o n_w e i g h t s, s c o r e s$

Random Forest reduces overfitting by constructing multiple decision trees and using majority voting to improve predictive power compared to individual trees [7,46]. XGBoost and Gradient Boosting, on the other hand, are sequential ensemble methods, whereby each new tree is trained to correct the errors of the previous one. These methods are widely recognised for their computational efficiency and excellent performance on unbalanced and complex datasets [7,14].

Optimization using GridSearchCV with 5-fold cross-validation ensures exhaustive exploration of the hyperparameter space, evaluating all possible combinations to maximize predictive performance. This systematic approach is essential for achieving robust and generalizable models in business contexts.

3.2.2. Evaluation and Selection System

The main metric used is the F1 score, which is the harmonic mean of precision and recall. This methodological choice is particularly appropriate for potentially imbalanced datasets, providing a balanced assessment of predictive performance. Recent empirical results report an accuracy of 81.85% for Random Forest in predicting startup success [20], while XGBoost techniques combined with SMOTE achieve an F1-score of 86% on unbalanced datasets [21].

3.3. Methodology Flowchart

The proposed algorithm implements a machine learning pipeline structured in eight sequential stages, specifically designed to evaluate the viability of business ideas in early stages. Each phase systematically transforms the input data, progressively building the predictive capacity of the model. Similar to how an industrial process converts raw materials into finished products through specialized stations, this pipeline processes heterogeneous business data to produce robust and explainable predictions of success. Figure 2 illustrates this complete process, from data acquisition to the selection of the optimal model, ensuring reproducibility and methodological transparency.

1. Data Source: The system receives business ideas as input, represented as structured records with values for fifteen variables. During this phase, the system validates that each record contains the variables organized into the five dimensions of the framework, which are innovation with its variables of novelty and technological proposal, sustainability with environmental impact and social impact, entrepreneurial team with sector experience, scalability with market access, and initial finances with initial investment. Just as a factory inspects the quality of its raw materials before entering them into the production line, this phase verifies that the data meets the structural requirements before processing it.

2. Preprocessing: In this phase, the data matrix is received with possible missing values, heterogeneous scales, and mixed variable types (ordinal, categorical, and continuous). Three sequential transformations are executed to prepare the data. First, the presence of missing values is evaluated, and if any exist, differentiated imputation is applied. Continuous variables use the median, which is robust against outliers, while ordinal and categorical variables use the mode, or the most frequent value. Second, the variables are coded and scaled according to their type: categorical variables are coded ordinally by assigning numerical values to each category; ordinal variables maintain their natural numerical scale; and both ordinal and continuous variables are standardized using z-score scaling to homogenize magnitudes. Third, the variables are grouped into subsets by conceptual dimension (innovation, sustainability, teamwork, scalability, and finance) to facilitate subsequent importance analysis. The output is a preprocessed matrix with all variables on comparable scales and no missing values, properly coded for machine learning algorithms. This phase transforms the raw data into an optimal format for algorithmic processing.

3. Data Division: The division phase receives the preprocessed matrix and the target vector. It applies stratified splitting with an 80–20 ratio (80% training and 20% testing), which balances two conflicting objectives: maximizing the data available for training to improve pattern learning and maintaining a sufficiently large test set to reduce performance metric variance and enable robust evaluation. This technique is crucial for avoiding evaluation biases, particularly in contexts where a class may be underrepresented. The result is four clearly differentiated data subsets. These subsets include the predictor variables and their corresponding labels.

4. Class Balancing with SMOTE: This phase receives training sets with potentially unbalanced class distribution. The algorithm begins by assessing whether there is significant imbalance, determined when the minority class represents less than thirty percent of the total. If imbalance is detected, the SMOTE technique is applied, which generates synthetic examples of the minority class by linear interpolation between existing instances and their k-nearest neighbors. This process artificially increases the representation of the minority class until the dataset is balanced, creating new examples that are not simple copies but realistic combinations of existing cases. If the imbalance is not significant, the original distribution is preserved to avoid unnecessary oversampling. As a result, balanced training sets are obtained when necessary, or the original sets remain unchanged when the balance is already adequate. Previous studies show that SMOTE combined with ensemble algorithms significantly improves metrics such as the F1-Score, reaching up to 86% in unbalanced datasets [21].

5. Training and Optimization: The training sets, which were balanced or preserved in the previous phase, enter this central phase of the pipeline. Three complementary ensemble algorithms are trained in parallel Figure 3. Random Forest builds multiple independent decision trees using bootstrap aggregating to reduce overfitting through majority voting and to optimize hyperparameters such as the number of trees, maximum depth, and splitting criteria. Gradient Boosting trains trees sequentially. Each new tree corrects the residual errors of the previous trees using gradient descent. This process adjusts the learning rate, the number of estimators, and the depth. XGBoost is an optimized version of Gradient Boosting that includes additional regularization and efficient parallelization. It uses specific parameters, such as gamma for complexity control and

s c a l e_{p} o s_{w} e i g h t

for handling imbalances. Each model is optimized using GridSearchCV with five-fold cross-validation to explore the defined hyperparameter space and maximize the F1-score exhaustively. As a result, three trained and optimized models are obtained, each with its optimal hyperparameter configuration identified through cross-validation. Similar to three teams of architects proposing different designs for the same building, each algorithm builds a predictive model with a complementary approach. Cross-validation acts as a jury, evaluating each proposal in multiple scenarios. The literature reports accuracies greater than

81 %

for Random Forest and F1 scores of

86 %

for XGBoost in predicting startup success, which justifies selecting these algorithms [20,21].

6. Prediction Generation: In this phase, the three trained models (Random Forest, Gradient Boosting, and XGBoost) and the test set containing business ideas never seen during training are received. The process is straightforward. Each model receives the characteristics of an idea (its 15 variables) and generates two types of predictions. First, it generates a binary classification that answers the question, Will this idea be successful?, with a response of zero (failure) or one (success). Second, it generates a numerical probability indicating the model’s confidence in its prediction Figure 4. This probability is expressed as a value between zero and one; values close to one indicate a high probability of success. For instance, if the test set contains one hundred business ideas, each model will generate one hundred classifications (success or failure) and one hundred probabilities (confidence). Consequently, three complete sets of predictions are obtained, one for each model. These sets will later be compared to identify the model with the best predictive capacity.

7. Comprehensive Evaluation: In this phase, the predictions of each model and the true labels of the test set are evaluated. Five complementary metrics are calculated for each model. The F1 score is the harmonic mean of precision and recall and is especially robust for unbalanced datasets. Accuracy measures the proportion of correct predictions out of the total. Precision quantifies the proportion of positive predictions that were correct. Recall determines the proportion of positive cases that were correctly identified. Finally, balanced accuracy calculates the average sensitivity and specificity, neutralizing the effects of imbalance. The result is a matrix with five metrics for each of the three models, for a total of fifteen performance values. This is complemented by a confusion matrix for each model showing the detailed distribution of true positives, true negatives, false positives, and false negatives. This phase evaluates each model from different perspectives to identify its specific strengths and weaknesses. The F1 score is the main metric because it balances precision and recall, preventing models biased toward the majority class from receiving artificially high evaluations.

8. Model Selection and Testing: This final phase incorporates the evaluation results of the three candidate models, closing the framework cycle and demonstrating its practical application. The process compares the F1 scores and selects the model with the highest value as M*. Characteristics important to the optimal model are extracted, and aggregate weights are calculated by conceptual dimension. This quantifies the relative contribution of innovation, sustainability, teamwork, scalability, and finance to business success.

Once

M *

is selected, the system processes new business ideas. For each input case x, the framework applies preprocessing transformations and generates three types of output. First, it generates scores by dimension that reflect the idea’s relative performance on each evaluated axis. Second is an overall score (score in [0, 1]) summarizing the general alignment of the idea with the model’s principles. Third, it provides an interpretive classification label derived from the overall score. Values equal to or greater than 0.70 indicate high potential. Values between 0.50 and 0.69 correspond to medium potential. Values below 0.50 indicate that the idea requires reformulation. This output is complemented by an interpretive summary identifying dimensions most influencing the result. This turns the model into a diagnostic tool guiding decision-making in the early stages of incubation or investment.

Importance and Interpretability Analysis

The algorithm concludes with the extraction of feature importance and the calculation of the relative weights of the innovation and sustainability dimensions. This interpretive phase is fundamental to informed decision-making, offering insights into the proportional contribution of each dimension to business success [14,18].

The methodology enables the relative impact of various factors, such as funding received, team experience, level of innovation, media exposure, industrial sector, sustainability metrics and centrality in professional networks, to be quantified. These factors have been shown to be important predictors of financial and sustainability success [7,14,18,44,46].

3.4. Methodological Innovation

This approach introduces a number of significant innovations. Firstly, the systematic integration of innovation and sustainability as distinct predictor variables enables a multidimensional analysis of business success. Secondly, applying class balancing techniques adapted specifically to the business context improves predictive robustness. Thirdly, using complementary algorithm ensembles maximises predictive power while maintaining interpretable results.

The methodology is aligned with recent developments, such as the SECURE AI framework. This framework integrates metrics of sustainability, scalability, viability, desirability and market fit. It combines these dimensions with machine learning algorithms to provide a holistic, multidimensional prediction of entrepreneurial success [44].

The proposed algorithmic model processes a structured dataset that quantitatively represents the characteristics of early-stage business ideas. Each record describes a business idea and consists of variables associated with five conceptual dimensions: innovation, sustainability, entrepreneurial team, scalability, and initial finances. In computational terms, the dataset can be represented as a tabular matrix

X \in R^{n \times 15}

corresponds to the number of ideas evaluated (one per row) and 15 to the number of variables observed (one per column). These variables are defined and validated according to the model’s structural schema (schema/variables.yaml), which ensures consistency in data types, value ranges, and the presence of required fields. The model can receive records in YAML or CSV format, depending on the source or application stage, and convert them internally into standardized structures for processing.

Prior to calculation, the variables are normalized to the range [0, 1] to ensure scale homogeneity between heterogeneous dimensions, and avoid distortions derived from differences in magnitude. Central processing is performed using a weighted aggregation function, where each dimension contributes to the overall score according to its theoretical relative weight within the conceptual model:

{Score}_{global} = \sum_{i} (w_{i} \times D_{i})

where:

D_{i}

represents the normalized average score of the variables belonging to each dimension and

w_{i}

represents the assigned weights: innovation (0.25), sustainability (0.25), team (0.20), scalability (0.15), and finance (0.15). According to the specialized literature [14,47], this weighting reflects the theoretical importance of each dimension in the early viability of ventures. The algorithm generates three types of output in JSON format for easy analysis and interpretation: Scores per dimension reflecting the idea’s relative performance on each evaluated axis; An overall score (score in the range of [0, 1]), which summarizes the general alignment of the idea with the principles of the model, and An interpretive classification label derived from the overall score:

≥0.70: High potential
0.50–0.69: Medium potential
<0.50: Requires reformulation

The output is complemented by an interpretive summary identifying the dimensions most influencing the result. This makes the model a diagnostic tool guiding decision-making in the early stages of incubation or investment. The model does not replace expert evaluation; rather, it complements it by providing a reproducible analytical layer based on quantitative evidence that can be adapted to future integrations of real or simulated data.

4. Discussion

The prediction of entrepreneurial success at an early stage has sparked a growing interest in literature, primarily due to the high level of uncertainty that characterises innovation ecosystems. Several studies have demonstrated that decisions made at the outset can significantly impact a venture’s trajectory, emphasising the necessity of more systematic and dependable tools for the initial evaluation of projects [48].

In this context, integrating innovation and sustainability variables has become increasingly relevant, as both are recognised as determining factors in the viability of contemporary business projects. Recent literature has emphasised that the ability to incorporate innovation, generate sustainable value and respond to technological disruption is increasingly important for entrepreneurial success, as discussed in [22]. Methodologically, advances in machine learning, especially data balancing techniques such as SMOTE and ensemble models, have overcome some limitations of traditional approaches and improved predictive power in highly heterogeneous environments [49].

However, traditional approaches to entrepreneurial assessment have been dominated by linear models (such as regressions and discriminant analysis) and financial scores, which have recurring limitations: Firstly, they tend to be one-dimensional, focusing almost exclusively on accounting and financial information. Secondly, they are sensitive to survival bias and the lack of longitudinal data, which compromises their generalisation. Thirdly, they have little capacity to handle unbalanced data, which is a common problem given that most ventures fail in the early stages [7,14].

The framework proposed helps to overcome these limitations through its three distinctive elements. Firstly, defining a set of 15 critical variables organised into five dimensions (innovation, sustainability, the entrepreneurial team, scalability and initial finances) enables a comprehensive approach to entrepreneurial viability, overcoming the fragmentation of previous studies. Secondly, standardising the operationalisation of the variables through homogeneous indicators and coding criteria facilitates their integration into machine learning algorithms. Thirdly, incorporating ensemble models (Random Forest, Gradient Boosting and XGBoost), together with data balancing techniques, enables the handling of non-linearities and class imbalances, thereby enhancing predictive robustness in contexts of high uncertainty and technological disruption.

To demonstrate the application of the model in a real-world context, a case study was conducted to evaluate its performance and the results it generated. The dataset included early-stage business ideas that were structured according to the model’s five dimensions. The algorithm processed the information in eight stages, generating scores for each dimension and an overall viability score. The results showed that ideas with higher levels of innovation and sustainability received the highest overall scores. This confirms the validity of the theoretical weighting and demonstrates the usefulness of the model as a practical tool for evaluating business ideas early on.

This contrast is summarised in Table 3, which outlines the key differences between the traditional approaches and the proposed framework.

In theoretical terms, this model provides a novel integration of innovation and sustainability dimensions that are usually addressed in a fragmented manner—with team, scalability, and financial factors, proposing a replicable framework for comparative studies in different entrepreneurial ecosystems. On a practical level, it is an applicable tool for incubators, accelerators, and development agencies, providing an objective and standardized framework for early-stage decision-making.

This research should be viewed as an initial methodological contribution. The following areas are proposed for future research: (i) empirical validation of the instrument in real incubation and acceleration contexts; (ii) integration of multimodal signals by combining structured and textual data through hybrid learning approaches; and (iii) incorporation of interpretability techniques such as SHAP to reinforce transparency and traceability of algorithmic models when evaluating business ideas in their early stages [7,49,50].

5. Conclusions

This research presents a framework for an algorithmic machine learning model designed for the early evaluation of business ideas. Unlike traditional approaches, which tend to focus on financial metrics or single indicators, the proposed model systematically integrates five critical dimensions: innovation, sustainability, the entrepreneurial team, scalability and initial finances. This provides a more holistic and consistent view of entrepreneurial viability in contexts of high uncertainty.

The model may exhibit biases stemming from the composition of the dataset and the chosen variables. Since the records correspond to business ideas in the early stages, there may be an uneven representation of sectors or types of innovation, generating sampling bias. Excluding contextual variables, such as macroeconomic conditions, public policies, and cultural factors, may introduce bias due to omitted variables and limit the generalization of the results. While these biases do not affect the model’s methodological validity, they should be considered when interpreting the results and applying the model to new contexts or data sources.

The main contribution of this work is the operationalisation of variables that have usually been considered separately, providing a replicable methodological structure that can be adapted to different sectors and ecosystems. Likewise, incorporating advanced algorithmic techniques such as ensemble models and data balancing methods lays the foundation for overcoming recurring limitations in the literature, such as class imbalance and lack of generalisation.

This study enhances existing approaches by combining variables that were previously examined separately, such as innovation, sustainability, scalability, team, and finances, into one model. This integration provides a more comprehensive representation of the initial feasibility of a business idea. Additionally, the model incorporates an eight-phase algorithmic pipeline that combines class balancing with SMOTE and cross-validation, as well as hyperparameter optimization with GridSearchCV. This increases the accuracy and stability of predictions in contexts with heterogeneous or unbalanced data. Unlike traditional models based solely on financial metrics, this approach generates comparable and explainable results across projects, facilitating objective decision-making in the incubation and initial financing processes.

In practice, the framework serves as a support tool for incubators, accelerators and policymakers, facilitating strategic decision-making at an early stage when uncertainty is often greater and mistakes can be costly. However, it is recognised that the proposal has limitations, including the use of simulated data in the experimental phase and the need for empirical validation in real contexts. This is projected as a priority area for future research.

In summary, this work establishes a foundation for developing more comprehensive, transparent and applicable predictive models in sustainable entrepreneurship practice, helping to bridge the gap between conceptual approaches and algorithmic tools for the early evaluation of business ideas.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su172210124/s1.

Author Contributions

Conceptualization, K.C.-J., D.G.-A. and C.E.R.; methodology, D.G.-A. and C.E.R.; software, C.E.R.; validation, D.G.-A. and C.E.R.; formal analysis, K.C.-J., D.G.-A. and C.E.R.; investigation, K.C.-J., D.G.-A. and C.E.R.; resources, K.C.-J., D.G.-A. and C.E.R.; data curation, D.G.-A. and C.E.R.; writing—original draft preparation, K.C.-J., D.G.-A. and C.E.R.; writing—review and editing, K.C.-J., D.G.-A. and C.E.R.; visualization, K.C.-J., D.G.-A. and C.E.R.; supervision, K.C.-J. and D.G.-A.; project administration, K.C.-J. and D.G.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data and codes are available in Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Audretsch, D.B.; Belitski, M. Knowledge spillover entrepreneurship and economic growth. Technol. Forecast. Soc. Change 2021, 166, 120650. [Google Scholar] [CrossRef]
Urbano, D.; Audretsch, D.; Aparicio, S.; Noguera, M. Does entrepreneurial activity matter for economic growth in developing countries? The role of the institutional environment. Int. Entrep. Manag. J. 2019, 16, 1065–1099. [Google Scholar] [CrossRef]
Stam, E.; van de Ven, A. Entrepreneurial ecosystem elements. Small Bus. Econ. 2021, 56, 809–832. [Google Scholar] [CrossRef]
Martín-Peña, M.L.; Lorenzo, P.C.; Meyer, N. Digital platforms and business ecosystems: A multidisciplinary approach for new and sustainable business models. Rev. Manag. Sci. 2024, 18, 2465–2482. [Google Scholar] [CrossRef]
Rubilar-Torrealba, R.; Chahuán-Jiménez, K.; De La Fuente-Mella, H.; Marzo-Navarro, M. Econometric modeling to measure the social and economic factors in the success of entrepreneurship. Sustainability 2022, 14, 7573. [Google Scholar] [CrossRef]
Estrada-Lavilla, R.; Ruiz-Navarro, J. Method for and Analysis of Early-Stage Firm Growth Patterns Using World Bank Data. Sustainability 2024, 16, 1450. [Google Scholar] [CrossRef]
Kim, J.; Kim, H.; Geum, Y. How to succeed in the market? Predicting startup success using a machine learning approach. Technol. Forecast. Soc. Change 2023, 193, 122614. [Google Scholar] [CrossRef]
González, M.A.M.; Terzidis, O.; Lütz, P.; Heblich, B. Critical decisions at the early stage of start-ups: A systematic literature review. J. Innov. Entrep. 2024, 13, 83. [Google Scholar] [CrossRef]
Ahmad, M.; Zheng, J. The Cyclical and Nonlinear Impact of R&D and Innovation Activities on Economic Growth in OECD Economies: A New Perspective. J. Knowl. Econ. 2022, 14, 544–593. [Google Scholar] [CrossRef]
Kostis, P.C. Culture, innovation, and economic development. J. Innov. Entrep. 2021, 10, 22. [Google Scholar] [CrossRef]
Skare, M.; Porada-Rochon, M. The role of innovation in sustainable growth: A dynamic panel study on micro and macro levels 1990–2019. Technol. Forecast. Soc. Change 2022, 175, 121337. [Google Scholar] [CrossRef]
Baden-Fuller, C.; Teece, D.J. Market sensing, dynamic capability, and competitive dynamics. Ind. Mark. Manag. 2020, 89, 105–106. [Google Scholar] [CrossRef]
Carbonell Garcia, D.; Van Klyton, A.; Tavera-Mesias, J.F. The moderating effect of digital transformation on environmental turbulence in emerging economies-advancing business model innovation research. Benchmarking Int. J. 2025, 1–31. [Google Scholar] [CrossRef]
Gangwani, D.; Zhu, X. Modeling and prediction of business success: A survey. Artif. Intell. Rev. 2024, 57, 44. [Google Scholar] [CrossRef]
Park, J.; Choi, S.; Feng, Y. Predicting startup success using two bias-free machine learning: Resolving data imbalance using generative adversarial networks. J. Big Data 2024, 11, 122. [Google Scholar] [CrossRef]
Zhang, Z.; Lau, R.Y. Exploiting Multimodal Features and Deep Learning for Predicting Crowdfunding Successes. In Proceedings of the 2024 IEEE International Conference on Omni-layer Intelligent Systems (COINS), London, UK, 29–31 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Sadia, R.T.; Cheng, Q. CrunchLLM: Multitask LLMs for Structured Business Reasoning and Outcome Prediction. arXiv 2025. [Google Scholar] [CrossRef]
Bonaventura, M.; Ciotti, V.; Panzarasa, P.; Liverani, S.; Lacasa, L.; Latora, V. Predicting success in the worldwide start-up network. Sci. Rep. 2020, 10, 345. [Google Scholar] [CrossRef] [PubMed]
Argaw, Y.M.; Liu, Y. The Pathway to Startup Success: A Comprehensive Systematic Review of Critical Factors and the Future Research Agenda in Developed and Emerging Markets. Systems 2024, 12, 541. [Google Scholar] [CrossRef]
Ningrum, I.W.K.; Ridho, F.; Wijayanto, A.W. Predicting Startup Success Using Machine Learning Approach. J. Appl. Inform. Comput. 2024, 8, 280–290. [Google Scholar] [CrossRef]
Najie, M.; Sofian, A.A.; Sidabutar, R.J.; Untoro, M.C. Optimizing Startup Success Prediction Through SMOTE Oversampling and Classification. J. Intell. Syst. Inf. Technol. 2024, 1, 57–65. [Google Scholar] [CrossRef]
Audretsch, D.B.; Belitski, M.; Guerrero, M. Sustainable orientation management and institutional quality: Looking into European entrepreneurial innovation ecosystems. Technovation 2023, 124, 102742. [Google Scholar] [CrossRef]
Takas, N.; Kouloumpris, E.; Moutsianas, K.; Liapis, G.; Vlahavas, I.; Kousenidis, D. Startup Sustainability Forecasting with Artificial Intelligence. Appl. Sci. 2024, 14, 8925. [Google Scholar] [CrossRef]
Wei, C.P.; Fang, E.S.H.; Yang, C.S.; Liu, P.J. To shine or not to shine: Startup success prediction by exploiting technological and venture-capital-related features. Inf. Manag. 2025, 62, 104152. [Google Scholar] [CrossRef]
Truant, E.; Giordino, D.; Borlatto, E.; Bhatia, M. Drivers and barriers of smart technologies for circular economy: Leveraging smart circular economy implementation to nurture companies’ performance. Technol. Forecast. Soc. Change 2024, 198, 122954. [Google Scholar] [CrossRef]
Liu, Y.; Tian, G.; Sheng, H.; Zhang, X.; Yuan, G.; Zhang, C. Batch Eol Products Human-Robot Collaborative Remanufacturing Process Planning and Scheduling for Industry 5.0. Robot. Comput.-Integr. Manuf. 2026, 97, 103098. [Google Scholar] [CrossRef]
Delmar, F. Measuring growth: Methodological considerations and empirical results. In Entrepreneurship and SME Research; Routledge: Oxfordshire, UK, 2019; pp. 199–215. [Google Scholar]
Davila, A.; Foster, G.; Gupta, M. Venture capital financing and the growth of startup firms. J. Bus. Ventur. 2003, 18, 689–708. [Google Scholar] [CrossRef]
Levie, J.; Lichtenstein, B.B. A terminal assessment of stages theory: Introducing a dynamic states approach to entrepreneurship. Entrep. Theory Pract. 2010, 34, 317–350. [Google Scholar] [CrossRef]
Pryor, C.; Webb, J.W.; Ireland, R.D.; Ketchen, D.J., Jr. Toward an integration of the behavioral and cognitive influences on the entrepreneurship process. Strateg. Entrep. J. 2016, 10, 21–42. [Google Scholar] [CrossRef]
Cohen, B.; Winn, M.I. Market imperfections, opportunity and sustainable entrepreneurship. J. Bus. Ventur. 2007, 22, 29–49. [Google Scholar] [CrossRef]
Cohen, B.; Almirall, E.; Chesbrough, H. The emergence of the urban entrepreneurship ecosystem: Ecosystem dynamics in an urban context. Calif. Manag. Rev. 2017, 59, 5–25. [Google Scholar] [CrossRef]
Schade, P.; Schuhmacher, M.C. Predicting entrepreneurial activity using machine learning. J. Bus. Ventur. Insights 2023, 19, e00357. [Google Scholar] [CrossRef]
Li, Y.; Zadehnoori, I.; Jowhar, A.; Wise, S.; Laplume, A.; Zihayat, M. Learning from Yesterday: Predicting early-stage startup success for accelerators through content and cohort dynamics. J. Bus. Ventur. Insights 2024, 22, e00490. [Google Scholar] [CrossRef]
McCarthy, P.X.; Gong, X.; Braesemann, F.; Stephany, F.; Rizoiu, M.A.; Kern, M.L. The impact of founder personalities on startup success. Sci. Rep. 2023, 13, 17200. [Google Scholar] [CrossRef]
Kaiser, U.; Kuhn, J. The value of publicly available, textual and non-textual information for startup performance prediction. J. Bus. Ventur. Insights 2020, 14, e00179. [Google Scholar] [CrossRef]
Arroyo, J.; Corea, F.; Jimenez-Diaz, G.; Recio-Garcia, J.A. Assessment of Machine Learning Performance for Decision Support in Venture Capital Investments. IEEE Access 2019, 7, 124233–124243. [Google Scholar] [CrossRef]
Kaminski, J.; Hopp, C. Predicting outcomes in crowdfunding campaigns with textual, visual, and linguistic signals. Small Bus. Econ. 2020, 55, 627–649. [Google Scholar] [CrossRef]
Ralcheva, A.; Roosenboom, P. Forecasting success in equity crowdfunding. Small Bus. Econ. 2020, 55, 39–56. [Google Scholar] [CrossRef]
Qiu, Y.; Chen, P.; Huang, W. Enhancing Startup Financing Success Prediction Based on Social Media Sentiment. Systems 2025, 13, 520. [Google Scholar] [CrossRef]
Soto-Simeone, A.; Sirén, C.; Antretter, T. New Venture Survival: A Review and Extension. Int. J. Manag. Rev. 2020, 22, 378–407. [Google Scholar] [CrossRef]
Zhao, X.; Xu, Y.; Vasa, L.; Shahzad, U. Entrepreneurial ecosystem and urban innovation: Contextual findings in the lens of sustainable development from China. Technol. Forecast. Soc. Change 2023, 191, 122526. [Google Scholar] [CrossRef]
Gilsing, R.; Türetken, O.; Grefen, P.; Ozkan, B.; Adali, O.E. Business model evaluation: A systematic review of methods. Pac. Asia J. Assoc. Inf. Syst. 2022, 14, 26–61. [Google Scholar] [CrossRef]
Razaghzadeh Bidgoli, M.; Raeesi Vanani, I.; Goodarzi, M. Predicting the success of startups using a machine learning approach. J. Innov. Entrep. 2024, 13, 80. [Google Scholar] [CrossRef]
Lee, K.; Roh, T.; Kim, J.; Park, S.; Bae, Y. Unpacking sustainability in start-ups: A systematic review and research agenda. Environ. Dev. Sustain. 2025, 6, 218. [Google Scholar] [CrossRef]
Ross, G.; Das, S.; Sciro, D.; Raza, H. CapitalVX: A machine learning model for startup selection and exit prediction. J. Financ. Data Sci. 2021, 7, 94–114. [Google Scholar] [CrossRef]
Teece, D. Hand in glove: Open innovation and the dynamic capabilities framework. Strateg. Manag. Rev. 2020, 1, 233–253. [Google Scholar] [CrossRef]
Carle, P.É.; Rayna, T. Where to start? Exploring how sustainable startups integrate sustainability impact assessment within their entrepreneurial process. J. Manag. Organ. 2023, 30, 148–164. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Fernández, A.; García, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]

Figure 1. Three maim components of Framework.

Figure 2. This is methodological pipeline for the early evaluation of business ideas using machine learning. The diagram illustrates the eight sequential stages. (1) a data source containing 15 critical variables; (2) preprocessing with differentiated imputation; (3) a stratified division into a training set (80%) and a test set (20%); (4) conditional balancing using SMOTE, which generates synthetic examples when there is a significant imbalance; (5) training and parallel optimization of three ensemble algorithms (Random Forest, Gradient Boosting, and XGBoost) with GridSearchCV and cross-validation (k = 5); (6) generation of predictions on the test set; (7) evaluation using balanced metrics (F1 score, accuracy, precision, recall, and balanced accuracy); and (8) selection of the optimal model (M*) based on the best F1 score. The lower right box details the distinctive features of each algorithm and the hyperparameter optimization process.

Figure 3. This is a conceptual diagram of the training and parallel optimization process for ensemble models. The balanced training data simultaneously feeds three complementary algorithms: Random Forest, Gradient Boosting, and XGBoost. GridSearchCV with 5-fold cross-validation optimizes each algorithm by exhaustively exploring its hyperparameter space to maximize the F1-score. The result is three optimized models (RF*, GB*, and XGB*) that will be evaluated and compared subsequently.

Figure 4. This is a diagram of the prediction generation process. The testing dataset, which contains business ideas that were not seen during training, is distributed to three optimized models: Random Forest*, Gradient Boosting*, and XGBoost*. Each model processes the data independently and generates two types of output: binary predictions (0 = failure, 1 = success) and probabilities of success,

P (y = 1 ∣ x) \in [0, 1]

, which quantify the model’s confidence. The result is three complete sets of predictions, which will be evaluated and compared in Phase 7 to identify the model with the best predictive capacity using multiple complementary metrics.

Figure 4. This is a diagram of the prediction generation process. The testing dataset, which contains business ideas that were not seen during training, is distributed to three optimized models: Random Forest*, Gradient Boosting*, and XGBoost*. Each model processes the data independently and generates two types of output: binary predictions (0 = failure, 1 = success) and probabilities of success,

P (y = 1 ∣ x) \in [0, 1]

, which quantify the model’s confidence. The result is three complete sets of predictions, which will be evaluated and compared in Phase 7 to identify the model with the best predictive capacity using multiple complementary metrics.

Table 1. Summary of recent studies applying Machine Learning approaches to startup success prediction.

Author	Method/Approach	Main Finding (Brief)	Reinforced Dimension
[33]	Comparison between ML and regression (DT, RF, ANN, k-NN, XGBoost, Naïve Bayes; logistic baseline)	XGBoost outperforms regression; high accuracy under imbalanced scenarios	Algorithmic/Metric methodology
[34]	Two-phase framework; textual/team and cohort features; supervised ML	Cohort-level features enhance prediction in accelerator contexts	Scalability/Environment (network or cohort)
[35]	Interpretable ML based on Big Five personality traits	Traits such as openness and activity, along with team diversity, correlate with startup success	Entrepreneurial team
[36]	Text-as-data combined with administrative data; classification models (AUC)	Public textual information improves prediction of survival and innovation	Innovation/Early signals
[37]	Multiple ML models on >120,000 startups, 3-year horizon	ML supports VC decisions, predicting funding rounds and shutdown risk through multiple signals	Early finance/Risk
[38]	NLP and network analysis; 20,188 campaigns	Textual and visual signals outperform firm-level determinants in early-stage pitches	Innovation/Communication
[39]	Parsimonious model with 3-year moving window	Equity retention, founder experience, and accelerator participation predict success	Finance/Team
[40]	DSS with DNN; Crunchbase + Twitter; BERTweet for sentiment analysis	Incorporating social sentiment enhances prediction accuracy of funding outcomes	Environment/Market (external signals)

Table 2. The proposed framework integrates variables for the early evaluation of business ideas.

Dimension	Variable	Type	Indicator/Question	Encoding
Innovation	Novelty of the solution	Ordinal	Level of technical or commercial novelty	Likert 1–5
	Intellectual property protection	Categorical	Is there an IP registration or application?	1 = Yes, 0 = No
	Business model differentiation	Ordinal	Degree of differentiation from the market	Likert scale: 1–5
Sustainability	Environmental impact	Ordinal	Estimated level of environmental impact	1 = Low, 2 = Medium, 3 = High
	Circular economy	Ordinal	Proportion of reusable/recyclable inputs	0 = None, 1 = Partial, 2 = Total
	Social inclusion	Categorical	Does it consider historically excluded groups?	1 = Yes, 0 = No
Entrepreneurial team	Sector experience	Numerical	Average years of experience in the sector	Scale: 0–1
	Functional diversity	Ordinal	Coverage of critical roles (technical, business, etc.)	Likert 1–5
	Dedication to the project	Numerical	Percentage of dedication to the project	0–100%
Scalability	Market Size (TAM/SAM/SOM)	Numerical	Documented estimate of the target market	Scale 0–1
	Scaling Potential	Ordinal	Ability to expand geographically/sectorally	Likert 1–5
	Early Traction	Numerical	Initial signs of interest (leads, pre-sales)	Scale 0–1
Initial Financing	Capital Intensity (CAPEX/OPEX)	Numerical	Initial investment vs. opportunity ratio	Scale 0–1 (inverted)
	Time to Break-even	Numerical	Estimated months to reach breakeven	Scale 0–1 (inverted)
	Initial funding sources	Categorical	Do you have seed capital or investors?	1 = Yes, 0 = No

Table 3. Contrast between traditional approaches and the proposed framework.

Appearance	Traditional Approaches	Proposed Framework
Dimensions	Financial dominance; marginal innovation and sustainability	15 variables in 5 dimensions (innovation, sustainability, team, scalability, finance)
Non-linearity	Restrictive linear assumptions	Ensemble algorithms that capture interactions and non-linearities
Imbalance	Generally not addressed	SMOTE and balanced metrics (F1, balanced accuracy)
Validation	Limited cross-validation; low replicability	Systematic validation and replication- ready design
Generalization	Sensitive to concept drift and survival bias	Scalable pipeline oriented towards early decisions
Interpretability	Global coefficients	Importance of variables and dimensions; model-agnostic explainability

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chahuán-Jiménez, K.; Garrido-Araya, D.; Román, C.E. Conceptual Framework for a Machine Learning-Based Algorithmic Model for Early-Stage Business Idea Evaluation. Sustainability 2025, 17, 10124. https://doi.org/10.3390/su172210124

AMA Style

Chahuán-Jiménez K, Garrido-Araya D, Román CE. Conceptual Framework for a Machine Learning-Based Algorithmic Model for Early-Stage Business Idea Evaluation. Sustainability. 2025; 17(22):10124. https://doi.org/10.3390/su172210124

Chicago/Turabian Style

Chahuán-Jiménez, Karime, Dominique Garrido-Araya, and Carlos Escobedo Román. 2025. "Conceptual Framework for a Machine Learning-Based Algorithmic Model for Early-Stage Business Idea Evaluation" Sustainability 17, no. 22: 10124. https://doi.org/10.3390/su172210124

APA Style

Chahuán-Jiménez, K., Garrido-Araya, D., & Román, C. E. (2025). Conceptual Framework for a Machine Learning-Based Algorithmic Model for Early-Stage Business Idea Evaluation. Sustainability, 17(22), 10124. https://doi.org/10.3390/su172210124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Conceptual Framework for a Machine Learning-Based Algorithmic Model for Early-Stage Business Idea Evaluation

Abstract

1. Introduction

2. Materials and Methods

2.1. Methodological Approach

2.2. Framework Construction Procedure

2.3. Proposed Framework

2.4. Concept Validation

2.5. Reproducibility and Transparency

3. Analysis of the Business Success Prediction Algorithm Using Machine Learning

3.1. System Architecture

3.1.1. Data Acquisition and Preprocessing Phase

3.1.2. Division Strategy and Data Balance

3.2. Algorithm Pseudocode

3.2.1. Training and Optimization Center

3.2.2. Evaluation and Selection System

3.3. Methodology Flowchart

Importance and Interpretability Analysis

3.4. Methodological Innovation

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI