1. Introduction
In recent years, the rapid development of internet technology has triggered a global entrepreneurial boom, with startups flourishing in many countries. Startups not only serve as key drivers of technological innovation and advancement but also play a vital role in addressing employment issues and promoting economic diversification [
1]. The emergence of business information databases such as Crunchbase and BVD–ORBIS has transformed the way startups secure funding and reshaped the financing landscape [
2]. These platforms help startups increase project visibility, expand financing channels, and provide investors with valuable decision-making references. Nevertheless, many startups fail to secure funding due to factors such as financing risks and capital constraints [
3], unclear market prospects [
4], and founders’ lack of experience [
5]. For startup founders and investors, understanding the factors contributing to successful financing and predicting the likelihood of financing success is of significant importance.
In 2009, the US.-based restaurant chain Naked Pizza leveraged Twitter for online promotion, attracting a large number of investors and growing from a small eatery into a multinational restaurant chain. The rapid development of digital technology has turned social media platforms into crucial channels for communication, interaction, and information sharing. Entrepreneurs can use social media interactions to expand their networks and accumulate social capital, facilitating resource acquisition and entrepreneurial realization [
6]. Social media has become one of the indispensable conditions for entrepreneurial success [
7]. The positive impact of social media on corporate performance has been widely recognized, making it essential to incorporate social media analysis into the prediction of startup financing success.
Current research on predicting startup financing success mainly focuses on structured data such as financial indicators and project characteristics, often employing statistical and regression models. While these methods reveal certain linear relationships, they face significant limitations in handling complex nonlinear features and high-dimensional data. Additionally, such approaches often overlook startups’ performance in external information dissemination and market feedback, particularly the potential influence of social media as a critical channel for modern information exchange and sentiment propagation. Textual and social media data contain abundant latent value that can provide a more comprehensive perspective for evaluating startups’ financing potential. Social media texts are rich in sentiment and public perception information, reflecting market attitudes and confidence toward enterprises. The rapid development of sentiment analysis techniques enables the extraction of valuable features from unstructured data. By analyzing the sentiment polarity (e.g., positive, negative, neutral) and sentiment intensity (e.g., degree of emotional strength) in tweets and other social media texts, it is possible to capture subtle changes in public sentiment and further reveal the dynamic relationship between enterprises and the market.
Against this backdrop, this study aimed to construct a decision support system for startup financing success by integrating traditional financial numerical indicators with multi-source data from corporate social media and news information, thereby unveiling the impact mechanisms of various factors on financing performance. First, financial indicators for financing projects were systematically collected from the Crunchbase platform. Second, social media data for the financing enterprises were retrieved from Twitter, and sentiment analysis was performed using BERTweet to capture latent factors such as market attitudes and expectations toward the enterprises. Finally, traditional numerical indicators were combined with social media sentiment metrics to construct a decision support system for predicting financing success using a deep neural network (DNN). The results demonstrated that predictions based on social media news data significantly outperformed traditional decision support system relying solely on financial indicators, highlighting the crucial role of social media news numerical and sentiment features in predicting startup financing outcomes. The findings provide strategies for startup founders to improve financing success rates and offer guidance for investors in assessing startup potential and risks, thereby fostering the healthy and efficient development of the startup financing market.
To guide the research process, we defined two main research questions and associated hypotheses as follows:
RQ1: How can a sentiment-aware decision support system (DSS) be constructed to effectively predict startup financing success using both structured financial indicators and unstructured social media signals?
RQ2: Which types of sentiment and engagement features extracted from social media are most predictive of startup financing success?
H2: Sentiment polarity proportions (positive, negative, neutral), sentiment intensity, and user engagement metrics (likes, retweets, comments) jointly contribute significantly to the predictive performance of the DSS.
2. Literature Review
2.1. Traditional Predictive Models for Startup Financing
Effective variable selection is crucial for accurately describing the characteristics of startups and predicting financing success. Many scholars have studied the factors that influence the financing success of startups [
8]. The factors affecting startup financing success are categorized into internal resources, such as the company’s own capital, credit conditions [
4], technical capabilities, and the experience and expertise of the management team; external environment factors, including overall economic outlook, the prospects of the company’s specific industry, the maturity of financial markets, and government policy support [
9]; and strategic choices, such as market positioning and business model [
10].
Since startup founders often raise funds through internet crowdfunding platforms, the way they describe their projects [
11], language style [
12], interaction between founders and investors [
13], founder information [
14], and information update frequency [
15] are also important factors influencing financing performance. Factors such as the number of project images, the presence of videos, frequency of project updates, number of social media links, reward levels, and number of comments positively impact financing outcomes, while target amount and fundraising duration negatively affect financing success. Project nature and minimum investment amount are also key determinants of financing success [
16,
17]. Koch et al. found that the founder’s experience, the number of supported projects, the number of initiated projects, and the founder’s notoriety positively influence the financing amount [
18]. Huckman et al. [
19] argued that the experience of team members is crucial for startup financing, noting that a specialized and high-quality team has a positive impact on project success. Moreover, the number of project supporters also has a positive effect on financing success.
In studies predicting the success rate of startup financing based on the above factors, scholars have employed various statistical models and regression analysis methods. Mollick [
20], using data from the crowdfunding platform Kickstarter, applied regression models and Cox proportional hazard models to find that project quality and the social networks of project founders are directly related to crowdfunding success. Xu et al. [
21] conducted a semantic analysis of crowdfunding projects on Kickstarter and used hierarchical logistic regression models to analyze how different types of updates and their timing affect the success of crowdfunding campaigns. Cumming et al. [
17] compared project characteristics and financing outcomes under two crowdfunding models, “AON” and “KIA”, using regression analysis and propensity score matching methods. Huang et al. [
22] used logistic regression analysis to study the factors influencing the success of crowdfunding projects on platforms, finding that project rewards, related metrics of the project initiators, the number of images, and the increase in the number of shares and comments all positively influenced the success of project financing, while project financing scale had a negative impact. Although these traditional empirical methods can explain certain influencing factors to some extent, they have limitations in dealing with complex nonlinear relationships and high-dimensional data.
2.2. Research on Corporate Social Media
Third-party social networks provide a new way to reduce information asymmetry between fundraisers and investors and enhance financing performance [
23]. Entrepreneurs can use social media to expand their project’s influence and attract interested investors. Kim et al. [
24] pointed out that media exposure is one of the key factors determining the success of startups. Through social media interactions, entrepreneurs can expand their social networks and accumulate social capital, which facilitates resource acquisition and helps achieve financing goals [
6]. Liao et al. [
25] confirmed that the more social capital a project initiator possesses, the more advantageous it is for the project’s financing. Liu et al. [
23] found that entrepreneurs sharing information through third-party social networks can reduce information asymmetry, positively influencing their financing performance. Moreover, the facilitative effect of personal information sharing and project information sharing demonstrates stage-specific dynamics.
Through social media, entrepreneurs can not only attract potential investors but also shape their personal and corporate image [
26]. Investors can share information and gain supplementary investment decision-making insights. From the investor’s perspective, early investors who are socially closer to the project are more likely to contribute to successful financing [
27].
The use of social media also significantly impacts other aspects of businesses, such as employee satisfaction, corporate performance, and innovation capability, further demonstrating its positive effects on companies. Fu et al. [
28] found that the use of social media promotes social capital accumulation among employees, including better communication and collaboration. This not only improves work efficiency but also plays a crucial role in enhancing employees’ sense of social support and belonging, directly boosting job satisfaction and workplace well-being. Parveen et al. [
29] conducted in-depth interviews with six organizations and explored the impact of social media usage on corporate performance. They found that the effective use of social media could significantly enhance customer relationships, reduce marketing and customer service costs, increase brand awareness, and provide competitive advantages through information sharing and market insights. Hannu et al. [
30] discovered that the use of social media facilitates breakthrough innovation for companies. They suggested that corporate managers should develop innovation strategies that align with the company’s actual situation in order to fully leverage the potential of social media. Zhou et al. [
11] employed an LDA topic model to identify key topics in user-generated content on Twitter, revealing the critical factors for startup success and exploring the emotional features and themes that impact corporate success. Therefore, corporate social media plays a crucial role in the success and subsequent development of startups. Incorporating social media into the prediction of startup financing success can improve prediction accuracy and reduce investment risks.
2.3. Application of Deep Learning Models in Financing Prediction
In recent years, with the rapid development of artificial intelligence technology, deep learning models have been widely applied in various performance predictions for businesses due to their strong data mining, pattern recognition, and forecasting capabilities, providing new approaches to research in this field. Lee et al. [
31] used sequence-to-sequence neural network models and hierarchical attention network models to improve the accuracy of crowdfunding project success prediction, based on text data such as project descriptions, updates, and comments. Monika et al. [
32] developed a deep learning model based on artificial neural networks to predict the valuation of entrepreneurial enterprises through internal resources, industrial structure, and transaction data. Hsairi et al. [
1] investigated the impact of non-financial factors on the success or failure of startups by constructing a convolutional neural network model and found that the predictive accuracy of convolutional neural networks was superior to that of artificial neural networks. Jung et al. [
33] employed crowdfunding project data from the Kickstarter platform, using deep learning models to calculate the tone heterogeneity between project descriptions and existing projects to obtain the overall novelty level of the project. They discovered that when a project had higher novelty, positive two-way communication contributed to the success of financing. Tang et al. [
34] proposed a deep cross-attention network (DCAN) to extract text and video features of crowdfunding projects and used the cross-attention mechanism to fuse features from different modalities, demonstrating good performance in crowdfunding success prediction tasks with certain interpretability.
Compared to traditional regression analysis models, deep learning models such as artificial neural networks, convolutional neural networks, and deep neural networks are capable of processing unstructured data and handling complex high-dimensional data, improving prediction accuracy through automatic feature learning. These models are not limited to traditional numerical indicators but can integrate multi-source and multi-type data, thus enhancing prediction performance, which as shown in
Table 1.
2.4. Research on Sentiment Analysis
Sentiment analysis, a crucial subfield of natural language processing, focuses on identifying and evaluating emotions embedded in textual content. Its significance has been increasingly accentuated due to its capabilities in facilitating trend prediction, enhancing customer satisfaction, maintaining brand reputation, and enabling precision marketing.
Shivani et al. [
35] conducted a sentiment analysis on Twitter’s American Airlines intelligence dataset using a random forest model, extracting features via TF-IDF vectorization for emotional classification. The results demonstrated that sentiment analysis can significantly improve airline user experience. Raut et al. [
36] integrated financial data analysis with sentiment analysis, evaluated multiple stock price prediction models, and optimized stock market prediction frameworks. Huang et al. [
37] proposed an integrated approach based on Graph Attention Networks (GAT) and Heterogeneous Graph Neural Networks (HGNN), which achieves precise and personalized marketing strategy optimization by deeply mining user emotional information and social interactions, notably improving the accuracy and ranking quality of recommendation systems. Suhaimin et al. [
38] developed a multi-task learning embedding framework that captures subtle features of emotions and irony for detecting public safety threats.
In financial forecasting, Christina et al. [
39] integrated sentiment analysis from Twitter with historical stock data, achieving strong performance in daily stock price prediction. Tarun et al. [
40] explored the use of federated learning for sentiment modeling, enabling decentralized training across multiple platforms while preserving privacy and adaptability.
These studies collectively demonstrated that sentiment signals—when accurately captured and modeled—can significantly enhance predictive performance across domains. This provides important theoretical and technical foundations for integrating sentiment features into startup financing success prediction systems.
Current research on startup financing success rates has the following shortcomings: (1) In terms of indicator selection, the focus is mainly on the characteristics of the financing project itself, including target financing amount, project returns, fundraising duration, and other numerical indicators [
41], as well as textual information such as company descriptions, founder backgrounds, and crowdfunding platform interaction features, while neglecting the impact of external social media information on financing performance. (2) Regarding prediction methods, linear regression, Logit regression, and other statistical analysis methods are commonly used to explain the linear relationships between financing performance and various indicators, but such traditional methods struggle to handle nonlinear relationships and high-dimensional data.
Therefore, the aim of this study was to fill this gap by exploring how deep neural network models can extract effective features from online news, thereby improving the accuracy of predicting startup financing success. At the same time, this research has strong practical significance. By combining online news data with deep neural network technology, this study provides investors with a new financing evaluation tool, enabling them to more efficiently obtain valuable predictive insights from complex market information. Furthermore, this study can provide useful insights for business managers, helping them better understand market feedback and optimize their market strategies and financing plans.
3. Model Construction
3.1. Deep Learning Framework Based on Multi-Source Data
To support decision-making in startup investment, we proposed a decision support system (DSS) framework that predicts financing success by integrating multi-source data. As shown in
Figure 1, the framework combines text data from social media (e.g., Twitter) and structured numerical data (e.g., Crunchbase). The text pipeline involves natural language processing and sentiment analysis using a pre-trained transformer (Bertweet), while the numerical pipeline handles missing values, scaling, and encoding. Both data streams are fed into a multi-layer perceptron for prediction. The model is further enhanced through ensemble learning, hyperparameter tuning, and evaluation, with additional modules for interpretability and deployment readiness.
3.2. BERTweet Model and Sentiment Feature Extraction
To effectively extract sentiment features from social media texts, this study adopted the BERTweet model, which is a variant of the BERT (Bidirectional Encoder Representations from Transformers) model, specifically optimized for Twitter data. The model architecture is the same as BERT, utilizing the Transformer structure and bidirectional encoders and capturing internal sentence dependencies through the self-attention mechanism. The training process of the BERTweet model uses a large corpus of Twitter text data to adapt to the language features commonly found in social media, such as abbreviations, emojis, and punctuation.
3.2.1. Representation of Input Text Embeddings
Given a specific tweet
, where
represents the
-th token or word in the text, the BERTweet model first maps these tokens to an embedding space, generating a word embedding matrix
. Here,
denotes the length of the token sequence, and
is the dimensionality of each token embedding. The input embeddings are represented using the following formula:
In this formula, represents the token embedding, represents the position encoding, and represents the segment encoding. The resulting input matrix is subsequently processed by the BERTweet model for further analysis.
3.2.2. Transformer Encoder
The BERTweet model utilizes a multi-layer Transformer encoder to extract features from the text. The output of each layer is computed through the self-attention mechanism, expressed as:
represents the output of the
-th layer, and
denotes the total number of Transformer layers. The self-attention mechanism is computed using the following formula:
In this equation, , , and refer to the query, key, and value matrices, respectively, while represents the dimensionality of the keys.
3.2.3. Sentiment Analysis and Sentiment Scores
The sentiment analysis task includes the classification of sentiment polarity (positive, negative, or neutral) and the prediction of sentiment intensity (sentiment scores). After training and fine-tuning, the BERTweet model is capable of extracting sentiment features from text and outputting both sentiment categories and intensity, which as shown in
Figure 2.
For sentiment polarity classification, the BERTweet model maps the input text sequence
into a classification label, typically positive, negative, or neutral. Using the Softmax function, the model projects the sentiment category distribution as:
Here, represents the output of the last layer of the Transformer encoder, is the weight matrix, and is the bias term. denotes the probability distribution of each sentiment category for the input text.
In addition to sentiment polarity classification, the BERTweet model can also output continuous sentiment scores, representing the intensity of the sentiment. This task is treated as a regression problem, where the model optimizes the sentiment score prediction through a regression loss function:
Here, denotes the true sentiment score, and is the predicted sentiment score by the model. Sentiment scores are typically normalized to a fixed range, such as 0 to 1, where 0 indicates completely negative sentiment, and 1 indicates completely positive sentiment.
3.2.4. Model Training
The training process of the BERTweet model includes two stages, pre-training and fine-tuning, to adapt to social media text characteristics and accurately extract sentiment information.
During the pre-training stage, BERTweet leverages a large corpus of Twitter text data and primarily adopts the Masked Language Model (MLM) task to learn contextual dependencies. This approach helps the model capture the formal syntax, abbreviations, and symbolic characteristics of social media text, providing strong initial feature representation capabilities for the model.
In the fine-tuning stage, the BERTweet model further optimizes its parameters based on the sentiment analysis task, using labeled text data for detailed adjustment. This ensures the model can meet specific application requirements. The sentiment analysis task is divided into two subtasks: sentiment polarity classification and sentiment intensity regression. Sentiment polarity classification predicts whether the input text belongs to positive, negative, or neutral categories, while sentiment intensity regression provides continuous scores ranging from 0 (completely negative) to 1 (completely positive), reflecting the sentiment strength.
This two-stage training framework ensures the model’s adaptability and precision, making it an effective component within broader decision support systems that require real-time and reliable social media sentiment signals.
To jointly optimize these two subtasks, a combined loss function is defined as follows:
Here, represents the cross-entropy loss, which is used to optimize sentiment classification accuracy, and represents the mean squared error loss, which is used to minimize the bias in sentiment intensity prediction. and are hyperparameters that control the relative importance of the two subtasks.
3.3. Construction of Social Media Feature Matrices
To provide finer-grained social media sentiment features, this study designed a temporal sentiment aggregation module as an integral component of the overall decision support system (DSS). Specifically, the system aggregates tweet data on a quarterly basis, computing average sentiment values and interaction metrics to construct a quarterly level feature matrix for each startup. Within this subsystem, interaction metrics—including retweet count, likes, and comments—are extracted from tweet metadata. These metrics are normalized and averaged within each quarter, allowing the system to capture fluctuations in public engagement and attention toward the company over time.
The feature matrix is structured as , where denotes the number of quarters. Included in the 7 are three interaction-related features (quarterly averages of retweets, likes, and comments), three sentiment polarity proportions (positive, negative, and neutral), and the quarterly average sentiment intensity. This approach captures temporal changes in market sentiment while integrating the overall level of public interaction. The resulting quarterly feature matrix, combined with financial data and other features, serves as input to predictive models, further improving the accuracy of predicting the financing success of startups.
3.4. Data Feature Integration
To achieve the effective integration of multi-source data, this study proposed a decision support system (DSS)-oriented deep learning framework that systematically combines three core types of data—financial attributes, social media numerical indicators, and sentiment analysis features—for the comprehensive prediction of startup financing success. This integrated system is designed to support stakeholders in making informed investment decisions by leveraging both internal company fundamentals and external market perceptions.
The financial data module captures foundational enterprise-level information, including industry classification, product and technology types, number of employees, and company size. These structured features are standardized to eliminate scale inconsistencies and are transformed into high-dimensional vectors through embedding layers. As key representations of internal enterprise capability, they form the backbone of the decision support system’s predictive logic.
The social media numerical data module incorporates external signals derived from Twitter interaction metrics such as retweets, likes, and comment counts. These features are normalized to mitigate bias and passed through fully connected layers to uncover underlying nonlinear patterns. By integrating social engagement indicators, the DSS framework accounts for public attention and behavioral signals that reflect the company’s external visibility and reputation in real time.
The sentiment analysis module utilizes the BERTweet model, which is pre-trained on large-scale Twitter corpora to extract text-based sentiment features. It outputs both sentiment polarity (e.g., positive, neutral, or negative) and sentiment intensity (continuous values from 0 to 1), quantifying market sentiment toward startups with granularity. These outputs are transformed through neural layers to align with the dimensionality of other input sources, thereby enabling seamless integration.
3.5. Data Fusion and Prediction Engine
The system’s fusion layer adopts an early fusion strategy that concatenates the processed outputs from the three data modules into a unified feature vector. This vector is then passed into a deep neural network (DNN), which serves as the core predictive engine of the DSS. The DNN comprises multiple nonlinear transformation layers that extract deep-level interactions between heterogeneous features. The final layer outputs a binary classification result, indicating the predicted success or failure of a startup’s financing round.
To further enhance the system’s predictive reliability, the framework incorporates optimization strategies including the Binary Cross-Entropy Loss as the objective function. This loss function effectively measures the deviation between predicted probabilities and actual outcomes, ensuring robust model training for real-world application in decision support scenarios.
The Binary Cross-Entropy Loss function is particularly suitable for the binary classification task of this study (i.e., financing success or failure). Its main advantage lies in effectively quantifying the error between the model’s output probabilities and the true labels. In financing scenarios, where data often exhibit significant class imbalance, the Binary Cross-Entropy Loss can mitigate this imbalance by assigning weights to different classes of samples, thereby reducing the model’s bias toward the majority class.
4. Experimental Analysis
4.1. Data Sources
The data used in this study were obtained from the Crunchbase database and the tweets posted on companies’ Twitter accounts. Crunchbase provides key data about startups, such as company size, number of employees, funding history, and technology usage. Meanwhile, Twitter tweets reflect online news related to the companies as well as their social media interactions and public sentiment feedback, capturing public opinions and attitudes toward startups.
To ensure the timeliness and accuracy of the data, this study utilized data spanning 24 months, from January 2023 to December 2024, covering 2000 startups and their corresponding online news. The dependent variable in this study was the outcome of startup financing, categorized into two scenarios: financing success and financing failure. Specifically, financing success was defined as whether a startup secured the next round of funding within one year after receiving angel-round funding [
42].
To ensure temporal alignment between social media features and financing events, we used the timestamp of each funding event as the anchor point. Specifically, for each startup, we aggregated tweet sentiment features within the quarter preceding the funding date. This window captured public sentiment and engagement prior to each financing decision, reducing data leakage. For startups with multiple funding rounds, tweet features were dynamically updated per round to reflect evolving public perceptions, which as shown in
Figure 3.
To mitigate potential biases and ensure sample representativeness, we selected startups across a range of industries (e.g., FinTech, HealthTech, AI, E-commerce) and geographic regions, focusing on companies active in the US, UK, and India. We also observed a long-tail distribution in tweet volume, with a small number of companies being highly active on Twitter. To address this, we normalized tweet-based features by company activity level and conducted robustness checks on high-activity and low-activity subsets, as detailed in the experiment section.
For independent variables, the dataset included various features sourced from Crunchbase, such as Industries, Financing Amount, Last Funding Amount, Total Equity Funding Amount, Active Tech Count, Number of Articles, IT Spend, Number of Founders, and Number of Employees. Additionally, information about tweets posted by companies on Twitter, including tweet content, posting time, and interaction metrics (e.g., number of likes, retweets, and comments), was incorporated as numerical features related to social media sentiment. Sentiment analysis was performed on the comments of each tweet using the BERTweet model, which outputs sentiment scores and categorizes them into sentiment polarity classes (positive, negative, neutral). These sentiment scores were included as social media sentiment features for the model. A detailed description of all indicators is provided in
Table 2.
4.2. Data Quality Checks and Preprocessing
To ensure data validity and model reliability, we conducted thorough data cleaning and preprocessing on both the Crunchbase and Twitter datasets.
For Crunchbase, we examined missing value rates across all structured features (e.g., funding amount, number of founders, employee size, company category). Features with low missing rates were imputed using median (for numeric variables) or mode (for categorical variables), while features with over 30% missing values were excluded from the modeling process. Duplicate entries were removed based on a composite key of company ID and funding date, ensuring the uniqueness of each funding event.
For Twitter data, we removed duplicate tweets using Tweet IDs and filtered out irrelevant content based on keyword matching (e.g., company name mentions). Only English-language tweets were retained. Text preprocessing steps included removing URLs, emojis, special symbols, and stop words. To ensure the reliability of sentiment labeling and text relevance, a small random sample of tweets was manually reviewed. Outliers in key numeric features such as funding amount, site traffic, and tweet engagement were detected using the IQR method, and extreme values were either capped or removed, which as shown in
Figure 4.
4.3. Evaluation Metrics
To validate the feasibility of the proposed model, this study conducted experiments on the dataset and adopted multiple evaluation metrics, including Accuracy, Recall, Precision, and F1 Score. These metrics were defined based on the confusion matrix for binary classification tasks, as shown in
Table 3. Specifically, TP (True Positive) represented the number of documents that were actually positive and predicted as positive; FN (False Negative) represented the number of documents that were actually positive but predicted as negative; FP (False Positive) represented the number of documents that were actually negative but predicted as positive; and TN (True Negative) represented the number of documents that were actually negative and predicted as negative.
Among the evaluation metrics, Accuracy was defined as:
Precision was defined as:
The F1 Score, which is the harmonic mean of Precision and Recall, provided a balanced consideration of both the model’s accuracy and completeness. It was defined as:
4.4. Social Media Analysis Results
This study selected a subset of companies for case analysis and extracted quarterly social media sentiment and interaction features based on their associated tweet data. The aim was to explore the potential impact of public sentiment and interaction levels on the success of corporate financing. Using the BERTweet model, sentiment analysis was performed on the tweet data to extract the proportions of sentiment polarity (positive, negative, neutral) and the average sentiment intensity. Additionally, the quarterly averages of tweet interaction metrics (likes, retweets, and comments) were calculated.
Table 4 presents the quarterly sentiment and interaction features for selected companies.
From the data, significant differences in social media performance between companies across quarters can be observed. For instance, Company A (Aidaly) experienced an increase in its positive sentiment proportion from 42% in 2024 Q1 to 48% in Q2, alongside an improvement in sentiment intensity from 0.67 to 0.72, indicating a continuous enhancement in public sentiment towards the company. Meanwhile, its interaction metrics (likes, retweets, comments) also showed a steady upward trend, reflecting a growing level of public attention toward the company. Aidaly eventually achieved financing success.
In contrast, Company B (ePIC) exhibited higher negative sentiment proportions than positive sentiment proportions in both 2024 Q1 and Q2, with relatively low sentiment intensity (0.55 and 0.58, respectively). This sentiment pattern may indicate a negative public perception of the company. Its interactive indicators were relatively low, which may have had a negative impact on market image and financing results, ultimately leading to financing failure.
Company C (Candy) demonstrated strong market recognition, maintaining a positive sentiment proportion above 50% across both quarters and achieving high sentiment intensity levels (0.72 and 0.75). Its interaction metrics were the highest among the three companies, with the average numbers of likes, retweets, and comments significantly exceeding those of the others, and Candy successfully secured funding. This indicated a strong association between high social media activity and favorable public perception.
Overall, companies with higher positive sentiment proportions and sentiment intensity levels tend to exhibit greater social media activity, as evidenced by their interaction metrics. These combined sentiment and interaction characteristics can effectively support financing prediction; companies with greater social media engagement and favorable sentiment tend to demonstrate higher probabilities of funding success. By leveraging fine-grained quarterly analysis, this study further validates the close relationship between social media sentiment and corporate market performance, providing important theoretical support for feature construction in financing success prediction models.
4.5. Feature Ablation Study
To evaluate the contribution of different features to the predictive performance of the model, a systematic feature ablation study was conducted. By progressively removing specific features, the importance of financial data, social media statistical features, and sentiment features in the financing prediction model was assessed. Four experimental configurations were designed, as follows:
Exp 1 (Baseline Model): Uses only financial data as the basic input for the model.
Exp 2 (Financial Data + Numerical Features): Builds upon the baseline model by incorporating social media statistical features (e.g., likes, retweets, and comments).
Exp 3 (Social Media Features Only): Removes financial data and uses only social media statistical features and sentiment features.
Exp 4 (Financial Data + Numerical Features + Sentiment Features): Adds sentiment features (e.g., sentiment polarity proportions and average sentiment intensity) extracted using the BERTweet model to Exp 2.
To provide a more detailed evaluation of model performance, classification accuracy metrics, including Accuracy (ACC), Precision, Recall, and F1 Score, were calculated for two categories: “Financing Success” and “Financing Failure.” The experimental results are shown in
Table 5 and
Figure 5.
The results demonstrated the significant impact of different experimental configurations on the predictive performance for the two categories, “Financing Success” and “Financing Failure”.
Baseline Model (Exp 1): When using only financial data, the model achieved excellent Recall for the “Financing Success” category (1.000) but exhibited low Precision and Recall for the “Financing Failure” category, resulting in an overall low F1 Score. This indicated that financial data alone cannot fully capture the characteristics of corporate financing.
Addition of Social Media Statistical Features (Exp 2): Incorporating social media statistical features into the baseline model significantly improved the Precision and F1 Score for the “Financing Success” category. For the “Financing Failure” category, Recall increased from 0.251 to 0.553, demonstrating that interaction data provided additional external information, enhancing the model’s ability to identify failure cases.
Social Media Features Only (Exp 3): When financial data were removed, the model still achieved high Recall for the “Financing Success” category (0.899). However, the F1 Score for the “Financing Failure” category dropped to 0.581, indicating that, while social media statistical and sentiment features provided independent predictive capabilities, the absence of financial data limited the model’s ability to predict failure cases.
Addition of Sentiment Features (Exp 4): Adding sentiment features to Exp 2 led to comprehensive performance improvements. The F1 Score for the “Financing Success” category reached 0.911, while the Recall and F1 Score for the “Financing Failure” category increased to 0.751 and 0.821, respectively. This demonstrated that sentiment features effectively captured the complexity of public emotions, significantly enhancing the model’s ability to identify failure cases and yielding optimal overall classification performance.
4.6. Model Analysis
4.6.1. Feature Importance
The analysis of feature importance revealed that financial features and social media features complemented each other in the predictive model. Financial features, such as the number of funding rounds, the amount raised in the most recent round, and the total equity financing amount, exhibited high importance in predicting the success of corporate financing. This highlights that a company’s financial records remain a key basis for investment decisions. However, social media features, particularly interaction metrics (e.g., average number of retweets and comments) and sentiment scores, also demonstrated significant predictive power. The features importance ranking are shown in
Figure 6. These results underscore the unique value of public market perception and sentiment in corporate financing decisions.
Further analysis indicated that average sentiment scores ranked among the most important features, suggesting that sentiment features provide additional information gains for predicting financing success. Compared to relying solely on financial data, sentiment analysis revealed shifts in investors’ potential attitudes and confidence toward companies. Additionally, the model results showed that deeper public interaction behaviors, such as comments and retweets, carry greater feature weights than simpler metrics like likes. This reflects the greater importance of engagement depth and discussion intensity in predicting financing success. It suggests that fostering deep public participation often correlates with higher market recognition and investment attractiveness for companies.
Although social media and sentiment features contributed significantly to the model’s predictive power, financial features remain the foundation of the financing prediction model. These features not only reflect a company’s fundamental strengths and growth potential but also provide a solid structural foundation for the model. Moreover, some secondary features, such as a company’s technological domain, employee size, and website traffic growth rate, may also provide supplementary contributions to prediction outcomes in specific scenarios.
This multidimensional feature integration further validates the value of the multi-source data fusion strategy, demonstrating that the organic combination of financial and social media data can significantly enhance the overall performance of the predictive model.
4.6.2. Prediction Probability Distribution
The results of the prediction probability distribution is shown in
Figure 7, which indicates that the model performs well in classification tasks, accurately distinguishing between companies with successful and unsuccessful financing outcomes. The majority of the failed financing samples had prediction probabilities concentrated near 0, while the successful financing samples were primarily distributed near 1, clearly illustrating the model’s precise differentiation between positive and negative samples. This distribution pattern indicates that the model not only achieved high classification accuracy but also exhibited strong confidence in its decision-making process.
Additionally, the number of samples in the ambiguous probability range (i.e., 0.4 to 0.6) was extremely small, suggesting that the model’s classification boundaries were well-defined and its predictions were highly certain. Furthermore, the high concentration of probabilities near 0 for failed financing samples reflected the model’s effective learning of the characteristics associated with unsuccessful enterprises. Similarly, the prediction probabilities for successful financing samples were strongly skewed toward 1, further highlighting the model’s robustness and reliability in identifying companies with high financing potential.
Overall, the model’s classification ability was validated not only by traditional evaluation metrics such as the F1 score but also through its prediction probability distribution. This capability provides investors with scientific and actionable decision-making support by accurately distinguishing between positive and negative classes of companies. It enables investors to efficiently identify startups with strong financing potential while avoiding high-risk companies with failed financing outcomes, thereby offering robust support for optimizing investment portfolios.
5. Discussion
This study examined how features of social media sentiment—such as polarity, emotional intensity, and engagement metrics—enhance predictive models for startup financing success. We tested the direct and indirect relationships among sentiment-derived variables and fundraising outcomes, demonstrating significant predictive improvements when sentiment signals are incorporated alongside traditional firm-level indicators.
From a theoretical standpoint, the study extends the literature on entrepreneurial finance and information asymmetry by introducing investor sentiment as a proxy for market perception in early-stage investment decisions [
43]. In contrast to prior works that primarily focused on structured financial indicators, our results demonstrate the added predictive value of unstructured social media data, thus supporting the emerging perspective that soft information can complement traditional hard metrics [
44].
The study indicates that incorporating sentiment polarity and emotional intensity metrics from Twitter substantially improves prediction accuracy compared to models relying solely on historical financing and firm characteristics. This finding is in line with Das et al. [
45], who demonstrated that public sentiment strengthens market forecasts, and builds upon Alamu’s work integrating sentiment, emotion, and text mining for financial prediction [
46]. By applying these techniques to early-stage startup financing, the research extends their relevance to the domain of entrepreneurial finance.
The analysis further reveals that social media engagement indicators—such as retweets, likes, and replies—act as robust proxies for investor interest. This result corroborates Bayar and Kesici [
47], who reported that higher engagement correlates with fewer but larger funding rounds, increased exit probabilities, and greater VC syndicate concentration. Combining sentiment intensity with engagement depth appears to signal venture credibility and elevate the likelihood of securing investment.
By integrating richer sentiment dimensions—including fear and optimism—the methodology transcends basic polarity models. This approach mirrors Vamossy’s findings that nuanced investor emotions on Twitter correspond with short-term IPO mispricings [
48]. When applied to startups, the inclusion of nuanced emotional trends is vital.
For investors and startup support institutions, our findings underscore the value of real-time social media monitoring. By tracking sentiment polarity and engagement depth (likes, retweets, comments), decision support systems can better distinguish ventures with strong fundraising potential, thereby reducing information asymmetry. For entrepreneurs, these results suggest that sustaining authentic, positive digital engagement may enhance perceived legitimacy and attract investor attention, complementing traditional fundraising efforts.
While our framework was developed and validated on Twitter data for startups in major English-speaking markets, its modular design supports straightforward adaptation to other industries, geographic regions, or social platforms (e.g., LinkedIn, Reddit, Weibo). Key considerations include: (1) cross-platform data collection and preprocessing—each API has distinct rate limits, data schemas, and noise profiles, requiring customized scrapers and normalization pipelines; (2) multilingual sentiment analysis—beyond English, one can employ multilingual models such as XLM-R or apply high-quality machine translation before sentiment scoring to capture local language nuances; (3) industry and regional social conventions—different sectors (e.g., healthcare versus consumer tech) and cultures (e.g., North America versus East Asia) exhibit unique posting behaviors and emotive expressions, which may call for domain-specific lexicons or transfer learning; and (4) modular deployment and compliance—our system’s microservices architecture allows plugging in new data sources or models with minimal changes, while legal and ethical constraints (GDPR, platform terms of service) must be respected in each jurisdiction.
6. Limitation
While the current study focuses on Twitter-based sentiment signals over a 24-month period (2023–2024), our modeling framework is designed to be generalizable. The temporal alignment mechanism and multimodal feature design are platform-agnostic, making it feasible to extend the system to other sources such as LinkedIn, Reddit, or online news. In future work, we plan to incorporate such platforms and test the system’s robustness across different content modalities and data sparsity levels. In particular, LinkedIn posts—often containing professional announcements and fundraising updates—could provide complementary cues in a more formal business context. Although current access is limited due to API restrictions, the pipeline can be adapted once broader access is granted, using similar aggregation and sentiment analysis strategies.
Moreover, although the dataset includes startups from diverse industries and multiple regions (e.g., the US, UK, India), we acknowledge the current limitations in geographic and linguistic scope. The framework may benefit from adaptation using multilingual language models (e.g., XLM-R) and evaluation on startups in non-English-speaking ecosystems. In addition, we recognize that differences in company-level social media activity may introduce sampling bias—startups with higher Twitter engagement are more likely to generate sufficient signals for modeling. To mitigate this, we normalized sentiment and interaction features by activity level and ensured coverage across a wide range of activity intensities. To partially assess temporal generalizability, we conducted time-split validation across 2023 and 2024. The model showed stable performance, suggesting that sentiment signals retain predictive power over time. However, more extensive temporal and cross-platform evaluation will be necessary to confirm the model’s real-world robustness and transferability.
7. Conclusions
This study presents a sentiment-aware decision support system (DSS) that integrates traditional financial metrics with social media sentiment and interaction features to predict startup financing success. By combining structured enterprise data from Crunchbase and unstructured public signals from Twitter, the system leverages the BERTweet model to extract both sentiment polarity and intensity features, enabling the dynamic modeling of market perception. The experimental results confirm that this multi-source DSS architecture significantly outperforms baseline systems relying solely on financial indicators, particularly in terms of prediction accuracy, recall, and F1 score.
From a systems perspective, the inclusion of social media sentiment and interaction data introduces an important external feedback loop into traditional financing assessment models. The findings emphasize that deeper forms of user engagement—such as retweets and comments—carry greater predictive value than surface-level metrics like likes. These features not only enhance the decision support system’s sensitivity to public sentiment fluctuations but also reflect a more nuanced representation of real-time market attention, providing decision-makers with richer contextual insights.
The proposed system highlights several key system design principles, including modular data integration, real-time sentiment sensing, and multi-modal feature fusion, which collectively improve the system’s adaptability to complex, high-dimensional decision environments. However, certain limitations remain. The system currently operates on a two-year dataset and is confined to a single social media platform (Twitter), which may constrain its temporal generalizability and platform diversity. Additionally, the potential noise and bias in social media content may affect sentiment accuracy, calling for robust filtering and normalization mechanisms in future implementations.
To advance this system into a more comprehensive decision support platform, future work could explore three main directions. First, the temporal scope of the system can be extended by incorporating longer time-series data to support trend-based and cyclical financing behavior analysis. Second, multi-platform integration—including LinkedIn, Reddit, and industry-specific media—can enhance the system’s representational completeness and improve robustness across startup types and regions. Third, the core sentiment analysis engine can be upgraded using more advanced large language models (e.g., GPT-based architectures) to handle semantic ambiguity and context-sensitive expressions more effectively.
Finally, this system framework can be generalized and deployed across broader decision support domains, such as bankruptcy early warning systems, product launch forecasting, or public market perception monitoring. Its modular architecture and real-time sentiment interface make it particularly suitable for applications requiring high responsiveness and interpretability. In doing so, this study contributes not only a predictive model but a practical, extensible system paradigm for integrating heterogeneous data sources into intelligent decision-making workflows in the era of data-driven innovation.