Predicting the Success of Internet Social Welfare Crowdfunding Based on Text Information

: This study explored how the success of project crowdfunding can be predicted based on the texts of Internet social welfare crowdfunding projects. Through a calculation of the quantity of information and a mining of the sentimental value of the text, how the text information of the interconnected social welfare crowdfunding project affects the success of the project was studied. To this aim, a sentimental dictionary of Chinese Internet social welfare crowdfunding texts was constructed, and information entropy was used to calculate the quantity of information in the text. It was found that, compared with the information presented in the text, the fundraiser’s social network factors are key in improving the success of fundraising. The sentimental value of the text positively affects the success of fundraising, while the inﬂuence of the quantity of information is represented as an inverted, U-shaped relationship. The non-ideal R-squared indices reﬂected that the multiple linear regression models do not perform well regarding this prediction. Furthermore, this paper validated and analyzed the prediction efﬁciency of four machine-learning models, including a multiple regression model, a decision tree regression model, a random forest regression model, and an AdaBoost regression model, and the AdaBoost regressor showed the best efﬁciency, with an accuracy R 2 of up to 97.7%. This study provides methods for the quantiﬁed processing of information contained in social welfare crowdfunding texts, as well as effective prediction on social welfare crowdfunding, and also seeks to raise the success rate of crowdfunding and thus features commercial and social value.


Introduction
Crowdfunding refers to the services and attempts of funding a project or venture by raising small quantities of money from the public via the Internet [1]. The most common social welfare crowdfunding is aimed toward medical care [2]. Research has provided evidence that social welfare crowdfunding improves the recipient's access to medical treatment and promotes social engagement in charity [3]. Especially in developing countries with relatively incomplete medical security systems, social welfare crowdfunding is now widely chosen by people with unaffordable health expenses [4]. In undeveloped countries, social welfare crowdfunding holds the potential to fill insurance gaps and help those burdened by medical debt [5]. As a new financial service in the Internet era, social welfare crowdfunding can benefit everyone, thus becoming an important supplement to the health care insurance system. Social welfare crowdfunding activates the social participation power of Internet platform users; however, not every project can be successful. Existing studies have examined the factors contributing to the success of social welfare crowdfunding projects from the perspectives of project environments, crowdfunding mechanisms, presentation elements, and intrinsic motivation [6][7][8][9][10][11]. As a non-equity-based form of crowdfunding, traditional economic theories, such as transaction costs, reputation, and market design can explain factors influencing social welfare crowdfunding to some extent [12]. Some studies have analyzed the expressive characteristics of the texts of social welfare crowdfunding projects, including the picture information in the text [13], the identity clue of the subject [14], and the planning of behavior [15].These studies provide a basis and foundation to understand the fundraising factors of social welfare crowdfunding projects, but there is a lack of exploratory research that helps predict the success of projects.
In China, most social welfare crowdfunding projects acquire support through WeChat groups or Moments, which has become an infrastructure platform for social communication and information dissemination [16], and a study in the U.S. concluded that Internet crowdfunding has the potential to deepen social and health inequalities in the U.S., and the success of crowdfunding projects requires fundraisers to have both medical and media knowledge. In reality, the uneven distribution of cultural knowledge and unequal marketing ability mean that the fundraisers most in need of financial support do not necessarily have the highest probability of success [17]. Both social networking factors and the expression of fundraising texts influence the success of social welfare crowdfunding projects, but there have been no systematic examinations of factors and their importance.
Intelligent technology offers operable methods of exploiting text [18]. Text mining is a variation on a field called data mining [19]. Internet crowdfunding platform has generated a large amount of project fundraising texts, which includes fund raisers' social network elements and fundraising details. However, it is still lacking the intelligent means of text processing for social welfare crowdfunding. The intelligent processing methods of text data have seen continued expansion from single-objective solution methods to multi-objective solution methods. A series of literatures have discussed the use of multi-objective solution methods to mine the knowledge data contained in text data [20][21][22], including multiobjective data mining and processing of multiple documents [23]. This study analyzes how social welfare crowdfunding project information determines the success of projects based on a text mining approach, and attempts to achieve three research objectives: RQ1. Provide methods for the quantified processing of information contained in the Chinese social welfare crowdfunding texts.
RQ2. What are the factors that can effectively predict success of social welfare crowdfunding projects? How important are these factors?
RQ3. Building prediction models for fundraising success based on machine learning approaches and validating the prediction efficiency of four machine-learning regression models, namely, a multiple regression model, a decision tree model, a random forest model, and an AdaBoost model. This paper is organized as follows: In Section 2, the theoretical basis is analyzed, and the research method is introduced. In Section 3, the data features, structure, and text information acquiring method are introduced, as well as a description of how to construct the sentiment lexicon for social welfare crowdfunding projects and how to extract the sentimental values of the text in a targeted manner. In Section 4, based on the information extracted from crowdfunding texts, factors influencing the success of social welfare crowdfunding projects by utilizing the multiple regression model are analyzed. In Section 5, the training process is depicted, and the performance of the machine-learning prediction models is assessed. In Section 6, conclusions, applications, and several open issues are discussed, and the paper is concluded.

Information Richness and Bounded Rationality in Decision-Making
Information richness refers to the quantity of information that can be conveyed through a communication medium in a specific period of time and the extent to which the medium enables the sender and receiver to reach a common understanding [24]. Individual emotion is reflected in the information richness of a text [25]. Information richness can efficiently decrease uncertainty during the process of organizational decision-making [26]. In practice, there are three steps in the process of information transfer, including source-language translation, information transfer, and target-language translation. After the information has been translated from the sender, information richness may impose an impact on the information transfer as well as feedback, while the effective interaction of information may rely on the receiver's efficient acceptance and understanding of the information. Receivers of information during information processing are restricted by bounded rationality in decision-making [27].
With the fast extension of information and communication technology, diversity and richness of communication media in cyberspace have been constantly and extensively intensified. The Internet has ushered in an era of information surplus [28], and the attention of users has been highly diverted. The key factor of the effective transfer of information no longer exclusively lies in information media, and the effective processing of information by actors has become a great concern.
Information richness in cyberspace imposes an appreciable impact on users. It is the criterion to measure the quality of information disclosure on the webpage [29,30]. In addition, there is an active influence of high-quality information on a webpage on the user's trust, which will eventually affect the willingness to buy. The online introduction and evaluation with high information richness can enhance the user's perceptibility of the effectiveness, credibility, and persuasiveness of product information [31]. Based on the text structure of an Internet social welfare crowdfunding project and the information attention features of actors in the process of cyber crowdfunding, this paper divides the information contained in primitive data into two dimensions-the sentimental value and the quantity of information-and then transforms the text characteristics into numeric features, which lays a solid foundation for further probing into the connection between sentimental value, the quantity of information, and the fundraising success rate, as well as analyzing and forecasting the model efficiency of computational learning for fundraising success.

Mining of Sentiment Value in the Text
A text's sentiment refers to the intensity of positive or negative sentimental value expressed in the project's crowdfunding text. Currently, the calculation methods of sentiment for a text are based on sentiment lexicon or machine learning [32][33][34]. The sentiment lexicon-based sentiment calculation methods mainly rely on open-sourced sentiment lexicon or scene-based expanded sentiment lexicon [35]. A pure open-sourced sentiment lexicon applies to common scenes but not to special Internet-based crowdfunding scenes. In order to improve the accuracy of a text's sentiment, the traditional expanded sentiment lexicon-based calculation method is adopted herein for calculating the sentimental value of a text, as shown in the flow chart in Figure 1.
The jieba analyzer was adopted for text segmentation, whereby the text is segmented in an accurate way for text analysis [36]. The Stop Word Dictionary collected the prevailing stop word lists and incorporated special stop words in specific scenarios, totaling 1519 stop words. After word segmentation and stop word removal, each project text was stored in the form of strings composed of spaced words. To calculate sentiment, it is important to construct a sentiment lexicon in the scenario surveyed, and the basic idea is to combine classic common sentiment lexicon and the Internet-based crowdfunding sentiment lexicon and eliminate repeated words for building a sentiment lexicon required in this study. To be specific, common sentiment lexicons adopted include the CNKI Sentiment Lexicon [37] and the National Taiwan University Simplified-Chinese Dictionary (NTUSD) [38]. Here, the Semantic Orientation Pointwise Mutual Information (SO-PMI) algorithm was mainly utilized to construct a scene-based sentiment lexicon. The SO-PMI algorithm consists of two parts: SO-PMI and Pointwise Mutual Information (PMI) [39]. PMI is employed for determining the probabilities of the occurrence of a certain word and the reference word, whose calculation equation is expressed below: P(word1, word2) refers to the joint probability, namely, the probability that word1 and word2 occur in the corpus simultaneously. If they are independent, P(word1, word2) = P(word1)P(word2); that is to say, the total score is 1, and PMI = 0.
SO-PMI is used for determining the correlation between a strange word and the reference word; if it is more correlated with the positive reference word, the strange word is positive; if it is more correlated with the negative reference word, the strange word is negative. If it is equally correlated with the positive and negative reference words (but independent from these reference words), it is determined that this strange word is a neutral word, as expressed below: where num(pos) refers to the total number of positive reference words, and num(neg) is the total number of negative reference words. If SO − PMI > 0, the strange word is considered positive; if SO − PMI = 0, the strange word is considered neutral; if SO − PMI < 0, the strange word is considered negative.

Calculation of Text Information
Text information is used for describing the information contained in a text. In this study, the information entropy of a single raising text is employed to replace the text information. The Theory of Information Entropy was proposed by Shannon in 1984 for describing the uncertainty in an information source [40]. Information entropy is often utilized as a quantitative indicator for measuring the information content of a system. The information entropy is calculated as follows: Given a random variable X, with n possible outcomes x 1 , x 2 , . . . , x n , which occur with probability p(x i ), there should be ∑p(x i ) = 1. Information entropy H(X) reflects the information content of the random variable X and is often represented by bits. Here, the greater the information entropy of the text is, the more information the text contains [41].
This study adopted TF-IDF (term frequency-inverse document frequency) to calculate the text information entropy in combination with the information entropy formula, of which TF is mainly used for the word frequency of internal text, while IDF is utilized for the inverse document frequency of the external text as the weight W for the information entropy of each word. IDF (inverse document frequency, opposed to document frequency) refers to feature terms (words) in a set of documents that describe a certain document feature. They can be given the corresponding weights according to their frequency in this set of documents. Special words that appear only in a few documents have larger weights than those that appear in more documents. Based on interpretation of IDF by Shannon's information theory, the higher the frequency of a feature term is in all documents, the less its information entropy will be; if a feature term appears rather intensively with higher frequency in only a few documents, it will show higher information entropy [42][43][44]. Therefore, IDF can be interpreted as the cross entropy of probability distribution of the keywords under a specific condition [45]. The original Shannon formula combined with word frequency in a single document only considers the information entropy of words in individual crowdfunding texts. Based on TF-IDF theory, using an IDF weighting factor, a description of information entropy of words across texts can calculate the information entropy of individual text with more details in the whole crowdfunding scenario. The following is the computational formula of information entropy for each text: In Formula (4), X represents a single individual text, x i stands for the words in the text, n represents the total amount of words, and p(x i ) represents the word frequency of a single individual text.
f r(x i ) is the word frequency of x i in the text, and f r(X) is the value of text X after it is divided, and will show the total number of separated words.
W x i stands for the inverse document frequency of x i , D represents the total amount of texts, and D x i stands for the number of texts containing x i . Based on the computational formula of information entropy, the longer the single individual text, the more words contained and the larger the information entropy [46].

Decision Tree Model
A decision tree, a basic classification and regression algorithm in the field of machine learning, is a tree established by choosing an appropriate strategy. It is characterized by good readability, a fast classification speed, and good understandability. The decision tree is mainly derived from the ID3 and C4.5 algorithms proposed by Quinlan [47,48], respectively, and from the CART algorithm offered by Breiman et al. in 1984 [49]. Decision trees can be classified into two categories: classification trees and regression trees. Feature selection refers to selecting a feature from numerous ones in the training data as the classification standard of the current node; different quantitative evaluation criteria can be adopted for selecting the feature, thus producing different decision tree algorithms. Among all kinds of quantitative splitting methods, the Information Theory is adopted for measuring information classification; the Information-Theory-based decision tree algorithms include ID3, CART, and C4.5, where C4.5 and CART are derived from ID3. A regression tree generally refers to the CART tree, which is also adopted herein. In most cases, compared to algebraic prediction criteria built using conventional statistical methods, a prediction tree constructed based on the CART model yields a higher accuracy, and the more complicated the data and the more variables it involves, the higher the superiority of the algorithm.

Random Forest Regressor
A random forest is composed of multiple decision trees that are not correlated with each other; the ultimate output of the model is jointly decided by all decision trees in the forest [50]. A certain number of samples are randomly chosen from the training set as the samples of the root node for each decision tree; while setting up each decision tree, some candidate attributes are randomly selected, and the most appropriate attribute is chosen as the splitting node. As for regression, the mean value outputted by each decision tree is taken as the result. A random forest is an Ensemble Learning method of a Bagging type; by combining multiple weak learners, the result is obtained by voting or taking the mean value so that the overall result of the model has high accuracy and generalization performance [51]. Goods results are attributed to "the randomness" and "the forest": the former endows it with an anti-overfitting capability, while the latter increases the accuracy. Weak analyzers in the random forest adopt CART trees, which are also called regression trees.

AdaBoost Regressor
A Boosting algorithm, a reinforcement learning method proposed by Schapire and Freund, is an important integrated learning technology that can be used for changing a weak learner, whose prediction accuracy is only slightly higher than random guessing, into a strong learner with a high prediction accuracy [52]. The AdaBoost algorithm was obtained by improving the Boosting algorithm by Freund and Schapire in 1995, which chooses weak learners with the lowest weight coefficient from the trained weak learners and integrates them into a final strong learner by adjusting the weight of the sample and the weak learner. The weak learner is trained based on the training set; the next weak learner is trained in accordance with different weight sets of the samples. The difficulty of classifying each sample decides its weight, and the classification difficulty is estimated from the outputs of the learners in the previous steps. The AdaBoost regressor is an iterative algorithm achieved by changing the data distribution. It determines the weight of each sample by judging whether the classification of each sample in every training set is correct in accordance with the overall accuracy of the previous classification. AdaBoost provides a framework in which various methods can be adopted for constructing sub-classifiers, and it allows the use of simple, weak learners without screening the features.

K-Fold Cross Validation and Grid-Search Gain Scheduling
Cross validation means that raw data are (generally averagely) divided into K-many groups; each subset is used as the validation set once with the other (K-1)-many subsets as a training set, thus yielding K models. The mean classification accuracy of the final validation set of the K model is adopted as a performance indicator for the classifier under K-CV [53]. Through gain-search gain scheduling, the best-performing parameter is chosen from all candidate parameters as the result. In this study, both K-fold cross validation and grid-search gain scheduling were utilized for optimizing the hyperparameters of decision tree regression, random forest regression, and AdaBoost regression, and 5-fold standards were set.

Assessment of Prediction Model Performance
In this study, the explained variance score, the mean square error, the mean absolute error, and R-square were utilized as indicators for assessing the prediction performance of the predication model [54]: whereŷ is the sequence of the predicted value,ŷ i is the predicted value of the i th sample, y i is the expectation, n is the number of all samples, and y is the mean value of the sequence of expectations. n is the number of samples and p is the number of features.

Data and Features
Qschou (https://www.qschou.com (accessed on 28 December 2021) was chosen as the basic scenario and data source in this study, which is also named "Fun in Funding". Qschou is now estimated to be one of the biggest social welfare fundraising platforms; there are more than 190 million registered users, with nearly 20 billion RMB collected for 160 million Chinese families unable to cover medical expenses [55,56].
The basic data of 1249 crowdfunding projects were collected. Repeated projects were excluded, as were projects whose actual crowdfunding amount was greater than the fundraising goal, which is the targeted fundraising amount set by the fundraiser according to the consideration of potential donors [57]. Finally, 1239 crowdfunding projects were used as raw data. Each crowdfunding project consists of two kinds of data: structured and unstructured data. Figure 2 presents the data structure of a specific project on Qschou.
The data collected include structured and unstructured data. Structured data consist of "Goal", "Raised", "Help", "Forwarding", "Verify", "Success", and "Date"; unstructured data contain "project title" and "project help text". All projects adopted here are marked with "Success", but the field "Verify" in some projects is missing, and other structured data are complete; unstructured data refer to the crowdfunding text, totaling 920,000 Chinese characters. Table 1 shows the structure of the raw data.

The Calculation of Information Entropy
This paper explores the overall calculation of text information entropy from the following two perspectives: (1) As for the information entropy of words in a single individual text, there are two categories of words after text segmentation. One category includes pronouns and auxiliary words with a high word frequency, such as "的" (meaning "of") and "我" (meaning "I"), but without an actual consequence; the other contains nouns and adjectives with a low word frequency, such as "肿瘤" (meaning "tumor") and "医院" (meaning "hospital"). (2) In terms of the overall text generated by all texts, the inverse document frequency of a certain word is used to indicate its significance, which is used as the weight for the computational formula of information entropy, so as to lower the interference of pronouns and auxiliary words with text information entropy. Table 2 shows the computational results of some text information entropy.

Building the Sentimental Dictionary
In the application of machine leaning technologies, text features and numerical features are different [58,59]. In this study, the information from the raw data includes two dimensions: sentiment and information content, so the text features were converted into two dimensions of numerical features; this study further explored the relationships between the sentiment and information content of the text and the success rate of the crowdfunding project and lay a foundation for understanding the efficiency of the machine-learning models that analyzed and predicted the crowdfunding success rate.
In the early stage, three postgraduates specializing in Internet research sorted out the segmented words, and three other postgraduates focusing on Internet governance chose 50 positive sentiment words and 50 negative sentiment words in the scenario herein based on five topics, including "disease", "family", "mood", "hope", and "money". Under each topic, there were 10 positive sentiment words and 10 negative sentiment words, as shown in Table 3. These 100 seed words were used as reference words in the SO-PMI algorithm, and more than 5000 positive and negative sentiment words were ultimately expanded in this scenario. For the purpose of the accurate segmentation of sentiment words and the scientific calculation of sentiment, the sentiment lexicon required herein was constructed by combining positive and negative sentiment words with an SO-PMI score ranking the top 1000 and the integrated common sentiment lexicon and by eliminating repeated words. The sentiment lexicon constructed based on inter-information consists of 1000 positive words and 1000 negative words. After integrating this with the common sentiment lexicon and eliminating repeated words, 9047 positive sentiment words and 13,251 negative sentiment words were included. Thus, the sentiment lexicon proposed was constructed. By calculating the traditional sentiment of the text and assuming that the sentiment satisfies a linear superposition, we endowed each positive sentiment word with a weight of 1 and each negative sentiment word with a weight of −1. Afterwards, word segmentation was implemented on the sentence: if the word vectors included corresponding words after the segmentation, the forwarding weight was added. Negative words and adverbs of degree had special discrimination rules: negative words led to the opposite weight, which means the value weight should be multiplied by -1; meanwhile, the most extreme adverb of degree doubled the weight, and the weight of adverbs of moderate degree is 1.5. The sentiment of the text was determined based on whether the total weight was positive or negative. The degree and negative word lists proposed herein refer to three levels and their weight settings according to the classification degree of Chinese adverbs [60]. The calculated sentiment is shown in Table 4. Based on the three levels of reference adverb classification and weight setting [61], the sentimental value features were defined and calculated, as shown in Algorithm 1.

Predictor Analysis
Multiple linear regression models are often used in prediction, which is represented by the relationship between the dependent variables and a set of predictor variables [62,63]. Based on multiple linear regression, this paper carries out predictions for the success of Internet social welfare crowdfunding projects. Taking the fundraising ratio of charitable crowdfunding projects as the dependent variable Y, and the sentimental value as well as the quantity of information from the numeric data and text mining as independent variables, the influence of the independent variables on the fundraising ratio of the dependent variable was analyzed by setting up multiple linear regression equations. The regression models in this paper are as follows: M1: Y = a 10 + a 11 X 1 + a 12 X 2 + a 13 X 3 + a 14 X 4 + a 15 X 5 + a 16 C 01 + a 17 C 02 + ε M2: Y = a 20 + a 21 X 1 + a 22 X 2 + a 23 X 3 + a 24 X 4 + a 25 X 5 + a 26 X 6 + a 27 C 01 + a 28 C 02 + ε (13) M3: Y = a 30 + a 31 X 4 + a 32 X 5 + a 33 C 01 + a 34 C 02 + ε M4: Y = a 40 + a 41 X 4 + a 42 X 5 + a 43 X 6 + a 44 C 01 + a 45 C 02 + ε Thereinto, Y represents the fundraising success ratio, X 1 stands for the forwarding number, X 2 is the verifying number, X 3 is the fundraising goal, X 4 represents the text's sentimental value, X 5 is the quantity of information, X 6 is the quadratic term of the information quantity, C 01 is the sentimental value of the title, C 02 represents the length of the title, a are the coefficients, and ε is the error term. X 5 and X 6 go are calculated based on the formulas and processes defined in "Section 2.3". This paper builds four multiple regression models for different purposes. Model 1 and Model 2 are full models, the independent variables of which include the forwarding number (X 1 ), the verifying number (X 2 ), the fundraising goal (X 3 ), the text's sentimental value (X 4 ), the quantity of information (X 5 ), the sentimental value of the title (C 01 ), and the length of the title (C 02 ). Compared with that of Model 1, there is an added quadratic term of the information quantity (X 6 ) in Model 2. These can fulfill two purposes. On the one hand, it can analyze the influence of all independent variables on the dependent variable; on the other hand, it can explore the connection between the quantity of information and the fundraising ratio.
Without regard to fundraising goal and other variables, Model 3 and Model 4 center on the influence of the text and the title on the fundraising ratio. The independent variables of Model 3 only include the sentimental value (X 4 ), the quantity of information (X 5 ), the sentimental value of title (C 01 ), and the length of the title (C 02 ), while there is an added quadratic term for the information quantity (X 6 ) in Model 4. In general, comparing the influence analysis of global variables on dependent variables in Model 1 and Model 2 with Model 3 and Model 4 adds prominence to the impact of the forwarding number (X 1 ), the verifying number (X 2 ), and the fundraising goal (X 3 ) on the fundraising ratio; in the comparison between Model 3 and Model 4, taking the text's sentimental value (X 4 ), the information quantity (X 5 ), the sentimental value of the title (C 01 ), and the length of the title (C 02 ) as control variables, and the quadratic term of the information quantity (X 6 ) as the independent variable, this paper further investigates the impact of the quantity of information and the fundraising ratio. The descriptive statistics of variables are shown in Table 5. The frequency of the verifying number (X 2 ) was 1209, and that of other variables was 1239, with no omission. The minimum value of the forwarding number (X1) and the verifying number (X2) was 0, and the maximum values were 53,086 and 4494, respectively. Since there were relatively small mean values and a relatively large standard deviation for the two variables, there was a great discrepancy and strong heterogeneity in the sample. With the minimum value of the fundraising goal (X3) reaching 20,000, a maximum value of 500,000, and a standard deviation of 435,061, there was a great discrepancy among the expectations of various fundraising projects, and a majority of project originators expected too much. The mean value of the text's sentimental value (X4) was −30.95, the standard deviation of which was 23.77, indicating that there was negative text sentiment. There was a relatively small standard deviation for the information quantity (X5), which indicated that the difference in the information quantity was also relatively small. With a minimum fundraising ratio (Y) of 0, a maximum value of 1, a mean value of 0.29, and a standard deviation of 0.21, the fundraising ratio was generally low, and there was a minority of fundraising projects with an extremely low or extremely high fundraising ratio. Table 6 describes the Pearson correlation analysis on all input and output variables.  Table 7 shows the fitting results of the regression model. It can be seen, based on Model 1 and Model 2, that the goal has a significant negative influence on the raising rate, which is possibly because potential donors are more willing to help those projects whose raising goals can be easily realized and will gain more satisfaction from it [64]. The forwarding number and the verifying number exert a positive influence on the raising rate; they are often strongly and positively correlated with the social network of the fundraiser and are key factors that determine the success of crowdfunding. The sentiment of a text has a relatively significant positive influence on the success of crowdfunding; it is not so significant because, when donating money, potential donors consider the social relations of the beneficiary but not the actual situation. In Model 1 and Model 2, the text information, the sentiment of the title, and the length of the title do not significantly affect the raising rate, but the coefficient of text information squared in Model 4 is −0.0607, which is significant when p < 0.05, suggesting that text information has an inverted, U-shaped relationship with the raising rate; when writing the text, the fundraiser should control the length of the text to increase the raising rate, such that it should be neither too long nor too short [65].
As it is shown in Table 8, tolerance of both explanatory variables and control variables is larger than 0.1 while variance inflation factor (VIF) is smaller than 10. Independent variables show no multicollinearity. The adjusted R-squared indices are small, and at least one coefficient of the models is significantly different from 0. There are three possible reasons leading to such a phenomenon. One is that certain important variables might not be included in the model, the second is that there might be a large amount of random data interference in the original dataset [66], and the third is that the model failed to fit the actual data distribution well [67]. Such a relationship between independent and the dependent variables may also be very important, even though it may not explain a large amount of variation in the response [66]. This study aims to build methods to predict the success of Internet social welfare crowdfunding projects based on text information. Through text information extraction, we identified the factors that significantly contribute to project success predictions. The non-ideal, R-squared indices reflected that the multiple linear regression model does not perform well regarding the prediction issue discussed in this study. Therefore, we need to continue looking for prediction models with excellent performance with the help of machine learning.

Machine-Learning Prediction Model Performance
In order to effectively predict the raising rate, the abovementioned forwarding number, verifying number, goal, text sentiment, and text information were adopted as independent variables, and the raising rate was the dependent variable; different machine-learning regressors were adopted for prediction, including multivariate linear regression, the decision tree regressor, the random forest regressor, and the AdaBoost regressor.
The processed data sets herein were divided for training and testing prediction models by a ratio of 3:1. To be specific, the datasets for the training models were classified into training sets and validation sets based on five-fold standards; a training set is mainly used for training models, while a validation set is mainly adopted for determining network structure or controlling the complexity parameters of models; a testing set is often employed for testing the performance of trained models. Table 9 shows how resampling techniques can be used for five-fold cross-validation on training data to obtain optimal hyper-parameters, as well as the mean value and standard deviation of the prediction performance of validation sets on trained models. For the AdaBoost regressor compared to other models, MSE = 0.040, MAE = 0.072, MAPE = 0.222, and R 2 = 0.961, and the AdaBoost regressor had a smaller prediction error and an excellent prediction performance, followed by the decision tree regressor, the random forest regressor, and linear regression in sequence. Each model has the following optimal hyper-parameters: DecisionTree (criterion = mse, max_depth = 30, min_samples_leaf = 2), RandomForest (max_depth = 1000, min_samples_leaf = 1, max_features = None, n_estimators = 800, bootstrap = True), and AdaBoost (DecisionTreeRegressor (criterion = 'mse', max_depth = 30, min_samples_leaf = 2), n_estimators = 500, random_state = 0, learning_rate = 0.2). The random forest regressor and AdaBoost regressor both adopt a CART-based algorithm and can identify the variables that can effectively reduce impurities. Figure 3 displays the arithmetic mean value of variable importance after five instances of repeated training. In the random forest regressor and AdaBoost regressor, the goal, forwarding number, and verifying number are all important factors that affect the raising rate, and the forwarding number is the most important one. The project initiator may make use of its social links to engage more people in forwarding crowdfunding projects through online channels and hopes that more insiders can prove the authenticity of the project. This, in turn, proves that the project initiator's social network does have a decisive role in the actual crowdfunding activities, and that the preset goal also exerts an important influence on the raising rate. The text sentiment, text information, the sentiment of the title, and the length of the title do not have a significant influence on the raising rate, suggesting that the initiator's social relations are more important than the project contents.  Table 10 demonstrates the evaluation results of the prediction performance of the data tested by the four models using the optimal hyper-parameters. The AdaBoost regressor offers the best prediction performance; among its assessment indicators, MSE = 0.023, MAE = 0.042, MAPE = 0.090, and R 2 = 0.977. The smaller the MSE, Mae and MAPE are, the better the model performs. According to R 2 , the AdaBoost regressor shows the best prediction performance, followed by the decision tree regressor, random forest regressor, and linear regression in sequence. In particular, linear regression has the poorest prediction performance, and further showed that multiple linear regression does not serve as a good prediction model in this study. Although multiple linear regression could help us verify factor importance, the low R 2 indicates that such factors are likely to have a nonlinear and complex relationship with project success. Therefore, it is necessary to use machine learning models to realize the effective improvement of text information indices on project success prediction, proving the speculation of the above raising rate prediction model that the selected, nonlinear machinelearning model improves the poor fitting results of the multivariate regression model. Figure 4 shows the differences between the predicted and expected values of the testing data on different models, which displays the prediction performance of four models in a more intuitive way.

Conclusions
Based on the method of text mining, information in the text dataset of Internet social crowdfunding projects in China, including expression factors, such as information entropy, sentimental value, and captions, as well as socializing factors, such as the number of people forwarding and the number of people confirming, was extracted to predict the success of project fundraising. For research purposes, a sentiment lexicon of text used in Internet social crowdfunding was constructed; moreover, information entropy was introduced to calculate the information quantity of the text. According to the research, compared with the information presentation of the text itself, the social networking elements of the fundraiser are key to the success of fundraising. The sentimental value of the text has a positive effect on the success of fundraising, while the effect of information quantity thereon is in a reverse U-shape. The positive effect of sentimental value supports the concerns of Berliner and Kenworthy (2017); a strong sentimental description in a fundraising text will increase the likelihood of receiving donations. This gives fundraisers with sentimental description skills an edge.
Here, information entropy was introduced to calculate the function of information quantity in predicting the success of fundraising for crowdfunding projects for the public good. It was noticed that the effect of information quantity on fundraising success is in a reverse U shape. This verifies the applicability of the findings of Simon (1972) about the limited information handling capacity of humans to the issue of fundraising for Internet social crowdfunding projects. Although the abundance of information presented to donors can help them better understand those in need of help, because of the limitations of human information processing, greater information quantity will not necessarily lead to a higher success rate; an appropriate supply of information is more beneficial to the success of such projects.
According to this research, compared with the information presentation in the text itself, socializing factors, such as the number of people confirming and the number of people forwarding, have more significant effects on the success of fundraising for Internet crowdfunding projects for the social good. Social networks are the main platforms for disseminating social welfare crowdfunding information, and social factors play an important role in the success of such projects [68,69]. This is related to the following two characteristics of such projects: being non-profit and being low in quantity. More donors chose to donate money out of love and the social validation of the person forwarding such fundraising information. In this era of information overload and distracted attention, donors will not spend too much time or energy carefully reading the text used for non-profit crowdfunding. In such a context, socializing factors seem particularly important.
Furthermore, the predictive efficiency of four machine learning models, namely, the multiple regression model, the decision tree regression model, the random forest regression model, and the AdaBoost regression model, was verified and analyzed. It was noticed that, aside from linear regression, the other three models showed good predictive efficiency. Among them, the AdaBoost regressor showed the best efficiency, with an accuracy of up to 97.7%.

Applications
The extending research and application value of intelligent method adopted by this study provides methods and tools for processing the text message of social welfare crowdfunding.
The crowdfunding text message processing adopted by the research is divided into two parts. One is the calculation of single crowdfunding text's emotional value. To acquire specific, accurate, and significant emotional value, we adopt an emotional value calculation method based on an emotional dictionary, with the critical point being the scientific selection of emotional seed words and defining the weight of words of degree appropriately. The other is the calculation of the information quantity in the text. With the view of quantifying the information content contained by single crowdfunding text and further analyzing the influences of the information quantity contained in the descriptive text of crowdfunding on the success rate of crowdfunding, we combined the information entropy formula with the TF-IDF method under the guidance of information entropy theory and in accordance with TF-IDF theory. The information quantity contained by crowdfunding text is thus calculated.
The text information processing method in this study features value in three aspects: To begin with, this study provides methods for the quantified processing of information contained in the social welfare crowdfunding texts. Specific and operable tools are presented for the calculation of the information quantity and emotional value contained in the text. For instance, the special emotional dictionary for social welfare crowdfunding text has been established. These methods would benefit the study on the descriptive text of social welfare crowdfunding project and improve the shortages of current study and application in terms of a simple text processing method and serious missing text information. Then, the method of this study could be expanded to other scenarios for text information processing. In particular, if the qualification operation for text information is needed for the long descriptive text of a special sector, the method adopted by this research would have reference significance. Furthermore, the method adopted in this study is simple and effective. Current popular methods, such as the method adopting a neural network, have a high processing cost. It solves problems in new research scenarios based on classical text information processing theory, which provides extensible thinking for the future study.
In addition, this study provides a method for the effective prediction of social welfare crowdfunding projects and seeks to raise the success rate of crowdfunding, and thus features significant commercial and social value.

Future Work
In this paper, based on the text information in project datasets collected on social welfare Internet crowdfunding projects in China, the success rate of such projects is predicted. Future research may extend this method to datasets from Internet crowdfunding platforms in other countries. In addition, there are many other factors to be included in the prediction model: external factors, such as the economic environment and platform policy, and internal factors, such as the personal characteristics of the help-seekers and the characteristics of potential donors. Research on these aspects can help platforms better grasp success factors, better distinguish projects more likely to succeed, and thus better manage their projects.
Author Contributions: X.C. performed the theory analysis, conceptualization, and contributed to drafting the manuscript. H.D. analyzed the data, design, coding, and modeling. S.F. collected the data and improved the writing. W.C. performed the literature reviews and funding. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/kiwi1998dh/datasets (accessed on 28 December 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Translations for the Chinese Text information in Tables 1-4. The son with aplastic anemia needs a bone marrow transplant, and only the 8-year-old sister and brother in the family have successfully matched. The sister is eagerly looking forward to being the "hero" who saves her brother as the parents hesitated, worrying about whether the young daughter's bone marrow donation would impact her healthy. At the moment, grandfather stood up and said: boy and girl are both significant that anyone shouldn't be ignored. Go for it as they are born from one bloodline which is the best convenience they are the slblings . . . Table A2. Calculation of the information entropy of texts.

1
Dear uncles and aunts. My name is ***, 12 years old, living in a ordinary family in Pingshang Town, I am the only daughter of the family and I implore everyone to help me! Wife died in a car accident on 20** . . . .

566 2
Dear social benevolent personage, I have no choice but to initiate this fundraising, hope to get everyone's understanding and support! Never thought that I would make a QingSongChou 'cause it happened too suddenly, which caught me off guard and bothered everyone. Sorry about this situation . . .

585
3 I am ***, ** years old, coming from ** province *** village. I went to the hospital and had a checkup when I felt uncomfortable at the late August and was diagnosed as uremia later. The news was like a bolt from the blue. My family couldn't believe it and then I had repeated checks in other hospitals that finally diagnosed as uremia . . .