Detecting Indicators for Startup Business Success: Sentiment Analysis Using Text Data Mining

: The main aim of this study is to identify the key factors in User Generated Content (UGC) on the Twitter social network for the creation of successful startups, as well as to identify factors for sustainable startups and business models. New technologies were used in the proposed research methodology to identify the key factors for the success of startup projects. First, a Latent Dirichlet Allocation (LDA) model was used, which is a state-of-the-art thematic modeling tool that works in Python and determines the database topic by analyzing tweets for the #Startups hashtag on Twitter ( n = 35.401 tweets). Secondly, a Sentiment Analysis was performed with a Supervised Vector Machine (SVM) algorithm that works with Machine Learning in Python. This was applied to the LDA results to divide the identiﬁed startup topics into negative, positive, and neutral sentiments. Thirdly, a Textual Analysis was carried out on the topics in each sentiment with Text Data Mining techniques using Nvivo software. This research has detected that the topics with positive feelings for the identiﬁcation of key factors for the startup business success are startup tools, technology-based startup, the attitude of the founders, and the startup methodology development. The negative topics are the frameworks and programming languages, type of job offers, and the business angels’ requirements. The identiﬁed neutral topics are the development of the business plan, the type of startup project, and the incubator’s and startup’s geolocation. The limitations of the investigation are the number of tweets in the analyzed sample and the limited time horizon. Future lines of research could improve the methodology used to determine key factors for the creation of successful startups and could also study sustainable issues.


Introduction
In recent years, advances in new technologies have meant that companies have adopted new business models that incorporate globalization and using the Internet as a promotion tool for products and services [1].With the evolution of technologies since the first decade of the 21st century, these business models have been adapting to include new processes and social changes, as well as the new demands of consumers, who are increasingly supported in this new digital era where the use of new technologies has become a habit in both professional and personal worlds [2,3].
In this new Digital age, companies adopt business models that are scalable by using technologies that can help them to understand what and how their users and clients think [4].Users' thinking is expressed on digital platforms and environments and is called User Generated Content (UGC) [5,6].
Over the last decade, UGC has been used in research to determine the key factors for any chosen topic [7].The term 'startup' was coined for business models using technology.A startup is a technology-based company that offers a new product or service using the added value of the incorporated technology.This is defined as 'innovation through technology' [8].
Startups use scalable business models, that is, startups make investments for the improvement of the technology on which they base their project, and once the technology has been improved, the product or service is created [3].The product or service is launched when the product is ready, and many startups create successful products and services for consumers whose consumption habits are based on the digital age.Examples of services and products that were created as startups are WhatsApp, Facebook, Instagram, and the technological giant Alphabet (Google) [8].
Startups are small companies that start from an innovative idea using technology and with time and experience become a solid and solvent technological and innovative company that is sustainable over time.In a global ecosystem where new technologies and processes are produced on a daily basis, it is important to know the key factors that can make a startup successful, as well as identifying the technologies that will determine what humans will do in the coming years.[9,10].
Currently there are technological processes that generate data and information in real time.New technologies such as big data, data mining, and artificial intelligence or business intelligence are the results of analyzing this data, and all provide important value for companies [11].Startups base their business models on innovation.Innovation is the process of searching for a value that improves a current product or service or satisfies a demand that has not been covered until now [12,13].Technological innovation is what startups do with new products and services by working with an emerging technology and applying it to a new or existing product [14].
In this sense, it is interesting that academic researchers can study what the success factors for startups are, and also compare how the findings of these research studies fit with the startup industry.The startups industry needs to know what the success factors for their business are, since they want to develop successful and profitable business models over time.Consequently, this study aimed to identify the key factors that make a startup successful by analyzing the comments made in UGC on the Twitter social network.In addition, new technologies were used to carry out the methodology of this study.First, Latent Dirichlet Allocation (LDA), which is a state-of-the-art topic modeling tool that works in Python, was used to find the topics of a database made up of extracted tweets with the hashtag #Startups on Twitter (n = 35.401).Sentiment Analysis was performed with a Supervised Vector Machine algorithm (SVM) that works with Machine Learning in Python.This algorithm was applied to the results of the LDA model to divide the identified topics into negative, positive, and neutral, for the key factors that make a startup successful.Finally, a Textual Analysis was performed with the qualitative analysis software Nvivo in order to identify the key factors for the success and creation of startups using the results found for the users' sentiments about the topics identified in the Twitter UGC.

UGC Analysis
The UGC analysis was performed on samples generated as a result of on-line comments, user generated content, and reviews made on online platforms.An online review or comment is a piece of text in a public profile on the Internet that describes a user's experience with a product, service, or topic [15].By studying this type of UGC on the Internet, a solid, causal relationship can be built that has a powerful meaning and can be useful for effective research [16,17].At the same time, advances in technology that give rise to new research models help to improve Text Data Mining techniques, which help, among other things, to automatically find information in large databases or to recognize topics in database data generated by comments, reviews, and User Generated Content on social networks [18].For example, the Latent Dirichlet Allocation (LDA) is a modeling tool that is able to identify topics from a database of qualitative reviews and comments and quantify and count the number of comments made about any topic [6,19].Table 1 below shows the characteristics of LDA models when analyzing UGC-type content in other studies.
Source: Adapted from Jia [6] A benefit of analyzing Twitter interactions and UGC is that user comments about other companies are included.This allows the amount of engagement to be measured.In this way, researchers are able to analyze a user's motivation due to a UGC comment.Jia [6] and Wang and Zhai [25] discovered that the most important types of motivations were knowledge and sense of belonging from the content generated by chat groups on the Internet.These were found by analyzing the chat messages without directly asking the users who wrote the comments.
The research by Liang et al. [26] found users' motivations by studying users' textual expression on the Internet.In addition, finding correlations between users and the number of ratings can also be a way to quantify these methodologies in order to obtain metrics for the motivation and satisfaction of Internet users who generate content.For example, Saura et al. [3] analyzed the reviews of hotel users with the UGC on the TripAdvisor social network.Companies can see if their consumers, or users, are happy or satisfied with them by using a UGC analysis approach and can also find out the reasons for the users2019 feelings.

Sentiment Analysis with Social Network Analysis
Sentiment Analysis is a research methodology that analyzes the feelings of a given sample, which normally comes from digital environments such as online platforms or social networks, to find the different opinions with different methodological approaches.It has been confirmed that Sentiment Analysis can identify the feelings and therefore the opinions of product users in order to understand how these feelings and opinions affect the users' decision making [3,8].There are different options and approaches for this technique.Approximations can be made using special software for applying machine learning, artificial intelligence techniques, and hybrid models.Other options are available, such as algorithm training with Data Mining techniques, which are processes used to improve the probability of success of an algorithm with machine learning based on the accuracy of the results [27,28].
Several studies have been carried out with machine learning models to analyze social networks, users' opinions, and to identify the key factors that influence different cases.Supervised methods using the classification and categorization of key factors, such as Maximum Entropy (MaxEnt) and Support Vector Machines (SVMs), have been used to perform social network analysis with machine learning using technological research methods to identify the important factors in different areas of research [14,29].However, there are other types of approaches based on sentiment analysis with UGC, such as Naïve Bayes, Linear Regression, or Deep Learning [3].These studies used keywords; ratings of feelings about a topic; semantic meaning; concepts and semantic theory; feelings about topics such as hashtags, retweets, or points on social networks; and valuation identifiers for products and services on the Internet [30].
In the research by Liang [26], a semi-supervised dual recurrent neural network was used to perform Sentiment Analysis.This is similar to a traditional neural network and can be used to evaluate a data set over a long period of time [6,27].This technique allows an effective and efficient Sentiment Analysis to be carried out [6].In the research by Reyes-Menendez et al. [17], Sentiment Analysis was performed on the UGC for a hashtag (#WorldEnvironmentDay) and in the research by Saura et al. [3] and [14], the authors used a SVM algorithm that identified the sentiments of a sample collected from social networks and divided them into negative, positive, and neutral sentiments.In Hasan et al. [31], a recursive neural network was used to understand the meaning of different comments.
It can be seen that the feelings expressed in UGC can be analyzed with Neural Connection Analysis of groups of interacting users; Textual Analysis, which is the analysis of words and their sentiments to determine key factors; Time, which is the analysis of the time over which this UGT takes place, and the analysis of patterns of sentiments in social networks; Hashtags, URLs, or Mentions, which is the analysis of labels of user groups; Topic, which is the analysis of sentiments of content categories with the same topic; and finally, Classification of Information, which is a feature of Sentiment Analysis that uses information keywords with the sample [6,32,33].Table 2 shows the main characteristics of Sentiment Analysis.

Textual Analysis
Textual analysis is a text mining analysis approach that determines key factors by analyzing large amounts of data.It is a qualitative approach that uses the weight and repetition of text in a given sample to determine the keywords that express the sentiments shown about the subject under study [38].In the research by Vázquez and Escamilla [39], a textual analysis process is undertaken with the Nvivo software, which aimed to identify attitudes towards the main factors for the health of the elderly.In the research by Saito et al. [40], textual analysis was used to predict re-tweets from the relevance of the UGC content on Twitter.
Likewise, Jiang et al. [41] analyzed the fundamental factors that affect a concept called "re-tweetability" for each tweet when using a predictive filter for the collaboration between users [42], the connections, and the repetition of keywords in tweets.Therefore, Textual Analysis can be used to determine and identify the keywords with the greatest weight in a given sample and study the influence of these on the content.
Table 3 shows a summary of the main characteristics of the approximations when using textual analysis to identify key factors in UGC analysis.

Research Questions
The following Research Questions (RQs) were proposed for this study using the above information because of the interest shown by startups in identifying technologies for their business models.Previous studies have shown that the most important topics can be found for different industries and areas by using UGC analysis and approximations [6,13].In addition, LDA can be used to identify topics in the UGC on social networks [29].This study used the following research questions to identify whether important business topics for startups can be found from the comments in UGC content on Twitter: RQ 1: Can important business topics for Startups be found in the UGC content on Twitter?
Different studies have used methodological approaches with UGC content to find the feelings expressed by the comments and opinions of users on social networks such as Twitter, Google Maps, TripAdvisor, or Booking.com[14,15,17].In this study, the Twitter social network was used for Sentiment Analysis of the topics commented on in Twitter users' UGC.These were divided into positive, negative, and neutral sentiments: RQ 2: What sentiments are expressed about the topics for startup business success in Twitter UGC? Key factors are important factors that influence the advance of a topic [14,25].For startup businesses, key factors could be the leadership of the managers, the management of the team members, or the innovation technology chosen by the startup [36].Other key indicators could be related to investors or new business models for sustainable approaches.The following research question was proposed for this research: RQ 3: Could key indicators for startup business success be found from the sentiment topics of Twitter UGC content, and can these results consequently determine negative, positive, and neutral factors for success?

Methodology
The methodology was divided into three-phases.First, a Latent Dirichlet Allocation (LDA) model was used, which is a state-of-the-art thematic modeling tool that works in Python and determines the database topic by analyzing tweets for the #Startups hashtag on Twitter (n = 35.401tweets).Secondly, a Sentiment Analysis was performed with a Supervised Vector Machine (SVM) algorithm that works with Machine Learning in Python to divide the identified topics into negative, positive, and neutral for the key factors that make a startup business successful.Thirdly, a Textual Analysis was performed on the results with Text Data Mining techniques using the Nvivo qualitative analysis software.

Data Sampling
The sample for this study was structured using information from previous studies that used the same methodology for a sample of 2000 Tweets and another of 10,000 Tweets [17,46].Palomino et al. [46] extracted information from 6333 Tweets for the #getoutside hashtag and Reyes-Menendez et al. [17] used the #WorldEnvironmentDay hashtag with a sample of 5874 tweets.The public Twitter API (Application programming interface) was used to download a total of n = 35,401 tweets in order to extract data.Initially, the sample consisted of 44,101 tweets, but after the database cleaning process, the final sample was reduced to 35,401 tweets.This step was done using the MAC version of Python software 3.7.0.The tweets that contained the keywords: "startup", "start-up", "startups", and "start-ups" in English were used [17,38].The database of downloaded tweets was cleaned to eliminate tweets that were repeated because they were news, duplicate content, or retweets.The images and multimedia files published next to the Tweet text were not analyzed.Using Saura et al. [3], the sample of tweets was validated with the following criteria:

•
Active Twitter profile (profiles without activity in the three months prior to the use of #Startups were deleted)

•
Twitter user profile had a profile photo and a cover picture No retweets.Retweets from the same tweet about #startup, "start-up", "startups" and "start-ups" were removed (i.e., considered as duplicate content) • Public profiles.Only public profiles and tweets using #Startups in English were included • Minimum 80 characters.Tweets had to be at least 80 characters long (including spaces) and use the #Startups hashtag.This means that tweets without the "#" or a wrong label like "# startups" were omitted.

Topic Identification Using LDA
The LDA model is based on a probabilistic assumption that assumes that content is generated in two steps [6,14,47].The first step identifies words and separates each word into a different document.The next step randomly identifies the distribution of the topics in a sample, and then selects the main topics found in that sample [6,[13][14][15].In real situations, neither the distribution of topics in documents nor the distribution of words in topics is known a priori [6].The importance of the hidden and observed variables is the joint distribution expressed mathematically in (1) below: where βi is the distribution of a word in topic i, with total K topics; θd is the proportion of topics in document d, with total D documents; z d is the topic assignment in document d; z dn is the topic assignment for the nth word in document d, with total N words; w d is the observed words for document d; and w d,n is the nth word for document d.

Sentiment Analysis
After the topics had been identified using the LDA model, the Python algorithm supplied by MonkeyLearn (MonkeyLearn, San Francisco, CA, USA) was used by connecting to its API [14,15,36].This was done after training machine learning with the data-mining processes and subdividing the sample into positive, negative, and neutral sentiments about the technology startup industry.A total of 481 samples were processed with data-mining.The sample of Tweets was fed into the MonkeyLearn application and the interface was linked to the Sentiment Analysis algorithm until a probability percentage of >0.674 was reached [3,[13][14][15].Algorithm training was carried out after identifying the content that was exclusively related to the research topic, including the ironic and sarcastic comments [17].Throughout the entire process, the content that was not related to #startups was discarded from the sample and training.

Textual Analysis
The databases were processed for sentiments using different stages of the Nvivo software in which the tweets were categorized into the following three nodes: Positive (N 1 ), Neutral (N 2 ), and Negative (N 3 ) [13,14,48].The data entry process was manual for Nvivo although the databases were already divided into Sentiments.The researchers then created the node structure and filtered the database, eliminating the words identified as connectors, prepositions, articles, and plural forms [6,49].The nodes were defined as data containers that were grouped according to their characteristics.It should be noted that the design and development of nodes is a way to analyze pure data and to achieve the highest possible descriptive and research quality.An important indicator for the analysis using Nvivo is the weighted percentage [17,50], which shows the number of times the data in a node is repeated in the sample.To calculate the weighted percentage, the following formula was used: K = ∑ k i /n i = {1, . . . ,n} n = [1, 25]   (3) In this formula, a query that allows the program to search the text is used to find K.The behavior of each of the words and each tweet can be seen, and the value of K was found for the #Startups hashtag.Using this process, the average value of K for all the tweets was calculated in order to obtain the global value [13,51].Figure 1 shows the steps of the methodology used in this investigation.

Latent Dirichlet Allocation (LDA) Model
The topics identified with the LDA are shown in Table 4.In the LDA process, the words were automatically categorized into topics, and the researchers gave each topic a name after analyzing the group of words.Manually naming the topics is a standard procedure in LDA-based topic identification [21, 23,24]).The name of a topic is usually selected by researchers by taking the top 10 ranking words in the topic classification and forming a meaningful name for the topic from these words [52].

Topic Name Topic Description
Business Angels Relationship with investors or business angels to obtain financing for startups.
Business Plans Information about how to prepare a business plan for startups, which is adapted to its ecosystem.

Startup Project
Information about the startup's foundation, creation, management, and team structure

Startup Methodology
Lean startup method for the development of successful startups.Guidelines to structure the projects.
Startup Incubators Information about start-up incubators or accelerators that offer startup acceleration and promotion programs in their training programs.

Startup Jobs
Job profiles and job offers in startups.Specialist profiles for developers or digital marketing.

Startup Founders
Information for startup CEOs (Chief Executive Officer) and team leaders.

Technology-Based Startup
Startups that develop or improve the technologies on which their business model is based, seeking innovation and excellence in sustainable business processes and quality.
Startup Geo-Location Location of startups and information about them.Main startup's location identified.

Startup Tools
Tools that startups use to organize team management and collaboration between the startups' team members.

Startup Frameworks and Programming Languages
Programming languages and frameworks that are usually used in startups to develop their projects.

Topic Sentiment Identification
Sentiment Analysis of the topics obtained with the LDA model identified the feelings expressed in these topics [6,53].Sentiment Analysis was separately done on the tweets included in each topic, allowing the topics to be separated into different feelings that were later used in a textual analysis.The sentiment analysis algorithm probability of success is established by (i) the quality of the sample, after having been filtered and refined by the authors, as well as (ii) the number of times the algorithm is trained on the dataset.In this research study, we trained the algorithm with a total of 481 samples that were processed with data-mining techniques.The sample of Tweets was fed into the MonkeyLearn application and the interface was linked to the Sentiment Analysis algorithm until average probability percentages of >0.794 (positive sentiment), >0.802 (neutral sentiment), and >0.693 (negative sentiment) were reached [3,[13][14][15].The results of the sentiment analysis are shown in Figure 2. The Sentiment Analysis identified the different sentiments shown about each topic.Table 5 shows the name and description of the topic and the identified sentiment (positive, negative, or neutral).

Textual Analysis Results
The textual analysis identified factors about startups from the sentiments expressed.The factors that were identified as positive, negative, and neutral for the success of a startup are shown in Tables 6-8.The application of Textual Analysis with Nvivo software identified three nodes for the feelings shown in each topic.Text Data Mining was performed on each of these to find the most important factors from the weight of each in the selected themes.The words were grouped into different nodes according to the number of times the words were repeated in the dataset.

Business Plans
Business plans in startups define the viability of the products or services offered.It is of critical importance before receiving an investment.

310 Startup Project
Startup projects should be sustainable, exponential, and innovative.In addition, they should be based on technological breakthroughs.

Startup Incubators
Startup incubators are an opportunity to start projects with the help of mentors and funding.Startup accelerators are important for small startups that need help to develop their ideas and business plans.

Startup Geolocation
The location of a startup can help its success.The ecosystems and locations where there are many startups can help the projects be successful because of the surrounding community.Once similar words were grouped into independent nodes, a qualitative approximation was carried out to find the factors of each indicator.N 1 was analyzed for the factors of positive sentiment, N 2 was analyzed for the neutral factors, and N 3 was analyzed for the negative factors.

Discussion
This study identified the main topics for the development of successful startups.A large number of social network users' opinions were analyzed in order to identify the relevant factors for this study.The information collected from the UGC on the Twitter social media has given us interesting results in this study.
The positive factors for startups which were stated by users' sentiments in their UGC were identified.The UGC topics identified were related to the management tools used by startups to improve their internal processes; artificial intelligence and machine learning technologies; the attitude of the startups' management and the team leaders; as well as the correct progression of the startup business model that should be based on sustainability and innovation.Negative sentiments were also identified for the key factor about the framework and programming languages that startups use because of the difficulty to find relevant expertise in these areas.
Likewise, the high returns charged by business angels for investment in startup-type projects was also identified as a negative key factor.The neutral sentiment factors were those related to the progression of the business models, the type of projects, the startup incubators, and the location of the startups.
This research study identified the main topics for the success of a startup and also the main factors by analyzing the feelings detected in the UGC on Twitter.This study used a three-phase methodological approach for the analysis of UGC on Twitter.This approach is valid when using an LDA model with defined topics, on which text mining techniques are applied with a machine learning approach.This methodological text mining approach is valid for the analysis of content on social networks to identify important factors in defined research areas.
As has been observed in the results of this research study, the positive factors for a successful startup are characterized by the type of tools it uses, the technology it develops, the leadership and empathy of CEOs, and their methodologies for the project development.It can be said that artificial intelligence, machine learning processes, and the attitude of startup managers are key factors for startups to succeed.
Other factors that obtained a neutral result are the standards for success including the development of business plans, the type of project, and the support of startup incubators, and the geolocation of startups.As for the negative factors, we should highlight that they are factors that can harm the success of a startup if they are not well employed, such as the type of programming languages used, the quality of the job offers, and finally, the treatment from and negotiations with the business angels.

Conclusions
This research used a three-phase methodological process to extract the main topics about startups that appeared in Twitter users' UGC.The sentiments of these comments were identified and the key indicators for startup business success were found from the Twitter users' comments.
Important topics were identified in the startup's ecosystem, such as the importance of business plans; the startups' projects; sustainable business models; employee profiles in startups; theoretical and educational support; development or programs of institutions such as startup incubators or accelerators; and attitudes to investors and business angels.In addition, the technologies, applications, tools, and programming languages that startups use were also identified as important topics to consider.
These topics were grouped according to the sentiment that users show about them.These sentiments were negative, positive, and neutral.Key factors for startups in each topic were then identified using the comments in these groups.The key factors found allow us to understand the user's sentiment and attitude to these key issues and factors.This information is important for the creation and development of a successful startup project.Topics are composed by the main points of the startup development process and can be used by practitioners to improve their strategies or rethink their tactics.
RQ1 was answered in this study since the main topics on Twitter have been identified from the large volume of data obtained from the Twitter UGC.
RQ2 was also verified as the sentiments shown for each topic were found and rated by the importance given to the topic in the opinions of the social network users.
RQ3 was answered positively and the route for successful startup business in the digital era has been shown.
The indicators for this route were found from the sentiments shown in users' opinions.Both academics and professionals can use these indicators to create and follow successful startup business models using the results found in this study.

Theoretical Implications
The theoretical implications of this study about the comments made on social networks, especially on Twitter, for startup business success are for researchers.Data Text Mining allows meaning to be given to large amounts of data that have been grouped by topics.Innovative methods and methodological approaches were used for the analysis of the data in this study, and patterns and indicators were identified that were not found before.
Researchers can use the methodological approach proposed in this research to increase the literature available about research into startups or use these methods to improve and consolidate future studies.

Practical Implications
This study gives a wide range of practical results that professionals can use in the startup industry.CEOs and startup leaders can take advantage of the key indicators identified in this study to improve their projects by ensuring that the key factors for the success of a startup business according to UGC on Twitter are included in the business plan and project.This study can be a used as a guide to the issues found in the startup ecosystem from the large amount of Twitter data that was analyzed.Entrepreneurs who are considering a startup project can use this research to understand the structure of the startup ecosystem.CEOs and startup leaders can use the topics and key indicators identified in this study to develop and improve their projects.
The limitations of this study are due to the size of the sample, the topic chosen for the study, and the methodological approach taken to reach the conclusions and implications presented.Future lines of research could improve the methodological process of text mining and increase the sample size to try to find new indicators for startups.

Figure 1 .
Figure 1.Summary of the three-phase methodology process; Source: the authors.

Figure 2 .
Figure 2. Results of sentiment analysis.* Accuracy: This is the probability of success obtained after the training of the algorithm.

Table 1 .
Characteristics of User Generated Content (UGC) analysis in research studies.LDA = Latent Dirichlet Allocation.

Table 2 .
Summary of the main research using Sentiment Analysis.

Table 3 .
Summary of the main characteristics of Textual Analysis approaches.

Table 4 .
Identified topics for startups.

Table 5 .
Sentiment for each topic.