Identifying Startups Business Opportunities from UGC on Twitter Chatting: An Exploratory Analysis

: The startup business ecosystem in India has experienced exponential growth. The amount of investment in Indian startups in the last decade demonstrates the strong interest of the technology industry to these business models based on innovation. In this context, the present study aims to identify investment opportunities for investors in Indian startups by identifying key indicators that characterize the startup ecosystem in India. To this end, a three steps data mining method is developed using data mining techniques. First, a sentiment analysis (SA), a machine learning approach that classiﬁes the topics into groups expressing feelings, is applied to a dataset. Next, we develop a Latent Dirichlet Allocation (LDA) model, a topic-modeling technique that divides the sample of n = 14.531 tweets from Twitter into topics, using user-generated content (UGC) as data. Finally, in order to identify the characteristics of each topic we apply textual analysis (TA) to identify key indicators. The originality of the present study lies in the methodological process used for data analysis. Our results also contribute to the literature on startups. The results demonstrate that the Indian startup ecosystem is inﬂuenced by areas such as ﬁntech, innovation, crowdfunding, hardware, funds, competition, artiﬁcial intelligence, augmented reality and electronic commerce. Of note, in view of the exploratory approach of the present study, the results and implications should be taken as descriptive, rather than determining for future investments in the Indian startup ecosystem.


Introduction
In the 21st century, new information communication technologies (ICTs) have revolutionized the structure of the business sector [1]. Specifically, ICTs have increased the uncertainty regarding the success of different business models based on innovation and technology; this uncertainty is mainly due to the exponential development component of these business models.
India is one of the most relevant countries worldwide for the development of startups focused on innovation and technology [2]. For further development of Indian startups, i.e., small companies based on innovation and technology, investors who provide liquidity to these businesses should be aware of several key issues and factors that define the Indian startup ecosystem. This awareness will effectively guide investment based on both reliability and risks associated with startups [2].
In this context, startups have become a key element of the digital economy in India and other developing countries [3]. These entrepreneurial projects, based on technology and innovation, seek to identify new opportunities in different sectors. Of note, in their beginning periods, big companies such as Google, Uber, Netflix or Facebook were considered as startups [4].
However, when these business projects receive rounds of investments and scale financially, they become medium-size and large companies [5,6]. Accordingly, these companies move from the startups ecosystem to become multinational enterprises. Innovations and applied technologies in the startup ecosystem include smart cities, healthcare, 3D printing, drones, development of professional solutions based on artificial intelligence, Big Data and so forth [7].
In this context, new startup projects are difficult to analyze from the perspective of investors who seek profitable investments in projects that can be consolidated in the market after a round of investments [8]. However, according to Kohler [9], most startups fail for the following two reasons: first, such projects either try to acquire access to a saturated market, or they access a market that has no space for the application of new technologies [10]. Sometimes, there are very high barriers to entry or limited opportunities to adapt the system to the technologies proposed by startups within the short term [11]. In this situation, startups investors try to find actionable insights regarding the projects and the corresponding innovation-based technologies to ensure that their investments will be profitable [12].
According to Weiblen and Chesbrough [13], social networks and digital marketing strategies-the main communication channels through which startups can promote their products and services-constitute an important segment in the startup ecosystem [12]. These channels are used to attract new users and clients and to show the world the project they have developed [14].
Today, in order to make a business profitable and to derive meaningful insights using business intelligence or marketing analytics processes, databases of online consumer behavior have started to be increasingly used. Publishing content, expressing opinions or requesting information, users generate enormous amounts of data. This content from interactions on the Internet is known as user-generated content (UGC) [15]. While the initial user content is messy and unstructured, various data mining techniques based on machine learning have been widely applied to identify key elements in such data [16].
The present study analyzes the UGC with #IndianStartups and similar hashtags on Twitter. Our aim is to establish which issues affect the ecosystem of startups in India and to identify what users feel about those issues. Then these insights would be used by investors to acquire information and help their decision-making strategies when investing in Indian startups.
To this end, we use the following three processes and combinations of techniques for data analysis. First, we employ sentiment analysis (SA) that divides the identified the sample into positive, negative and neutral based on the feelings expressed in corresponding tweets. To this end, we develop and train an algorithm with a sufficient reliability of Krippendorff's alpha value (KAV) [17] that measures the accuracy of the algorithm. Then, we apply a Latent Dirichlet Allocation (LDA) model, a mathematical model developed in Python that structures a sample in specific themes based on the analysis of words in that sample. Finally, the data are submitted to text analysis (TA). In this phase of the analysis, the main specific indicators affecting the startup sector in India are identified.
The originality of the present study lies in the methodological process we use for data analysis. Our results also contribute to the literature on startups. These results can be used in further comparative research on the characteristics of startup industries. Finally, our findings provide a methodological approach based on UGC and exploratory data analysis [18] for investors to obtain valuable insights related to Indian business startups. The remainder of this paper is structured as follows. Section 2 presents the literature review of the present study. The methodology is presented in Section 3. Section 4 reports the results that are further discussed in Section 5. Finally, conclusions are drawn in Section 6.

Literature Review
In recent years, despite the COVID-19 pandemic, the start-up ecosystem in India has exponentially grown. According to a report published by ET-tech [19], a total of 1652 investment rounds were raised in 2013, obtaining a total of $3.51 million. In 2017, the investment rounds dropped to 1513, but the budget line increased to $6.43 million. In 2018, investment worth of $10.60 million out of 1471 investment round was effective in India. In 2019, before the COVID-19 pandemic, a total of $14.27 million were reached in investments of Indian start-ups out of 1482 investment rounds. In 2020, despite the COVID-19 pandemic and the lockdown, a total of $9.33 million were reached from 1088 investment rounds (see Figure 1).

Literature Review
In recent years, despite the COVID-19 pandemic, the start-up ecosystem in Ind exponentially grown. According to a report published by ET-tech [19], a total of 16 vestment rounds were raised in 2013, obtaining a total of $3.51 million. In 2017, the in ment rounds dropped to 1513, but the budget line increased to $6.43 million. In 201 vestment worth of $10.60 million out of 1471 investment round was effective in Ind 2019, before the COVID-19 pandemic, a total of $14.27 million were reached in in ments of Indian start-ups out of 1482 investment rounds. In 2020, despite the COV pandemic and the lockdown, a total of $9.33 million were reached from 1088 inves rounds (see Figure 1). Furthermore, in the last decade, several studies have addressed the entrepren ecosystem and emerging startup strategies used in different countries. For instan their study on the current business ecosystem in China, He [20] expanded the curre derstanding of entrepreneurship Asian countries and analyzed the process of st technology-based projects and entrepreneurial initiatives in the Asian market. Fu more, Zhao [21] studied the technological business growth in China, highlightin main local technologies developed to offer new approaches to economic and social d opment in the startup sector in India.
In addition, Barberis et al. [22] highlighted the importance of the investors' fe about the companies that receive financing. Likewise, López-Cabarcos et al. [23] po out the importance of the technical and non-technical information obtained by the tors at the stock market with an approach focused on identifying the sentiment on networks.
In another relevant study, Piñeiro-Chousa et al. [24] highlighted the importan studying the digital ecosystem, specifically microblogging and its link to the stock m behavior. Cookson and Niessnet [25] also indicated the relevance of studying socia works from the investors' perspective.
In their analysis aimed at the identification of innovation factors in Indian sta Dinesh and Sushil [26] underscored the importance of creativity and innovation f success of startups strategies. In a case study in the field of social media, Sindhan analyzed 25 top Indian startups to identify the indicators startup founders should f to develop strategies in this digital environment.
In another analysis of Indian startups, Ghosh et al. [28] sought to understand ronmental uncertainty for startups and explore how it affects global development study concluded that the perceived utility of environmental uncertainty could startups in their long-term strategies. Furthermore, in the last decade, several studies have addressed the entrepreneurial ecosystem and emerging startup strategies used in different countries. For instance, in their study on the current business ecosystem in China, He [20] expanded the current understanding of entrepreneurship Asian countries and analyzed the process of starting technology-based projects and entrepreneurial initiatives in the Asian market. Furthermore, Zhao [21] studied the technological business growth in China, highlighting the main local technologies developed to offer new approaches to economic and social development in the startup sector in India.
In addition, Barberis et al. [22] highlighted the importance of the investors' feelings about the companies that receive financing. Likewise, López-Cabarcos et al. [23] pointed out the importance of the technical and non-technical information obtained by the investors at the stock market with an approach focused on identifying the sentiment on social networks.
In another relevant study, Piñeiro-Chousa et al. [24] highlighted the importance of studying the digital ecosystem, specifically microblogging and its link to the stock market behavior. Cookson and Niessnet [25] also indicated the relevance of studying social networks from the investors' perspective.
In their analysis aimed at the identification of innovation factors in Indian startups, Dinesh and Sushil [26] underscored the importance of creativity and innovation for the success of startups strategies. In a case study in the field of social media, Sindhani [27] analyzed 25 top Indian startups to identify the indicators startup founders should follow to develop strategies in this digital environment.
In another analysis of Indian startups, Ghosh et al. [28] sought to understand environmental uncertainty for startups and explore how it affects global development. This study concluded that the perceived utility of environmental uncertainty could affect startups in their long-term strategies.
In the analysis of the business industry in China and Russia, Batjargal [29] focused on social networks. In addition, studying the projects based on technology companies and entrepreneurs, Bruton and Ahlstrom [30] sought to establish the main differences between the business approaches used in the West and East. This study highlighted the differences in respective technologies and the price of the economic value they generate with its impact.
In addition, in their analysis of the global economy from the experience of the development of Indian startups, Dossani and Kenney [31] concluded that services developed by startups based on the health of the global economy can benefit the growth of Indian startups as long as the global economy is on the rise. Furthermore, Bindal et al. [32] also investigated the role of Indian startups in the economy and their contribution margin in the medium and long term.
Following this line of research, Wu and Wu [33] investigated the impact of the increase in the number of students willing to become entrepreneurs and to create a startup on the Chinese society. The authors highlighted the importance of the technology component of these new entrepreneurial projects (See Tabarsa et al. [34]).
Likewise, focusing on the activities and social actions that Indian startups can generate to promote a social improvement of the environment, Baporikar [35] proposed a framework for social change through the analysis of startups in India.
In addition, in an analysis of entrepreneurship in China, Pistrui et al. [36] analyzed the cultural and family forces involved in this process, such as the components that influence investment. In another relevant study where enterprises that pursue innovation and based on technology were cataloged as startups, Ahlstrom and Ding [37] investigated the difficulties associated with the creation of small and medium enterprises (SMEs) in Asian countries. The results of this study demonstrated that, during the foundation and development of startup companies, there are specific indicators that help investors decide on whether or not to make an investment in a startup project.
Furthermore, Wright [38] compared China's business industry and its startups with similar investment opportunities in European and Asian environments. In another study on the agglomeration of startups in India, Dornberger and Zeng [39] identified the factors that can affect startup development. Finally, Li et al. [40] focused on business education in emerging countries and the contribution of the business sector and startups to the economic development of the country. Table 1 provides a summary of relevant previous studies on the startup ecosystem in India and the decision-making capacity of investors. Table 1. Previous studies of the startup ecosystem in India.

Main Purpose Results
Au and Kwan [41] This study analyzes the development of a startup in Asia and India, with the focus on cultural aspects and their impact on the business ecosystem.
This study revealed the familism value factor interference on the development of new business; therefore, in those countries, entrepreneurs decide to obtain funding from friends and other outsiders. Zhao [42] This study explains how to create a startup in India by analyzing real cases and identifying problems and solutions for each specific case.
The study found several interconnecting forces that shape the creation of startups. These forces include culture, country history, economic and social development and the strategic framing of social enterprises. Tan et al. [43] This study analyzes the investment to SMEs and startups in India; the UGC in social networks is used for the analysis.
The findings showed that social networks enhance user interactions that generate useful data for an in-depth analysis of SMEs and startups. Chen et al. [44] This study analyzes different strategies used by startups to acquire engagement and financing through social media.
This study confirmed that stakeholder's engagement can be obtained through social media to gain awareness, as well as build brand image or reputation. Additionally, it was found that, due to the limited financial human resources, the management and measurement of social media communications is one of the challenges for startups.

Authors Main Purpose Results
Saura et al. [45] This study proposes a new methodological approach based on defining the main indicators related to the success of startups (e.g., business angels or investors in startups).
The findings showed how data-mining techniques are helpful for startups to succeed in the analysis of UGC from social networks.
Suresh Babu and Sridevi [46] This study discussed the main issues and challenges Indian startup has to face the opportunities than India can provide in this business model.
The results showed that Indian startups have to face finance, human resources and sustainable growth issues. However, there are many available opportunities by expanding into other countries. Banudevi and Shiva [47] This study reviewed the main difficulties of Indian startups and discussed the financing resources that they used.
This study found funding the major concern for startups and small business. Additionally, the exponential growth of technology was reported to intensify the scarce motivation of investors to invest in those businesses. Anand Verma and Singhal [48] This study analyzed the data from Indian startups about investment and funding trends.
The results showed that Indian events, cities and industry verticals play an important role when startups want to acquire funding. Additionally, foreign investors were found to be greater contributors in India. Table 2 shows how previous studies have addressed the gap identified regarding the India startup ecosystem by means of the type of research (empirical, exploratory or descriptive).

Research Questions and Methodology Development
As discussed in Section 2, the ecosystem of startups in India is an emerging market with an exponential level of growth. In this context, it is important to identify the key points within the ecosystem of Indian startups. According to Baporikar [35] and Korreck [49], the Indian startup ecosystem is mainly characterized by the following points: (i) economic growth; (ii) current market trends; (iii) technological change; and (iv) social change. Some relevant research in this direction has already been undertaken. For instance, Saura et al. [45] developed a methodological process using LDA, sentiment analysis and textual analysis to determine the main key performance indicators (KPIs) for the success of startups. Likewise, Chen et al. [44] obtained data to evaluate the engagement of Indian startups in social networks. Some of relevant past research also focused on the feelings expressed in different UGC-based topics about startups in the social media.
According to Naab and Sehl [50], the analysis based on UGC produces meaningful insights in many fields. UGC is characterized by personal contribution of users who are interested in a specific topic, event or industry; therefore, such contributions become relevant. Individuals behind social network profiles are diverse: they can be founders, investors, social media influencers or employees [51]. For example, Saura et al. [45] and Reyes-Menendez et al. [15] proposed a methodological approach based on machine learning and sentiment analysis to analyze the feelings associated with certain topics. Likewise, based on the opinions and feelings expressed by interviewed users, Zhao [42] identified the factors related to Indian startups that make them interesting for future investment.
Overall, given that textual analysis has been widely used to obtain relevant information from large amounts of data, it is interesting to establish which indicators investors should pay attention to when deciding whether or not to invest in an Indian startup. Based on the above, in the present study, we address the following three research questions: To answer the questions raised, the methodology that consists of the following three methodological processes is proposed based on [52]. The first one is a model known as LDA, in the second, we use an algorithm that works with machine learning and data mining to perform an SA. Finally, a TA is applied to the data classified using the aforementioned two processes (See Sections 3.2-3.4).

Data Sample
The UGC data were collected from the Twitter platform. The time horizon for data collection was from December 2018 to March 2019. This period coincides with the period of highest investment in Indian start-ups (see Figure 1), where the number of investments in Indian start-ups increased from $10.60 million in 2018 to the historical maximum of $14.27 million in 2019 [53].
To download the data, we connected to the Twitter API with the tweet collection limit of 7 days. The search words used for data collection were: "Indian startup", "Indian startup", "India startups" and "India start-ups" with tag "#" (a hashtag, i.e., a tag that groups themes of the same interest if it accompanies a word on Twitter). The database consisted of a total 18.902 tweets of which 14.531 tweets were selected after filtering. To clean the database, retweets (RT) were eliminated; to avoid noise, we also eliminated alphanumeric characters and complete URLs. We did not analyze images and videos published with the tweets. In addition, we included only the content that came from the users who had been active on Twitter during the last 3 months prior to the data collection, had profile pictures and public cover. Finally, the minimum of 80 characters per tweet was set as the inclusion criterion [54].

Sentiment Analysis Process
On identification of the topics in the UGC database of Indian startups, the sample was subdivided into these topics. To apply the sentiment analysis process, we created an algorithm in Python and trained it exclusively for the Indian startups sector with data mining processes until reaching the correct percentage of reliability using the KAV coefficient. The KAV coefficient should be above 0.800. If the coefficient is 0.800 > α ≥ 0.667, only tentative conclusions can be drawn. The results with α < 0.667 should not be taken into account. To train the SA algorithm, the MonkeyLearn library was used. The algorithm was trained 402 times. The type of algorithm used was a Support Vector Machine (SVM) Algorithm that works with machine learning. An SVM was selected because it offers the possibility to train an algorithm through a digital interface instead of creating a complex algorithm of these characteristics from scratch.
The SA process is shown in Figure 2 (see also Barberis et al. [22]). In Figure 2a, the researchers should connect to the API in order to download the data is neuron A. Furthermore, E and C are data extraction (E) and data collection (C) processes. (D) stands for data processing to build the dataset. Neuron S is to data set (S) that should be correctly filtered to eliminate duplicate, erroneous or unnecessary information. The neuron LDA is the LDA model that divides the sample analyzed in T 1 , T 2 , T 3 , . . . , until the maximum number of topics identified according to the size of dataset (k).
was trained 402 times. The type of algorithm used was a Support Vector Machine (SVM) Algorithm that works with machine learning. An SVM was selected because it offers the possibility to train an algorithm through a digital interface instead of creating a complex algorithm of these characteristics from scratch.
The SA process is shown in Figure 2 (see also Barberis et al. [22]). In Figure 2a, the researchers should connect to the API in order to download the data is neuron A. Furthermore, E and C are data extraction (E) and data collection (C) processes. (D) stands for data processing to build the dataset. Neuron S is to data set (S) that should be correctly filtered to eliminate duplicate, erroneous or unnecessary information. The neuron LDA is the LDA model that divides the sample analyzed in T1, T2, T3, …, until the maximum number of topics identified according to the size of dataset (k). As suggested previously by Korreck [47], in Figure 3b, SA refers to the SA applied to the LDA neuron to show how SA is applied to each dataset divided into topics (T1-T3). Next, from the total number of topics (k), each topic is classified into positive, negative or neutral. Based on this classification when applying TA, specific KPIs can be obtained. In order to evaluate the effectiveness of algorithms using sentiment analysis, the following three measurements are used: precision, recall and accuracy.
Here, precision measures the number of texts automatically predicted correctly (both correct and incorrect) and categorized according to their feeling and category. Recall accounts for the number of texts correctly predicted within the same category (against the total number of texts in the same category). Accuracy reflects the number of correctly predicted texts, including both those within and outside of a category. In sum, while precision and recall are measurements used to evaluate the performance quality of the algorithm, accuracy helps to evaluate predictability [52].
However, the limitation of the algorithms that work with machine learning is that such algorithms cannot correctly determine feelings related to sarcasm, irony or polarity of the contents. In our training of the algorithm, we took this limitation into account. Likewise, the contents that did not achieve a credibility of 0.667 of positive or negative sentiment according to KAV were classified as neutral feelings. This was accomplished to increase the precision and reliability of the positive and negative feelings and, consequently, the effectiveness of the insights obtained. As suggested previously by Korreck [47], in Figure 3b, SA refers to the SA applied to the LDA neuron to show how SA is applied to each dataset divided into topics (T1-T3). Next, from the total number of topics (k), each topic is classified into positive, negative or neutral. Based on this classification when applying TA, specific KPIs can be obtained. In order to evaluate the effectiveness of algorithms using sentiment analysis, the following three measurements are used: precision, recall and accuracy.

Application of LDA Model for Topic-Modeling
The LDA model was developed in Python and can be found in the Python library LDA 1.0.5 using Gibbs sampling. LDA is a mathematical model that divides a series of samples (e.g., documents, texts, comments, reviews, etc.). The analysis of topics is based on the importance of the layers or insights that exist in the sample. To this end, we measured the frequency of repetition of words and their associations in topics. An example of the LDA process is shown in Figure 3.   Figure 2a, α and β are the parameter of the Dirichlet prior and parameter of the Dirichlet prior on the pertopic word distribution, respectively. ø_M is the topic distribution for document M, while φ_Κ is the word distribution for topic k. Z_mn is the topic for each word in the dataset. Finally, M is the specific word. Figure 2b shows the notation for this model. In Figure 2b, k refers to the number of topics, while φ_1, …, φ_Κ are V dimensional vectors storing the parameters of the Dirichlet-distributed topic-word distributions. Entities represented by ø and φ are matrices used to decompose the original samples of the represented matrix to be modeled [4]. In Figure 2b, ø refers to the number of lines (defined in columns by the documents in the sample), while φ refers to different topics. φ_1, …, φ_Κ are a set of rows, or vectors, composed of the words within different topics. Finally, ø_1… ø_M indicate the set of rows that define each topic [52].
The LDA is divided into two processes, the first of which detects words and their connectors in the proposed database or documents. In the second step, based on the layers that make up the sample, this model develops a distribution to see how the words are Here, precision measures the number of texts automatically predicted correctly (both correct and incorrect) and categorized according to their feeling and category. Recall accounts for the number of texts correctly predicted within the same category (against the total number of texts in the same category). Accuracy reflects the number of correctly predicted texts, including both those within and outside of a category. In sum, while precision and recall are measurements used to evaluate the performance quality of the algorithm, accuracy helps to evaluate predictability [52].
However, the limitation of the algorithms that work with machine learning is that such algorithms cannot correctly determine feelings related to sarcasm, irony or polarity of the contents. In our training of the algorithm, we took this limitation into account. Likewise, the contents that did not achieve a credibility of 0.667 of positive or negative sentiment according to KAV were classified as neutral feelings. This was accomplished to increase the precision and reliability of the positive and negative feelings and, consequently, the effectiveness of the insights obtained.

Application of LDA Model for Topic-Modeling
The LDA model was developed in Python and can be found in the Python library LDA 1.0.5 using Gibbs sampling. LDA is a mathematical model that divides a series of samples (e.g., documents, texts, comments, reviews, etc.). The analysis of topics is based on the importance of the layers or insights that exist in the sample. To this end, we measured the frequency of repetition of words and their associations in topics. An example of the LDA process is shown in Figure 3. Figure 3 shows the Latent Dirichlet Allocation (LDA) process. In Figure 2a, α and β are the parameter of the Dirichlet prior and parameter of the Dirichlet prior on the pertopic word distribution, respectively. ø_M is the topic distribution for document M, while ϕ_K is the word distribution for topic k. Z_mn is the topic for each word in the dataset. Finally, M is the specific word. Figure 2b shows the notation for this model. In Figure 2b, k refers to the number of topics, while ϕ_1, . . . , ϕ_K are V dimensional vectors storing the parameters of the Dirichlet-distributed topic-word distributions. Entities represented by ø and ϕ are matrices used to decompose the original samples of the represented matrix to be modeled [4]. In Figure 2b, ø refers to the number of lines (defined in columns by the documents in the sample), while ϕ refers to different topics. ϕ_1, . . . , ϕ_K are a set of rows, or vectors, composed of the words within different topics. Finally, ø_1 . . . ø_M indicate the set of rows that define each topic [52].
The LDA is divided into two processes, the first of which detects words and their connectors in the proposed database or documents. In the second step, based on the layers that make up the sample, this model develops a distribution to see how the words are repeated and how they create the topics. Equation (1) describes how this is accomplished and shows the importance of the hidden and observed variables in the joint distribution. ρ(β 1:k , θ 1:D , Z 1:D , ω 1: Of note, in the LDA development process, the names of the topics are proposed by the researcher using the words that are most frequent in the data. Furthermore, in Equation (2), β 1 is the distribution of a word in topic I (of the total K of topics); θ 1d is the ratio of topics in document d (of the total of D documents); z 1:D is the topic assignment in document d; z 1:D is the topic assignment for the nth word in document d (with the total of N words); w 1:D is the number of words in document d; and w 1:D is the word in document d.
Accordingly, Equation (2) is used to identify topics for Gibbs sampling.

Textual Analysis
The TA is a technique used to obtain indicators and key variables by analyzing a textual database. TA measures variables that determine the relevance of an indicator within the text. TA techniques can give meaning to each variable and help determine quality indicators related to a specific topic.
As discussed by Korreck [47], LDA and SA show the steps of textual analysis (see Figure 4). Here, the values of T1p, T2x and T3n which stand for topics with positive sentiment (p), neutral feeling (x) and negative sentiment (n), respectively, are identified. On identification and classification of the text samples that subdivide the topics by feelings, the TA process, classifying the samples into categories known as nodes. Therefore, N1, N2 and N3 correspond to T1p, T2x and T3n, respectively. After the classification of each topic with the feeling it expresses into the corresponding nodes, the text mining process starts. This process involves subdividing the main category into sub-nodes so that sub-nodes Nx and Ny belonging to N1 are created. These sub-nodes contain sub-topics or features that are relevant for the topic under investigation [52]. and N3 correspond to T1p, T2x and T3n, respectively. After the classification of ea with the feeling it expresses into the corresponding nodes, the text mining proce This process involves subdividing the main category into sub-nodes so that sub-n and Ny belonging to N1 are created. These sub-nodes contain sub-topics or featu are relevant for the topic under investigation [52].
In essence, the categorization approach outlined above based on two factor weight of repeated words and phrases and (2) accuracy, i.e., the evaluation of the of repetitions of each word. Upon division of all samples into nodes and sub-n key indicators have to be analyzed and defined. As a final step, the average we accuracy of the established key indicators can be summarized in tables. For the correct development of this step, a structure of nodes should be cr software for textual analysis. In the present study, the NVivo software was used. stop words (e.g., connectors, prepositions, articles and plural forms) was applied. inate repeated words, Equation (3) was used in NVivo.
where K is an empirical approximation constant [15,52]. This constant K is used t all mechanical repetitions of words in the analyzed data. To determine K, a query that allows the program to search the databases. Establishing K for each sample according to the feeling it expresses, should be followed by a comparison with the der of the sample so that to compute the average value of K for all topics and to ob global weight of X, i.e., the number of topics. The nodes are, in essence, data containers grouped according to their charac The structure and design of new nodes is used to maximally accurately group ra In essence, the categorization approach outlined above based on two factors: (1) the weight of repeated words and phrases and (2) accuracy, i.e., the evaluation of the number of repetitions of each word. Upon division of all samples into nodes and sub-nodes, all key indicators have to be analyzed and defined. As a final step, the average weight and accuracy of the established key indicators can be summarized in tables.
For the correct development of this step, a structure of nodes should be created in software for textual analysis. In the present study, the NVivo software was used. A list of stop words (e.g., connectors, prepositions, articles and plural forms) was applied. To eliminate repeated words, Equation (3) was used in NVivo.
where K is an empirical approximation constant [15,52]. This constant K is used to delete all mechanical repetitions of words in the analyzed data. To determine K, a query is made that allows the program to search the databases. Establishing K for each sample or topic according to the feeling it expresses, should be followed by a comparison with the remainder of the sample so that to compute the average value of K for all topics and to obtain the global weight of X, i.e., the number of topics. The nodes are, in essence, data containers grouped according to their characteristics. The structure and design of new nodes is used to maximally accurately group raw data. To this end, the Weighted Percentage (WP), shows the number of times a node repeats its content in the database is used and is computed as shown in Equation (4) [52].
Finally, using the NVivo software and a word repetition filter, we calculated which words were linked to a node. This was accomplished using the Total Count (TC) indicator.

Analysis of Results
The processes outlined in Section 3 led to the identification of 9 topics that characterize Indian startups. The words were automatically organized in topics, and each topic was named after a discussion among the present authors. The naming of topics followed the standardized approximation used in interpreting the results of LDA models.
To determine the name of a topic, words classified in the top 10 of a specific topic were used. In the present study, all topics were grouped into positive, negative and neutral (see Table 3 for a summary of the results). According to the results of sentiment analysis, the KAV value of 0.769 was obtained for the tweets identified as positive; for the tweets classified as negative and neutral, the KAV values were 0.719 and 0.80, respectively. Textual analysis was used to identify key indicators related to within the identified topics. Using the NVivo software, the sample was classified into 3 different nodes. Each node had an associated feeling. Then, an approximation was made based on the identification and classification of the most important factors according to their weight within the selected topic.
To this end, we grouped words according to the number of times they appeared in a topic and analyzed similar terms. Once the text samples were grouped into independent nodes, a qualitative approximation was made to determine the factor of each indicator. Consequently, N 1 , N 2 and N 3 aggregated factors related to positive, neutral and negative factors, respectively. The results for N 1 , N 2 and N 3 corresponding to positive, neutral and negative indicators for investors are summarized in Tables 4-6. As can be seen in Table 4, the following five topics have positive key indicators for investors: FS, AI, CS, AR and IS. As can be seen in Table 4, two topics (SF, EC) have neutral key indicators for investors. Finally, two topics (CH, SC) have negative key indicators for investors.

Discussion
As shown by the results reported in Section 4, the Indian startup ecosystem is influenced by 9 areas, established in our analyses as topics, that characterize the system and that may be of interest to investors in the Indian startups industry.
According to Anand Verma and Singhal [48], investment flow in India in the FS nearly doubled from 1.9 billion dollars in 2018 to 3.7 billion dollars in 2019. This demonstrates the interest that investors have in Fintech startups and the reason why our findings showed a positive sentiment in tweets by Twitter users. The results demonstrated that Indian startups within the Fintech sector or financial entrepreneurship are positively discussed on Twitter. Therefore, the startups that base their models on entrepreneurship in the financial sector offer a great opportunity for investors. Social networks reflect the importance and impact of this specific area of Indian startups. In this respect, our results are consistent with those reported by Sindhani [27] too.
Furthermore, our results showed that technologies such as artificial intelligence and augmented reality were also positively discussed on Twitter. These two technologies in the Indian startup sector have been heavily invested in by business angels. In fact, AI, a technology that has experienced an exponential growth in recent years, has been a key factor for the success of Indian startups. According to Ahlstrom and Ding [37], from 2008 to 2017, the investment in AI by startups in India reached 50 billion dollars and this tendency still growing. This research confirms our findings regarding the positive feeling users have regarding AI investment in startups in India.
Similarly, successful has been the technology of augmented reality. Specifically, Indian startups have effectively developed mobile applications and technologies that link the use of augmented reality to daily consumption habits and video gaming. As shown by Korreck [49], there is an increasing in investment in AR in India that can reach 0.5 billion in 2022. This evidence is consistent with our results that revealed a positive sentiment in the tweets by Twitter users. These conclusions derived in the present study are also coherent with the results reported by Chakraborty and Gupta [55].
Next, our analysis identified that the business models of Indian startups that highlight creativity and the so-called crowdfunding startups are also positively evaluated by Twitter users. This suggests that investors should focus their attention on the projects that seek financing through specific crowdfunding platforms (see also Shah and Shah [56]). Moreover, Ashta [57] listed India at the top for raising money in 2015 (27.8 million dollars). Moreover, the same tendency is expected in the upcoming years. It also confirms the positive sentiment UGC in Twitter found in our analysis.
With regard to the investment areas that were negatively evaluated in our data, our results showed that Indian startups that focus on hardware are positioned negatively in terms of quality. In line with our findings, Gregory [58] concluded this sector has a long-term exponential evolution in both China and India. Another conclusion comes from the India-FDI equity inflow amount for computer hardware and software sector 2020 [59], which presents a graph showing a drop in the investment flow (25 billion dollars) for computer hardware in India in 2017. The tweets analyzed in our study were collected in 2018, which can explain the negative sentiment in the analyzed tweets. In terms of software development, however, no such negative feelings were detected. Furthermore, competition among Indian startups was also perceived negatively. This finding may be due to Indian startups' difficulties in finding funding to craft robust marketing and development plans to acquire funding. As also discussed by Korreck [49] and Rao and Kumar [60], due to the shortage of investors, it is difficult for startups to acquire these investment rounds.
Finally, in our results, two topics were found to be associated with neutral feelings: eCommerce projects, which highlight the dynamization of the sector for logistics and low cost, and startups funds, where the amount of investment received by Indian startups was found to increase the chances of success of respective projects. Both topics were studied by Korreck [49] and Snelson [61] who demonstrated that both the eCommerce sector and the startups funds received by Indian startups largely depend on the problems or opportunities derived from the logistics and technology used in the development of their products [62]. However, according to PricewaterhouseCoopers [63], India owns important e-commerce investors such as Amazon, Flipkart and so forth, which could explain the neutral sentiment found in our research. Here, we assume that users might not have feelings regarding e-commerce industry because they link their feelings with the investor brands themselves.

Conclusions
In this study, we identified the main topics of interest for investors using sentiment analysis of Twitter-based UGC on Indian startups sector. Using LDA, we identified the topics in the sample of tweets; then, SA was used to determine the feelings associated with those topics; finally, TA was applied to derive insights for investors' decisions in the startup sector.
Our results provide an exploratory overview of the main areas of investment in Indian startups, if we consider the feelings about the topics in the UGC on Twitter, are the Fintechtype startups; startups that develop business models based on innovation; startups that develop models of crowdfunding or that are advertised on these platforms, as well startups that develop solutions based on artificial intelligence and augmented reality. In contrast, the areas of less interest in investors, if we look at the negative sentiment indicator, are the startups that develop Indian hardware and those that participate in competitive projects in conferences and seminars to seek investment. Investment areas that did not obtain results with obvious positive or negative implications are projects based on eCommerce and startups that have already received an investment.
The results of LDA provide an affirmative answer to RQ1 as to whether there could be topics in Twitter UGC that generate interest from the point of view of the identification of insights for investors in Indian startups. Our identification of positive, negative and neutral feelings in the identified topics also provides an affirmative answer to RQ2. Based on these results, we also obtained key indicators that can help investors in Indian startups to make investment decisions. Regarding RQ3, we were able to identify key indicators that have been classified into sentiments from the investor's point of view. Our results can be meaningfully used by investors for better investment decision making.

Theoretical Contributions
Our results can help to better structure the design of future studies aimed at understanding the decisions made by startup investors in their exploratory analysis of a new market of investment.
Similarly, the results could be taken as variables or indicators in statistical models elaborated with partial least squares structural equation modeling (PLS-SEM) or AMOS, so that, through surveys, statistical significance of these variables can be measured by asking investors' opinions.
Furthermore, our findings can be used to develop theoretical frameworks to better understand the functioning of UGC in the ecosystem of startups and its possible informative impact on investor market research.
Finally, the present study promotes the use of new data mining techniques in research on digital ecosystems to explore different industries. The application of these novel techniques will hopefully be furthered in future research.

Practical Implications
The present study has several practical implications. On the one hand, investors can use the topics identified in the present study to conduct market research on startups in India. In addition, investors can use the identified sentiments to better understand the impact of identified Indian startup areas, rather than determine their investments. In addition, our findings can serve as a complementary source of information for investors and practitioners to compare with the results obtained using more traditional methods.
Similarly, the present study proposes a new methodological approach to find opportunities and insights about a specific topic by using UGC from Twitter. In the present study, this method was applied to investigate Indian business startups; however, it can also be applied to study other industries. Therefore, marketers, data scientists and other professionals can benefit from this original and consistent approach when developing international strategies.

Limitations and Future Research
The present study has several limitations. First, the results of the present study provide descriptive and exploratory information for investors, rather than information that could determine their investments. Said differently Investors may use the present results to identify the main investment areas, indicators or variables identified in the present study are not prescriptive for their decisions to invest in Indian business startups.
Second, the methodology used in three processes is exploratory. In the LDA model, the name of the topics is selected based on the manual analysis of the results, which is a standardized process in topic-modelling. Upon training, the machine learning algorithm can be improved in the future, as more training improves its predictions. The sample size can be augmented to more thoroughly analyze other variables and indicators. Accordingly, future research can focus on increasing the sample and understanding the influence of the identify topics from different theoretical and practical perspectives. Funding: This research was funded by National Funds provided by FCT-Foundation for Science and Technology through projects UIDB/04470/2020 and UIDB/04020/2020.